# Data exploration/visualization

**SageMaker Studio Kernel**: Data Science

The challenge we're trying to address here is to identify the sentiment from Tweets. 
The dataset used is a public dataset taken from [Kaggle](https://www.kaggle.com/code/sagniksanyal/tweet-s-text-classicifaction/data)
Each data is like:
 - Username
 - User location
 - User description
 - User creation date
 - User followers
 - User friends
 - User favourites
 - User is verified
 - Date of the tweet
 - Text of the tweet
 - Sentiment associated to the t

Let's start preparing our dataset, then.

## Let's take a look on the data
Loading the dataset using Pandas...

In [None]:
! pip install emoji

In [None]:
import csv
import datetime
import emoji
import logging
import numpy as np
import pandas as pd
import re
import seaborn as sns
import time

sns.set(rc={'figure.figsize':(11.7,8.27)})

In [None]:
logging.basicConfig(level=logging.INFO)
LOGGER = logging.getLogger(__name__)

In [None]:
file_name = "data.csv"

In [None]:
df = pd.read_csv(
    "./../data/{}".format(file_name),
    sep=",",
    quotechar='"',
    quoting=csv.QUOTE_ALL,
    escapechar='\\',
    encoding='utf-8',
    error_bad_lines=False
)

### Ploting data, just to have an idea

In [None]:
df.head()

In [None]:
df.describe(include='all')

In [None]:
%matplotlib inline

ds = df['source'].value_counts().reset_index()
ds.columns = ['source', 'count']
ds = ds.sort_values(['count'],ascending=False)

fig = sns.barplot(
    x=ds.head(10)["count"], 
    y=ds.head(10)["source"], 
    orientation='horizontal', 
).set_title('Top 10 user sources by number of tweets')

## Data preparation
Now lets clean the text content from the tweets

In [None]:
def clean_text(text):
    text = text.lower()

    text = text.lstrip()
    text = text.rstrip()

    text = re.sub("\[.*?\]", "", text)
    text = re.sub("https?://\S+|www\.\S+", "", text)
    text = re.sub("\n", "", text)
    text = " ".join(filter(lambda x:x[0]!="@", text.split()))

    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U0001F1F2-\U0001F1F4"  # Macau flag
                               u"\U0001F1E6-\U0001F1FF"  # flags
                               u"\U0001F600-\U0001F64F"
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               u"\U0001f926-\U0001f937"
                               u"\U0001F1F2"
                               u"\U0001F1F4"
                               u"\U0001F620"
                               u"\u200d"
                               u"\u2640-\u2642"
                               "]+", flags=re.UNICODE)

    text = emoji_pattern.sub(r'', text)

    text = emoji.replace_emoji(text, "")

    text = text.replace("u'", "'")

    text = text.encode("ascii", "ignore")
    text = text.decode()

    word_list = text.split(' ')

    for word in word_list:
        if isinstance(word, bytes):
            word = word.decode("utf-8")

    text = " ".join(word_list)

    if not any(c.isalpha() for c in text):
        return ""
    else:
        return text

In [None]:
def convert_date(date):
    date = time.mktime(datetime.datetime.strptime(date, "%Y-%m-%d %H:%M:%S").timetuple())

    return date

In [None]:
df = df[["user_name", "date", "text", "Sentiment"]]

LOGGER.info("Original count: {}".format(len(df.index)))

df = df.dropna()

df["user_name"] = df["user_name"].apply(lambda x: clean_text(x))
df["text"] = df["text"].apply(lambda x: clean_text(x))

df['user_name'] = df['user_name'].map(lambda x: x.strip())
df['user_name'] = df['user_name'].replace('', np.nan)
df['user_name'] = df['user_name'].replace(' ', np.nan)

df['date'] = df['date'].map(lambda x: x.strip())
df['date'] = df['date'].replace('', np.nan)
df['date'] = df['date'].replace(' ', np.nan)
df["date"] = df["date"].apply(lambda x: convert_date(x))

df['text'] = df['text'].map(lambda x: x.strip())
df['text'] = df['text'].replace('', np.nan)
df['text'] = df['text'].replace(' ', np.nan)

df['Sentiment'] = df['Sentiment'].map(lambda x: x.strip())
df['Sentiment'] = df['Sentiment'].replace('', np.nan)
df['Sentiment'] = df['Sentiment'].replace(' ', np.nan)

df["Sentiment"] = df["Sentiment"].map({"Negative": 0, "Neutral": 1, "Positive": 2})

df = df.dropna()

LOGGER.info("Current count: {}".format(len(df.index)))

### Ploting cleaned data

In [None]:
df.head()

We have just cleaned and explored our dataset. Now lets move on and see how to process data using Amazon SageMaker Processing Jobs

 > [Prepare-Data-ML](./01-Prepare-Data-ML.ipynb)

We have just cleaned and explored our dataset. Now lets move on and see how to process data using Amazon SageMaker Processing Jobs

 > [Prepare-Data-ML](./01-Prepare-Data-ML.ipynb)