# Data exploration and visualization

This notebook demonstrates how to perform Data Analysis and Feature Engineering with Amazon SageMaker Studio in an interactive way.

Using this notebook, we can execute cells in order to read data and visualize them by defining notebook cells. By using the [SageMaker Data Wrangler conector](https://aws.amazon.com/blogs/machine-learning/interactive-data-prep-widget-for-notebooks-powered-by-amazon-sagemaker-data-wrangler/), we are going to perform data preparation by using the interactive widget for notebooks

Let's start preparing our dataset.

**SageMaker Studio Kernel**: Data Science

# Install Dependencies

Let's install some required dependencies for our environment.

In [None]:
! pip install emoji seaborn

***

# Dataset

The data set (The Social Dilemma Tweets - Text Classification 2020) was downloaded from [Kaggle](https://www.kaggle.com/datasets/kaushiksuresh147/the-social-dilemma-tweets).
This dataset brings you the twitter responses made with the #TheSocialDilemma hashtag after watching the eye-opening documentary "The Social Dilemma" released in an OTT platform(Netflix) on September 9th, 2020.
The dataset was extracted using TwitterAPI, consisting of nearly 10,526 tweets from twitter users all over the globe!

We'd like to train a model based on the content of the text in order to determine the sentiment.

This is a multi-class classification problem:
* Negative - 0
* Neutral - 1
* Positive - 2


In [None]:
! rm -rf ./data && mkdir -p data
! curl https://sagemaker-sample-files.s3.amazonaws.com/datasets/tabular/tweets_dataset/TheSocialDilemma.csv -o data/data.csv

***

# Step 1 - Import Modules

In [None]:
import boto3
import csv
import emoji
import logging
import numpy as np
import pandas as pd
import re
import sagemaker
import sagemaker_datawrangler
import seaborn as sns
from sklearn.model_selection import train_test_split

sns.set(rc={'figure.figsize':(11.7,8.27)})

In [None]:
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

In [None]:
sagemaker_client = boto3.client("sagemaker")
s3_client = boto3.client("s3")

In [None]:
sagemaker_session = sagemaker.Session()

***

# Step 2 - Data Exploration

Loading the dataset using Pandas.

Each row is like:
 - Username
 - User location
 - User description
 - User creation date
 - User followers
 - User friends
 - User favourites
 - User is verified
 - Date of the tweet
 - Text of the tweet
 - Sentiment associated to the text

In [None]:
file_name = "data.csv"

In [None]:
df = pd.read_csv(
    "./data/{}".format(file_name),
    sep=",",
    quotechar='"',
    quoting=csv.QUOTE_ALL,
    escapechar='\\',
    encoding='utf-8',
    error_bad_lines=False
)

### Ploting data, just to have an idea

In [None]:
df

In [None]:
%matplotlib inline

ds = df['source'].value_counts().reset_index()
ds.columns = ['source', 'count']
ds = ds.sort_values(['count'],ascending=False)

fig = sns.barplot(
    x=ds.head(10)["count"], 
    y=ds.head(10)["source"], 
    orientation='horizontal', 
).set_title('Top 10 user sources by number of tweets')

By using the connector `sagemaker_datawrangler`, we can apply data transformation directly from the table and generate automatically the code to reproduce those data preparation steps in another notebook cell

Now lets clean the text content from the tweets

In [None]:
def clean_text(text):
    text = text.lower()

    text = text.lstrip()
    text = text.rstrip()

    text = re.sub("\[.*?\]", "", text)
    text = re.sub("https?://\S+|www\.\S+", "", text)
    text = re.sub("\n", "", text)
    text = " ".join(filter(lambda x:x[0]!="@", text.split()))

    text = emoji.replace_emoji(text, "")

    text = text.replace("u'", "'")

    text = text.encode("ascii", "ignore")
    text = text.decode()

    word_list = text.split(' ')

    for word in word_list:
        if isinstance(word, bytes):
            word = word.decode("utf-8")

    text = " ".join(word_list)

    if not any(c.isalpha() for c in text):
        return ""
    else:
        return text

In [None]:
df = df[["text", "Sentiment"]]

logger.info("Original count: {}".format(len(df.index)))

df = df.dropna()

df["text"] = df["text"].apply(lambda x: clean_text(x))

df['text'] = df['text'].map(lambda x: x.strip())
df['text'] = df['text'].replace('', np.nan)
df['text'] = df['text'].replace(' ', np.nan)

df['Sentiment'] = df['Sentiment'].map(lambda x: x.strip())
df['Sentiment'] = df['Sentiment'].replace('', np.nan)
df['Sentiment'] = df['Sentiment'].replace(' ', np.nan)

df["Sentiment"] = df["Sentiment"].map({"Negative": 0, "Neutral": 1, "Positive": 2})

df = df.dropna()

df = df.rename(columns={'Sentiment': 'labels'})

df = df[["text", "labels"]]

logger.info("Current count: {}".format(len(df.index)))

### Ploting cleaned data

In [None]:
df.head()

### Split dataset into Train and Test

In [None]:
data_train, data_test = train_test_split(df, test_size=0.2)

logger.info("Training dataset count: {}".format(len(data_train.index)))
logger.info("Test dataset count: {}".format(len(data_test.index)))

In [None]:
data_train.to_csv(
    "./data/train.csv",
    index=False,
    header=True,
    quoting=csv.QUOTE_ALL,
    encoding="utf-8",
    escapechar="\\",
    sep=","
)

In [None]:
data_test.to_csv(
    "./data/test.csv",
    index=False,
    header=True,
    quoting=csv.QUOTE_ALL,
    encoding="utf-8",
    escapechar="\\",
    sep=","
)

### Upload data on Amazon S3

In [None]:
bucket_name = sagemaker_session.default_bucket()

In [None]:
# clean the buckets first
s3_client.delete_object(Bucket=bucket_name, Key="e2e-base/data/output")

In [None]:
train_path = sagemaker_session.upload_data('./data/train.csv', key_prefix="e2e-base/data/output/train")

train_path

In [None]:
test_path = sagemaker_session.upload_data('./data/test.csv', key_prefix="e2e-base/data/output/test")

test_path