# Data exploration/visualization

**SageMaker Studio Kernel**: Data Science

The challenge we're trying to address here is to identify the sentiment from Tweets. 
The dataset used is a public dataset taken from [Kaggle](https://www.kaggle.com/code/sagniksanyal/tweet-s-text-classicifaction/data)
Each data is like:
 - Username
 - User location
 - User description
 - User creation date
 - User followers
 - User friends
 - User favourites
 - User is verified
 - Date of the tweet
 - Text of the tweet
 - Sentiment associated to the t

Let's start preparing our dataset, then.

## Let's take a look on the data
Loading the dataset using Pandas...

In [None]:
import argparse
import boto3
import csv
import logging
import os
from os import listdir
from os.path import isfile, join
import pandas as pd
import pathlib
import re
import traceback

In [None]:
logging.basicConfig(level=logging.INFO)
LOGGER = logging.getLogger(__name__)

In [None]:
file_name = "TheSocialDilemma.csv"

In [None]:
df = pd.read_csv(
    "./../data/{}".format(file_name),
    sep=",",
    quotechar='"',
    quoting=csv.QUOTE_ALL,
    escapechar='\\',
    encoding='utf-8',
    error_bad_lines=False
)

### Ploting data, just to have an idea

In [None]:
df.head()

In [None]:
df[["text", "Sentiment"]].head()

## Data preparation
Now lets clean the text content from the tweets

In [None]:
def clean_text(text):
    text = text.lower()
    text = re.sub("\[.*?\]", "", text)
    text = re.sub("https?://\S+|www\.\S+", "", text)
    text = re.sub("\n", "", text)
    text = " ".join(filter(lambda x:x[0]!="@", text.split()))
    return text

In [None]:
df = df[["text", "Sentiment"]]

LOGGER.info("Original count: {}".format(len(df.index)))

df = df[df["text"].notna()]
df = df[df["Sentiment"].notna()]

LOGGER.info("Current count: {}".format(len(df.index)))

df["text"] = df["text"].apply(lambda x: clean_text(x))
df["Sentiment"] = df["Sentiment"].map({"Negative": 0, "Neutral": 1, "Positive": 2})

### Ploting cleaned data

In [None]:
df.head()

We have just cleaned and explored our dataset. Now lets move on the end to end journey with Amazon SageMaker

 > [Train-Build-Model](./01-Train-Build-Model.ipynb)