# Data Cleaning and Description
Tyler Christensen, William Lewis, Addison Powell, Jared Smith

## Dataset Descriptions
For our project, we intend to explore a few different branches. Each of these require their own dataset due to the nature of the project, though the methodology will remain largely the same. Our two main branches of exploration are:
- Sentiment Analysis of Casual Texts
    - For this we will use a dataset of over 1.6 million tweets, each labeled with a class related to its sentiment (negative, neutral, or positive).
- Sentiment Analysis of Reviews
    - For this we will use 2 datasets. The first is a set of yelp reviews from <a href="https://www.yelp.com/dataset/documentation/main">this link</a>. There are TODO reviews, each labeled with a score out of 5 stars (TODO: CHECK THIS). The second is a set of 50,000 IMDb Movie reviews, each labeled with either a positive or negative sentiment.

## Validation Set
Before we began to look at the data in full, we sealed off 20% of each set to save for a final analysis. This was split via files, and as such we will not be able to access the validation sets unless we specifically load those files in our code. The split was chosen randomly and immediately saved into a separate file.

TODO: Yelp Dataset is 5GB, and I don't have near enough space on my computer for that. Do it on another computer.

## Data Access
All the data can be found in <a href="https://drive.google.com/drive/folders/1Hp54gH3TQ93ELuzkHJI76C1nXsSI5CpE?usp=sharing">this Google Drive folder.</a>

In [60]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from io import StringIO
from html.parser import HTMLParser
import string
import re
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/smithj00/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Importing Data

In [36]:
# the directory that the data is stored in 
# don't store in git repo, please store it somewhere else on your computer
# data_dir = '../../data/'
data_dir = '../../../Winter2024/'

# load the datasets
twitter_df = pd.read_csv(data_dir + "twitter_data.csv", index_col=0)
imdb_df = pd.read_csv(data_dir + "imdb_data.csv", index_col=0)
# TODO: yelp_df = pd.read_csv(data_dir + "yelp_data.csv", index_col=0)
twitter_df.head()

Unnamed: 0,target,id,date,flag,user,text
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
5,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


In [49]:
imdb_df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive


## Data Transformation
- TODO: change text data into something models can read

In [61]:
# Removes punctuation and numbers (by character) and returns as a single string
def remove_punctuation(text):
    return ''.join([char for char in text if (char not in string.punctuation) and (not char.isdigit())])

# Remove URLs from a string
def remove_urls(text, replacement_text=""):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(replacement_text, text)

# Remove HTML from a string
class MLStripper(HTMLParser):
    def __init__(self):
        super().__init__()
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.text = StringIO()
    def handle_data(self, d):
        self.text.write(d)
    def get_data(self):
        return self.text.getvalue()
def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

# Splits the message on one or more non-word character
# Returns as a list
def tokenize(text):
    return re.split("\W+", text)
    
# Define stopwords and remove them from the list
# Also reduce words to the root word
def remove_stopwords(text):
    stopword_lst = stopwords.words('english')
    ps = PorterStemmer()
    return [ps.stem(word) for word in text if word not in stopword_lst]


In [62]:
# Apply cleaning functions to the data
clean_data = twitter_df['text'].apply(lambda x: remove_stopwords(   # Remove stopwords and shorten to root words
                                                tokenize(           # Split message into a list
                                                remove_punctuation( # Remove punctuation and numbers
                                                remove_urls(        # Remove URLs
                                                strip_tags(x)       # Remove HTML tags
                                                )).lower())))

f = lambda x: x
# An alternative function could be TfidfVectorizer, but
# I just don't understand how that one works enough
vectorizer = CountVectorizer(preprocessor=f, tokenizer=f)           # Turn lists of words into numbers
X = vectorizer.fit_transform(clean_data)
# This should show a list of all the words that are our features
print(vectorizer.get_feature_names_out())
# This should show the count of each word in the message as a sparse matrix
print(X.toarray())

## Feature Analysis
- TODO: Any missing values? How to impute?
- TODO: Variables to drop?
- TODO: What feature engineering is required?

## Data Visualizations and Analysis
the distribution of words used should follow the Zipf distribution, even with a smaller lexicon of keywords that exclude things such as particles. See this paper for a Zipf distribution on a X (formerly known as twitter) dataset: https://www.researchgate.net/figure/Zipf-distribution-of-Twitter-keywords-at-different-spatial-levels_fig1_311857596  
One thing we expect to see a binomial dist heavily weighted to the negative sentiment for a good/bad sentiment analysis (most people on X (formerly known as twitter) are negative).  
If we have pairs of either/or states such as a good/bad sentiment or a funny/serious sentiment we can make predictions on a correlation matrix.  Some visualiztions include identifying keywords and creating a heatmap of the intensity with which they correspond to our hidden states.  Other visualiztions could include lists of words or sentence fragements that correspond the most to certain sentiments

## Conclusions
- TODO: What assumptions do our models rely on, and does our data line up with these models? Or does our dataset change what models would be appropriate?