# Data Cleaning and Description
Tyler Christensen, William Lewis, Addison Powell, Jared Smith

## Dataset Descriptions
For our project, we intend to explore a few different branches. Each of these require their own dataset due to the nature of the project, though the methodology will remain largely the same. Our two main branches of exploration are:
- Sentiment Analysis of Casual Texts
    - For this we will use a dataset of over 1.6 million tweets, each labeled with a class related to its sentiment (negative, neutral, or positive).
- Sentiment Analysis of Reviews
    - For this we will use 2 datasets. The first is a set of yelp reviews from <a href="https://www.yelp.com/dataset/documentation/main">this link</a>. There are 1,250,000 reviews (we took a subset of the larger dataset of 6 million reviews out of concerns for memory), each labeled with a score out of 5 stars. The second is a set of 50,000 IMDb Movie reviews, each labeled with either a positive or negative sentiment.

## Validation Set
Before we began to look at the data in full, we sealed off 20% of each set to save for a final analysis. This was split via files, and as such we will not be able to access the validation sets unless we specifically load those files in our code. The split was chosen randomly and immediately saved into a separate file. We will reopen these and check our results towards the end of our project.

## Data Access
All the data can be found in <a href="https://drive.google.com/drive/folders/1Hp54gH3TQ93ELuzkHJI76C1nXsSI5CpE?usp=sharing">this Google Drive folder.</a>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from io import StringIO
from html.parser import HTMLParser
from tqdm.notebook import tqdm_notebook
from tqdm import tqdm
import string
import re
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk import download

tqdm.pandas()
download('stopwords')

[nltk_data] Downloading package stopwords to /home/tylerc/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Importing Data

In [2]:
# the directory that the data is stored in 
# don't store in git repo, please store it somewhere else on your computer
# data_dir = '../../data/'
data_dir = '/home/tylerc/dat/school/acme/'

# load the datasets
twitter_df = pd.read_csv(data_dir + "twitter_data.csv", index_col=0)
twitter_df.head()

Unnamed: 0,target,id,date,flag,user,text
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
5,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


In [3]:
imdb_df = pd.read_csv(data_dir + "imdb_data.csv", index_col=0)
imdb_df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive


In [4]:
yelp_df = pd.read_csv(data_dir + "yelp_data.csv", index_col=0)
yelp_df.head()

Unnamed: 0,user_id,business_id,stars,useful,funny,cool,date,text
477697,5sHSDjYnNkaia5lUec7rNw,ZSJeZPEoHXMdgMpRIMBpiQ,5.0,2,1,2,We have been visiting the Devon Horse Show for...,2019-05-30 14:18:56
757770,4dwF1g0wOZjwjxyQ8cRVoA,AjQGanUkM-SFa7MxwTfMRw,5.0,2,0,1,"What an amazing, random Friday adventure. I re...",2019-06-21 23:14:42
1013355,NyPaks2v8GkcWVsCXctpKA,G8r_HHphWNWfRN0LgVwrFA,1.0,0,0,0,Last three items broke- they wouldn't return t...,2011-11-14 01:10:28
98618,gpzgC3AwKY7cLkMsdSNA3w,2kuhZOrWcLYe_XePccr4lA,4.0,0,0,0,Delicious. We got Chicago style. Nice place. O...,2018-03-19 00:28:12
446665,Iu1akOzyVihFr7oj9JnK1Q,dvNNkfCyAjOq1HHltSRXRA,5.0,0,0,0,"Great experience, Roman is the best, he went b...",2016-04-15 17:24:36


## Data Transformation
Due to the size and time constraints from our local machines, we will perform the data transformation on a subset of the twitter data of 10,000 samples. However, the process will be the same for the entire dataset.

Since we are treating each sequence of data as 'time series' data, the transformation process requires a few steps.
- First, we clean the data of punctuation, numbers, urls, and html tags. We also stop words (words that are so widely used that they contain no useful information)
- Next, we create a large corpus of unique words found in the dataset.
- Last, we iterate through each word in each sequence and replace it with the index in the corpus. 

As an example, assume our data is the following two rows:<br>
```
["Hey, I was curious about why you would even think that @Wendys?",
"Just saw @Dune. I think it was pretty good. What about you?"]
```
The first step would transform the data like so:<br>
```
[['hey', 'i', 'was', 'curious', 'about', 'why', 'you', 'would', 'even', 'think', 'that', 'wendys'],
 ['just', 'saw', 'dune', 'i', 'think', 'it', 'was', 'pretty', 'good', 'what', 'about', 'you']]
```
Then, we create a set of unique words:
```
['hey', 'i', 'was', 'curious', 'about', 'why', 
 'you', 'would', 'even', 'think', 'that', 'wendys',
'just', 'saw', 'dune', 'it', 'was', 'pretty', 
 'good', 'what']
```
Last, we replace the transformed data with the indices of the words in the corpus, giving us our ordered sequence of data.
```
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
 [12, 13, 14, 1, 9, 15, 16, 17, 18, 19, 4, 6]]
```


In [5]:
# Removes punctuation and numbers (by character) and returns as a single string
def remove_punctuation(text):
    return ''.join([char for char in text if (char not in string.punctuation) and (not char.isdigit())])

# Remove URLs from a string
def remove_urls(text, replacement_text=""):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(replacement_text, text)

# Remove HTML from a string
class MLStripper(HTMLParser):
    def __init__(self):
        super().__init__()
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.text = StringIO()
    def handle_data(self, d):
        self.text.write(d)
    def get_data(self):
        return self.text.getvalue()
def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

# Splits the message on one or more non-word character
# Returns as a list
def tokenize(text):
    return re.split("\W+", text)
    
# Define stopwords and remove them from the list
# Also reduce words to the root word
def remove_stopwords(text):
    stopword_lst = stopwords.words('english')
    ps = PorterStemmer()
    return [ps.stem(word) for word in text if word not in stopword_lst]

  return re.split("\W+", text)


In [10]:
# Apply cleaning functions to the data
twitter_subset = twitter_df.sample(10000)
clean_data = twitter_subset['text'].progress_apply(lambda x: remove_stopwords(   # Remove stopwords and shorten to root words
                                                tokenize(           # Split message into a list
                                                remove_punctuation( # Remove punctuation and numbers
                                                remove_urls(        # Remove URLs
                                                strip_tags(x)       # Remove HTML tags
                                                )).lower())))

100%|██████████| 10000/10000 [00:01<00:00, 5889.89it/s]


In [16]:
# Turn each mesage into a sequence of unique indices
#   that correspond to a given word
loop = tqdm(total=len(clean_data), position=0, leave=False)

words = []
for message in clean_data:
    words = list(set(words + message))
    loop.update()

 98%|█████████▊| 9837/10000 [00:03<00:00, 1853.90it/s] 

In [17]:

seq_data = [[words.index(word) for word in message] for message in clean_data]

100%|██████████| 10000/10000 [00:17<00:00, 1853.90it/s]

In [45]:
i = 5129
print("Before/After Transformation:")
print(twitter_subset.iloc[i]['text'])
print(f"{clean_data.iloc[i]}")
print(f"{seq_data[i]}")

Before/After Transformation:
and i lost my camera last night 
['lost', 'camera', 'last', 'night', '']
[8478, 1479, 13632, 751, 0]


## Feature Analysis
- TODO: Any missing values? How to impute?
- TODO: Variables to drop?
- TODO: What feature engineering is required?

## Data Visualizations and Analysis
the distribution of words used should follow the Zipf distribution, even with a smaller lexicon of keywords that exclude things such as particles. See this paper for a Zipf distribution on a X (formerly known as twitter) dataset: https://www.researchgate.net/figure/Zipf-distribution-of-Twitter-keywords-at-different-spatial-levels_fig1_311857596  
One thing we expect to see a binomial dist heavily weighted to the negative sentiment for a good/bad sentiment analysis (most people on X (formerly known as twitter) are negative).  
If we have pairs of either/or states such as a good/bad sentiment or a funny/serious sentiment we can make predictions on a correlation matrix.  Some visualiztions include identifying keywords and creating a heatmap of the intensity with which they correspond to our hidden states.  Other visualiztions could include lists of words or sentence fragements that correspond the most to certain sentiments

## Conclusions
- TODO: What assumptions do our models rely on, and does our data line up with these models? Or does our dataset change what models would be appropriate?