# W266 Project

### Adam Sayre & Erin Werner

## Personal Cleaning Method

Although the dataset provides both the original content as well as a cleaned version, we want to apply our own cleaning techniques and compare how they perform in the same models.

So to start we can take a look at the cleaned and original content provided in the dataset.

In [14]:
import numpy as np
import csv
import pandas as pd 
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import importlib
import emoji
import tensorflow as tf
import nltk
import re
from nltk.corpus import brown
nltk.download('stopwords')
from nltk.corpus import stopwords
assert(nltk.download("treebank"))
from nltk.corpus import europarl_raw
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from collections import Counter
from sklearn.model_selection import train_test_split

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/erinwerner/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package treebank to
[nltk_data]     /Users/erinwerner/nltk_data...
[nltk_data]   Package treebank is already up-to-date!


In [20]:
# Keras libraries
from nltk.corpus import stopwords
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
#from tensorflow.keras.utils.np_utils import to_categorical
from tensorflow.keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras import models
from tensorflow.keras import layers
import tensorflow.keras

In [2]:
data = pd.read_csv("~/Downloads/dataset(clean).csv") 
data.head()

Unnamed: 0,Emotion,Content,Original Content
0,disappointed,oh fuck did i wrote fil grinningfacewithsweat ...,b'RT @Davbingodav: @mcrackins Oh fuck.... did ...
1,disappointed,i feel nor am i shamed by it,i feel nor am i shamed by it
2,disappointed,i had been feeling a little bit defeated by th...,i had been feeling a little bit defeated by th...
3,happy,imagine if that reaction guy that called jj kf...,"b""@KSIOlajidebt imagine if that reaction guy t..."
4,disappointed,i wouldnt feel burdened so that i would live m...,i wouldnt feel burdened so that i would live m...


We can see that the cleaned content does not include any of the website links or user tags. The cleaned content also includes all of the emoji names as a single token.

#### Custom Preprocessing Technique #1

So, for our personal cleaning technique, we are going to make several changes to the original content. First, we are going to clean the text of special characters, remove stopwords, and lower the text. Then, we are going to replace user tags and website instances with the token 'USERTAGINSTANCE' and 'WEBSITEINSTANCE' respectively. This is because there might be an influence in sentiment related to these Twitter interactions that can be useful in our model. These replacements will allow us to generalize these actions similar to how numbers would be replaced in other NLP tasks. Last, we will split up the emoji name descriptions into individual tokens. This is because each name contains phrases that might be more influential as individual tokens compared to as a single token. Therefore, this cleaning approach will have different results compared to the original data.

In [3]:
data['E_Content'] = data['Original Content']

In [4]:
def preprocess(raw_text):
    stopword_set = set(stopwords.words("english"))
    return " ".join([i for i in re.sub(r'[^a-zA-Z\s]', " ", raw_text).lower().split() if i not in stopword_set])

In [None]:
for i in range(0,len(data)):
    tweet = data['E_Content'][i]
    tweet = re.sub('b\'','',tweet)
    tweet = re.sub('b\"','',tweet)
    tweet = re.sub('@[^\s]+','USERTAGINSTANCE',tweet)
    tweet = re.sub('https','WEBSITEINSTANCE',tweet)
    tweet = preprocess(tweet)
    
    if i%2000 == 0:
        print(i)
    
    data['E_Content'][i] = tweet

In [None]:
#na_index = data_e[pd.isna(data_e['E_Content'])].index

#for n in na_index:
#    data_e['E_Content'][n] = data_e['Content'][n]

In [None]:
#data.to_csv("~/Downloads/dataset(clean)_e.csv")

In [5]:
data_e = pd.read_csv("~/Downloads/dataset(clean)_e.csv") 
data_e.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Emotion,Content,Original Content,E_Content,label
0,0,0,disappointed,oh fuck did i wrote fil grinningfacewithsweat ...,b'RT @Davbingodav: @mcrackins Oh fuck.... did ...,rt usertaginstance usertaginstance oh fuck wro...,0
1,1,1,disappointed,i feel nor am i shamed by it,i feel nor am i shamed by it,feel shamed,0
2,2,2,disappointed,i had been feeling a little bit defeated by th...,i had been feeling a little bit defeated by th...,feeling little bit defeated steps faith would ...,0
3,3,3,happy,imagine if that reaction guy that called jj kf...,"b""@KSIOlajidebt imagine if that reaction guy t...",usertaginstance imagine reaction guy called jj...,1
4,4,4,disappointed,i wouldnt feel burdened so that i would live m...,i wouldnt feel burdened so that i would live m...,wouldnt feel burdened would live life testamen...,0


#### Custom Preprocessing Technique #2

In [23]:
# Some starting variables
vocab_size = 10000
max_length = 40

In [24]:
X = data['Original Content'].to_numpy()
y = data.Emotion.to_numpy()

# First split the data into train and test
X_train_a, X_test_a, y_train_a, y_test_a = train_test_split(X, y, test_size=0.10, random_state=42)

# Next split the train data into train and dev data
X_train_a, X_dev_a, y_train_a, y_dev_a = train_test_split(X_train_a, y_train_a, test_size=0.33, random_state=42)

In [25]:
# Tokenizing
tk = Tokenizer(num_words = vocab_size, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{"}~\t\n', lower=True, split = " ")
tk.fit_on_texts(X_train_a)

X_train_seq = tk.texts_to_sequences(X_train_a)
X_train_seq_trunc = pad_sequences(X_train_seq, maxlen=max_length)

X_dev_seq = tk.texts_to_sequences(X_dev_a)
X_dev_seq_trunc = pad_sequences(X_dev_seq, maxlen=max_length)

X_test_seq = tk.texts_to_sequences(X_test_a)
X_test_seq_trunc = pad_sequences(X_test_seq, maxlen=max_length)

# Encoding output variable
le = LabelEncoder()

y_train_le = le.fit_transform(y_train_a)
y_train_emb = to_categorical(y_train_le)

y_dev_le = le.transform(y_dev_a)
y_dev_emb = to_categorical(y_dev_le)

y_test_le = le.transform(y_test_a)
y_test_emb = to_categorical(y_test_le)

In [26]:
# Use these for training!
X_train_final_a = X_train_seq_trunc
X_dev_final_a = X_dev_seq_trunc
X_test_final_a = X_test_seq_trunc

y_train_final_a = y_train_emb
y_dev_final_a = y_dev_emb
y_test_final_a = y_test_emb

In [27]:
data['A_Content'] = data['Original Content']

In [28]:
def preprocess_2(raw_text):
    stopword_set = set(stopwords.words("english"))
    return " ".join([i for i in re.sub(r'!"#$%&()*+,-./:;<=>?@[\]^_`{"}~\t\n', " ", raw_text).lower().split() if i not in stopword_set])

In [29]:
for i in range(0,len(data)):
    tweet = data['A_Content'][i]
    tweet = preprocess(tweet)
    
    if i%2000 == 0:
        print(i)
    
    data['A_Content'][i] = tweet

0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
22000
24000
26000
28000
30000
32000
34000
36000
38000
40000
42000
44000
46000
48000
50000
52000
54000
56000
58000
60000
62000
64000
66000
68000
70000
72000
74000
76000
78000
80000
82000
84000
86000
88000
90000
92000
94000
96000
98000
100000
102000
104000
106000
108000
110000
112000
114000
116000
118000
120000
122000
124000
126000
128000
130000
132000
134000
136000
138000
140000
142000
144000
146000
148000
150000
152000
154000
156000
158000
160000
162000
164000
166000
168000
170000
172000
174000
176000
178000
180000
182000
184000
186000
188000
190000
192000
194000
196000
198000
200000
202000
204000
206000
208000
210000
212000
214000
216000
218000
220000
222000
224000
226000
228000
230000
232000
234000
236000
238000
240000
242000
244000
246000
248000
250000
252000
254000
256000
258000
260000
262000
264000
266000
268000
270000
272000
274000
276000
278000
280000
282000
284000
286000
288000
290000
292000
294000
296000
298000
300000
3

In [52]:
#na_index = data_a[pd.isna(data_a['A_Content'])].index

#for n in na_index:
#    data_a['A_Content'][n] = data_a['Content'][n]

In [49]:
#data.to_csv("~/Downloads/dataset(clean)_a.csv")

In [50]:
data_a = pd.read_csv("~/Downloads/dataset(clean)_a.csv") 

In [51]:
data_a.head()[['Emotion','Content','Original Content','A_Content']]

Unnamed: 0,Emotion,Content,Original Content,A_Content
0,disappointed,oh fuck did i wrote fil grinningfacewithsweat ...,b'RT @Davbingodav: @mcrackins Oh fuck.... did ...,b rt davbingodav mcrackins oh fuck wrote fil g...
1,disappointed,i feel nor am i shamed by it,i feel nor am i shamed by it,feel shamed
2,disappointed,i had been feeling a little bit defeated by th...,i had been feeling a little bit defeated by th...,feeling little bit defeated steps faith would ...
3,happy,imagine if that reaction guy that called jj kf...,"b""@KSIOlajidebt imagine if that reaction guy t...",b ksiolajidebt imagine reaction guy called jj ...
4,disappointed,i wouldnt feel burdened so that i would live m...,i wouldnt feel burdened so that i would live m...,wouldnt feel burdened would live life testamen...
