# W266 Project

### Adam Sayre & Erin Werner

## Personal Cleaning Method

Although the dataset provides both the original content as well as a cleaned version, we want to apply our own cleaning techniques and compare how they perform in the same models.

So to start we can take a look at the cleaned and original content provided in the dataset.

In [1]:
import numpy as np
import csv
import pandas as pd 
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import importlib
import emoji
import tensorflow as tf
import nltk
import re
from nltk.corpus import brown
nltk.download('stopwords')
from nltk.corpus import stopwords
assert(nltk.download("treebank"))
from nltk.corpus import europarl_raw
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from collections import Counter
from sklearn.model_selection import train_test_split

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sayre\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package treebank to
[nltk_data]     C:\Users\sayre\AppData\Roaming\nltk_data...
[nltk_data]   Package treebank is already up-to-date!


In [2]:
# Keras libraries
from nltk.corpus import stopwords
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
#from tensorflow.keras.utils.np_utils import to_categorical
from tensorflow.keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras import models
from tensorflow.keras import layers
import tensorflow.keras

In [3]:
data = pd.read_csv("~/Downloads/dataset(clean).csv") 
data.head()

Unnamed: 0,Emotion,Content,Original Content
0,disappointed,oh fuck did i wrote fil grinningfacewithsweat ...,b'RT @Davbingodav: @mcrackins Oh fuck.... did ...
1,disappointed,i feel nor am i shamed by it,i feel nor am i shamed by it
2,disappointed,i had been feeling a little bit defeated by th...,i had been feeling a little bit defeated by th...
3,happy,imagine if that reaction guy that called jj kf...,"b""@KSIOlajidebt imagine if that reaction guy t..."
4,disappointed,i wouldnt feel burdened so that i would live m...,i wouldnt feel burdened so that i would live m...


We can see that the cleaned content does not include any of the website links or user tags. The cleaned content also includes all of the emoji names as a single token.

#### Custom Preprocessing Technique #1

So, for our personal cleaning technique, we are going to make several changes to the original content. First, we are going to clean the text of special characters, remove stopwords, and lower the text. Then, we are going to replace user tags and website instances with the token 'USERTAGINSTANCE' and 'WEBSITEINSTANCE' respectively. This is because there might be an influence in sentiment related to these Twitter interactions that can be useful in our model. These replacements will allow us to generalize these actions similar to how numbers would be replaced in other NLP tasks. Last, we will split up the emoji name descriptions into individual tokens. This is because each name contains phrases that might be more influential as individual tokens compared to as a single token. Therefore, this cleaning approach will have different results compared to the original data.

In [3]:
data['E_Content'] = data['Original Content']

In [4]:
def preprocess(raw_text):
    stopword_set = set(stopwords.words("english"))
    return " ".join([i for i in re.sub(r'[^a-zA-Z\s]', " ", raw_text).lower().split() if i not in stopword_set])

In [None]:
for i in range(0,len(data)):
    tweet = data['E_Content'][i]
    tweet = re.sub('b\'','',tweet)
    tweet = re.sub('b\"','',tweet)
    tweet = re.sub('@[^\s]+','USERTAGINSTANCE',tweet)
    tweet = re.sub('https','WEBSITEINSTANCE',tweet)
    tweet = preprocess(tweet)
    
    if i%2000 == 0:
        print(i)
    
    data['E_Content'][i] = tweet

In [None]:
#na_index = data_e[pd.isna(data_e['E_Content'])].index

#for n in na_index:
#    data_e['E_Content'][n] = data_e['Content'][n]

In [None]:
#data.to_csv("~/Downloads/dataset(clean)_e.csv")

In [5]:
data_e = pd.read_csv("~/Downloads/dataset(clean)_e.csv") 
data_e.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Emotion,Content,Original Content,E_Content,label
0,0,0,disappointed,oh fuck did i wrote fil grinningfacewithsweat ...,b'RT @Davbingodav: @mcrackins Oh fuck.... did ...,rt usertaginstance usertaginstance oh fuck wro...,0
1,1,1,disappointed,i feel nor am i shamed by it,i feel nor am i shamed by it,feel shamed,0
2,2,2,disappointed,i had been feeling a little bit defeated by th...,i had been feeling a little bit defeated by th...,feeling little bit defeated steps faith would ...,0
3,3,3,happy,imagine if that reaction guy that called jj kf...,"b""@KSIOlajidebt imagine if that reaction guy t...",usertaginstance imagine reaction guy called jj...,1
4,4,4,disappointed,i wouldnt feel burdened so that i would live m...,i wouldnt feel burdened so that i would live m...,wouldnt feel burdened would live life testamen...,0


#### Custom Preprocessing Technique #2

In [4]:
# Copying original content into new column A_Content
data['A_Content'] = data['Original Content']

In [5]:
# Defining preprocessing technique number two
def preprocess_2(raw_text):
    return re.sub('[^A-Za-z0-9 ]+', '', raw_text).lower()

In [6]:
data['A_Content'] = data['A_Content'].apply(preprocess_2)

In [7]:
# Where preprocess produces an NA, replace with the uncleaned content
na_index = data[pd.isna(data['A_Content'])].index

for n in na_index:
   data['A_Content'][n] = data['Original Content'][n]

In [8]:
data.head()[['Emotion','Content','Original Content','A_Content']]

Unnamed: 0,Emotion,Content,Original Content,A_Content
0,disappointed,oh fuck did i wrote fil grinningfacewithsweat ...,b'RT @Davbingodav: @mcrackins Oh fuck.... did ...,brt davbingodav mcrackins oh fuck did i wrote ...
1,disappointed,i feel nor am i shamed by it,i feel nor am i shamed by it,i feel nor am i shamed by it
2,disappointed,i had been feeling a little bit defeated by th...,i had been feeling a little bit defeated by th...,i had been feeling a little bit defeated by th...
3,happy,imagine if that reaction guy that called jj kf...,"b""@KSIOlajidebt imagine if that reaction guy t...",bksiolajidebt imagine if that reaction guy tha...
4,disappointed,i wouldnt feel burdened so that i would live m...,i wouldnt feel burdened so that i would live m...,i wouldnt feel burdened so that i would live m...


In [11]:
data.to_csv("dataset(clean)_a.csv")

In [13]:
# Checks
data_a = pd.read_csv("dataset(clean)_a.csv") 

In [14]:
data.head()[['Emotion','Content','Original Content','A_Content']]

Unnamed: 0,Emotion,Content,Original Content,A_Content
0,disappointed,oh fuck did i wrote fil grinningfacewithsweat ...,b'RT @Davbingodav: @mcrackins Oh fuck.... did ...,brt davbingodav mcrackins oh fuck did i wrote ...
1,disappointed,i feel nor am i shamed by it,i feel nor am i shamed by it,i feel nor am i shamed by it
2,disappointed,i had been feeling a little bit defeated by th...,i had been feeling a little bit defeated by th...,i had been feeling a little bit defeated by th...
3,happy,imagine if that reaction guy that called jj kf...,"b""@KSIOlajidebt imagine if that reaction guy t...",bksiolajidebt imagine if that reaction guy tha...
4,disappointed,i wouldnt feel burdened so that i would live m...,i wouldnt feel burdened so that i would live m...,i wouldnt feel burdened so that i would live m...
