## __LSTM IMPLEMENTATION__
<hr>

### Import Libraries
We will use *pandas* and *numpy* for data manipulation, *nltk* for NLP, *matplotlib*, *seaborn*, and *plotly* for data visualization, *sklearn* and *keras* for training the model.

In [9]:
import pandas as pd
import numpy as np
import string, re
import itertools
import nltk
import plotly.offline as py
import plotly.graph_objs as go
import matplotlib.pyplot as plt
import seaborn as sns
#from wordcloud import WordCloud,STOPWORDS
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from keras.preprocessing.text import Tokenizer
from keras.utils.data_utils import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM
from keras.callbacks import EarlyStopping
py.init_notebook_mode(connected=True)
%matplotlib inline

### Load the dataset
We note that we have 500 entries for twitter (tweets) and facebook (posts and commments).

In [10]:
twitter = pd.read_csv('../data/twitter.csv', usecols=[3,4])
facebook = pd.read_csv('../data/facebook.csv', usecols=[3,4])
df = pd.concat([twitter, facebook])
df.head()

Unnamed: 0,Clean_Translated,Tag
0,Hahahahahhaha happiest day of my life. You hav...,1
1,I like the chocolate bars and then you are thi...,0
2,Hey. You're cute,0
3,I save on going to the massage spa because I h...,0
4,Pretending we know things,0


### Data Analysis - Shape Checks
In the next 2 cells, we examine the shape of our dataset and check if there are some missing values.

In [11]:
print(df.shape)
print(facebook.shape)
print(twitter.shape)

(1001, 2)
(501, 2)
(500, 2)


we check for missing values using __isna()__:

In [12]:
twitter[twitter.isna().any(axis=1)]
facebook[facebook.isna().any(axis=1)]
df[df.isna().any(axis=1)]

Unnamed: 0,Clean_Translated,Tag


### Download NLTK resources

In [15]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')


[nltk_data] Downloading package punkt to C:\Users\Aira Mae
[nltk_data]     Aloveros\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Aira Mae
[nltk_data]     Aloveros\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

### Tokenize Text
Removes prepositions and Determiner

In [21]:
import nltk


def tokenize(text: str):
    tokens = [word for word in nltk.pos_tag(nltk.word_tokenize(text)) if word[1] not in ["DT", "PRP"]]
    #Reference: https://medium.com/@gianpaul.r/tokenization-and-parts-of-speech-pos-tagging-in-pythons-nltk-library-2d30f70af13b
    return [word[0] for word in tokens]


print(tokenize("You are very ugly. I hate you."))


['are', 'very', 'ugly', '.', 'hate', '.']


### Tokenize Data

In [25]:
#df['Clean_Translated'] & df['Tag']
df

Unnamed: 0,Clean_Translated,Tag
0,Hahahahahhaha happiest day of my life. You hav...,1
1,I like the chocolate bars and then you are thi...,0
2,Hey. You're cute,0
3,I save on going to the massage spa because I h...,0
4,Pretending we know things,0
...,...,...
496,After Calax construction is this river still a...,0
497,Do you think that? There are times when I can ...,0
498,"Happy 18th Birthday, Madumb Ariel! <333 Hope y...",0
499,In the art of art\nCabbage with Sardinas…\nDin...,0
