# Unit 4 - Who is the Saltiest Hacker?
## Machine Learning Engineers: Rob Bennett & Hernan Echeverry


##Initial model v nlp processes to determine sentiment. 

I've tried a few different approaches to the models below. Our initial push was to find a dataset to train a model and then deploy the model to a RapidAPI app for the backend to access. 

Our initial data was found here: https://zenodo.org/record/45901#.X0VTK8hKiUl 

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
pip install vaderSentiment

Collecting vaderSentiment
[?25l  Downloading https://files.pythonhosted.org/packages/76/fc/310e16254683c1ed35eeb97386986d6c00bc29df17ce280aed64d55537e9/vaderSentiment-3.3.2-py2.py3-none-any.whl (125kB)
[K     |██▋                             | 10kB 25.4MB/s eta 0:00:01[K     |█████▏                          | 20kB 3.3MB/s eta 0:00:01[K     |███████▉                        | 30kB 4.3MB/s eta 0:00:01[K     |██████████▍                     | 40kB 4.7MB/s eta 0:00:01[K     |█████████████                   | 51kB 3.9MB/s eta 0:00:01[K     |███████████████▋                | 61kB 4.4MB/s eta 0:00:01[K     |██████████████████▏             | 71kB 4.8MB/s eta 0:00:01[K     |████████████████████▉           | 81kB 5.2MB/s eta 0:00:01[K     |███████████████████████▍        | 92kB 5.6MB/s eta 0:00:01[K     |██████████████████████████      | 102kB 5.3MB/s eta 0:00:01[K     |████████████████████████████▋   | 112kB 5.3MB/s eta 0:00:01[K     |███████████████████████████████▏| 12

### Previous experience with NLP told me that a pre-trained model would save me a lot of time and heartache. I have done some research and VaderSentiment (aside from an awesome name), it has decent performance.

Below are the functions I generated to apply some of these metrics to the comment text. 

In [None]:
# Imports and libraries
import pandas as pd
import vaderSentiment
import tensorflow as tf
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
import numpy as np
import re
import os
from tensorflow.keras import layers
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

The rest of the team performed some extensive cleaning on a sampling of the initial 18 million row dataset. We have a 200k row, cleaned csv below, which I mounted to my drive for ease of reference.

In [None]:
df = pd.read_csv('/content/drive/My Drive/cleaned_comments.csv')

In [None]:
df.head()

Unnamed: 0,story_id,story_time,story_url,story_text,story_author,comment_id,comment_text,comment_author,comment_ranking,author_comment_count,story_comment_count
0,4941692,1355906280,http://blog.vincentlaforet.com/2012/12/19/the-...,,aaronbrethorst,4942376,Meh I think it will just take time before peop...,polyfractal,28,226,57
1,7559692,1397055696,http://m.motherjones.com/politics/2014/04/inqu...,,fixedd,7559855,Some other features that can help classify a l...,a8da6b0c91d,11,98,12
2,3348011,1323794108,http://blog.macromates.com/2011/textmate-2-0-a...,,fredleblanc,3348087,List of changes httpmacromatescomchanges,c4urself,18,78,30
3,6702077,1384014431,http://exploresion.org,,slashdotaccount,6702761,Perhaps this uses DBpedia httpx2Fx2Fwikidbpedi...,roryokane,0,204,21
4,2877563,1313161720,http://web.mit.edu/newsoffice/2011/introductio...,,ilamont,2877872,Its strange the article talks about it being ...,marshray,6,368,14


## Creating a function to get positive, neutral and negative scoring based on the sentiment analysis

In [None]:
# This is the original function that assigns a word-score to a sentence.

def sentiment_scores(sentence): 
  
    # Create a SentimentIntensityAnalyzer object. 
    sid_obj = SentimentIntensityAnalyzer() 
  
    # polarity_scores method of SentimentIntensityAnalyzer 
    # oject gives a sentiment dictionary. 
    # which contains pos, neg, neu, and compound scores. 
    sentiment_dict = sid_obj.polarity_scores(sentence) 
      
#    print("Overall sentiment dictionary is : ", sentiment_dict) 
#    print("sentence was rated as ", sentiment_dict['neg']*100, "% Negative") 
#    print("sentence was rated as ", sentiment_dict['neu']*100, "% Neutral") 
#    print("sentence was rated as ", sentiment_dict['pos']*100, "% Positive") 
  
#    print("Sentence Overall Rated As", end = " ") 
  
    # decide sentiment as positive, negative and neutral 
    if sentiment_dict['compound'] >= 0.05 : 
        return "Positive" 
  
    elif sentiment_dict['compound'] <= - 0.05 : 
        return "Negative" 
  
    else : 
        return "Neutral" 
  


## These are the secondary functions which I am leaning on us using currently
## They give the math values rather than word values.

In [None]:
# Preprocessing and sentiment analysis functions
def preprocessing(df):
    df = df[df['comment_text'].notna()]
    df['neg'], df['neu'], df['pos'], df["compound"] = [np.nan, np.nan,np.nan,np.nan]    
    return df

def sentiment_analysis(df):
    df['neg'], df['neu'], df['pos'], df["compound"] = [np.nan, np.nan,np.nan,np.nan]
    sid = SentimentIntensityAnalyzer()
    for i, row in df.iterrows():
        text = row["comment_text"]
        df.at[i,"comment_text"] = text
        ss = sid.polarity_scores(text)
        for k in ss:
            df.at[i,k] = ss[k]

In [None]:
# Looking at the processed dataframe
df = preprocess_df(df)
sentiment_analysis(df)
df.describe()

Unnamed: 0,story_id,story_time,comment_id,comment_ranking,author_comment_count,story_comment_count,neg,neu,pos,compound
count,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0
mean,5194671.0,1352291000.0,5195429.0,16.15,299.985,35.51,0.05353,0.826465,0.11997,0.28768
std,2834398.0,57448570.0,2834549.0,45.405128,494.251614,77.889949,0.064755,0.122983,0.111952,0.514239
min,92441.0,1198516000.0,92515.0,0.0,11.0,11.0,0.0,0.213,0.0,-0.9449
25%,2758964.0,1310557000.0,2759658.0,5.0,42.0,14.0,0.0,0.74825,0.0495,0.0
50%,5387452.0,1363414000.0,5387574.0,9.0,113.5,20.0,0.033,0.8415,0.1055,0.3787
75%,7377534.0,1394514000.0,7377794.0,16.0,356.0,30.0,0.09225,0.90825,0.1535,0.716775
max,10313950.0,1443727000.0,10315050.0,606.0,3224.0,791.0,0.415,1.0,0.787,0.9951


In [None]:
df.head()

Unnamed: 0,story_id,story_time,story_url,story_text,story_author,comment_id,comment_text,comment_author,comment_ranking,author_comment_count,story_comment_count,neg,neu,pos,compound
0,4941692,1355906280,http://blog.vincentlaforet.com/2012/12/19/the-...,,aaronbrethorst,4942376,Meh I think it will just take time before peop...,polyfractal,28,226,57,0.018,0.908,0.074,0.6154
1,7559692,1397055696,http://m.motherjones.com/politics/2014/04/inqu...,,fixedd,7559855,Some other features that can help classify a l...,a8da6b0c91d,11,98,12,0.113,0.639,0.248,0.6486
2,3348011,1323794108,http://blog.macromates.com/2011/textmate-2-0-a...,,fredleblanc,3348087,List of changes httpmacromatescomchanges,c4urself,18,78,30,0.0,1.0,0.0,0.0
3,6702077,1384014431,http://exploresion.org,,slashdotaccount,6702761,Perhaps this uses DBpedia httpx2Fx2Fwikidbpedi...,roryokane,0,204,21,0.05,0.95,0.0,-0.0258
4,2877563,1313161720,http://web.mit.edu/newsoffice/2011/introductio...,,ilamont,2877872,Its strange the article talks about it being ...,marshray,6,368,14,0.051,0.876,0.073,0.1531


In [None]:
# Checking out row 1 of the data frame
df['comment_text'][0]

'Makes sense for Verizon, and if they are being greedy, others will step in to fill the void.'

In [None]:
# Checking the first value of for sentiment
sentiment_scores(df['comment_text'][0])

'Negative'

## For my initial neural network, I explored the dataset and just ran sentiment on each of the comments in order to generate a y value for training. This ultimately yielded poor results, but here is the effort made.

In [None]:
# Creating a method to ensure there are 200,000 results
counter= 0
sentiment = []
for text in range(200000):
  sentiment.append(sentiment_scores(df['comment_text'][text]))
  counter += 1
  print(counter)


In [None]:
len(sentiment)

200000

In [None]:
# Looking at the size of the dataframe and the values within
print(df.shape)
df['comment_author'].value_counts().head(20)

(200000, 11)


tptacek           367
ChuckMcM          342
ck2               317
jacquesm          302
DanielBMarkham    264
tokenadult        258
rdl               254
Tichy             247
jrockway          242
davidw            241
edw519            230
patio11           229
petercooper       226
danso             225
DanBC             216
bane              214
known             211
gojomo            210
brudgers          201
stcredzero        193
Name: comment_author, dtype: int64

## Making features and some additional cleaning

In [None]:
# Engineering a feature called 'sentiment'
df['sentiment'] = sentiment

In [None]:
# Looking at the values of the new column
df['sentiment'].value_counts()

Positive    124601
Negative     46226
Neutral      29173
Name: sentiment, dtype: int64

In [None]:
# Dropping unnecessary columns
to_drop = ['story_id', 'story_time', 'story_url', 'story_text', 'story_author']
df = df.drop(columns=to_drop)
df.head()

Unnamed: 0,comment_id,comment_text,comment_author,comment_ranking,author_comment_count,story_comment_count,sentiment
0,3985756,"Makes sense for Verizon, and if they are being...",Quizzy,10,11,13,Negative
1,2481521,"""Made to play"" is a contradiction.",petervandijck,9,1125,16,Positive
2,6303075,I like this a lot!<p>The research manuscript e...,heurist,5,31,42,Positive
3,6270567,Fetching stuff over HTTP can be incredibly stu...,MBCook,5,84,11,Positive
4,5487972,"Conversely, start working( harder ).",TheSOB88,43,171,44,Neutral


In [None]:
# Assigning values to positive and negative, neutral to be nan
df['sentiment'] = df['sentiment'].replace(['Negative'], 1)
df['sentiment'] = df['sentiment'].replace(['Positive'], 5)
df['sentiment'] = df['sentiment'].replace(['Neutral'], np.nan)
df.head()

Unnamed: 0,comment_id,comment_text,comment_author,comment_ranking,author_comment_count,story_comment_count,sentiment
0,3985756,"Makes sense for Verizon, and if they are being...",Quizzy,10,11,13,1.0
1,2481521,"""Made to play"" is a contradiction.",petervandijck,9,1125,16,5.0
2,6303075,I like this a lot!<p>The research manuscript e...,heurist,5,31,42,5.0
3,6270567,Fetching stuff over HTTP can be incredibly stu...,MBCook,5,84,11,5.0
4,5487972,"Conversely, start working( harder ).",TheSOB88,43,171,44,


In [None]:
# Getting a null count
df.isnull().sum()

comment_id                  0
comment_text                0
comment_author              0
comment_ranking             0
author_comment_count        0
story_comment_count         0
sentiment               29173
dtype: int64

In [None]:
# Drop them nulls
df = df.dropna()

In [None]:
df.shape

(170827, 7)

## After shaping and sentiment we moved on to splitting the data and generated a tokenized sequential model to do some initial training.

In [None]:
X,y = (df['comment_text'].values, df['sentiment'].values)

In [None]:
# Necessary imports
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import RMSprop

tk = Tokenizer(lower=True)
tk.fit_on_texts(X)
X_seq = tk.texts_to_sequences(X)
X_pad = pad_sequences(X_seq, 
                      maxlen=100, 
                      padding='post')


In [None]:
# Train/Test split
X_train, X_test, y_train, y_test = train_test_split(X_pad, 
                                                    y, 
                                                    test_size=.2, 
                                                    random_state=42)

In [None]:
batch_size = 64
X_train1 = X_train[batch_size:]
y_train1 = y_train[batch_size:]

X_valid = X_train[:batch_size]
y_valid = y_train[:batch_size]


## Running, fitting and testing the model

In [None]:
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout

vocabulary_size = len(tk.word_counts.keys())+1
max_words = 100
embedding_size = 32
model = Sequential()
model.add(Embedding(vocabulary_size, embedding_size, input_length=max_words))
model.add(LSTM(200))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])

In [None]:
# Fitting the model
history = model.fit(X_train1, 
                    y_train1, 
                    validation_data=(X_valid, y_valid),
                    batch_size=batch_size,
                    epochs=1)



In [None]:
# Testing
scores = model.evaluate(X_test, y_test, verbose=0)
print("Test accuracy: ", scores[1])

Test accuracy:  0.272463858127594


## Converting dataframe to a CSV file with a new directory

In [None]:
# Saving dataframe as a CSV
df.to_csv('sample.csv')

In [None]:
# Creating a new directory
!mkdir -p saved_model

model.save('saved_model/my_model')
model.save('model1.h5')

INFO:tensorflow:Assets written to: saved_model/my_model/assets


In [None]:
text = "This post is garbage and I hate it."

new_model = tf.keras.models.load_model('model1.h5')

In [None]:
cleaned = pd.read_csv('/content/drive/My Drive/hacker_news_comments.csv')


## Conclusion: Ultimately each of my model attempts never exceeded a 27% accuracy, so I believe we will just go the NLP route rather than the neural network route. 
