## Designing our own sentiment analysis tool

**While there are a lot of tools that will automatically give us a sentiment of a piece of text, it is observed that they don't always agree! Let's design our own to see both how these tools work internally, along with how we can test them to see how well they might perform.**

Algorithms used for designing the models:
1. LinearRegression
2. LogisticRegression
3. RandomForestClassifier
4. LinearSVC (Support vector machine)
5. Naive_Bayes

In [0]:
import pandas as pd  
import numpy as np
import matplotlib.pyplot as plt

###Before we get started, we need to download all of the data we'll be using.
**sentiment140-subset.csv: cleaned subset of Sentiment140 data - half a million tweets marked as positive or negative**

In [0]:
# Make data directory if it doesn't exist
!mkdir -p data
!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/investigating-sentiment-analysis/data/sentiment140-subset.csv.zip -P data
!unzip -n -d data data/sentiment140-subset.csv.zip

--2020-05-29 06:37:33--  https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/investigating-sentiment-analysis/data/sentiment140-subset.csv.zip
Resolving nyc3.digitaloceanspaces.com (nyc3.digitaloceanspaces.com)... 162.243.189.2
Connecting to nyc3.digitaloceanspaces.com (nyc3.digitaloceanspaces.com)|162.243.189.2|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 17927149 (17M) [application/zip]
Saving to: ‘data/sentiment140-subset.csv.zip’


2020-05-29 06:37:34 (18.2 MB/s) - ‘data/sentiment140-subset.csv.zip’ saved [17927149/17927149]

Archive:  data/sentiment140-subset.csv.zip
  inflating: data/sentiment140-subset.csv  


In [0]:
!pip install sklearn



**Training on tweets**

Let's say we were going to analyze the sentiment of tweets. If we had a list of tweets that were scored positive vs. negative, we could see which words are usually associated with positive scores and which are usually associated with negative scores.

Luckily, we have Sentiment140 - http://help.sentiment140.com/for-students - a list of 1.6 million tweets along with a score as to whether they're negative or positive. We'll use it to build our own machine learning algorithm to see separate positivity from negativity.

In [0]:
df = pd.read_csv("data/sentiment140-subset.csv", nrows=30000)
df.head()

Unnamed: 0,polarity,text
0,0,@kconsidder You never tweet
1,0,Sick today coding from the couch.
2,1,"@ChargerJenn Thx for answering so quick,I was ..."
3,1,Wii fit says I've lost 10 pounds since last ti...
4,0,@MrKinetik Not a thing!!! I don't really have...


In [0]:
df.shape

(30000, 2)

In [0]:
df.polarity.value_counts()

1    15064
0    14936
Name: polarity, dtype: int64

**Train our algorithm**

**Vectorize our tweets**

Create a TfidfVectorizer and use it to vectorize our tweets. 

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [0]:
vectorizer = TfidfVectorizer(max_features=1000)
vectors = vectorizer.fit_transform(df.text)
words_df = pd.DataFrame(vectors.toarray(), columns=vectorizer.get_feature_names())
words_df.head()

Unnamed: 0,10,100,11,12,15,1st,20,2day,2nd,30,able,about,account,actually,add,after,afternoon,again,ago,agree,ah,ahh,ahhh,air,album,all,almost,alone,already,alright,also,although,always,am,amazing,amp,an,and,annoying,another,...,work,worked,working,works,world,worried,worry,worse,worst,worth,would,wouldn,wow,write,writing,wrong,wtf,www,xd,xoxo,xx,xxx,ya,yay,yea,yeah,year,years,yep,yes,yesterday,yet,yo,you,young,your,yourself,youtube,yum,yup
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.334095,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.22101,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.427465,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#####Because we want to fit in with all the other progammers, we need to create two variables: one called X and one called y.

######X is all of our features, the things we use to predict positive or negative. That's going to be our words.y is all of our labels, the positive or negative rating. We'll use the polarity column for that.

In [0]:
X = words_df
y = df.polarity

In [0]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB

**Training our algorithms**
When we teach our algorithm about what a positive or a negative tweet looks like, this is called training. Training can take different amounts of time based on what kind of algorithm you are using

In [0]:
%%time
# Create and train a logistic regression
logreg = LogisticRegression(C=1e9, solver='lbfgs', max_iter=1000)
logreg.fit(X, y)

CPU times: user 13.7 s, sys: 724 ms, total: 14.5 s
Wall time: 7.39 s


In [0]:
%%time
# Create and train a random forest classifier
forest = RandomForestClassifier(n_estimators=50)
forest.fit(X, y)

CPU times: user 23.5 s, sys: 102 ms, total: 23.6 s
Wall time: 23.6 s


In [0]:
%%time
# Create and train a linear support vector classifier (LinearSVC)
svc = LinearSVC()
svc.fit(X, y)

CPU times: user 339 ms, sys: 0 ns, total: 339 ms
Wall time: 344 ms


In [0]:
%%time
# Create and train a multinomial naive bayes classifier (MultinomialNB)
bayes = MultinomialNB()
bayes.fit(X, y)

CPU times: user 195 ms, sys: 13.9 ms, total: 209 ms
Wall time: 141 ms


**Use our models**

Now that we've trained our models, they can try to predict whether some content is **positive** or **negative**.

**Preparing the data**

Add a few more sentences below. They should be a mix of **positive** and **negative**. They can be **boring**, they can be **exciting**, they can be **short**, they can be **long**.

In [0]:
# Create some test data

pd.set_option("display.max_colwidth", 200)

unknown = pd.DataFrame({'content': [
    "I love love love love this kitten",
    "I hate hate hate hate this keyboard",
    "I'm not sure how I feel about toast",
    "Did you see the baseball game yesterday?",
    "The package was delivered late and the contents were broken",
    "Trashy television shows are some of my favorites",
    "I'm seeing a Kubrick film tomorrow, I hear not so great things about it.",
    "I find chirping birds irritating, but I know I'm not the only one",
]})
unknown

Unnamed: 0,content
0,I love love love love this kitten
1,I hate hate hate hate this keyboard
2,I'm not sure how I feel about toast
3,Did you see the baseball game yesterday?
4,The package was delivered late and the contents were broken
5,Trashy television shows are some of my favorites
6,"I'm seeing a Kubrick film tomorrow, I hear not so great things about it."
7,"I find chirping birds irritating, but I know I'm not the only one"


First we need to **vectorizer** our sentences into numbers, so the algorithm can understand them.

Our algorithm only knows **certain words.** Run `vectorizer.get_feature_names()` to show the list of the words it knows.

In [0]:
print(vectorizer.get_feature_names())

['10', '100', '11', '12', '15', '1st', '20', '2day', '2nd', '30', 'able', 'about', 'account', 'actually', 'add', 'after', 'afternoon', 'again', 'ago', 'agree', 'ah', 'ahh', 'ahhh', 'air', 'album', 'all', 'almost', 'alone', 'already', 'alright', 'also', 'although', 'always', 'am', 'amazing', 'amp', 'an', 'and', 'annoying', 'another', 'any', 'anymore', 'anyone', 'anything', 'anyway', 'app', 'apparently', 'apple', 'appreciate', 'are', 'around', 'art', 'as', 'ask', 'asleep', 'ass', 'at', 'ate', 'aw', 'awake', 'awards', 'away', 'awesome', 'aww', 'awww', 'baby', 'back', 'bad', 'band', 'bbq', 'bday', 'be', 'beach', 'beautiful', 'because', 'bed', 'been', 'beer', 'before', 'behind', 'being', 'believe', 'best', 'bet', 'better', 'big', 'bike', 'birthday', 'bit', 'bitch', 'black', 'blip', 'blog', 'blue', 'body', 'boo', 'book', 'books', 'bored', 'boring', 'both', 'bought', 'bout', 'box', 'boy', 'boys', 'break', 'breakfast', 'bring', 'bro', 'broke', 'broken', 'brother', 'brothers', 'btw', 'bus', 'bu

In [0]:
# Put it through the vectoriser

# transform, not fit_transform, because we already learned all our words
unknown_vectors = vectorizer.transform(unknown.content)
unknown_words_df = pd.DataFrame(unknown_vectors.toarray(), columns=vectorizer.get_feature_names())
unknown_words_df.head()

Unnamed: 0,10,100,11,12,15,1st,20,2day,2nd,30,able,about,account,actually,add,after,afternoon,again,ago,agree,ah,ahh,ahhh,air,album,all,almost,alone,already,alright,also,although,always,am,amazing,amp,an,and,annoying,another,...,work,worked,working,works,world,worried,worry,worse,worst,worth,would,wouldn,wow,write,writing,wrong,wtf,www,xd,xoxo,xx,xxx,ya,yay,yea,yeah,year,years,yep,yes,yesterday,yet,yo,you,young,your,yourself,youtube,yum,yup
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.417209,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.537291,0.0,0.0,0.244939,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.215967,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [0]:
unknown_words_df.shape

(8, 1000)

In [0]:
unknown['pred_logreg'] = logreg.predict(unknown_words_df)

In [0]:
unknown

Unnamed: 0,content,pred_logreg
0,I love love love love this kitten,1
1,I hate hate hate hate this keyboard,0
2,I'm not sure how I feel about toast,0
3,Did you see the baseball game yesterday?,1
4,The package was delivered late and the contents were broken,0
5,Trashy television shows are some of my favorites,0
6,"I'm seeing a Kubrick film tomorrow, I hear not so great things about it.",1
7,"I find chirping birds irritating, but I know I'm not the only one",0


In [0]:
# Predict using all our models. 

# Logistic Regression predictions + probabilities
unknown['pred_logreg'] = logreg.predict(unknown_words_df)
unknown['pred_logreg_proba'] = logreg.predict_proba(unknown_words_df)[:,1]

# Random forest predictions + probabilities
unknown['pred_forest'] = forest.predict(unknown_words_df)
unknown['pred_forest_proba'] = forest.predict_proba(unknown_words_df)[:,1]

# SVC predictions
unknown['pred_svc'] = svc.predict(unknown_words_df)

# Bayes predictions + probabilities
unknown['pred_bayes'] = bayes.predict(unknown_words_df)
unknown['pred_bayes_proba'] = bayes.predict_proba(unknown_words_df)[:,1]

In [0]:
unknown

Unnamed: 0,content,pred_logreg,pred_logreg_proba,pred_forest,pred_forest_proba,pred_svc,pred_bayes,pred_bayes_proba
0,I love love love love this kitten,1,0.950516,1,0.846984,1,1,0.747222
1,I hate hate hate hate this keyboard,0,0.009595,0,0.0,0,0,0.122383
2,I'm not sure how I feel about toast,0,0.180953,0,0.24,0,0,0.416819
3,Did you see the baseball game yesterday?,1,0.614999,1,0.66,1,1,0.509662
4,The package was delivered late and the contents were broken,0,0.058225,1,0.51,0,0,0.219788
5,Trashy television shows are some of my favorites,0,0.330459,0,0.32,0,1,0.534234
6,"I'm seeing a Kubrick film tomorrow, I hear not so great things about it.",1,0.558401,0,0.24,1,1,0.533493
7,"I find chirping birds irritating, but I know I'm not the only one",0,0.060197,0,0.36,0,0,0.295739


In [0]:
df.head()

Unnamed: 0,polarity,text
0,0,@kconsidder You never tweet
1,0,Sick today coding from the couch.
2,1,"@ChargerJenn Thx for answering so quick,I was afraid I was gonna crash twitter with all the spamming I did 2 RR..sorry bout that"
3,1,Wii fit says I've lost 10 pounds since last time
4,0,@MrKinetik Not a thing!!! I don't really have a life.....


In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [0]:
%%time

print("Training logistic regression")
logreg.fit(X_train, y_train)

print("Training random forest")
forest.fit(X_train, y_train)

print("Training SVC")
svc.fit(X_train, y_train)

print("Training Naive Bayes")
bayes.fit(X_train, y_train)

Training logistic regression
Training random forest
Training SVC
Training Naive Bayes
CPU times: user 44.8 s, sys: 1.72 s, total: 46.5 s
Wall time: 31.7 s


In [0]:
from sklearn.metrics import confusion_matrix

In [0]:
#Logistic Regression
y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,2769,965
Is positive,894,2872


In [0]:
#Random forest
y_true = y_test
y_pred = forest.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,2810,924
Is positive,1077,2689


In [0]:
#SVC
y_true = y_test
y_pred = svc.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,2764,970
Is positive,883,2883


In [0]:
#Multinomial Naive Bayes
y_true = y_test
y_pred = bayes.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,2811,923
Is positive,986,2780


In [0]:
#Logistic
y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,0.741564,0.258436
Is positive,0.237387,0.762613


In [0]:
#Logistic Regression
y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,0.741564,0.258436
Is positive,0.237387,0.762613


In [0]:
#Random Forest
y_true = y_test
y_pred = forest.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,0.752544,0.247456
Is positive,0.28598,0.71402


In [0]:
#SVC
y_true = y_test
y_pred = svc.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,0.740225,0.259775
Is positive,0.234466,0.765534


In [0]:
#Multinomial Naive Bayes
y_true = y_test
y_pred = bayes.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,0.752812,0.247188
Is positive,0.261816,0.738184


##TESTING ON OUR EXTRACTED TWEETS FROM TWITTER 

In [0]:
unknown=pd.read_csv('last1_Stanford.csv')
pd.set_option("display.max_colwidth", 200)

In [0]:
unknown.head(2)

Unnamed: 0.1,Unnamed: 0,Date,Time,hashtags,tagged_usernames,Text,tokens,data,pol_sub,polarity,subjectivity,Stanford_score,Stanford_sentiment
0,0,2020-05-20,11:47:00,"['#bjp', '#rss']",['@any_morya:'],"['big', 'shame', 'bjp', 'rss', 'tried', 'divide', 'hindu', 'muslim', 'sunlight', 'heat', 'keeping', 'daily', 'ramzan', 'fasting']","['big', 'shame', 'bjp', 'rss', 'tried', 'divide', 'hindu', 'muslim', 'sunlight', 'heat', 'keeping', 'daily', 'ramzan', 'fasting']",its big shame to bjp rss who tried to divide hindu muslim in such sunlight and heat keeping their daily ramzan fasting,"Sentiment(polarity=0.0, subjectivity=0.19999999999999998)",0.0,0.2,1.0,Negative
1,1,2020-05-20,11:47:00,[],"['@milligazette:', '@muslimmirror']","['fir', 'news', 'website', '.', 'according', 'gujarat', 'police', 'day', 'shahpur', 'stone-pelting', 'video']","['fir', 'news', 'website', '.', 'according', 'gujarat', 'police', 'day', 'shahpur', 'stone-pelting', 'video']",fir against news website according to gujarat police a day after the shahpur stonepelting a video was,"Sentiment(polarity=0.0, subjectivity=0.0)",0.0,0.0,2.0,Neutral


In [0]:
# Put it through the vectoriser

# transform, not fit_transform, because we already learned all our words
unknown_vectors = vectorizer.transform(unknown.data)
unknown_words_df = pd.DataFrame(unknown_vectors.toarray(), columns=vectorizer.get_feature_names())
unknown_words_df.head()

Unnamed: 0,10,100,11,12,15,1st,20,2day,2nd,30,able,about,account,actually,add,after,afternoon,again,ago,agree,ah,ahh,ahhh,air,album,all,almost,alone,already,alright,also,although,always,am,amazing,amp,an,and,annoying,another,...,work,worked,working,works,world,worried,worry,worse,worst,worth,would,wouldn,wow,write,writing,wrong,wtf,www,xd,xoxo,xx,xxx,ya,yay,yea,yeah,year,years,yep,yes,yesterday,yet,yo,you,young,your,yourself,youtube,yum,yup
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.159639,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.369165,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.415174,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [0]:
unknown_words_df.shape

(2672, 1000)

In [0]:
# Predict using all our models. 

# Logistic Regression predictions + probabilities
unknown['pred_logreg'] = logreg.predict(unknown_words_df)
unknown['pred_logreg_proba'] = logreg.predict_proba(unknown_words_df)[:,1]

# Random forest predictions + probabilities
unknown['pred_forest'] = forest.predict(unknown_words_df)
unknown['pred_forest_proba'] = forest.predict_proba(unknown_words_df)[:,1]

# SVC predictions
unknown['pred_svc'] = svc.predict(unknown_words_df)

# Bayes predictions + probabilities
unknown['pred_bayes'] = bayes.predict(unknown_words_df)
unknown['pred_bayes_proba'] = bayes.predict_proba(unknown_words_df)[:,1]

In [0]:
unknown

Unnamed: 0.1,Unnamed: 0,Date,Time,hashtags,tagged_usernames,Text,tokens,data,pol_sub,polarity,subjectivity,Stanford_score,Stanford_sentiment,pred_logreg,pred_logreg_proba,pred_forest,pred_forest_proba,pred_svc,pred_bayes,pred_bayes_proba
0,0,2020-05-20,11:47:00,"['#bjp', '#rss']",['@any_morya:'],"['big', 'shame', 'bjp', 'rss', 'tried', 'divide', 'hindu', 'muslim', 'sunlight', 'heat', 'keeping', 'daily', 'ramzan', 'fasting']","['big', 'shame', 'bjp', 'rss', 'tried', 'divide', 'hindu', 'muslim', 'sunlight', 'heat', 'keeping', 'daily', 'ramzan', 'fasting']",its big shame to bjp rss who tried to divide hindu muslim in such sunlight and heat keeping their daily ramzan fasting,"Sentiment(polarity=0.0, subjectivity=0.19999999999999998)",0.000000,0.200000,1.0,Negative,0,0.044493,0,0.120000,0,0,0.277326
1,1,2020-05-20,11:47:00,[],"['@milligazette:', '@muslimmirror']","['fir', 'news', 'website', '.', 'according', 'gujarat', 'police', 'day', 'shahpur', 'stone-pelting', 'video']","['fir', 'news', 'website', '.', 'according', 'gujarat', 'police', 'day', 'shahpur', 'stone-pelting', 'video']",fir against news website according to gujarat police a day after the shahpur stonepelting a video was,"Sentiment(polarity=0.0, subjectivity=0.0)",0.000000,0.000000,2.0,Neutral,0,0.388191,0,0.480000,0,1,0.513531
2,2,2020-05-20,11:48:00,"['#bjp', '#gujrati', '#haridwar', '#covid__19.are']",[],"['bjp', 'govt', 'allows', '18', 'luxury', 'coaches', 'gujrati', 'pilgrims', 'haridwar', 'gujarat', 'amidst', 'covid__19.are', 'ppl']","['bjp', 'govt', 'allows', '18', 'luxury', 'coaches', 'gujrati', 'pilgrims', 'haridwar', 'gujarat', 'amidst', 'covid__19.are', 'ppl']",bjp govt allows luxury coaches for gujrati pilgrims from haridwar to gujarat amidst covid are the ppl of,"Sentiment(polarity=0.0, subjectivity=0.0)",0.000000,0.000000,2.0,Neutral,1,0.719626,1,0.643333,1,1,0.556054
3,3,2020-05-20,11:46:00,"['#bjp', '#rss', '#rss_terrorists']",['@muslimmirror:'],"['armed', 'bjp', 'rss', 'activists', 'police', 'shahpur', 'ahmedabad', 'rss_terrorists']","['armed', 'bjp', 'rss', 'activists', 'police', 'shahpur', 'ahmedabad', 'rss_terrorists']",armed bjp rss activists with police in shahpur ahmedabad rss terrorists,"Sentiment(polarity=0.0, subjectivity=0.0)",0.000000,0.000000,2.0,Neutral,1,0.738175,1,0.699048,1,1,0.558944
4,4,2020-05-20,11:46:00,"['#hindurashtra', '#india', '#hindus', '#modi.']",[],"['aspirants', 'hindurashtra', 'india', 'must', 'see', 'pathetic', 'conditions', 'poor', 'hindus', 'modi', '.', 'worst']","['aspirants', 'hindurashtra', 'india', 'must', 'see', 'pathetic', 'conditions', 'poor', 'hindus', 'modi', '.', 'worst']",aspirants of hindurashtra in india must see pathetic conditions of poor hindus under modi in worst,"Sentiment(polarity=-0.7999999999999999, subjectivity=0.8666666666666667)",-0.800000,0.866667,1.0,Negative,0,0.007435,0,0.020000,0,0,0.133602
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2667,2667,2020-05-20,01:27:00,[],['@ronnielouise2:'],"['dem', 'congress', 'see', 'ca25', 'went', 'red', '22', 'years', 'rep', 'wi', 'winning', 'votes', 'ano']","['dem', 'congress', 'see', 'ca25', 'went', 'red', '22', 'years', 'rep', 'wi', 'winning', 'votes', 'ano']",any dem in congress who does not see that ca went red after years and the rep in wi winning who votes for ano,"Sentiment(polarity=0.25, subjectivity=0.375)",0.250000,0.375000,3.0,Positive,1,0.834752,0,0.180000,1,1,0.571076
2668,2668,2020-05-20,01:27:00,['#congress'],['@vernbuchanan'],"['corrupt', 'congress', 'hard', 'work', '.', 'vote', 'stand', 'alone', 'unconstitutionality']","['corrupt', 'congress', 'hard', 'work', '.', 'vote', 'stand', 'alone', 'unconstitutionality']",corrupt congress hard at work did you vote for this stand alone unconstitutionality too or do you,"Sentiment(polarity=-0.39583333333333337, subjectivity=0.7708333333333333)",-0.395833,0.770833,1.0,Negative,0,0.229779,0,0.440000,0,0,0.333080
2669,2669,2020-05-20,01:26:00,['#tx10'],"['@harrisdemocrats', '@priteshgandhimd']","['case', 'anyone', 'wondering', 'tx10', 'debate', 'going', '...', 'killing', '.']","['case', 'anyone', 'wondering', 'tx10', 'debate', 'going', '...', 'killing', '.']",in case anyone is wondering how the tx debate is going is killing it,"Sentiment(polarity=0.0, subjectivity=0.0)",0.000000,0.000000,2.0,Neutral,1,0.827298,1,0.668889,1,1,0.611772
2670,2670,2020-05-20,01:24:00,"['#congress', '#nationalguard']",['@maddow:'],"['.', 'cant', 'congress', 'anything', 'protect', 'nationalguard', 'servicepeople', 'work', 'june', '25']","['.', 'cant', 'congress', 'anything', 'protect', 'nationalguard', 'servicepeople', 'work', 'june', '25']",cant congress do anything to protect the nationalguard servicepeople so they can work through june,"Sentiment(polarity=0.0, subjectivity=0.0)",0.000000,0.000000,2.0,Neutral,0,0.124383,0,0.280000,0,0,0.331514


**Save the final dataframe to the csv file 'last2_allModels.csv'**

In [0]:
unknown.to_csv('last2_allModels.csv')