# Logistic Regression with Count Vectorizer

We will try to apply Logistic Regression over our hot encoded data which is converted into sparse matrices by count vectorizer method. Hot encoded data does not contain any other column than tweets. These tweets are marked as 1 or 0 for being bot or normal. We will only include normal column as it is boolean, it will be sufficient for us to detect which one is bot or not.

In [1]:
import glob
import pandas as pd
import os as os
import matplotlib.pyplot as plt
import numpy as np


path =r'Data/'
frame = pd.DataFrame()
frame =pd.read_csv('Data/tweetData.csv',index_col=None, header=0, usecols=[1,2])
print frame.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 362458 entries, 0 to 362457
Data columns (total 2 columns):
content    362458 non-null object
Normal     362458 non-null int64
dtypes: int64(1), object(1)
memory usage: 5.5+ MB
None


In [2]:
print frame.shape
print frame.head(10)

(362458, 2)
                                             content  Normal
0  Wind 3.2 mph NNE. Barometer 30.20 in, Rising s...       1
1  Good. Morning. #morning #Saturday #diner #VT #...       1
2  @gratefuldead recordstoredayus 🌹🌹🌹 @ TOMS M...       1
3  Egg in a muffin!!! (@ Rocket Baby Bakery - @ro...       1
4  @lyricwaters should've gave the neighbor  a bu...       1
5  On the way to CT! (@ Mamaroneck, NY in Mamaron...       1
6  We're #hiring! Read about our latest #job open...       1
7  Me... @ Montgomery Scrap Corporation https://t...       1
8  BAYADA Home Health Care: Home Health Registere...       1
9  Shift Supervisor Trainee - CVS Health: (#OCEAN...       1


In [3]:
print frame.Normal.value_counts()

0    190252
1    172206
Name: Normal, dtype: int64


In [4]:
X = frame.content
y = frame.Normal
print X.shape
print y.shape

(362458,)
(362458,)


# Count Vectorizer
First we will import sklearn library to divide data into test and training dataset. Later we will use sklearn library to convert this data into vectors. These vectors will be sparse matrices containg count of every word that occured in the tweet. This way we will know that which word trolls used most and will help us to detect them among other tweets.

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

In [6]:
print X_train.shape
print X_test.shape
print y_train.shape
print y_test.shape

(242846,)
(119612,)
(242846,)
(119612,)


In [7]:
print np.bincount(y_train)

[127342 115504]


In [8]:
from sklearn.feature_extraction.text import CountVectorizer

In [9]:
vectorizer = CountVectorizer()

In [10]:
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

In [11]:
feature_names = vectorizer.get_feature_names()

In [12]:
print len(feature_names)

380454


In [13]:
import sklearn
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
scores = cross_val_score(LogisticRegression(), X_train, y_train,cv=5)
print np.mean(scores)



0.9259242415906146


# Logistic Regression
After completeing vectorization we will run our logistic regression algorithm over vectored data. We can clearly see that how logistic regression quickly converges and gives us upto 92% accuracy. We can say that logisitc regression is one of the most accurate to run over binary classifications. In the end we created a confusion matrix to show false positives and false negatives.

In [14]:
logreg = LogisticRegression()
logreg.fit(X_train,y_train)
print logreg.score(X_train,y_train)
print logreg.score(X_test, y_test)

0.977784274807903
0.9293381934922917


In [15]:
from sklearn.metrics import confusion_matrix
pred_logreg = logreg.predict(X_test)
confusion = confusion_matrix(y_test, pred_logreg)
print confusion

[[58923  3987]
 [ 4465 52237]]


In [2]:
print ("false positive : 3987")
print ("false negative : 4465")

false positive : 3987
false negative : 4465
