# Intro to NLP Lab

In this lab, you'll be classifying randomly selected tweets from political officials into whether or not they are partisan tweets or neutral. In the following import statement, we're selecting only the columns that are important, but there may be more useful features in that set. Feel free to explore. 

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
%matplotlib inline

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.decomposition import TruncatedSVD
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

In [3]:
df = pd.read_csv('datasets/political_media.csv',
                usecols=[7, 20])
df.head()

Unnamed: 0,bias,text
0,partisan,RT @nowthisnews: Rep. Trey Radel (R- #FL) slam...
1,partisan,VIDEO - #Obamacare: Full of Higher Costs and ...
2,neutral,Please join me today in remembering our fallen...
3,neutral,RT @SenatorLeahy: 1st step toward Senate debat...
4,partisan,.@amazon delivery #drones show need to update ...


## Set up

Please split the dataset into a training and test set and convert the `bias` feature into 0s and 1s.

In [4]:
# split the dataset into two train and test.
df['bias'] = df['bias'].apply(lambda x: 1 if x == 'partisan' else 0)
X_train, X_test, y_train, y_test = train_test_split(df['text'].values,
                                                   df['bias'].values)

In [5]:
df['bias'].values

array([1, 1, 0, ..., 0, 0, 0], dtype=int64)

In [6]:
# let's check the shape of the X_train and X_test, Y_train and y_test

X_train.shape
X_test.shape

(1250,)

In [7]:
y_train.shape
y_test.shape

(1250,)

## Modeling

Please try the following techniques to transform the data. For each technique, do the following:

1. Transform the training data
2. Fit a `RandomForestClassifier` to the transformed training data
3. Transform the test data
4. Discuss the goodness of fit of your model using the test data and a classification report and confusion matrix

### 1. `CountVectorizer()`

In [8]:
# let's use cv to CountVectorizer the X_train
cv = CountVectorizer()
cv.fit(X_train)

X_train_cv = cv.transform(X_train)
X_test_cv = cv.transform(X_test)

rf = RandomForestClassifier()
rf.fit(X_train_cv, y_train)
print(rf.score(X_test_cv, y_test))
predictions = rf.predict(X_test_cv)
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))
 

0.7488
[[879  50]
 [264  57]]
             precision    recall  f1-score   support

          0       0.77      0.95      0.85       929
          1       0.53      0.18      0.27       321

avg / total       0.71      0.75      0.70      1250



### 2. `CountVectorizer()` with your choice of `min_df` and `max_df`

In [9]:
cv = CountVectorizer(min_df=0.10, max_df=0.90)
cv.fit(X_train)

X_train_cv = cv.transform(X_train)
X_test_cv = cv.transform(X_test)

rf = RandomForestClassifier()
rf.fit(X_train_cv, y_train)
print(rf.score(X_test_cv, y_test))
predictions = rf.predict(X_test_cv)
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))

0.7024
[[832  97]
 [275  46]]
             precision    recall  f1-score   support

          0       0.75      0.90      0.82       929
          1       0.32      0.14      0.20       321

avg / total       0.64      0.70      0.66      1250



### 3. `CountVectorizer()` with English stop words

In [10]:
cv = CountVectorizer(stop_words='english')
cv.fit(X_train)

X_train_cv = cv.transform(X_train)
X_test_cv = cv.transform(X_test)

rf = RandomForestClassifier()
rf.fit(X_train_cv, y_train)
print(rf.score(X_test_cv, y_test))
predictions = rf.predict(X_test_cv)
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))

0.7272
[[829 100]
 [241  80]]
             precision    recall  f1-score   support

          0       0.77      0.89      0.83       929
          1       0.44      0.25      0.32       321

avg / total       0.69      0.73      0.70      1250



### 4. `TfidfVectorizer()` 

In [11]:
tfidf = TfidfVectorizer()
tfidf.fit(X_train)

X_train_tfidf = tfidf.transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

rf = RandomForestClassifier()
rf.fit(X_train_tfidf, y_train)
print(rf.score(X_test_tfidf, y_test))
predictions = rf.predict(X_test_tfidf)
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))

0.736
[[868  61]
 [269  52]]
             precision    recall  f1-score   support

          0       0.76      0.93      0.84       929
          1       0.46      0.16      0.24       321

avg / total       0.69      0.74      0.69      1250



### 5. `TfidfVectorizer()` with English stop words

In [12]:
tfidf = TfidfVectorizer(stop_words='english')
tfidf.fit(X_train)

X_train_tfidf = tfidf.transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

rf = RandomForestClassifier()
rf.fit(X_train_tfidf, y_train)
print(rf.score(X_test_tfidf, y_test))
predictions = rf.predict(X_test_tfidf)
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))

0.744
[[851  78]
 [242  79]]
             precision    recall  f1-score   support

          0       0.78      0.92      0.84       929
          1       0.50      0.25      0.33       321

avg / total       0.71      0.74      0.71      1250



### Moving forward

With the remainder of your time, please try and find the best model and data transformation to predict partisan tweets. This is a challenging data set and can be approached from a number of ways.

Some techniques to try are:

1. Different types of data transformation 
2. Custom preprocessors for `CountVectorizer`
3. Custom stopword lists
4. Use of a dimensionality reduction technique (like `TruncatedSVD`)
5. Optimizing hyperparameters using `GridSearchCV`
6. Trying a different modeling technique such as `KNeighborsClassifier` or `LogisticRegression`