# Udacity Machine Learning Nanodegree

## Capstone Project — Tweet Classifier

> *Classify tweets as either __Republican__ or __Democrat__.*

The goal of this project is to create a neural network text classifier that performs better then a benchmark Naive Bayes classifier.

---

## Step 1 — Examine the Dataset

Some summary stats:

In [409]:
from sklearn.datasets import load_files
import pandas as pd
import numpy as np

tweets = pd.read_csv('./dataset/ExtractedTweets.csv')

tweet_texts = tweets['Tweet']
tweet_texts = tweets.Tweet

print "Total number of tweets: %i" % tweets['Tweet'].count()

print "------"
party_count = tweets.groupby(['Party',]).count()
print ("Tweets by party:")
print(party_count['Tweet'])

print "------"
handles = None
handle_counts = None
handles = tweets.groupby('Handle')
handle_counts = handles.count()
non_two_hundred = handle_counts[handle_counts['Tweet']!=200]['Tweet'].count()
print "Mean number tweets per account: %f" % handle_counts['Tweet'].mean()
print "Median number of tweets per account: %f" % handle_counts['Tweet'].median()
print "Number of accounts without exactly two hundred tweets: %f" % non_two_hundred


Total number of tweets: 86460
------
Tweets by party:
Party
Democrat      42068
Republican    44392
Name: Tweet, dtype: int64
------
Mean number tweets per account: 199.676674
Median number of tweets per account: 200.000000
Number of accounts without exactly two hundred tweets: 17.000000


This is an almost perfectly balanced binary-labeled dataset. There are a total of **86460** tweets.  **42068** (48.65%) are labeled *Democrat*, while **44392** (51.34%) are labeled *Republican*. All but seventeen (17) of the accounts included in this data set made exactly two-hundred (200) tweets each.

This data set is ideal for doing a binary classification based solely on the content of the text in the dataset, with minimal influence from data composition.

---

## Step 2 — Vectorize: Bag-of-Words / tfidf

In [410]:
X_train_counts = None

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
vectorizer = CountVectorizer()
X_train_counts = vectorizer.fit_transform(tweet_texts)

print X_train_counts.shape

(86460, 126329)


In [411]:
tweet_sample = tweet_texts[0:10]
print tweet_sample
print tweet_sample.index


0    Today, Senate Dems vote to #SaveTheInternet. P...
1    RT @WinterHavenSun: Winter Haven resident / Al...
2    RT @NBCLatino: .@RepDarrenSoto noted that Hurr...
3    RT @NALCABPolicy: Meeting with @RepDarrenSoto ...
4    RT @Vegalteno: Hurricane season starts on June...
5    RT @EmgageActionFL: Thank you to all who came ...
6    Hurricane Maria left approx $90 billion in dam...
7    RT @Tharryry: I am delighted that @RepDarrenSo...
8    RT @HispanicCaucus: Trump's anti-immigrant pol...
9    RT @RepStephMurphy: Great joining @WeAreUnidos...
Name: Tweet, dtype: object
RangeIndex(start=0, stop=10, step=1)


In [412]:

tfv = TfidfVectorizer()
doc = tfv.fit_transform(tweets['Tweet'])
print(len(tfv.vocabulary_.keys()))
index_of_senate = tfv.vocabulary_.get(u'senate')
print "index of 'senate': %i" % index_of_senate
print tfv.idf_
print "tfidf of 'senate': %f" % tfv.idf_[index_of_senate]

print "\n__________"
print "Sample tweet: '%s'" % tweets['Tweet'][0]
print "Tweet breakdown:"
print doc[0]
print "__________\n"

print tfv.vocabulary_.get(u'today')


tfxfr = TfidfTransformer()
X_train_tfidf = tfxfr.fit_transform(X_train_counts)
X_train_tfidf.shape 


126329
index of 'senate': 98443
[ 7.15796257  5.59322484 11.67430154 ... 11.67430154 11.67430154
 11.26883644]
tfidf of 'senate': 5.598956

__________
Sample tweet: 'Today, Senate Dems vote to #SaveTheInternet. Proud to support similar #NetNeutrality legislation here in the House… https://t.co/n3tggDLU1L'
Tweet breakdown:
  (0, 74945)	0.4856049705506262
  (0, 28191)	0.052833650863872844
  (0, 53345)	0.05292198936794678
  (0, 52688)	0.17166959365086887
  (0, 105638)	0.06484465864207267
  (0, 55875)	0.09591415175080434
  (0, 51255)	0.1902203520427344
  (0, 66211)	0.22046448919469594
  (0, 76460)	0.2783065606436417
  (0, 99761)	0.3525527653079148
  (0, 103134)	0.1957869759852857
  (0, 87421)	0.19415404163917044
  (0, 97338)	0.3492886551285527
  (0, 106897)	0.13840569968604383
  (0, 114602)	0.21903540977450434
  (0, 33246)	0.3015487758119325
  (0, 98443)	0.23289450050801297
  (0, 106944)	0.13729330830673972
__________

106944


(86460, 126329)

In [426]:
#party_names = tweets['Party']
party_names = tweets.Party

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, party_names)

test1 = ['I love savetheinternet. netneutrality', \
         'Christian values are very important. SAD!', \
         'cannot believe the audacity of the liberal party',\
         'with god’s help, we can end abortion',
         'we must save obamacare from destruction #netneutrality']
test1_counts = tfv.transform(test1)
test1_tfidf = tfxfr.transform(test1_counts)
p = clf.predict(test1_tfidf)
print p

['Democrat' 'Republican' 'Republican' 'Republican' 'Democrat']


---

## Step 3 — Baseline Classifier: Multinomial Naive Bayes

In [433]:
from sklearn.cross_validation import train_test_split

tweet_accounts = tweets.iloc[:, :2].drop_duplicates()
accounts_train, accounts_test = train_test_split(tweet_accounts.Handle, stratify=tweet_accounts.Party, \
                                                 test_size=0.2, random_state=43)

tweets_train = tweets[tweets.Handle.isin(accounts_train)].reset_index().drop('index', axis=1)
tweets_test = tweets[tweets.Handle.isin(accounts_test)].reset_index().drop('index', axis=1)


In [437]:
from sklearn.pipeline import Pipeline

mnb_pipeline = Pipeline([('vec', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])

mnb_pipeline.fit(tweets_train.Tweet, tweets_train.Party)

p = mnb_pipeline.predict(tweets_test.Tweet)
np.mean(p == tweets_test.Party)

0.7505605058924979