# Udacity Machine Learning Nanodegree

## Capstone Project — Tweet Classifier

> *Classify tweets as either __Republican__ or __Democrat__.*

The goal of this project is to create a neural network text classifier that performs better then a benchmark Naive Bayes classifier.

---

## Step 1 — Examine the Dataset

Some summary stats:

In [137]:
from sklearn.datasets import load_files
import pandas as pd
import numpy as np

tweets = pd.read_csv('./dataset/ExtractedTweets.csv')
#tweets.head()
print "Total number of tweets: %i" % tweets['Tweet'].count()

print "------"
party_count = tweets.groupby(['Party',]).count()
print ("Tweets by party:")
print(party_count['Tweet'])

print "------"
handles = None
handle_counts = None
handles = tweets.groupby('Handle')
handle_counts = handles.count()
non_two_hundred = handle_counts[handle_counts['Tweet']!=200]['Tweet'].count()
print "Mean number tweets per account: %f" % handle_counts['Tweet'].mean()
print "Median number of tweets per account: %f" % handle_counts['Tweet'].median()
print "Number of accounts without exactly two hundred tweets: %f" % non_two_hundred


Total number of tweets: 86460
------
Tweets by party:
Party
Democrat      42068
Republican    44392
Name: Tweet, dtype: int64
------
Mean number tweets per account: 199.676674
Median number of tweets per account: 200.000000
Number of accounts without exactly two hundred tweets: 17.000000


This is an almost perfectly balanced binary-labeled dataset. There are a total of **86460** tweets.  **42068** (48.65%) are labeled *Democrat*, while **44392** (51.34%) are labeled *Republican*. All but seventeen (17) of the accounts included in this data set made exactly two-hundred (200) tweets each.

This data set is ideal for doing a binary classification based solely on the content of the text in the dataset, with minimal influence from data composition.

## Step 2 — Vectorize

---


In [199]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X_train_counts = vectorizer.fit_transform(tweets['Tweet'])
X_train_counts.shape

vectorizer.vocabulary_


{u'fawn': 42024,
 u'01xg2gzxxm': 78,
 u'6omo6ddyte': 8964,
 u'repjimcooper': 93011,
 u'm8twhf3anz': 69262,
 u'2qtyckm2wy': 3892,
 u'6atwljwhar': 8447,
 u'sonja': 100884,
 u'riccieodqc': 94189,
 u'woods': 118324,
 u'ewtnews': 40615,
 u'hanging': 50269,
 u'2btpjdxxjf': 3345,
 u'woody': 118335,
 u'hastily': 50550,
 u'vavetbenefits': 113118,
 u'localized': 67618,
 u'debat\xeda': 32697,
 u'8pq4rketxv': 11399,
 u'4ee5jiyhcs': 6088,
 u'cfzxmoyspz': 26294,
 u'55elhtw5ty': 7085,
 u'6i8cqxykzl': 8724,
 u'ihkh7pnom7': 55172,
 u'tnyajegezb': 106895,
 u'tq5yuwpp2e': 107427,
 u'nbcnro62bm': 75809,
 u'igual': 55114,
 u'refunding': 92293,
 u'5bjlzria07': 7285,
 u'tnnlmljthu': 106870,
 u'y2awcxtsck': 121645,
 u'3ghqf2hk1u': 4835,
 u'eybswidnzw': 41086,
 u'mgp7vl0a5i': 71694,
 u'rgzcpilbhd': 94048,
 u'ulldbagr3y': 110306,
 u'errkyahbi1': 39872,
 u'qsvgajbnpu': 89984,
 u'bnskdfo15y': 22088,
 u'sowell': 101048,
 u'hi4czjxxka': 51586,
 u'mmomeubc5r': 72848,
 u'bollwerk': 22271,
 u'replaces': 93076,
 u'vy0u

In [200]:
print count_vect.vocabulary_.get(u'Haven')

count_vect

print "---"
print X_train_counts[50596]
print "---"
#print X_train_counts[1]

print tweets['Tweet'][1]
for i in range(0,len(tweets['Tweet'][1])):
    print tweets['Tweet'][1][i]
    


None
---
  (0, 55875)	1
  (0, 105638)	1
  (0, 53345)	1
  (0, 28191)	1
  (0, 44016)	1
  (0, 80981)	1
  (0, 106022)	1
  (0, 56314)	1
  (0, 73387)	1
  (0, 83482)	1
  (0, 114098)	1
  (0, 116734)	1
  (0, 29192)	1
  (0, 34908)	1
---
RT @WinterHavenSun: Winter Haven resident / Alta Vista teacher is one of several recognized by @RepDarrenSoto for National Teacher Apprecia…
R
T
 
@
W
i
n
t
e
r
H
a
v
e
n
S
u
n
:
 
W
i
n
t
e
r
 
H
a
v
e
n
 
r
e
s
i
d
e
n
t
 
/
 
A
l
t
a
 
V
i
s
t
a
 
t
e
a
c
h
e
r
 
i
s
 
o
n
e
 
o
f
 
s
e
v
e
r
a
l
 
r
e
c
o
g
n
i
z
e
d
 
b
y
 
@
R
e
p
D
a
r
r
e
n
S
o
t
o
 
f
o
r
 
N
a
t
i
o
n
a
l
 
T
e
a
c
h
e
r
 
A
p
p
r
e
c
i
a
�
�
�
