# Udacity Machine Learning Nanodegree

## Capstone Project — Tweet Classifier

> *Classify tweets as either __Republican__ or __Democrat__.*

The goal of this project is to create a neural network text classifier that performs better then a benchmark Naive Bayes classifier.

---

## Step 1 — Examine the Dataset

Some summary stats:

In [137]:
from sklearn.datasets import load_files
import pandas as pd
import numpy as np

tweets = pd.read_csv('./dataset/ExtractedTweets.csv')
#tweets.head()
print "Total number of tweets: %i" % tweets['Tweet'].count()

print "------"
party_count = tweets.groupby(['Party',]).count()
print ("Tweets by party:")
print(party_count['Tweet'])

print "------"
handles = None
handle_counts = None
handles = tweets.groupby('Handle')
handle_counts = handles.count()
non_two_hundred = handle_counts[handle_counts['Tweet']!=200]['Tweet'].count()
print "Mean number tweets per account: %f" % handle_counts['Tweet'].mean()
print "Median number of tweets per account: %f" % handle_counts['Tweet'].median()
print "Number of accounts without exactly two hundred tweets: %f" % non_two_hundred


Total number of tweets: 86460
------
Tweets by party:
Party
Democrat      42068
Republican    44392
Name: Tweet, dtype: int64
------
Mean number tweets per account: 199.676674
Median number of tweets per account: 200.000000
Number of accounts without exactly two hundred tweets: 17.000000


This is an almost perfectly balanced binary-labeled dataset. There are a total of **86460** tweets.  **42068** (48.65%) are labeled *Democrat*, while **44392** (51.34%) are labeled *Republican*. All but seventeen (17) of the accounts included in this data set made exactly two-hundred (200) tweets each.

This data set is ideal for doing a binary classification based solely on the content of the text in the dataset, with minimal influence from data composition.

## Step 2 — Vectorize

---


In [212]:
X_train_counts = None

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer.fit(tweets['Tweet'])

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [253]:
#X_train_counts = vectorizer.transform(tweets['Tweet'])
#X_train_counts.shape

tweet_sample = tweets['Tweet'][2:4]
print tweet_sample

v = CountVectorizer()
v.fit(tweet_sample)

print(v.vocabulary_)

t = vectorizer.transform(tweets['Tweet'][2:3])
t.shape
#help(t)
#for i in range(1,len(t[1][0].toarray())):
#    print i

#vectorizer.vocabulary_



2    RT @NBCLatino: .@RepDarrenSoto noted that Hurr...
3    RT @NALCABPolicy: Meeting with @RepDarrenSoto ...
Name: Tweet, dtype: object
{u'ed': 8, u'nbclatino': 22, u'guzman': 10, u'thanks': 27, u'in': 13, u'repdarrensoto': 24, u'allocated': 3, u'rt': 25, u'for': 9, u'latinoleader': 14, u'marucci': 17, u'hurricane': 12, u'approximately': 4, u'to': 31, u'damages': 7, u'nalcabpolicy': 20, u'has': 11, u'meeting': 19, u'congress': 6, u'that': 28, u'90': 1, u'with': 32, u'maria': 16, u'billion': 5, u'about': 2, u'nalcabpolicy2018': 21, u'18': 0, u'taking': 26, u'time': 30, u'meet': 18, u'the': 29, u'noted': 23, u'left': 15}


(1, 126329)

In [200]:
print count_vect.vocabulary_.get(u'Haven')

count_vect

print "---"
print X_train_counts[50596]
print "---"
#print X_train_counts[1]

print tweets['Tweet'][1]
for i in range(0,len(tweets['Tweet'][1])):
    print tweets['Tweet'][1][i]
    


None
---
  (0, 55875)	1
  (0, 105638)	1
  (0, 53345)	1
  (0, 28191)	1
  (0, 44016)	1
  (0, 80981)	1
  (0, 106022)	1
  (0, 56314)	1
  (0, 73387)	1
  (0, 83482)	1
  (0, 114098)	1
  (0, 116734)	1
  (0, 29192)	1
  (0, 34908)	1
---
RT @WinterHavenSun: Winter Haven resident / Alta Vista teacher is one of several recognized by @RepDarrenSoto for National Teacher Apprecia…
R
T
 
@
W
i
n
t
e
r
H
a
v
e
n
S
u
n
:
 
W
i
n
t
e
r
 
H
a
v
e
n
 
r
e
s
i
d
e
n
t
 
/
 
A
l
t
a
 
V
i
s
t
a
 
t
e
a
c
h
e
r
 
i
s
 
o
n
e
 
o
f
 
s
e
v
e
r
a
l
 
r
e
c
o
g
n
i
z
e
d
 
b
y
 
@
R
e
p
D
a
r
r
e
n
S
o
t
o
 
f
o
r
 
N
a
t
i
o
n
a
l
 
T
e
a
c
h
e
r
 
A
p
p
r
e
c
i
a
�
�
�
