# Text Data IIb: Building a tweet classifier

Name: Alex Davenport

In this project, you'll finish off all the work that we've done so far, and build a collection of classifiers that 

## 1: Reshaping your code into classes

At this point, you should have a solid understanding of text preprocessing, and the Naive Bayes algorithm.  Take the code from your previous projects, and consolidate it into two Python classes: 

* `TextPreprocessor`: Given a Pandas series of tweets (with punctuation, links, and other garbage), perform all the preprocessing necessary to construct a Pandas dataframe of vectorized tweets.  It should have some keyword arguments (set to reasonable defaults) that include all of the choices that you might make regarding text processing.  Here's an example, taken from our class worksheet:
    
    ```
    >>> df = pd.DataFrame({"Tweets":["Trick or Treat!",
                                     "One to Two Guesses",
                                     "Try this one weird trick!",
                                     "That's weird, you might guess",
                                     "Can you guess these 10 health tricks?"]})
    >>> vectorized_tweets = TextProcessor(df, N=1000)
    >>> vectorized_tweets
    ```
    
|      | "one" | "weird" | "trick" | "guess" |
|------|------|------|------|------|
|   "Trick or Treat!"  |  0   |  0   |  1   |  0    |
| "One to Two Guesses"  |   1  |   0  |  0   |  1    |
| "Try this one weird trick!"   |   1  |  1   |  1   |   0   |
| "That's weird, you might guess"   |  0   |  1   |  0   |  1    |
| "Can you guess these 10 health tricks?"   |   0  |  0   |  1   |  1 |
    
* `NaiveBayes`: Given a Pandas Dataframe of vectorized tweets (the output of your `TextPreprocessor` class), train a Naive Bayes classifier, which can then be used to classify tweets it hasn't seen before.  

    ```
    >>> model = NaiveBayes()
    >>> model.fit(vectorized_tweets, y)      // y is the column of 1's and 0's, as usual
    >>> model.predict("Man, @nicholaszufelt is such a great teacher! #brownnoseforlife")
    1
    ```

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#import seaborn as sns
#from sklearn.datasets import make_circles, make_moons, make_classification
from sklearn.linear_model import LogisticRegression
from math import *
import nltk
from nltk.corpus import stopwords
from nltk.util import ngrams
from nltk.stem.porter import PorterStemmer
import re
from collections import Counter

%matplotlib inline

plt.style.use('fivethirtyeight')

fivethiryeight_colors = {"blue": "#30a2da", 
                         "red": "#fc4f30", 
                         "green": "#e5ae38", 
                         "yellow": "#6d904f", 
                         "gray": "#8b8b8b"}
fivethirtyeight_rb = ["#30a2da", "#fc4f30"]

In [2]:
stopwords_list = stopwords.words('english') + ["https","http","co","amp","rt","â","ð","¾ð","â¦","ï","½","½ï","¾","wâ","ð¤â¾","www"]

#so elegant (i <3 nltk)
#this basically is the first lab in like 6 lines
stemmer = PorterStemmer() #initialize stemmer
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+') #initialize tokenizer
ttk = nltk.tokenize.casual.TweetTokenizer(preserve_case=False, strip_handles=True)

def preprocess(string):
    handleless = " ".join(ttk.tokenize(string))
    tokens = tokenizer.tokenize(handleless) #tokenize, make lowercase, create list
    filtered_words = [stemmer.stem(w) for w in tokens if not w in stopwords_list and len(w)>=3] #remove stopwords
    return filtered_words

In [43]:
class TextPreprocessor:
    def __init__(self, dataset, n=1000):
        self.tweets = dataset 
        
        #stem everything
        self.tweets["stemmed"] = self.tweets["text"].apply(preprocess)

        #n most common positive words
        yay = self.tweets[self.tweets["sentiment"] == 1].copy()
        good_words = []
        for tweet in yay["stemmed"]:
            good_words += set(tweet)
        best = Counter(good_words).most_common(n)
        
        #only use words in common list
        pos_features = [i[0] for i in best]
        self.com_check = lambda tweet: set([word for word in tweet if word in pos_features])
        self.tweets["stemmed"] = self.tweets["stemmed"].apply(self.com_check)
        
        #outputs
        self.indicators = self.tweets['stemmed'].str.join(sep=' ').str.get_dummies(sep=' ')
        self.vec_with_text = pd.concat([self.tweets["text"],self.indicators], axis=1)
        
    def nb_prep(self, text):
        nb_parsed = [word for word in preprocess(text) if word in list(self.indicators.columns)]
        return nb_parsed
    
    def test_process(self, text):
        parsed = self.com_check(preprocess(text))
        #print(parsed)
        names = list(self.indicators.columns)
        data = pd.DataFrame(np.zeros((1,len(names))), columns=names)
        #print(data)
        for i in names:
            if i in parsed:
                #print(names[i])
                data[i] = 1
        return data

In [4]:
class NaiveBayes:
    def __init__ (self):
        self.p1 = None
        self.p0 = None
        self.pos = None
        self.neg = None
        
        
    def fit(self, a, b):
        #fit_data = pd.concat([a,b], axis=1)
        #print(b)
        self.pos = a[b == 1]
        self.neg = a[b == 0]
        
        plen = len(self.pos)
        nlen = len(self.neg)
        total_len = plen + nlen
        
        self.p1 = plen/total_len
        self.p0 = nlen/total_len
        
        self.pos = (self.pos.sum(axis=0)+1)/plen
        self.neg = (self.neg.sum(axis=0)+1)/nlen
    
    def multi(self, l):
        up = self.p1
        down = self.p0
        for word in l:
            up *= self.pos[word]
            down *= self.neg[word]
        return [down, up]
    
    def predict(self, text):
        probs = []
        if isinstance(text, list):
            probs.append(self.multi(text))
        else:
            for item in text:
                probs.append(self.multi(item))
        return probs

## 2: Train-Test split 

Now, before you build your model, you need to construct some datasets to try out.  Fortunately, we've been working on building one!  You can download `tweets_class.txt` from Canvas.  It's a `.txt`, not a `.csv`, because tweets have a lot of commas in them, and that messed with Pandas `read_csv` method.  So `tweets_class.txt` is a tab-delimited dataset.  You'll need to change the delimiter for `read_csv`.  I also found that including the keyword `encoding='ISO-8859-1'` helped a ton as well.

Take the data, split off some amount of it as a **test dataset**.  This means that you don't give it to your model, but you run it through the model after training to test its accuracy on tweets it hasn't seen before.  How much to split off is a good question, and the numbers vary from 10% to 50%.  Both of those extremes I think are a little over-the-top, I would recommend about 20-30%.  The remaining dataset is called your **training dataset**.

A quick google search allowed me to find the gigantic [Sentiment Analysis Dataset](http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip).  It's 150MB, consisting of 1.5 million tweets which have been labeled for sentiment.  You may be getting frustrated with me at this point: "Why did we have to go label all those tweets when there's a huge dataset already?!"  I have many answers to this, but the most important one is that all of these labeled tweets were labeled by someone else's model, so I didn't want you to work off of entirely computer-generated data (It's the same idea behind "a copy of a copy of a copy...".  Nonetheless, let's pad our dataset with it.  Here's a line I found useful for opening that massive dataset in pandas:

In [5]:
def fixer(val):
    if val == 4:
        return 1
    else:
        return 0

mega_df = pd.read_csv("data/training.1600000.processed.noemoticon.csv", error_bad_lines=False, encoding='ISO-8859-1', names=["sentiment","id","timestamp","idk","handle","text"])[["sentiment","text"]]
mega_df = mega_df.loc[mega_df['sentiment'] != 2]
mega_df["sentiment"] = mega_df["sentiment"].apply(fixer)

mega_df.head()

Unnamed: 0,sentiment,text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


In [6]:
class_df = pd.read_csv("data/tweets_class.txt", sep="\t", encoding='ISO-8859-1', error_bad_lines=False).dropna().reset_index(drop=True)
class_df.head()

Unnamed: 0,sentiment,text
0,0.0,Moving the Tesla announcement to Wednesday. Ne...
1,1.0,@markpinc @TeslaMotors thanks!
2,0.0,@Reuters Umm...Autobahn?
3,0.0,@vicentes obviously wrong
4,0.0,@Cocoanetics @heiseonline Not actually based o...


Add some number of these labeled tweets to your dataset.  How much is up to you, I might suggest some number of thousands, but less than 10 thousand (Though feel free to try more!).  Take them randomly from the dataframe, not from the top.  

In [7]:
class_df = class_df.sample(len(class_df)).reset_index(drop=True)
split = int(len(class_df)/4)
test_df = class_df.ix[:split].copy()
training_set = class_df.ix[split:].copy()

SAMPLE_COUNT = 5000
training_set = pd.concat([training_set, mega_df.sample(SAMPLE_COUNT)], axis=0, ignore_index=True)

training_set.head()

Unnamed: 0,sentiment,text
0,0.0,The attack on the Orange County HQ @NCGOP offi...
1,1.0,Half way there&newline;#Believeland #Cubs #wor...
2,0.0,RT @lifewithoutPB: Put on some #BadHombre for ...
3,0.0,Thank you Colorado Springs. If Im elected Pres...
4,0.0,i would go a step further these are the two mo...


In [44]:
training_data = TextPreprocessor(training_set, n=300)
X = training_data.indicators
y = training_set["sentiment"]

## 3: Train and test your models

Build both a Naive Bayes and a Logistic Regression classifier on your training dataset, and test them on your test dataset.  Which is better? What percent of positives and negatives do you have? How many false positives and false negatives do you have?  Interpret your results.  Is your model better or worse when you include the computer-generated data?  Add some bigrams to, and try changing your `N` in Naive Bayes, and try changing your `C` in Logistic Regression.  What's the best model? (This is why creating a robust class in part 1 will help you.)

In [45]:
log_binned_test = pd.concat(list(test_df["text"].apply(training_data.test_process)), axis=0).reset_index(drop=True)

In [46]:
log_mod = LogisticRegression()
log_mod.fit(X,y)
log_results = log_mod.predict(log_binned_test)

In [47]:
nb_binned_test = test_df["text"].apply(training_data.nb_prep)

In [48]:
def best(array):
    out = []
    for row in array:
        if row[0] > row[1]:
            out.append(0)
        else:
            out.append(1)
    return np.array(out)

In [49]:
nb_mod = NaiveBayes()
nb_mod.fit(X,y)
nb_results = best(nb_mod.predict(nb_binned_test)) #predict and make sure same data type as log_results

In [50]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(test_df["sentiment"],log_results))
print(confusion_matrix(test_df["sentiment"],nb_results))

[[341 102]
 [172 240]]
[[359  84]
 [181 231]]


We can see that the naive bayes classifier is marginally better. With an n of 500, the logistic has 274 falses and the NB has 265 falses. The NB classifier tends to prefer false negatives to false positives. This may be an artifact of an imbalance in the sentiments of the tweets. The Naive Bayes has a success rate of 69%. Pretty abysmal. Logistic is awful as well with 67.9%. I found that the data was better with computer generated data added. This indicates that out data probably is not very good.