# Stanford Data Mexent Classifier

This is a classifier that will utilize the Stanford data found online and my created <code>maxent_sentiment_analysis.py</code> module to build a maximum entropy classifier.

The data used from this code can be found <a href="http://help.sentiment140.com/for-students/" target = "_blank">here</a>, and the original paper is <a href="http://www-cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf" target = "_blank">here</a>.


## First upload data

First I will import the object from a file.

In [1]:
# Import the file
from maxent_sentiment_analysis import MaxEntSentimentClassifier as maxent

## Now add data

I stored the data in three files within the "Stanford_data folder":

1. <em>Stanford_data\\training_data.txt</em>
2. <em>Stanford_data\\training_classifier.txt</em>
3. <em>Stanford_data\\test_tweets.txt</em>

We can add these files in three modules below and upload the data.


In [2]:
training_tweets = "Stanford_data/training_data.txt"
training_classes = "Stanford_data/training_classifier.txt"
test_tweets = "Stanford_data/test_tweets.txt"

## Create the object and upload data


### Create the object

We will now create a <code>MaxEntSentimentClassifier</code> by utilzing the <code>maxent</code> module imported above. This can be done simply by saying:

classifier = maxent()

### Upload data

Next we can add the training data, which composes of the tweets themselves and then the classification for each tweet.  There are two functions that are created to handle importing the data in the two text files:

1. <code>MaxEntSentimentClassifier.add_tweets(filename)</code>
2. <code>MaxEntSentimentClassifier.add_classification_data(filename)</code>

The data MUST be in a <em>.txt</em> format and MUST be listed such that each tweet is a row, and the classificaiton for that tweet is on the corresponding row in the second text file.

In [3]:
# Create object
classifier = maxent()

# Upload files
classifier.add_tweets(training_tweets)
classifier.add_classification_data(training_classes)

# Check to see that everything uploaded currently
print(len(classifier.tweets))
print(len(classifier.classes))

2000
2000


## Upload training data

Now one can train the classifier.  This is handled by the function

<code>MaxEntSentimentClassifier.train_data()</code>

This function will first organized the data in a format handled by NLTK.  This format introduced a list where each entry is a pair.  The first entry in the pair is a dictionary that shows the frequency of words per tweet. My object utilizes a python <code>Counter()</code> object to create this.  A <code>Counter()</code> is simply a child of a dictionary, so NLTK can handle this.  The second is simply the classification for that tweet.

As of 9/9/2016 I also added cleaning steps.  The cleaning steps are in the <code>__clean_data(self, s)</code> function.  Right now I clean by:

1. Lower casing the tweet
2. Removing all URLs
3. Turning all characters occurring more than twice into a two time occurance
   - Ex: "haaaaapppy" => "haappy"
4. Removing usernames from the data (i.e. anything beginning with an @ symbol)
5. Removing all remaining punctuation

This cleaning was supposed to mimic the cleaning in the original Stanford paper.


In [4]:
# Classify the data based upon loaded train data
classifier.train_data()

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.500
             2          -0.55704        0.920
             3          -0.47102        0.943
             4          -0.41173        0.951
             5          -0.36808        0.957
             6          -0.33434        0.960
             7          -0.30731        0.965
             8          -0.28504        0.968
             9          -0.26630        0.970
            10          -0.25027        0.972
            11          -0.23637        0.974
            12          -0.22416        0.976
            13          -0.21335        0.978
            14          -0.20369        0.979
            15          -0.19500        0.980
            16          -0.18712        0.980
            17          -0.17996        0.981
            18          -0.17340        0.981
            19          -0.16737        0.982
 

## Now let's see how well we did

### Upload the test data

This can be done with the function:

<code>MaxEntSentimentClassifier.add_test_data(filename)</code>

We already saved the file path in a test_tweets variable.

### Classify the test data

We can now classify the test data.  The classification of each tweet will be saved to a list and return from a function.  THe following function:

<code>MaxEntSentimentClassifer.classify_test_data()</code>

will classify the data and print to standard out.

In [5]:
# Add data
classifier.add_test_data(test_tweets)
# Classify data
test_classes = classifier.classify_test_data()
print(test_classes)

['0', '4', '4', '4', '0', '4', '0', '4', '4', '4', '0', '4', '0', '4', '4', '0', '0', '4', '0', '0', '4', '4', '0', '4', '0', '4', '0', '4', '0', '4', '4', '4', '4', '0', '0', '0', '0', '0', '0', '4', '4', '0', '0', '0', '0', '0', '4', '0', '4', '4', '4', '4', '0', '0', '4', '0', '0', '4', '4', '0', '0', '4', '4', '4', '4', '4', '4', '0', '0', '0', '0', '4', '0', '4', '0', '4', '4', '0', '0', '0', '4', '4', '4', '4', '4', '4', '0', '4', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '4', '0', '4', '4', '4', '4', '4', '0', '4', '0', '4', '4', '4', '4', '0', '0', '0', '0', '4', '0', '4', '4', '4', '4', '4', '0', '0', '4', '0', '0', '4', '0', '4', '0', '0', '4', '4', '0', '4', '4', '0', '0', '4', '0', '4', '0', '0', '0', '0', '0', '0', '0', '0', '4', '0', '0', '0', '4', '0', '0', '0', '0', '0', '0', '4', '4', '0', '4', '0', '4', '0', '4', '4', '4', '4', '4', '0', '0', '4', '4', '0', '4', '0', '4', '4', '0', '0', '0', '4', '4', '4', '4', '4', '0', '4', '4', '0', '4', '4', '0', '4', '0',

## Test accuracy

Let's see how well we did.  The true classification results are in

<em>Stanford_data\\test_true_classification.txt</em>

Again, each row is listed with whether it is a positive (0) or negative (4) tweet.

Some tweets are marked as neutral (2) so we will have to filter out those.  We'll make the following variables:

1. TP = True Positive. Positive = "0"
2. TN = True Negative. Negative = "4"
3. FP = False Positive
4. FN = False Negative.
5. Total = total results

to get a confusion matrix of the results.



In [6]:
# Open file
true_classes = open("Stanford_data/test_true_classification.txt", "r").read().split('\n')

# Make vars
TP = 0
TN = 0
FP = 0
FN = 0

for i in range(len(test_classes)):
    if test_classes[i] == '0' and true_classes[i] == '0':
        TP += 1
    elif test_classes[i] == '0' and true_classes[i] == '4':
        FP += 1
    elif test_classes[i] == '4' and true_classes[i] == '4':
        TN += 1
    elif test_classes[i] == '4' and true_classes[i] == '0':
        FN += 1

Total = (TP + TN + FP + FN)
Accuracy = (TP + TN)/Total
Recall = TP/(TP + FN)
Precision = TP/(TP + FP)

print("True Positive: " + str(TP/Total))
print("True Negative: " + str(TN/Total))
print("False Positive: " + str(FP/Total))
print("False Negative: " + str(FN/Total))
print("Accuracy: " + str(Accuracy))
print("Precision: " + str(Precision))
print("Recall: " + str(Recall))


True Positive: 0.30919220055710306
True Negative: 0.3342618384401114
False Positive: 0.17270194986072424
False Negative: 0.18384401114206128
Accuracy: 0.6434540389972145
Precision: 0.6416184971098265
Recall: 0.6271186440677966


## What's next

Other than taking out punctuation, there was no filtering or these tweets.  The Stanford paper took out all emojis, RT's and usernames (defind by "@") symbol.  More filtering wil be necessary in the future.

As of 09/09/2016 the filtering has been completed.  Accuracy did not improve that much...
Thinking I will now try to use more data, but need to think of the best way to store that data.  5000 tweets is not much.