## Use nltk to build sentiment analysis model

In [2]:
import nltk

# requires the nltk_data set (3.5G in total) for initial training
    # run'sudo python3 -m nltk.downloader all'
def format_sentence(sent):
    return({word: True for word in nltk.word_tokenize(sent)})

print(format_sentence('That is a nice wombat'))

{'That': True, 'is': True, 'a': True, 'nice': True, 'wombat': True}


- Above changes a sentence into a dictionary mapped to true bools. 
- This will allow the training of a prediction model by splitting text into its tokens

In [3]:
positive = []
# a list that will contain dictionaries of the tweets, mapped to true as above
with open('./positive_tweets.txt') as f:
    for i in f:
        positive.append([format_sentence(i), 'positif'])

negative = []
with open('./negative_tweets.txt') as f:
    for i in f:
        negative.append([format_sentence(i), 'négatif'])

# split labeled data into training and test data
training = positive[:int((.8)*len(positive))] + negative[:int((.8)*len(negative))]
test = positive[int((.8)*len(positive)):] + negative[int((.8)*len(negative)):]

### Builiding a Classifier
- All nltk classifiers work with feature structures
    - Which can be simple dictionaries mapping a feature name to a feature value.
- The Naive Bayes Classifier makes predictions based on the word frequencies associated with the label
    -  http://www.nltk.org/_modules/nltk/classify/naivebayes.html

In [4]:
from nltk.classify import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(training)

- With this method, see which words are the highest indicators of a positive or negative label
- NBC is based off the frequenceis associated with each label for a word

In [5]:
classifier.show_most_informative_features()

Most Informative Features
                      no = True           négati : positi =     19.4 : 1.0
                    love = True           positi : négati =     19.0 : 1.0
                 awesome = True           positi : négati =     17.2 : 1.0
                headache = True           négati : positi =     16.2 : 1.0
                      Hi = True           positi : négati =     12.7 : 1.0
               beautiful = True           positi : négati =      9.7 : 1.0
                     New = True           positi : négati =      9.7 : 1.0
                     fan = True           positi : négati =      9.7 : 1.0
                   Thank = True           positi : négati =      9.7 : 1.0
                    haha = True           positi : négati =      9.3 : 1.0


- The first column is why we needed format_sentence
    - Number of occurences of each word for both labels to compute the ratio between the two
- Second column lists on the left which occurs more frequently
- Third column is the ratio

In [6]:
test1 = "The wombat is nature's most perfect animal"
print(classifier.classify(format_sentence(test1)))

positif


In [7]:
test2 = "I hate Mondays, like that orange cat"
print(classifier.classify(format_sentence(test2)))

négatif


In [8]:
test3 = 'I do not have a headache'
print(classifier.classify(format_sentence(test3)))

négatif


- Naive Bayes does not consider the relationship between words, so it did not properly classify the above sentence

In [9]:
test4 = 'headache love love love love'
print(classifier.classify(format_sentence(test4)))
# hmm still negative

négatif


In [10]:
# Compute accuracy
from nltk.classify.util import accuracy
print(accuracy(classifier, test))

0.8308457711442786


- ~83% Accuracy
- Why?
    - Tweets contain typos, abbreviations, grammatical errors, and such