In [0]:
from collections import Counter, defaultdict
import math, random, re, glob, codecs
import urllib.request

# Naive Bayes Classifier
## Based on materials from Dr. Kevin Scannell

In this lab, we will create our a Bayesian classifier.

The problem we'll be looking at is one "sentiment analysis". For the purposes of this lab, we'll formulate the problem as a binary classifier in which we try to label texts as being either "positive sentiment" or "negative sentiment".  (Some formulations allow gradations, or just a third class for "neutral sentiment", etc., but I prefer having only two classes for this first example.)

The training data we'll use consists of tweets written in the
Irish language, gathered as part of one of Dr. Scannell's research projects. (Note from Dr. Scannell: "I very intentionally chose a language that (I suspect) no one in the course actually speaks! This will prevent you from introducing any bias in your experiments, or injecting any intuition into how you model the problem.")

There are 10000 "positive sentiment" tweets in the file irish-happy.txt, and 10000 "negative sentiment" tweets in the file irish-sad.txt. Any personally identifying information has been stripped (usernames, etc.) Normally, assembling big sets of training data in machine learning is a huge challenge; very often the best way to achieve good results when building a classifier is to pay humans to label examples by hand to create a training set. In this case, Dr. Scannell took a big shortcut, and just gathered tweets containing either a happy or sad emoticon or emoji character!  So, strictly speaking, we're not really doing sentiment analysis, we're building a classifier that predicts whether a given tweet will contain a happy or frownie face, as a kind of "proxy" for sentiment analysis. As you'll see below, the results show this isn't completely
unreasonable.

# Loading the Data

In [0]:
def load_labeled_data(url,label):
    # load the content from the url
    content = urllib.request.urlopen(url).read()

    # decode it (make sure all the letters look right)
    decoded = codecs.decode(content,encoding='utf-8')
    
    # split on the line breaks so that each tweet is an element of a list
    split_list = decoded.split('\n')

    # create a list of tweets w/ the label that got passed in
    tweets = []
    for tweet in split_list:
        tweets.append((tweet, label))

    return tweets

In [0]:
irish_happy_url = 'https://raw.githubusercontent.com/kscanne/1070/master/lab04/irish-happy.txt'
irish_sad_url = 'https://raw.githubusercontent.com/kscanne/1070/master/lab04/irish-sad.txt'

# Load happy and sad tweets, labeling the happy ones as 'True' and the sad ones as 'False'
happy_data = load_labeled_data(irish_happy_url,True)
sad_data = load_labeled_data(irish_sad_url,False)

# Concatenate the happy and sad data into one list
full_dataset = happy_data.copy()
full_dataset.extend(sad_data)

# shuffle the dataset so we don't have all trues at the start and all falses at the end
random.shuffle(full_dataset)

Now we can take a look at the first five elements of our dataset


In [66]:
full_dataset[:5]

[('@USER tá   is breá liom Barcelona is cathair iontach é! Tá grá agam do Catalóin, áit is fearr liom in Eoraip. Tár éis Éire cinnte 😉',
  False),
 ('@USER @USER @USER @USER Na bí buartha. Beidh na pubanna foscailte  ', True),
 ('@USER An-sásta leis na píobairí, leis na pióga steak & mushroom, agus leis an aimsir a bhí muid, áfach.   ',
  True),
 ("@USER  Tá Gaeilge ag go leor 'foreigners' anois  ", True),
 ('Ní bhfuair mé ach ceithre huair I mo choladh aréir, scriosta go hiomlan anois! Beidh sé ag éirí a bhfad níos measa an bhliain seo #Strus  !',
  False)]

# Training / Testing Split
Now, let's split our data into a training and testing set, just like we did in our k-Nearest Neighbors classification. The code here is a bit different, but the output is the same (for example, if you pass in .75 as the prob parameter, then 75% of the data will get put into training, and 25% in testing).

In [0]:
def split_data(data, prob):
    """split data into fractions [prob, 1 - prob]"""
    results = [], []
    for row in data:
        results[0 if random.random() < prob else 1].append(row)
    return results

train_data, test_data = split_data(full_dataset, 0.75)  

In [67]:
len(train_data), len(test_data)

(13306, 4412)

# Creating the Classifier

The next big block of code actually implements the classifier. Later I'll have you come back and explore some different parts of this, but for now, just run this cell so that we can run our classifier.

In [0]:
def tokenize(message):
    message = message.lower()                       # convert to lowercase
    patt = re.compile(u"[a-záéíóú'-]+", re.UNICODE)
    all_words = re.findall(patt, message)
    return set(all_words)                           # remove duplicates

def count_words(training_set):
    """training set consists of pairs (message, is_true)"""
    counts = defaultdict(lambda: [0, 0])
    for message, is_true in training_set:
        for word in tokenize(message):
            counts[word][0 if is_true else 1] += 1
    return counts

def word_probabilities(counts, total_true, total_false, k=0.5):
    """turn the word_counts into a list of triplets 
    w, p(w | true) and p(w | false)"""
    return [(w,
             (truec + k) / (total_true + 2 * k),
             (falsec + k) / (total_false + 2 * k))
             for w, (truec, falsec) in counts.items()]

def true_probability(word_probs, message):
    message_words = tokenize(message)
    log_prob_if_true = log_prob_if_false = 0.0

    for word, prob_if_true, prob_if_false in word_probs:
        # for each word in the message, 
        # add the log probability of seeing it 
        if word in message_words:
            log_prob_if_true += math.log(prob_if_true)
            log_prob_if_false += math.log(prob_if_false)

        # for each word that's not in the message
        # add the log probability of _not_ seeing it
        else:
            log_prob_if_true += math.log(1.0 - prob_if_true)
            log_prob_if_false += math.log(1.0 - prob_if_false)
            
    ans = 1.0 / (1.0 + math.exp(log_prob_if_false - log_prob_if_true))
    return ans

def p_true_given_word(word_prob):
    word, prob_if_true, prob_if_false = word_prob
    return prob_if_true / (prob_if_true + prob_if_false)
  
class NaiveBayesClassifier:
    def __init__(self, k=0.5):
        self.k = k
        self.word_probs = []

    def train(self, training_set):
        num_trues = len([is_true 
                         for message, is_true in training_set 
                         if is_true])
        num_falses = len(training_set) - num_trues

        # run training data through our "pipeline"
        word_counts = count_words(training_set)
        self.word_probs = word_probabilities(word_counts, 
                                             num_trues, 
                                             num_falses,
                                             self.k)
                                             
    def classify(self, message):
        return true_probability(self.word_probs, message)

# Train!
Now, we'll create an instance of the classifier class we defined above. This is like when we created an instance of the KNN classifier from sklearn, except this time, we implemented the classifier ourself rather than importing it from a library.

Then we will train our classifier using our training data.

In [0]:
classifier = NaiveBayesClassifier()
classifier.train(train_data)

# Test!
The following code will loop over the test dataset and predict whether each tweet is happy or sad (remember: true == happy; false == sad) using the classifier that was trained in the cell above.

It will then print out the results of our classifier on
our testing data, with an output that looks like this:

Counter({(True, True): 1992, (False, False): 1389, (False, True): 579, (True, False): 471})

This is saying that 1992 happy tweets we labeled as happy by
the classifier, 1389 sad tweets were labeled as sad, 579 sad were labeled as happy, and 471 happy were labeled as sad.

We can use this to compute the percentage labeled correctly, as a simple measure of the accuracy of the classifier.

In [73]:
classified = [(subject, is_true, classifier.classify(subject)) for subject, is_true in test_data]

counts = Counter((is_true, true_probability > 0.5) # (actual, predicted)
                     for _, is_true, true_probability in classified)

print(counts)

ERROR! Session/line number was not unique in database. History logging moved to new session 62
Counter({(True, True): 1983, (False, False): 1396, (False, True): 524, (True, False): 509})


# HOMEWORK

Please answer the following questions in a file type of your choosing (word document, Google Doc, slack post -- whatever!) and post it to your Slack channel before 11:59pm Sunday, April 5th.

1) Reading the code. - 5pts

(a) Recall that in our Naive Bayes model, we treat "words" as a features.  What is the place (the specific function or lines of code) where we break the texts into words (this relies on knowing any characters special to the language in question).

(b) Can you find the place in the code where word probabilities are computed and "smoothed"? Hint: instead of "adding 1" to all word counts, this code adds a parameter "k" with default value 0.5...

2) Compute the accuracy of our classifier. Remember that the training set is generated randomly each time we run the cell above that creates the training/testing split. Generate a handful of different splits, train the classifier on that new split, compute your accuracy, and average over all of the different datasets.

3) Error analysis. The following code outputs the five "truest false" tweets, and the five "falsest true" tweets. The "truest false" tweets are the ones that are actually sad, but which "look" the happiest to the classifier. Similarly for "falsest true".

For better or worse, Google Translate supports Irish-to-English translation; copy and paste these tweets into Google Translate and see if you can figure out why the classifier is particularly confused in these cases. Describe your explanation, using example tweets and translations.

In [85]:
classified.sort(key=lambda row: row[2])
truest_falses = list(filter(lambda row: not row[1], classified))[-5:]
falsest_trues = list(filter(lambda row: row[1], classified))[:5]

print("truest_falses")
for t in truest_falses:
  print(t[0]+'\n')

print("\n\n\nfalsest_trues")
for t in falsest_trues:
  print(t[0]+'\n')

truest_falses
Táimid ag éisteacht le #bricfeastablasta deiridh @USER ar maidin ar @USER   Go n-éirí an t-ádh le do thograí nua go léir, Lisa

@USER níl. ag fanacht ar uimhir nua a fháil... you can't get good help these days   

@USER Aw ná habair sin a Emma, ní orainn an locht   Súil againn gur éirigh go maith leat sa scrúdú / Hope the exam went well! RRR.ie x

@USER Loving the combined use of Béarla agus Gaeilge ar an nuacht inniu! Pity it couldn't be like this gach lá   #gaeilge #SnaG2016

RT @USER: Campa Mhacha thart do bhliain eile   Míle buíochas leis na múinteoirí & cúntóirí. Go raibh maith agaibh speisialta leis na…




falsest_trues
#Gaeilige @USER 'Northern Ireland....a nation State'!    Is ag magadh fúm atá tú nach ea? Níl ina 'nation State' ar chor ar bith é.

@USER tá tú ar meisce agus níl sé ach leath uair taréis a naoi   awh bhuel tá mé ag ól buidéal Cobra mé fhéin

Faraor géar ní bheidh an Club ar oscailt anocht dá bharr fadhb aibhléise. Gabhann muid ár leithscéal, tá sú

5) The next cell shows the ten words most characteristic of happy tweets and sad tweets, respectively. Again, copy and paste these into Google Translate and see if they appear reasonable. Can you "explain away" any that don't appear reasonable?  Should we change the model in some way?

In [86]:
words = sorted(classifier.word_probs, key=p_true_given_word)

truest_words = words[-10:]
falsest_words = words[:10]

print("truest_words:")
for t in truest_words:
  print(t[0] + '\n')

print("\n\n\nfalsest_words:")
for t in falsest_words:
  print(t[0] + '\n')

truest_words:
brilliant

happy

shona

gradaim

seolta

aithne

greann

breithe

dhaoibh

smile




falsest_words:
ugh

slánlepeadar

mins

abalta

croíbhriste

léarscáil

arabacha

mbreatain

polaitiúil

ochón

ERROR! Session/line number was not unique in database. History logging moved to new session 64
