# Assignment 1: Preprocessing and Text Classification

Student Name: Clarisca Lawrencia

Student ID: xxxxxx

# General Info

<b>Due date</b>: Sunday, 28 March 2021 5pm

<b>Submission method</b>: Canvas submission

<b>Submission materials</b>: completed copy of this iPython notebook

<b>Late submissions</b>: -10% per day (both week and weekend days counted)

<b>Marks</b>: 9% of mark for class (with 8% on correctness + 1% on quality and efficiency of your code)

<b>Materials</b>: See [Using Jupyter Notebook and Python page](https://canvas.lms.unimelb.edu.au/courses/121115/pages/using-jupyter-notebook-and-python?module_item_id=2681264) on Canvas (under Modules>Resources) for information on the basic setup required for this class, including an iPython notebook viewer and the python packages NLTK, Numpy, Scipy, Matplotlib, Scikit-Learn, and Gensim. We recommend installing all the data for NLTK, since you will need various parts of it to complete this assignment. You can also use any Python built-in packages, but do not use any other 3rd party packages (the packages listed above are all fine to use); if your iPython notebook doesn't run on the marker's machine, you will lose marks. <b> You should use Python 3</b>.  

To familiarize yourself with NLTK, here is a free online book:  Steven Bird, Ewan Klein, and Edward Loper (2009). <a href=http://nltk.org/book>Natural Language Processing with Python</a>. O'Reilly Media Inc. You may also consult the <a href=https://www.nltk.org/api/nltk.html>NLTK API</a>.

<b>Evaluation</b>: Your iPython notebook should run end-to-end without any errors in a reasonable amount of time, and you must follow all instructions provided below, including specific implementation requirements and instructions for what needs to be printed (please avoid printing output we don't ask for). You should edit the sections below where requested, but leave the rest of the code as is. You should leave the output from running your code in the iPython notebook you submit, to assist with marking. The amount each question is worth is explicitly given. 

You will be marked not only on the correctness of your methods, but also the quality and efficency of your code: in particular, you should be careful to use Python built-in functions and operators when appropriate and pick descriptive variable names that adhere to <a href="https://www.python.org/dev/peps/pep-0008/">Python style requirements</a>. If you think it might be unclear what you are doing, you should comment your code to help the marker make sense of it.

<b>Updates</b>: Any major changes to the assignment will be announced via Canvas. Minor changes and clarifications will be announced on the discussion board; we recommend you check it regularly.

<b>Academic misconduct</b>: For most people, collaboration will form a natural part of the undertaking of this homework, and we encourge you to discuss it in general terms with other students. However, this ultimately is still an individual task, and so reuse of code or other instances of clear influence will be considered cheating. We will be checking submissions for originality and will invoke the University’s <a href="http://academichonesty.unimelb.edu.au/policy.html">Academic Misconduct policy</a> where inappropriate levels of collusion or plagiarism are deemed to have taken place.

# Overview

In this homework, you'll be working with a collection tweets. The task is to predict the geolocation (country) where the tweet comes from. This homework involves writing code to preprocess data and perform text classification.

# Preprocessing (4 marks)

**Instructions**: Download the data (as1-data.json) from Canvas and put it in the same directory as this iPython notebook. Run the code below to load the json data. This produces two objects, `x` and `y`, which contains a list of  tweets and corresponding country labels (it uses the standard [2 letter country code](https://www.iban.com/country-codes)) respectively. **No implementation is needed.**

In [1]:
import json

x = []
y = []
data = json.load(open("as1-data.json"))
for k, v in data.items():
    x.append(k)
    y.append(v)
    
print("Number of tweets =", len(x))
print("Number of labels =", len(y))
print("\nSamples of data:")
for i in range(10):
    print("Country =", y[i], "\tTweet =", x[i])
    
assert(len(x) == 943)
assert(len(y) == 943)

Number of tweets = 943
Number of labels = 943

Samples of data:
Country = us 	Tweet = @Addictd2Success thx u for following
Country = us 	Tweet = Let's just say, if I were to ever switch teams, Khalesi would be top of the list. #girlcrush
Country = ph 	Tweet = Taemin jonghyun!!! Your birits make me go~ http://t.co/le8z3dntlA
Country = id 	Tweet = depart.senior 👻 rapat perdana (with Nyayu, Anita, and 8 others at Ruang Aescullap FK Unsri Madang) — https://t.co/swRALlNkrQ
Country = ph 	Tweet = Done with internship with this pretty little lady!  (@ Metropolitan Medical Center w/ 3 others) [pic]: http://t.co/1qH61R1t5r
Country = gb 	Tweet = Wow just Boruc's clanger! Haha Sunday League stuff that, Giroud couldn't believe his luck! #clown
Country = my 	Tweet = I'm at Sushi Zanmai (Petaling Jaya, Selangor) w/ 5 others http://t.co/bcNobykZ
Country = us 	Tweet = Mega Fest!!!! Its going down🙏🙌  @BishopJakes
Country = gb 	Tweet = @EllexxxPharrell wow love the pic babe xx
Country = us 	Tweet = You 

### Question 1 (1.0 mark)

**Instructions**: Next we need to preprocess the collected tweets to create a bag-of-words representation. The preprocessing steps required here are: (1) tokenize each tweet into individual word tokens (using NLTK `TweetTokenizer`); (2) lowercase all words; (3) remove any word that does not contain any English alphabets (e.g. {_hello_, _#okay_, _abc123_} would be kept, but not {_123_, _!!_}) and (4) remove stopwords (based on NLTK `stopwords`). An empty tweet (after preprocessing) and its country label should be **excluded** from the output (`x_processed` and `y_processed`).

**Task**: Complete the `preprocess_data(data, labels)` function. The function takes **a list of tweets** and **a corresponding list of country labels** as input, and returns **two lists**. For the first list, each element is a bag-of-words representation of a tweet. For the second list, each element is a corresponding country label. Note that while we do not need to preprocess the country labels (`y`), we need to have a new output list (`y_processed`) because some tweets maybe removed after the preprocessing (due to having an empty set of bag-of-words).

**Check**: Use the assertion statements in <b>"For your testing"</b> below for the expected output.

In [2]:
import nltk
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
import re
from collections import Counter

tt = TweetTokenizer()
stopwords = set(stopwords.words('english')) #note: stopwords are all in lowercase

def preprocess_data(data, labels):
    
    ###
    # Your answer BEGINS HERE
    ###
    
    #Tokenizing each tweet and lower casing
    tokenized_tweet=[]
    for tweet in data:
        tweet = tweet.lower()
        tokenized_tweet.append(tt.tokenize(tweet))
    
    
    #Append the country to the processed tweet
    dataset=list(zip(tokenized_tweet,labels))
    
    #Remove non-alphabetic words
    processed_data_list=[]
    processed_labels=[]
    processed_data=[]
    for twt,country in dataset:
        tweet=[]
        for tw in twt:
            if tw.islower() and tw not in stopwords:
                tweet.append(tw)
        processed_data_list.append(tweet)
        
        #Check if list of tweets is empty
        if (len(processed_data_list)!=0):
            processed_labels.append(country)
    
    #Creating a new dictionary for the features 
    for t in processed_data_list:
        dic = {y:t.count(y) for y in t}
        processed_data.append(dic)
    
    return processed_data, processed_labels
    ###
    # Your answer ENDS HERE
    ###

x_processed, y_processed = preprocess_data(x, y)

print("Number of preprocessed tweets =", len(x_processed))
print("Number of preprocessed labels =", len(y_processed))
print("\nSamples of preprocessed data:")
for i in range(10):
    print("Country =", y_processed[i], "\tTweet =", x_processed[i])

Number of preprocessed tweets = 943
Number of preprocessed labels = 943

Samples of preprocessed data:
Country = us 	Tweet = {'@addictd2success': 1, 'thx': 1, 'u': 1, 'following': 1}
Country = us 	Tweet = {"let's": 1, 'say': 1, 'ever': 1, 'switch': 1, 'teams': 1, 'khalesi': 1, 'would': 1, 'top': 1, 'list': 1, '#girlcrush': 1}
Country = ph 	Tweet = {'taemin': 1, 'jonghyun': 1, 'birits': 1, 'make': 1, 'go': 1, 'http://t.co/le8z3dntla': 1}
Country = id 	Tweet = {'depart.senior': 1, 'rapat': 1, 'perdana': 1, 'nyayu': 1, 'anita': 1, 'others': 1, 'ruang': 1, 'aescullap': 1, 'fk': 1, 'unsri': 1, 'madang': 1, 'https://t.co/swrallnkrq': 1}
Country = ph 	Tweet = {'done': 1, 'internship': 1, 'pretty': 1, 'little': 1, 'lady': 1, 'metropolitan': 1, 'medical': 1, 'center': 1, 'w': 1, 'others': 1, 'pic': 1, 'http://t.co/1qh61r1t5r': 1}
Country = gb 	Tweet = {'wow': 1, "boruc's": 1, 'clanger': 1, 'haha': 1, 'sunday': 1, 'league': 1, 'stuff': 1, 'giroud': 1, 'believe': 1, 'luck': 1, '#clown': 1}
Countr

**For your testing**:

In [3]:
assert(len(x_processed) == len(y_processed))
assert(len(x_processed) > 800)

**Instructions**: Hashtags (i.e. topic tags which start with #) pose an interesting tokenisation problem because they often include multiple words written without spaces or capitalization. Run the code below to collect all unique hashtags in the preprocessed data. **No implementation is needed.**



In [4]:
def get_all_hashtags(data):
    
    hashtags = set([])
    for d in data:
      
        for word, frequency in d.items():
            if word.startswith("#") and len(word) > 1:
                hashtags.add(word)
    return hashtags

hashtags = get_all_hashtags(x_processed)
print("Number of hashtags =", len(hashtags))
print(sorted(hashtags))

Number of hashtags = 425
['#100percentpay', '#1stsundayofoctober', '#1yearofalmostisneverenough', '#2011prdctn', '#2015eebritishfilmacademyawards', '#2k16', '#2littlebirds', '#365picture', '#5sosacousticatlanta', '#5sosfam', '#8thannualpubcrawl', '#affsuzukicup', '#aflpowertigers', '#ahimacon14', '#aim20', '#airasia', '#allcity', '#alliswell', '#allwedoiscurls', '#amazing', '#anferneehardaway', '#ariona', '#art', '#arte', '#artwork', '#ashes', '#asian', '#asiangirl', '#askcrawford', '#askherforfback', '#askolly', '#asksteven', '#at', '#australia', '#awesome', '#awesomepict', '#barcelona', '#bart', '#bayofislands', '#beautiful', '#bedimages', '#bell', '#beringmy', '#bettybooppose', '#bff', '#big', '#bigbertha', '#bigbreakfast', '#blackhat', '#blessedmorethanicanimagine', '#blessedsunday', '#blogtourambiente', '#bluemountains', '#bonekachika', '#boomtaob', '#booyaa', '#bored', '#boredom', '#bradersisterhood', '#breaktime', '#breedingground', '#bringithomemy', '#brooksengland', '#burgers'

### Question 2 (1.0 mark)

**Instructions**: Our task here to tokenize the hashtags, by implementing the **MaxMatch algorithm** discussed in class.

NLTK has a list of words that you can use for matching, see starter code below (`words`). Be careful about efficiency with respect to doing word lookups. One extra challenge you have to deal with is that the provided list of words (`words`) includes only lemmas: your MaxMatch algorithm should match inflected forms by converting them into lemmas using the NLTK lemmatizer before matching (provided by the function `lemmatize(word)`). Note that the list of words (`words`) is the only source that you'll use for matching (i.e. you do not need to find  other external word lists). If you are unable to make any longer match, your code should default to matching a single letter.

For example, given "#newrecord", the algorithm should produce: \["#", "new", "record"\].

**Task**: Complete the `tokenize_hashtags(hashtags)` function by implementing the MaxMatch algorithm. The function takes as input **a set of hashtags**, and returns **a dictionary** where key="hashtag" and value="a list of tokenised words".

**Check**: Use the assertion statements in <b>"For your testing"</b> below for the expected output.

In [5]:
from nltk.corpus import wordnet

lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
words = set(nltk.corpus.words.words()) #a list of words provided by NLTK
words = set([ word.lower() for word in words ]) #lowercase all the words for better matching

def lemmatize(word):
    lemma = lemmatizer.lemmatize(word,'v')
    if lemma == word:
        lemma = lemmatizer.lemmatize(word,'n')
    return lemma


def tokenize_hashtags(hashtags):
    ###
    # Your answer BEGINS HERE
    ###
    
    tags_dict={}
    
    #Looping over the hashtag
    for ht in hashtags:
        tokens=[]
        index =0
        
        while index <len(ht):
            maxWord=ht[index]
            
            #looping over the letters in a word, and finding the longest word        
            for j in range(index,len(ht)):
                temp = ht[index:j+1]
                
                #find the lemmatized word in the corpus, and assign it as a maxWord if found
                if lemmatize(temp) in words and len(temp)>len(maxWord):
                    maxWord = temp
                else:
                    continue
                    
            #Changing the index with respect to the maximum word found
            index=index+len(maxWord)
            tokens.append(maxWord)
                
        tags_dict[ht] = tokens
       
    return tags_dict
  
    ###
    # Your answer ENDS HERE
    ###

#tokenise hashtags with MaxMatch
tokenized_hashtags = tokenize_hashtags(hashtags)

#print results
count=1
for k, v in sorted(tokenized_hashtags.items())[-30:]:
    
    print(count,k, v)
    count+=1

1 #vanilla ['#', 'vanilla']
2 #vca ['#', 'v', 'ca']
3 #vegan ['#', 'vega', 'n']
4 #veganfood ['#', 'vega', 'n', 'food']
5 #vegetables ['#', 'vegetables']
6 #vegetarian ['#', 'vegetarian']
7 #video ['#', 'video']
8 #vma ['#', 'v', 'ma']
9 #voteonedirection ['#', 'vote', 'one', 'direction']
10 #vsco ['#', 'vs', 'c', 'o']
11 #vscocam ['#', 'vs', 'coca', 'm']
12 #walking ['#', 'walking']
13 #watch ['#', 'watch']
14 #weare90s ['#', 'wear', 'e', '9', '0', 's']
15 #wearesocial ['#', 'weares', 'o', 'c', 'i', 'al']
16 #white ['#', 'white']
17 #wings ['#', 'wings']
18 #wok ['#', 'wo', 'k']
19 #wood ['#', 'wood']
20 #work ['#', 'work']
21 #workmates ['#', 'work', 'mates']
22 #world ['#', 'world']
23 #worldcup2014 ['#', 'world', 'cup', '2', '0', '1', '4']
24 #yellow ['#', 'yellow']
25 #yiamas ['#', 'y', 'i', 'ama', 's']
26 #ynwa ['#', 'yn', 'wa']
27 #youtube ['#', 'you', 'tube']
28 #yummy ['#', 'yummy']
29 #yws13 ['#', 'y', 'ws', '1', '3']
30 #zweihandvollfarm ['#', 'z', 'wei', 'hand', 'vol', 'l',

**For your testing:**

In [6]:
assert(len(tokenized_hashtags) == len(hashtags))
assert(tokenized_hashtags["#newrecord"] == ["#", "new", "record"])

### Question 3 (1.0 mark)

**Instructions**: Our next task is to tokenize the hashtags again, but this time using a **reversed version of the MaxMatch algorithm**, where matching begins at the end of the hashtag and progresses backwards (e.g. for <i>#helloworld</i>, we would process it right to left, starting from the last character <i>d</i>). Just like before, you should use the provided word list (`words`) for word matching.

**Task**: Complete the `tokenize_hashtags_rev(hashtags)` function by the MaxMatch algorithm. The function takes as input **a set of hashtags**, and returns **a dictionary** where key="hashtag" and value="a list of tokenised words".

**Check**: Use the assertion statements in <b>"For your testing"</b> below for the expected output.

In [7]:
def tokenize_hashtags_rev(hashtags):
    ###
    # Your answer BEGINS HERE
    ###
    tags_dict={}
     
    #Looping over the hashtag
    for ht in hashtags:
        tokens=[]
        index = len(ht)-1
        symbol = ht[0]
        
        while index > 0:
            maxWord = ''
            j = len(ht)-1
            
            #looping over the letters in a word, and finding the longest word     
            while j > 0:
                temp=ht[j:index+1]
                if lemmatize(temp) in words:
                    maxWord = temp
                j=j-1
                
            #find the lemmatized word in the corpus, and assign it as a maxWord if found
            if len(maxWord)==0 :
                maxWord = ht[index]
            
            #Changing the index with respect to the longest word
            index = index - len(maxWord)
            tokens.append(maxWord)
        
        #Reverse the tokenized words 
        tokens.reverse()
        tokens.insert(0,symbol)
        tags_dict[ht] = tokens
    return tags_dict

    ###
    # Your answer ENDS HERE
    ###

#tokenise hashtags with the reversed version of MaxMatch
tokenized_hashtags_rev = tokenize_hashtags_rev(hashtags)

#print results
count=1
for k, v in sorted(tokenized_hashtags_rev.items())[-30:]:
    print(count,k, v)
    count+=1

1 #vanilla ['#', 'vanilla']
2 #vca ['#', 'v', 'ca']
3 #vegan ['#', 'v', 'e', 'gan']
4 #veganfood ['#', 'v', 'e', 'gan', 'food']
5 #vegetables ['#', 'vegetables']
6 #vegetarian ['#', 'vegetarian']
7 #video ['#', 'video']
8 #vma ['#', 'v', 'ma']
9 #voteonedirection ['#', 'vote', 'one', 'direction']
10 #vsco ['#', 'vs', 'c', 'o']
11 #vscocam ['#', 'vs', 'c', 'o', 'cam']
12 #walking ['#', 'walking']
13 #watch ['#', 'watch']
14 #weare90s ['#', 'we', 'are', '9', '0', 's']
15 #wearesocial ['#', 'we', 'are', 'social']
16 #white ['#', 'white']
17 #wings ['#', 'wings']
18 #wok ['#', 'w', 'ok']
19 #wood ['#', 'wood']
20 #work ['#', 'work']
21 #workmates ['#', 'work', 'mates']
22 #world ['#', 'world']
23 #worldcup2014 ['#', 'world', 'cup', '2', '0', '1', '4']
24 #yellow ['#', 'yellow']
25 #yiamas ['#', 'y', 'i', 'a', 'mas']
26 #ynwa ['#', 'yn', 'wa']
27 #youtube ['#', 'you', 'tube']
28 #yummy ['#', 'yummy']
29 #yws13 ['#', 'y', 'ws', '1', '3']
30 #zweihandvollfarm ['#', 'z', 'wei', 'hand', 'vol', 

**For your testing:**

In [8]:
assert(len(tokenized_hashtags_rev) == len(hashtags))
assert(tokenized_hashtags_rev["#newrecord"] == ["#", "new", "record"])

### Question 4 (1.0 mark)

**Instructions**: The two versions of MaxMatch will produce different results for some of the hashtags. For a hastag that has different results, our task here is to use a **unigram language model** (lecture 3) to score them to see which is better. Recall that in a unigram language model we compute P(<i>#</i>, <i>hello</i>, <i>world</i> = P(<i>#</i>)\*P(<i>hellow</i>)\*P(<i>world</i>).

You should: (1) use the NLTK's Brown corpus (`brown_words`) for collecting word frequencies (note: the words are already tokenised so no further tokenisation is needed); (2) lowercase all words in the corpus; (3) use add-one smoothing when computing the unigram probabilities; and (4) work in the log space to prevent numerical underflow.

**Task**: Build a unigram language model with add-one smoothing using the word counts from the Brown corpus. Iterate through the hashtags, and for each hashtag where MaxMatch and reversed MaxMatch produce different results, print the following: (1) the hashtag; (2) the results produced by MaxMatch and reversed MaxMatch; and (3) the log probability of each result as given by the unigram language model. Note: you **do not** need to print the hashtags where MaxMatch and reversed MaxMatch produce the same results.

An example output:
```
1. #abcd
MaxMatch = [#, a, bc, d]; LogProb = -2.3
Reversed MaxMatch = [#, a, b, cd]; LogProb = -3.5

2. #efgh
MaxMatch = [#, ef, g, h]; LogProb = -4.2
Reversed MaxMatch = [#, e, fgh]; LogProb = -3.1

```

Have a look at the output, and see if the sequences with better language model scores (i.e. less negative) are generally more coherent.

In [15]:
from nltk.corpus import brown
from nltk import ngrams
import math 
#words from brown corpus
brown_words = brown.words()

###
# Your answer BEGINS HERE
###

#A function to calculate frequency of brown corpus
def corpus_freq(corpus):
    wordfreq = {}
    for word in brown_words:
        word= word.lower()
        if word in wordfreq:
            wordfreq[word] +=1
        else:
          
            wordfreq[word] = 1
    return wordfreq
                
count_brown = corpus_freq(brown_words)
size_corpus = len(brown_words)

#A function to find all hashtags that yielded different results
def different_hashtags(hashtags_A, hashtags_B):
    different_htA={}
    different_htB={}
   
    for htA, htB in zip(hashtags_A.items(), hashtags_B.items()):
        if htA[1:] != htB[1:]:
         
            different_htA[htA[0]] = htA[1]
            different_htB[htB[0]] = htB[1]
    
    return different_htA, different_htB

hashtagA, hashtagB = different_hashtags(tokenized_hashtags,tokenized_hashtags_rev)

#Calculate unigram probability 
def unigram_prob(hashtag_list, freq_corpus, corpus_size):
    
    hashtag={}
    unique_corpus=0
    
    #Finding the unique words in the corpus
    for k in freq_corpus:   
        unique_corpus += 1
    denominator=unique_corpus+corpus_size
    
    #Looping over the hashtag
    for ht in hashtag_list:
        tags=hashtag_list[ht]
        word_dict={}
        prob=0
        
        #Calculating the occurrences in the corpus
        for word in tags:
            if word in word_dict:
                word_dict[word]+=1
            else:
                word_dict[word]=1
                
        #Calculating the probability of P(x|y)
        for w in word_dict:
            if w in freq_corpus:
                numerator= freq_corpus[w]
            else:
                numerator=0
                
        #Calculating the probability of the word
            prob+=math.log(((numerator+1)/(denominator+1)))
        hashtag[ht] = prob
       
    return hashtag
         

hashtagA_prob= unigram_prob(hashtagA,count_brown,size_corpus)
hashtagB_prob= unigram_prob(hashtagB,count_brown,size_corpus)

#Print result
counter=1
for htA_prob, htB_prob in zip(hashtagA_prob,hashtagB_prob):
    print(counter,".", htA_prob)
    print("Max Match: " ,hashtagA[htA_prob],"; LogProb= %.2f"%hashtagA_prob[htA_prob] )
    print("Reverse: ", hashtagB[htB_prob],"; LogProb= %.2f"%hashtagB_prob[htB_prob])
    print("\n")
    counter+=1
###
# Your answer ENDS HERE
###



1 . #lebedeintennis
Max Match:  ['#', 'l', 'e', 'bed', 'e', 'in', 'tennis'] ; LogProb= -59.90
Reverse:  ['#', 'l', 'e', 'be', 'de', 'in', 'tennis'] ; LogProb= -65.19


2 . #1yearofalmostisneverenough
Max Match:  ['#', '1', 'year', 'of', 'almost', 'is', 'never', 'enough'] ; LogProb= -60.89
Reverse:  ['#', '1', 'year', 'of', 'al', 'mos', 'tis', 'never', 'enough'] ; LogProb= -86.93


3 . #blessedsunday
Max Match:  ['#', 'blesseds', 'un', 'day'] ; LogProb= -46.86
Reverse:  ['#', 'blessed', 'sunday'] ; LogProb= -34.76


4 . #endomondo
Max Match:  ['#', 'end', 'om', 'on', 'do'] ; LogProb= -47.98
Reverse:  ['#', 'en', 'do', 'mon', 'do'] ; LogProb= -46.04


5 . #melbourne
Max Match:  ['#', 'mel', 'bourn', 'e'] ; LogProb= -49.42
Reverse:  ['#', 'm', 'elb', 'our', 'ne'] ; LogProb= -58.12


6 . #nevergetsold
Max Match:  ['#', 'never', 'gets', 'old'] ; LogProb= -38.78
Reverse:  ['#', 'never', 'get', 'sold'] ; LogProb= -38.99


7 . #instagood
Max Match:  ['#', 'ins', 'tag', 'o', 'od'] ; LogProb= -6

# Text Classification (4 marks)

### Question 5 (1.0 mark)

**Instructions**: Here we are interested to do text classification, to predict the country of origin of a given tweet. The task here is to create training, development and test partitions from the preprocessed data (`x_processed`) and convert the bag-of-words representation into feature vectors.

**Task**: Create training, development and test partitions with a 70%/15%/15% ratio. Remember to preserve the ratio of the classes for all your partitions. That is, say we have only 2 classes and 70% of instances are labelled class A and 30% of instances are labelled class B, then the instances in training, development and test partitions should also preserve this 7:3 ratio. You may use sklearn's builtin functions for doing data partitioning.

Next, turn the bag-of-words dictionary of each tweet into a feature vector. You may also use sklearn's builtin functions for doing this.

You should produce 6 objects: `x_train`, `x_dev`, `x_test` which contain the input feature vectors, and `y_train`, `y_dev` and `y_test` which contain the labels.

In [10]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder


x_train, x_dev, x_test = None, None, None
y_train, y_dev, y_test = None, None, None

###
# Your answer BEGINS HERE
###


def prepare_data(x_data,y_data):
    
    #Split to get test set
    xtrain, xtest, ytrain, ytest = train_test_split(x_data,y_data, train_size=0.7,test_size=0.3,stratify=y_data,random_state=100)
    
    #Split to get development set
    #Split with the ratio of 50:50 to get 15% each 
    xtest, xdev, ytest, ydev = train_test_split(xtest,ytest, train_size=0.5,test_size=0.5, stratify=ytest, random_state=100)
    
    #Vectorize the features  
    v = DictVectorizer(sparse=True)
    xtrain_vect = v.fit_transform(xtrain)
    xtest_vect = v.transform(xtest)
    xdev_vect = v.transform(xdev)
    
    #Label Encoding the classes
    le = LabelEncoder()
    ytrain_le = le.fit_transform(ytrain)
    ytest_le = le.transform(ytest)
    ydev_le = le.transform(ydev)

    return xtrain_vect, ytrain_le, xtest_vect,ytest_le,xdev_vect,ydev_le, v, le


x_train, y_train, x_test, y_test, x_dev, y_dev, vectorizer, encoder= prepare_data(x_processed, y_processed)
###
# Your answer ENDS HERE
###

### Question 6 (1.0 mark)

**Instructions**: Now, let's build some classifiers. Here, we'll be comparing Naive Bayes and Logistic Regression. For each, you need to first find a good value for their main regularisation hyper-parameters, which you should identify using the scikit-learn docs or other resources. Use the development set you created for this tuning process; do **not** use cross-validation in the training set, or involve the test set in any way. You don't need to show all your work, but you do need to print out the **accuracy** with enough different settings to strongly suggest you have found an optimal or near-optimal choice. We should not need to look at your code to interpret the output.

**Task**: Implement two text classifiers: Naive Bayes and Logistic Regression. Tune the hyper-parameters of these classifiers and print the task performance (accuracy) for different hyper-parameter settings.

In [11]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn import metrics

###
# Your answer BEGINS HERE
###

#A function for hyperparameter tuning of Multinomial NB
def hyperparam_MultinomialNB(xtrain_vect,ytrain, xdev_vect, ydev):
    
    alpha = [1.0,0.9,0.8,0.7,0.6,0.5,0.4,0.3,0.2,0.1,0.01,0.001,0.0001,0.00001]
    
    best_score=0
    best_alpha=None
    
    #Looping over alpha
    for a in alpha:
        #Train and predict the model
        nb = MultinomialNB(alpha=a)
        nb.fit(xtrain_vect, ytrain)
        y_pred_class = nb.predict(xdev_vect)
        #Evaluating the model
        current_score = metrics.accuracy_score(ydev, y_pred_class)
        print("Naive Bayes accuracy: "+str(metrics.accuracy_score(ydev, y_pred_class))," Alpha: ",a)
        
        #Change the current best score and its parameter
        if current_score > best_score :
            best_score = current_score
            best_alpha = a
           
    print("NB Best_score:",best_score, "Alpha: ", best_alpha)
    
    return best_alpha
    
#A function for hyperparameter tuning Logistic Regression
def hyperparam_LogisticRegression(xtrain_vect,ytrain, xdev_vect, ydev):
    
    solver = ['newton-cg','liblinear','lbfgs','sag','saga']
    C=[1.0,0.9,0.8,0.7,0.6,0.5,0.4,0.3,0.2,0.175,0.15, 0.1, 0.75, 0.5, 0.01,0.001,0.0001,0.00001]
    
    best_score =0
    best_C=None
    best_solver=None
    param=[]
    
    #Looping over solver 
    for s in solver:
        #Lopping over C value
        for c in C:
            #Train and predict the model
            log_reg = LogisticRegression(C=c, solver=s)
            log_reg.fit(xtrain_vect,ytrain)
            y_logreg_class = log_reg.predict(xdev_vect)
            #Evaluating the model
            current_score = metrics.accuracy_score(ydev, y_logreg_class)   
            print("Logistic Regression accuracy: "+str(current_score),'C: ',c,'solver: ',s)
            
            #Change the current best score and its parameters
            if current_score > best_score:
                best_score = current_score
                best_c = c
                best_solver= s
    print("LR Best_score:",best_score, "C: ", best_c,"solver: ",best_solver)
    param.append(best_c)
    param.append(best_solver)
    
    return param
    
alpha_val= hyperparam_MultinomialNB(x_train, y_train, x_dev, y_dev)
parameters=hyperparam_LogisticRegression(x_train, y_train, x_dev, y_dev)


###
# Your answer ENDS HERE
###

Naive Bayes accuracy: 0.2605633802816901  Alpha:  1.0
Naive Bayes accuracy: 0.2605633802816901  Alpha:  0.9
Naive Bayes accuracy: 0.2605633802816901  Alpha:  0.8
Naive Bayes accuracy: 0.2605633802816901  Alpha:  0.7
Naive Bayes accuracy: 0.2535211267605634  Alpha:  0.6
Naive Bayes accuracy: 0.2535211267605634  Alpha:  0.5
Naive Bayes accuracy: 0.2535211267605634  Alpha:  0.4
Naive Bayes accuracy: 0.2535211267605634  Alpha:  0.3
Naive Bayes accuracy: 0.2535211267605634  Alpha:  0.2
Naive Bayes accuracy: 0.2535211267605634  Alpha:  0.1
Naive Bayes accuracy: 0.2535211267605634  Alpha:  0.01
Naive Bayes accuracy: 0.2535211267605634  Alpha:  0.001
Naive Bayes accuracy: 0.2535211267605634  Alpha:  0.0001
Naive Bayes accuracy: 0.2535211267605634  Alpha:  1e-05
NB Best_score: 0.2605633802816901 Alpha:  1.0
Logistic Regression accuracy: 0.30985915492957744 C:  1.0 solver:  newton-cg
Logistic Regression accuracy: 0.31690140845070425 C:  0.9 solver:  newton-cg
Logistic Regression accuracy: 0.3169



Logistic Regression accuracy: 0.31690140845070425 C:  0.9 solver:  sag
Logistic Regression accuracy: 0.31690140845070425 C:  0.8 solver:  sag




Logistic Regression accuracy: 0.31690140845070425 C:  0.7 solver:  sag
Logistic Regression accuracy: 0.33098591549295775 C:  0.6 solver:  sag




Logistic Regression accuracy: 0.323943661971831 C:  0.5 solver:  sag
Logistic Regression accuracy: 0.31690140845070425 C:  0.4 solver:  sag




Logistic Regression accuracy: 0.31690140845070425 C:  0.3 solver:  sag
Logistic Regression accuracy: 0.2887323943661972 C:  0.2 solver:  sag




Logistic Regression accuracy: 0.2887323943661972 C:  0.175 solver:  sag
Logistic Regression accuracy: 0.2887323943661972 C:  0.15 solver:  sag
Logistic Regression accuracy: 0.28169014084507044 C:  0.1 solver:  sag




Logistic Regression accuracy: 0.31690140845070425 C:  0.75 solver:  sag
Logistic Regression accuracy: 0.323943661971831 C:  0.5 solver:  sag
Logistic Regression accuracy: 0.2605633802816901 C:  0.01 solver:  sag
Logistic Regression accuracy: 0.2676056338028169 C:  0.001 solver:  sag




Logistic Regression accuracy: 0.19718309859154928 C:  0.0001 solver:  sag




Logistic Regression accuracy: 0.1056338028169014 C:  1e-05 solver:  sag
Logistic Regression accuracy: 0.30985915492957744 C:  1.0 solver:  saga




Logistic Regression accuracy: 0.31690140845070425 C:  0.9 solver:  saga
Logistic Regression accuracy: 0.323943661971831 C:  0.8 solver:  saga




Logistic Regression accuracy: 0.33098591549295775 C:  0.7 solver:  saga
Logistic Regression accuracy: 0.33098591549295775 C:  0.6 solver:  saga




Logistic Regression accuracy: 0.323943661971831 C:  0.5 solver:  saga
Logistic Regression accuracy: 0.323943661971831 C:  0.4 solver:  saga




Logistic Regression accuracy: 0.30985915492957744 C:  0.3 solver:  saga
Logistic Regression accuracy: 0.28169014084507044 C:  0.2 solver:  saga
Logistic Regression accuracy: 0.28169014084507044 C:  0.175 solver:  saga
Logistic Regression accuracy: 0.29577464788732394 C:  0.15 solver:  saga
Logistic Regression accuracy: 0.28169014084507044 C:  0.1 solver:  saga




Logistic Regression accuracy: 0.323943661971831 C:  0.75 solver:  saga
Logistic Regression accuracy: 0.323943661971831 C:  0.5 solver:  saga




Logistic Regression accuracy: 0.2535211267605634 C:  0.01 solver:  saga
Logistic Regression accuracy: 0.19014084507042253 C:  0.001 solver:  saga




Logistic Regression accuracy: 0.1619718309859155 C:  0.0001 solver:  saga
Logistic Regression accuracy: 0.1267605633802817 C:  1e-05 solver:  saga
LR Best_score: 0.33098591549295775 C:  0.6 solver:  newton-cg




### Question 7 (1.0 mark)

**Instructions**: Using the best settings you have found, compare the two classifiers based on performance in the test set. Print out both **accuracy** and **macro-averaged F-score** for each classifier. Be sure to label your output. You may use sklearn's inbuilt functions.

**Task**: Compute test performance in terms of accuracy and macro-averaged F-score for both Naive Bayes and Logistic Regression, using their optimal hyper-parameter settings based on their development performance.

In [12]:
###
# Your answer BEGINS HERE
###
from sklearn.metrics import f1_score

#A function to train the model using Multinomial NB
def train_MultinomialNB(xtrain_vect,ytrain, xtest_vect, ytest, a):
    nb = MultinomialNB(alpha=a)
    nb.fit(xtrain_vect, ytrain)
    y_pred_class = nb.predict(xtest_vect)
    accuracy = metrics.accuracy_score(ytest, y_pred_class)
    macro_f1 = metrics.f1_score(ytest,y_pred_class,average='macro')
    print("Naive Bayes Accuracy: "+str(accuracy)," Alpha:",a)
    print("Naive Bayes Macro Averaged F1 score: "+str(macro_f1)," Alpha:",a)
    
    return nb

#A function to train the model using Logistic Regression
def train_LogisticRegression(xtrain_vect,ytrain, xtest_vect, ytest,param):
    c = param[0]
    sol = param[1]
    lr = LogisticRegression(C=c, solver=sol)
    lr.fit(xtrain_vect, ytrain)
    y_pred_class = lr.predict(xtest_vect)
    accuracy = metrics.accuracy_score(ytest, y_pred_class)
    macro_f1 = metrics.f1_score(ytest,y_pred_class,average='macro')
    print("Logistic Regression Accuracy: "+str(accuracy)," C:",c," Solver:",sol)
    print("Logistic Regression Macro Averaged F1 score: "+str(macro_f1)," C:",c," Solver:",sol)
    
    return lr
    
nb_clf= train_MultinomialNB(x_train,y_train,x_test,y_test,alpha_val)
print("=============================================")
lr_clf=train_LogisticRegression(x_train, y_train,x_test,y_test,parameters)
###
# Your answer ENDS HERE
###

Naive Bayes Accuracy: 0.2624113475177305  Alpha: 1.0
Naive Bayes Macro Averaged F1 score: 0.2562357665583472  Alpha: 1.0
Logistic Regression Accuracy: 0.3262411347517731  C: 0.6  Solver: newton-cg
Logistic Regression Macro Averaged F1 score: 0.30805083701888974  C: 0.6  Solver: newton-cg


### Question 8 (1.0 mark)

**Instructions**: Print the most important features and their weights for each class for the two classifiers.


**Task**: For each of the classifiers (Logistic Regression and Naive Bayes) you've built in the previous question, print out the top-20 features (words) with the highest weight for each class (countries).

An example output:
```
Classifier = Logistic Regression

Country = au
aaa (0.999) bbb (0.888) ccc (0.777) ...

Country = ca
aaa (0.999) bbb (0.888) ccc (0.777) ...

Classifier = Naive Bayes

Country = au
aaa (-1.0) bbb (-2.0) ccc (-3.0) ...

Country = ca
aaa (-1.0) bbb (-2.0) ccc (-3.0) ...
```

Have a look at the output, and see if you notice any trend/pattern in the words for each country.

In [13]:
###
# Your answer BEGINS HERE
###
import numpy as np 

#A function that prints the top 20 features trained using Multinomial Naive Bayes
def top_feat_NB(classifier, vectorizer, label_encoder):
    #Inverse encode the label
    le= label_encoder
    y_labels = le.inverse_transform(classifier.classes_)
    #Obtain the feature names from the vectorizer
    features = np.asarray(vectorizer.get_feature_names())
    
    #Loop over the class labels
    for i, label in enumerate(y_labels):
          
        #Select the best 20 features based on their log probability weight
        top20_feats =sorted(zip(classifier.feature_log_prob_[i], features),reverse=True)[:20]
        print("Country = ",label)
        
        #Loop over the best 20 features and print the results
        for coef,feat in top20_feats: 
            print(feat,"(%.3f)"% coef," ",sep=' ', end='', flush=True)      
        print("\n")

#A function that prints the top 20 features trained using Logistic Regression
def top_feat_LR(classifier, vectorizer,label_encoder):
    le = label_encoder
    y_labels = le.inverse_transform(classifier.classes_)
    features = np.asarray(vectorizer.get_feature_names())    
   
    #Loop over the class labels
    for i, label in enumerate(y_labels):
        
        #Select the best 20 features based on their log probability weight
        top20_feats =sorted(zip(classifier.coef_[i], features),reverse=True)[:20]
        print("Country = ",label)
        
        #Loop over the best 20 features and print the features
        for coef,feat in top20_feats: 
            print(feat,"(%.3f)"% coef," ",sep=' ', end='', flush=True)
        print("\n")

print("Classifier = Logistic Regression\n")
top_feat_LR(lr_clf,vectorizer,encoder)
        
print("Classifier = Naive Bayes\n")
top_feat_NB(nb_clf,vectorizer,encoder)
###
# Your answer ENDS HERE
###

Classifier = Logistic Regression

Country =  au
melbourne (0.962)  australia (0.933)  #melbourne (0.589)  #photography (0.554)  #sunrise (0.538)  #mtvhottest (0.519)  one (0.512)  friends (0.502)  finished (0.500)  little (0.488)  song (0.487)  forward (0.456)  great (0.453)  even (0.450)  lachie (0.435)  @dasheryoung (0.435)  lovely (0.434)  name (0.432)  @darlingsaila (0.431)  @whennboys (0.431)  

Country =  ca
happen (0.694)  right (0.690)  first (0.682)  bed (0.649)  thing (0.611)  great (0.593)  let's (0.591)  next (0.561)  really (0.557)  sounds (0.553)  finally (0.538)  movies (0.536)  later (0.523)  one (0.513)  found (0.475)  looks (0.475)  awesome (0.472)  works (0.465)  actually (0.463)  new (0.444)  

Country =  de
posted (0.989)  night (0.742)  love (0.740)  photo (0.669)  germany (0.647)  enough (0.570)  painting (0.523)  roseninsel (0.496)  https://t.co/df7ficsci3 (0.496)  krefeld (0.482)  jessica (0.475)  hyde (0.475)  https://t.co/brkwmsvzrb (0.475)  gauting (0.475)  