## Lyrics Era classification with Naive Bayes 

You will implement a Naive Bayes text classifier that classifies song lyrics by era.

In particular, we created a dataset for you that consists of song lyrics from *heavy metal bands* spanning the last half century!

<img src="pics/cover.jpg">

*This assignment is inspired from material by J.Eisenstein*




# 1. Preprocessing


Read the data into a dataframe

In [None]:
import pandas as pd
df_train = pd.read_csv('data/metal-lyrics-train.csv')

A dataframe is a structured representation of your data. You can preview a dataframe using `head()`

In [None]:
df_train.head()

In [None]:
## Let's inspect an example. Can you guess the era? :) [Can the machine do so?]
df_train['Lyrics'][10]

##  <font color='blue'>Task 1</font>: Representing and inspecting the data - Bags of word (BOW), Label distribution and OOV rate

Your first task is to convert the text into a bag-of-words representation.


- **Deliverable 1.1**: implement `bag_of_words`, a counter of all words in a single document. Decide how to tokenize the data. Load the data files `data/metal-lyrics-train.csv` and `data/metal-lyrics-dev.csv` and inspect them.

In [None]:
from collections import Counter

# deliverable 1.1
def bag_of_words(text):
    '''
    Count the number of word occurrences for each document in the corpus. The function also tokenizes the text.

    :param text: a document, as a single string
    :returns: a Counter for a single document
    :rtype: Counter
    '''
    # your code here
    raise NotImplementedError

In [None]:
### helper code
    
def read_data(filename, label='Era', text="Lyrics", preprocessor=bag_of_words):
    '''
    Read the data and convert with preprocessor
    '''
    df = pd.read_csv(filename)
    return df[label].values, [preprocessor(string) for string in df[text].values]

In [None]:
## Load the data

## your code here


## Label Distribution

- **Deliverable 1.2**: inspect the data. Plot the label distribution in the training and dev data.


In [None]:
## your code here

## Unseen words

One challenge for classification is that words will appear in the test data that do not appear in the training data. Compute the number of words that appear in `metal-lyrics-dev.csv`, but not in `metal-lyrics-train.csv`. To do this, implement the following deliverables:


- **Deliverable 1.3**: implement `aggregate_counts`, a counter of all words in a list of bags-of-words. This function  creates the vocabulary over the entire dataset (here the vocabulary are all unique *word types*).
- **Deliverable 1.4**: implement `compute_oov`, returning a list of words that appear in one list of bags-of-words, but not another.  Then, use your implementation to calculate the out-of-vocabulary (OOV) rate.



In [None]:
# deliverable 1.3
def aggregate_counts(bags_of_words):
    '''
    Aggregate word counts for individual documents into a single bag of words representation

    :param bags_of_words: a list of bags of words as Counters from the bag_of_words method
    :returns: an aggregated bag of words for the whole corpus
    :rtype: Counter
    '''

    counts = Counter()
    # your code here
    raise NotImplementedError

    
# deliverable 1.4
def compute_oov(bow1, bow2):
    '''
    Return a set of words that appears in bow1, but not bow2

    :param bow1: a bag of words
    :param bow2: a bag of words
    :returns: the set of words in bow1, but not in bow2
    :rtype: set
    '''
    # your code here
    raise NotImplementedError


# helper code
def oov_rate(bow1, bow2):
    return len(compute_oov(bow1, bow2)) / len(bow1.keys())

Now that you have implemented the functions, we can use them to calculate the OOV rate.
- **Deliverable 1.5**: calculate the OOV rate on the dev data.

What percentage of words in the dev set do not appear in the training set?

In [None]:
## your code here

## Power laws

Word count distributions are said to follow [power law](https://en.wikipedia.org/wiki/Power_law) distributions. 

In practice, this means that a log-log plot of frequency against rank is nearly linear. Let's see if this holds for our data.

You can see the most common items in a counter by calling `counts.most_common()`:



In [None]:
# your code here

In [None]:
import matplotlib.pyplot as plt
plt.loglog([val for word, val in counts_tr.most_common()])
plt.loglog([val for word, val in counts_dv.most_common()])
plt.xlabel('rank')
plt.ylabel('frequency')
plt.legend(['training set','dev set']);
plt.show()

What would this curve look like if it were not plotted in log space?

<!---
Latex Macros
-->
$$
\newcommand{\Xs}{\mathcal{X}}
\newcommand{\Ys}{\mathcal{Y}}
\newcommand{\y}{\mathbf{y}}
\newcommand{\weights}{\mathbf{w}}
\newcommand{\balpha}{\boldsymbol{\alpha}}
\newcommand{\bbeta}{\boldsymbol{\beta}}
\newcommand{\aligns}{\mathbf{a}}
\newcommand{\align}{a}
\newcommand{\source}{\mathbf{s}}
\newcommand{\target}{\mathbf{t}}
\newcommand{\ssource}{s}
\newcommand{\starget}{t}
\newcommand{\repr}{\mathbf{f}}
\newcommand{\repry}{\mathbf{g}}
\newcommand{\bar}{\,|\,}
\newcommand{\x}{\mathbf{x}}
\newcommand{\prob}{p}
\newcommand{\Pulp}{\text{Pulp}}
\newcommand{\Fiction}{\text{Fiction}}
\newcommand{\PulpFiction}{\text{Pulp Fiction}}
\newcommand{\pnb}{\prob^{\text{NB}}}
\newcommand{\vocab}{V}
\newcommand{\params}{\boldsymbol{\theta}}
\newcommand{\param}{\theta}
\DeclareMathOperator{\perplexity}{PP}
\DeclareMathOperator{\argmax}{argmax}
\DeclareMathOperator{\argmin}{argmin}
\newcommand{\train}{\mathcal{D}}
\newcommand{\counts}[2]{\#_{#1}(#2) }
\newcommand{\length}[1]{\text{length}(#1) }
\newcommand{\indi}{\mathbb{I}}
$$

# <font color='blue'>Task 2</font>: Naive Bayes

You'll now implement a Naive Bayes classifier for this task.


###  Naive Bayes Prediction

Given a trained NB model (we have estimated model parameters $\params$), we search for the $y\in\Ys$ with maximum *a posteriori* probability:

$$
\argmax_{y\in\Ys} \prob_\params(y|\x) =  \argmax_{y\in\Ys} \frac{\prob(\x|y) \prob(y) }{ \prob(\x) } =\\ \argmax_{y\in\Ys} \prob(\x|y) \prob(y) 
$$


where the parameters $\params$ in NB consist of two sets:
* $\prob(y)$ - prior probability (or $\bbeta$ for class priors) 
* $\prob(\x|y)$ - likelihood (or $\balpha$, per-class feature probabilities)

To train the model, we use Maximum Likelihood estimation: 

\begin{split}
  \alpha_{w,y} & = \frac{\counts{\train}{w,y}}{\sum_{w'}\counts{\train}{w',y}}\\\\
  \beta_{y} & = \frac{\counts{\train}{y}}{\left| \train \right|}
\end{split}


Write Python code to use a training set of documents to estimate the probabilities in the Naive Bayes model. Return the data structure containing the probabilities. The input parameter of this function should be a list of lyrics with labels. 

Hints:

* use *log* probabilities to avoid numeric over/underflow.
* we drop words that are unknown at test time completely (as mentioned in J&M, we don’t use unknown word models for naive Bayes)
* first implement a model without any smoothing; evaluate it; then add Laplace smoothing to the likelihoods and observe how that impacts performance
* it might help to develop your model first on toy data; for this purpose we include the worked example from Section 4.3 of the textbook (`data/tinysentiment*.csv`)






##  <font color='blue'>Task 2.A </font>: Towards estimating likelihood - counting features per class

Above you have implemented code to convert the text into a bag-of-words representation. 

For the estimation of the Naive Bayes likelihood parameters, we are interested in counting the number of times a word appears, given a certain class. 

- **Deliverable 1.4**: implement `aggregate_counts_for_label`, a counter of all words in a list of bags-of-words for a specific label only. 

Test it on the provided worked toy example


In [None]:
# deliverable 1.4
def aggregate_counts_for_label(bags_of_words, y_train, label):
    '''
    Aggregate word counts for individual documents given a certain label (class) into a single bag of words representation

    :param bags_of_words: a list of bags of words as Counters from the bag_of_words method
    :param label: the class we (later used to compute the likelihood term p(w|class))
    :returns: an aggregated bag of words for the subset of the corpus specific to the given label
    :rtype: Counter
    '''

    counts = Counter()
    # your code here
    raise NotImplementedError

In [None]:
## load the toy example and inspect the aggregate counts
## your code here



##  <font color='blue'>Task 2.B </font>:  Implement Naive Bayes (from scratch)


- **Deliverable 2.1**: implement `train_nb`, which estimates the parameters for the Naive Bayes model. 

- **Deliverable 2.2**: evaluate the estimated model parameters on the evaluation dataset. Add Laplace smoothing and observe what happens to the performance of the model. 

Tipps:

- we provide helper code below. You are free to ignore it and code it up your own way (with dictionaries or whichever structure you feel most comfortable with)
- for efficiency reasons, the code below creates a data structure in which all model parameters end up in a matrix, which allows for very efficient processing. In particular, the model parameter matrix will be of size `num_features x classes`, in which the first row is reserved for the prior class probabilities and all rows from 1 onwards contain the likelihoods of the feature given the class. Some useful functions are `np.array()`, `np.sum(data, axis=0)` and [`np.nan_to_num`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.nan_to_num.html).

In [None]:
## helpful numpy code
a = np.array([[0, 0, 1, 0],
              [1, 1, 2, 10],
              [4, 0, 0, 1],
              [1, 2, 0, 0]])

b = a / a.sum(axis=0)
b
# Probabilities must sum to one! (or very close)
print("sum to one:", b.sum(axis=0))


In [None]:
### helper code 
from abc import ABC

class LinearClassifier(ABC):
    """
    General class for a linear classifier
    """

    def __init__(self):
        self.trained = False
        self.lab2idx = {} # mapping of each class label to an index
        self.feat2idx = {} # mapping of features to indices

    def train(self, X, y):
        """
        Estimates the model parameters
        """
        raise NotImplementedError

    def get_scores(self, x, w):
        """
        Computes the dot product between X and w (several instances at once)
        """
        return np.dot(x, w)

    def get_label(self, x, w):
        """
        Computes the label for each data instance
        """
        scores = np.dot(x, w)
        return np.argmax(scores, axis=1).transpose()

    def test(self, x, w):
        """
        Finds the most likely label for each instance
        """
        if not self.is_trained:
            raise Error("Please train the model first")
        idx2lab = {i: lab for lab, i in self.lab2idx.items()} # reverse mapping
        x_matrix = np.zeros((len(x),len(self.feat2idx)+1)) # add prior
        for i, inst in enumerate(x):
            # add prior
            for j, p_c in enumerate(w[0]):
                x_matrix[i][0] = 1
            # likelihood
            for f in inst:
                if f in self.feat2idx: #otherwise ignore
                    fIdx = self.feat2idx[f]
                    x_matrix[i][fIdx] = inst[f]
               
        predicted_label_indices = self.get_label(x_matrix, w)
        return [idx2lab[i] for i in predicted_label_indices]

    def evaluate(self, gold, predicted):
        """
        Estimate accuracy of the predictions
        """
        correct = 0
        total = 0
        for g,p in zip(gold,predicted):
            if g == p:
                correct += 1
            total += 1
        return correct/total

In [None]:
import numpy as np

class NaiveBayes(LinearClassifier):

    def __init__(self):
        LinearClassifier.__init__(self)
        self.is_trained = False

    def train(self, X, y):
        print("Training a multinomial NB model")
        params = self.train_nb(X, y)
        self.is_trained = True
        return params
    
    ### deliverable 2.1
    def train_nb(self, X_train, y_train):
        # estimate the model parameters
        
        # this function should return the following matrix
        # 
        #   parameters = np.zeros((vocab_size+1, num_classes))
        #
        #   where
        #    - the first row [0] contains the prior (log probs) per class  parameters[0, i] = log p of c
        #    - and the remaining rows contain the per class likelihood parameters[1:, i]
        #
           
        num_classes = len(np.unique(y_train))
        features_train = aggregate_counts(X_train)
        vocab_size = len(features_train)
        print("{} classes, {} vocab size".format(num_classes, vocab_size))

        
        # instantiate mappers
        self.feat2idx = {f: i+1 for i,f in enumerate(features_train)}  # keep 0 reserved for prior 
        self.lab2idx = {l: i for i,l in enumerate(np.unique(y_train))}


        # your code here
        
        
        #return parameters
        raise NotImplementedError

### Evaluate the model on the worked toy example

In [None]:
nb = NaiveBayes()
# your code here


##  <font color='blue'>Task 3</font>: Evaluation and Analysis

**Deliverable 2.3**: Evaluate your model on the metal lyrics dev set. Create a confusion matrix. Which classes are often confused with each other?
    


In [None]:
# evaluate - deliverable 2.2 & deliverable 2.3



In [None]:
# plot confusion matrix