## Lyrics classification

*Hey man, every-every-everybody's talkin' about it, everybody's talkin' bout...*

In this milestone you will build a system that can automatically classify song lyrics by era.

* Do basic text processing, tokenizing your input and converting it into a bag-of-words representation
* Build a machine learning classifier based on the generative model, using Naive Bayes
* Build a machine learning classifier based on the discriminative model, using a logistic regression classifier implemented in Keras
* Evaluate your classifiers and examine what they have learned
* Implement techniques to improve your classifier.

Requested reading: Jurafsky & Martin, Chapters 4 and 5 (3rd edition)

We created a dataset for you that consists of song lyrics from *rock bands* spanning the last half century!

<img src="pics/cover.jpg">

*This assignment is adapted from material by J.Eisenstein*




# 0. Setup

You will need [python 3.6](https://www.python.org/downloads/) and the following libraries. Most if not all of these are part of [anaconda](https://www.continuum.io/downloads), so a good starting point would be to install that. 

- [jupyter](http://jupyter.readthedocs.org/en/latest/install.html)
- numpy (This will come if you install scipy like above, but if not install separately)
- [matplotlib](http://matplotlib.org/users/installing.html)
- [nosetests](https://nose.readthedocs.org/en/latest/)
- [pandas](http://pandas.pydata.org/) Dataframes


## About this assignment

- This is a Jupyter notebook. 
- Most of your coding will be in the python source files in the directory ```snlp```.
- The directory ```tests``` contains unit tests which you should run to see that you're on the right track. 
- You may want to add more tests, but that is completely optional. 
- Code locally, push to your git repository and consider running computationally heavier parts on the assigned compute server.

In [103]:
import sys
from importlib import reload

In [104]:
print('My Python version')

print('python: {}'.format(sys.version))

My Python version
python: 3.6.7 |Anaconda, Inc.| (default, Oct 23 2018, 14:01:38) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]


In [105]:
import nose

import pandas as pd
import numpy as np
import scipy as sp
import matplotlib
import matplotlib.pyplot as plt

import keras

%matplotlib inline

In [106]:
print('My library versions')

print('pandas: {}'.format(pd.__version__))
print('numpy: {}'.format(np.__version__))
print('scipy: {}'.format(sp.__version__))
print('matplotlib: {}'.format(matplotlib.__version__))
print('nose: {}'.format(nose.__version__))
print('keras: {}'.format(keras.__version__))

My library versions
pandas: 0.23.4
numpy: 1.15.4
scipy: 1.1.0
matplotlib: 3.0.2
nose: 1.3.7
keras: 2.2.4


To test whether your libraries are the right version, run:

In [107]:
# use ! to run shell commands in your notebook
! nosetests tests/test_environment.py

nose.config: INFO: Ignoring files matching ['^\\.', '^_', '^setup\\.py$']
Using TensorFlow backend.
test_environment.test_library_versions ... ok

----------------------------------------------------------------------
Ran 1 test in 0.001s

OK


To avoid long debug messages when running in the shell:

In [22]:
! nosetests tests/test_environment.py --nologcapture --nocapture


nose.config: INFO: Ignoring files matching ['^\\.', '^_', '^setup\\.py$']
Using TensorFlow backend.
test_environment.test_library_versions ... ok

----------------------------------------------------------------------
Ran 1 test in 0.001s

OK


# 1. Preprocessing


Read the data into a dataframe

In [23]:
df_train = pd.read_csv('data/rock-lyrics-train.csv')

A dataframe is a structured representation of your data. You can preview a dataframe using `head()`

In [24]:
df_train.head()

Unnamed: 0,Era,Lyrics
0,2000s,Don't tell me what to think 'Cause I don't car...
1,2000s,Whenever the lights go down That's when she co...
2,2000s,You say this will be alright Just put my faith...
3,2000s,There's one who takes it all And there's one w...
4,2000s,So far away from knowing where I am going I am...


Explore the data. How many different classes are there? 

In [25]:
## your code

## Bags of words

Your first task is to convert the text to a bag-of-words representation. For this data, a lot of the preprocessing is already done: the text is lower-cased, and punctuation is removed. You need only create a `counter` for each instance.

- **Deliverable 1.1**: Complete the function `snlp.preproc.bag_of_words`. 
- **Test**: `nosetests tests/test_preproc.py:test_d1_1_bow`

In [29]:
from snlp import preproc

In [32]:
# run this block to update the notebook as you change the preproc library code
reload(preproc);

In [33]:
! nosetests tests/test_preproc.py:test_d1_1_bow --nologcapture --nocapture

nose.config: INFO: Ignoring files matching ['^\\.', '^_', '^setup\\.py$']
Using TensorFlow backend.
test_preproc.test_d1_1_bow ... ok

----------------------------------------------------------------------
Ran 1 test in 0.282s

OK


In [34]:
y_tr,x_tr = preproc.read_data('data/rock-lyrics-train.csv',preprocessor=preproc.bag_of_words)
y_dv,x_dv = preproc.read_data('data/rock-lyrics-dev.csv',preprocessor=preproc.bag_of_words)

In [35]:
y_te,x_te = preproc.read_data('data/rock-lyrics-test-hidden.csv',preprocessor=preproc.bag_of_words)

## Unseen words

One challenge for classification is that words will appear in the test data that do not appear in the training data. Compute the number of words that appear in `rock-lyrics-dev.csv`, but not in `rock-lyrics-train.csv`. To do this, implement the following deliverables:

- **Deliverable 1.2**: implement `snlp.preproc.aggregate_counts`, a counter of all words in a list of bags-of-words. 
- **Tests**: `tests/test_preproc.py:test_d1_2_agg`, `tests/test_preproc.py:test_d1_3a_oov`

In [36]:
from collections import Counter

In [48]:
reload(preproc);

In [49]:
! nosetests tests/test_preproc.py:test_d1_2_agg --nologcapture --nocapture

nose.config: INFO: Ignoring files matching ['^\\.', '^_', '^setup\\.py$']
Using TensorFlow backend.
test_preproc.test_d1_2_agg ... ok

----------------------------------------------------------------------
Ran 1 test in 0.464s

OK


To write fast code, you can find bottlenecks using the %%timeit cell magic. 

Here I'm evaluating two different implementations of `aggregate_counts`

In [50]:
%%timeit
preproc.aggregate_counts(x_tr)

152 ms ± 4.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [51]:
counts_dv = preproc.aggregate_counts(x_dv)

You can see the most common items in a counter by calling `counts.most_common()`:

In [52]:
counts_dv.most_common(5)

[('the', 4678), ('I', 4005), ('you', 3456), ('to', 2628), ('a', 2363)]

In [53]:
counts_tr = preproc.aggregate_counts(x_tr)

- **Deliverable 1.3**: implement `snlp.preproc.compute_oov`, returning a list of words that appear in one list of bags-of-words, but not another. 

In [54]:
! nosetests tests/test_preproc.py:test_d1_3a_oov --nologcapture --nocapture

nose.config: INFO: Ignoring files matching ['^\\.', '^_', '^setup\\.py$']
Using TensorFlow backend.
test_preproc.test_d1_3a_oov ... ok

----------------------------------------------------------------------
Ran 1 test in 0.475s

OK


In [55]:
len(preproc.compute_oov(counts_dv,counts_tr))

3145

In [56]:
len(preproc.compute_oov(counts_tr,counts_dv))

33146

In [57]:
preproc.oov_rate(counts_dv,counts_tr)

0.2723651164804711

If all did go well, you observe that 27% of the words in the dev set do not appear in the training set.

## Power laws

Word count distributions are said to follow [power law](https://en.wikipedia.org/wiki/Power_law) distributions. 

In practice, this means that a log-log plot of frequency against rank is nearly linear. Let's see if this holds for our data.

In [None]:
plt.loglog([val for word,val in counts_tr.most_common()])
plt.loglog([val for word,val in counts_dv.most_common()])
plt.xlabel('rank')
plt.ylabel('frequency')
plt.legend(['training set','dev set']);

How would this curve look like if it were not plotted in log space?

## Pruning the vocabulary

Let's prune the vocabulary to include only words that appear at least ten times in the training data.

- **Deliverable 1.4:** Implement `preproc.prune_vocabulary` 
- **Test**: `tests/test_preproc.py:test_d1_4_prune`

In [61]:
reload(preproc);

In [62]:
! nosetests tests/test_preproc.py:test_d1_4_prune --nologcapture --nocapture

nose.config: INFO: Ignoring files matching ['^\\.', '^_', '^setup\\.py$']
Using TensorFlow backend.
test_preproc.test_d1_4_prune ... ok

----------------------------------------------------------------------
Ran 1 test in 23.016s

OK


In [63]:
x_tr_pruned, vocab = preproc.prune_vocabulary(counts_tr,x_tr,10)
x_dv_pruned, _ = preproc.prune_vocabulary(counts_tr,x_dv,10)
x_te_pruned, _ = preproc.prune_vocabulary(counts_tr,x_te,10)

In [64]:
len(vocab)

5992

(Section 2 is skipped)

# 3. Naive Bayes

You'll now implement a Naive Bayes classifier.


In [74]:
from snlp import naive_bayes
reload(naive_bayes);

- **Deliverable 3.1**: (warmup) implement ```get_corpus_counts``` in ```naive_bayes.py```. 
- **Test**: `tests/test_classifier.py:test_d3_1_corpus_counts`

This function should compute the word counts for a given label.

In [75]:
! nosetests tests/test_classifier.py:test_d3_1_corpus_counts --nologcapture --nocapture

nose.config: INFO: Ignoring files matching ['^\\.', '^_', '^setup\\.py$']
Using TensorFlow backend.
test_classifier.test_d3_1_corpus_counts ... ok

----------------------------------------------------------------------
Ran 1 test in 11.994s

OK


In [70]:
eighties_counts = naive_bayes.get_corpus_counts(x_tr_pruned,y_tr,"1980s");
print(eighties_counts['today'])
print(eighties_counts['yesterday'])

24
8


- **Deliverable 3.2**: Implement ```estimate_pxy``` in ```naive_bayes.py```. 
- **Test**: `tests/test_classifier.py:test_d3_2_pxy`

This function should compute the *smoothed* multinomial distribution $\log P(x \mid y)$ for a given label $y$. This means we want to smooth it with pseudocounts, as illustrated in equation 2.3 in Jacob Eisensteins' [notes (page 22)](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf).

<img src="pics/eq23.png">

Hint: note that this function takes the vocabulary as an argument. You have to assign a probability even for words that do not appear in documents with label $y$, if they are in the vocabulary.

You can use ```get_corpus_counts``` in this function if you want to, but you don't have to.

In [76]:
reload(naive_bayes);

In [77]:
! nosetests tests/test_classifier.py:test_d3_2_pxy

nose.config: INFO: Ignoring files matching ['^\\.', '^_', '^setup\\.py$']
Using TensorFlow backend.
test_classifier.test_d3_2_pxy ... ok

----------------------------------------------------------------------
Ran 1 test in 14.438s

OK


In [79]:
log_pxy = naive_bayes.estimate_pxy(x_tr_pruned,y_tr,"1980s",0.1,vocab)

Probabilities must sum to one! (or very close)

In [80]:
sum(np.exp(list(log_pxy.values())))

0.999999999999961

Let's look at the log-probabilities of the words from some hand-tuned weights.

In [83]:
from collections import defaultdict
from snlp import constants
reload(constants);

# weight vectors must be defaultdicts
theta_hand = defaultdict(float,
                         {('2000s','money'):0.1,
                          ('2000s','name'):0.2,
                          ('1980s','tonight'):0.1,
                          ('2000s','man'):0.1,
                          ('1990s','fly'):0.1,
                          ('pre-1980s',constants.OFFSET):0.1
                         })


In [84]:
print({word:log_pxy[word] for (_,word),weight in theta_hand.items() if weight>0})

{'money': -7.969973834156985, 'name': -7.377406759792738, 'tonight': -6.9382388281494505, 'man': -6.3519475704903305, 'fly': -8.360722957632635, '**OFFSET**': 0.0}


In [85]:
log_pxy_more_smooth = naive_bayes.estimate_pxy(x_tr_pruned,y_tr,"1980s",10,vocab)

In [86]:
print({word:log_pxy_more_smooth[word] for (_,word),weight in theta_hand.items() if weight>0})

{'money': -8.173461476943206, 'name': -7.679803656799581, 'tonight': -7.287410630258769, 'man': -6.740405349915276, 'fly': -8.46826101716385, '**OFFSET**': 0.0}


- **Deliverable 3.3**: Now you are ready to implement ```estimate_nb``` in ```naive_bayes.py```. 
- **Test**: `tests/test_classifier.py:test_d3_3a_nb`



- The goal is that the score given by ```clf_base.predict``` is equal to the joint probability $P(x,y)$, as described in the notes.
- Don't forget the offset feature, whose weights should be set to the prior $\log P(y)$.
- The log-probabilities for the offset feature should not be smoothed.
- You can call the functions you have defined above, but you don't have to.

In [90]:
reload(naive_bayes);

In [91]:
! nosetests tests/test_classifier.py:test_d3_3a_nb

nose.config: INFO: Ignoring files matching ['^\\.', '^_', '^setup\\.py$']
Using TensorFlow backend.
test_classifier.test_d3_3a_nb ... ok

----------------------------------------------------------------------
Ran 1 test in 22.656s

OK


In [92]:
theta_nb = naive_bayes.estimate_nb(x_tr_pruned,y_tr,0.1)

In [95]:
from snlp import clf_base
reload(clf_base)
labels = set(y_tr) #figure out all possible labels
print(labels)
clf_base.predict(x_tr_pruned[155],theta_nb,labels)

{'1980s', 'pre-1980s', '2000s', '1990s'}


('1990s',
 {'1980s': -1702.5481166437244,
  'pre-1980s': -1673.2432003414203,
  '2000s': -1664.5625204786743,
  '1990s': -1658.4478945049184})

Now let's see how good these weights are, by evaluating on the dev set.



In [98]:
from snlp import evaluation
reload(evaluation);

In [99]:
y_hat = clf_base.predict_all(x_dv_pruned,theta_nb,labels)
print(evaluation.acc(y_hat,y_dv))

0.5038910505836576


In [100]:
# this block shows how we write and read predictions for evaluation
evaluation.write_predictions(y_hat,'nb-dev.preds')
y_hat_dv = evaluation.read_predictions('nb-dev.preds')
evaluation.acc(y_hat_dv,y_dv)

0.5038910505836576

# Deliverables:
 
1. Create the most simple baseline you can come up with for this task. Comare the NB performance to that simple baseline. What do you observe?
2. Implement a neural model of your choice (e.g., CBOW/FNN) and compare it to the Naive Bayes classifier. Tune its parameters on the development data. How does the neural model compare to the traditional ML? (optional: implement another traditional classifier, e.g., Logistic Regression).
3. Elaborate on the choices you investigated. Discuss your findings in the light of the results. What is particularly easy with this dataset? What is particularly difficult? Prepare a 5 minute presentation for the next class.
