# RNN for text classification and text generation
### Dr. Omri Allouche 2018. YData Deep Learning Course

[Open in Google Colab](https://colab.research.google.com/github/omriallouche/deep_learning_course/blob/master/DL_rnn_text_classification_generation.ipynb)

In the first part of this exercise, we’ll continue our attempts to classify text using different network architectures. This time, we’ll try a LSTM. We'll use the Metrolyrics dataset we used in the previous exercise.  

You are encouraged to review the code in [this](https://github.com/prakashpandey9/Text-Classification-Pytorch) repo, that contains implementation of several deep learning architectures for text classification in PyTorch. If you face time limitations, you're welcome to adapt it to your needs instead of writing your own code from scratch.

In the second part of this exercise, you'll unleash the hidden creativity of your computer, by letting it generate Country songs (yeehaw!). You'll train a character-level RNN-based language model, and use it to generate new songs.


### Special Note
Our Deep Learning course was packed with both theory and practice. In a short time, you've got to learn the basics of deep learning theory and get hands-on experience training and using pretrained DL networks, while learning PyTorch.  
Past exercises required a lot of work, and hopefully gave you a sense of the challenges and difficulties one faces when using deep learning in the real world. While the investment you've made in the course so far is enormous, I strongly encourage you to take a stab at this exercise. 

DL networks for NLP are much shallower than those for image classification. It's possible to construct your own networks from scratch, and achieve nice results. While I hope the theoretical foundations of RNNs are clear after our class sessions, getting your hands dirty with their implementation in PyTorch allows you to set breakpoints, watch the dimensions of the different layers and components and get a much better understand of theory, in addition to code that might prove useful later for your own projects. 

I tried to provide references for all parts that walk you through a very similar task (actually, the same task on a different dataset). I expect this exercise to require much less of your time than previous exercises.

The exercise is aimed to help you get better understanding of the concepts. I am not looking for the optimal model performance, and don't look for extensive optimization of hyperparameters. The task we face in this exercise, namely the classification of the song’s genre from its text alone, is quite challenging, and we probably shouldn’t expect great results from our classifier. Don’t let this discourage you - not every task reaches an f1 score of 90%+. 

In fact, some of the reasons I chose this dataset is because it highlights some of the issues we face in machine learning models in the real world. Examples include:
- The classes are highly imbalanced - try to think how this affects the network learning
- Given the small amount of data for some classes, you might actually prefer to remove them from the dataset. How would you decide that?
- NLP tasks often involve preprocessing (lowercasing, tokenization, lemmatization, stopwords removal etc.). The decision on the actual preprocessing pipeline depends on the task, and is often influenced by our believes about the data and exploratory analysis of it. Thinking conciously about these questions helps you be a better data scientist
- Some songs contain no lyrics (for example, they just contain the text "instrumental"). Others include non-English characters. You'll often need to preprocess your data and make decisions as to what your network should actually get as input (think - how should you treat newline characters?)
- While model performance on this dataset are not amazing, we can try to answer interesting follow-up questions - which genres are more similar to each other and are often confused? Do genres become more similar through the years? ...

More issues will probably pop up while you're working on this task. If you face technical difficulties or find a step in the process that takes too long, please let me know. It would also be great if you share with the class code you wrote that speeds up some of the work (for example, a data loader class, a parsed dataset etc.)

## RNN for Text Classification
In this section you'll write a text classifier using LSTM, to determine the genre of a song based on its lyrics.  
The code needed for this section should be very similar to code you've written for the previous exercise, and use the same dataset.  

In [0]:

import sys
sys.version

%reset -f
import os
os.environ['PATH'] += ':/usr/local/cuda/bin'
import sys
sys.version

!pip3 install 'torch==0.4.0'
!pip3 install 'torchvision==0.2.1'
!pip3 install --no-cache-dir -I 'pillow==5.1.0'
#!pip3 install torchvision
!pip install 'livelossplot==0.2.2'
!pip install 'imageio==2.4.1'
!pip install  'torchnet==0.0.4'
!pip install 'torchvision==0.2.1'

print('done')
# Restart Kernel
# This workaround is needed to properly upgrade PIL on Google Colab.
import os
os._exit(0)

Collecting pillow==5.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/5f/4b/8b54ab9d37b93998c81b364557dff9f61972c0f650efa0ceaf470b392740/Pillow-5.1.0-cp36-cp36m-manylinux1_x86_64.whl (2.0MB)
[K    100% |████████████████████████████████| 2.0MB 37.5MB/s 
[31mimgaug 0.2.8 has requirement numpy>=1.15.0, but you'll have numpy 1.14.6 which is incompatible.[0m
[31mfastai 1.0.45 has requirement torch>=1.0.0, but you'll have torch 0.4.0 which is incompatible.[0m
[31malbumentations 0.1.12 has requirement imgaug<0.2.7,>=0.2.5, but you'll have imgaug 0.2.8 which is incompatible.[0m
[?25hInstalling collected packages: pillow
Successfully installed pillow-5.4.1


Collecting livelossplot==0.2.2
  Downloading https://files.pythonhosted.org/packages/b0/93/419eeab5ffc64da5c0d437f0c4d887e786972d8527c9a265647a52309c55/livelossplot-0.2.2-py3-none-any.whl
Installing collected packages: livelossplot
Successfully installed livelossplot-0.2.2
Collecting torchnet==0.0.4
  Using cached https://files.pythonhosted.org/packages/b7/b2/d7f70a85d3f6b0365517782632f150e3bbc2fb8e998cd69e27deba599aae/torchnet-0.0.4.tar.gz
Collecting visdom (from torchnet==0.0.4)
  Using cached https://files.pythonhosted.org/packages/97/c4/5f5356fd57ae3c269e0e31601ea6487e0622fedc6756a591e4a5fd66cc7a/visdom-0.1.8.8.tar.gz
Collecting torchfile (from visdom->torchnet==0.0.4)
  Using cached https://files.pythonhosted.org/packages/91/af/5b305f86f2d218091af657ddb53f984ecbd9518ca9fe8ef4103a007252c9/torchfile-0.1.0.tar.gz
Collecting websocket-client (from visdom->torchnet==0.0.4)
  Using cached https://files.pythonhosted.org/packages/26/2d/f749a5c82f6192d77ed061a38e02001afcba55fe8477336d26a95

In [2]:
import sys
sys.path

['',
 '/env/python',
 '/usr/lib/python36.zip',
 '/usr/lib/python3.6',
 '/usr/lib/python3.6/lib-dynload',
 '/usr/local/lib/python3.6/dist-packages',
 '/usr/lib/python3/dist-packages',
 '/usr/local/lib/python3.6/dist-packages/IPython/extensions',
 '/root/.ipython']

In [3]:
from torch.utils.data import Dataset
from skimage import io, transform
import os
import pandas as pd
from PIL import Image
import random 
import numpy as np
from torchvision import transforms, datasets
import torch
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F
import numpy as np 
from imageio import imread
import torch
from livelossplot import PlotLosses
from torch.utils.data import Dataset, DataLoader
import torchnet
import seaborn as sns
import matplotlib.pyplot as plt
from torch.utils.data import Dataset, DataLoader
import itertools
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from torchnet.meter import ConfusionMeter
import seaborn as sn
import pandas as pd
import matplotlib.pyplot as plt
from torchnet.meter import ConfusionMeter

Using TensorFlow backend.


In [4]:
import nltk
%matplotlib inline
sns.set_style("darkgrid")
nltk.download('punkt')
nltk.download('stopwords')
use_cuda = torch.cuda.is_available()

SEED = 999
import random 
def fixSeed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if use_cuda:
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)

fixSeed(SEED)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
!pip install -U -q kaggle
!mkdir -p ~/.kaggle
from google.colab import files
files.upload()
!cp kaggle.json ~/.kaggle/
!ls ~/.kaggle/


Saving kaggle.json to kaggle.json
kaggle.json


In [5]:
!kaggle datasets download -d gyani95/380000-lyrics-from-metrolyrics


380000-lyrics-from-metrolyrics.zip: Skipping, found more recently modified local copy (use --force to force download)


In [6]:
!unzip -q 380000-lyrics-from-metrolyrics.zip -d data

replace data/lyrics.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: N


In [0]:
lyrics_df = pd.read_csv("data/lyrics.csv", usecols=['genre', 'lyrics'])


In [8]:
lyrics_df.describe(include = 'all')

Unnamed: 0,genre,lyrics
count,362237,266557
unique,12,244873
top,Rock,INSTRUMENTAL
freq,131377,1369


In [9]:
lyrics_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 362237 entries, 0 to 362236
Data columns (total 2 columns):
genre     362237 non-null object
lyrics    266557 non-null object
dtypes: object(2)
memory usage: 5.5+ MB


In [10]:
lyrics_df = lyrics_df[lyrics_df.lyrics.notnull()]
lyrics_df.sample(10)

Unnamed: 0,genre,lyrics
343917,Rock,(corthon)\nGot lots of money\nGot lots of phon...
267582,Rock,It's no big surprise\nWe turned out this way\n...
309031,Hip-Hop,"I don't sweat no bitches, I only issue dick\nI..."
19899,Not Available,Me estÃ¡ gustando\nQue me des los buenos dÃ­as...
175151,Hip-Hop,What's so different?\nWhat's so different?\nIf...
113034,Not Available,"Hello cruel world, so this is you\nA broken he..."
201066,Jazz,"I thought I told you everything\nYou needed, n..."
311614,Metal,Mike and Susan have been together for nearly f...
156807,Rock,There's a specter in the corner of an illustra...
293961,Jazz,Heb je wel 'ns verlangd naar 't Tjeukemeer\nDa...


In [0]:
from nltk.tokenize import RegexpTokenizer

MIN_OCCURENCES = 5
UNKOWN_WORDS = '<UNK>'

texts = lyrics_df["lyrics"].tolist()
tokenizer = RegexpTokenizer(r'\w+')
flat_list =  [word for word in [tokenizer.tokenize(text.lower()) for text  in texts]]
all_text = [item for sublist in flat_list for item in sublist]
req_dist = nltk.FreqDist(all_text)
rare_words = {word for (word, count) in req_dist.items() if count < MIN_OCCURENCES}
replace_words = {word for (word, count) in req_dist.items() if count == MIN_OCCURENCES}

In [0]:
import string 
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

def cleanText(text, rare_words=None, replace_words = None):
    table = str.maketrans({key: None for key in string.punctuation})
    text = text.translate(table)

    tokens = word_tokenize(text)
    tokens = [w.lower() for w in tokens]

    stops = set(stopwords.words("english"))

    words = [word for word in tokens if word not in stops and (rare_words is None or word not in rare_words)]

    if replace_words and words:
        words = [word if word not in replace_words else UNKOWN_WORDS for word in words]


    if len(words) == 0:
        return None

    return words

In [0]:
lyrics_df = lyrics_df[0:1000]

In [25]:
lyrics_df["clean_lyrics"] = lyrics_df["lyrics"].map(lambda text: cleanText(text, rare_words, replace_words))
lyrics_df = lyrics_df[lyrics_df.clean_lyrics.notnull()]
lyrics_df.sample(10)

Unnamed: 0,genre,lyrics,clean_lyrics
1085,Rock,Our father which art on Wall Street\nHonored b...,"[father, art, wall, street, thy, buck, thy, <U..."
1429,Metal,Somewhere in time there was a dream\nA dream I...,"[somewhere, time, dream, dream, felt, deep, in..."
269,Hip-Hop,This is a real life jack in progress. Nigga gi...,"[real, life, jack, progress, nigga, give, shit..."
1482,Not Available,(M. Detroit)\nMaybe it's my blood sugar\nMaybe...,"[detroit, maybe, blood, sugar, maybe, im, mad,..."
4,Pop,"Party the people, the people the party it's po...","[party, people, people, party, popping, sittin..."
77,Pop,[Verse 1]\nI'm in my penthouse half naked\nI c...,"[verse, 1, im, half, naked, naked, hell, one, ..."
822,Rock,I wish you well\nCouldn't you tell after all t...,"[wish, well, tell, years, wish, love, life, wo..."
1110,Rock,"Hey baby, be my dog, ooh\nAlice in my fantasie...","[hey, baby, dog, ooh, fantasies, uh, promised,..."
688,Jazz,It's only small town talk\nYou know how people...,"[small, town, talk, know, people, cant, stand,..."
468,Other,"Sen arardÄ±n beni cep telefonumdan\nArardÄ±n,a...","[sen, arardä±n, beni, cep, telefonumdan, arard..."


In [8]:
#import torch
#import torch.nn as nn
#from torch.autograd import Variable
#from torch.nn import functional as F

class LSTMClassifier(nn.Module):
	def __init__(self, batch_size, output_size, hidden_size, vocab_size, embedding_length, weights):
		super(LSTMClassifier, self).__init__()
		
		"""
		Arguments
		---------
		batch_size : Size of the batch which is same as the batch_size of the data returned by the TorchText BucketIterator
		output_size : 2 = (pos, neg)
		hidden_sie : Size of the hidden_state of the LSTM
		vocab_size : Size of the vocabulary containing unique words
		embedding_length : Embeddding dimension of GloVe word embeddings
		weights : Pre-trained GloVe word_embeddings which we will use to create our word_embedding look-up table 
		
		"""
		
		self.batch_size = batch_size
		self.output_size = output_size
		self.hidden_size = hidden_size
		self.vocab_size = vocab_size
		self.embedding_length = embedding_length
		
		self.word_embeddings = nn.Embedding(vocab_size, embedding_length)# Initializing the look-up table.
		self.word_embeddings.weight = nn.Parameter(weights, requires_grad=False) # Assigning the look-up table to the pre-trained GloVe word embedding.
		self.lstm = nn.LSTM(embedding_length, hidden_size)
		self.label = nn.Linear(hidden_size, output_size)
		
	def forward(self, input_sentence, batch_size=None):
	
		""" 
		Parameters
		----------
		input_sentence: input_sentence of shape = (batch_size, num_sequences)
		batch_size : default = None. Used only for prediction on a single sentence after training (batch_size = 1)
		
		Returns
		-------
		Output of the linear layer containing logits for positive & negative class which receives its input as the final_hidden_state of the LSTM
		final_output.shape = (batch_size, output_size)
		
		"""
		
		''' Here we will map all the indexes present in the input sequence to the corresponding word vector using our pre-trained word_embedddins.'''
		input = self.word_embeddings(input_sentence) # embedded input of shape = (batch_size, num_sequences,  embedding_length)
		input = input.permute(1, 0, 2) # input.size() = (num_sequences, batch_size, embedding_length)
		if batch_size is None:
			h_0 = Variable(torch.zeros(1, self.batch_size, self.hidden_size).cuda()) # Initial hidden state of the LSTM
			c_0 = Variable(torch.zeros(1, self.batch_size, self.hidden_size).cuda()) # Initial cell state of the LSTM
		else:
			h_0 = Variable(torch.zeros(1, batch_size, self.hidden_size).cuda())
			c_0 = Variable(torch.zeros(1, batch_size, self.hidden_size).cuda())
		output, (final_hidden_state, final_cell_state) = self.lstm(input, (h_0, c_0))
		final_output = self.label(final_hidden_state[-1]) # final_hidden_state.size() = (1, batch_size, hidden_size) & final_output.size() = (batch_size, output_size)
		
		return final_output
    


1


## RNN for Text Generation
In this section, we'll use an LSTM to generate new songs. You can pick any genre you like, or just use all genres. You can even try to generate songs in the style of a certain artist - remember that the Metrolyrics dataset contains the author of each song. 

For this, we’ll first train a character-based language model. We’ve mostly discussed in class the usage of RNNs to predict the next word given past words, but as we’ve mentioned in class, RNNs can also be used to learn sequences of characters.

First, please go through the [PyTorch tutorial](https://pytorch.org/tutorials/intermediate/char_rnn_generation_tutorial.html) on generating family names. You can download a .py file or a jupyter notebook with the entire code of the tutorial. 

As a reminder of topics we've discussed in class, see Andrej Karpathy's popular blog post ["The Unreasonable Effectiveness of Recurrent Neural Networks"](http://karpathy.github.io/2015/05/21/rnn-effectiveness/). You are also encouraged to view [this](https://gist.github.com/karpathy/d4dee566867f8291f086) vanilla implementation of a character-level RNN, written in numpy with just 100 lines of code, including the forward and backward passes.  

Other tutorials that might prove useful:
1. http://warmspringwinds.github.io/pytorch/rnns/2018/01/27/learning-to-generate-lyrics-and-music-with-recurrent-neural-networks/
1. https://github.com/mcleonard/pytorch-charRNN
1. https://github.com/spro/practical-pytorch/blob/master/char-rnn-generation/char-rnn-generation.ipynb

### Final Tips
As a final tip, I do encourage you to do most of the work first on your local machine. They say that Data Scientists spend 80% of their time cleaning the data and preparing it for training (and 20% complaining about cleaning the data and preparing it). Handling these parts on your local machine usually mean you will spend less time complaining. You can switch to the cloud once your code runs and your pipeline is in place, for the actual training using a GPU.  

I also encourage you to use a small subset of the dataset first, so things run smoothly. The Metrolyrics dataset contains over 300k songs. You can start with a much much smaller set (even 3,000 songs) and try to train a network based on it. Once everything runs properly, add more data. 

Good luck!  
Omri