## Text Analysis Tutorial

Hello there - we'll be following this Jupyter Notebook for the tutorial. 
The purpose of this tutorial is to walk you through different parts of the text analysis pipeline, from getting a hold of our data, cleaning and annotating it all the way to swapping verbs in sentences and evaluating topic models.

We will not be looking to explore our textual data in depth, but rather in breadth; give a taste of the different kinds of analysis we can do.

Our step, naturally, is setting up our imports. We will be using spaCy for data pre-processing and computational linguistics, gensim for topic modelling, scikit-learn for classification, and Keras for text generation.
We will also use numpy and matplotlib for other parts of the tutorial.

### Imports

In [None]:
import gensim
import numpy as np
import spacy
from spacy import displacy
from gensim.corpora import Dictionary
from gensim.models import LdaModel
import matplotlib.pyplot as plt
import sklearn
import keras

In [None]:
import warnings
import os
warnings.filterwarnings('ignore')  # Let's not pay heed to them right now
%matplotlib inline

## Gathering Data

A huge part of text analysis is your data collection - one of the initial goals of the tutorial was to walk the user through the process of cleaning messy twitter data, or scraping data off the internet. But while this does remain an integral part of text analysis, a one and half hour tutorial cannot do justice to both the process of data collection and data analysis - so we will use two more popular, already available data-sets for the purpose of the tutorial.

Keep in mind the only main difference between using a standardised data-set and scraping your own data off the internet is that internet data is largely unstructured; this means we will be spending a lot of time in organising our data into a form that is easy to pre-processes. The datasets we will be working with will be the Lee corpus which is a shortened version of the [Lee Background Corpus](http://www.socsci.uci.edu/~mdlee/lee_pincombe_welsh_document.PDF), and the [20NG dataset](http://qwone.com/~jason/20Newsgroups/). We will be performing different tasks with these two datasets, and will talk a little bit more about the datasets when we come across them.

Let us now get started with loading our first data-set, the Lee corpus, which we load using Gensim.

In [None]:
test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])
lee_train_file = test_data_dir + os.sep + 'lee_background.cor'
text = open(lee_train_file).read()

## Cleaning Data

It's been often said in Machine Learning and NLP algorithms - garbage in, garbage out. We can't have state-of-the-art results without data which is as good. Let's spend this section working on cleaning and understanding our data set.
NTLK is usually a popular choice for pre-processing - but is a rather [outdated](https://explosion.ai/blog/dead-code-should-be-buried) and we will be checking out spaCy, an industry grade text-processing package. 

spaCy uses language models similar to the one we just downloaded before starting this tutorial.

In [None]:
nlp = spacy.load("en")

For safe measure, let's add some stopwords. It's a newspaper corpus, so it is likely we will be coming across variations of 'said' and 'Mister' which will not really add any value to the topic models.


In [None]:
my_stop_words = [u'say', u'\'s', u'mr', u'be', u'said', u'says', u'saying', 'today']
for stopword in my_stop_words:
    lexeme = nlp.vocab[stopword]
    lexeme.is_stop = True

In [None]:
doc = nlp(text.lower())

Voila! With the `English` pipeline, all the heavy lifting has been done. Let's see what went on under the hood.

In [None]:
doc

## Computational Linguistics

Okay - now that we have our doc object, what exactly can we do with it?
We can see that the doc object now contains the entire corpus. This is important because we will be using this doc object to create our corpus for the machine learning algorithms. When creating a corpus for gensim/scikit-learn, we sometimes forget the incredible power which spaCy packs in its pipeline, so we will briefly demonstrate the same in this section with a smaller example sentence. Keep in mind that whatever we can do with a sentence, we can also just as well do with the entire corpus.

In [None]:
sent = nlp(u"Tom went to IKEA to get some of those delicious Swedish meatballs.")

Simple enough sentence, right? When we pass any kind of text through the spaCy pipeline, it becomes annotated. We will quickly have a look at the 3 most important of capabilities which spaCy provides - POS-tagging, NER-tagging, and dependency parsing.

#### POS-Tagging

In [None]:
for token in sent:
    print(token.text, token.pos_, token.tag_)

#### NER-Tagging

In [None]:
for token in sent:
    print(token.text, token.ent_type_)

In [None]:
for ent in sent.ents:
    print(ent.text, ent.label_)

In [None]:
displacy.render(sent, style='ent', jupyter=True)

#### Dependency Parsing

In [None]:
for chunk in sent.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
          chunk.root.head.text)


In [None]:
for token in sent:
    print(token.text, token.dep_, token.head.text, token.head.pos_,
          [child for child in token.children])


In [None]:
displacy.render(sent, style='dep', jupyter=True)

In [None]:
for w in nlp("Bhargav an absolute knocker, the ball is a sun."):
    print(w.text, w.is_stop, w.lemma_)

This is just an example of the kind of annotations spaCy adds when it runs any text through its pipeline. We will see in the very next section that spaCy has a bunch of other information as well, such as whether a token is a number or not, stop-word or not, and other information which comes in very handy when pre-processing text. 

## Continuing Cleaning

Have a quick look at the output of the doc object. It seems like nothing, right? But spaCy's internal data structure has done all the work for us. Let's see how we can create our corpus. You can check out what a gensim corpus looks like [here](https://radimrehurek.com/gensim/tut1.html).

In [None]:
# we add some words to the stop word list
texts, article = [], []
for w in doc:
    # if it's not a stop word or punctuation mark, add it to our article!
    if w.text != '\n' and not w.is_stop and not w.is_punct and not w.like_num and w.text != 'I':
        # we add the lematized version of the word
        article.append(w.lemma_)
    # if it's a new line, it means we're onto our next document
    if w.text == '\n':
        texts.append(article)
        article = []

In [None]:
texts

And this is the magic of spaCy - just like that, we've managed to get rid of stopwords, punctauation markers, and added the lemmatized word. 

Sometimes topic models make more sense when 'New' and 'York' are treated as 'New_York' - we can do this by creating a bigram model and modifying our corpus accordingly.

In [None]:
bigram = gensim.models.Phrases(texts)

In [None]:
texts = [bigram[line] for line in texts]

In [None]:
texts[0]

In [None]:
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

## Topic Modelling

Topic Modelling refers to the probabilistic modelling of text documents as topics. Gensim remains the most popular library to perform such modelling, and we will be using it to perform our topic modelling. 

LDA, or Latent Dirichlet Allocation is arguably the most famous topic modelling algorithm out there. Out here we create a simple topic model with 10 topics. This is where the corpus we created earlier will come in handy.

In [None]:
ldamodel = LdaModel(corpus=corpus, num_topics=10, id2word=dictionary)

In [None]:
ldamodel[corpus[88]]

This is a great way to get a view of what words end up appearing in our documents, and what kind of document topics might be present. For more details, such as the other topic models which Gensim provides, as well as ways to measure topic coherence (performance), and visualisation, the topic modelling notebook in the same directory will serve as a good resource.

## Text Classification

In the previous example, we worked with unlabelled, unstructured data. Classification is a machine learning task which is quite different from the previous examples because we are dealing with labelled data, and we know what classes we want to put our documents into - we are not discovering topics or classes.

For such an example, we would need to use a labelled data-set, and in our case we will be using the previously mentioned 20NG dataset.

In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]

from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer

In [None]:
data_train = fetch_20newsgroups(subset='train', categories=categories,
                             shuffle=True, random_state=42)
n_components = 5
labels = data_train.target
true_k = np.unique(labels).shape[0]

# convert to TF-IDF format
vectorizer = TfidfVectorizer(max_df=0.5, min_df=2, stop_words='english', use_idf=True)
X_train = vectorizer.fit_transform(data_train.data)

# Reduce dimensions
svd = TruncatedSVD(n_components)
normalizer = Normalizer(copy=False)
lsa = make_pipeline(svd, normalizer)

X_train = lsa.fit_transform(X_train)

In [None]:
# order of labels in `target_names` can be different from `categories`
data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42)

target_names = data_train.target_names
# split a training set and a test set
y_train, y_test = data_train.target, data_test.target

print("Extracting features from the test data using the same vectorizer")
X_test = vectorizer.transform(data_test.data)
X_test = lsa.fit_transform(X_test)

Take a minute to note the pre-processing steps we used above - it is less transparent than our method with spaCy, but it is still important to know and to be able to use the scikit-learn modules for the same. 

In [None]:
from sklearn.naive_bayes import GaussianNB

In [None]:
gnb = GaussianNB()
y_pred_NB = gnb.fit(X_train, y_train).predict(X_test)

In [None]:
y_pred_NB

In [None]:
from sklearn.svm import SVC

In [None]:
svm = SVC()
y_pred_SVM = svm.fit(X_train, y_train).predict(X_test) 

## Deep Learning

The final bit of our tutorial will explore the ideas of neural networks, and using RNNs to generate text.

A Recurrent Neural Network does one step better than other neural networks because of its ability to remember context, as each layer in the network is built with information from the previous layer - this additional context allows it to perform better. We will be using a particular variant of an RNN called LSTM, or Long Short Term Memory - as the name suggests, it has the ability to have short-term memory which can last for a long period of time. Whenever there is a significant time-lag between inputs, LSTMs tend to perform well - considering the nature of language, where a word which appears later on in a sentence is influenced by the context of the sentence, this property starts becoming more important. For a more detailed explanation on the mathematics or intuition behind an LSTM and RNN, the following blog posts can be very useful:

[Understanding LSTM Networks](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
[Unreasonable Effectiveness of Reccurent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)

For this part of the tutorial, we will be using the code written by my good friend Kirit Thadaka - you can find the original code over on this [GitHub repository](https://github.com/kirit93/Personal/blob/master/text_generation_keras/text_generation.ipynb).

In [None]:
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils
SEQ_LENGTH = 100

In [None]:
test_x = np.array([1, 2, 0, 4, 3, 7, 10])
# one hot encoding
test_y = np_utils.to_categorical(test_x)
print(test_x)
print(test_y)

In [None]:
# Using keras functional model
def create_functional_model(n_layers, input_shape, hidden_dim, n_out, **kwargs):
    drop        = kwargs.get('drop_rate', 0.2)
    activ       = kwargs.get('activation', 'softmax')
    mode        = kwargs.get('mode', 'train')
    hidden_dim  = int(hidden_dim)

    inputs      = Input(shape = (input_shape[1], input_shape[2]))
    model       = LSTM(hidden_dim, return_sequences = True)(inputs)
    model       = Dropout(drop)(model)
    model       = Dense(n_out)(model)


In [None]:
# Using keras sequential model
def create_model(n_layers, input_shape, hidden_dim, n_out, **kwargs):
    drop        = kwargs.get('drop_rate', 0.2)
    activ       = kwargs.get('activation', 'softmax')
    mode        = kwargs.get('mode', 'train')
    hidden_dim  = int(hidden_dim)
    model       = Sequential()
    flag        = True 

    if n_layers == 1:   
        model.add( LSTM(hidden_dim, input_shape = (input_shape[1], input_shape[2])) )
        if mode == 'train':
            model.add( Dropout(drop) )

    else:
        model.add( LSTM(hidden_dim, input_shape = (input_shape[1], input_shape[2]), return_sequences = True) )
        if mode == 'train':
            model.add( Dropout(drop) )
        for i in range(n_layers - 2):
            model.add( LSTM(hidden_dim, return_sequences = True) )
            if mode == 'train':
                model.add( Dropout(drop) )
        model.add( LSTM(hidden_dim) )

    model.add( Dense(n_out, activation = activ) )

    return model

In [None]:
def train(model, X, Y, n_epochs, b_size, vocab_size, **kwargs):    
    loss            = kwargs.get('loss', 'categorical_crossentropy')
    opt             = kwargs.get('optimizer', 'adam')
    
    model.compile(loss = loss, optimizer = opt)

    filepath        = "Weights/weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
    checkpoint      = ModelCheckpoint(filepath, monitor = 'loss', verbose = 1, save_best_only = True, mode = 'min')
    callbacks_list  = [checkpoint]
    X               = X / float(vocab_size)
    model.fit(X, Y, epochs = n_epochs, batch_size = b_size, callbacks = callbacks_list)

The fit function will run the input batchwase n_epochs number of times and it will save the weights to a file whenever there is an improvement. This is taken care of through the callback. 

After the training is done or once you find a loss that you are happy with, you can test how well the model generates text.

In [None]:
def generate_text(model, X, filename, ix_to_char, vocab_size):
    
    # Load the weights from the epoch with the least loss
    model.load_weights(filename)
    model.compile(loss = 'categorical_crossentropy', optimizer = 'adam')

    start   = np.random.randint(0, len(X) - 1)
    pattern = np.ravel(X[start]).tolist()

    # We seed the model with a random sequence of 100 so it can start predicting
    print ("Seed:")
    print ("\"", ''.join([ix_to_char[value] for value in pattern]), "\"")
    output = []
    for i in range(250):
        x           = np.reshape(pattern, (1, len(pattern), 1))
        x           = x / float(vocab_size)
        prediction  = model.predict(x, verbose = 0)
        index       = np.argmax(prediction)
        result      = index
        output.append(result)
        pattern.append(index)
        pattern = pattern[1 : len(pattern)]

    print("Predictions")
    print ("\"", ''.join([ix_to_char[value] for value in output]), "\"")

In [None]:
# filename    = 'data/game_of_thrones.txt'
# data        = open(filename).read()
# data        = data.lower()
test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])
lee_train_file = test_data_dir + os.sep + 'lee_background.cor'
data = open(lee_train_file).read()
# Find all the unique characters
chars       = sorted(list(set(data)))
char_to_int = dict((c, i) for i, c in enumerate(chars))
ix_to_char  = dict((i, c) for i, c in enumerate(chars))
vocab_size  = len(chars)

print("List of unique characters : \n", chars)

print("Number of unique characters : \n", vocab_size)

print("Character to integer mapping : \n", char_to_int)

In [None]:
list_X      = []
list_Y      = []

# Python append is faster than numpy append. Try it!
for i in range(0, len(data) - SEQ_LENGTH, 1):
    seq_in  = data[i : i + SEQ_LENGTH]
    seq_out = data[i + SEQ_LENGTH]
    list_X.append([char_to_int[char] for char in seq_in])
    list_Y.append(char_to_int[seq_out])

n_patterns  = len(list_X)

In [None]:
X           = np.reshape(list_X, (n_patterns, SEQ_LENGTH, 1)) # (n, 100, 1)
# Encode output as one-hot vector
Y           = np_utils.to_categorical(list_Y)

In [None]:
model   = create_model(1, X.shape, 256, Y.shape[1], mode = 'train')

In [None]:
train(model, X[:1024], Y[:1024], 2, 512, vocab_size)


In [None]:
generate_text(model, X, "Weights/weights-improvement-01-4.3050.hdf5", ix_to_char, vocab_size)

In [None]:
generate_text(model, X, "Weights/weights-improvement-02-4.0229.hdf5", ix_to_char, vocab_size)
