# Naive demonstration of a Markov Language Model 

We build a model that will compute
$$
P(w_1, \dots, w_n) = \prod\limits_{i=1}^{n} P(w_i \mid w_{i-m}, \dots w_{i-1})
$$

In [1]:
from nltk.tokenize import sent_tokenize, word_tokenize
import nltk
import numpy as np 
import pandas as pd 

### Gather some data

In [2]:
from sklearn.datasets import fetch_20newsgroups

In [3]:
download_dir = '/Users/flint/Data/sklearn/'
subset = 'all'
remove = ('headers', 'footers', 'quotes')
data = fetch_20newsgroups(data_home=download_dir, subset=subset, remove=remove)

In [4]:
print(data.data[0][:250], '...')
print('--------')
print(data.target[0], data.target_names[data.target[0]])



I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' re ...
--------
10 rec.sport.hockey


### Training ML

In [5]:
import sys 
sys.path.append('../nlp/')
from nlp.markovlm import NaiveMarkovLM

In [6]:
lm = NaiveMarkovLM(n=3)

In [7]:
lm.train(data.data[:10])

In [8]:
lm.index.T.head()

Unnamed: 0,Unnamed: 1,i,actually,however,man,jagr,he,bowman,pens,!,my,...,made,3-4,flaming,wings,pizza,hut,commercial,tlu/a,gic,bait
#S,#S,9.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
#S,i,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
i,am,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
am,sure,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sure,some,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Compute conditional probabilities

In [9]:
seq = ('i', 'am', 'alfio')
print(lm.P(*seq, log=True))
seq = ('i', 'am', 'sure')
print(lm.P(*seq, log=True))

-1.791759469228055
-1.252762968495368


**Text probability**

In [10]:
text = sent_tokenize(data.data[0])[0].replace('\n', '')
print(text)
print(lm.joint_log_probability(text))

I am sure some bashers of Pens fans are pretty confused about the lackof any kind of posts about the recent Pens massacre of the Devils.
-4.882801922586371


#### Generate text

In [11]:
prefix = ('#S', '#S')
text = list(lm.generate(prefix=prefix, max_len=20))
print(" ".join(text))

#S #S ! #E


## Applications

### Classification
1. we train a general model over the whole corpus
2. then, we clone the model into a specific model for each class 
3. we fine-tune the class-specific model for its class
4. given a text, we compute the text probability for each class-model in order to select the best 

In [13]:
global_lm = NaiveMarkovLM(n=3)
global_lm.train(documents=data.data)

In [14]:
print(list(global_lm.generate(max_len=20)))

['#S', '#S', 'if', 'you', 'are', "n't", 'running', '.', '#E']


**Cloning and fine tuning**

In [17]:
from tqdm.notebook import tqdm

In [19]:
class_models = {}
# Test with three classes only to speed up the process
run = list(enumerate(data.target_names[:3]))
for i, label in tqdm(run):
    class_docs = [data.data[j] for j, k in enumerate(data.target) if k==i]
    class_models[label] = global_lm.clone()
    # fine turning
    class_models[label].train(class_docs)

  0%|          | 0/3 [00:00<?, ?it/s]

In [20]:
print(class_models.keys())

dict_keys(['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc'])


**Classification**

In [27]:
count = 0
for i, doc in enumerate(data.data):
    label = data.target_names[data.target[i]]
    if label in class_models.keys():
        print("Correct class: ", label)
        print("Prediction")
        for pred, clm in class_models.items():
            log_p = clm.joint_log_probability(doc)
            print(pred, log_p)
        count += 1
        print("===========")
    if count > 10:
        break 

Correct class:  alt.atheism
Prediction
alt.atheism -137.80277922199454
comp.graphics -138.07748550812423
comp.os.ms-windows.misc -138.582829631957
Correct class:  comp.graphics
Prediction
alt.atheism -1029.8680763235138
comp.graphics -1034.1258635539702
comp.os.ms-windows.misc -1045.4617026153967
Correct class:  comp.graphics
Prediction
alt.atheism -201.77446261715642
comp.graphics -202.0126661968555
comp.os.ms-windows.misc -205.37547148798566
Correct class:  alt.atheism
Prediction
alt.atheism -2402.879511230015
comp.graphics -2413.6960941956645
comp.os.ms-windows.misc -2433.8188997727807
Correct class:  comp.graphics
Prediction
alt.atheism -434.15892132454286
comp.graphics -435.72816410439407
comp.os.ms-windows.misc -441.47625707730936
Correct class:  comp.os.ms-windows.misc
Prediction
alt.atheism -664.4213946267886
comp.graphics -667.7872258557901
comp.os.ms-windows.misc -674.5441184644216
Correct class:  comp.os.ms-windows.misc
Prediction
alt.atheism -330.9984957632746
comp.graphics

### Generation
1. we fine tune class-specific models as for classification
2. we generate texts to see if they are consistent with the class

In [30]:
prefix = ('i', 'am')
for label, clm in class_models.items():
    print(label)
    for i in range(4):
        print('\t', " ".join(list(clm.generate(prefix=prefix, max_len=20))))

alt.atheism
	 i am using it as optional ? #E
	 i am sick and taking holiday snaps , but i prefer not to generalize about atheists and non-atheists on the case
	 i am just now understanding that the la area that newer dsos are addressing . #E
	 i am well aware that some jews and additional data . #E
comp.graphics
	 i am not mathew ( mantis ) but he 's setting himself up to 16 megahertz system speed ) ? ''
	 i am running on the bids stop coming to have anything to be pulled because of the fundamental issues at hand
	 i am describing , even a journalist who writes `` then man goes to a dog . #E
	 i am from turkey . '' #E
comp.os.ms-windows.misc
	 i am to 12:00 pm and 1:00 pm today . #E
	 i am not asking much . #E
	 i am not ! #E
	 i am looking pro a win for mr. vanderbyl ( no sex , right ? ) . #E
