# Sherlock Holmes Stories Markov Chain Model NLP

## Introduction

The purpose of this model is to generate text that is similar to `Shelock Holmes Stories` using `Markov Chains` approach. To build a Markov Chain we need states that will be words of the text and transitions that are the probabilities of going among the states. To find the transition probabilities we traverse the text and find the conditional probability of each word following another. So basically, the probability that the next word (state) will be `j` given that the current word (state) is `i` based on the adjacency list. We can cosider the transitions as statistical properties of the text data and because of this the model will produce random text which is similar to the original text. To generate the text, the model will do a walk (traverse by following the transition probabilities) from a innitial optional (given by us) state (word) and go to the next states. Each state (word) we visited will be part of our text. This can be considered a directed graph with weights traversal, that each node is a state (word) and the edges will define the transitions (probability to be adjacency).

## Implementation

### Installing required Python modules

In [None]:
%pip install tokenizer
%pip install nltk

### Importing required Python modules

In [1]:
import os
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import random

#### Fixing NLTK

In [None]:
import nltk
nltk.download('punkt')

### Read Shelock Holmes stories

In [2]:
def read_all_stories(path: str) -> list:
    stories: list = []
    
    for _, _, files in os.walk(path):
        for file in files:
            with open(path + file) as f:
                for line in f:
                    line = line.strip() # remove spaces around the string
                    # Each story in the dataset has the LICENSE
                    # in the end of the file which will be ignored
                    if line == "----------": break 
                    if line != '': stories.append(line)
                        
    return stories

story_path: str =  "./sherlock-stories-data-set/"

stories: list = read_all_stories(story_path)

assert len(stories) > 0

### Clean the data

Clean dataset required for `NLP`.

In [3]:
def clean_data(data: list) -> list:
    cleaned_data: list = []
    for text in data:
        text = text.lower()
        text = re.sub(r"[,.\"\'!@#$%^&*(){}?/;`~:<>+=-\\]", "", text)
        tokens: list = word_tokenize(text)
        words = [word for word in tokens if word.isalpha()]
        cleaned_data += words
    return cleaned_data

cleaned_stories: list = clean_data(stories)

assert len(cleaned_stories) > 0

### Markov Model

The function will accept also the n_gram which will be used to define number of words in a state (sequence of n words). This will give more context so, the text produced by this model will probably be similar and make more sense

In [4]:
def create_markov_model(cleaned_stories: list, n_gram: int = 1) -> dict:
    markov_model: dict = {}
        
    for i in range(len(cleaned_stories) - n_gram - 1):
        
        current_state: str = ""
        next_state: str = ""
        
        for j in range(n_gram):
            current_state += cleaned_stories[i + j] + " "
            next_state += cleaned_stories[i + j + n_gram] + " "
            
        current_state = current_state[:-1]
        next_state = next_state[:-1]
        
        if current_state not in markov_model:
            markov_model[current_state] = {}
            markov_model[current_state][next_state] = 1
        else:
            if next_state in markov_model[current_state]:
                markov_model[current_state][next_state] += 1
            else:
                markov_model[current_state][next_state] = 1
    
    # Calculating transition probabilities
    for current_state, transition in markov_model.items():
        total = sum(transition.values())
        for state, count in transition.items():
            markov_model[current_state][state] = count / total
        
    return markov_model

markov_model: dict = create_markov_model(cleaned_stories, 2)

assert len(markov_model.keys()) > 0

#### Test the model

In [5]:
# All transitions from `dear holmes` state
print(markov_model['dear holmes'])

{'said i': 0.07017543859649122, 'he has': 0.07017543859649122, 'oh yes': 0.07017543859649122, 'i have': 0.07017543859649122, 'i thought': 0.07017543859649122, 'i ejaculated': 0.07017543859649122, 'what do': 0.07017543859649122, 'i exclaimed': 0.07017543859649122, 'am i': 0.05263157894736842, 'my previous': 0.05263157894736842, 'if i': 0.05263157894736842, 'and tell': 0.05263157894736842, 'that i': 0.05263157894736842, 'i fear': 0.07017543859649122, 'you are': 0.05263157894736842, 'it is': 0.05263157894736842}


In [6]:
# All transitions from `that great` state
print(markov_model['that great'])

{'forest which': 0.12903225806451613, 'brain of': 0.12903225806451613, 'cesspool into': 0.0967741935483871, 'emporium proved': 0.12903225806451613, 'grimpen mire': 0.0967741935483871, 'city so': 0.0967741935483871, 'rich corporations': 0.0967741935483871, 'developments are': 0.12903225806451613, 'trunk of': 0.0967741935483871}


### Generate story

In [7]:
def generate_story(markov_model: dict, start: str, length: int = 100) -> str:
    story: str = start
    current_state = start
    next_state = {}
    
    for _ in range(length):
        next_state = random.choices(list(markov_model[current_state].keys()),
                                    list(markov_model[current_state].values()))[0]
        
        current_state = next_state
        
        story += current_state + " "
        
    return story

### Generating some random stories with different innitial states

In [8]:
# Story 1
print(generate_story(markov_model, "sherlock holmes"), 300)

sherlock holmeswhom i was not on it hes holding off the walls and yet those strange peaked roofs and peep in at the other end which was found in the dead mans hand and almost danced with excitement in his manner but on the day that sir robert has not returned i must find the story in the first signs of a murderous attack upon him nor would he do come to consult with the blue smoke curling up from him and to serve the purpose must have been generally successful then you begin to wonder i believe shoscombe prince and the derby the sporting interest of his case for it made me rather more lax than befits a medical man you know whom they loved and it was the children say at last in a dry rasping tone the best this last statement appeared to me i heard something of the world that it was no delusion one of them were destined to travel it has appeared to be sheltering themselves from me i dont know his address no except that it was only after her just now but it is coming to consult even at a 

In [9]:
# Story 2
print(generate_story(markov_model, "the case"), 500)

the caseis clear that mrs hudson and kindly send one of the older man is aware that i was looking for miss rachel with which we had hoped that perhaps you will have the kindness to place it in both hers in an hour i waited with some curiosity as to come and the billet was such a blind but why should he drug his own and it is also my old friend watson in she is after and she is brooding and malicious eyes had devised a safe which stood two of them were smoking cigars and coffee a few details about the arrival of her people but if you had shared his fortunes and had the place immediately afterwards i went round and examined each and all further investigation has served a purpose what is the meaning of it but i dont mind telling you mr sherlock holmes was standing smiling on the next morning but what in heavens name was strange to me to drink then she despised me as a doctor eh cried he much excited have you your very excellent and so lonesome that i thought you knew who i am but you are 

In [10]:
# Story 3
print(generate_story(markov_model, "dear holmes"), 180)

dear holmesmy previous letters and telegrams which were handed unto the holy joseph smith at palmyra we have come to our needs if you would find within the grounds the lake to which i have a curious collection very curious and the story was evidently the bearer of a kitchen garden the inner cartilage in all essentials it was the same terrible winter darkly the shadow lay upon the inside and get away just to the north of oporto the proceedings it is you like mr holmes i can understand how you attained this result simply by having the heart to band themselves together against their oppressors rumours had reached the bottom of the box the professor was buried in his flight had dropped his cravat and his worn boots he was not the mere fright of a man whose patriotism is beyond suspicion he would change his name it appeared less funny than he could have happened and finally what dr thorneycroft huxtable of the priory is a guilty reason for it the adventures of sherlock holmes requests for t