### **NLP Model with Markov Chains**

**library setting**

In [4]:
# library
import numpy as np
import pandas as pd
import os
import re
import random
import string

# Token trasformation
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords


**1. upload file and read file .txt**

In [5]:
# setting config file to upload file path
from config import file_txt

with open(file_txt, 'r') as file:
   text = file.read()
   
print(text[:1000])

Chapter 1

I am by birth a Genevese, and my family is one of the most
distinguished of that republic.  My ancestors had been for many years
counsellors and syndics, and my father had filled several public
situations with honour and reputation.  He was respected by all who
knew him for his integrity and indefatigable attention to public
business.  He passed his younger days perpetually occupied by the
affairs of his country; a variety of circumstances had prevented his
marrying early, nor was it until the decline of life that he became a
husband and the father of a family.

As the circumstances of his marriage illustrate his character, I cannot
refrain from relating them.  One of his most intimate friends was a
merchant who, from a flourishing state, fell, through numerous
mischances, into poverty.  This man, whose name was Beaufort, was of a
proud and unbending disposition and could not bear to live in poverty
and oblivion in the same country where he had formerly been
distinguished fo

In [6]:
# Removes formatting and multiple spaces
def text_form(text):
   txt_clean = re.sub(r'\s+', ' ', text)
   txt_words = re.findall(r'\b\w+\b', txt_clean)
   return txt_words

txt_words = text_form(text)
print(f'Number of words:',len(txt_words))

Number of words: 69690


**2. Cleaning and arrangement of text in list**

In [7]:
# rows in line
def read_txt(file_txt):
   txt = []
   with open (file_txt) as f:
      for line in f:
         line = line.strip()
         if line != '': txt.append(line)
   return txt

rows_list = read_txt(file_txt)
print('Number of lines: ', len(rows_list))

Number of lines:  5941


In [8]:
# removing special characters and transform words in token
def clean_txt(txt):
   cleaned_text = []
   for line in txt:
      line = line.lower()
      line = re.sub(r"[,.\"\'!@#$%^&*(){}?/;`~:<>+=-\\]", "", line)
      tokens = word_tokenize(line, language="english", preserve_line=True)
      words = [word for word in tokens if word.isalpha()]
      cleaned_text+= words
   return cleaned_text

cleaned_txt = clean_txt(rows_list)
print(f'number of tokenized words:',len(cleaned_txt))

number of tokenized words: 69389


**3. Markov Chain Model - Statistical Base NPL Model**

*The implemented Markov model is a probabilistic language model that captures the sequential structure of a set of texts. Specifically, the model analyzes the sequence of words within the clean texts provided and calculates the transition probabilities between n-grams, where an n-gram is a contiguous sequence of n words.*

In [13]:
# Defining the model
def MC_model1(cleaned_txt, n_gram = 2):
   MC_model1_Output = {}
   for i in range(len(cleaned_txt) - n_gram - 1):
      current_state, next_state = "", ""
      for j in range(n_gram):
         current_state += cleaned_txt[i+j] + ""
         next_state += cleaned_txt[i+j+n_gram] + ""
      current_state = current_state[:-1]
      next_state = next_state[:-1]
      if current_state not in MC_model1_Output:
         MC_model1_Output[current_state] = {}
         MC_model1_Output[current_state][next_state] = 1
      else:
         if next_state in MC_model1_Output[current_state]:
            MC_model1_Output[current_state][next_state] += 1
         else:
            MC_model1_Output[current_state][next_state] = 1
            
   for current_state, transition_prob in MC_model1_Output.items():
      total = sum(transition_prob.values())
      for state, count in transition_prob.items():
         MC_model1_Output[current_state][state] = count / total
   
   return MC_model1_Output

In [15]:
# testing model1
MC1 = MC_model1(cleaned_txt)

In [16]:
print(len(MC1))

38026


In [18]:
MC1

{'chapter': {'amb': 0.16666666666666666,
  'spentth': 0.16666666666666666,
  'layo': 0.16666666666666666,
  'nowhaste': 0.16666666666666666,
  'saton': 0.16666666666666666,
  'wassoo': 0.16666666666666666},
 'ia': {'bybirt': 0.010309278350515464,
  'oncegav': 0.010309278350515464,
  'takenfro': 0.010309278350515464,
  'happysai': 0.010309278350515464,
  'notrecordin': 0.010309278350515464,
  'acquaintedtha': 0.010309278350515464,
  'reservedupo': 0.010309278350515464,
  'nowconvince': 0.010309278350515464,
  'moralizingi': 0.010309278350515464,
  'tose': 0.010309278350515464,
  'atlengt': 0.010309278350515464,
  'rewardedfo': 0.010309278350515464,
  'firstexpresse': 0.010309278350515464,
  'suresh': 0.010309278350515464,
  'innocentbu': 0.010309278350515464,
  'onlylef': 0.010309278350515464,
  'checked': 0.010309278350515464,
  'saidsh': 0.010309278350515464,
  'wellacquainte': 0.010309278350515464,
  'glado': 0.010309278350515464,
  'nowan': 0.010309278350515464,
  'sover': 0.0103092

In [23]:
print(MC1['bybirth'])

KeyError: 'bybirth'