### Hidden Markov Model and NGram
#### Breif recap of Hidden Markov Model
Hidden Markov Models (HMMs) are statistical models that represent systems with hidden states. They are used to model sequential data where the system being modeled is assumed to follow a Markov process with hidden states. HMMs are characterized by:

- **States**: The hidden states of the system.
- **Observations**: The observed data that is generated by the hidden states.
- **Transition Probabilities**: The probabilities of transitioning from one state to another.
- **Emission Probabilities**: The probabilities of observing a particular observation given a state.
- **Initial Probabilities**: The probabilities of the system starting in each state.

HMMs are widely used in various applications such as speech recognition, part-of-speech tagging, and bioinformatics.

#### Brief recap of NGram
N-grams are contiguous sequences of n items from a given sample of text or speech. They are used in various natural language processing tasks to capture the context and structure of the text. N-grams can be unigrams (single words), bigrams (pairs of words), trigrams (triplets of words), and so on.

- **Unigram**: A single word. Example: "the"
- **Bigram**: A pair of consecutive words. Example: "the cat"
- **Trigram**: A triplet of consecutive words. Example: "the cat sat"

N-grams are useful for tasks such as text classification, language modeling, and machine translation. They help in understanding the context and meaning of the text by considering the relationships between consecutive words.

## NGram

### Dataset

The SMS Spam Collection is a dataset of SMS messages tagged for spam research. It contains 5,574 English messages labeled as either ham (legitimate) or spam.

#### Content

Each line in the dataset consists of two columns: 
- **v1**: The label (ham or spam)
- **v2**: The raw text of the message

In [None]:
import pandas as pd
import numpy as np

In [None]:
#Importing data
df = pd.read_csv('spam.csv', encoding='latin-1')
df.drop(columns=['Unnamed: 2','Unnamed: 3',	'Unnamed: 4'], inplace = True)
df.columns = ['label', 'message']
df

### Data Cleaning and Preprocessing

In this section, we will clean and preprocess the data to prepare it for analysis. This involves removing unwanted characters, converting text to lowercase, and removing stopwords. We will also use stemming and lemmatization techniques to reduce words to their root forms.

In [None]:
import re
import nltk
nltk.download('stopwords')

#### Stemming

Stemming is the process of reducing a word to its base or root form. This is useful in natural language processing to ensure that different forms of a word are treated as the same word. For example, "running" and "runner" can be reduced to the root word "run". One of the most common stemming algorithms is the Porter Stemmer.

#### Porter Stemmer

The Porter Stemmer is a widely used stemming algorithm that applies a series of rules to transform words into their root forms. It was developed by Martin Porter in 1980 and is known for its simplicity and effectiveness. The algorithm works by iteratively applying a set of rules to remove common suffixes from words.

Example:
- "running" -> "run"
- "runner" -> "run"
- "happiness" -> "happi"

In the context of our lab, we will use the Porter Stemmer to preprocess the text data and reduce words to their root forms.

In [None]:
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

In [None]:
#Data cleaning and preprocessing
corpus = []
for i in range(len(df)):
    review = re.sub('[^a-zA-Z]',' ', df['message'][i])
    review = review.lower()
    review = review.split()
    [ps.stem(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)


#### Creating Bag of Words Model

In this section, we will create a Bag of Words (BoW) model using the `CountVectorizer` from the `sklearn.feature_extraction.text` module. The BoW model is a common text representation technique used in natural language processing. It converts text data into numerical feature vectors, where each feature represents the frequency or presence of a word in the text.

We will use the `CountVectorizer` with the following parameters:
- `max_features=2500`: Limits the number of features to the top 2500 most frequent words.
- `binary=True`: Indicates that the feature values should be binary (1 if the word is present, 0 if not).

The resulting feature vectors will be stored in the variable `X`.


In [None]:
# Creating bag of words model:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=2500, binary=True)
X = cv.fit_transform(corpus).toarray()

In [None]:
pd.DataFrame(X, columns=cv.get_feature_names_out())

In [None]:
# Preprocessing using Lemmatization
from nltk.stem import WordNetLemmatizer
lemma = WordNetLemmatizer()

In [None]:
for i in range(len(df)):
    review = re.sub('[^a-zA-Z]',' ', df['message'][i])
    review = review.lower()
    review = review.split()
    [lemma.lemmatize(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)

In [None]:
cv = CountVectorizer(max_features=2500, binary=True)
X = cv.fit_transform(corpus).toarray()
pd.DataFrame(X, columns=cv.get_feature_names_out())

<p>sklearn offers a parameter known as N gram for creation of BOW. 
<p>Consider the below set of messages:
<li>'Boy is Good'
<li>'Boy is not Good'
<p>The vector representation of the above would be [1,1,1,0] and [1,1,1,1] given the vocabulary is ['Boy','is','good','not']
<p>It can be observed that the cosine similarity of the above two vectors is high but the meaning implies otherwise.
<p>To counter this issue, N gram can be used, which helps adding combination of 2 or more words to vocabulary.
<p>for, N gram = [1,2] ---> vocab: ['Boy','is','good','not', 'Boy is', 'is good', 'good not'....]
<p>vect representation: sentence 1 ---> [1,1,1,0,1,1,0] and sentence 2 ---> [1,1,1,1,1,1,1]
<p>Now the similarity is reduced.

In [None]:
# Using N gram: 
# Combination of unigram and bigram = [1,2]
# Combination of unigram and trigram = [1,3]
# Combination of bigram and trigram = [2,3] so on..

cv = CountVectorizer(max_features=2500, binary=True,ngram_range=(1, 2))
X = cv.fit_transform(corpus).toarray()
pd.DataFrame(X, columns=cv.get_feature_names_out())

# Hidden markov Model
we are gonna do spellcheck and auto complete using 'HMM

The code that would fit at $PLACEHOLDER$ without ``` is:

To perform spellcheck and autocomplete using Hidden Markov Model (HMM), we will use the following steps:

1. **Spellcheck**: We will use a pre-trained HMM tagger to identify and correct misspelled words in a given sentence. The HMM tagger will help us determine the most likely sequence of words, and we will use a dictionary of common misspellings to suggest corrections.

2. **Autocomplete**: We will build a bigram model from a corpus of text to predict the next word(s) in a given sentence. The bigram model will help us generate probable word sequences based on the context provided by the input text.

The implementation details are provided in the subsequent code cells.

In [None]:
import nltk
from nltk.util import ngrams
from nltk.corpus import reuters
from collections import defaultdict, Counter
from lab_helpers import *

nltk.download('punkt_tab')
nltk.download('punkt')
nltk.download('reuters')

### Autocomplete using Hidden Markov Model

In this section, we will implement an autocomplete feature using a Hidden Markov Model (HMM). The steps involved are:

1. **Building a Bigram Model**: a bigram model from a corpus of text. The bigram model will help us predict the next word based on the previous word in the sequence.
2. **Autocomplete Function**: We already defined a function that takes an input text and uses the bigram model to predict the next few words, completing the input text.

The code implementation is provided in the subsequent code cells.

In [None]:
corpus = reuters.sents(categories='acq')

def build_bigram_model(corpus):
    bigram_model = defaultdict(Counter)
    for sentence in corpus:
        sentence = [word.lower() for word in sentence]
        for w1, w2 in ngrams(sentence, 2, pad_left=True, pad_right=True, left_pad_symbol="<s>", right_pad_symbol="</s>"):
            bigram_model[w1][w2] += 1
    return bigram_model

In [None]:
bigram_model = build_bigram_model(corpus)

In [None]:
input_text = "Machine learning is"
completed_text = autocomplete(input_text, bigram_model, num_words=3)
print("Autocomplete:", completed_text)

### Autocomplete using Custom NLTK Data

In this section, we will implement an autocomplete feature using a Hidden Markov Model (HMM) with a custom NLTK data path. The steps involved are:

1. **Setting Up Custom NLTK Data Path**: We will set up a custom path for NLTK data downloads and ensure the necessary corpora are available.
2. **Building a Bigram Model**: We will build a bigram model from the Gutenberg corpus. The bigram model will help us predict the next word based on the previous word in the sequence.
3. **Autocomplete Function**: We already defined a function that takes an input text and uses the bigram model to predict the next few words, completing the input text.

The code implementation is provided in the subsequent code cell.

In [None]:
import os
import nltk
from nltk.util import ngrams
from collections import defaultdict, Counter

In [None]:
custom_nltk_path = os.path.expanduser('~/custom_nltk_data')
if not os.path.exists(custom_nltk_path):
    os.makedirs(custom_nltk_path)

In [None]:
nltk.download('punkt', download_dir=custom_nltk_path)
nltk.download('gutenberg', download_dir=custom_nltk_path)


nltk.data.path.append(custom_nltk_path)

corpus = nltk.corpus.gutenberg.sents()

In [None]:
def build_bigram_model(corpus):
    bigram_model = defaultdict(Counter)
    for sentence in corpus:
        sentence = [word.lower() for word in sentence]
        for w1, w2 in ngrams(sentence, 2, pad_left=True, pad_right=True, left_pad_symbol="<s>", right_pad_symbol="</s>"):
            bigram_model[w1][w2] += 1
    return bigram_model

In [None]:
bigram_model = build_bigram_model(corpus)

In [None]:
input_text = "machine learning is"
completed_text = autocomplete_HMM(input_text, bigram_model, num_words=3)
print("Autocomplete:", completed_text)

### Spellcheck using Hidden Markov Model

In this section, we will implement a spellcheck feature using a Hidden Markov Model (HMM). The steps involved are:

1. **Training the HMM Tagger**: We will train an HMM tagger using the treebank corpus. The HMM tagger will help us determine the most likely sequence of words.
2. **Spellcheck Function**: We will define a function that takes an input sentence and uses the HMM tagger to identify and correct misspelled words. We will use a dictionary of common misspellings to suggest corrections.

The code implementation is provided in the subsequent code cell.

In [None]:
import nltk
from nltk.corpus import treebank, words
from nltk.tag import hmm
from nltk.metrics.distance import edit_distance

nltk.download('treebank')
nltk.download('universal_tagset')
nltk.download('words')

In [None]:
train_sents = treebank.tagged_sents(tagset='universal')
english_words = set(words.words())
train_vocab = set(word.lower() for sent in train_sents for word, _ in sent)
full_vocabulary = english_words.union(train_vocab)

trainer = hmm.HiddenMarkovModelTrainer()
tagger = trainer.train(train_sents)

common_misspellings = {
    'teh': 'the',
    'quikc': 'quick',
    'brownn': 'brown',
    'fxo': 'fox',
    'jupms': 'jumps',
    'lazzy': 'lazy',
}

In [None]:
def is_correct(word):
    return word.lower() in full_vocabulary

def hmm_spell_checker(sentence, tagger, common_misspellings):
    corrected_sentence = []

    for i, word in enumerate(sentence):
        if not is_correct(word):
            correction = common_misspellings.get(word.lower(), None)
            if correction:
                corrected_sentence.append(correction)
                print(f"Correcting '{word}' to '{correction}'")
            else:
                suggested_word = suggest_corrections(word, full_vocabulary, max_distance=2)
                corrected_sentence.append(suggested_word)
                print(f"Suggesting '{suggested_word}' for '{word}'")
        else:
            corrected_sentence.append(word)

    return " ".join(corrected_sentence)

In [None]:
input_sentence = ['The', 'quikc', 'brownn', 'fxo', 'jupms', 'over', 'the', 'lazzy', 'dog', 'cta', 'cet','catt']
corrected_text = hmm_spell_checker(input_sentence, tagger, common_misspellings)
print("\nCorrected Sentence:")
print(corrected_text)

### Conclusion

In this notebook, we explored various natural language processing techniques and models, including Hidden Markov Models (HMMs) and N-grams. We covered the following key points:

1. **Hidden Markov Models (HMMs)**: We provided a brief recap of HMMs, their components, and their applications in tasks such as speech recognition and part-of-speech tagging.

2. **N-Grams**: We discussed N-grams and their use in capturing the context and structure of text. We implemented a Bag of Words (BoW) model using unigrams, bigrams, and trigrams.

3. **Data Cleaning and Preprocessing**: We cleaned and preprocessed the SMS Spam Collection dataset, including removing unwanted characters, converting text to lowercase, and applying stemming and lemmatization techniques.

4. **Bag of Words Model**: We created a BoW model using the `CountVectorizer` from `sklearn`, converting text data into numerical feature vectors.

5. **Autocomplete using HMM**: We implemented an autocomplete feature using a bigram model built from the Reuters and Gutenberg corpora.

6. **Spellcheck using HMM**: We implemented a spellcheck feature using an HMM tagger trained on the treebank corpus and a dictionary of common misspellings.

These techniques and models are fundamental in natural language processing and can be applied to various tasks such as text classification, language modeling, and text generation. By understanding and implementing these methods, we can enhance our ability to process and analyze textual data effectively.