# Natural Language Processing
### A Sentence Completion ML Model
- The first part of this project is a sentence completion model that takes in 5 words as independent features, and predicts what the following word would be. It is to help user typing more efficient by learning typing patterns from the user's conversation history.

- The second part is to then use the model within a function that iteratively uses the model to predict the 5th word of a pattern, by layering the sentence each time a prediction is made, to continually predict a long array of words based on the specified limit.

### 📦 Importing Libraries and Setting Up the Environment

In this section, we import all the required libraries for handling data, visualizing trends, processing text, and training our machine learning models.

- **Data Handling & Visualization**
  - `pandas` and `numpy` for data manipulation.
  - `seaborn` and `matplotlib.pyplot` for visualizations.

- **Machine Learning**
  - `TfidfVectorizer` to convert text into numerical feature vectors.
  - `train_test_split` to divide data into training and testing sets.
  - `MultinomialNB` for Naive Bayes text classification.
  - `RandomForestClassifier` as an alternative model for text classification.

- **Text Preprocessing**
  - `stopwords` and `word_tokenize` from `nltk` for cleaning and tokenizing text.

- **Utilities**
  - `Counter` from `collections` for frequency analysis.
  - `random` for reproducibility or sampling operations.


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import random

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

from collections import Counter

from sklearn.ensemble import RandomForestClassifier

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.metrics import classification_report

### 🧹 Reading and Cleaning the Raw Text

We begin by loading the raw text data from a `.txt` file and performing basic cleaning by removing punctuation marks.

- The `with open(...)` block reads the contents of **`Video Games.txt`** into the variable `initial_text`.
- A set of punctuation characters is defined using a raw string (`r'''...'''`).
- We then remove all punctuation by iterating through each character in the text and keeping only those not found in the `punctuations` string.

In [2]:
with open("Video Games.txt", "r", encoding="utf-8") as text_file:
    initial_text = text_file.read()

Remove all punctuations

In [3]:
# punctuations = '''!()-[]{};:'"\,<>./?@#$%^&’*_~'''
punctuations = r'''!()-[]{};:'"\,<>./?@#$%^&’*_~'''
# Remove punctuations from the text
text_variable = ''.join(char for char in initial_text if char not in punctuations)

In [4]:
# print(text_variable)

### 🧠 Tokenization and Sequence Preparation

In this section, we prepare the textual data for training a language model by:

- **Tokenizing** the cleaned text using `nltk.word_tokenize` and converting all characters to lowercase.
- *(Optional)* Removing stopwords — this step is currently commented out in case the goal is next-word prediction, where stopwords can carry meaningful context.
- **Generating input-target pairs**:
  - We use a sliding window approach to extract 5-word input sequences.
  - The **6th word** following each 5-word sequence is treated as the **target label** for prediction.

This is a standard approach in building datasets for **next-word prediction** or **language modeling**.

In [5]:
# Tokenize the text
tokens = word_tokenize(text_variable.lower())

# Create dataset: 5-word sequences with 6th word as target
input_sequences = []
target_words = []

for i in range(len(tokens) - 5):
    input_sequences.append(tokens[i:i+5])
    target_words.append(tokens[i+5])

# Preview
print("Sample input:", input_sequences[0])
print("Target word:", target_words[0])

Sample input: ['video', 'games', 'have', 'evolved', 'into']
Target word: a


### 🧾 Converting Word Sequences to Clean Text Strings

Now that we have our input sequences as lists of individual words, we convert each sequence into a single space-separated string.

- This transformation is useful when working with vectorizers like `TfidfVectorizer` or `CountVectorizer`, which expect raw text input.
- Each 5-word list is joined into a single string using `' '.join(...)`, and stored in `the_list`.

In [6]:
the_list = []

for words in input_sequences:
    new_clean_text = ' '.join(words)

    the_list.append(new_clean_text)

### 🧱 Creating the Final Dataset

We now create a structured DataFrame that pairs each 5-word input sequence with its corresponding target word.

- The `Sentence` column contains the 5-word context as a single string.
- The `Target` column holds the next word that follows each sentence — the word we want the model to predict.

This DataFrame will serve as the **training corpus** for our language modeling task.

In [7]:
corpus_df = pd.DataFrame({'Sentence' : the_list, 'Target' : target_words})

In [8]:
corpus_df

Unnamed: 0,Sentence,Target
0,video games have evolved into,a
1,games have evolved into a,major
2,have evolved into a major,form
3,evolved into a major form,of
4,into a major form of,entertainment
...,...,...
12607,high score because in the,world
12608,score because in the world,of
12609,because in the world of,games
12610,in the world of games,anythings


In [9]:
corpus_df.Target.value_counts()

and            420
the            362
a              329
of             263
to             234
              ... 
wasnt            1
cultivation      1
formed           1
chat             1
anythings        1
Name: Target, Length: 3613, dtype: int64

In [10]:
# corpus_df.Target = corpus_df.Target.apply(lambda x :'others' if x not in corpus_top else x)

In [11]:
corpus_df.Target.value_counts()

and            420
the            362
a              329
of             263
to             234
              ... 
wasnt            1
cultivation      1
formed           1
chat             1
anythings        1
Name: Target, Length: 3613, dtype: int64

### 🧮 Vectorizing Input Sentences with TF-IDF

To convert textual data into numerical format for modeling, we use **TF-IDF (Term Frequency–Inverse Document Frequency)**:

- `TfidfVectorizer` transforms each sentence into a vector of TF-IDF scores.
- This highlights words that are frequent in a sentence but rare across the corpus, improving model focus on informative terms.

We apply this transformation to the `Sentence` column of our dataset and store the resulting sparse matrix in `X`.

In [12]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus_df.Sentence)
y = corpus_df.Target

This displayes the vectors in the from of `(doc_index, feature_index)    tfidf_score` instead of a sparce matrix for memory efficiency.


In [13]:
print(X)

  (0, 1676)	0.36304682846865993
  (0, 1098)	0.6368387560404972
  (0, 1458)	0.46580932310381934
  (0, 1320)	0.2876209140440805
  (0, 3391)	0.4036449968197907
  (1, 1676)	0.3968090537138697
  (1, 1098)	0.6960627785089225
  (1, 1458)	0.5091281405530329
  (1, 1320)	0.31436876397338415
  (2, 1873)	0.5433142907210021
  (2, 1676)	0.35092439516401003
  (2, 1098)	0.6155741842537654
  (2, 1458)	0.4502555652709021
  (3, 1265)	0.5309563308389728
  (3, 1873)	0.515627982573967
  (3, 1676)	0.3330419262012861
  (3, 1098)	0.5842056433490291
  (4, 2134)	0.27857704997764005
  (4, 1265)	0.6283073780335314
  (4, 1873)	0.6101685712266656
  (4, 1676)	0.3941052912885008
  (5, 1055)	0.4806625003031631
  (5, 2134)	0.26579808489588636
  (5, 1265)	0.599485484610713
  (5, 1873)	0.5821787462704642
  :	:
  (12607, 2727)	0.552879490168834
  (12607, 263)	0.4339450374497415
  (12607, 1492)	0.6050552565970259
  (12607, 3145)	0.2347879970344969
  (12607, 1601)	0.2912021750197425
  (12608, 2727)	0.6106197010003567
  (1260

### 🌲 Training with Random Forest Classifier

In this section, we train a **Random Forest Classifier** on our TF-IDF feature set.

#### Steps:
- **Data Splitting**:  
  We split the dataset into training and testing sets using `train_test_split`, with 80% for training and 20% for testing.
- **Model Training**:  
  A `RandomForestClassifier` is instantiated and trained using the training data.
- **Prediction**:  
  The trained model is then used to predict the target values for the test set.

In [14]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [15]:
RF_model = RandomForestClassifier()
RF_model.fit(X_train, y_train)

In [16]:
rf_y_pred = RF_model.predict(X_test)

In [17]:
from sklearn.metrics import f1_score
print(f1_score(y_test, rf_y_pred, average= 'weighted'))

0.030424610717650312


### 📊 Training with Naive Bayes Classifier

In this section, we train a **Multinomial Naive Bayes** classifier and evaluate its performance using the **F1-score**.

#### Steps:
- **Model Training**:  
  `MultinomialNB` is trained on the TF-IDF features (`X_train`) and corresponding target words (`y_train`).
- **Prediction**:  
  We generate predictions for the test set using `predict()`.
- **Evaluation**:  
  We compute the **weighted F1-score**, which considers label imbalance and gives a better measure of overall model performance.

In [18]:
NB_model = MultinomialNB()
NB_model.fit(X_train, y_train)

In [19]:
nb_y_pred = NB_model.predict(X_test)

In [20]:
print(f1_score(y_test, nb_y_pred, average= 'weighted'))

0.011500818522530043


## 📏 Why We Did Not Rely Solely on Evaluation Metrics

While evaluation metrics like **F1-score** are commonly used to assess model performance, we found them **inadequate for this specific task** due to the nature of our dataset and prediction objective.

### Here's why:
- The **target column** represents the 6th word following each 5-word input sequence in the raw text.
- These targets are **highly varied and context-dependent**, meaning the same input may not always lead to a single "correct" next word in real usage.
- After splitting the dataset into training and testing sets, there's **no guarantee that similar patterns exist in both subsets**, making accuracy-based metrics less meaningful.
- The prediction process is inherently **sequential and creative**, not strictly deterministic like traditional classification.

### What We Did Instead:
- For **completeness**, we still compared the **F1-scores** of the Naive Bayes and Random Forest models.
- However, the **true measure of performance** lies in **qualitative evaluation**:  
  → How well does the model generate natural, context-aware next words?

This approach reflects the creative, language-generation nature of our task — where context and fluency matter more than raw metrics.


## Testing

The code below simply predicts the next word based on the input text and the model you want to apply.

In [21]:
def predict_word(input, model):
    input_df = pd.Series(str(input))

    # Transform the input text using the same vectorizer
    new_review = vectorizer.transform(input_df)

    # Get class output
    output = model.predict(new_review)

    return output

### ✍️ Making Predictions with Custom Input

We can now test the trained models with a custom input sequence.  
Here, we prompt the user to enter a 5-word sentence using Python's `input()` function.

Using the same phrase, we passed it into both models to see what they would likely pass as a next word prediction, to try to check for which one of them gives a next word with a better meaning

In [22]:
new_review = input("Enter text here:")

In [23]:
print(new_review)

the world of video games


 - Result from the Random Forest Classifier model.

In [24]:
print(predict_word(new_review, RF_model))

['continues']


- Result from the Naive Bayes Classifier model.

In [25]:
print(predict_word(new_review, NB_model))

['a']


### 🧠 Next-Word Prediction Function

The `predict_word()` function predicts the next likely word given a 5-word input string using a specified model (e.g., Naive Bayes or Random Forest).

#### Key Features:
- **Model-Agnostic**: Accepts any trained classification model (e.g., Naive Bayes or Random Forest).
- **Input Flexibility**: Takes a 5-word user input string and uses it as the prediction context.
- **Top-k Sampling**: Instead of choosing just the highest probability word, the function:
  - Retrieves the **top 5 most probable words** based on the model's predictions.
  - Randomly selects **one word from these top 5**, introducing controlled randomness and variation in the generated text.

In [26]:
def predict_word(input, model):
    input_df = pd.Series(str(input))

    # Transform the input text using the same vectorizer
    new_review = vectorizer.transform(input_df)
    # Get class probabilities
    proba = model.predict_proba(new_review)

    # Get top 5 classes for each sample
    top_k = 5
    top_classes = np.argsort(proba, axis=1)[:, -top_k:][:, ::-1]  # sort and reverse

    # Map to class labels
    top_class_labels = model.classes_[top_classes][0]
    rand_variable = random.choice(top_class_labels)

    return rand_variable

### 🔮 Enhanced Next-Word Prediction Function

The function below is an improved version of the original `predict_word()` function, designed to support **continuous text generation**.


In [27]:
def generate_sentence(words, model):
    count = 50  # number of words to generate
    word_list = words.split(" ")  # turn input into list of words

    for n in range(count):
        main_words = ' '.join(word_list)  # form the current context string
        next_word = str(predict_word(main_words, model))  # predict the next word
        words = words + " " + next_word  # add it to the sentence
        word_list = word_list[1:]  # shift the context window
        word_list.append(next_word)  # include the new word

    return words

The code above simply generates a 50 word sentence based on the input text.

- Results from the Random Forest Classifier model.

In [28]:
print(generate_sentence(new_review, RF_model))

the world of video games wild expanding ride are through technology on leaps creates that players a transcend national national and boundaries language in barriers games between player property are titles as borders like pacman and the of gaming generation thought whats is like games through adapt that with the isnt feels feels just of


- Results from the Naive Bayes Classifier model.

In [29]:
print(generate_sentence(new_review, NB_model))

the world of video games a of games of of and the and and of in a and the gaming to and the of a of gaming to of the and the of of gaming gaming and a of to to a and a the and in games and of to and to a a


## 📈 Performance Check

Based on the evaluation, the **Random Forest model** performs better in generating more contextually appropriate words compared to the **Naive Bayes model**.

The Naive Bayes model tends to predict common stopwords more frequently. This behavior is likely due to the model's reliance on word occurrence frequencies — since stopwords appear very often in the dataset, Naive Bayes assigns them higher probabilities.

In contrast, the Random Forest model demonstrates a better understanding of contextual relevance, likely because it captures more complex patterns and relationships in the feature space beyond just frequency.
