# Basic Models in Natural Language Processing

In this notebook, we will explore foundational models and techniques used in NLP, including:
- N-grams
- TF-IDF (Term Frequency-Inverse Document Frequency)
- Naive Bayes

We will implement these techniques step-by-step, ensuring a deep understanding of their mechanics and applications.

In [176]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
import nltk
from nltk.util import ngrams
from collections import Counter, defaultdict
from sklearn.pipeline import Pipeline

In [2]:
# nltk.download('punkt')

# N-grams

N-grams are continuous sequences of `n` items from a given text.

In [3]:
sentence = 'I love programming in Python.'
tokens = nltk.word_tokenize(sentence)
bigrams = list(ngrams(tokens, 2))
trigrams = list(ngrams(tokens, 3))
print('Bigrams:', bigrams)
print('Trigrams:', trigrams)

Bigrams: [('I', 'love'), ('love', 'programming'), ('programming', 'in'), ('in', 'Python'), ('Python', '.')]
Trigrams: [('I', 'love', 'programming'), ('love', 'programming', 'in'), ('programming', 'in', 'Python'), ('in', 'Python', '.')]


In [4]:
nltk.download('words')

[nltk_data] Downloading package words to /Users/chen.m/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

## Create An Unwise Language Model

In [5]:
some_list = [1, 2, 1, 3, 4, 1, 2, 3, 4, 2, 1]

In [6]:
c = Counter()
c.update(some_list)
c

Counter({1: 4, 2: 3, 3: 2, 4: 2})

In [7]:
not_default_dict = {}

In [8]:
type(not_default_dict)

dict

In [9]:
not_default_dict['a'] = 4

In [10]:
not_default_dict['a']

4

In [11]:
not_default_dict['b']

KeyError: 'b'

In [12]:
default_dict = defaultdict(int)

In [184]:
default_dict['b']

0

### Instructions:
1. Create a function that return a dictionary with each word and the count of its biagrams.
For example, if the corpus is:
'Football is very very cool'
the dict will be
{football: {is: 1, very: 1}, is: {very: 2}, very: {very: 1, cool:1}}
2. Create another function that for every given word, predicts the biagram that most commonly comes after it.
For example if the word 'a' appears in multiple sentences, and the word that comes after 'a' (in all the corpus) is banana, return banana.
3. Adjust the function to recursivley return the next word according to the logic of section 2 until some stop conditions happens. For example, 'a' --> 'a banana' -> 'a banana is' -> 'a banana is yellow'

In [164]:
corpus = [
    "Football is a popular sport",
    "Basketball requires great skill and teamwork",
    "Tennis is played on a court with a racket and ball",
    "Cricket is a game with batsmen and bowlers",
    "Baseball is a bat-and-ball game",
    "Soccer is known as football in many countries",
    "Golf is played on a course with clubs and a ball",
    "Hockey is played on ice or field",
    "Volleyball is a game of strategy and power",
    "Running is a fundamental part in many sports",
    "Swimming requires strength and endurance",
    "The capital of France is Paris"
]

### Solution

In [166]:
def build_bigram_model(corpus):
    bigram_model = defaultdict(Counter)
    corpus = corpus.split()
    for sentence in corpus:
        words = sentence.lower().split()  # Split sentence into words
        for i in range(len(words) - 1):  # Create bigrams
            bigram_model[words[i]][words[i + 1]] += 1
    return bigram_model

In [167]:
STOP_WORD = 'No prediction available'

In [168]:
def predict_next_word(word, bigram_model):
    if word not in bigram_model:
        print(123, word)
        return STOP_WORD
    next_word = bigram_model[word].most_common(1)[0][0]
    return next_word

In [169]:
def return_lm_response(query):
    response = []
    while True:
        query_to_list_of_words = query.split()
        query_len = len(query_to_list_of_words)
        last_word = query_to_list_of_words[query_len - 1]
        
        predicted_word = predict_next_word(last_word, res)
        if predicted_word == STOP_WORD:
            return 'Cant predict next word'
        if query not in response:
            response.append(query)

        response.append(predicted_word)
        query = predicted_word
        stringified_ans = ' '.join(response)
    
    return stringified_ans

In [170]:
current_word = "the"
predicted_word = predict_next_word(current_word, bigram_model)
print(f"Next word for '{current_word}': {predicted_word}")

Next word for 'the': capital


In [171]:
current_word = "known"
predicted_word = predict_next_word(current_word, bigram_model)
print(f"Next word for '{current_word}': {predicted_word}")

Next word for 'known': as


In [172]:
return_lm_response('Running is a fundamental')

123 fundamental


'Cant predict next word'

# TF-IDF - See Separate Notebook

TF-IDF stands for Term Frequency-Inverse Document Frequency, a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents.

In [None]:
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)
print('TF-IDF Feature Names:', tfidf_vectorizer.get_feature_names_out())

# Naive Bayes

Naive Bayes is a probabilistic classifier based on Bayes' theorem.

See separate notebook

## Naive Bayes for Sentiment Analysis in NLP

Naive Bayes is a classification algorithm based on Bayes' Theorem. It’s called "naive" because it assumes that all features (e.g., words in text) are independent of each other, which is often not true but simplifies the calculations.

## Objectives
1. Understand Naive Bayes and how it works.
2. Perform text preprocessing.
3. Train a Naive Bayes classifier on a dataset of reviews.
4. Predict sentiment for new reviews.
    

## The Formula
The formula for Bayes' Theorem is:

$$
P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}
$$

### Explanation:
- **\(P(A|B)\):** Probability of \(A\) (class, e.g., "Spam") given \(B\) (evidence, e.g., a specific word in an email).
- **\(P(B|A)\):** Probability of observing \(B\) (e.g., word) given \(A\) (e.g., class "Spam").
- **\(P(A)\):** Prior probability of \(A\) (how common is "Spam" in the dataset?).
- **\(P(B)\):** Probability of \(B\) (how common is that word in all emails?).

## Dataset

In [2]:
reviews = [
    "I love this movie",
    "I hate this movie",
    "This movie is great",
    "This movie is bad"
]
labels = ["Positive", "Negative", "Positive", "Negative"]

df = pd.DataFrame({"Review": reviews, "Sentiment": labels})
df    

Unnamed: 0,Review,Sentiment
0,I love this movie,Positive
1,I hate this movie,Negative
2,This movie is great,Positive
3,This movie is bad,Negative



## Preprocessing Text

Naive Bayes requires numeric input. We will use **CountVectorizer** to convert text into a Bag of Words representation.

### What is Bag of Words (BoW)?
BoW represents text by counting word occurrences.

## Training a Naive Bayes Classifier

In [3]:
model = Pipeline([
    ('vectorizer', CountVectorizer()),  # Convert text to Bag of Words
    ('classifier', MultinomialNB())    # Train Naive Bayes classifier
])
model.fit(reviews, labels)  

## Making Predictions

In [4]:
new_review = ["I love this movie"]
prediction = model.predict(new_review)

print(f"Review: {new_review[0]}")
print(f"Predicted Sentiment: {prediction[0]}")    

Review: I love this movie
Predicted Sentiment: Positive


## Evaluate the Model

In [5]:
train_predictions = model.predict(reviews)

df['Predicted Sentiment'] = train_predictions
df    

Unnamed: 0,Review,Sentiment,Predicted Sentiment
0,I love this movie,Positive,Positive
1,I hate this movie,Negative,Negative
2,This movie is great,Positive,Positive
3,This movie is bad,Negative,Negative



## Key Takeaways

1. Naive Bayes is a simple but effective algorithm for text classification.
2. It assumes independence between features, which is rarely true but works well in practice.
3. Preprocessing text data (like tokenization and Bag of Words) is essential for working with text.
    