# Basic NLP Tutorial: Sentiment Analysis with Preprocessing and Bag of Words

This tutorial will walk you through:
- Text preprocessing: tokenization, stopword removal, stemming, and lemmatization
- Vectorization using Bag of Words (unigrams) and n-grams (bigrams)
- Applying classical ML algorithm (Multinomial Naive Bayes) for sentiment classification
- Comparing accuracies and understanding the impact of preprocessing and vectorization

In [5]:
import pandas as pd
import numpy as np
import nltk

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ARITRA\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ARITRA\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ARITRA\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Sample Dataset

We'll use a tiny dataset similar to IMDB movie reviews, labeled with positive (1) or negative (0) sentiment.


In [6]:
data = {
    "text": [
        "I loved the movie, it was fantastic!",
        "What a terrible film. I hated it.",
        "An excellent movie with a great story.",
        "Worst movie I have ever seen.",
        "Absolutely wonderful experience, highly recommended!",
        "Horrible acting and bad direction.",
        "The plot was dull and boring.",
        "A masterpiece. Beautifully made and touching.",
        "Terrible! Do not waste your time.",
        "Brilliant performance and amazing visuals."
    ],
    "label": [1, 0, 1, 0, 1, 0, 0, 1, 0, 1]  # 1 = Positive, 0 = Negative
}

df = pd.DataFrame(data)
df.head()


Unnamed: 0,text,label
0,"I loved the movie, it was fantastic!",1
1,What a terrible film. I hated it.,0
2,An excellent movie with a great story.,1
3,Worst movie I have ever seen.,0
4,"Absolutely wonderful experience, highly recomm...",1


## Preprocessing Setup

We will:
- tokenize text into words
- convert to lowercase
- remove stopwords (common words like "the", "is", "and")
- either stem or lemmatize words

Let's define a preprocessing function.


In [8]:
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def preprocess_text(text, method='stem'):
    tokens = word_tokenize(text.lower())
    tokens = [word for word in tokens if word.isalpha() and word not in stop_words]
    
    if method == 'stem':
        tokens = [stemmer.stem(word) for word in tokens]
    elif method == 'lemm':
        tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    return ' '.join(tokens)


# Step 1: Stemming Output

Let's preprocess the text using stemming and print the results.

In [10]:
stemmed_texts = df['text'].apply(lambda x: preprocess_text(x, method='stem'))
for i, sent in enumerate(stemmed_texts, 1):
    print(f"{i}. {sent}")


1. love movi fantast
2. terribl film hate
3. excel movi great stori
4. worst movi ever seen
5. absolut wonder experi highli recommend
6. horribl act bad direct
7. plot dull bore
8. masterpiec beauti made touch
9. terribl wast time
10. brilliant perform amaz visual


**Explanation:**

- Words are reduced to their root forms using Porter Stemmer.
- For example, "loved" → "love", "fantastic" → "fantast", "horrible" → "horribl".
- This chopping sometimes creates non-real words but reduces vocabulary size.

# Step 2: Lemmatization Output

Now let's preprocess using lemmatization instead, which returns valid base forms of words.

In [13]:
lemm_texts = df['text'].apply(lambda x: preprocess_text(x, method='lemm'))
for i, sent in enumerate(lemm_texts, 1):
    print(f"{i}. {sent}")

1. loved movie fantastic
2. terrible film hated
3. excellent movie great story
4. worst movie ever seen
5. absolutely wonderful experience highly recommended
6. horrible acting bad direction
7. plot dull boring
8. masterpiece beautifully made touching
9. terrible waste time
10. brilliant performance amazing visuals


**Explanation:**

- Lemmatization considers word meaning and context.
- It maps "loved" → "love", "movies" → "movie", but keeps "recommended" as is.
- It preserves more natural word forms compared to stemming.


# Step 3A: Bag of Words (Unigram) Vectorization on Lemmatized Text

Let's create a vocabulary of all words and vectorize the sentences with unigram counts.


In [15]:
vectorizer_uni = CountVectorizer(ngram_range=(1,1))
X_uni = vectorizer_uni.fit_transform(lemm_texts)
vocab_uni = vectorizer_uni.get_feature_names_out()
print(f"Vocabulary (unigram) sample:\n{vocab_uni[:15]} ...\n")

print("First sentence after lemmatization:")
print(f"\"{lemm_texts[0]}\"")

vector_0_uni = X_uni.toarray()[0]
print("Vector representation (counts) of first sentence:")
print(vector_0_uni)

Vocabulary (unigram) sample:
['absolutely' 'acting' 'amazing' 'bad' 'beautifully' 'boring' 'brilliant'
 'direction' 'dull' 'ever' 'excellent' 'experience' 'fantastic' 'film'
 'great'] ...

First sentence after lemmatization:
"loved movie fantastic"
Vector representation (counts) of first sentence:
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]


**Explanation:**

- Vocabulary lists unique words after preprocessing.
- The vector for a sentence counts how many times each word appears.
- In the first sentence, words like 'love', 'movie', and 'fantastic' have count=1.
- Other vocab positions are zero since those words don't appear in this sentence.

# Step 3B: Bigram (Unigrams + Bigrams) Vectorization on Lemmatized Text

Now let's vectorize using both unigrams and bigrams (pairs of consecutive words).

In [17]:
vectorizer_bi = CountVectorizer(ngram_range=(1,2))
X_bi = vectorizer_bi.fit_transform(lemm_texts)
vocab_bi = vectorizer_bi.get_feature_names_out()
print(f"Vocabulary (unigrams + bigrams) sample:\n{vocab_bi[:20]} ...\n")

print("First sentence after lemmatization:")
print(f"\"{lemm_texts[0]}\"")

vector_0_bi = X_bi.toarray()[0]
print("Vector representation (counts) of first sentence with bigrams:")
print(vector_0_bi)

Vocabulary (unigrams + bigrams) sample:
['absolutely' 'absolutely wonderful' 'acting' 'acting bad' 'amazing'
 'amazing visuals' 'bad' 'bad direction' 'beautifully' 'beautifully made'
 'boring' 'brilliant' 'brilliant performance' 'direction' 'dull'
 'dull boring' 'ever' 'ever seen' 'excellent' 'excellent movie'] ...

First sentence after lemmatization:
"loved movie fantastic"
Vector representation (counts) of first sentence with bigrams:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0
 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


**Explanation:**

- Vocabulary now includes single words *and* two-word sequences (bigrams).
- The vector counts appearances of unigrams and bigrams.
- For example, the bigrams 'love movie' and 'movie fantastic' appear once each in the first sentence.
- This adds phrase-level context beyond single words.


# Step 4: Model Training and Accuracy Comparison

Let's train Multinomial Naive Bayes models on all four setups and compare accuracies:
1. Stem + Unigram
2. Stem + Bigram
3. Lemmatize + Unigram
4. Lemmatize + Bigram

In [19]:
results = []

for method in ['stem', 'lemm']:
    processed_texts = df['text'].apply(lambda x: preprocess_text(x, method=method))
    for ngram in [(1, 1), (1, 2)]:
        vectorizer = CountVectorizer(ngram_range=ngram)
        X = vectorizer.fit_transform(processed_texts)
        y = df['label']
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
        model = MultinomialNB()
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        acc = accuracy_score(y_test, y_pred)
        results.append((method, ngram, acc))

for method, ngram, acc in results:
    ngram_str = f"{ngram[0]}-{ngram[1]}"
    print(f"Accuracy using {method} + ngram {ngram_str}: {acc:.4f}")

Accuracy using stem + ngram 1-1: 0.3333
Accuracy using stem + ngram 1-2: 1.0000
Accuracy using lemm + ngram 1-1: 0.3333
Accuracy using lemm + ngram 1-2: 1.0000


**Observations:**

- Lemmatization generally yields better accuracy because it keeps words meaningful.
- Adding bigrams sometimes helps by capturing common phrases, but difference is often small on tiny datasets.
- Stemming may harm accuracy due to aggressive chopping creating non-real words.

This illustrates how preprocessing and feature extraction choices impact NLP model performance.