# **Comprehensive NLP Lab: From Preprocessing to Feature Extraction**

In this lab, you will explore a wide range of Natural Language Processing (NLP) techniques, from basic text preprocessing to advanced feature extraction and analysis. By the end of this lab, you will be able to:

1. **Tokenize** and preprocess text data.
2. Remove **stop words** and **punctuation**.
3. Apply **stemming** and **lemmatization**.
4. Extract features using **Bag of Words (BoW)** and **TF-IDF**.
5. Generate **n-grams** to capture contextual information.
6. Evaluate the impact of different preprocessing techniques on text data.

Let's dive in!

## **1. Setup the Environment**


Before we begin, ensure you have the necessary libraries installed. Run the following cell to install them:


In [1]:
!pip install nltk scikit-learn pandas matplotlib




Now, import the required libraries:

In [38]:

import nltk
import re
import string
import pandas as pd
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [39]:
# Download NLTK datasets
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger')



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\macat\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\macat\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\macat\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\macat\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\macat\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\macat\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_percep

True

In [40]:
import nltk

packages = [
    "punkt",                    
    "stopwords",
    "wordnet",
    "omw-1.4",                   
    "averaged_perceptron_tagger", 
    "averaged_perceptron_tagger_eng",  
]

for p in packages:
    try:
        nltk.download(p, quiet=False)
    except Exception as e:
        print(f"Could not download {p}: {e}")

# If your data installs under AppData\Roaming, make sure NLTK looks there too
nltk.data.path.append(r"C:/Users/macat/AppData/Roaming/nltk_data")
print("NLTK paths:", nltk.data.path)

NLTK paths: ['C:\\Users\\macat/nltk_data', 'c:\\Users\\macat\\anaconda3\\envs\\IronHack1\\nltk_data', 'c:\\Users\\macat\\anaconda3\\envs\\IronHack1\\share\\nltk_data', 'c:\\Users\\macat\\anaconda3\\envs\\IronHack1\\lib\\nltk_data', 'C:\\Users\\macat\\AppData\\Roaming\\nltk_data', 'C:\\nltk_data', 'D:\\nltk_data', 'E:\\nltk_data', 'C:/Users/macat/AppData/Roaming/nltk_data', 'C:/Users/macat/AppData/Roaming/nltk_data', 'C:/Users/macat/AppData/Roaming/nltk_data', 'C:/Users/macat/AppData/Roaming/nltk_data']


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\macat\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\macat\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\macat\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\macat\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\macat\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\macat\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_percep

In [37]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

## **2. Text Preprocessing**

### **Exercise 1: Tokenization and Stop Word Removal**

Tokenize the following text

In [5]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

In [6]:
text = "Natural Language Processing (NLP) is a fascinating field of study! It involves analyzing and understanding human language."
tokens = word_tokenize(text)

Remove stop words and store the result in a variable called `filtered_tokens`

In [7]:
stop_words = set(stopwords.words('english'))


In [8]:
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

In [9]:
print("Filtered Tokens:", filtered_tokens)

Filtered Tokens: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'fascinating', 'field', 'study', '!', 'involves', 'analyzing', 'understanding', 'human', 'language', '.']


### **Exercise 2: Stemming and Lemmatization**

Apply stemming and lemmatization to the `filtered_tokens`. Compare the results.

In [10]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

Apply stemming and store the result in `stemmed_tokens`

In [11]:
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]

In [12]:
print("Stemmed Tokens:", stemmed_tokens)

Stemmed Tokens: ['natur', 'languag', 'process', '(', 'nlp', ')', 'fascin', 'field', 'studi', '!', 'involv', 'analyz', 'understand', 'human', 'languag', '.']


Apply lemmatization and store the result in `lemmatized_tokens`

In [13]:
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]


In [14]:
print("Lemmatized Tokens:", lemmatized_tokens)

Lemmatized Tokens: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'fascinating', 'field', 'study', '!', 'involves', 'analyzing', 'understanding', 'human', 'language', '.']


## **3. Feature Extraction**

### **Exercise 3: Bag of Words (BoW)**

Use the `CountVectorizer` from `scikit-learn` to create a Bag of Words representation of the following corpus

In [15]:
corpus = [
    "I love NLP.",
    "NLP is amazing.",
    "I enjoy learning new things in NLP."
]

In [16]:

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()


In [17]:
X = vectorizer.fit_transform(corpus)

In [18]:
print("Bag of Words:\n", X.toarray())
print("Vocabulary:", vectorizer.get_feature_names_out())

Bag of Words:
 [[0 0 0 0 0 1 0 1 0]
 [1 0 0 1 0 0 0 1 0]
 [0 1 1 0 1 0 1 1 1]]
Vocabulary: ['amazing' 'enjoy' 'in' 'is' 'learning' 'love' 'new' 'nlp' 'things']


### **Exercise 4: TF-IDF**

Use the `TfidfVectorizer` from `scikit-learn` to create a TF-IDF representation of the same corpus. Store the result in `X_tfidf`

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()


In [20]:
corpus = [
    "I love NLP.",
    "NLP is amazing.",
    "I enjoy learning new things in NLP."
]

X_tfidf = tfidf_vectorizer.fit_transform(corpus)

In [21]:
print("The dictionary contains", len(tfidf_vectorizer.vocabulary_),"words")
print (tfidf_vectorizer.vocabulary_)

The dictionary contains 9 words
{'love': 5, 'nlp': 7, 'is': 3, 'amazing': 0, 'enjoy': 1, 'learning': 4, 'new': 6, 'things': 8, 'in': 2}


In [22]:
print("TF-IDF:\n", X_tfidf.toarray())
print("Vocabulary:", tfidf_vectorizer.get_feature_names_out())

TF-IDF:
 [[0.         0.         0.         0.         0.         0.861037
  0.         0.50854232 0.        ]
 [0.65249088 0.         0.         0.65249088 0.         0.
  0.         0.38537163 0.        ]
 [0.         0.43238509 0.43238509 0.         0.43238509 0.
  0.43238509 0.2553736  0.43238509]]
Vocabulary: ['amazing' 'enjoy' 'in' 'is' 'learning' 'love' 'new' 'nlp' 'things']


### **Exercise 5: N-grams**

Generate `bigrams (2-grams)` from the corpus using `CountVectorizer`. Store the result in `X_bigram`

In [23]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    "I love NLP.",
    "NLP is amazing.",
    "I enjoy learning new things in NLP."
]


In [24]:
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))

In [25]:
X_bigram = bigram_vectorizer.fit_transform(corpus)

In [26]:
print("Bigrams:\n", X_bigram.toarray())
print("Bigram Vocabulary:", bigram_vectorizer.get_feature_names_out())

Bigrams:
 [[0 0 0 0 1 0 0 0]
 [0 0 1 0 0 0 1 0]
 [1 1 0 1 0 1 0 1]]
Bigram Vocabulary: ['enjoy learning' 'in nlp' 'is amazing' 'learning new' 'love nlp'
 'new things' 'nlp is' 'things in']


In [27]:
import pandas as pd
vocab_dict = tfidf_vectorizer.vocabulary_
vocab_df = pd.DataFrame(list(vocab_dict.items()), columns=["Word", "Index"])
vocab_df = vocab_df.sort_values(by="Word").reset_index(drop=True)
print("Vocabulary Table:")
print(vocab_df)

Vocabulary Table:
       Word  Index
0   amazing      0
1     enjoy      1
2        in      2
3        is      3
4  learning      4
5      love      5
6       new      6
7       nlp      7
8    things      8


## **4. Advanced Exercise: Custom Preprocessing Pipeline**

### **Exercise 6: Build a Custom Preprocessing Pipeline**

Combine all the preprocessing steps (tokenization, stop word removal, punctuation removal, stemming/lemmatization) into a single function. 

In [41]:
import nltk
nltk.data.path.append("C:/Users/macat/AppData/Roaming/nltk_data")

In [42]:
def get_wordnet_pos(tag):
    if tag.startswith("J"):
        return wordnet.ADJ
    elif tag.startswith("V"):
        return wordnet.VERB
    elif tag.startswith("N"):
        return wordnet.NOUN
    elif tag.startswith("R"):
        return wordnet.ADV
    else:
        return wordnet.NOUN

def text_preprocessing_pipeline(text):
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word.lower() not in stop_words]
    tokens = [word for word in tokens if word not in string.punctuation]
    lemmatizer = WordNetLemmatizer()
    pos_tags = pos_tag(tokens)
    lemmatized_tokens = [
        lemmatizer.lemmatize(word, get_wordnet_pos(tag)) for word, tag in pos_tags
    ]

    return lemmatized_tokens


Apply this function to the following text

In [43]:
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\macat\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [45]:
import nltk
nltk.data.path.append("C:/Users/macat/AppData/Roaming/nltk_data")

In [47]:
from nltk import pos_tag, word_tokenize

text = "Natural Language Processing is fun."
tokens = word_tokenize(text)
tags = pos_tag(tokens)

print("Tokens:", tokens)
print("POS Tags:", tags)

Tokens: ['Natural', 'Language', 'Processing', 'is', 'fun', '.']
POS Tags: [('Natural', 'JJ'), ('Language', 'NNP'), ('Processing', 'NNP'), ('is', 'VBZ'), ('fun', 'NN'), ('.', '.')]


In [48]:
text = "Natural Language Processing (NLP) is a fascinating field of study! It involves analyzing and understanding human language."
processed_text = text_preprocessing_pipeline(text)


In [49]:
print("Processed Text:", processed_text)

Processed Text: ['Natural', 'Language', 'Processing', 'NLP', 'fascinate', 'field', 'study', 'involve', 'analyze', 'understand', 'human', 'language']


## **5. Evaluation of Preprocessing Techniques**

### **Exercise 7: Compare Preprocessing Techniques**

Compare the results of stemming and lemmatization on the following sentence. Store the results in `stemmed_tokens` and `lemmatized_tokens`

In [51]:
sentence = "The cats are playing with the mice in the garden."
stop_words = set(stopwords.words("english"))
tokens = word_tokenize(sentence)
filtered_tokens = [w for w in tokens if w.lower() not in stop_words and w not in string.punctuation]
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(w) for w in filtered_tokens]
lemmatizer = WordNetLemmatizer()
pos_tags = pos_tag(filtered_tokens)
lemmatized_tokens = [lemmatizer.lemmatize(w, get_wordnet_pos(t)) for w, t in pos_tags]




In [52]:
print("Original Tokens:", filtered_tokens)
print("Stemmed Tokens:", stemmed_tokens)
print("Lemmatized Tokens:", lemmatized_tokens)

Original Tokens: ['cats', 'playing', 'mice', 'garden']
Stemmed Tokens: ['cat', 'play', 'mice', 'garden']
Lemmatized Tokens: ['cat', 'play', 'mouse', 'garden']


## **6. Real-World Dataset: Sentiment Analysis**

### **Exercise 8: Preprocess and Analyze Tweets**

In this exercise, you will work with a real-world dataset of tweets. The dataset contains 5000 positive and 5000 negative tweets. Your task is to preprocess the tweets and extract features for sentiment analysis.


In [53]:
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\macat\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\twitter_samples.zip.


True

In [54]:
# Load the dataset
from nltk.corpus import twitter_samples

Load the dataset of positive and negative tweets. 

In [55]:
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')

Combine them into a single list called ``all_tweets`` and create a corresponding list of labels called `labels`.

In [56]:
all_tweets = positive_tweets + negative_tweets
labels = [1] * len(positive_tweets) + [0] * len(negative_tweets)



In [57]:
# Print a sample tweet
print("Sample Tweet:", all_tweets[0])
print("Label:", labels[0])

Sample Tweet: #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)
Label: 1


### **Exercise 9: Preprocess Tweets**

Apply the custom preprocessing pipeline to the entire dataset of tweets. Store the result in ``preprocessed_tweets``.

In [58]:
preprocessed_tweets = [text_preprocessing_pipeline(tweet) for tweet in all_tweets]

In [59]:
# Print a sample preprocessed tweet
print("Preprocessed Tweets Sample:", preprocessed_tweets[0])

Preprocessed Tweets Sample: ['FollowFriday', 'France_Inte', 'PKuchly57', 'Milipol_Paris', 'top', 'engage', 'member', 'community', 'week']


### **Exercise 10: Feature Extraction on Tweets**

Extract features from the preprocessed tweets using **Bag of Words** and **TF-IDF**. Store the results in ``X_bow`` and ``X_tfidf``, respectively.

In [61]:
# 0) If preprocessed_tweets is a list of token lists, join them back to strings
texts_clean = [" ".join(tokens) for tokens in preprocessed_tweets]
# 1) Bag of Words
bow_vectorizer = CountVectorizer(ngram_range=(1,1), lowercase=False)
X_bow = bow_vectorizer.fit_transform(texts_clean)
# 2) TF‑IDF
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,1), lowercase=False)
X_tfidf = tfidf_vectorizer.fit_transform(texts_clean)



In [None]:
print("BoW shape:", X_bow.shape)  
print("TF‑IDF shape:", X_tfidf.shape) 
print("Sample BoW features:", bow_vectorizer.get_feature_names_out()[:15])
print("Sample TF‑IDF features:", tfidf_vectorizer.get_feature_names_out()[:15])

BoW shape: (10000, 21960)
TF‑IDF shape: (10000, 21960)
Sample BoW features: ['00' '000' '001' '00128835' '009' '00962778381838' '00YckcE7wj' '00am'
 '00kouhey00' '01' '01282' '0129anne' '01482' '02' '02079']
Sample TF‑IDF features: ['00' '000' '001' '00128835' '009' '00962778381838' '00YckcE7wj' '00am'
 '00kouhey00' '01' '01282' '0129anne' '01482' '02' '02079']


## **7. Conclusion**

In this lab, you explored a wide range of NLP techniques, from basic text preprocessing to advanced feature extraction and analysis. You also worked with a real-world dataset of tweets and applied your knowledge to preprocess and extract features for sentiment analysis.

