# NLP Preprocessing and Feature Extraction
This notebook demonstrates text preprocessing steps:

-Tokenization

-Stopwords removal

-Stemming

-Lemmatization

**We will use NLTK and spaCy libraries.**

In [1]:
!pip install nltk
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     - -------------------------------------- 0.5/12.8 MB 8.5 MB/s eta 0:00:02
     ---- ----------------------------------- 1.6/12.8 MB 6.0 MB/s eta 0:00:02
     --------- ------------------------------ 3.1/12.8 MB 6.6 MB/s eta 0:00:02
     ------------------ --------------------- 5.8/12.8 MB 8.8 MB/s eta 0:00:01
     --------------------- ------------------ 6.8/12.8 MB 7.8 MB/s eta 0:00:01
     -------------------------- ------------- 8.4/12.8 MB 7.8 MB/s eta 0:00:01
     ------------------------------ --------- 9.7/12.8 MB 7.5 MB/s eta 0:00:01
     ------------------------------------ --- 11.5/12.8 MB 7.8 MB/s eta 0:00:01
     ---------------------------------------  12.6/12.8 MB 7.4 MB/s eta 0:00:01
     ----------------------------------

In [7]:
import nltk
import spacy
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer


# Download NLTK resources
nltk.download('punkt')#Downloads the tokenizer model used by word_tokenize()
nltk.download('punkt_tab')
nltk.download('stopwords')#Downloads the list of stopwords (like the, is, in).
nltk.download('wordnet')#Downloads the WordNet lexical database (needed for lemmatization).


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Anju\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Anju\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Anju\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Anju\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Sample Text

In [8]:
text="I am learning Natural Language Processing, NLP is fascinating and powerful!"
print(text)

I am learning Natural Language Processing, NLP is fascinating and powerful!


# 1. Tokenization with NLTK

In [9]:
tokens=word_tokenize(text)
print(tokens)

['I', 'am', 'learning', 'Natural', 'Language', 'Processing', ',', 'NLP', 'is', 'fascinating', 'and', 'powerful', '!']


## 2. Stopwords Removal with NLTK

In [11]:
stop_words=set(stopwords.words('English'))
filtered_tokens=[w for w in tokens if w.lower() not in stop_words]
print(filtered_tokens)

['learning', 'Natural', 'Language', 'Processing', ',', 'NLP', 'fascinating', 'powerful', '!']


## 3. Stemming with NLTK

In [12]:
stemmer=PorterStemmer()
stem=[stemmer.stem(word) for word in filtered_tokens]
print(stem)

['learn', 'natur', 'languag', 'process', ',', 'nlp', 'fascin', 'power', '!']


## 4. Lemmatization with NLTK

In [13]:
lemmatizer=WordNetLemmatizer()
lemmas=[lemmatizer.lemmatize(word) for word in filtered_tokens]
print(lemmas)

['learning', 'Natural', 'Language', 'Processing', ',', 'NLP', 'fascinating', 'powerful', '!']


Use stemming if speed is more important than accuracy.

Use lemmatization if you need meaningful dictionary words.

## 5. Tokenization and Lemmatization with spaCy

In [15]:
nlp=spacy.load('en_core_web_sm')
doc=nlp(text)
print(doc)

print([token.text for token in doc])
print([token.lemma_ for token in doc])

I am learning Natural Language Processing, NLP is fascinating and powerful!
['I', 'am', 'learning', 'Natural', 'Language', 'Processing', ',', 'NLP', 'is', 'fascinating', 'and', 'powerful', '!']
['I', 'be', 'learn', 'Natural', 'Language', 'Processing', ',', 'NLP', 'be', 'fascinating', 'and', 'powerful', '!']


# Feature Extraction

In [16]:
import pandas as pd

In [18]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from gensim.models import Word2Vec


# Sample Corpus

In [19]:
corpus = [
    "I love natural language processing",
    "Language models are powerful",
    "I love deep learning for NLP"
]

print("Sample Corpus:")
print(corpus)


Sample Corpus:
['I love natural language processing', 'Language models are powerful', 'I love deep learning for NLP']


# Bag of Words (BoW)

In [None]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print("Feature Names:", vectorizer.get_feature_names_out())

# Convert to DataFrame for readability
pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())


In [25]:
vectorizer=CountVectorizer()
X=vectorizer.fit_transform(corpus)
print("Feature Names:", vectorizer.get_feature_names_out())
pd.DataFrame(X.toarray(),columns=vectorizer.get_feature_names_out())


Feature Names: ['are' 'deep' 'for' 'language' 'learning' 'love' 'models' 'natural' 'nlp'
 'powerful' 'processing']


Unnamed: 0,are,deep,for,language,learning,love,models,natural,nlp,powerful,processing
0,0,0,0,1,0,1,0,1,0,0,1
1,1,0,0,1,0,0,1,0,0,1,0
2,0,1,1,0,1,1,0,0,1,0,0


# TF-IDF

In [28]:
tfidf=TfidfVectorizer()
X_tfidf=tfidf.fit_transform(corpus)
print("Feature names:",tfidf.get_feature_names_out())
pd.DataFrame(X_tfidf.toarray(),columns=tfidf.get_feature_names_out())

Feature names: ['are' 'deep' 'for' 'language' 'learning' 'love' 'models' 'natural' 'nlp'
 'powerful' 'processing']


Unnamed: 0,are,deep,for,language,learning,love,models,natural,nlp,powerful,processing
0,0.0,0.0,0.0,0.428046,0.0,0.428046,0.0,0.562829,0.0,0.0,0.562829
1,0.528635,0.0,0.0,0.40204,0.0,0.0,0.528635,0.0,0.0,0.528635,0.0
2,0.0,0.467351,0.467351,0.0,0.467351,0.355432,0.0,0.0,0.467351,0.0,0.0


# 5. Word Embeddings (Word2Vec)

In [29]:

# Tokenize each sentence
tokenized_corpus = [word_tokenize(sentence.lower()) for sentence in corpus]
print("Tokenized Corpus:", tokenized_corpus)

# Train Word2Vec model
model = Word2Vec(sentences=tokenized_corpus, vector_size=50, window=5, min_count=1, workers=4)

# Get embedding for a word
print("Vector for word 'language':")
print(model.wv['language'])

# Find most similar words
print("Most similar to 'language':", model.wv.most_similar('language'))

Tokenized Corpus: [['i', 'love', 'natural', 'language', 'processing'], ['language', 'models', 'are', 'powerful'], ['i', 'love', 'deep', 'learning', 'for', 'nlp']]
Vector for word 'language':
[-1.0724545e-03  4.7286271e-04  1.0206699e-02  1.8018546e-02
 -1.8605899e-02 -1.4233618e-02  1.2917745e-02  1.7945977e-02
 -1.0030856e-02 -7.5267432e-03  1.4761009e-02 -3.0669428e-03
 -9.0732267e-03  1.3108104e-02 -9.7203208e-03 -3.6320353e-03
  5.7531595e-03  1.9837476e-03 -1.6570430e-02 -1.8897636e-02
  1.4623532e-02  1.0140524e-02  1.3515387e-02  1.5257311e-03
  1.2701781e-02 -6.8107317e-03 -1.8928028e-03  1.1537147e-02
 -1.5043275e-02 -7.8722071e-03 -1.5023164e-02 -1.8600845e-03
  1.9076237e-02 -1.4638334e-02 -4.6675373e-03 -3.8754821e-03
  1.6154874e-02 -1.1861792e-02  9.0324880e-05 -9.5074680e-03
 -1.9207101e-02  1.0014586e-02 -1.7519170e-02 -8.7836506e-03
 -7.0199967e-05 -5.9236289e-04 -1.5322480e-02  1.9229487e-02
  9.9641159e-03  1.8466286e-02]
Most similar to 'language': [('natural', 0.16