# NLP Preprocessing and Feature Extraction
This notebook demonstrates text preprocessing steps:

-Tokenization

-Stopwords removal

-Stemming

-Lemmatization

**We will use NLTK and spaCy libraries.**

In [1]:
!pip install nltk
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     - -------------------------------------- 0.5/12.8 MB 8.5 MB/s eta 0:00:02
     ---- ----------------------------------- 1.6/12.8 MB 6.0 MB/s eta 0:00:02
     --------- ------------------------------ 3.1/12.8 MB 6.6 MB/s eta 0:00:02
     ------------------ --------------------- 5.8/12.8 MB 8.8 MB/s eta 0:00:01
     --------------------- ------------------ 6.8/12.8 MB 7.8 MB/s eta 0:00:01
     -------------------------- ------------- 8.4/12.8 MB 7.8 MB/s eta 0:00:01
     ------------------------------ --------- 9.7/12.8 MB 7.5 MB/s eta 0:00:01
     ------------------------------------ --- 11.5/12.8 MB 7.8 MB/s eta 0:00:01
     ---------------------------------------  12.6/12.8 MB 7.4 MB/s eta 0:00:01
     ----------------------------------

In [7]:
import nltk
import spacy
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer


# Download NLTK resources
nltk.download('punkt')#Downloads the tokenizer model used by word_tokenize()
nltk.download('punkt_tab')
nltk.download('stopwords')#Downloads the list of stopwords (like the, is, in).
nltk.download('wordnet')#Downloads the WordNet lexical database (needed for lemmatization).


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Anju\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Anju\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Anju\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Anju\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Sample Text

In [8]:
text="I am learning Natural Language Processing, NLP is fascinating and powerful!"
print(text)

I am learning Natural Language Processing, NLP is fascinating and powerful!


# 1. Tokenization with NLTK

In [9]:
tokens=word_tokenize(text)
print(tokens)

['I', 'am', 'learning', 'Natural', 'Language', 'Processing', ',', 'NLP', 'is', 'fascinating', 'and', 'powerful', '!']


## 2. Stopwords Removal with NLTK

In [11]:
stop_words=set(stopwords.words('English'))
filtered_tokens=[w for w in tokens if w.lower() not in stop_words]
print(filtered_tokens)

['learning', 'Natural', 'Language', 'Processing', ',', 'NLP', 'fascinating', 'powerful', '!']


## 3. Stemming with NLTK

In [12]:
stemmer=PorterStemmer()
stem=[stemmer.stem(word) for word in filtered_tokens]
print(stem)

['learn', 'natur', 'languag', 'process', ',', 'nlp', 'fascin', 'power', '!']


## 4. Lemmatization with NLTK

In [13]:
lemmatizer=WordNetLemmatizer()
lemmas=[lemmatizer.lemmatize(word) for word in filtered_tokens]
print(lemmas)

['learning', 'Natural', 'Language', 'Processing', ',', 'NLP', 'fascinating', 'powerful', '!']


Use stemming if speed is more important than accuracy.

Use lemmatization if you need meaningful dictionary words.

## 5. Tokenization and Lemmatization with spaCy

In [15]:
nlp=spacy.load('en_core_web_sm')
doc=nlp(text)
print(doc)

print([token.text for token in doc])
print([token.lemma_ for token in doc])

I am learning Natural Language Processing, NLP is fascinating and powerful!
['I', 'am', 'learning', 'Natural', 'Language', 'Processing', ',', 'NLP', 'is', 'fascinating', 'and', 'powerful', '!']
['I', 'be', 'learn', 'Natural', 'Language', 'Processing', ',', 'NLP', 'be', 'fascinating', 'and', 'powerful', '!']
