# NLP Text Processing Fundamentals

This notebook will guide you through essential text processing techniques in Natural Language Processing (NLP), including:
- Tokenization
- Stopwords Removal
- Stemming
- Lemmatization

We will use libraries like `nltk` and `torchtext`, and implement custom functions where possible.

In [7]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

## 1. Tokenization

Tokenization: The process of breaking text into smaller units like words, sentences, or subwords for easier processing.

In [3]:
nltk.download('punkt')

text = "NLP is amazing! It allows machines to understand human language. Let's learn how to preprocess text."

word_tokens = word_tokenize(text)
print("Word Tokens:", word_tokens)

sentence_tokens = sent_tokenize(text)
print("Sentence Tokens:", sentence_tokens)

Word Tokens: ['NLP', 'is', 'amazing', '!', 'It', 'allows', 'machines', 'to', 'understand', 'human', 'language', '.', 'Let', "'s", 'learn', 'how', 'to', 'preprocess', 'text', '.']
Sentence Tokens: ['NLP is amazing!', 'It allows machines to understand human language.', "Let's learn how to preprocess text."]


[nltk_data] Downloading package punkt to /Users/chen.m/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Exercise 1
- Tokenize the following text into words and sentences:

"Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from noisy, structured, and unstructured data."

- Count the frequency of each word.

## 2. Stopwords Removal

In [9]:
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in word_tokens if word.lower() not in stop_words]
print("Filtered Words:", filtered_words)

Filtered Words: ['NLP', 'amazing', '!', 'allows', 'machines', 'understand', 'human', 'language', '.', 'Let', "'s", 'learn', 'preprocess', 'text', '.']


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/chen.m/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Exercise 2
- Remove stopwords from the word tokens you generated in Exercise 1.

## 3. Stemming

Stemming: Simplistically cutting off word endings to reduce them to their root form (e.g., "running" → "run"), often ignoring grammatical context.

In [12]:
filtered_words

['NLP',
 'amazing',
 '!',
 'allows',
 'machines',
 'understand',
 'human',
 'language',
 '.',
 'Let',
 "'s",
 'learn',
 'preprocess',
 'text',
 '.']

In [10]:
porter = PorterStemmer()

stemmed_words = [porter.stem(word) for word in filtered_words]
print("Stemmed Words:", stemmed_words)

Stemmed Words: ['nlp', 'amaz', '!', 'allow', 'machin', 'understand', 'human', 'languag', '.', 'let', "'s", 'learn', 'preprocess', 'text', '.']


### Exercise 3
- Apply stemming to a list of custom words (e.g., 'running', 'runs', 'easily', 'fairness').

## 4. Lemmatization

Lemmatization: Reducing words to their base or dictionary form (e.g., "running" → "run") while considering context.

In [14]:
filtered_words

['NLP',
 'amazing',
 '!',
 'allows',
 'machines',
 'understand',
 'human',
 'language',
 '.',
 'Let',
 "'s",
 'learn',
 'preprocess',
 'text',
 '.']

In [13]:
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()

lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]
print("Lemmatized Words:", lemmatized_words)

Lemmatized Words: ['NLP', 'amazing', '!', 'allows', 'machine', 'understand', 'human', 'language', '.', 'Let', "'s", 'learn', 'preprocess', 'text', '.']


[nltk_data] Downloading package wordnet to /Users/chen.m/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Exercise 4
- Apply lemmatization with POS tagging (e.g., nouns, verbs).

## Combined Exercise
- Write a function that performs tokenization, stopwords removal, stemming, and lemmatization on a given text.

## Write Your Own X
- Choose one of the above methods and write it on your own, from scratch