# Day 60 – Introduction to Natural Language Processing (NLP) 

Today, I am starting the journey into **Artificial Intelligence (AI)**. I've successfully completed the core **Machine Learning (ML)** part, which focused heavily on structured data. Now, I will shift my focus to **unstructured data** and the technologies that allow machines to understand it.

AI is a broader concept that focuses on building systems capable of performing tasks that typically require **human intelligence** — such as understanding language, recognizing images, making decisions, or learning from experience.

---

## 1. The World of Artificial Intelligence (AI)

> **Artificial Intelligence (AI)** is the concept of creating machines and computer systems that can mimic or simulate human intelligence.

While Machine Learning focuses on learning patterns from data, AI encompasses a broader range of tasks, including:

  * Reasoning and problem-solving.
  * Perception (via image and audio data).
  * Language understanding and generation.

AI systems often work with unstructured data, which includes:

  * **Text Data** (the focus of NLP)
  * **Image, Audio, and Video**
  * **PDFs, XML/HTML documents**
  * Sensor data and drone data


### Types of AI Data and Focus Areas

When dealing with **unstructured data**, AI uses different specialized subfields:

| Type of Data         | AI Domain / Technique                                                                          |
| -------------------- | ---------------------------------------------------------------------------------------------- |
| **Text Data**        | NLP (Natural Language Processing), RNN (Recurrent Neural Network), LLM (Large Language Models) |
| **Image Data**       | Computer Vision, CNNs                                                                          |
| **Audio/Video Data** | Speech Recognition, Audio Processing, Deep Learning                                            |

In this notebook, I begin with the **Text Data** part — that is, **Natural Language Processing (NLP)**.


---

## 2. Natural Language Processing (NLP)

> **Natural Language Processing (NLP)** is a branch of Artificial Intelligence that helps computers understand, interpret, and generate human language.

It bridges the gap between **human communication** and **machine understanding** — enabling machines to process text and speech in the way humans do.

### How NLP Works

NLP works by breaking down text into smaller, meaningful pieces and applying linguistic and statistical techniques to extract meaning.

A general NLP workflow includes:

1. **Text Preprocessing** – Cleaning, tokenizing, stemming, lemmatizing text.
2. **Feature Extraction** – Converting text into numerical features (Bag of Words, TF-IDF, Word2Vec, etc.).
3. **Model Building** – Applying ML or DL models for tasks like classification or translation.
4. **Evaluation** – Measuring performance using metrics such as accuracy, precision, recall, etc.
5. **Deployment** – Using trained NLP models in applications like chatbots or recommendation systems.

### Applications of NLP

NLP is widely used in various AI applications, including:

*  **Text Classification** – Spam filtering, sentiment analysis, topic detection
*  **Chatbots & Virtual Assistants** – Siri, Alexa, ChatGPT
*  **Information Retrieval** – Search engines like Google
*  **Text Summarization** – Automatic summary generation
*  **Machine Translation** – Google Translate
*  **Named Entity Recognition (NER)** – Identifying names, locations, organizations, etc.

### Components of NLP

NLP consists of two main components:

### 1. **NLU (Natural Language Understanding)**

Focuses on understanding human language and extracting meaning.
Tasks include:

* Intent Detection
* Entity Recognition
* Sentiment Analysis
* Semantic Interpretation

### 2. **NLG (Natural Language Generation)**

Focuses on generating text that sounds natural to humans.
Tasks include:

* Chatbot response generation
* Report writing
* Text summarization
* Story or content creation


### Key NLP Libraries

While libraries like `scikit-learn` and `XGBoost` are used for the ML framework, NLP has its own set of specialized libraries for handling text data:

  * **NLTK (Natural Language Toolkit)**: A widely used, older, foundational library providing access to many corpora and lexical resources. It's often used for academic and research purposes.
  * **spaCy**: A newer, more production-ready library known for its speed and efficiency, often used for industrial-scale NLP applications.
  * **Gensim**: A robust library primarily focused on topic modeling and document similarity (e.g., Word2Vec).
  * **Stanford NLP** (e.g., CoreNLP): A suite of high-quality tools for advanced language analysis (parsing, named entity recognition).

---

## 3. Hierarchy of Text in NLP

Text data follows a specific hierarchy, from the smallest unit to the largest:

  * **Word**: The smallest meaningful unit of text.
  * **Collection of Words** = **Sentence**: A grammatically complete thought.
  * **Collection of Sentences** = **Paragraph**: A group of sentences forming an idea  
  * **Collection of Paragraphs** = **Document**: A complete article, email, or report.
  * **Collection of Documents** = **Articles/Corpus**: A large body of text data used for training.


---

## 4. Text Preprocessing: Tokenization, Stemming and Lemmatization

Before any NLP model can be trained, raw text must be cleaned and converted into a usable format through preprocessing steps like tokenization and stemming.


## 4.1 Tokenization

#### What is Tokenization?
> Tokenization is the process of breaking text into smaller pieces called **tokens** (words, sentences, or paragraphs).

It is the **first and most important step** in any NLP pipeline, helping computers understand the structure of text data.


#### Types of Tokenization (using NLTK)

| Tokenization Type | Description | NLTK Function |
| :--- | :--- | :--- |
| **Word Tokenize** | Splits text into individual words and punctuation. | `word_tokenize` |
| **Sentence Tokenize** | Splits text into individual sentences. | `sent_tokenize` |
| **Blankline Tokenize** | Splits a document into paragraphs (tokens separated by blank lines). | `blankline_tokenize` |
| **Whitespace Tokenizer** | Splits text only by white space, retaining punctuation attached to words. | `WhitespaceTokenizer` |
| **WordPunct Tokenizer** | Splits on white space and separates all punctuation from words. | `WordPunctTokenizer` |

You can explore the official NLTK documentation here:
🔗 [https://www.nltk.org/api/nltk.tokenize.html](https://www.nltk.org/api/nltk.tokenize.html)

---

In [1]:
import os
import nltk

In [2]:
AI = '''Artificial Intelligence refers to the intelligence of machines. This is in contrast to the natural intelligence of 
humans and animals. With Artificial Intelligence, machines perform functions such as learning, planning, reasoning and 
problem-solving. Most noteworthy, Artificial Intelligence is the simulation of human intelligence by machines. 
It is probably the fastest-growing development in the World of technology and innovation. Furthermore, many experts believe
AI could solve major challenges and crisis situations.'''

### Word Tokenize

In [3]:
from nltk.tokenize import word_tokenize

In [4]:
AI_tokens = word_tokenize(AI)
print(AI_tokens)

['Artificial', 'Intelligence', 'refers', 'to', 'the', 'intelligence', 'of', 'machines', '.', 'This', 'is', 'in', 'contrast', 'to', 'the', 'natural', 'intelligence', 'of', 'humans', 'and', 'animals', '.', 'With', 'Artificial', 'Intelligence', ',', 'machines', 'perform', 'functions', 'such', 'as', 'learning', ',', 'planning', ',', 'reasoning', 'and', 'problem-solving', '.', 'Most', 'noteworthy', ',', 'Artificial', 'Intelligence', 'is', 'the', 'simulation', 'of', 'human', 'intelligence', 'by', 'machines', '.', 'It', 'is', 'probably', 'the', 'fastest-growing', 'development', 'in', 'the', 'World', 'of', 'technology', 'and', 'innovation', '.', 'Furthermore', ',', 'many', 'experts', 'believe', 'AI', 'could', 'solve', 'major', 'challenges', 'and', 'crisis', 'situations', '.']


In [5]:
len(AI_tokens)

81

### Sentence Tokenize

In [6]:
from nltk.tokenize import sent_tokenize

In [7]:
AI_sent = sent_tokenize(AI)
AI_sent

['Artificial Intelligence refers to the intelligence of machines.',
 'This is in contrast to the natural intelligence of \nhumans and animals.',
 'With Artificial Intelligence, machines perform functions such as learning, planning, reasoning and \nproblem-solving.',
 'Most noteworthy, Artificial Intelligence is the simulation of human intelligence by machines.',
 'It is probably the fastest-growing development in the World of technology and innovation.',
 'Furthermore, many experts believe\nAI could solve major challenges and crisis situations.']

In [8]:
len(AI_sent)

6

### Blankline Tokenize

In [9]:
from nltk.tokenize import blankline_tokenize 
AI_blank = blankline_tokenize(AI) 
AI_blank

['Artificial Intelligence refers to the intelligence of machines. This is in contrast to the natural intelligence of \nhumans and animals. With Artificial Intelligence, machines perform functions such as learning, planning, reasoning and \nproblem-solving. Most noteworthy, Artificial Intelligence is the simulation of human intelligence by machines. \nIt is probably the fastest-growing development in the World of technology and innovation. Furthermore, many experts believe\nAI could solve major challenges and crisis situations.']

In [10]:
len(AI_blank) 

1

### Whitespace Tokenizer

In [11]:
from nltk.tokenize import WhitespaceTokenizer
wt = WhitespaceTokenizer().tokenize(AI)
print(wt)

['Artificial', 'Intelligence', 'refers', 'to', 'the', 'intelligence', 'of', 'machines.', 'This', 'is', 'in', 'contrast', 'to', 'the', 'natural', 'intelligence', 'of', 'humans', 'and', 'animals.', 'With', 'Artificial', 'Intelligence,', 'machines', 'perform', 'functions', 'such', 'as', 'learning,', 'planning,', 'reasoning', 'and', 'problem-solving.', 'Most', 'noteworthy,', 'Artificial', 'Intelligence', 'is', 'the', 'simulation', 'of', 'human', 'intelligence', 'by', 'machines.', 'It', 'is', 'probably', 'the', 'fastest-growing', 'development', 'in', 'the', 'World', 'of', 'technology', 'and', 'innovation.', 'Furthermore,', 'many', 'experts', 'believe', 'AI', 'could', 'solve', 'major', 'challenges', 'and', 'crisis', 'situations.']


In [12]:
len(wt)

70

### WordPunct Tokenizer

In [13]:
from nltk.tokenize import wordpunct_tokenize

s = 'Good apple cost $3.88 in Hyderabad. Please buy two of them. Thanks.'
s

'Good apple cost $3.88 in Hyderabad. Please buy two of them. Thanks.'

In [14]:
print(wordpunct_tokenize(s))

['Good', 'apple', 'cost', '$', '3', '.', '88', 'in', 'Hyderabad', '.', 'Please', 'buy', 'two', 'of', 'them', '.', 'Thanks', '.']


In [15]:
len(wordpunct_tokenize(s))

18

In [16]:
w_p = wordpunct_tokenize(AI)
print(w_p)

['Artificial', 'Intelligence', 'refers', 'to', 'the', 'intelligence', 'of', 'machines', '.', 'This', 'is', 'in', 'contrast', 'to', 'the', 'natural', 'intelligence', 'of', 'humans', 'and', 'animals', '.', 'With', 'Artificial', 'Intelligence', ',', 'machines', 'perform', 'functions', 'such', 'as', 'learning', ',', 'planning', ',', 'reasoning', 'and', 'problem', '-', 'solving', '.', 'Most', 'noteworthy', ',', 'Artificial', 'Intelligence', 'is', 'the', 'simulation', 'of', 'human', 'intelligence', 'by', 'machines', '.', 'It', 'is', 'probably', 'the', 'fastest', '-', 'growing', 'development', 'in', 'the', 'World', 'of', 'technology', 'and', 'innovation', '.', 'Furthermore', ',', 'many', 'experts', 'believe', 'AI', 'could', 'solve', 'major', 'challenges', 'and', 'crisis', 'situations', '.']


In [17]:
len(w_p)

85

---

## 4.2 Understanding N-Grams

While tokenization breaks text into individual words or sentences, sometimes analyzing **combinations of consecutive words** provides better context.  
This is where **N-Grams** come in.

| Type | Description | Example (for sentence: “Natural Language Processing is fun”) |
|------|--------------|---------------------------------------------------------------|
| **Unigram** | Single word | `["Natural", "Language", "Processing", "is", "fun"]` |
| **Bigram** | Pair of two consecutive words | `[("Natural", "Language"), ("Language", "Processing"), ...]` |
| **Trigram** | Sequence of three consecutive words | `[("Natural", "Language", "Processing"), ("Language", "Processing", "is"), ...]` |
| **N-Gram** | Sequence of *n* words | Generalized form of above |

These are helpful in identifying **common patterns and word relationships**, especially in text classification and language modeling tasks.



In [18]:
from nltk.util import bigrams,trigrams,ngrams 

In [19]:
string = 'the best and most beautifull thing in the world cannot be seen or even touched,they must be felt with heart'
string

'the best and most beautifull thing in the world cannot be seen or even touched,they must be felt with heart'

In [20]:
quotes_tokens = nltk.word_tokenize(string)
print(quotes_tokens)

['the', 'best', 'and', 'most', 'beautifull', 'thing', 'in', 'the', 'world', 'can', 'not', 'be', 'seen', 'or', 'even', 'touched', ',', 'they', 'must', 'be', 'felt', 'with', 'heart']


In [21]:
len(quotes_tokens)

23

### Bigrams

In [22]:
quotes_bigrams = list(nltk.bigrams(quotes_tokens))
print(quotes_bigrams)

[('the', 'best'), ('best', 'and'), ('and', 'most'), ('most', 'beautifull'), ('beautifull', 'thing'), ('thing', 'in'), ('in', 'the'), ('the', 'world'), ('world', 'can'), ('can', 'not'), ('not', 'be'), ('be', 'seen'), ('seen', 'or'), ('or', 'even'), ('even', 'touched'), ('touched', ','), (',', 'they'), ('they', 'must'), ('must', 'be'), ('be', 'felt'), ('felt', 'with'), ('with', 'heart')]


### Trigrams

In [23]:
quotes_trigrams = list(nltk.trigrams(quotes_tokens))
print(quotes_trigrams)

[('the', 'best', 'and'), ('best', 'and', 'most'), ('and', 'most', 'beautifull'), ('most', 'beautifull', 'thing'), ('beautifull', 'thing', 'in'), ('thing', 'in', 'the'), ('in', 'the', 'world'), ('the', 'world', 'can'), ('world', 'can', 'not'), ('can', 'not', 'be'), ('not', 'be', 'seen'), ('be', 'seen', 'or'), ('seen', 'or', 'even'), ('or', 'even', 'touched'), ('even', 'touched', ','), ('touched', ',', 'they'), (',', 'they', 'must'), ('they', 'must', 'be'), ('must', 'be', 'felt'), ('be', 'felt', 'with'), ('felt', 'with', 'heart')]


### n-grams

In [24]:
quotes_ngrams = list(nltk.ngrams(quotes_tokens, 4)) 
print(quotes_ngrams)

[('the', 'best', 'and', 'most'), ('best', 'and', 'most', 'beautifull'), ('and', 'most', 'beautifull', 'thing'), ('most', 'beautifull', 'thing', 'in'), ('beautifull', 'thing', 'in', 'the'), ('thing', 'in', 'the', 'world'), ('in', 'the', 'world', 'can'), ('the', 'world', 'can', 'not'), ('world', 'can', 'not', 'be'), ('can', 'not', 'be', 'seen'), ('not', 'be', 'seen', 'or'), ('be', 'seen', 'or', 'even'), ('seen', 'or', 'even', 'touched'), ('or', 'even', 'touched', ','), ('even', 'touched', ',', 'they'), ('touched', ',', 'they', 'must'), (',', 'they', 'must', 'be'), ('they', 'must', 'be', 'felt'), ('must', 'be', 'felt', 'with'), ('be', 'felt', 'with', 'heart')]


---

## 4.3 Stemming

> **Stemming** is the process of reducing a word to its **root or base form**.

  * **Example**: The words "running," "runs," and "ran" are all reduced to the stem "run."
  * **Note**: The stem may not be a valid word itself.

Stemming helps reduce variations of a word to a common root, simplifying analysis.

###  Common Stemmer Algorithms in NLTK

| Stemmer              | Description                                                                        |
| -------------------- | ---------------------------------------------------------------------------------- |
| **PorterStemmer**    | Most common stemmer; balances accuracy and simplicity.                             |
| **LancasterStemmer** | More aggressive; may over-stem words.                                              |
| **SnowballStemmer**  | An improved and versatile version of Porter stemmer supporting multiple languages. |


### Porter Stemmer 

In [25]:
from nltk.stem import PorterStemmer
pst = PorterStemmer()

In [26]:
pst.stem('having')
pst.stem('affection')
pst.stem('playing')
pst.stem('give') 

'give'

In [27]:
words_to_stem=['give','giving','given','gave','thinking', 'loving', 'final', 'finalized', 'finally']
for words in words_to_stem:
    print(words+  ' : ' + pst.stem(words))

give : give
giving : give
given : given
gave : gave
thinking : think
loving : love
final : final
finalized : final
finally : final


### Lancaster Stemmer

In [28]:
from nltk.stem import LancasterStemmer
lst = LancasterStemmer()
for words in words_to_stem:
    print(words + ' : ' + lst.stem(words))

give : giv
giving : giv
given : giv
gave : gav
thinking : think
loving : lov
final : fin
finalized : fin
finally : fin


### Snowball Stemmer (English)

In [29]:
from nltk.stem import SnowballStemmer
sbst = SnowballStemmer('english')

for words in words_to_stem:
    print(words+ ' : ' +sbst.stem(words))

give : give
giving : give
given : given
gave : gave
thinking : think
loving : love
final : final
finalized : final
finally : final


---
## 4.4 Lemmatization

> **Lemmatization** is the process of reducing a word to its base or dictionary form, known as the **lemma**.

Just like **stemming**, lemmatization is also a technique used to reduce words to their **base or root form**.  
However, the key difference is that **lemmatization is more intelligent and linguistically accurate** — it uses a **dictionary** and **part-of-speech (POS)** understanding to derive the *correct* root form.
For example:

| Word | Lemmatized Form |
|-------|------------------|
| running | run |
| studies | study |
| better | good |

Unlike stemming, which may simply cut off word endings, lemmatization considers **the context and grammar** to return a proper word.

### Stemming vs. Lemmatization

| Aspect | Stemming | Lemmatization |
|---------|-----------|----------------|
| Approach | Rule-based (cuts suffixes) | Dictionary-based (uses linguistic analysis) |
| Output | May produce non-words (e.g., *“stud”* for *“studies”*) | Always produces valid words |
| Speed | Faster | Slower |
| Accuracy | Lower | Higher |

### Why Use Lemmatization?

- Ensures all variations of a word are treated as one (e.g., *run, running, ran → run*)  
- Improves accuracy in NLP models  
- Essential for tasks like **sentiment analysis, document classification, and information retrieval**


In [30]:
from nltk.stem import WordNetLemmatizer
word_lem = WordNetLemmatizer()

for word in words_to_stem:
    print(word + ' : ' + word_lem.lemmatize(word))

give : give
giving : giving
given : given
gave : gave
thinking : thinking
loving : loving
final : final
finalized : finalized
finally : finally


---

## Conclusion

In this notebook, I officially began my journey into **Artificial Intelligence (AI)** by exploring one of its most impactful branches — **Natural Language Processing (NLP)**.  
After completing the Machine Learning section, this marks a major step toward working with **unstructured data**, especially **text data**.

I learned how computers process, understand, and generate human language using various techniques.  
Starting from **tokenization** (splitting text into meaningful units) to **stemming** and **lemmatization** (reducing words to their root form), each concept plays a foundational role in preparing text for NLP models.

These text preprocessing steps are essential before applying advanced NLP methods such as **TF-IDF, Word2Vec, Transformers, and LLMs**.


## Key Takeaways

- **AI** is about mimicking human intelligence — it deals with unstructured data such as text, audio, and images.  
- **NLP (Natural Language Processing)** helps machines understand and process human language.  
- **Tokenization** breaks down text into smaller components (words, sentences, paragraphs) for analysis.  
- **Stemming** simplifies words to their root form but may produce non-words (e.g., *stud*).  
- **Lemmatization** provides dictionary-based, accurate base forms of words (e.g., *studies → study*).  
- These preprocessing techniques are the **first step** toward building powerful NLP pipelines.  
- Libraries such as **NLTK**, **spaCy**, **Gensim**, and **Stanford NLP** are essential tools in this domain.  

---