<a href="https://colab.research.google.com/github/ankurs190/GenAI-Durgasoft-/blob/main/NLP_Durgasoft.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **NLP Summary**


| Step                    | Description                                       | Tool/Library                | Example Input                                   | Example Output             |
| ----------------------- | ------------------------------------------------- | --------------------------- | ----------------------------------------------- | -------------------------- |
| **Lowercasing**         | Converts all text to lowercase                    | Python string methods       | "Hello World"                                   | "hello world"              |
| **Tokenization**        | Splits text into words/sentences                  | `nltk.word_tokenize()`      | "I love NLP."                                   | \['I', 'love', 'NLP', '.'] |
| **Stopword Removal**    | Removes common words with little meaning          | `nltk.corpus.stopwords`     | "I am learning NLP"                             | \['learning', 'NLP']       |
| **Punctuation Removal** | Removes symbols like `.,!?`                       | `string.punctuation` + `re` | "Hello!"                                        | "Hello"                    |
| **Stemming**            | Trims words to their root form                    | `PorterStemmer`, `Snowball` | "running", "flies"                              | "run", "fli"               |
| **Lemmatization**       | Converts word to dictionary base form using POS   | `WordNetLemmatizer`         | "better", "was"                                 | "good", "be"               |
| **POS Tagging**         | Tags words with part of speech (noun, verb, etc.) | `nltk.pos_tag()`            | "run", "beautiful"                              | \[('run', 'VB'), ...]      |
| **Regex Cleaning**      | Custom text cleanup using patterns                | `re` module                 | "[User123@gmail.com](mailto:User123@gmail.com)" | "User" (after regex)       |


- NLP Text Hierarchy- Corpus-> Document-> Sentence-> Vocabulary

| Level          | Description                                                                         | Example                            |
| -------------- | ----------------------------------------------------------------------------------- | ---------------------------------- |
| **Corpus**     | A **collection of documents**. The entire dataset you're analyzing.                 | All the articles in a news archive |
| **Document**   | A **single piece of text** (can be an article, paragraph, review, etc.)             | One news article                   |
| **Sentence**   | A **meaningful sequence of words**, typically ending in a period/question mark/etc. | "The cat sat on the mat."          |
| **Vocabulary** | The set of **unique words** in the corpus (after cleaning).                         | {“cat”, “sat”, “mat”, ...}         |


### **Session 69 (25-June)**

##### **String Punctuation**


In [135]:
import string

In [136]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [137]:
punc = string.punctuation

In [138]:
punc

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [139]:
msg = "Good Evening Everyone, Welcome to 'NLP' Class (AI)"

In [140]:
msg

"Good Evening Everyone, Welcome to 'NLP' Class (AI)"

In [141]:
import time

In [142]:
for i in msg :
  print(i)
  time.sleep(0.5)  # prints each character (or item) in msg, one at a time, with a 0.5 second delay between each — creating a "typing effect".

G
o
o
d
 
E
v
e
n
i
n
g
 
E
v
e
r
y
o
n
e
,
 
W
e
l
c
o
m
e
 
t
o
 
'
N
L
P
'
 
C
l
a
s
s
 
(
A
I
)


In [143]:
punc

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [144]:
msg

"Good Evening Everyone, Welcome to 'NLP' Class (AI)"

In [145]:
for i in msg :
   if i not in punc:
      print(i)


G
o
o
d
 
E
v
e
n
i
n
g
 
E
v
e
r
y
o
n
e
 
W
e
l
c
o
m
e
 
t
o
 
N
L
P
 
C
l
a
s
s
 
A
I


In [146]:
slist = []
for i in msg :
   if i not in punc:
      slist.append(i)
print(slist)
print("".join(slist))

['G', 'o', 'o', 'd', ' ', 'E', 'v', 'e', 'n', 'i', 'n', 'g', ' ', 'E', 'v', 'e', 'r', 'y', 'o', 'n', 'e', ' ', 'W', 'e', 'l', 'c', 'o', 'm', 'e', ' ', 't', 'o', ' ', 'N', 'L', 'P', ' ', 'C', 'l', 'a', 's', 's', ' ', 'A', 'I']
Good Evening Everyone Welcome to NLP Class AI


In [147]:
"".join([ c    for c in msg  if c not in punc])

'Good Evening Everyone Welcome to NLP Class AI'

### **Session 70 (25 June)**

##### **Special Character with Regular Expression**

In [148]:
wish = "Wish -You -A@ Happy_ New_ Year #2025!"

In [149]:
wish

'Wish -You -A@ Happy_ New_ Year #2025!'

In [150]:
import re

In [151]:
re.findall("[a-z]", wish)  # uses Python’s re module to find all lowercase alphabetic characters (a to z) in the string wish.

['i', 's', 'h', 'o', 'u', 'a', 'p', 'p', 'y', 'e', 'w', 'e', 'a', 'r']

In [152]:
# re.findall("[a-zA-Z]", wish)

In [153]:
re.findall("[a-zA-Z0-9]", wish)

['W',
 'i',
 's',
 'h',
 'Y',
 'o',
 'u',
 'A',
 'H',
 'a',
 'p',
 'p',
 'y',
 'N',
 'e',
 'w',
 'Y',
 'e',
 'a',
 'r',
 '2',
 '0',
 '2',
 '5']

In [154]:
re.findall("[^a-zA-Z0-9 ]", wish)

['-', '-', '@', '_', '_', '#', '!']

In [155]:
re.sub("[^a-zA-Z0-9 ]","", wish)  #removes all characters from the string wish that are not letters, digits, or spaces.

'Wish You A Happy New Year 2025'

In [156]:
# re.findall()	#Returns all matches in a list
# re.search()	#Returns first match (or None)
# re.match()	#Matches only from start of string
# re.sub()	#Substitutes matches with a new string
# re.split()	#Splits string based on a pattern
# re.compile()	#Precompiles a regex pattern for reuse

##### **Tokenization**

* Tokenization is the process of breaking text into smaller pieces called tokens. These tokens can be:
 * Words, Subwords, Characters, or even punctuation marks
* It’s a fundamental step in NLP used by models like GPT, BERT, etc.

* nltk, spaCy, and HuggingFace tokenizers for both sentence and word tokenization

In [157]:
# %pip install nltk
import nltk  #  Natural Language Toolkit
from nltk.tokenize import word_tokenize, sent_tokenize

nltk.download('punkt')  #downloads the Punkt tokenizer models for the NLTK.This is required for functions like word_tokenize() to work properly.
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [158]:
para = "Hello All! How are you doing today? We are implementing tokenization in NLP. This is done using NLTK."
print(word_tokenize(para))
# splits punctuation as separate tokens. Notice how !, ?, and . are treated as separate tokens
# This is ideal for tasks like POS tagging, NER, or training models


print(sent_tokenize(para)) #  keeps punctuation attached to sentence
# Sentence boundaries are inferred based on punctuation and capitalization patterns

['Hello', 'All', '!', 'How', 'are', 'you', 'doing', 'today', '?', 'We', 'are', 'implementing', 'tokenization', 'in', 'NLP', '.', 'This', 'is', 'done', 'using', 'NLTK', '.']
['Hello All!', 'How are you doing today?', 'We are implementing tokenization in NLP.', 'This is done using NLTK.']


In [159]:
print(para.split())  #Splits text only on whitespace. Does not separate punctuation from words
print(word_tokenize(para))

['Hello', 'All!', 'How', 'are', 'you', 'doing', 'today?', 'We', 'are', 'implementing', 'tokenization', 'in', 'NLP.', 'This', 'is', 'done', 'using', 'NLTK.']
['Hello', 'All', '!', 'How', 'are', 'you', 'doing', 'today', '?', 'We', 'are', 'implementing', 'tokenization', 'in', 'NLP', '.', 'This', 'is', 'done', 'using', 'NLTK', '.']


##### **Stop Words**

* Stop words are common words in a language (like "the", "is", "in", "and") that are usually filtered out before processing text in NLP, because they don’t carry much meaningful information. Doesn't contribute to semantic meaning of seuqential data

* Why Remove Stop Words?
   * They appear very frequently, but add little value to tasks like:
    * Text classification
    * Topic modeling
    * Information retrieval
   * Removing them reduces noise (focus on important words) and sometimes improves performance (by removing words, reduced memory usage)

* when to remove stopwords-
  * Text Classification -
    * spam detection: if mail is having words like-
      * Discount, offer, personal loans, excellent opportunity
  * Search Engine- To ignore irrelevant terms in queries
  * Document Clustering
  * Sentiment Analysis
  * Topic modeling (LDA): Stopwords add noise
  * TF-IDF vectorization: Stopwords often have high frequency but low informativeness



* When You Don’t Remove Them
   * **Context-sensitive models (BERT, GPT)**: These models learn to assign low weight to stopwords, so removal isn't necessary.
   * **Sequence tasks (e.g., translation, summarization)**: Removing stopwords would distort grammar and meaning.

In [160]:
# Sentence: "The cat is on the mat."

# Without stopwords: "cat mat"

# For a classifier, this may retain core meaning

# For a translator or summarizer — this would break the sentence


In [161]:
 #stopwords
import nltk
from nltk.corpus import stopwords  # imports the stop words corpus from the NLTK library, which provides predefined lists of common stop words in multiple languages (like English, French, German, etc.).
nltk.download('stopwords') # Download the stopwords data (only once):
print(stopwords.words('english'))
print(len(stopwords.words('english')))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [162]:
#To list available languages like:
print(stopwords.fileids())

['albanian', 'arabic', 'azerbaijani', 'basque', 'belarusian', 'bengali', 'catalan', 'chinese', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hebrew', 'hinglish', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'tamil', 'turkish']


In [163]:
# filtered
print([word for word in word_tokenize(para) if word.lower() not in stopwords.words('english') and word not in string.punctuation])

['Hello', 'today', 'implementing', 'tokenization', 'NLP', 'done', 'using', 'NLTK']


### **Session 71 - (26 June)**

In [164]:
news = """Defence Minister Rajnath Singh on Thursday has reportedly refused to sign the joint statement of the
Shanghai Cooperation Organisation (SCO). As per reports, the defence minister refused to dot India's name on the
document due to its failure to address India's concern regarding cross-border terrorism."""

In [165]:
words = word_tokenize(news)

In [None]:
for w in words:
  if w not in stopwords.words('english'):
    print(w,end=' ')
    time.sleep(0.5)

Defence Minister Rajnath Singh 

In [None]:
' '.join([w for w in words if w not in stopwords.words('english')])

##### **Stemming**

In [None]:
# Stemming
import nltk
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer

In [None]:
ps= PorterStemmer()  #One of the oldest and most common stemmers.Conservative: trims suffixes but preserves meaning better.
ss=SnowballStemmer(language = "english") # More modern, multilingual, and slightly more aggressive than Porter. Often preferred over Porter for non-English or consistent stemming.
ls=LancasterStemmer() #Very aggressive. May cut too much and produce root forms that aren't real words.

In [None]:
ps.stem("running")  # Output: run
ps.stem("studies")  # Output: studi
ps.stem("happiness")  # Output: happi
ps.stem("maximum")  # Output: maximum
ps.stem("writing") # write

ss.stem("running")  # Output: run
ss.stem("studies")  # Output: happi
ss.stem("happiness")  # Output: happy
ss.stem("maximum")  # Output: maximum
ss.stem("writing") # write

ls.stem("running")  # Output: run
ls.stem("studies")  # Output: study
ls.stem("happiness")  # Output: happy
ls.stem("maximum")  # Output: maxim
ls.stem("writing") # writ

# Each stemmer is useful in different contexts:
  # Use Porter for general English.
  # Use Snowball for multilingual or slightly better balance.
  # Use Lancaster when you need shorter roots but can tolerate distortion.

In [None]:
print(ps.stem("writing"))
print(ss.stem("writing"))
print(ls.stem("writing"))

In [None]:
inflected_words = ["code", "coder", "coding", "coders", "codings", "change", "changes", "changing", "changed",
                   "trouble", "troubled","troubling", "troubles", "university", "universities","universe", "universal", "run",
                   "ran", "running", "runs", "writer", "write", "writers", "writes", "writing"]
print([ps.stem(i) for i in inflected_words])
print([ss.stem(i) for i in inflected_words])
print([ls.stem(i) for i in inflected_words])

In [None]:
# for  word in inflected_words:
#   p = ps.stem(word)
#   s = ss.stem(word)
#   l = ls.stem(word)
#   print(word,"===>",p,"===>",s,"===>", l)

##### **Lemmatization**
- Lemmatization is a NLP technique used to reduce words to their base or dictionary form, known as a lemma.
- Unlike stemming, which often chops off word endings without understanding context, lemmatization uses vocabulary and grammar rules to return real words.
- Lemmatizers consider the part of speech (POS) to determine the correct lemma. For example:
  - running as a verb → run
  - running as a noun (e.g., "the running was fun") → remains running

In [None]:
import nltk
nltk.download('wordnet')
lemma = nltk.WordNetLemmatizer()

In [None]:
lemma.lemmatize("writing")
# running	run
# better	good
# studies	study
# was	be

In [None]:
inflected_words = ["code", "coder", "coding", "coders", "codings", "change", "changes", "changing", "changed",
                   "trouble", "troubled","troubling", "troubles", "university", "universities","universe", "universal", "run",
                   "ran", "running", "runs", "writer", "write", "writers", "writes", "writing"]
print([lemma.lemmatize(i) for i in inflected_words])

In [None]:
# for w in inflected_words:
#   res = lemma.lemmatize(w)
#   print(w," ===> ",res)

In [None]:
from nltk.stem import WordNetLemmatizer  # The main lemmatizer class in NLTK.
nltk.download('averaged_perceptron_tagger_eng')
from nltk.corpus import wordnet  # provides part-of-speech (POS) constants (like wordnet.NOUN).
from nltk import pos_tag, word_tokenize  # pos_tag tags each word in a sentence with its part of speech (e.g., noun, verb)

lemmatizer = WordNetLemmatizer()  # Creates an instance of the lemmatizer.

def get_wordnet_pos(word):
    """Map POS tag to first character for WordNetLemmatizer."""
    tag = pos_tag([word])[0][1][0].upper()
    tag_dict = {'J': wordnet.ADJ,
                'N': wordnet.NOUN,
                'V': wordnet.VERB,
                'R': wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

word = "writing"
lemmatized_word = lemmatizer.lemmatize(word, get_wordnet_pos(word))
print(lemmatized_word)  # Output: run


In [None]:
news = """Defence Minister Rajnath Singh on Thursday has reportedly refused to sign the joint statement of the
Shanghai Cooperation Organisation (SCO). 123456 As per reports, the defence minister refused to dot India's name on the
document due to its failure to address India's concern regarding cross-border terrorism."""

In [None]:
import string
import re
''.join([i for i in news if  i not in string.punctuation])

In [None]:
# string.punctuation  #-> Contains all punctuation characters:  !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~. Used to remove punctuation during text cleaning.
# re  #Python's built-in module for pattern matching and text manipulation. Common use: remove unwanted patterns like extra spaces, special characters, digits.
# Tokenizations # Splits text into smaller units: words (word tokenization) or sentences (sentence tokenization).
# Stopswords removal -> Stopwords are common words (e.g., “the”, “is”, “and”) that usually carry less meaning.
# Stemming  -> Reduces words to their root form, which may not always be a valid word.
# lemmatization -> Similar to stemming, but returns the dictionary base form (lemma) and uses POS tagging.

| Feature              | `stopwords`                          | `string.punctuation`                  | `re` (Regular Expressions)                |                                            |
| -------------------- | ------------------------------------ | ------------------------------------- | ----------------------------------------- | ------------------------------------------ |
| **Module**           | `nltk.corpus.stopwords`              | Python `string` module                | Python `re` module                        |                                            |
| **Purpose**          | Remove common, low-value words       | Remove punctuation/special characters | Remove or match patterns (flexible)       |                                            |
| **Example Elements** | "the", "is", "and", "in", "at"       | `!"#$%&'()*+,-./:;<=>?@[\]^_`{        | }\~\`                                     | Any text pattern (e.g., `\d+`, `[A-Za-z]`) |
| **Use Case**         | Text simplification, keyword focus   | Cleaning symbols from text            | Cleaning, pattern matching, substitutions |                                            |
| **Customizable?**    | Yes (you can add/remove words)       | No (fixed punctuation list)           | Yes (very flexible patterns)              |                                            |
| **Example Code**     | `word in stopwords.words('english')` | `if char in string.punctuation`       | `re.sub(r'\d+', '', text)`                |                                            |
| **Returns**          | List of stop words                   | String of punctuation characters      | Modified string or match object           |                                            |

-----------------------

| Feature | **Stemming**                                | **Lemmatization**                                 |
| ------- | ------------------------------------------- | ------------------------------------------------- |
| Purpose | Reduces words to **root forms** by trimming | Reduces words to **dictionary base form** (lemma) |
| Output  | May not be a real word                      | Always produces a valid word                      |
| Method  | Rule-based, mechanical                      | Rule-based **+ vocabulary + grammar (POS)**       |

| Scenario                                                   | Choose                                                |
| ---------------------------------------------------------- | ----------------------------------------------------- |
| Speed-sensitive pipeline                                   | Stemming                                              |
| Meaning-preserving NLP task (like QA, chatbots, sentiment) | Lemmatization                                         |
| Indexing/search engines                                    | Stemming (e.g., "run", "runs", "running" all → "run") |

| Word             | **PorterStemmer** | **SnowballStemmer** | **LancasterStemmer** | **Lemmatization** |
| ---------------- | ----------------- | ------------------- | -------------------- | ----------------- |
| **running**      | run               | run                 | run                  | run               |
| **studies**      | studi             | studi               | study                | study             |
| **flies**        | fli               | fli                 | fly                  | fly               |
| **happily**      | happili           | happili             | happy                | happily           |
| **better**       | better            | better              | bet                  | good *(adj)*      |
| **caring**       | care              | care                | car                  | care              |
| **maximum**      | maximum           | maximum             | maxim                | maximum           |
| **organization** | organ             | organ               | organ                | organization      |
| **was**          | wa                | wa                  | was                  | be *(verb)*       |
| **children**     | children          | children            | child                | child             |



In [None]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

from nltk.corpus import stopwords
nltk.download('stopwords')
# print(stopwords.words('english'))

from nltk.stem import PorterStemmer, WordNetLemmatizer, LancasterStemmer
ps= PorterStemmer()
ss=SnowballStemmer(language = "english")
ls=LancasterStemmer()

nltk.download('wordnet')
lemma = nltk.WordNetLemmatizer()



In [None]:
' '.join([lemma.lemmatize(ps.stem(i)) for i in word_tokenize (news) if i not in stopwords.words('english') and i not in string.punctuation])
# ' '.join([ps.stem(i) for i in word_tokenize (news) if i not in stopwords.words('english') and i not in string.punctuation])

In [None]:
' '.join([lemma.lemmatize(i) for i in word_tokenize (news) if i not in stopwords.words('english') and i not in string.punctuation])

In [None]:
# full pipeline example
import nltk
import string
import re

from nltk.corpus import stopwords, wordnet
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk import pos_tag

# Download required resources (only once)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

# Input text
text = "Running faster than the wind, she was better at it than anyone!"

# Step 1: Lowercasing
text = text.lower()

# Step 2: Remove punctuation using regex
text = re.sub(f"[{re.escape(string.punctuation)}]", "", text)

# Step 3: Tokenization
tokens = word_tokenize(text)

# Step 4: Stopword removal
stop_words = set(stopwords.words("english"))
filtered_tokens = [w for w in tokens if w not in stop_words]

# Step 5: Stemming
stemmer = PorterStemmer()
stemmed = [stemmer.stem(w) for w in filtered_tokens]

# Step 6: Lemmatization with POS
lemmatizer = WordNetLemmatizer()

def get_wordnet_pos(word):
    tag = pos_tag([word])[0][1][0].upper()
    pos_dict = {'J': wordnet.ADJ, 'N': wordnet.NOUN,
                'V': wordnet.VERB, 'R': wordnet.ADV}
    return pos_dict.get(tag, wordnet.NOUN)

lemmatized = [lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in filtered_tokens]

# Output
print("Original Tokens:", tokens)
print("After Stopword Removal:", filtered_tokens)
print("After Stemming:", stemmed)
print("After Lemmatization:", lemmatized)


### **Session 72- NLP Pipeline - (27 June)**

In [None]:
import pandas as pd
import numpy as np
import re
df = pd.read_csv("/content/IMDB.csv", on_bad_lines='skip')
df.head(3)

In [None]:
df1 = df.sample(10000)
df1["sentiment"].value_counts()

In [None]:
df1.reset_index(inplace = True)

In [None]:
df1.head(2)

In [None]:
df1.drop("index", axis = 1, inplace=True)

In [None]:
df1.review.head(3)

In [None]:
# df1["review"].map(lambda x : x.upper())
# df1["review"].map(lambda x : x.lower())

df1["Result"] = df1["review"].map(lambda x : x.lower())

# df1["Result"].map(lambda x :  re.findall("<.*?>", x))
# df1["Result"].map(lambda x :  len(x))   # length before
# df1["Result"].map(lambda x :  re.sub("<.*?>","", x))

df1["Result"] = df1["Result"].map(lambda x :  re.sub("<.*?>","", x))

# df1["Result"].map(lambda x :  len(x))  #length after

In [None]:
import string
from string import punctuation
punc = punctuation

In [None]:
df1["Result"][0]

In [None]:
"".join( [ c   for c in df1["Result"][0] if c not in punc] )

In [None]:
len(df1["Result"][0])

In [None]:
len("".join( [ c   for c in df1["Result"][0] if c not in punc] ))

In [None]:
# df1["Result"].map(lambda x : [ c   for c in x  if c not in punc] )
# df1["Result"].map(lambda x : "".join([ c   for c in x  if c not in punc]))
df1["Result"] = df1["Result"].map(lambda x : "".join([ c   for c in x  if c not in punc]))

In [None]:
df1.head()
df1.drop("review", axis = 1, inplace = True)

In [None]:
df1.columns
df1.columns = ['sentiment', 'Review']

In [None]:
df1["Review"]

In [None]:
# Tokenize
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize

df1["Review"][0]

In [None]:
# word_tokenize(df1["Review"][0])
# df1["Review"].map(lambda x : word_tokenize(x) )
df1["Tokens"] = df1["Review"].map(lambda x : word_tokenize(x) )
df1.head(3)

In [None]:
# Stopwords
import nltk
from nltk.corpus import stopwords
nltk.download("stopwords")
sw_eng = stopwords.words("english")
df1.head()

In [None]:
#tokens
# df1["Tokens"][0]
# len(df1["Tokens"][0])
# [ w  for w in df1["Tokens"][0] if w not in sw_eng]
# len([ w  for w in df1["Tokens"][0] if w not in sw_eng])
# df1["Tokens"].map( lambda x : [w    for w in x  if w not in sw_eng])

df1["Tokens"] = df1["Tokens"].map( lambda x : [w    for w in x  if w not in sw_eng])

In [None]:
# stemming
from nltk.stem import PorterStemmer
port_stem = PorterStemmer()
# df1["Tokens"].map(lambda x :  [port_stem.stem(w)  for w in x ])

df1["Tokens"] = df1["Tokens"].map(lambda x :  " ".join([port_stem.stem(w)  for w in x ]))

In [None]:
df1.head(5)

### **Session 73- Automate NLP Pipeline Process - (1st July)**

### **Session 74- NGrams- (2nd July)**

In [None]:
import nltk
from nltk import ngrams
sent = "Data Scientist work on Machine Learning and Deep Learning"
sent.split()

In [None]:
#Unigram
ngrams( sent.split(), 1)

In [None]:
list(ngrams( sent.split(), 1))

In [None]:
# Bigram
ngrams(sent.split(),2)

In [None]:
list(ngrams(sent.split(),2))

In [None]:
# Tri - Gram
list( ngrams( sent.split(), 3) )

### **Session 75- NER Recognition & POS Tagging- (3rd July)**

- POS (Part-of-Speech) tagging in NLP is the process of labeling each word in a sentence with its correct part of speech, such as noun, verb, adjective, etc., based on both its definition and context.
- POS Tagging is a crucial step in NLP which helps model /
system to understand grammatical structure and meaning
of input sequence.
- POS Tagging helps model with complete understanding
of how words are related to each other.
- POS Tagging helps the model with
  - Understanding sequential structure of sentence.
  - Remove ambiguity from the words (meaning)
  - Extract Hidden context of the input sentence
- POS Tagging is used
  - Machine Translation
  - Sentiment Analysis
  - Information Retrieval
  - Text Summarization

In [None]:
import nltk
# nltk.download('punkt_tab')
# nltk.download('averaged_perceptron_tagger_eng')
# nltk.download('tagsets_json')

from nltk.help import brown_tagset
from nltk.help import upenn_tagset

from nltk.tokenize import word_tokenize, sent_tokenize
sent = "The black cat jumps over green gate"
tokens = word_tokenize(sent)   # ['The', 'black', 'cat', 'jumps', 'over', 'green', 'gate']
tokens_pos= nltk.pos_tag(tokens)
tokens_pos

In [None]:
# upenn_tagset("DT")
# upenn_tagset("JJ")
# upenn_tagset("NN")

In [None]:
# import time
# for w , tag  in tokens_pos:
#     print(w,": ==>", tag)
#     res = upenn_tagset(tag)
#     print(res)
#     time.sleep(1)

**Named Entity Recognition**- NER is a very important technique in NLP which focus on identifying named entity within a sentence and segregate these entities into predefined categories.

| Type    | Meaning                   | Example               |
| ------- | ------------------------- | --------------------- |
| PERSON  | Individual names          | Elon Musk             |
| ORG     | Organizations/Companies   | Google, NASA          |
| GPE     | Countries, cities, states | India, Paris, Texas   |
| DATE    | Dates                     | July 11, 2025         |
| TIME    | Time                      | 10:00 AM              |
| MONEY   | Monetary values           | \$100, ₹500           |
| PERCENT | Percentages               | 20%                   |
| PRODUCT | Products                  | iPhone, Tesla Model S |


In [None]:
#  nltk.download('maxent_ne_chunker_tab')
# nltk.download('words')
# !pip install -qU svgling

from nltk import ne_chunk
from nltk import pos_tag
text = "Apple Inc. is looking to buy a startup in America for $200 millions"
tokens = word_tokenize(text)
tag_tokens = pos_tag(tokens)
tag_tokens

In [None]:
ner = ne_chunk(tag_tokens)
ner

In [None]:
sent = "Elon Musk is CEO of Tesla Inc. Located in America"
ne_chunk(pos_tag(word_tokenize(sent)))

### **Session 76- BoW- (4th July)**

- Bag of Words (BoW) is a simple and widely used text representation technique in NLP. It converts text into a numerical format so that machine learning algorithms can understand and work with it.

- Vectorization in NLP refers to the process of converting textual data (words or documents) into numerical representations (vectors) so that machine learning models can process and learn from them.

| Technique                  | Description                                                                                                                  | Example                                                                |
| -------------------------- | ---------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------- |
| **Bag of Words (BoW)**     | Counts the frequency of each word in the document.                                                                           | `["dog barks", "cat meows"]` → `{dog:1, barks:1, cat:1, meows:1}`      |
| **TF-IDF**                 | Adjusts word frequency by how rare a word is across all documents. Helps reduce the weight of common words like "the", "is". | Term Frequency × Inverse Document Frequency                            |
| **One-Hot Encoding**       | Each word is represented by a binary vector with 1 at its index.                                                             | `["apple", "banana", "grape"]` → apple = `[1,0,0]`, banana = `[0,1,0]` |
| **Word Embeddings**        | Pre-trained vectors capturing word semantics. Words with similar meanings have similar vectors.                              | Word2Vec, GloVe, FastText                                              |
| **Transformer Embeddings** | Context-aware vectors for words based on sentence structure.                                                                 | BERT, RoBERTa, GPT embeddings                                          |


- Example: Suppose you have these 2 sentences:
    - I love NLP
    - I love machine learning
    - Vocabulary (unique words): [I, love, NLP, machine, learning]

Now, represent each sentence as a vector of word counts:

| Sentence                  | I | love | NLP | machine | learning |
| ------------------------- | - | ---- | --- | ------- | -------- |
| `I love NLP`              | 1 | 1    | 1   | 0       | 0        |
| `I love machine learning` | 1 | 1    | 0   | 1       | 1        |

- compare BoW with Word2Vec, TF-IDF, or Transformer-based embeddings like BERT.

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

cvec = CountVectorizer()  # lower case,
cvec

In [None]:
sent = [ "Tall brown Dog is playing in brown Green Fields",
         "White Rabbit eating green Grass near rabbit cage",
          "Brown Dog chase white rabbit",
          "White Rabbit escaped into Green Gate."]
cvec.fit(sent)

In [None]:
cvec.get_feature_names_out()

In [None]:
cvec.vocabulary_

In [None]:
cvec.transform(sent)

In [None]:
ary=cvec.transform(sent).toarray()
ary

In [None]:
pd.DataFrame(ary , columns = ['brown', 'cage', 'chase', 'dog', 'eating', 'escaped', 'fields',
       'gate', 'grass', 'green', 'in', 'into', 'is', 'near', 'playing',
       'rabbit', 'tall', 'white'] )

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
docs = ["I love NLP nlp machine", "I love machine learning"]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(docs)
pd.DataFrame(X.toarray() , columns = vectorizer.get_feature_names_out() )

### **Session 77- TF-IDF- (7th July)**

- This is another popular technique used for vectorization in
NLP.
- It is a statistical measure which is used in NLP to extract
information from the sequence data (information retrieval)
to evaluate the importance of each word in a document
with the collection of documents (corpus).

- TF ==> Term Frequency
  - This measures how frequently a term / word appears in a specific document.
  - More number of times a particular document appears , that word is highly
weighted as important word in that particular document
- IDF ==> Inverse Document Frequency
  - This measures how rarely or commonly a term / word appears across entire corpus
  - Word that appears more number of times across all documents is considered as least important / prioritize
  - Word that appears less number of times across all documents (corpus) is considered as highly prioritize .
- TF = how important a word is within a document
- IDF = how unique that word is across all documents (corpus)

| Term                       | High TF | High IDF | Importance |
| -------------------------- | ------- | -------- | ---------- |
| Common word (e.g. "the")   | ✅       | ❌        | Low        |
| Rare word (e.g. "quantum") | ✅       | ✅        | High       |


Example corpus

| Document | Content                   |
| -------- | ------------------------- |
| Doc1     | "I love NLP"              |
| Doc2     | "I love machine learning" |

- Step 1
  - vocabulary- [I, love, NLP, machine, learning]
- Step 2- Term Frequency (TF) = (Number of times a particular word appears in entire document) / (Total words in that document)

| Word     | TF in Doc1  | TF in Doc2 |
| -------- | ----------- | ---------- |
| I        | 1/3 = 0.333 | 1/4 = 0.25 |
| love     | 1/3 = 0.333 | 1/4 = 0.25 |
| NLP      | 1/3 = 0.333 | 0          |
| machine  | 0           | 1/4 = 0.25 |
| learning | 0           | 1/4 = 0.25 |

- Step 3: Inverse Document Frequency (IDF)= log(Total Docs in corpus  / (1 + Number of Docs containing word))
(Total Docs = 2)

| Word     | Docs Appeared In | IDF                           |
| -------- | ---------------- | ----------------------------- |
| I        | 2                | log(2 / (1+2)) = log(2/3)     |
| love     | 2                | log(2 / (1+2)) = log(2/3)     |
| NLP      | 1                | log(2 / (1+1)) = log(2/2) = 0 |
| machine  | 1                | log(2 / (1+1)) = 0            |
| learning | 1                | log(2 / (1+1)) = 0            |

 Often we use log base e or 10 and smooth the denominator with +1.

- Step 4: TF × IDF- Multiply TF and IDF for each word per document:

| Word     | TF-IDF Doc1        | TF-IDF Doc2     |
| -------- | ------------------ | --------------- |
| I        | 0.333 × log(2/3)   | 0.25 × log(2/3) |
| love     | 0.333 × log(2/3)   | 0.25 × log(2/3) |
| NLP      | 0.333 × log(2/2)=0 | 0               |
| machine  | 0                  | 0.25 × 0 = 0    |
| learning | 0                  | 0.25 × 0 = 0    |

| Word     | Doc1 TF-IDF | Doc2 TF-IDF |
| -------- | ----------- | ----------- |
| I        | Low         | Low         |
| love     | Low         | Low         |
| NLP      | High        | 0           |
| machine  | 0           | High        |
| learning | 0           | High        |


✅ Interpretation:
- Common words (like "I", "love") → Low TF-IDF
- Unique words (like "NLP", "machine", "learning") → Higher TF-IDF



In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfvec = TfidfVectorizer() #creating an instance of the vectorizer, but you still need to fit it on a list of documents to generate the TF-IDF values.
tfvec

In [None]:
sent = [ "Tall brown Dog is playing in Green Fields",
         "White Rabbit eating green Grass near rabbit cage",
          "Brown Dog chase white rabbit",
          "White Rabbit escaped into Green Gate."]
tfvec.fit(sent)

In [None]:
tfvec.get_feature_names_out() #returns the list of unique terms (vocabulary) extracted from your corpus

In [None]:
tfvec.vocabulary_  # Python dictionary mapping each term (word) in the vocabulary to its column index in the TF-IDF matrix.

In [None]:
tfvec.transform(sent)

In [None]:
out = tfvec.transform(sent).toarray()
out

In [None]:
import pandas as pd
pd.DataFrame(out , columns=['brown', 'cage', 'chase', 'dog', 'eating', 'escaped', 'fields',
       'gate', 'grass', 'green', 'in', 'into', 'is', 'near', 'playing',
       'rabbit', 'tall', 'white'] )

### **Session 78- Embeddings pdf- (8th July)**

### **Session 79- Word2Vec- (9th July)**

In [None]:
# !pip install gensim
import gensim
import gensim.downloader as api
from gensim.models import KeyedVectors

# Pre-Trained Word2Vec Embedding Model, File-size 1.5 GB,Vector Size 300 Dimensions,Corpus ==> 100 Billion words
model = api.load("word2vec-google-news-300")
model

In [None]:
king_vector= model["king"]
len(king_vector)

In [None]:
laptop_vector = model["laptop"]

model.most_similar("computer")
model.similarity("man","king")
model.similarity("queen","king")
model.similarity("doctor", "hospital")
model.doesnt_match(["java", "php", "dog"])
result = model["king"] - model["man"] + model["woman"]
model.most_similar(result)

### **Session 80- GloVe Embeddings- (10th July)**

In [None]:
# !pip install gensim
import gensim
import gensim.downloader as api

glove_model = api.load("glove-wiki-gigaword-300")
glove_model

In [None]:
glove_model["lion"]
len(glove_model["lion"])

In [None]:
glove_model["love"]
glove_model.most_similar("car")
glove_model.most_similar("car", topn = 5 )
glove_model["husband"]
glove_model["wife"]
glove_model["man"]
glove_model["woman"]
# husband - man + woman

# positive ==>  husband + woman
# negative  ==> man

res = glove_model.most_similar( positive = ["husband", "woman"], negative = ["man"], topn = 2)
res

res1 = glove_model.most_similar( positive = ["king", "woman"], negative = ["man"], topn = 5)
res1

##### **FastText**

      - FastText is a powerful and efficient library developed by Facebook AI Research (FAIR) for text classification and word representation in NLP.
      - It is word embedding technique which is specifically designed to handle rare words.
      - In this words are further divided into subwords to understand relationship between subwords.
      - It is kind of enhanced version of word2vec which can handle rare words very efficiently or words which are which are never seen before.  like playng.
      - working mechanism-
        - Tokenization
        - each word after tokenization further divided into subwords based on window of characters
        - Fasttext combines all its subwords vectors create the final word vector
        - It performs the prediction of target word

In [None]:
sample_text = """The cat sat on the mat.
Dogs are loyal animals.
I love learning about natural language processing.
FastText is great for text classification.
Birds can fly in the sky.
Apples and oranges are fruits.
The sun rises in the east.
Python is a popular programming language.
She is reading a book under the tree.
AI is changing the world rapidly."""

with open("data.txt", "w") as f:
    f.write(sample_text)


# !pip install -qU fasttext
import fasttext  # importing FastText library, which allows you to train and use word embeddings or text classifiers.

# Train Skip-gram model
model = fasttext.train_unsupervised('data.txt', model='cbow')  #model= cbow/ skipgram

# Get word vector
print(model.get_word_vector("language")) #This retrieves the embedding vector for the word "language". The result is a 100-dimensional float vector by default.


| Step                      | Purpose                            |
| ------------------------- | ---------------------------------- |
| `train_unsupervised`      | Train word vectors from raw text   |
| `'skipgram'`              | Predict context from a center word |
| `get_word_vector("word")` | Fetch the learned vector of a word |

| Feature                      | FastText         | Word2Vec / GloVe   |
| ---------------------------- | ---------------- | ------------------ |
| Handles OOV words            | ✅ Yes            | ❌ No               |
| Uses subword information     | ✅ Yes            | ❌ No               |
| Fast training                | ✅ Yes            | ✅ Yes              |
| Morphological awareness      | ✅ Strong         | ❌ Weak             |
| Pre-trained models available | ✅ 157+ languages | ✅ English (mainly) |


### **Session 81- Sequence Data pdf- (12th July)**

    - CNN (Convolutional Neural Network) ==> Images
    - RNN (Recurrent Neural Network) ==> Sequence Data

      **Sequence Data** : type of data where order of the words in the sequence is very important. Each word in the sequence are related to surrounding elements (before word and after word.). This type of data is often collected over time.

- Why Sequence Matters

| Sentence A                   | Sentence B                    |
| ---------------------------- | ----------------------------- |
| "He didn’t say he stole it." | "He said he didn’t steal it." |
      Same words, different order, different meaning. This is why models need to capture the sequence of words — not just their presence.

- Example of Sequence data
      - Textual data which is arranged in proper order. ex- Sentences, - Paragraphs, Documents, Articles, Chat Conversation....etc.
      - Time Series Data- Data Collected based on time i.e regular  intervals of time. ex- Sales prediction Data, Temperature prediction, stock Price Prediction...etc.

| Type of Text      | Sequence Example                               |
| ----------------- | ---------------------------------------------- |
| Sentence          | "The cat sat on the mat."                      |
| Chat conversation | "Hi" → "Hello!" → "How are you?"               |
| Paragraph         | Context builds over multiple sentences         |
| Document          | News articles, essays, or instructions         |
| Code or commands  | "if", "else", "print", etc. (sequence matters) |

- Summary
      - Sequential data is core to NLP — word order shapes meaning.
      - NLP models must preserve or learn this order to perform well.
      - Sequence-based models (RNNs, Transformers) are built to handle this structured dependency.

- Key Features of Sequential Data
      - Order Matters : Position of an element within the sequence is important to understand overall meaning and context. Example: "I only eat fruit" ≠ "Only I eat fruit"
      - Sequential Relationship : (Temporal Relationship) It explains the relationships of a particular element with its before and after elements. Example: In "She was not happy", the word "not" modifies "happy".
      - Variable Length : In sequence data each sequence has different length which effects the contextual meaning of sequence and also in understanding relationship among the words. e.g., short tweets vs long articles.Longer sequences often require memory or attention mechanisms to understand distant relationships.
      - Dependencies : Each element in the sequence often dependent on surrounding elements.Example: In "If it rains, we will cancel the picnic", the second part depends on the first.
      - Contextual Sensitivity : The meaning of each element depends on the context created by the rest of the sequence. This is why simple bag-of-words models often fail—because they lose sequence and context.

- Models which deals with Sequence data is called Seq2Seq
  Model.
      A Sequence-to-Sequence (Seq2Seq) model is a neural architecture designed to transform one sequence into another. It’s widely used when both the input and output are sequences, possibly of different lengths.

      Basic Model which deals with Sequence data-
        RNN ==> Recurrent Neural Network
        LSTM ==> Long Short Term Memory.

| Task                        | Input Sequence               | Output Sequence                     |
| --------------------------- | ---------------------------- | ----------------------------------- |
| Machine Translation         | `"I love NLP"`               | `"J'aime le traitement du langage"` |
| Text Summarization          | `"The article discusses..."` | `"Summary of the article"`          |
| Chatbot Response Generation | `"Hi, how are you?"`         | `"I'm good, thanks!"`               |
| Speech Recognition          | `Audio waveform`             | `"Hello world"`                     |
| Code Generation             | `"Write a Python function"`  | `"def add(a, b): return a+b"`       |



| Model Type                         | How It Works                                         |
| ---------------------------------- | ---------------------------------------------------- |
| **RNN (Recurrent Neural Network)** | Reads tokens one by one and maintains state          |
| **LSTM / GRU**                     | Improved RNNs for long-range dependencies            |
| **CNN (1D)**                       | Detects local n-gram features, ignores global order  |
| **Transformer**                    | Uses attention to relate all positions at once       |
| **BERT/GPT**                       | Pretrained Transformers that understand full context |


### **Session 82- RNN pdf- (14th July)**

       Hierarchy of how machine learning and deep learning models evolve to address increasingly complex data problems
       ML → ANN → CNN → RNN
       
        Machine Learning Models
        - Can deal only small volume of Dataset
        - Needed: Feature Extraction, Feature Selection, Error Rectification, Encoding, Imbalance....etc
        - Disadvantage :
          - We can't handle large volume of data.
          - There is not Auto Feature Extraction
          - It can't handle large number of features
        - Solution : Feed Forward Neural Network (ANN)
                        - Auto Feature Engineering
                        - Auto feature extraction
                        - It can handle large volume of data

        Artificial Neural Network :
          - Feed Forward Neural Network (ANN)
          - Auto Feature Engineering
          - Auto feature extraction
          - It can handle large volume of data
          - It has feature like Weights, Activation Function, Bias, Forward Propagation and Back Propagation.
          - Problem with ANN- can't deal with
            - Complex Multi Dimension Data, Image Recognition, Speech Conversation, Dataset with Huge Input Parameter
          - Solution :
            - CNN can rectify all problem of ANN.
            - Image
            - Huge Input Parameter
            - Complex multi dimensional data.
          CNN
            - stands for Convolutional Neural Network which handle complex multi dimensional data like colour images.
            - Image Classification, Image Segmentation, Image Recognition
            - Huge Input Parameter
            - Complex multi dimensional data.
            - Limitation of CNN : CNN can't deal with
              - Sequential data, Time Series dat, Text data,Language translation, Text Summarization, Sentiment Analysis....etc.
              - No memory provided: CNNs do not have memory of past inputs. They are stateless and process each input independently.
              - Memory in neural networks refers to the ability to retain information about previous inputs in a sequence. This is crucial for sequential data like text, time series, and speech — where context builds over time.

            - Solution :
              - RNN can handle Sequential data problems
              - Time Series, Language Translation
              - Text Summarization
              - Sentiment Analysis...etc.

              For text:
              CNN might see:
              "He is very not good"
              - CNN might treat “not” and “good” separately or in a small window — but not remember their relation across the whole sentence.
              - RNN would process one word at a time and maintain memory:
                  "He → is → very → not → good"
                  So it understands the negation. Understanding t, t+1, t+2 in Sequential Data

| Model Type                             | Best For                            | Key Features                                                                                        | Limitations                                                                                                     | Solution                        |
| -------------------------------------- | ----------------------------------- | --------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------- | ------------------------------- |
| **ML (Machine Learning)**              | Small structured/tabular data       | - Manual feature engineering<br>- Simple algorithms<br>- Fast on small data                         | - Can't handle large data<br>- No auto feature extraction<br>- Struggles with unstructured data                 | ➡️ Use **ANN**                  |
| **ANN (Artificial Neural Network)**    | Large structured data               | - Auto feature extraction<br>- Forward & backpropagation<br>- Handles high dimensionality           | - Not effective for images<br>- Struggles with spatial data<br>- Weak with high-dimensional input (like images) | ➡️ Use **CNN**                  |
| **CNN (Convolutional Neural Network)** | Images, spatial data                | - Detects patterns via filters<br>- Efficient with large input<br>- Handles 2D/3D data              | - Not suitable for sequences<br>- Can't handle time series, text, speech                                        | ➡️ Use **RNN**                  |
| **RNN (Recurrent Neural Network)**     | Sequential data: text, time, speech | - Maintains memory across steps<br>- Learns temporal patterns<br>- Works with variable-length input | - Training can be slow<br>- Can forget long dependencies (vanishing gradient)                                   | ➡️ Use **LSTM/GRU/Transformer** |

      ANN Architecture Summary Table

| Layer Type          | Purpose                          | Activation Function      | # of Neurons             |
| ------------------- | -------------------------------- | ------------------------ | ------------------------ |
| **Input Layer**     | Accept raw data                  | None                     | Equal to # of features   |
| **Hidden Layer(s)** | Learn patterns & transformations | ReLU, Sigmoid, Tanh      | Tunable (hyperparameter) |
| **Output Layer**    | Give predictions                 | Sigmoid / Softmax / None | Depends on target type   |


| Feature                    | **ANN (Artificial Neural Network)**                      | **CNN (Convolutional Neural Network)**                            | **RNN (Recurrent Neural Network)**                         |
| -------------------------- | -------------------------------------------------------- | ----------------------------------------------------------------- | ---------------------------------------------------------- |
| **Purpose**                | General-purpose, structured data                         | Spatial data like images                                          | Sequential data like text, time series, speech             |
| **Input Type**             | Fixed-size vector input                                  | Grid-like (e.g., image pixels)                                    | Sequences (e.g., words, time steps)                        |
| **Layer Types**            | Dense (Fully Connected)                                  | Convolutional + Pooling + Dense                                   | Recurrent (looped) + Dense                                 |
| **Memory / Context**       | ❌ No memory of past inputs                               | ❌ No memory (local features only)                                 | ✅ Yes – remembers past inputs via hidden states            |
| **Handles Sequence**       | ❌ No                                                     | ❌ No                                                              | ✅ Yes                                                      |
| **Handles Image Data**     | ✅ But not optimized                                      | ✅ Specialized for images                                          | ❌ Not suitable                                             |
| **Handles Text/Time Data** | ❌ Not well                                               | ✅ Limited (with filters)                                          | ✅ Best suited                                              |
| **Computation**            | Low to Medium                                            | Medium to High                                                    | High (due to time-step loops)                              |
| **Parallelism (Training)** | ✅ High                                                   | ✅ High                                                            | ❌ Low – sequential processing                              |
| **Use Cases**              | - Tabular data<br>- Regression<br>- Basic classification | - Image classification<br>- Object detection<br>- Medical imaging | - Text generation<br>- Translation<br>- Sentiment analysis |
| **Weight Sharing**         | ❌ No                                                     | ✅ Yes (shared kernels)                                            | ✅ Shared across time steps                                 |
| **Example Algorithms**     | Logistic Regression, MLP                                 | LeNet, AlexNet, VGG, ResNet                                       | Vanilla RNN, LSTM, GRU                                     |
| **Strength**               | Simplicity                                               | Local spatial feature extraction                                  | Temporal/contextual modeling                               |
| **Weakness**               | No spatial or temporal awareness                         | No memory for sequences                                           | Vanishing gradient (in vanilla RNN)                        |



###### RNN

      RNN :
      - A Recurrent Neural Network (RNN) is type of Artificial Neural Network designed to handle sequential data.
      - As RNNs have internal memory, allowing them to process sequences of inputs by retaining information about previous input elements in the sequence.
      - This "internal memory" makes them suitable for tasks where order of input data matters or important. Such as natural language processing and speech recognition.
      - Unfolded vs Folded RNN

    🧩 1. Input Sequence: Suppose input is a sequence of words (or time steps):
    X = [x₁, x₂, x₃, ..., xₜ] . Each input xₜ is passed into the RNN one at a time (not all at once like ANN).

      2. Hidden State Update: At every time step t, the RNN:

          - Receives current input xₜ
          - Receives previous hidden state hₜ₋₁
          - Produces new hidden state hₜ
          - hₜ = f(Wₓ * xₜ + Wₕ * hₜ₋₁ + b)
          -W-weight, b-bias, f- activation function (tanh or ReLU),hₜ- updated memory
          - This allows the model to remember what it saw before

### **Session 83- RNN- (21st July)**

      Evolution of RNN :
      In 1980 ==> John Hopfield ==> Hopfield Network
              - In Hopfield network multiple neural networks are connected with each other over the feedback. Feedback of previous neural network is pass on the next neural network as input with which pattern is
              remembered over the time in sequential data.
              - Draws of Hopfield Network was rectified by David Rumelhart with the help of Geoffrey Hinton by introducing BPTT (Back Propagation Through Time)
      - LSTM : 1997 - 2010: Long Short Term Memory. Complicated architecture consists of
              - Forget Gate
              - Input Gate
              - Output Gate
              - Cell State
              - Hidden State
      - GRU (Gated Recurrent Unit)  ==> Equally Powerful as LSTM with Simple Architecture ==> 2014
      - Encoder - Decoder Architecture- Ilya Sutskever. LSTM ==> multiple
      - Attention Mechanism
      - Self Attention Mechanism
      - Transformer Model-> BERT & GPT

| **Year**    | **Contributor(s)**           | **Model / Milestone**      | **Contribution**                                                                      | **Summary**                                               | **Drawbacks**                                                  |
| ----------- | ---------------------------- | -------------------------- | ------------------------------------------------------------------------------------- | --------------------------------------------------------- | -------------------------------------------------------------- |
| 1980        | **John Hopfield**            | **Hopfield Network**       | Introduced feedback-based network. Remembered patterns using energy minimization.     | Early recurrent model with memory for fixed patterns      | Limited memory, not suitable for sequence prediction           |
| 1986        | **Rumelhart & Hinton**       | **Backpropagation + BPTT** | Extended backpropagation for RNNs using time-unrolling                                | Enabled training RNNs using BPTT                          | Prone to vanishing/exploding gradients                         |
| Early 1990s | **Sepp Hochreiter**          | **RNN Gradient Analysis**  | Found major training issues with deep RNNs (vanishing gradient)                       | Highlighted need for better architectures                 | Couldn’t solve long-term dependency problems                   |
| 1997        | **Hochreiter & Schmidhuber** | **LSTM**                   | Introduced memory cells, input/output/forget gates to preserve long-term dependencies | First practical solution for long sequence learning       | Computationally intensive, complex architecture                |
| 2014        | **Kyunghyun Cho et al.**     | **GRU**                    | Simplified LSTM with reset and update gates, faster training                          | Lightweight alternative to LSTM                           | Slightly less accurate on some tasks than LSTM                 |
| 2014–2017   | **Google, Facebook, etc.**   | **Attention Mechanism**    | Allowed models to focus on relevant input tokens during decoding                      | Improved translation and summarization accuracy           | Still depends on RNN backbone (before Transformers)            |
| 2017        | **Google**                   | **Transformer**            | Fully attention-based model with parallel processing, no recurrence                   | Most powerful and scalable architecture (e.g., BERT, GPT) | Requires large data and compute, lacks inherent temporal order |


    🔁 RNN Variants Based on Input/Output Architecture

| **Type**                        | **Input**      | **Output**                         | **Use Case**                          |
| ------------------------------- | -------------- | ---------------------------------- | ------------------------------------- |
| **One-to-One**                  | Single         | Single                             | Image classification                  |
| **One-to-Many**                 | Single         | Sequence                           | Image captioning                      |
| **Many-to-One**                 | Sequence       | Single                             | Sentiment analysis                    |
| **Many-to-Many (synchronous)**  | Sequence       | Sequence                           | POS tagging, named entity recognition |
| **Many-to-Many (asynchronous)** | Input sequence | Output sequence (different length) | Machine translation, summarization    |


### **Session 84- RNN Implementation- (22 July)**

      RNN Implementation on Twitter Sentiment Data (from the given URL)

    TensorFlow is an open-source machine learning and deep learning framework developed by Google. It allows developers to build, train, and deploy machine learning models easily — especially neural networks.
    ✅ Build and train models for:
      Image classification (e.g., recognizing cats vs dogs)
      Natural Language Processing (e.g., chatbots, translation)
      Time series forecasting
      Speech recognition
      Object detection
      Recommender systems




In [None]:
import pandas as pd
import numpy as np
import re
# import contractions
import nltk
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Dense, SimpleRNN

from nltk.corpus import stopwords
nltk.download("stopwords")

url = "https://raw.githubusercontent.com/dD2405/Twitter_Sentiment_Analysis/master/train.csv"
df = pd.read_csv(url)
df.head(5)

# label - dependent coolumn (0 = negative, 1 = positive)
# tweet- independent column

In [None]:
# df.drop("id", axis = 1, inplace = True)  # don't require ID column
# df["tweet"]
# df["tweet"][0]
# df["tweet"][3]
df.shape

In [None]:
df.isna().sum()# summary
df.label.value_counts()    #oversampling- undersampling required to balance the data

In [None]:
X = df[["tweet"]]
y= df["label"]

In [None]:
from imblearn.over_sampling import RandomOverSampler  # RandomOverSampler is used to handle imbalanced datasets by randomly duplicating examples from the minority class until both classes are balanced.
ros = RandomOverSampler()
ros

In [None]:
X_ros, y_ros = ros.fit_resample(X, y)
y_ros.value_counts()

In [None]:
df1 = pd.DataFrame(X_ros)
df1['label'] = y_ros
df1.head(3)

In [None]:
# remove stopwords
swords = stopwords.words("english")  # downloading all stopwords
len(swords)
swords

In [None]:
df1["tweet"] = df1["tweet"].map( lambda x : x.lower())
df1["tweet"] = df1["tweet"].map(lambda x :  re.sub(r"@\w+|#\w+","", x))   # applying a regex substitution to remove usernames (@user) and hashtags (#tag) from tweets
df1["tweet"] = df1["tweet"].map(lambda x :  re.sub("http\S+|www\S+|https\S+","",x)) # removes URLs from tweets
df1["tweet"]= df1["tweet"].map(lambda x :   re.sub("[^a-zA-Z0-9 ]", "", x) ) # Removes punctuation, special characters, emojis, etc
df1["tweet"] = df1["tweet"].map(lambda x : " ".join( [ w    for w in x.split()   if w not in swords] )) #remove stopwords

# !pip install contractions
import contractions
df1["tweet"]= df1["tweet"].map(lambda x: contractions.fix(x))

# check for automated preprocessing steps
df1['tweet'].head(3)

### **Session 85- RNN Implementation Cont- (23rd July)**


      A tokenizer in NLP is a tool or function that breaks down text into smaller units called tokens — typically words, subwords, or characters.

      # How Tokenizer Assigns Word Indexes- After calling tokenizer.fit_on_texts(texts), the tokenizer:
        # this is different-> from nltk.tokenize import word_tokenize, sent_tokenize
        # Tokenizes (splits) all input texts into words (tokens).
        # Counts frequency of each word.
        # Sorts words by descending frequency (most frequent first).
        # Assigns index starting from 1 (or 2 if oov_token is specified).

In [None]:
# tokenization using TensorFlow's Tokenizer. The Tokenizer transforms raw text (sentences or documents) into a format that machine learning models can understand — usually sequences of numbers.

from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=10000, oov_token= "<00V>" )  # Only the top 10,000 most frequent words will be kept.Any word not seen during training will be replaced with this special Out-Of-Vocabulary token.
tokenizer.fit_on_texts(df1["tweet"])  #This tells the tokenizer to analyze the given list of texts and Count word frequency, Build a vocabulary, Assign a unique integer index to each word based on its frequency.
print(tokenizer.word_index)

In [None]:
# sample tokenizer
tokenizer1 = Tokenizer(num_words=10000, oov_token= "<00V>" )
tokenizer1.fit_on_texts(["I love NLP", "NLP is amazing"])

print(tokenizer1.word_index)
seq=tokenizer1.texts_to_sequences(["I love AI", "AI is Love NLP amazing"])  # see OOV
seq

In [None]:
# It ensures that all sequences have the same length. This is necessary because neural networks require input tensors to be of uniform shape.
from tensorflow.keras.preprocessing.sequence import pad_sequences
padded_sequences = pad_sequences(seq, maxlen=10, padding = "post", truncating="post")
padded_sequences

In [None]:
len(padded_sequences[0]), len(padded_sequences[1])

In [None]:
df1["tweet"].head(3)

In [None]:
# tokenizer.texts_to_sequences(df1["tweet"])
sequences = tokenizer.texts_to_sequences(df1["tweet"])
sequences

In [None]:
# Performing Padding
from tensorflow.keras.preprocessing.sequence import pad_sequences
padded_sequences = pad_sequences(sequences, maxlen=50, padding = "post", truncating="post")
padded_sequences

In [None]:
padded_sequences.shape

### **Session 86- RNN Implementation Cont- (24 July)**


In [None]:
# splitting data
X = padded_sequences #type(X): numpy.ndarray
y = df1["label"].values  # print(type(y)): numpy.ndarray

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

X_train.shape
X_test.shape
y_train.shape
y_test.shape
type(X), type(y), type(df1["label"])

In [None]:
# Model Building

# Importing the Sequential model and required layers: Embedding, SimpleRNN, and Dense.
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Dense, SimpleRNN

smodel = Sequential()  # Initializes a sequential neural network (layer-by-layer model).

smodel.add( Embedding(input_dim = 10000, output_dim = 64, input_length = 50))
smodel.add(SimpleRNN(64))
smodel.add(Dense(1, activation = "sigmoid"))

smodel.summary()

      Embedding Layer: Converts word indices (integers) into dense vectors.
        - input_dim=10000: Vocabulary size (max word index is 9999)
        - output_dim=64: Each word will be represented as a 64-dimensional vector
        - input_length=50: Input sequence length is 50 tokens
        📌 Output Shape: (None, 50, 64)
        
      Simple RNN Layer: Processes the embedded sequence data one step at a time, maintaining hidden state.

        - 64 is the number of RNN units (neurons)
        - It returns the last hidden state, not the full sequence        📌 Output Shape: (None, 64)

        Dense Layer: Output layer with 1 neuron
          - activation="sigmoid": Useful for binary classification (e.g., sentiment analysis)
          📌 Output Shape: (None, 1)



In [None]:
smodel.compile(optimizer="adam",
               loss = "binary_crossentropy",
               metrics = ["accuracy"] )
smodel.fit(X_train, y_train, epochs = 10, batch_size = 64)

In [None]:
smodel.evaluate(X_test, y_test)  # accuracy score, 96% predict 96% times accuractely wehter tweet is +ve or -ve

In [None]:
np.unique(y)

In [None]:
# Summary of RNN Programm
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense

# 1. Load data
url = "https://raw.githubusercontent.com/dD2405/Twitter_Sentiment_Analysis/master/train.csv"
df = pd.read_csv(url)

# 2. Preprocess text and labels
texts = df['tweet'].astype(str).tolist()
labels = df['label'].values  # 0 = negative, 1 = positive

# 3. Tokenization
tokenizer = Tokenizer(num_words=10000, oov_token="<OOV>")
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
padded_sequences = pad_sequences(sequences, maxlen=100, padding='post', truncating='post')

# 4. Model definition
model = Sequential([
    Embedding(input_dim=10000, output_dim=16, input_length=100),
    SimpleRNN(32),
    Dense(1, activation='sigmoid')
])

# 5. Compile & train
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(padded_sequences, labels, epochs=5, batch_size=64, validation_split=0.2)
