########## **Gen-AI Bootcamp 24** 
_________________________________________________________________________

## **Part 1: Text Collection and Loading**
**Objective:** *Collect and load a text dataset from a selected domain into a suitable format for
processing.*

#### **Domain**: *Social Media* 

#### **Kaggle Dataset**: *https://www.kaggle.com/datasets/cosmos98/twitter-and-reddit-sentimental-analysis-dataset*


#### **Loading Dataset**:

In [1]:
import pandas as pd

# Load the data into a DataFrame
df = pd.read_csv('Twitter_Data.csv')
# Print the shape of the DataFrame
df.shape

(162980, 2)

#### **Displaying the first few rows**

In [2]:
# Print the first five rows of the DataFrame
df.head()

Unnamed: 0,clean_text,category
0,when modi promised “minimum government maximum...,-1.0
1,talk all the nonsense and continue all the dra...,0.0
2,what did just say vote for modi welcome bjp t...,1.0
3,asking his supporters prefix chowkidar their n...,1.0
4,answer who among these the most powerful world...,1.0


_________________________________________________________________________
## **Part 2: Text Preprocessing**
**Objective:** *Gain hands-on experience with text preprocessing techniques.*

### **Step 1: Import the Necessary Libraries and Corpus**

In [3]:
import nltk
nltk.download('brown')  # Download Brown Data
nltk.download('punkt') # Download Punkt Data
nltk.download('stopwords') # Download Stopwords Data
nltk.download('wordnet') # Download WordNet Data
nltk.download('omw-1.4')  # Download WordNet Data
from nltk.corpus import brown
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import string

[nltk_data] Downloading package brown to C:\Users\Digital
[nltk_data]     Zone\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\brown.zip.
[nltk_data] Downloading package punkt to C:\Users\Digital
[nltk_data]     Zone\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Digital
[nltk_data]     Zone\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\Digital
[nltk_data]     Zone\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to C:\Users\Digital
[nltk_data]     Zone\AppData\Roaming\nltk_data...


### **Step 2: Load the Brown Corpus**

In [4]:
# Load the text from the 'news' category of the Brown Corpus
# text = ' '.join(brown.words(categories='science_fiction'))
text = str(df['clean_text'])
print("Original Text:", text[:500])  # Print the first 500 characters for reference

Original Text: 0         when modi promised “minimum government maximum...
1         talk all the nonsense and continue all the dra...
2         what did just say vote for modi  welcome bjp t...
3         asking his supporters prefix chowkidar their n...
4         answer who among these the most powerful world...
                                ...                        
162975    why these 456 crores paid neerav modi not reco...
162976    dear rss terrorist payal gawar what about modi...
162977    did you co


### **Step 3: Tokenization**
Tokenization splits the text into individual words and sentences.

**Impact of Tokenization:**
* Sentence Tokenization: Breaks down the text into manageable units (sentences) for further processing.
* Word Tokenization: Provides the basic units (words) needed for subsequent analysis steps.

In [5]:
# Sentence Tokenization
sentences = sent_tokenize(text)
print("First 5 sentences:", sentences[:5])

# Word Tokenization
words = word_tokenize(text)
print("First 20 words:", words[:20])

First 5 sentences: ['0         when modi promised “minimum government maximum...\n1         talk all the nonsense and continue all the dra...\n2         what did just say vote for modi  welcome bjp t...\n3         asking his supporters prefix chowkidar their n...\n4         answer who among these the most powerful world...\n                                ...                        \n162975    why these 456 crores paid neerav modi not reco...\n162976    dear rss terrorist payal gawar what about modi...\n162977    did you cover her interaction forum where she ...\n162978    there big project came into india modi dream p...\n162979    have you ever listen about like gurukul where ...\nName: clean_text, Length: 162980, dtype: object']
First 20 words: ['0', 'when', 'modi', 'promised', '“', 'minimum', 'government', 'maximum', '...', '1', 'talk', 'all', 'the', 'nonsense', 'and', 'continue', 'all', 'the', 'dra', '...']


### **Step 4: Stemming**
Stemming reduces words to their root form, stripping suffixes.

**Impact of Stemming:**
* Reduction of Variants: Words like "running," "runner," and "ran" are reduced to "run," which simplifies the text and reduces complexity.
* Potential Loss of Meaning: Sometimes, stemming can strip too much, losing the actual meaning of the word.

In [6]:
# Initialize the Porter Stemmer
stemmer = PorterStemmer()

# Apply stemming to each word
stemmed_words = [stemmer.stem(word) for word in words]
print("First 20 stemmed words:", stemmed_words[:20])

First 20 stemmed words: ['0', 'when', 'modi', 'promis', '“', 'minimum', 'govern', 'maximum', '...', '1', 'talk', 'all', 'the', 'nonsens', 'and', 'continu', 'all', 'the', 'dra', '...']


### **Step 5: Lemmatization**
Lemmatization reduces words to their base or dictionary form, considering the context.

**Impact of Lemmatization:**
* Context-Aware Reduction: More accurate than stemming, as it considers the part of speech and context.
* Improved Meaning Preservation: Maintains the integrity of the words better than stemming.

In [7]:
# Initialize the WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()

# Apply lemmatization to each word
lemmatized_words = [lemmatizer.lemmatize(word) for word in stemmed_words]
print("First 20 lemmatized words:", lemmatized_words[:20])

First 20 lemmatized words: ['0', 'when', 'modi', 'promis', '“', 'minimum', 'govern', 'maximum', '...', '1', 'talk', 'all', 'the', 'nonsens', 'and', 'continu', 'all', 'the', 'dra', '...']


### **Step 6: Stop Word Removal**
Stop words are common words (e.g., "the," "is") that may not add significant meaning to text analysis.

**Impact of Stop Word Removal:**
* Noise Reduction: Eliminates common but uninformative words, reducing the size of the text data.
* Focus on Meaningful Words: Enhances the focus on significant words that contribute more to the text analysis.

In [8]:
# Get the list of stop words
stop_words = set(stopwords.words('english'))

# Remove stop words from the tokenized words
filtered_words = [word for word in lemmatized_words if word.lower() not in stop_words and word not in string.punctuation]
print("First 20 words after stop word removal:", filtered_words[:20])

First 20 words after stop word removal: ['0', 'modi', 'promis', '“', 'minimum', 'govern', 'maximum', '...', '1', 'talk', 'nonsens', 'continu', 'dra', '...', '2', 'say', 'vote', 'modi', 'welcom', 'bjp']


_________________________________________________________________________
## **Part 3: Feature Extraction Techniques**
**Objective:** *Understand and apply text data transformation into machine-readable vectors.*

Whenever we apply any algorithm in NLP, it works on numbers. We cannot directly feed our text into that algorithm.

### **1. Bag-of-words**
Bag of Words model is used to preprocess the text by converting it into a bag of words, which keeps a count of the total occurrences of most frequently used words.

##### **Step-by-Step Example**
Consider the following three simple documents:
1. "the cat sat on the mat"
2. "the dog barked at the cat"
3. "the cat chased the mouse"

##### **Step 1: Tokenization**
Split each document into words (tokens)

    Document 1: ["the", "cat", "sat", "on", "the", "mat"]

    Document 2: ["the", "dog", "barked", "at", "the", "cat"]

    Document 3: ["the", "cat", "chased", "the", "mouse"]


##### **Step 2: Vocabulary Creation**
Combine all tokens from all documents and identify unique words to create the vocabulary

    Vocabulary: ["the", "cat", "sat", "on", "mat", "dog", "barked", "at", "chased", "mouse"]


##### **Step 3: Vectorization**
Create a vector for each document based on the vocabulary. Each position in the vector corresponds to a word in the vocabulary, and the value is the count of that word in the document.

For the given vocabulary ["the", "cat", "sat", "on", "mat", "dog", "barked", "at", "chased", "mouse"], let's create the vectors for each document:

![Description](images/docum1.png)
![Description](images/docum2.png)
![Description](images/docum3.png)


### **2. TF-IDF (Term Frequency-Inverse Document Frequency)**
TF-IDF, or Term Frequency-Inverse Document Frequency, is a numerical statistic that reflects how important a word is to a document in a collection or corpus. It helps to highlight important words in a document while reducing the impact of commonly occurring words that might be less informative. 

##### **2.1 Term Frequency (TF)**
Term Frequency measures how frequently a term appears in a document. It is calculated as the number of times a term appears in a document divided by the total number of terms in the document.

![Description](images/tf.png)


##### **2.2 Inverse Document Frequency (IDF)**
Inverse Document Frequency (IDF) measures how important a term is by comparing the number of documents that contain the term to the total number of documents. If a term appears in many documents, its IDF value will be low.

![Description](images/idf.png)


##### **2.3 TF-IDF**
TF-IDF combines both measures to give a score that represents the importance of a term in a document relative to the entire corpus.

![Description](images/tf-idf.png)


##### **2.4 TF_IDF Example**
Step 1 : Calculate Term Frequency (TF)

Term Frequency (TF) measures how frequently a term appears in a document. It is calculated as the number of times a term appears in a document divided by the total number of terms in the document.

![Description](images/tfnew.png)

Step 2: Calculate Inverse Document Frequency (IDF)

Inverse Document Frequency (IDF) measures how important a term is across the entire corpus. It is calculated as the logarithm of the total number of documents divided by the number of documents containing the term.

![Description](images/idfnew.png)

Step 3: Calculate TF-IDF

TF-IDF is the product of TF and IDF for each term in each document.

![Description](images/tf-idfnew.png)

Summary of Results

![Description](images/tfidfres.png)


### **3. n-grams**
The n-gram model is an extension of the Bag-of-Words model that considers sequences of n words (called n-grams) instead of individual words (unigrams). This allows the model to capture some of the context and order of words in the text. An n-gram is a contiguous sequence of n items from a given text.
#### Types of n-grams
1. Unigram: Single word (n=1)
2. Bigram: Sequence of two words (n=2)
3. Trigram: Sequence of three words (n=3)

### **Example**
Consider the following three simple documents:
1. "the cat sat on the mat"
2. "the dog barked at the cat"
3. "the cat chased the mouse"

##### **Step 1: Tokenization**
    Document 1: ["the", "cat", "sat", "on", "the", "mat"]

    Document 2: ["the", "dog", "barked", "at", "the", "cat"]

    Document 3: ["the", "cat", "chased", "the", "mouse"]


##### **Step 2: Generate Bigrams:**
    Document 1: ["the cat", "cat sat", "sat on", "on the", "the mat"]

    Document 2: ["the dog", "dog barked", "barked at", "at the", "the cat"]

    Document 3: ["the cat", "cat chased", "chased the", "the mouse"]



##### **Step 3: Vocabulary Creation:**
Combine all bigrams from all documents and find unique bigrams

    Vocabulary: ["the cat", "cat sat", "sat on", "on the", "the mat", "the dog", "dog barked", "barked at", "at the", "cat chased", "chased the", "the mouse"]


##### **Step 4: Vocabulary Creation:**
Create a vector for each document based on the vocabulary. Each position in the vector corresponds to a bigram in the vocabulary, and the value is the count of that bigram in the document.

For the given vocabulary ["the cat", "cat sat", "sat on", "on the", "the mat", "the dog", "dog barked", "barked at", "at the", "cat chased", "chased the", "the mouse"], let's create the vectors for each document:

![Description](images/docu1.png)
![Description](images/docu2.png)
![Description](images/docu3.png)


_________________________________________________________________________
## **Part 5: Model Training and Evaluation**
**Objective:** *Understand RNNs and their ability to handle sequence data*

##### **Introduction**

##### **Sequential Data**
Sequential data refers to data where the order of the elements is significant and the sequence of elements matters. This type of data is common in many real-world applications, such as:

1. Time series data: Stock prices, weather data, sensor readings, etc.
2. Natural language processing: Sentences in a text, where the order of words is important.
3. Biological sequences: DNA or protein sequences, where the order of nucleotides or amino acids matters.   
##### **1. Recurrent Neural Networks (RNNs)**


##### **Overview:**


1. RNNs are designed to handle sequential data by maintaining a hidden state that captures information from previous time steps.
2. They are used for tasks where the order of data points matters, such as time series forecasting, language modeling, and speech recognition.
3. Unlike traditional neural networks, RNNs have connections that form directed cycles, which allow them to maintain a state and capture temporal dependencies.

##### **Structure:**
RNNs have loops that allow information to persist. They process input sequences one step at a time, maintaining a hidden state that is updated at each time step.


##### **Problems:**
Vanishing Gradient: Gradients can become very small, making it difficult for the model to learn long-term dependencies.

Exploding Gradient: Gradients can become very large, leading to unstable training.


##### **2. Long Short-Term Memory (LSTM)**


##### **Overview:**


1. LSTM networks are a type of RNN designed to capture long-term dependencies. They include mechanisms called gates to regulate the flow of information.
2. LSTMs are used for the same tasks as RNNs but perform better when the sequence length is long.


##### **Structure:**
1. Forget Gate: Decides what information to throw away from the cell state.
2. Input Gate: Decides which new information to store in the cell state.
3. Output Gate: Decides what part of the cell state to output.


##### **Problems:**
Complexity: LSTMs are more complex and computationally expensive compared to simple RNNs.

Training Time: Longer training times due to their complexity.

##### **3. Gated Recurrent Unit (GRU)**


##### **Overview:**


1. GRUs are a simplified version of LSTMs that combine the forget and input gates into a single update gate.
2. They aim to provide similar benefits to LSTMs but with fewer parameters and less computational complexity.


##### **Structure:**
1. Reset Gate: Decides how much of the past information to forget.
2. Update Gate: Decides how much of the new information to use to update the cell state.


##### **Problems:**
Long-Term Dependencies: While GRUs can capture long-term dependencies, they might not be as effective as LSTMs in very long sequences.

Complexity: More complex than simple RNNs but simpler than LSTMs.

##### **Comparison of RNN, LSTM, and GRU:**


##### **RNN:**

Pros: Simpler architecture, faster computation.

Cons: Struggles with long-term dependencies due to vanishing/exploding gradient problems.


##### **LSTM::**
Pros: Effective at capturing long-term dependencies, addresses vanishing gradient problem.

Cons: More complex, longer training time.


##### **GRU::**
Pros: Simpler than LSTMs, faster training, effective at capturing long-term dependencies.

Cons: Slightly less expressive than LSTMs but generally performs similarly.

##### **Applications of RNNs**
1. Time series prediction
2. Natural language processing
3. Speech recognition
4. Video processing