<a id="1"></a>
# <div style="text-align:center; background-color:#ebbf21; padding:10px; border-radius:10px; color:black; font-family:'Georgia', serif;"><b>NLP Pipeline
</b></div>
### What is NLP Pipeline

NLP is a set of steps followed to build an end-to-end NLP software. NLP software consists of the following steps:

1. **Data Acquisition**
   
2. **Text Preparation**
   - **Text Cleanup**
   - **Basic Preprocessing**
   - **Advanced Preprocessing**

3. **Feature Engineering**

4. **Modelling**
   - **Model Building**
   - **Evaluation**

5. **Deployment**
   - **Deployment**
   - **Monitoring**
   - **Model Update**

These steps are essential for creating effective NLP applications, from acquiring and preparing data to engineering features, building and evaluating models, and finally deploying and maintaining the models in a production environment.

**It's not Universal**

**Pipeline is non-linear**

**ML based Pipeline**

<div style="text-align:center; margin-top:20px;">
    <img src="https://miro.medium.com/v2/resize:fit:944/1*dWY7adQ62NDn_w_sc4lAKw.png" alt="NLP Pipeline" style="width:80%; border-radius:10px;"/>
    <p style="font-size:14px; color:#555;">Image Credit: <a href="https://medium.com/swlh/journey-through-the-world-of-nlp-nlp-pipeline-part-2-1-744c0f72125f" target="_blank" style="color:#555;">Medium</a></p>
</div>


### Detailed explanation of each point in an NLP pipeline:

### Data Acquisition
**Data Acquisition** is the process of collecting text data for NLP tasks. This can include:
- **Web Scraping**: Extracting text data from websites.
- **APIs**: Using APIs to gather data from various platforms like Twitter, Reddit, etc.
- **Databases**: Retrieving text data from structured databases.
- **Manual Collection**: Hand-collecting data, including surveys and interviews.

### Text Preparation
**Text Preparation** involves cleaning and preprocessing the raw text data to make it suitable for analysis.

#### Text Cleanup
- **Remove Noise**: Eliminate irrelevant data such as HTML tags, special characters, and extra spaces.
- **Case Normalization**: Convert all text to lowercase or uppercase for consistency.
- **Spelling Correction**: Correct common spelling errors to ensure uniformity.

#### Basic Preprocessing
- **Tokenization**: Splitting text into words, sentences, or phrases.
- **Stop Words Removal**: Removing common words that do not contribute much meaning (e.g., "and", "the").
- **Punctuation Removal**: Eliminating punctuation marks to focus on the words.

#### Advanced Preprocessing
- **Lemmatization**: Reducing words to their base or dictionary form (e.g., "running" to "run").
- **Stemming**: Reducing words to their root form (e.g., "fishing" to "fish").
- **POS Tagging**: Identifying parts of speech (nouns, verbs, adjectives, etc.) for each word.
- **Named Entity Recognition (NER)**: Identifying and classifying named entities (e.g., names of people, organizations, locations).

### Feature Engineering
**Feature Engineering** involves creating features from text data that can be used for modeling:
- **Bag of Words (BoW)**: Representing text as a collection of its words.
- **TF-IDF**: Weighing the importance of words based on their frequency and uniqueness.
- **Word Embeddings**: Representing words as dense vectors (e.g., Word2Vec, GloVe).
- **N-grams**: Extracting contiguous sequences of n tokens.

### Modelling
**Modelling** involves building and evaluating machine learning models to perform NLP tasks.

#### Model Building
- **Selecting Algorithms**: Choosing appropriate algorithms (e.g., Naive Bayes, SVM, neural networks).
- **Training Models**: Feeding the processed text data into the algorithms to train the models.
- **Hyperparameter Tuning**: Adjusting model parameters to improve performance.

#### Evaluation
- **Metrics**: Using metrics like accuracy, precision, recall, F1-score, etc., to evaluate model performance.
- **Cross-Validation**: Using techniques like k-fold cross-validation to assess model reliability and robustness.
- **Intrinsic-vs-Extrinsic**: https://ai.plainenglish.io/nlp-evaluation-intrinsic-vs-extrinsic-assessment-ff1401505631

### Deployment
**Deployment** involves integrating the trained NLP model into a production environment and ensuring it functions correctly.

#### Deployment
- **Integration**: Embedding the model into applications or services where it will be used.
- **API Creation**: Developing APIs to allow external systems to interact with the model.

#### Monitoring
- **Performance Tracking**: Continuously monitoring model performance to detect issues.
- **Error Analysis**: Analyzing errors and making necessary adjustments to improve accuracy.

#### Model Update
- **Retraining**: Periodically retraining the model with new data to maintain its effectiveness.
- **Versioning**: Keeping track of model versions to manage updates and changes efficiently.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/game-of-thrones-books/004ssb.txt
/kaggle/input/game-of-thrones-books/005ssb.txt
/kaggle/input/game-of-thrones-books/001ssb.txt
/kaggle/input/game-of-thrones-books/002ssb.txt
/kaggle/input/game-of-thrones-books/003ssb.txt
/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv


<a id="1"></a>
# <div style="text-align:center; background-color:#ebbf21; padding:10px; border-radius:10px; color:black; font-family:'Georgia', serif;"><b>Text Preprocessing
</b></div>

**Text preprocessing in NLP** is the process of cleaning and preparing raw text data for analysis. It includes steps like removing noise (e.g., HTML tags, special characters), normalizing case, correcting spelling errors, tokenizing text into words or sentences, removing stop words, stripping punctuation, and performing lemmatization or stemming to reduce words to their base forms. Advanced preprocessing may involve POS tagging, named entity recognition, and feature extraction techniques such as TF-IDF or word embeddings. This process enhances the quality and performance of NLP models.

<a id="1"></a>
# <div style="text-align:center; background-color:#ebbf21; padding:10px; border-radius:10px; color:black; font-family:'Georgia', serif;"><b>Lowercasing
</b></div>

**Lowercasing** refers to the process of converting all characters in a text to lowercase. This standardization helps in reducing the complexity of text data by treating words with different cases (e.g., "Apple" and "apple") as the same word, thereby improving the efficiency and accuracy of subsequent text processing and analysis steps. Lowercasing is particularly useful in ensuring uniformity and consistency in the dataset.

In [2]:
df = pd.read_csv("/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv")

In [3]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [4]:
df["review"][5].lower()

'probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it\'s not preachy or boring. it just never gets old, despite my having seen it some 15 or more times in the last 25 years. paul lukas\' performance brings tears to my eyes, and bette davis, in one of her very few truly sympathetic roles, is a delight. the kids are, as grandma says, more like "dressed-up midgets" than children, but that only makes them more fun to watch. and the mother\'s slow awakening to what\'s happening in the world and under her own roof is believable and startling. if i had a dozen thumbs, they\'d all be "up" for this movie.'

In [5]:
df["review"] = df["review"].str.lower()

In [6]:
df["review"]

0        one of the other reviewers has mentioned that ...
1        a wonderful little production. <br /><br />the...
2        i thought this was a wonderful way to spend ti...
3        basically there's a family where a little boy ...
4        petter mattei's "love in the time of money" is...
                               ...                        
49995    i thought this movie did a down right good job...
49996    bad plot, bad dialogue, bad acting, idiotic di...
49997    i am a catholic taught in parochial elementary...
49998    i'm going to have to disagree with the previou...
49999    no one expects the star trek movies to be high...
Name: review, Length: 50000, dtype: object

<a id="1"></a>
# <div style="text-align:center; background-color:#ebbf21; padding:10px; border-radius:10px; color:black; font-family:'Georgia', serif;"><b>Remove HTML tags using Regular expressions
</b></div>
We remove HTML tags from text for several key reasons:

1. **Clean Text**: HTML tags don't contribute to the actual content, only to its structure and presentation.
2. **Normalization**: Removing tags helps standardize the text, making it easier to process uniformly.
3. **Preprocessing**: Tags can interfere with tokenization and other text processing steps.
4. **Accuracy**: Clean text improves the performance of NLP models by focusing on meaningful content.
5. **Consistency**: Ensures uniformity across different text sources, simplifying downstream tasks.

In [7]:
import re
def remove_html_tags(text):
    pattern = re.compile("<.*?>")
    return pattern.sub(r"", text)    

In [8]:
text = "<html><body><p> File </p><p> Author - Aman Khan</p><p> Click here to <a href='http://google.com'>download</a></p></body></html>"

In [9]:
remove_html_tags(text)

' File  Author - Aman Khan Click here to download'

In [10]:
df['review'] = df['review'].apply(remove_html_tags)
df['review'][7]

"this show was an amazing, fresh & innovative idea in the 70's when it first aired. the first 7 or 8 years were brilliant, but things dropped off after that. by 1990, the show was not really funny anymore, and it's continued its decline further to the complete waste of time it is today.it's truly disgraceful how far this show has fallen. the writing is painfully bad, the performances are almost as bad - if not for the mildly entertaining respite of the guest-hosts, this show probably wouldn't still be on the air. i find it so hard to believe that the same creator that hand-selected the original cast also chose the band of hacks that followed. how can one recognize such brilliance and then see fit to replace it with such mediocrity? i felt i must give 2 stars out of respect for the original cast that made this show such a huge success. as it is now, the show is just awful. i can't believe it's still on the air."

<a id="1"></a>
# <div style="text-align:center; background-color:#ebbf21; padding:10px; border-radius:10px; color:black; font-family:'Georgia', serif;"><b>Removing URLs
</b></div>

In NLP, removing URLs from text is important for several reasons:

1. **Noise Reduction**: URLs are often irrelevant to the text's main content and can introduce noise, affecting the quality of text analysis.

2. **Normalization**: Like HTML tags, URLs can disrupt the uniform processing of text, complicating tokenization and other preprocessing steps.

3. **Improved Model Performance**: Clean text without URLs helps NLP models focus on meaningful content, leading to better performance.

4. **Consistency**: Removing URLs ensures a consistent text format across different sources, simplifying text processing and analysis.

5. **Privacy and Security**: URLs can contain sensitive information or lead to security risks, so removing them helps in maintaining privacy and security.

Overall, removing URLs is a standard preprocessing step to ensure cleaner, more consistent, and useful text for NLP tasks.

In [11]:
def remove_url(text):
    pattern = re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub(r'', text)

This Python function `remove_url` is designed to remove URLs from a given text string.

**Regular Expression Compilation**:
   ```python
   pattern = re.compile(r'https?://\S+|www\.\S+')
   ```
   This line compiles a regular expression (regex) pattern into a regex object for later use. The pattern `r'https?://\S+|www\.\S+'` is used to match URLs:
   - `https?://`: Matches `http://` or `https://`. The `s?` part makes the `s` optional, so it matches both `http` and `https`.
   - `\S+`: Matches one or more non-whitespace characters, effectively capturing the entire URL.
   - `|`: Acts as an OR operator, meaning the pattern will match either the left side (`https?://\S+`) or the right side (`www\.\S+`).
   - `www\.\S+`: Matches URLs starting with `www.` followed by one or more non-whitespace characters.


In [12]:
text1 = 'Check out my Facecook https://www.facebook.com/'
text2 = 'Check out my Instagram https://www.instagram.com/'
text3 = 'Google search here www.google.com'
text4 = 'For GitHub click https://github.com/ to search check www.google.com'

In [13]:
remove_url(text2)

'Check out my Instagram '

<a id="1"></a>
# <div style="text-align:center; background-color:#ebbf21; padding:10px; border-radius:10px; color:black; font-family:'Georgia', serif;"><b>Removing Punctuation
</b></div>

In NLP, removing punctuation helps:

1. **Simplify Text**: Reduces complexity for processing.
2. **Normalize Data**: Ensures uniform text format.
3. **Improve Tokenization**: Prevents punctuation from affecting word splits.
4. **Enhance Model Performance**: Focuses on meaningful content for better results.
5. **Size**: Punctuation makes the document large.

In [14]:
import string, time
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [15]:
exclude = string.punctuation
exclude

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [16]:
def remove_punctuation(text):
    for char in exclude:
        text = text.replace(char, "")
    return text

In [17]:
text = "Hello, world! This is a test: do you like it? Yes, I do... A lot; really! How about you? @username #hashtag $dollar %percent ^caret &amp *star (parentheses) -dash_underscore+plus=equals{curly}brackets[brackets]|\backslash~tilde`backtick"


In [18]:
remove_punctuation(text)

'Hello world This is a test do you like it Yes I do A lot really How about you username hashtag dollar percent caret amp star parentheses dashunderscoreplusequalscurlybracketsbrackets\x08ackslashtildebacktick'

In [19]:
start=time.time()
print(remove_punctuation(text))
time1=time.time()-start
print(time1)

Hello world This is a test do you like it Yes I do A lot really How about you username hashtag dollar percent caret amp star parentheses dashunderscoreplusequalscurlybracketsbracketackslashtildebacktick
0.00015306472778320312


In [20]:
def remove_punctuation2(text):
    return text.translate(str.maketrans('','',exclude))

In [21]:
start=time.time()
print(remove_punctuation2(text))
time2=time.time()-start
print(time2)

Hello world This is a test do you like it Yes I do A lot really How about you username hashtag dollar percent caret amp star parentheses dashunderscoreplusequalscurlybracketsbracketackslashtildebacktick
0.00023508071899414062


<a id="1"></a>
# <div style="text-align:center; background-color:#ebbf21; padding:10px; border-radius:10px; color:black; font-family:'Georgia', serif;"><b>Common Chat Abbreviations and Slang
</b></div>

Handling chat words in NLP is crucial for several reasons:

1. **Improved Understanding**: Expanding chat abbreviations helps models better understand the content.
2. **Contextual Accuracy**: Many chat words affect sentiment, tone, or intent (e.g., "LOL" vs. "Laughing Out Loud").
3. **Data Normalization**: Ensures uniformity and consistency in text data, simplifying processing and analysis.
4. **Enhanced Model Training**: Models trained on expanded forms of chat words perform more accurately.
5. **Sentiment Analysis**: Properly handling chat words ensures more accurate sentiment detection (e.g., "LMAO" indicates strong amusement).
6. **Readability**: Expanded chat words are clearer for both humans and NLP tasks like summarization or translation.

In [22]:
chat_word = {
    'AFAIK': 'As Far As I Know',
    'AFK': 'Away From Keyboard',
    'ASAP': 'As Soon As Possible',
    'ATK': 'At The Keyboard',
    'ATM': 'At The Moment',
    'A3': 'Anytime, Anywhere, Anyplace',
    'BAK': 'Back At Keyboard',
    'BBL': 'Be Back Later',
    'BBS': 'Be Back Soon',
    'BFN': 'Bye For Now',
    'B4N': 'Bye For Now',
    'BRB': 'Be Right Back',
    'BRT': 'Be Right There',
    'BTW': 'By The Way',
    'B4': 'Before',
    'CU': 'See You',
    'CUL8R': 'See You Later',
    'CYA': 'See You',
    'FAQ': 'Frequently Asked Questions',
    'FC': 'Fingers Crossed',
    'FWIW': "For What It's Worth",
    'FYI': 'For Your Information',
    'GAL': 'Get A Life',
    'GG': 'Good Game',
    'GN': 'Good Night',
    'GMTA': 'Great Minds Think Alike',
    'GR8': 'Great!',
    'G9': 'Genius',
    'IC': 'I See',
    'ICQ': 'I Seek you (also a chat program)',
    'ILU': 'ILU: I Love You',
    'IMHO': 'In My Honest/Humble Opinion',
    'IMO': 'In My Opinion',
    'IOW': 'In Other Words',
    'IRL': 'In Real Life',
    'KISS': 'Keep It Simple, Stupid',
    'LDR': 'Long Distance Relationship',
    'LMAO': 'Laugh My A.. Off',
    'LOL': 'Laughing Out Loud',
    'LTNS': 'Long Time No See',
    'L8R': 'Later',
    'MTE': 'My Thoughts Exactly',
    'M8': 'Mate',
    'NRN': 'No Reply Necessary',
    'OIC': 'Oh I See',
    'PITA': 'Pain In The A..',
    'PRT': 'Party',
    'PRW': 'Parents Are Watching',
    'QPSA?': 'Que Pasa?',
    'ROFL': 'Rolling On The Floor Laughing',
    'ROFLOL': 'Rolling On The Floor Laughing Out Loud',
    'ROTFLMAO': 'Rolling On The Floor Laughing My A.. Off',
    'SK8': 'Skate',
    'STATS': 'Your sex and age',
    'ASL': 'Age, Sex, Location',
    'THX': 'Thank You',
    'TTFN': 'Ta-Ta For Now!',
    'TTYL': 'Talk To You Later',
    'U': 'You',
    'U2': 'You Too',
    'U4E': 'Yours For Ever',
    'WB': 'Welcome Back',
    'WTF': 'What The F...',
    'WTG': 'Way To Go!',
    'WUF': 'Where Are You From?',
    'W8': 'Wait...',
    '7K': 'Sick:-D Laugher',
    'TFW': 'That feeling when',
    'MFW': 'My face when',
    'MRW': 'My reaction when',
    'IFYP': 'I feel your pain',
    'TNTL': 'Trying not to laugh',
    'JK': 'Just kidding',
    'IDC': "I don't care",
    'ILY': 'I love you',
    'IMU': 'I miss you',
    'ADIH': 'Another day in hell',
    'ZZZ': 'Sleeping, bored, tired',
    'WYWH': 'Wish you were here',
    'TIME': 'Tears in my eyes',
    'BAE': 'Before anyone else',
    'FIMH': 'Forever in my heart',
    'BSAAW': 'Big smile and a wink',
    'BWL': 'Bursting with laughter',
    'BFF': 'Best friends forever',
    'CSL': "Can't stop laughing"
}

In [23]:
def short_conv(text):
    new_text = []  # Initialize an empty list to hold the processed words
    for w in text.split():  # Split the input text into words and iterate over them
        if w.upper() in chat_word:  # Check if the uppercase version of the word is in the chat_word dictionary
            new_text.append(chat_word[w.upper()])  # If it is, append the full form from the dictionary to new_text
        else:
            new_text.append(w)  # If it is not, append the original word to new_text
    return " ".join(new_text)  # Join the processed words into a single string and return it

In [24]:
short_conv("LOL I will BRB")

'Laughing Out Loud I will Be Right Back'

<a id="1"></a>
# <div style="text-align:center; background-color:#ebbf21; padding:10px; border-radius:10px; color:black; font-family:'Georgia', serif;"><b>Spelling Correction
</b></div>

Spelling correction in NLP is done to improve text quality and ensure accurate analysis. Correcting spelling errors helps in:

1. **Enhanced Understanding**: Ensures that words are recognized correctly by NLP models.
2. **Data Consistency**: Maintains uniformity in text data.
3. **Improved Model Performance**: Reduces noise, leading to better model training and predictions.
4. **Accurate Results**: Improves the accuracy of tasks like sentiment analysis, information retrieval, and machine translation.

In [25]:
from textblob import TextBlob
incorrect_text = "Ths is an exmple of a sentnce with sevral speling erors."

textblb = TextBlob(incorrect_text)
textblb.correct().string

'The is an example of a sentence with several spelling errors.'

<a id="1"></a>
# <div style="text-align:center; background-color:#ebbf21; padding:10px; border-radius:10px; color:black; font-family:'Georgia', serif;"><b>Removing StopWords
</b></div>

Removing stop words in NLP text processing is like cleaning up unnecessary words like "the", "is", and "and" from sentences. These words appear frequently in language but don't add much meaning. By getting rid of them, we focus more on the important words that carry the actual message, making our analysis faster and more accurate. It's like decluttering a room so you can see and understand the important things better.

In [26]:
from nltk.corpus import stopwords

stopwords.words("english")

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [27]:
def remove_stopwords(text):
    new_text=[]
    for word in text.split():
        if word in stopwords.words('english'):
            new_text.append('')
        else:
            new_text.append(word)
            
    x = new_text[:]  # Create a copy of new_text
    new_text.clear()  # Clear the original new_text list
    return " ".join(x)  # Join the copied list x into a single string separated by spaces and return it

text = "The quick brown fox jumps over the lazy dog. In a nutshell, it's all about how you can improve your writing skills by using the right words in the right context."
remove_stopwords(text)

'The quick brown fox jumps   lazy dog. In  nutshell,       improve  writing skills  using  right words   right context.'

In [28]:
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. the filming tec...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


In [29]:
# df['review'].apply(remove_stopwords)

<a id="1"></a>
# <div style="text-align:center; background-color:#ebbf21; padding:10px; border-radius:10px; color:black; font-family:'Georgia', serif;"><b>Handling Emojis 😊
</b></div>

Handling emojis in NLP text processing is important because emojis convey emotional and contextual information that traditional text alone may not fully capture. Here's why it matters:

1. **Emotional Context**: Emojis provide emotional cues such as happiness 😊, sadness 😢, or surprise 😮, which are crucial for sentiment analysis and understanding the tone of text.

2. **Enhanced Meaning**: They enrich the meaning of text by adding nuances that words alone might not express effectively. For example, "I'm excited!" might convey more with a 😃 emoji.

3. **Communication Style**: Emojis reflect modern communication styles and can impact how messages are interpreted in social media, customer feedback, or online reviews.

4. **Choice of Handling**: Depending on the application, emojis can be removed to focus purely on textual analysis, or they can be replaced with their textual description (emojis like 😊 become "smiling face with smiling eyes").

In [30]:
# Removing Emojis
def remove_emoji(text):
    emoji_pattern=re.compile("["
                             u"\U0001F600-\U0001F64F" #emoticons
                             u"\U0001F300-\U0001F5FF" #symbols, pictograph
                              u"\U0001F680-\U0001F6FF" #transport and map symbol
                              u"\U0001F1E0-\U0001F1FF" #flags(IOS)
                              u"\U00002702-\U000027B0"
                              u"\U00002FC2-\U0001F251"
                             "]+",flags=re.UNICODE)
    return emoji_pattern.sub(r'',text)

text="I'm so excited for the party tonight! 🎉 Can't wait to see everyone there! 😄"
remove_emoji(text)

"I'm so excited for the party tonight!  Can't wait to see everyone there! "

In [31]:
#Replacing Emojis
import emoji
print(emoji.demojize(text))

I'm so excited for the party tonight! :party_popper: Can't wait to see everyone there! :grinning_face_with_smiling_eyes:


<a id="1"></a>
# <div style="text-align:center; background-color:#ebbf21; padding:10px; border-radius:10px; color:black; font-family:'Georgia', serif;"><b>Tokenization
</b></div>

#### What is Tokenization?

Tokenization is the process of breaking down text into smaller pieces called tokens. These tokens can be words, phrases, or even individual characters, depending on the application. Think of it like cutting a paragraph into smaller, manageable parts.

#### Example:

Imagine you have this sentence:

```plaintext
I love eating pizza!
```

When we tokenize it into words, it becomes:

```plaintext
["I", "love", "eating", "pizza", "!"]
```

Each word and punctuation mark becomes a separate token.

#### Why Do We Use Tokenization in NLP?

1. **Easier Analysis**: Breaking text into tokens makes it easier to analyze. It's like reading a book one word at a time instead of trying to understand it all at once.
   
2. **Understanding Context**: It helps in understanding the context of each word in a sentence. For example, knowing that "love" is followed by "eating" gives a clear picture of the meaning.

3. **Efficient Processing**: Computers can process and analyze tokens more efficiently than long strings of text. It speeds up tasks like searching for specific words or understanding the structure of sentences.

4. **Building Blocks for NLP Tasks**: Tokenization is the first step for many NLP tasks like sentiment analysis, translation, and text summarization. It prepares the text for more complex processing.

Tokenization helps break down text into smaller, understandable parts, making it easier for computers to analyze and work with.

### 1. Split function

In [32]:
# word tokenization
sent1 = 'I am from mumbai'
sent1.split()

['I', 'am', 'from', 'mumbai']

In [33]:
# sentence tokenization
sent2 = 'I am going to delhi. I will stay there for 3 days. Let\'s hope the trip to be great'
sent2.split('.')

['I am going to delhi',
 ' I will stay there for 3 days',
 " Let's hope the trip to be great"]

In [34]:
# Problems with split function
sent3 = 'I am going to delhi!!!!'
sent3.split()

['I', 'am', 'going', 'to', 'delhi!!!!']

In [35]:
# Problems with split function
sent4 = 'Where do think I should go? I have 3 day holiday'
sent4.split('.')

['Where do think I should go? I have 3 day holiday']

### 2. Regular Expression

In [36]:
import re
sent3 = 'I am going to delhi!'
tokens = re.findall("[\w']+", sent3)
tokens

['I', 'am', 'going', 'to', 'delhi']

In [37]:
text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry? 
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, 
when an unknown printer took a galley of type and scrambled it to make a type specimen book."""
sentences = re.compile('[.!?] ').split(text)
sentences

['Lorem Ipsum is simply dummy text of the printing and typesetting industry',
 "\nLorem Ipsum has been the industry's standard dummy text ever since the 1500s, \nwhen an unknown printer took a galley of type and scrambled it to make a type specimen book."]

### 3. NLTK

In [38]:
from nltk.tokenize import word_tokenize,sent_tokenize

In [39]:
sent1 = 'I am going to visit delhi!'
word_tokenize(sent1)

['I', 'am', 'going', 'to', 'visit', 'delhi', '!']

In [40]:
text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry? 
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, 
when an unknown printer took a galley of type and scrambled it to make a type specimen book."""

sent_tokenize(text)

['Lorem Ipsum is simply dummy text of the printing and typesetting industry?',
 "Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, \nwhen an unknown printer took a galley of type and scrambled it to make a type specimen book."]

In [41]:
sent5 = 'I have a Ph.D in A.I'
sent6 = "We're here to help! mail us at nks@gmail.com" #Failed
sent7 = 'A 5km ride cost $10.50' #Failed

print(word_tokenize(sent5))
print(word_tokenize(sent6))
print(word_tokenize(sent7))

['I', 'have', 'a', 'Ph.D', 'in', 'A.I']
['We', "'re", 'here', 'to', 'help', '!', 'mail', 'us', 'at', 'nks', '@', 'gmail.com']
['A', '5km', 'ride', 'cost', '$', '10.50']


### 4. Spacy

In [42]:
import spacy
nlp = spacy.load('en_core_web_sm')

doc1 = nlp(sent5)
doc2 = nlp(sent6)
doc3 = nlp(sent7) #Failed
doc4 = nlp(sent1)

In [43]:
for token in doc3:
    print(token)

A
5
km
ride
cost
$
10.50


<a id="1"></a>
# <div style="text-align:center; background-color:#ebbf21; padding:10px; border-radius:10px; color:black; font-family:'Georgia', serif;"><b>Stemming
</b></div>

### What is Stemming?

Stemming is the process of reducing words to their base or root form. It's like finding the "stem" of a word, which can help us understand different variations of the same word.

### Example:

Imagine you have these words:

```plaintext
running, runner, runs, ran
```

When we apply stemming, they all get reduced to the root form:

```plaintext
run
```

So, "running", "runner", "runs", and "ran" all become "run".

### Why Do We Use Stemming in NLP?

1. **Simplifies Text**: Stemming simplifies words to their root form, which makes it easier to analyze text. For instance, "running" and "ran" are different forms of the same concept, and stemming helps treat them as one.

2. **Reduces Complexity**: By converting different forms of a word to a common base, stemming reduces the number of unique words in a text. This makes the analysis more manageable and less complex.

3. **Improves Search Results**: In tasks like search engines or information retrieval, stemming helps find relevant documents by matching different word forms. For example, searching for "run" will also return results for "running" and "ran".

4. **Consistent Analysis**: It ensures that variations of a word are consistently analyzed together, improving the accuracy of tasks like text classification, sentiment analysis, and topic modeling.

### Example:

If you are building a program to understand customer reviews, you might have sentences like:

```plaintext
I enjoyed running in the park.
She runs every morning.
He is a fast runner.
Yesterday, I ran for an hour.
```

Stemming will reduce "running", "runs", "runner", and "ran" to the common root "run". This way, your program understands that all these sentences are about the activity of running.

Stemming helps simplify and standardize words in text, making it easier for computers to analyze and understand different forms of words as part of the same concept.

### What is a Stemmer?

A stemmer is a tool in NLP that reduces words to their root form or base form. This helps in simplifying and standardizing words for easier analysis.

### PorterStemmer:

- **Developed by**: Martin Porter in 1980.
- **Characteristics**: 
  - It's one of the oldest and most widely used stemming algorithms.
  - It uses a set of rules to iteratively strip suffixes from words.
  - Known for its simplicity and efficiency.
- **Example**:
  ```plaintext
  "running", "runner", "runs" -> "run"
  ```

### Snowball Stemmer:

- **Developed by**: Martin Porter as well, it's an improvement over the original Porter Stemmer.
- **Characteristics**:
  - Also known as the Porter2 Stemmer.
  - More aggressive and efficient compared to the original Porter Stemmer.
  - Supports multiple languages, unlike the original Porter Stemmer which is English-only.
- **Example**:
  ```plaintext
  "running", "runner", "runs" -> "run"
  ```

Both stemmers aim to reduce words to their root form, the Snowball Stemmer is a more advanced and versatile version of the original Porter Stemmer, supporting additional languages and more sophisticated stemming rules.

In [44]:
from nltk.stem.porter import PorterStemmer

ps = PorterStemmer()
def stem_words(text):
    return " ".join([ps.stem(word) for word in text.split()])

In [45]:
sample = "running run runs runned"
stem_words(sample)

'run run run run'

In [46]:
text = 'probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble cause but its not preachy or boring it just never gets old despite my having seen it some 15 or more times in the last 25 years paul lukas performance brings tears to my eyes and bette davis in one of her very few truly sympathetic roles is a delight the kids are as grandma says more like dressedup midgets than children but that only makes them more fun to watch and the mothers slow awakening to whats happening in the world and under her own roof is believable and startling if i had a dozen thumbs theyd all be up for this movie'
print(text)

probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble cause but its not preachy or boring it just never gets old despite my having seen it some 15 or more times in the last 25 years paul lukas performance brings tears to my eyes and bette davis in one of her very few truly sympathetic roles is a delight the kids are as grandma says more like dressedup midgets than children but that only makes them more fun to watch and the mothers slow awakening to whats happening in the world and under her own roof is believable and startling if i had a dozen thumbs theyd all be up for this movie


In [47]:
stem_words(text)

'probabl my alltim favorit movi a stori of selfless sacrific and dedic to a nobl caus but it not preachi or bore it just never get old despit my have seen it some 15 or more time in the last 25 year paul luka perform bring tear to my eye and bett davi in one of her veri few truli sympathet role is a delight the kid are as grandma say more like dressedup midget than children but that onli make them more fun to watch and the mother slow awaken to what happen in the world and under her own roof is believ and startl if i had a dozen thumb theyd all be up for thi movi'

### Disadvantages of Stemming

1. **Over-Simplification**: Stemming can sometimes be too aggressive, reducing words to forms that are not real words (e.g., "better" becoming "bett").

2. **Loss of Meaning**: Important nuances and meanings might be lost when words are reduced to their base form (e.g., "running" and "runner" both becoming "run").

3. **Inconsistency**: Different stemming algorithms might produce different results for the same word, leading to inconsistency in text analysis.

4. **Language Limitations**: Some stemmers are designed for specific languages and might not work well with others.

Stemming helps in simplifying text, it can sometimes go too far, losing important details and creating inconsistencies.

<a id="1"></a>
# <div style="text-align:center; background-color:#ebbf21; padding:10px; border-radius:10px; color:black; font-family:'Georgia', serif;"><b>Lemmatization
</b></div>

### What is Lemmatization?

Lemmatization is the process of reducing words to their base or dictionary form, known as the lemma. Unlike stemming, which cuts off word endings, lemmatization considers the context and converts words to their actual root form as found in the dictionary.

### Example:

Imagine you have these words:

```plaintext
running, ran, runs
```

Lemmatization converts them all to:

```plaintext
run
```

### Why Do We Use Lemmatization in NLP?

1. **Accurate Base Forms**: It provides accurate base forms of words, maintaining the meaning. For example, "better" becomes "good," which is its true lemma.
   
2. **Improves Understanding**: Helps in understanding the text better by converting words to their proper form, making it easier for NLP models to analyze.

3. **Consistent Analysis**: Ensures consistency in text analysis by using standardized forms of words.

### What is a Lemma?

A lemma is the base or dictionary form of a word. For instance, the lemma of "running" and "ran" is "run."

Lemmatization is like looking up the correct word form in the dictionary. It helps computers understand and process text more accurately by converting words to their true base form. This way, words like "running" and "ran" are understood to be the same action, "run".

In [48]:
import spacy

# Load the small English language model
nlp = spacy.load('en_core_web_sm')

# Process a text
doc = nlp("Apple is looking at buying U.K. startup for $1 billion.")

# Print named entities
for ent in doc.ents:
    print(ent.text, ent.label_)
    
print("Word - Lemma")
for token in doc:
    print(f"{token.text} - {token.lemma_}")

Apple ORG
U.K. GPE
$1 billion MONEY
Word - Lemma
Apple - Apple
is - be
looking - look
at - at
buying - buy
U.K. - U.K.
startup - startup
for - for
$ - $
1 - 1
billion - billion
. - .


<a id="1"></a>
# <div style="text-align:center; background-color:#ebbf21; padding:10px; border-radius:10px; color:black; font-family:'Georgia', serif;"><b>When to Use Stemming or Lemmatization?
</b></div>

### **Stemming:**

- **Quick and Simple**: Use stemming when you need fast results and don't care about perfect accuracy.
- **Large Datasets**: It's good for handling large amounts of text quickly.
- **Internal Use**: Best for when the text won't be shown to others, like internal processing or quick keyword matching.

### **Lemmatization:**

- **Accurate and Contextual**: Use lemmatization for more accurate word forms and better understanding.
- **Complex Tasks**: Ideal for tasks like sentiment analysis or translation where meaning matters.
- **Readable Output**: Choose lemmatization when the text will be shown to others, ensuring it looks correct and makes sense.

Use stemming for speed and simplicity when the text is for internal use, and go for lemmatization when accuracy and readability are important.

<a id="1"></a>
# <div style="text-align:center; background-color:#ebbf21; padding:10px; border-radius:10px; color:black; font-family:'Georgia', serif;"><b>Feature Engineering
</b></div>

Feature engineering in NLP involves creating and selecting relevant features from raw text data to improve the performance of machine learning models. It includes processes like text normalization, tokenization, and extracting meaningful representations such as word embeddings or frequency counts.

<a id="1"></a>
# <div style="text-align:center; background-color:#ebbf21; padding:10px; border-radius:10px; color:black; font-family:'Georgia', serif;"><b>One Hot Encoding
</b></div>

**One-Hot Encoding (OHE)** is a technique used to convert categorical data into a binary (0 or 1) format. Each category in the dataset is transformed into a new binary column, where only one column is set to 1 (indicating the presence of the category), and all others are set to 0.

### Example with a Small Dataset

Suppose we have a dataset with a single categorical feature, "Fruit":

| ID | Fruit    |
|----|----------|
| 1  | Apple    |
| 2  | Banana   |
| 3  | Orange   |
| 4  | Banana   |
| 5  | Apple    |

After applying one-hot encoding, the dataset is transformed into:

| ID | Fruit_Apple | Fruit_Banana | Fruit_Orange |
|----|-------------|--------------|--------------|
| 1  | 1           | 0            | 0            |
| 2  | 0           | 1            | 0            |
| 3  | 0           | 0            | 1            |
| 4  | 0           | 1            | 0            |
| 5  | 1           | 0            | 0            |

### Advantages of One-Hot Encoding

1. **Simplicity**: Easy to implement and understand.
2. **No Ordinal Relationships**: Suitable for categorical variables where there is no ordinal relationship (no natural ordering).
3. **Compatibility**: Works well with many machine learning algorithms, including linear models and neural networks.

### Disadvantages of One-Hot Encoding

1. **High Dimensionality**: Can lead to a large number of columns, especially if the categorical variable has many unique values.
2. **Sparse Representation**: Results in sparse matrices, which can be memory inefficient.
3. **Loss of Information**: Does not capture any inherent relationships between categories (e.g., similarity between "Red" and "Pink").
4. **Out-of-Vocabulary (OOV)**: Data refers to data that was not present in the training set and therefore not accounted for during the one-hot encoding process. When OOV data appears during the model's deployment or testing phase, it poses significant challenges.

### Why Use One-Hot Encoding in NLP for Feature Extraction

In Natural Language Processing (NLP), one-hot encoding is often used to represent words or tokens as binary vectors. Here’s why:

1. **Representation of Categorical Data**: Words are categorical data and need to be converted into a numerical form for machine learning models.
2. **No Ordinal Relationship**: In many cases, there is no inherent order to words, making one-hot encoding appropriate.
3. **Compatibility with Algorithms**: Many NLP algorithms and models (e.g., neural networks) can easily work with one-hot encoded vectors.
4. **Baseline Representation**: One-hot encoding provides a simple baseline representation for more complex embeddings like Word2Vec, GloVe, or contextual embeddings from transformer models.

### Example in NLP

Consider the sentence "I love NLP":

| Word  | One-Hot Vector       |
|-------|-----------------------|
| I     | [1, 0, 0, 0]          |
| love  | [0, 1, 0, 0]          |
| NLP   | [0, 0, 1, 0]          |

Each word is represented as a vector with a length equal to the number of unique words in the vocabulary, with a 1 indicating the presence of the word and 0s elsewhere.

### Conclusion

One-hot encoding is a fundamental technique in machine learning and NLP for handling categorical data. Despite its limitations, it serves as a simple and effective method for representing categorical features in a format suitable for various algorithms.

In [49]:
from sklearn.preprocessing import OneHotEncoder

In [50]:
data = {
    'ID': [1, 2, 3, 4, 5],
    'Fruit': ['Apple', 'Banana', 'Orange', 'Banana', 'Apple']
}

df = pd.DataFrame(data)
df

Unnamed: 0,ID,Fruit
0,1,Apple
1,2,Banana
2,3,Orange
3,4,Banana
4,5,Apple


In [51]:
encoder = OneHotEncoder()
encoded_data = encoder.fit_transform(df[['Fruit']])
encoded_df = pd.DataFrame(encoded_data.toarray(), columns=encoder.get_feature_names_out(['Fruit']))
final_df = pd.concat([df, encoded_df], axis=1)
final_df

Unnamed: 0,ID,Fruit,Fruit_Apple,Fruit_Banana,Fruit_Orange
0,1,Apple,1.0,0.0,0.0
1,2,Banana,0.0,1.0,0.0
2,3,Orange,0.0,0.0,1.0
3,4,Banana,0.0,1.0,0.0
4,5,Apple,1.0,0.0,0.0


<a id="1"></a>
# <div style="text-align:center; background-color:#ebbf21; padding:10px; border-radius:10px; color:black; font-family:'Georgia', serif;"><b>Bag of Words
</b></div>

**Bag of Words (BoW)** is a simple and commonly used technique in natural language processing (NLP) for converting text data into numerical features. It represents text data by counting the occurrence of each word in the document, disregarding grammar and word order, but keeping multiplicity.

### Example with a Small Dataset

Suppose we have a small dataset with three sentences:

1. "I love apples"
2. "I hate bananas"
3. "I love oranges"

First, we create a vocabulary of all unique words in the dataset:

| Word     | Index |
|----------|-------|
| I        | 1     |
| love     | 2     |
| apples   | 3     |
| hate     | 4     |
| bananas  | 5     |
| oranges  | 6     |

Using this vocabulary, we can represent each sentence as a vector of word counts:

| Sentence         | I | love | apples | hate | bananas | oranges |
|------------------|---|------|--------|------|---------|---------|
| "I love apples"  | 1 | 1    | 1      | 0    | 0       | 0       |
| "I hate bananas" | 1 | 0    | 0      | 1    | 1       | 0       |
| "I love oranges" | 1 | 1    | 0      | 0    | 0       | 1       |

### Advantages of Bag of Words

1. **Simplicity**: Easy to understand and implement.
2. **Direct Representation**: Directly represents the frequency of words in a document, making it straightforward to interpret.
3. **Compatibility**: Works well with many traditional machine learning algorithms like Naive Bayes and Support Vector Machines (SVM).

### Disadvantages of Bag of Words

1. **High Dimensionality**: Can result in very large and sparse matrices, especially with large vocabularies.
2. **No Context or Semantics**: Ignores the order of words and does not capture any semantic relationships between words.
3. **Feature Independence**: Assumes independence between words, which is often not true in natural language.

### Why Use Bag of Words in NLP for Feature Extraction

1. **Baseline Model**: Provides a simple baseline for text representation, which can be used as a starting point before moving to more complex models.
2. **Text Classification**: Effective for text classification tasks where the frequency of individual words is more important than their order or context.
3. **Feature Extraction**: Converts text into a numerical format that machine learning models can process.

### Core Intuition of Bag of Words

The core intuition behind Bag of Words is to treat text as a collection of individual words and to represent documents based on the frequency of each word in the document. The representation ignores the order and structure of words, focusing solely on their presence and count.

### Formula

Let \( D \) be a document containing words $( w_1, w_2, ..., w_n )$ from a vocabulary $(V)$.

The Bag of Words representation of $(D)$ can be defined as a vector $(\mathbf{v}_D)$ where each element $(v_i)$ corresponds to the count of word $(w_i)$ in $(D)$:

$\mathbf{v}_D = [\text{count}(w_1, D), \text{count}(w_2, D), ..., \text{count}(w_n, D)]$

Where $\text{count}(w_i, D)$ is the number of times word $( w_i )$ appears in document $ D $.

### Example

Consider the same sentences:

1. "I love apples"
2. "I hate bananas"
3. "I love oranges"

Vocabulary: {I, love, apples, hate, bananas, oranges}

For "I love apples":

$\mathbf{v}_{D1} = [1, 1, 1, 0, 0, 0]$

For "I hate bananas":

$\mathbf{v}_{D2} = [1, 0, 0, 1, 1, 0]$

For "I love oranges":

$\mathbf{v}_{D3} = [1, 1, 0, 0, 0, 1]$

### Conclusion

Bag of Words is a fundamental technique in NLP for converting text into numerical features. While it has limitations like ignoring word order and context, its simplicity and effectiveness make it a useful starting point for many text processing and machine learning tasks.


In [52]:
data = {
    'Text': [
        'I love apples',
        'I hate bananas',
        'I love oranges',
        'I love mango'
    ],
    'Output': [1, 1, 0, 0]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Text,Output
0,I love apples,1
1,I hate bananas,1
2,I love oranges,0
3,I love mango,0


In [53]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

In [54]:
bow = cv.fit_transform(df['Text'])

In [55]:
#Unique Words (vocabulary)
cv.vocabulary_

{'love': 3, 'apples': 0, 'hate': 2, 'bananas': 1, 'oranges': 5, 'mango': 4}

In [56]:
bow[0].toarray()

array([[1, 0, 0, 1, 0, 0]])

In [57]:
vocab = cv.get_feature_names_out()
vocab

array(['apples', 'bananas', 'hate', 'love', 'mango', 'oranges'],
      dtype=object)

In [58]:
bow.toarray()

array([[1, 0, 0, 1, 0, 0],
       [0, 1, 1, 0, 0, 0],
       [0, 0, 0, 1, 0, 1],
       [0, 0, 0, 1, 1, 0]])

In [59]:
bow_df = pd.DataFrame(bow.toarray(), columns=vocab)

final_df = pd.concat([df[['Output']], bow_df], axis=1)

print("Original DataFrame:")
print(df)
print("\nBag of Words DataFrame with binary parameter as False:")
print(final_df)

Original DataFrame:
             Text  Output
0   I love apples       1
1  I hate bananas       1
2  I love oranges       0
3    I love mango       0

Bag of Words DataFrame with binary parameter as False:
   Output  apples  bananas  hate  love  mango  oranges
0       1       1        0     0     1      0        0
1       1       0        1     1     0      0        0
2       0       0        0     0     1      0        1
3       0       0        0     0     1      1        0


In [60]:
#New Text
cv.transform(['apples apples hate bananas mango']).toarray()

array([[2, 1, 1, 0, 1, 0]])

```python
class sklearn.feature_extraction.text.CountVectorizer(*, 
                                                      input='content',  # Type of input (string, file, etc.)
                                                      encoding='utf-8',  # Character encoding for input
                                                      decode_error='strict',  # Error handling for decoding ('strict', 'ignore','replace')
                                                      strip_accents=None,  # Remove accents ('ascii', 'unicode', None)
                                                      lowercase=True,  # Convert text to lowercase
                                                      preprocessor=None,  # Custom preprocessing function
                                                      tokenizer=None,  # Custom tokenization function
                                                      stop_words=None,  # Words to ignore (list or 'english')
                                                      token_pattern='(?u)\\b\\w\\w+\\b',  # Regex for token extraction
                                                      ngram_range=(1, 1),  # Range of n-values for n-grams
                                                      analyzer='word',  # Type of analysis ('word', 'char', 'char_wb')
                                                      max_df=1.0,  # Max document frequency for filtering
                                                      min_df=1,  # Min document frequency for filtering
                                                      max_features=None,  # Max number of features
                                                      vocabulary=None,  # Predefined vocabulary
                                                      binary=False,  # If True, return binary occurrence
                                                      dtype=<class 'numpy.int64'>  # Data type of output
                                                     )

```
Doc =https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

<a id="1"></a>
# <div style="text-align:center; background-color:#ebbf21; padding:10px; border-radius:10px; color:black; font-family:'Georgia', serif;"><b>N-grams
</b></div>

**N-grams** are contiguous sequences of n items (words, characters, etc.) from a given text or speech. They are used in NLP to capture the context of words by considering the surrounding words in a text.

### Types of N-grams

1. **Unigrams**: Single words (n=1)
2. **Bigrams**: Pairs of consecutive words (n=2)
3. **Trigrams**: Triples of consecutive words (n=3)
4. **N-grams**: General form for sequences of n words

### Example with a Small Dataset

Consider the sentence: "I love natural language processing"

**Unigrams** (n=1):

| Unigram |
|---------|
| I       |
| love    |
| natural |
| language|
| processing |

**Bigrams** (n=2):

| Bigram                |
|-----------------------|
| I love                |
| love natural          |
| natural language      |
| language processing   |

**Trigrams** (n=3):

| Trigram                     |
|-----------------------------|
| I love natural              |
| love natural language       |
| natural language processing |

### Advantages of N-grams

1. **Contextual Representation**: Bigrams and trigrams capture more context than unigrams by considering adjacent words.
2. **Flexibility**: N-grams can be adjusted to capture different levels of context by changing the value of n.
3. **Improved Performance**: Including bigrams and trigrams can improve the performance of models, especially in tasks like text classification and language modeling.

### Disadvantages of N-grams

1. **Data Sparsity**: As n increases, the number of possible n-grams grows exponentially, leading to sparse data issues.
2. **Memory and Computation**: Higher-order n-grams require more memory and computational power to process and store.
3. **Limited Long-range Context**: N-grams capture only a limited context and may miss long-range dependencies in the text.

### Why Use N-grams in NLP for Feature Extraction

1. **Context Preservation**: Unlike unigrams, bigrams and trigrams can preserve some word order and context information, which is crucial for many NLP tasks.
2. **Feature Enrichment**: Using n-grams enriches the feature set for machine learning models, potentially leading to better performance.
3. **Language Models**: N-grams are foundational in building statistical language models to predict the next word in a sequence.

### Core Intuition of N-grams

The core intuition behind n-grams is to capture local context by considering contiguous sequences of words. This helps in understanding the syntactic and semantic relationships between words, improving the model's ability to process natural language.

### Formula

For a given sequence of words $(S)$ = $([w_1, w_2, \ldots, w_m])$:

1. **Unigram**: $(S)$ itself as $([w_1, w_2, \ldots, w_m])$
2. **Bigram**: $((w_1, w_2), (w_2, w_3), \ldots, (w_{m-1}, w_m))$
3. **Trigram**: $((w_1, w_2, w_3), (w_2, w_3, w_4), \ldots, (w_{m-2}, w_{m-1}, w_m))$

### Example

Consider the same sentence: "I love natural language processing"

**Unigrams**:

$ S_{uni} = [I, love, natural, language, processing] $

**Bigrams**:

$ S_{bi} = [(I, love), (love, natural), (natural, language), (language, processing)] $

**Trigrams**:

$ S_{tri} = [(I, love, natural), (love, natural, language), (natural, language, processing)] $

### Conclusion

N-grams are a fundamental concept in NLP for capturing the local context of words in a text. By considering sequences of words, n-grams help in preserving some syntactic and semantic relationships, making them useful for various NLP tasks such as text classification, language modeling, and feature extraction. Despite their limitations, n-grams provide a simple yet powerful way to enhance text representation and improve model performance.

In [61]:
data = {
    'Text': [
        'I love apples',
        'I hate bananas',
        'I love oranges',
        'I love mango'
    ],
    'Output': [1, 1, 0, 0]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Text,Output
0,I love apples,1
1,I hate bananas,1
2,I love oranges,0
3,I love mango,0


In [62]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range = (1,2)) #Both Uni and Bi gram

In [63]:
gram = cv.fit_transform(df['Text'])
feature_name = cv.get_feature_names_out()
df_gram = pd.DataFrame(gram.toarray(), columns = feature_name)
df_gram

Unnamed: 0,apples,bananas,hate,hate bananas,love,love apples,love mango,love oranges,mango,oranges
0,1,0,0,0,1,1,0,0,0,0
1,0,1,1,1,0,0,0,0,0,0
2,0,0,0,0,1,0,0,1,0,1
3,0,0,0,0,1,0,1,0,1,0


In [64]:
len(cv.vocabulary_)

10

<a id="1"></a>
# <div style="text-align:center; background-color:#ebbf21; padding:10px; border-radius:10px; color:black; font-family:'Georgia', serif;"><b>TF-IDF
</b></div>

**TF-IDF (Term Frequency-Inverse Document Frequency)** is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). It combines two metrics: Term Frequency (TF) and Inverse Document Frequency (IDF).

- **Term Frequency (TF)**: Measures how frequently a term occurs in a document. It is often normalized to prevent bias towards longer documents.
- **Inverse Document Frequency (IDF)**: Measures how important a term is. It decreases the weight of terms that appear frequently in many documents and increases the weight of terms that appear in a few documents.

### Example

Consider a corpus with three documents:

1. "I love apples"
2. "I love bananas"
3. "I love apples and bananas"

#### Step 1: Calculate Term Frequency (TF)

$$
\text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}
$$

| Document | Term     | TF      |
|----------|----------|---------|
| D1       | I        | 1/3     |
| D1       | love     | 1/3     |
| D1       | apples   | 1/3     |
| D2       | I        | 1/3     |
| D2       | love     | 1/3     |
| D2       | bananas  | 1/3     |
| D3       | I        | 1/5     |
| D3       | love     | 1/5     |
| D3       | apples   | 1/5     |
| D3       | and      | 1/5     |
| D3       | bananas  | 1/5     |

#### Step 2: Calculate Inverse Document Frequency (IDF)

$$
\text{IDF}(t) = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing term } t}\right)
$$

IDF is calculated using the formula:
$text{IDF}(t) = \log \left(\frac{N}{df(t)}\right) $
where $ N $ is the total number of documents, and $ df(t) $ is the number of documents containing the term $ t $.

| Term    | df(t) | IDF       |
|---------|-------|-----------|
| I       | 3     | log(3/3) = 0    |
| love    | 3     | log(3/3) = 0    |
| apples  | 2     | log(3/2) = 0.176|
| bananas | 2     | log(3/2) = 0.176|
| and     | 1     | log(3/1) = 0.477|

#### Step 3: Calculate TF-IDF

TF-IDF is calculated by multiplying TF and IDF for each term in each document.

| Document | Term     | TF      | IDF       | TF-IDF        |
|----------|----------|---------|-----------|---------------|
| D1       | I        | 1/3     | 0         | 0             |
| D1       | love     | 1/3     | 0         | 0             |
| D1       | apples   | 1/3     | 0.176     | 0.059         |
| D2       | I        | 1/3     | 0         | 0             |
| D2       | love     | 1/3     | 0         | 0             |
| D2       | bananas  | 1/3     | 0.176     | 0.059         |
| D3       | I        | 1/5     | 0         | 0             |
| D3       | love     | 1/5     | 0         | 0             |
| D3       | apples   | 1/5     | 0.176     | 0.035         |
| D3       | and      | 1/5     | 0.477     | 0.095         |
| D3       | bananas  | 1/5     | 0.176     | 0.035         |

### Advantages of TF-IDF

1. **Simple to Understand**: TF-IDF is easy to compute and understand.
2. **Effective**: Often effective for text representation in various NLP tasks like text classification and information retrieval.
3. **Reduces Noise**: By reducing the weight of common terms, it helps in emphasizing more informative words.

### Disadvantages of TF-IDF

1. **High Dimensionality**: Like Bag of Words, it can result in high-dimensional feature vectors.
2. **Context Ignorance**: Does not capture the semantic meaning or context of words.
3. **Static Nature**: Needs to be recomputed if the corpus changes, which can be computationally expensive for large datasets.

### Why Use TF-IDF in NLP for Feature Extraction

1. **Feature Weighting**: Provides a way to weight features based on their importance, making it useful for text mining and information retrieval.
2. **Information Retrieval**: Helps in ranking documents based on relevance to a query by giving higher weight to rare but important terms.
3. **Text Classification**: Enhances the performance of classifiers by focusing on significant words and ignoring common words that carry less information.

### Core Intuition of TF-IDF

The core intuition behind TF-IDF is to assign higher weights to words that are important in a particular document but not common across the entire corpus. This helps in distinguishing documents based on their unique terms.

### TF-IDF Formula

The TF-IDF value for a term $ t $ in a document $ d $ is calculated as:

$ \text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t) $

where:
- $ \text{TF}(t, d) $ is the term frequency of $ t $ in document $ d $.
- $ \text{IDF}(t) $ is the inverse document frequency of term $ t $.

### Example

Consider the sentence "I love apples":

1. **Term Frequency (TF)**: 

| Term   | TF (raw count) | TF (normalized) |
|--------|----------------|-----------------|
| I      | 1              | 1/3             |
| love   | 1              | 1/3             |
| apples | 1              | 1/3             |

2. **Inverse Document Frequency (IDF)**:

| Term   | df(t) | IDF                  |
|--------|-------|----------------------|
| I      | 3     | log(3/3) = 0    |
| love   | 3     | log(3/3) = 0    |
| apples | 2     | log(3/2) = 0.176|

3. **TF-IDF Calculation**:

| Term   | TF (normalized) | IDF      | TF-IDF       |
|--------|-----------------|----------|--------------|
| I      | 1/3             | 0        | 0            |
| love   | 1/3             | 0        | 0            |
| apples | 1/3             | 0.176    | 0.059        |

### Conclusion

TF-IDF is a powerful technique for text feature extraction, offering a way to weigh terms by their importance in a document relative to a corpus. It reduces the influence of common terms and highlights unique terms, improving the effectiveness of information retrieval and text classification tasks.

In [65]:
data = {
    'Text': [
        'I love apples',
        'I hate bananas',
        'I love oranges',
        'I love mango'
    ],
    'Output': [1, 1, 0, 0]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Text,Output
0,I love apples,1
1,I hate bananas,1
2,I love oranges,0
3,I love mango,0


In [66]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
out = tfidf.fit_transform(df['Text'])

In [67]:
print(tfidf.idf_)
print(tfidf.get_feature_names_out())

[1.91629073 1.91629073 1.91629073 1.22314355 1.91629073 1.91629073]
['apples' 'bananas' 'hate' 'love' 'mango' 'oranges']


In [68]:
feature_name = tfidf.get_feature_names_out()
df_tfidf = pd.DataFrame(out.toarray(), columns = feature_name)
df_final = pd.concat([df[['Text']], df_tfidf], axis = 1)
df_final

Unnamed: 0,Text,apples,bananas,hate,love,mango,oranges
0,I love apples,0.842926,0.0,0.0,0.538029,0.0,0.0
1,I hate bananas,0.0,0.707107,0.707107,0.0,0.0,0.0
2,I love oranges,0.0,0.0,0.0,0.538029,0.0,0.842926
3,I love mango,0.0,0.0,0.0,0.538029,0.842926,0.0


<a id="1"></a>
# <div style="text-align:center; background-color:#ebbf21; padding:10px; border-radius:10px; color:black; font-family:'Georgia', serif;"><b>Word Embedding
</b></div>

Word embedding is a technique used in Natural Language Processing (NLP) to represent words as vectors of real numbers in a continuous vector space. This representation allows words with similar meanings to have similar representations, which makes it easier for machine learning models to process and understand text data. 

### Key Points about Word Embeddings:

1. **Dimensionality Reduction**: Instead of representing words as sparse vectors (like in one-hot encoding), word embeddings represent words in a lower-dimensional space, capturing the semantic relationships between words.

2. **Contextual Meaning**: Words with similar meanings or words that often appear in similar contexts have vectors that are close together in the embedding space.

3. **Applications**:
   - **Text Classification**: Word embeddings can be used to convert text into numerical features for machine learning models.
   - **Machine Translation**: By understanding the semantic meaning of words, word embeddings can improve the quality of translations.
   - **Sentiment Analysis**: They help in capturing the sentiment of words in context.
   - **Information Retrieval**: Word embeddings can improve search results by understanding the meaning behind search queries.

### Example:

Suppose we have the following sentences:

1. "The cat sits on the mat."
2. "The dog lies on the rug."

In a word embedding space, the words "cat" and "dog" would be represented by vectors that are close to each other, as they both represent animals. Similarly, "mat" and "rug" would have similar vectors, representing objects that one can lie or sit on.

<a id="1"></a>
# <div style="text-align:center; background-color:#ebbf21; padding:10px; border-radius:10px; color:black; font-family:'Georgia', serif;"><b>Word2Vec
</b></div>

### What is Word2Vec?

Word2Vec is a popular word embedding technique developed by Google that uses neural networks to learn word associations from a large corpus of text. It represents words as vectors in a continuous vector space, where words with similar contexts have similar vector representations. There are two main models within Word2Vec:

1. **Continuous Bag of Words (CBOW)**: Predicts the target word (center word) from the context words (surrounding words).
2. **Skip-gram**: Predicts the context words from the target word.

### Example with a Small Dataset

Let's use a simple dataset consisting of three sentences:

1. "I love dogs"
2. "Dogs are awesome"
3. "I love cats"

We'll create a vocabulary and show how Word2Vec generates embeddings. For simplicity, we'll use the Skip-gram model.

#### Vocabulary:
- I
- love
- dogs
- are
- awesome
- cats

#### Training Data for Skip-gram (window size = 1):

| Center Word | Context Word |
|-------------|--------------|
| I           | love         |
| love        | I            |
| love        | dogs         |
| dogs        | love         |
| Dogs        | are          |
| are         | Dogs         |
| are         | awesome      |
| awesome     | are          |
| I           | love         |
| love        | I            |
| love        | cats         |
| cats        | love         |

#### Word Embeddings:
After training, Word2Vec generates embeddings (vectors) for each word. For simplicity, let's assume the vectors are of 2 dimensions.

| Word    | Embedding      |
|---------|----------------|
| I       | [0.4, 0.1]     |
| love    | [0.5, 0.2]     |
| dogs    | [0.6, 0.3]     |
| are     | [0.3, 0.4]     |
| awesome | [0.2, 0.5]     |
| cats    | [0.7, 0.6]     |

### Why Do We Need Word2Vec?

1. **Semantic Meaning**: Captures semantic relationships between words, making it useful for tasks like similarity measurement and sentiment analysis.
2. **Dimensionality Reduction**: Reduces the high dimensionality of text data while preserving important information.
3. **Improved Performance**: Enhances the performance of machine learning models by providing better word representations.
4. **Contextual Understanding**: Understands the context in which words are used, allowing for more accurate natural language understanding.

### Differences from Other Word Embeddings

#### Bag of Words (BoW)
- **Representation**: Represents text as a vector of word counts or frequencies.
- **Dimensionality**: High dimensionality (number of unique words in the corpus).
- **Context**: Ignores the order of words and their context.
- **Example**: For the sentence "I love dogs", the BoW vector might be [1, 1, 1, 0, 0, 0].

#### Term Frequency-Inverse Document Frequency (TF-IDF)
- **Representation**: Represents text as a vector of weighted word counts, where weights are determined by the frequency of words in the document and the inverse frequency of words in the entire corpus.
- **Dimensionality**: High dimensionality, similar to BoW.
- **Context**: Ignores the order of words and their context, but adjusts for common versus rare words.
- **Example**: For the sentence "I love dogs", the TF-IDF vector might be [0.3, 0.4, 0.5, 0, 0, 0] (weights depend on the corpus).

#### Word2Vec
- **Representation**: Represents words as dense vectors in a continuous space.
- **Dimensionality**: Lower dimensionality, typically 100-300 dimensions.
- **Context**: Captures context and semantic meaning by predicting words from their surroundings.
- **Example**: For the word "love", the Word2Vec vector might be [0.5, 0.2].

### Underlying Assumption

The underlying assumption of Word2Vec is that words that appear in similar contexts tend to have similar meanings. This is based on the distributional hypothesis in linguistics, which suggests that words that occur in the same contexts tend to have similar meanings. Word2Vec leverages this hypothesis by learning vector representations of words in such a way that words with similar contexts in the training corpus have similar vector representations.

### Summary

| Feature        | BoW                      | TF-IDF                   | Word2Vec                |
|----------------|--------------------------|--------------------------|-------------------------|
| Dimensionality | High                     | High                     | Low                     |
| Context        | No                       | No                       | Yes                     |
| Semantic Meaning | No                       | Partially                | Yes                     |
| Representation | Sparse vectors (counts)  | Sparse vectors (weights) | Dense vectors (embeddings) |

Word2Vec provides a more meaningful and context-aware representation of words compared to BoW and TF-IDF, making it a powerful tool for various NLP tasks.

<a id="1"></a>
# <div style="text-align:center; background-color:#ebbf21; padding:10px; border-radius:10px; color:black; font-family:'Georgia', serif;"><b>Types of Word2Vec
</b></div>


<div style="text-align:center; margin-top:20px;">
    <img src="https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0216636.g001&type=large" alt="NLP Pipeline" style="width:80%; border-radius:10px;"/>
    <p style="font-size:14px; color:#555;">Image Credit: <a href="https://www.researchgate.net/figure/Word2Vec-architecture-The-figure-shows-two-variants-of-word2vec-architecture-CBOW-and_fig1_339921386" target="_blank" style="color:#555;">ResearchGate</a></p>
</div>

### CBOW (Continuous Bag of Words)

#### CBOW is a model used in natural language processing (NLP) to learn word embeddings from a corpus of text. It aims to predict the target word (center word) based on its surrounding context words. Let's delve into how CBOW works and its applications.

### Understanding CBOW

CBOW operates by predicting a target word given its context words. It treats this prediction task as a supervised learning problem, where the input is the context words and the output is the target word.

### Example Scenario

Consider the following text snippet and let's select a window size of 3 for simplicity:

**Text:**
"Use kaggle for data science"

For a window size of 3, we generate training pairs:

1. Window 1: Context words = ["Use", "for"], Target word = "kaggle"
2. Window 2: Context words = ["kaggle", "data"], Target word = "for"
3. Window 3: Context words = ["for", "science"], Target word = "data"

### Data Preparation

After selecting context and target pairs, we convert the context words into a one-hot encoded vector representation:


| Context (X)         | Target (Y) |
|---------------------|------------|
| `[1 0 0 0 0 0] [0 1 0 0 0 0]` | kaggle     |
| `[0 1 0 0 0 0] [0 0 1 0 0 0]` | for        |
| `[0 0 1 0 0 0] [0 0 0 1 0 0]` | data       |




### Training the Neural Network

We feed these pairs into a neural network where the input is the one-hot encoded context vector and the output is the target word. The network learns to predict the target word based on the context words.

### Skipgram


#### In contrast to CBOW, the Skipgram model predicts context words from a given target word. It reverses the CBOW approach, aiming to predict the context words surrounding a target word. Skipgram is beneficial when dealing with larger datasets but requires more computational resources compared to CBOW.

| Target (Y) | Context (X)                   |
|------------|-------------------------------|
| kaggle     | `[1 0 0 0 0 0] [0 1 0 0 0 0]`  |
| for        | `[0 1 0 0 0 0] [0 0 1 0 0 0]`  |
| data       | `[0 0 1 0 0 0] [0 0 0 1 0 0]`  |


### Improving Word2Vec Performance

To enhance Word2Vec performance:
- **Increase Training Data**: Use larger datasets to train the model more effectively.
- **Increase Vector Dimensions**: Higher-dimensional embeddings capture more semantic nuances but require more computational resources.
- **Expand Context Window**: Enlarging the context window captures broader semantic relationships but increases training time.

### Conclusion

CBOW provides an efficient method for learning word embeddings by predicting a target word from its context. It is suitable for smaller datasets due to its faster training time compared to Skipgram. By optimizing parameters like training data size, vector dimensions, and context window, CBOW can be tailored to specific NLP tasks, enhancing its effectiveness in various applications such as text classification and information retrieval.

<a id="1"></a>
# <div style="text-align:center; background-color:#ebbf21; padding:10px; border-radius:10px; color:black; font-family:'Georgia', serif;"><b>Creating our own Word2Vec Model using words from Game Of Thrones Book
</b></div>

 `gensim`: A library for topic modeling, document indexing, and similarity retrieval with large corpora.

`sent_tokenize` from `nltk`: A function for splitting text into sentences.

`simple_preprocess` from `gensim.utils`: A function for tokenizing and preprocessing text, converting it to lowercase and removing punctuations.

In [69]:
import gensim
import os
from nltk import sent_tokenize
from gensim.utils import simple_preprocess

# Initialize an empty list to store preprocessed sentences
story = []

# Iterate over all files in the directory
for filename in os.listdir('/kaggle/input/game-of-thrones-books'):
    
    # Open each file and read its content
    file_path = os.path.join('/kaggle/input/game-of-thrones-books', filename)
    
    with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
        corpus = f.read()
    
    # Sentence tokenize the corpus
    raw_sent = sent_tokenize(corpus)
    
    # Preprocess each sentence and append to the story list
    for sent in raw_sent:
        story.append(simple_preprocess(sent))  # Basic preprocessing

In [70]:
len(story)

158872

In [71]:
story[100]

['dragons',
 'must',
 'be',
 'the',
 'least',
 'of',
 'the',
 'things',
 'man',
 'might',
 'find',
 'in',
 'qarth',
 'and',
 'asshai',
 'and',
 'yi',
 'ti']

In [72]:
model = gensim.models.Word2Vec(window = 10, 
                               min_count = 2,
                               workers = 4)

### Explanation of Parameters:

1. **`gensim.models.Word2Vec`**:
   - This is the Word2Vec class from the `gensim` library, which is used to create and train Word2Vec models.

2. **`window=10`**:
   - The `window` parameter defines the maximum distance between the current and predicted word within a sentence.
   - For example, if `window=10`, the model will consider up to 10 words to the left and 10 words to the right of the target word to predict the context words.
   - This allows the model to learn relationships between words that are up to 10 words apart, providing a wider context for each word.

3. **`min_count=2`**:
   - The `min_count` parameter sets the minimum number of occurrences of a word for it to be included in the Word2Vec model.
   - Words that appear fewer times than this threshold are ignored and not included in the training process.
   - Setting `min_count=2` ensures that only words that appear at least twice in the corpus are considered, which helps to reduce noise by ignoring rare words.

4. **`workers=4`**:
   - The `workers` parameter specifies the number of worker threads to use for training the model.
   - Using multiple threads can speed up the training process by parallelizing it.
   - Setting `workers=4` means that 4 threads will be used for training the model, making it faster than using a single thread.

In [73]:
model.build_vocab(story)

In [74]:
model.train(story, total_examples=model.corpus_count, epochs = model.epochs)

(6575413, 8625265)

In [75]:
model.wv.most_similar('mormont')

[('slynt', 0.779107391834259),
 ('janos', 0.7674131393432617),
 ('marsh', 0.7495144009590149),
 ('littlefinger', 0.7464926242828369),
 ('tarly', 0.7462854385375977),
 ('qyburn', 0.7352117300033569),
 ('harwood', 0.7182857990264893),
 ('jason', 0.7040857672691345),
 ('wyman', 0.6971112489700317),
 ('rowan', 0.6961427927017212)]

In [76]:
model.wv.doesnt_match(["jaime", "tyrion", "cersei", "arya"])

'arya'

In [77]:
#Vector Dimension
model.wv['king']

array([ 2.014022  , -0.91976166,  0.1844158 , -3.247232  , -1.0503274 ,
        3.921612  ,  0.10615434, -2.608058  ,  1.5398357 ,  2.769984  ,
        0.44791904,  2.2324872 ,  0.03294838, -0.853302  , -0.6018878 ,
        0.92401826,  0.0867687 ,  0.63234246, -3.47414   , -1.5476279 ,
        4.354349  ,  2.4032953 , -0.07850899, -2.160132  ,  1.5359776 ,
       -0.49857235, -0.3678546 , -0.36599886,  1.2317008 , -1.5626659 ,
       -1.8909967 ,  1.8192202 , -2.9440567 , -2.0858681 ,  2.1928043 ,
        0.24764332,  0.43810955, -2.1319523 , -1.5585552 , -2.238373  ,
       -0.48949862, -2.0886695 ,  1.3925611 ,  0.3771918 ,  3.1935298 ,
       -2.3525846 ,  0.21972947, -0.81753635,  0.8940463 , -0.91356474,
        0.207845  , -0.04152534,  1.4885468 , -1.9201828 , -1.0271955 ,
       -0.82902867,  1.05767   ,  1.3534781 , -0.54307985, -1.2807306 ,
        0.8154349 , -0.9690467 ,  0.37769008,  0.6050772 , -2.6443164 ,
        0.49013776, -1.2913054 ,  0.05694449, -0.2580893 , -0.78

In [78]:
model.wv.similarity('arya', 'sansa')

0.85503215

In [79]:
model.wv.similarity('arya', 'jon')

0.44143063

In [80]:
model.wv.similarity('tyrion', 'sansa')

0.5435527

In [81]:
model.wv.similarity('tyrion', 'arya')

0.5880462

In [82]:
model.wv.similarity('daenerys', 'dragon')

0.4676541

In [83]:
# The get_normed_vectors() method retrieves the normalized vectors (unit length) from a trained Word2Vec model. 
model.wv.get_normed_vectors()

array([[ 0.04253045, -0.00271727, -0.01326206, ..., -0.16387872,
        -0.09463339,  0.10440636],
       [ 0.08754002,  0.03868708,  0.0323594 , ..., -0.03960106,
        -0.06577273,  0.12148576],
       [-0.0066203 , -0.06391785,  0.00725519, ...,  0.00279815,
         0.15526482, -0.13929999],
       ...,
       [-0.03168227, -0.10925772,  0.03761256, ..., -0.09049765,
         0.22529933, -0.01122991],
       [-0.10589752, -0.11385921,  0.17470562, ..., -0.18788782,
        -0.10042623,  0.09203169],
       [ 0.07815515, -0.0789224 ,  0.05534671, ..., -0.15016463,
        -0.0321193 ,  0.06174895]], dtype=float32)

In [84]:
model.wv.get_normed_vectors().shape

(17869, 100)

In [85]:
len(model.wv.index_to_key)

17869

In [86]:
y = model.wv.index_to_key
# wv.index_to_key attribute in a trained Word2Vec model provides the list of words (vocabulary) in the order of their vectors. 
# This allows you to see which words correspond to the vectors in the wv object.

In [87]:
#To see a 3D representation of the model, we have reduced the dimensions using PCA.
from sklearn.decomposition import PCA
pca = PCA(n_components = 3)

In [88]:
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
X = pca.fit_transform(model.wv.get_normed_vectors())
X.shape

(17869, 3)

In [89]:
X[:10]

array([[-0.21928276,  0.5830314 , -0.01150548],
       [-0.19380479,  0.3607731 ,  0.06721126],
       [ 0.24894103,  0.5864633 ,  0.24963972],
       [-0.03280409,  0.39058435, -0.10606393],
       [ 0.06819967,  0.54511684,  0.2542263 ],
       [-0.08530259,  0.27836925,  0.3199789 ],
       [ 0.0646778 ,  0.37276375,  0.32314423],
       [ 0.5125614 ,  0.6221849 ,  0.13542317],
       [-0.05592027,  0.39837986,  0.3417407 ],
       [-0.11791261,  0.43828732,  0.07669486]], dtype=float32)

In [90]:
import plotly.express as px
fig = px.scatter_3d(X[:500], x=0, y=1, z=2, color=y[:500])
fig.show()

  sf: grouped.get_group(s if len(s) > 1 else s[0])


In [91]:
X_subset = X[100:300]  
y_subset = y[100:300]  
df = pd.DataFrame(X_subset, columns=['x', 'y', 'z'])

fig = px.scatter_3d(df, x='x', y='y', z='z', color=y_subset)

annotations = []
h_index = y_subset.index('catelyn')
annotations.append(dict(x=X_subset[h_index, 0], y=X_subset[h_index, 1], z=X_subset[h_index, 2],
                        text='catelyn', showarrow=True, arrowhead=1))

fig.update_layout(scene=dict(annotations=annotations))

fig.show()





<a id="1"></a>
# <div style="text-align:center; background-color:#ebbf21; padding:10px; border-radius:10px; color:black; font-family:'Georgia', serif;"><b>Text Classification
</b></div>

Text classification in NLP is the process of categorizing text into predefined classes or labels. This involves training a model to identify the category a given piece of text belongs to based on its content. Common applications include spam detection, sentiment analysis, and topic categorization.

### Types

Text classification can be categorized into several types based on the nature and number of labels assigned to each text:

1. **Binary Classification**:
   - Classifies text into one of two possible categories.
   - Example: Spam vs. Not Spam.

2. **Multiclass Classification**:
   - Classifies text into one of more than two possible categories.
   - Example: Classifying news articles into categories like Sports, Politics, Technology, etc.

3. **Multilabel Classification**:
   - Assigns multiple labels to a single text.
   - Example: Tagging a movie review with labels like Action, Adventure, and Comedy.

4. **Hierarchical Classification**:
   - Classifies text into a hierarchy of categories, where categories can have subcategories.
   - Example: Classifying documents first into broad categories like Science, and then into narrower subcategories like Physics, Chemistry, and Biology.

### Applications

Each type serves different purposes and is chosen based on the specific requirements of the application.

Text classification has numerous applications across various domains. Here are some key applications:

1. **Spam Detection**:
   - Automatically identifying and filtering out spam emails from a user's inbox.

2. **Sentiment Analysis**:
   - Determining the sentiment (positive, negative, neutral) of a piece of text, such as reviews, social media posts, or customer feedback.

3. **Topic Categorization**:
   - Classifying news articles, blog posts, or documents into predefined topics or categories, such as sports, politics, technology, etc.

4. **Language Detection**:
   - Identifying the language in which a piece of text is written.

5. **Document Organization**:
   - Automatically organizing and tagging documents in large datasets, such as legal documents, research papers, or customer service tickets.

6. **Customer Support Automation**:
   - Routing customer queries to the appropriate department based on the content of the query.

7. **Product Recommendation**:
   - Categorizing products based on customer reviews to recommend similar products.

8. **Fraud Detection**:
   - Identifying fraudulent activities or transactions by analyzing text data.

9. **Content Moderation**:
   - Automatically detecting and filtering inappropriate or harmful content in social media, forums, or online communities.

10. **Chatbots and Virtual Assistants**:
    - Understanding and classifying user intents to provide appropriate responses.

11. **Information Retrieval**:
    - Enhancing search engines by categorizing and tagging content for more accurate and relevant search results.

These applications showcase the versatility and importance of text classification in various industries and everyday applications.

<a id="1"></a>
# <div style="text-align:center; background-color:#ebbf21; padding:10px; border-radius:10px; color:black; font-family:'Georgia', serif;"><b>Using Machine Learning models for Training and Prediction
</b></div>

In [92]:
temp_df = pd.read_csv('/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')
df = temp_df.iloc[:25000]
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [93]:
df.size

50000

In [94]:
df['review'][7]

"This show was an amazing, fresh & innovative idea in the 70's when it first aired. The first 7 or 8 years were brilliant, but things dropped off after that. By 1990, the show was not really funny anymore, and it's continued its decline further to the complete waste of time it is today.<br /><br />It's truly disgraceful how far this show has fallen. The writing is painfully bad, the performances are almost as bad - if not for the mildly entertaining respite of the guest-hosts, this show probably wouldn't still be on the air. I find it so hard to believe that the same creator that hand-selected the original cast also chose the band of hacks that followed. How can one recognize such brilliance and then see fit to replace it with such mediocrity? I felt I must give 2 stars out of respect for the original cast that made this show such a huge success. As it is now, the show is just awful. I can't believe it's still on the air."

In [95]:
df.duplicated().sum()

103

In [96]:
df.drop_duplicates(inplace=True)



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [97]:
df.duplicated().sum()

0

In [98]:
df['sentiment'].value_counts()

sentiment
negative    12451
positive    12446
Name: count, dtype: int64

In [99]:
df.isnull().sum()

review       0
sentiment    0
dtype: int64

In [100]:
#Basic Preprocessing
# Remove HTML tags
# lowercase
# remove stopwords

In [101]:
import re
def remove_tags(text):
    cleaned_text = re.sub(re.compile("<.*?>"), '', text)
    return cleaned_text

In [102]:
df['review'] = df['review'].apply(remove_tags)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [103]:
df["review"] = df["review"].str.lower()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [104]:
from nltk.corpus import stopwords

stop_list = stopwords.words('english')
df['review'] = df['review'].apply(lambda x: [item for item in x.split() if item not in stop_list]).apply(lambda x:" ".join(x))



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [105]:
df['review'][7]

"show amazing, fresh & innovative idea 70's first aired. first 7 8 years brilliant, things dropped that. 1990, show really funny anymore, continued decline complete waste time today.it's truly disgraceful far show fallen. writing painfully bad, performances almost bad - mildly entertaining respite guest-hosts, show probably still air. find hard believe creator hand-selected original cast also chose band hacks followed. one recognize brilliance see fit replace mediocrity? felt must give 2 stars respect original cast made show huge success. now, show awful. can't believe still air."

In [106]:
X = df.iloc[:,0:1]
y = df['sentiment']

In [107]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
y = encoder.fit_transform(y)
y

array([1, 1, 1, ..., 1, 1, 0])

In [108]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1)
X_train.shape, X_test.shape

((19917, 1), (4980, 1))

In [109]:
#Bag of Words (BoW)
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X_train_bow = cv.fit_transform(X_train['review']).toarray()
X_test_bow = cv.transform(X_test['review']).toarray()

X_train_bow.shape

(19917, 70921)

In [110]:
# Training with GaussianNB
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(X_train_bow, y_train)

In [111]:
from sklearn.metrics import accuracy_score,confusion_matrix

y_pred = gnb.predict(X_test_bow)

accuracy_score(y_test,y_pred)

0.6399598393574297

In [112]:
confusion_matrix(y_test,y_pred)

array([[1947,  539],
       [1254, 1240]])

In [113]:
# Training with RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()

rf.fit(X_train_bow,y_train)
y_pred = rf.predict(X_test_bow)
accuracy_score(y_test,y_pred)

0.851004016064257

In [114]:
cv = CountVectorizer(max_features=3000)

X_train_bow = cv.fit_transform(X_train['review']).toarray()
X_test_bow = cv.transform(X_test['review']).toarray()

rf = RandomForestClassifier()

rf.fit(X_train_bow,y_train)
y_pred = rf.predict(X_test_bow)
accuracy_score(y_test,y_pred)

0.8387550200803213

In [115]:
cv = CountVectorizer(ngram_range=(1,2),max_features=5000)

X_train_bow = cv.fit_transform(X_train['review']).toarray()
X_test_bow = cv.transform(X_test['review']).toarray()

rf = RandomForestClassifier()

rf.fit(X_train_bow,y_train)
y_pred = rf.predict(X_test_bow)
accuracy_score(y_test,y_pred)

0.8451807228915663

In [116]:
#Using TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()

X_train_tfidf = tfidf.fit_transform(X_train['review']).toarray()
X_test_tfidf = tfidf.transform(X_test['review'])

In [117]:
rf = RandomForestClassifier()

rf.fit(X_train_tfidf,y_train)
y_pred = rf.predict(X_test_tfidf)

accuracy_score(y_test,y_pred)

0.8506024096385543

In [118]:
# Using Word2vec
import gensim
from nltk import sent_tokenize
from gensim.utils import simple_preprocess
story = []
for doc in df['review']:
    raw_sent = sent_tokenize(doc)
    for sent in raw_sent:
        story.append(simple_preprocess(sent))
    
model = gensim.models.Word2Vec(
    window=10,
    min_count=2
)

In [119]:
model.build_vocab(story)
model.train(story, total_examples=model.corpus_count, epochs=model.epochs)

(14797965, 15525230)

In [120]:
len(model.wv.index_to_key)

47074

In [121]:
def document_vector(doc):
    # remove out-of-vocabulary words
    doc = [word for word in doc.split() if word in model.wv.index_to_key]
    return np.mean(model.wv[doc], axis=0)

document_vector(df['review'].values[0])

array([-0.42056623,  0.36349672,  0.25875875,  0.17984214,  0.06109646,
       -0.3510774 ,  0.31541118,  0.83514225,  0.17776473, -0.3724004 ,
       -0.16042408, -0.41054025,  0.04798472,  0.21565548,  0.25403252,
       -0.20799641, -0.02364288, -0.32523414,  0.15488128, -0.03037474,
        0.17235416,  0.33111802,  0.20676415, -0.2661609 , -0.10102784,
       -0.2327879 , -0.34784958,  0.02639874, -0.22555448,  0.02803126,
        0.06146627, -0.03602965, -0.09691825, -0.24101672, -0.33573252,
        0.44448924, -0.03987772, -0.08728085, -0.17730325, -0.5127163 ,
        0.07763765, -0.19635803, -0.14438656, -0.06349266, -0.04507221,
       -0.1553419 , -0.24808691, -0.27521482, -0.18998145,  0.20929493,
        0.27704525, -0.44401935,  0.1223617 ,  0.06539241,  0.05389095,
        0.5719021 ,  0.64783067,  0.09950755, -0.46062863, -0.2657431 ,
        0.08166499,  0.00163   , -0.20864245, -0.33988726, -0.17963094,
        0.10905342, -0.02566193,  0.4313338 , -0.13996449, -0.01

In [122]:
from tqdm import tqdm
X = []
for doc in tqdm(df['review'].values):
    X.append(document_vector(doc))

100%|██████████| 24897/24897 [19:08<00:00, 21.68it/s]


In [123]:
X = np.array(X)
X[0]

array([-0.42056623,  0.36349672,  0.25875875,  0.17984214,  0.06109646,
       -0.3510774 ,  0.31541118,  0.83514225,  0.17776473, -0.3724004 ,
       -0.16042408, -0.41054025,  0.04798472,  0.21565548,  0.25403252,
       -0.20799641, -0.02364288, -0.32523414,  0.15488128, -0.03037474,
        0.17235416,  0.33111802,  0.20676415, -0.2661609 , -0.10102784,
       -0.2327879 , -0.34784958,  0.02639874, -0.22555448,  0.02803126,
        0.06146627, -0.03602965, -0.09691825, -0.24101672, -0.33573252,
        0.44448924, -0.03987772, -0.08728085, -0.17730325, -0.5127163 ,
        0.07763765, -0.19635803, -0.14438656, -0.06349266, -0.04507221,
       -0.1553419 , -0.24808691, -0.27521482, -0.18998145,  0.20929493,
        0.27704525, -0.44401935,  0.1223617 ,  0.06539241,  0.05389095,
        0.5719021 ,  0.64783067,  0.09950755, -0.46062863, -0.2657431 ,
        0.08166499,  0.00163   , -0.20864245, -0.33988726, -0.17963094,
        0.10905342, -0.02566193,  0.4313338 , -0.13996449, -0.01

In [124]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()

y = encoder.fit_transform(df['sentiment'])
y

array([1, 1, 1, ..., 1, 1, 0])

In [125]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1)

rf = RandomForestClassifier()
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
accuracy_score(y_test,y_pred)

0.8082329317269076

<a id="1"></a>
# <div style="text-align:center; background-color:#ebbf21; padding:10px; border-radius:10px; color:black; font-family:'Georgia', serif;"><b>POS Tagging
</b></div>

<div style="text-align:center; margin-top:20px;">
    <img src="https://www.nltk.org/images/chunk-segmentation.png" alt="NLP Pipeline" style="width:80%; border-radius:10px;"/>
    <p style="font-size:14px; color:#555;">Image Credit: <a href="https://www.nltk.org/book/ch07.html" target="_blank" style="color:#555;">NLTK</a></p>
</div>

### What is POS Tagging?

POS (Part-of-Speech) tagging is the process of assigning a part-of-speech label to each word in a sentence. These labels include categories such as nouns, verbs, adjectives, adverbs, and others. The process involves both lexical analysis (identifying words) and syntactic analysis (assigning POS tags based on context and grammar rules).

### Example:

Given the sentence:
- "The quick brown fox jumps over the lazy dog."

POS tagging would output:
- The/DT quick/JJ brown/JJ fox/NN jumps/VBZ over/IN the/DT lazy/JJ dog/NN

Here, DT stands for determiner, JJ for adjective, NN for noun, VBZ for verb (third person singular present), and IN for preposition.

### Applications of POS Tagging:

1. **Text Preprocessing**:
   - Used as a step in text preprocessing to prepare data for further NLP tasks such as parsing, named entity recognition, and sentiment analysis.

2. **Information Retrieval**:
   - Helps improve the performance of search engines by understanding the role of each word in a query, which can enhance search accuracy and relevance.

3. **Named Entity Recognition (NER)**:
   - Assists in identifying and classifying proper nouns, such as names of people, organizations, and locations.

4. **Sentiment Analysis**:
   - Enhances sentiment analysis by understanding the role of adjectives, adverbs, and verbs, which are crucial for determining sentiment.

5. **Machine Translation**:
   - Helps in understanding the grammatical structure of sentences, which is essential for accurately translating text from one language to another.

6. **Speech Recognition**:
   - Aids in transcribing spoken language into written text by correctly identifying the part of speech for each word, improving the accuracy of the transcription.

7. **Grammar Checking**:
   - Used in grammar checkers to identify and correct grammatical errors by analyzing the syntactic structure of sentences.

8. **Question Answering Systems**:
   - Enhances the ability of QA systems to understand and generate responses by correctly interpreting the syntactic roles of words in both questions and answers.

9. **Text-to-Speech Systems**:
   - Improves the naturalness of synthesized speech by providing accurate prosody and intonation patterns based on the parts of speech.

10. **Coreference Resolution**:
    - Helps in identifying and resolving references to the same entities within a text, which is crucial for tasks that require understanding the relationships between different parts of the text.

POS tagging is a fundamental task in NLP that supports and enhances various advanced language processing applications, making it a critical component in the field.

<a id="1"></a>
# <div style="text-align:center; background-color:#ebbf21; padding:10px; border-radius:10px; color:black; font-family:'Georgia', serif;"><b>Hidden Markov Model (HMM) and Viterbi Algorithm for POS Tagging
</b></div>

### Hidden Markov Model (HMM) in POS Tagging

A Hidden Markov Model (HMM) is a statistical model that is particularly useful for sequence labeling tasks like Part-of-Speech (POS) tagging. An HMM models the sequence of tags and words with two key assumptions:

1. **Markov Assumption**: The probability of a tag depends only on the previous tag.
2. **Output Independence Assumption**: The probability of a word depends only on the current tag.

An HMM for POS tagging consists of:
- **States**: Represent the POS tags.
- **Observations**: Represent the words in the sentence.
- **Transition Probabilities (A)**: Probability of moving from one state (tag) to another state (tag).
- **Emission Probabilities (B)**: Probability of observing a word given a state (tag).
- **Initial Probabilities (π)**: Probability of starting in each state (tag).

### HMM Parameters
- **A (Transition probabilities)**: \( P(t_i | t_{i-1}) \)
- **B (Emission probabilities)**: \( P(w_i | t_i) \)
- **π (Initial probabilities)**: \( P(t_1) \)

### Viterbi Algorithm in POS Tagging

The Viterbi algorithm is a method used in POS tagging to find the most probable sequence of tags for a sentence. It does this by tracking the best paths and their probabilities for each word, ensuring efficient computation of the optimal tag sequence.

### Steps of the Viterbi Algorithm

1. **Initialization**:
   - $ v_1(k) = \pi_k \cdot B_k(o_1) $
   - Here, $ v_1(k) $ is the probability of the most likely path ending in state $(k)$ at step 1.

2. **Recursion**:
   - For each time step $ t $ from 2 to $ T $:
     - $ v_t(k) = \max_{j} \left[ v_{t-1}(j) \cdot A_{jk} \right] \cdot B_k(o_t) $
     - Track the state $ j $ that maximizes $ v_t(k) $.

3. **Termination**:
   - Find the maximum probability of the final states:
     - $ P^* = \max_{k} \left[ v_T(k) \right] $
   - Backtrack to find the optimal path.

### Example

Given a simple sentence and an HMM, let's illustrate the process.

#### Sentence: "The cat sleeps"

#### Tags: $ T = \{DT, NN, VB\} $

#### Words: $ W = \{The, cat, sleeps\} $

#### Transition Probabilities (A):
$$
A = \begin{bmatrix}
P(DT|DT) & P(NN|DT) & P(VB|DT) \\
P(DT|NN) & P(NN|NN) & P(VB|NN) \\
P(DT|VB) & P(NN|VB) & P(VB|VB)
\end{bmatrix}
$$

#### Emission Probabilities (B):
$$
B = \begin{bmatrix}
P(The|DT) & P(cat|DT) & P(sleeps|DT) \\
P(The|NN) & P(cat|NN) & P(sleeps|NN) \\
P(The|VB) & P(cat|VB) & P(sleeps|VB)
\end{bmatrix}
$$

#### Initial Probabilities (π):
\[
π = \{ P(DT), P(NN), P(VB) \}
\]

#### Viterbi Algorithm Execution:

1. **Initialization**:
   $$
   v_1(DT) = π(DT) \cdot B_{DT}(The)
   $$
   $$
   v_1(NN) = π(NN) \cdot B_{NN}(The)
   $$
   $$
   v_1(VB) = π(VB) \cdot B_{VB}(The)
   $$

2. **Recursion**:
   $$
   v_2(NN) = \max \left[ v_1(DT) \cdot A_{DT,NN}, v_1(NN) \cdot A_{NN,NN}, v_1(VB) \cdot A_{VB,NN} \right] \cdot B_{NN}(cat)
   $$
   Repeat for all words and states.

3. **Termination**:
   $$
   P^* = \max \left[ v_3(DT), v_3(NN), v_3(VB) \right]
   $$
   Backtrack to find the optimal tag sequence.

### Summary

- **HMM**: Models the sequence of POS tags and words using transition and emission probabilities.
- **Viterbi Algorithm**: Finds the most likely sequence of POS tags for a given sentence using dynamic programming.

This combination of HMMs and the Viterbi algorithm provides a powerful method for POS tagging, efficiently handling the complexities of natural language.

<a id="1"></a>
# <div style="text-align:center; background-color:#ebbf21; padding:10px; border-radius:10px; color:black; font-family:'Georgia', serif;"><b>Example of POS Tagging Using HMM and Viterbi Algorithm with Custom Dataset
</b></div>

Suppose we have a dataset with parts of speech: Verb (V), Noun (N), and Adjective (A).

| Docs                       |
|----------------------------|
| Sarah enjoys reading       |
| Can Sarah read books       |
| Will John read books       |
| John enjoys sports         |
| Sarah loves sports         |

In the first step, we need to label each word in the dataset with its part of speech. This is a supervised learning problem.

| Docs                       | POS                       |
|----------------------------|---------------------------|
| Sarah enjoys reading       | N      V        N         |
| Can Sarah read books       | V     N    V      N       |
| Will John read books       | V     N    V      N       |
| John enjoys sports         | N      V        N         |
| Sarah loves sports         | N      V        N         |

Next, we calculate the emission probability.

| Word   | N    | V    | A    |
|--------|------|------|------|
| Sarah  | 2    | 0    | 0    |
| enjoys | 0    | 2    | 0    |
| reading| 1    | 0    | 0    |
| books  | 2    | 0    | 0    |
| John   | 2    | 0    | 0    |
| loves  | 0    | 1    | 0    |
| sports | 2    | 0    | 0    |

For each row and column, the value indicates the frequency of the word appearing as a specific part of speech. For example, "Sarah" appeared as a noun in 2 places.

Now we convert these frequencies into probabilities.

| Word   | N    | V    | A    |
|--------|------|------|------|
| Sarah  | 2/10 | 0    | 0    |
| enjoys | 0    | 2/5  | 0    |
| reading| 1/10 | 0    | 0    |
| books  | 2/10 | 0    | 0    |
| John   | 2/10 | 0    | 0    |
| loves  | 0    | 1/5  | 0    |
| sports | 2/10 | 0    | 0    |

Next, we calculate the transition probability. We add "Start" and "End" to our dataset.

| Docs                                   |
|----------------------------------------|
| Start Sarah enjoys reading End         |
| Start Can Sarah read books End         |
| Start Will John read books End         |
| Start John enjoys sports End           |
| Start Sarah loves sports End           |

We now have:

| Transition | N    | V    | A    | End  |
|------------|------|------|------|------|
| Start      | 3    | 2    | 0    | 0    |
| N          | 0    | 5    | 0    | 5    |
| V          | 5    | 0    | 0    | 0    |

Converting these to probabilities:

| Transition | N    | V    | A    | End  |
|------------|------|------|------|------|
| Start      | 3/5  | 2/5  | 0    | 0    |
| N          | 0    | 5/10 | 0    | 5/10 |
| V          | 5/5  | 0    | 0    | 0    |

This means, for example, the probability of starting with a noun is 3/5.

Imagine the parts of speech as a network where each state (N, V, A) is connected, and words are mapped to their POS with associated probabilities.

For prediction, say we have "Will Sarah read books".

We know the POS sequence will be V, N, V, N.

| Docs                                   |
|----------------------------------------|
| Start Will Sarah read books End        |

Initially, we label all words as nouns.

| Docs                                   | POS                |
|----------------------------------------|---------------------|
| Start Will Sarah read books End        | Start N N N N End   |

Calculate probabilities step by step:

- Start-N: Transition probability 3/5 and "Will" emission probability 1/10
- N-N: Transition probability 0 and "Sarah" emission probability 2/10
- N-N: Transition probability 0 and "read" emission probability 1/10
- N-N: Transition probability 0 and "books" emission probability 2/10
- N-End: Transition probability 5/10

Try another sequence, N N N V:

- Start-N: Transition probability 3/5 and "Will" emission probability 1/10
- N-N: Transition probability 0 and "Sarah" emission probability 2/10
- N-N: Transition probability 0 and "read" emission probability 1/10
- N-V: Transition probability 0 and "books" emission probability 1/5
- V-End: Transition probability 0

Test all combinations; in total, you will have (Number of parts of speech)^(number of words).

Choose the sequence with the highest total probability. For the correct sequence (V N V N):

- Start-V: Transition probability 2/5 and "Will" emission probability 1/10
- V-N: Transition probability 5/5 and "Sarah" emission probability 2/10
- N-V: Transition probability 5/10 and "read" emission probability 1/10
- V-N: Transition probability 5/5 and "books" emission probability 2/10
- N-End: Transition probability 5/10

Multiplying these probabilities gives the highest value, indicating the most probable POS sequence.


The Viterbi Algorithm enables us to evaluate multiple paths through a sequence, focusing on the path with the highest probability. It continuously updates and maintains the maximum probabilities at each step, ensuring that paths with zero probabilities are excluded. This approach efficiently identifies the most likely sequence of states or events, making it ideal for applications like speech recognition or sequence labeling tasks.

In [126]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [127]:
doc = nlp(u"I will google about facebook")

In [128]:
doc.text

'I will google about facebook'

"Course-grained parts of speech" typically refers to a broad categorization of words into general categories or types, as opposed to "fine-grained" categories that involve more specific distinctions. In traditional grammar, course-grained parts of speech might include:

1. **Nouns**: Words that represent people, places, things, or ideas.
2. **Verbs**: Words that describe actions, states, or occurrences.
3. **Adjectives**: Words that describe or modify nouns.
4. **Adverbs**: Words that modify verbs, adjectives, or other adverbs.
5. **Pronouns**: Words that replace nouns.
6. **Prepositions**: Words that show relationships between other words (e.g., in, on, at).
7. **Conjunctions**: Words that connect clauses or sentences (e.g., and, but, or).
8. **Interjections**: Words or phrases that express strong emotion or reaction (e.g., oh, wow).

These categories help in understanding the basic functions of words within sentences, although more nuanced analyses might break these categories down further into subcategories or include additional parts of speech.

In [129]:
# Course-grained parts of speech
doc[3].pos_

'ADP'

Fine-grained parts of speech refer to detailed categories within broad parts of speech. For example:

- **Nouns**: Common, proper, abstract, concrete.
- **Verbs**: Action, stative, transitive, intransitive.
- **Adjectives**: Descriptive, quantitative, demonstrative.
- **Adverbs**: Manner, place, time, frequency, degree.
- **Pronouns**: Personal, possessive, relative, demonstrative.
- **Prepositions**: Simple, compound, complex.
- **Conjunctions**: Coordinating, subordinating, correlative.
- **Interjections**: Exclamations, reactions.

In [130]:
#Fine grained parts of speech
doc[3].tag_

'IN'

In [131]:
spacy.explain('IN')

'conjunction, subordinating or preposition'

In [132]:
for word in doc:
    print(word.text,"------>", word.pos_,word.tag_,spacy.explain(word.tag_))

I ------> PRON PRP pronoun, personal
will ------> AUX MD verb, modal auxiliary
google ------> VERB VB verb, base form
about ------> ADP IN conjunction, subordinating or preposition
facebook ------> NOUN NN noun, singular or mass


In [133]:
doc2 = nlp(u"Andy will google facebook")
for word in doc2:
    print(word.text,"------>", word.pos_,word.tag_,spacy.explain(word.tag_))

Andy ------> PROPN NNP noun, proper singular
will ------> AUX MD verb, modal auxiliary
google ------> VERB VB verb, base form
facebook ------> PROPN NNP noun, proper singular


In [134]:
from spacy import displacy
doc3 = nlp(u"The quick brown fox jumped over the lazy dog")
displacy.render(doc3,style='dep',jupyter=True)

In [135]:
options={
    'distance':110,
    'compact':True,
    'color':'#080608',
    'bg':'#ffadfe'
}
displacy.render(doc3,style='dep',jupyter=True,options=options)

This notebook was created using insights and techniques from the YouTube playlist available at [link to the playlist](https://www.youtube.com/playlist?list=PLKnIA16_RmvZo7fp5kkIth6nRTeQQsjfX). 
The playlist provided valuable tutorials and resources that guided the development and implementation of the methods used in this project.