# <p style="font-family:newtimeroman;font-size:100%;text-align:center;color:#F52549;">How to Handle Textual Data in NLP</p>

***In this notebook, I aim to provide insights into "How to Handle Textual Data in NLP". I will cover two major aspects:***

***1. Text Preprocessing:***
***I will demonstrate how to preprocess textual data using a real-world dataset, specifically the "IMDB-Movies" dataset. My primary focus is to simplify and illustrate text preprocessing techniques. However, if you encounter any difficulties, please feel free to comment, and I will do my best to address them.***

***2. Text Representation:***
***Text representation involves converting textual data into numerical format. I will discuss some of the most important techniques for handling textual data in machine learning, including:***
   1. One-Hot Encoding (OHE)
   2. Bag of Words
   3. N-grams
   4. TF-IDF
   
***I will provide demonstrations of how these four techniques work on the original dataset.***

# <p style="font-family:newtimeroman;font-size:150%;text-align:center;color:#F52549;">About Author!</p>
##### ***Hello, I'm  Muhammad_Abdullah : A Data Science Enthusiast and Kaggle 2x Expert***

Greetings! I'm delighted to welcome you into my world of data science exploration and innovation. I'm **Muhammad_Abdullah**, a passionate data scientist with a fervent dedication to unraveling the mysteries hidden within datasets and leveraging the power of machine learning to drive meaningful insights and solutions.

###### ***A Passion for Data Science***

Since the inception of my journey into the captivating realm of data science, I've been driven by an insatiable curiosity and an unwavering passion for uncovering the stories embedded in data. From the thrill of diving deep into complex datasets to the exhilaration of crafting predictive models that shape our understanding of the world, data science has become not just a profession but a lifelong passion.

###### ***Guiding Light on Kaggle***

As a Kaggle 2x Expert, I've had the privilege of sharing my knowledge, insights, and experiences with the vibrant Kaggle community. Through meticulously crafted notebooks, engaging discussions, and collaborative projects, I've had the opportunity to mentor aspiring data enthusiasts, foster a culture of learning and growth, and contribute to the collective pursuit of excellence in data science.

###### ***Let's Explore Together***

Join me on an exhilarating adventure into the dynamic world of data science! Together, we'll unlock the potential of data, unravel its mysteries, and embark on a transformative journey of discovery and innovation. Whether you're a seasoned data enthusiast or just beginning your data science odyssey, I'm excited to share this journey with you and explore the endless possibilities that data science has to offer.
<div style="text-align: left;">
    <table>
        <tr>
            <th><b>Website</b></th>
            <th><b>Links</b></th>
        </tr>
        <tr>
        </tr>
        <tr>
            <td>GitHub</td>
            <td><a href="https://github.com/muhammadabdullah0303"><img src="https://img.shields.io/badge/GitHub-Profile-blue?style=for-the-badge&logo=github" alt="GitHub"/></a></td>
        </tr>
        <tr>
            <td>LinkedIn</td>
            <td><a href="https://www.linkedin.com/in/muhammad-abdullah-6b84b4297/"><img src="https://img.shields.io/badge/LinkedIn-Profile-blue?style=for-the-badge&logo=linkedin" alt="LinkedIn"/></a></td>
        </tr>
        <tr>
        </tr>
        <tr>
            <td>Facebook</td>
            <td><a href="https://web.facebook.com/abd.sentaflexmental"><img src="https://img.shields.io/badge/Facebook-Profile-blue?style=for-the-badge&logo=facebook" alt="Facebook"/></a></td>
        </tr>
        <tr>
            <td>Gmail</td>
            <td><a href="mailto:mrabdullah@gmail.com"><img src="https://img.shields.io/badge/Gmail-Contact%20Me-red?style=for-the-badge&logo=gmail" alt="Gmail"/></a></td>
        </tr>
    </table>
</div>


# <p style="font-family:newtimeroman;font-size:100%;text-align:center;color:#F52549;">Basis Libraries</p>

In [None]:
# Punctuations
import string
# Pandas
import pandas as pd

# Regular Expressions
import re

# Lemmatization
from nltk.stem import WordNetLemmatizer
# Tokenization
from nltk.tokenize import word_tokenize
# Imporr Ohe 
from sklearn.preprocessing import OneHotEncoder
import numpy as np

import nltk
nltk.download('stopwords')
nltk.download('punkt_tab')

# Remove Stopwords
from nltk.corpus import stopwords 
# Import PorterStemmer from NLTK Library
from nltk.stem.porter import PorterStemmer


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


True

# <p style="font-family:newtimeroman;font-size:100%;text-align:center;color:#F52549;">Loading Data </p>

In [56]:
df = pd.read_csv('IMDB Dataset.csv')

In [57]:
# Head 
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [58]:
# Null Values
df.isnull().sum()

review       0
sentiment    0
dtype: int64

# <p style="font-family:newtimeroman;font-size:100%;text-align:center;color:#F52549;">1. Basis Text Preprocessing</p>

![image.png](attachment:09c6d7f5-ed17-4921-9e60-62cc1baddbf4.png)

# <p style="font-family:newtimeroman;font-size:100%;text-align:center;color:#F52549;">1.1 LowerCasing</p>


In [59]:
step_1 = df['review'][0]

# Lowercase All The Text
df['review'] = df['review'].str.lower()

# Head 
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. <br /><br />the...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


# <p style="font-family:newtimeroman;font-size:100%;text-align:center;color:#F52549;">1.2 Removing Punctuations</p>


In [60]:
step_2 = df['review'][0]

# Removing Punctuations
df['review'] = df['review'].str.translate(str.maketrans('', '', string.punctuation))

# Head
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production br br the filmin...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically theres a family where a little boy j...,negative
4,petter matteis love in the time of money is a ...,positive


# <p style="font-family:newtimeroman;font-size:100%;text-align:center;color:#F52549;">1.3 Removing StopWords</p>

In [None]:
step_3 = df['review'][0]

# Intilize Stopwords
stop_words = stopwords.words('english')

# Apply Stopwords
#df['review'] = df['review'].apply(lambda x : ' '.join([word for word in x.split() if word not in (stop_words)]))
# Head 
df.head()

Unnamed: 0,review,sentiment
0,one reviewers mentioned watching 1 oz episode ...,positive
1,wonderful little production br br filming tech...,positive
2,thought wonderful way spend time hot summer we...,positive
3,basically theres family little boy jake thinks...,negative
4,petter matteis love time money visually stunni...,positive


In [62]:
df['review'][0]

'one reviewers mentioned watching 1 oz episode youll hooked right exactly happened mebr br first thing struck oz brutality unflinching scenes violence set right word go trust show faint hearted timid show pulls punches regards drugs sex violence hardcore classic use wordbr br called oz nickname given oswald maximum security state penitentary focuses mainly emerald city experimental section prison cells glass fronts face inwards privacy high agenda em city home manyaryans muslims gangstas latinos christians italians irish moreso scuffles death stares dodgy dealings shady agreements never far awaybr br would say main appeal show due fact goes shows wouldnt dare forget pretty pictures painted mainstream audiences forget charm forget romanceoz doesnt mess around first episode ever saw struck nasty surreal couldnt say ready watched developed taste oz got accustomed high levels graphic violence violence injustice crooked guards wholl sold nickel inmates wholl kill order get away well mannere

# <p style="font-family:newtimeroman;font-size:100%;text-align:center;color:#F52549;">1.4 Removing HTMl Tags</p>

In [63]:
step_4 = df['review'][0]

# Function to remove HTML Tags
def remove_tags(text):
    return re.sub('<[^<]+?>', '', text)

# Calling Function
df['review'] = df['review'].apply(remove_tags)

# Head
df.head()

Unnamed: 0,review,sentiment
0,one reviewers mentioned watching 1 oz episode ...,positive
1,wonderful little production br br filming tech...,positive
2,thought wonderful way spend time hot summer we...,positive
3,basically theres family little boy jake thinks...,negative
4,petter matteis love time money visually stunni...,positive


# <p style="font-family:newtimeroman;font-size:100%;text-align:center;color:#F52549;">1.5 Removing URLs</p>

In [64]:
# Remove URLS
df['review'] = df['review'].str.replace('http\S+|www.\S+', '', case=False)

# Head
df.head()

step_5 = df['review'][0]

In [65]:
print(f"Raw : {step_1}")
print(f"After lowercase : {step_2}")
print(f"After removing punctuation : {step_3}")
print(f"After removing stopwords : {step_4}")
print(f"After removing HTML and URL's: {step_5}")

# example of stopwords : "of" "the" "other"

Raw : One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to t

# <p style="font-family:newtimeroman;font-size:100%;text-align:center;color:#F52549;">1.6 Handling ChatsWords</p>

In [66]:
# Here Come ChatWords Which i Get from a Github Repository
# Repository Link : https://github.com/rishabhverma17/sms_slang_translator/blob/master/slang.txt
chat_words = {
    "AFAIK": "As Far As I Know",
    "AFK": "Away From Keyboard",
    "ASAP": "As Soon As Possible",
    "ATK": "At The Keyboard",
    "ATM": "At The Moment",
    "A3": "Anytime, Anywhere, Anyplace",
    "BAK": "Back At Keyboard",
    "BBL": "Be Back Later",
    "BBS": "Be Back Soon",
    "BFN": "Bye For Now",
    "B4N": "Bye For Now",
    "BRB": "Be Right Back",
    "BRT": "Be Right There",
    "BTW": "By The Way",
    "B4": "Before",
    "B4N": "Bye For Now",
    "CU": "See You",
    "CUL8R": "See You Later",
    "CYA": "See You",
    "FAQ": "Frequently Asked Questions",
    "FC": "Fingers Crossed",
    "FWIW": "For What It's Worth",
    "FYI": "For Your Information",
    "GAL": "Get A Life",
    "GG": "Good Game",
    "GN": "Good Night",
    "GMTA": "Great Minds Think Alike",
    "GR8": "Great!",
    "G9": "Genius",
    "IC": "I See",
    "ICQ": "I Seek you (also a chat program)",
    "ILU": "ILU: I Love You",
    "IMHO": "In My Honest/Humble Opinion",
    "IMO": "In My Opinion",
    "IOW": "In Other Words",
    "IRL": "In Real Life",
    "KISS": "Keep It Simple, Stupid",
    "LDR": "Long Distance Relationship",
    "LMAO": "Laugh My A.. Off",
    "LOL": "Laughing Out Loud",
    "LTNS": "Long Time No See",
    "L8R": "Later",
    "MTE": "My Thoughts Exactly",
    "M8": "Mate",
    "NRN": "No Reply Necessary",
    "OIC": "Oh I See",
    "PITA": "Pain In The A..",
    "PRT": "Party",
    "PRW": "Parents Are Watching",
    "QPSA?": "Que Pasa?",
    "ROFL": "Rolling On The Floor Laughing",
    "ROFLOL": "Rolling On The Floor Laughing Out Loud",
    "ROTFLMAO": "Rolling On The Floor Laughing My A.. Off",
    "SK8": "Skate",
    "STATS": "Your sex and age",
    "ASL": "Age, Sex, Location",
    "THX": "Thank You",
    "TTFN": "Ta-Ta For Now!",
    "TTYL": "Talk To You Later",
    "U": "You",
    "U2": "You Too",
    "U4E": "Yours For Ever",
    "WB": "Welcome Back",
    "WTF": "What The F...",
    "WTG": "Way To Go!",
    "WUF": "Where Are You From?",
    "W8": "Wait...",
    "7K": "Sick:-D Laugher",
    "TFW": "That feeling when",
    "MFW": "My face when",
    "MRW": "My reaction when",
    "IFYP": "I feel your pain",
    "TNTL": "Trying not to laugh",
    "JK": "Just kidding",
    "IDC": "I don't care",
    "ILY": "I love you",
    "IMU": "I miss you",
    "ADIH": "Another day in hell",
    "ZZZ": "Sleeping, bored, tired",
    "WYWH": "Wish you were here",
    "TIME": "Tears in my eyes",
    "BAE": "Before anyone else",
    "FIMH": "Forever in my heart",
    "BSAAW": "Big smile and a wink",
    "BWL": "Bursting with laughter",
    "BFF": "Best friends forever",
    "CSL": "Can't stop laughing"
}

In [67]:
# Function
def chat_conversion(text):
    new_text = []
    for i in text.split():
        if i.upper() in chat_words:
            new_text.append(chat_words[i.upper()])
        else:
            new_text.append(i)
    return " ".join(new_text)

# Calling Function 
df['review'] = df['review'].apply(chat_conversion)

# Head
df.head()

Unnamed: 0,review,sentiment
0,one reviewers mentioned watching 1 oz episode ...,positive
1,wonderful little production br br filming tech...,positive
2,thought wonderful way spend Tears in my eyes h...,positive
3,basically theres family little boy jake thinks...,negative
4,petter matteis love Tears in my eyes money vis...,positive


# <p style="font-family:newtimeroman;font-size:100%;text-align:center;color:#F52549;">1.7 Word_Tokenization</p>


In [None]:
# Tokenization 
from nltk.tokenize import word_tokenize # more precise than .split()

# Apply word_tokenize
df['review_word_token'] = df['review'].apply(word_tokenize)

# Head
df.head()

Unnamed: 0,review,sentiment,review_word_token
0,one reviewers mentioned watching 1 oz episode ...,positive,"[one, reviewers, mentioned, watching, 1, oz, e..."
1,wonderful little production br br filming tech...,positive,"[wonderful, little, production, br, br, filmin..."
2,thought wonderful way spend Tears in my eyes h...,positive,"[thought, wonderful, way, spend, Tears, in, my..."
3,basically theres family little boy jake thinks...,negative,"[basically, theres, family, little, boy, jake,..."
4,petter matteis love Tears in my eyes money vis...,positive,"[petter, matteis, love, Tears, in, my, eyes, m..."


# <p style="font-family:newtimeroman;font-size:100%;text-align:center;color:#F52549;">1.8 Sentence_Tokenization</p>


In [69]:
# Tokenization 
from nltk.tokenize import sent_tokenize

# Apply sent_tokenize
df['review_sent_token'] = df['review'].apply(sent_tokenize)

# Head
df.head()

Unnamed: 0,review,sentiment,review_word_token,review_sent_token
0,one reviewers mentioned watching 1 oz episode ...,positive,"[one, reviewers, mentioned, watching, 1, oz, e...",[one reviewers mentioned watching 1 oz episode...
1,wonderful little production br br filming tech...,positive,"[wonderful, little, production, br, br, filmin...",[wonderful little production br br filming tec...
2,thought wonderful way spend Tears in my eyes h...,positive,"[thought, wonderful, way, spend, Tears, in, my...",[thought wonderful way spend Tears in my eyes ...
3,basically theres family little boy jake thinks...,negative,"[basically, theres, family, little, boy, jake,...",[basically theres family little boy jake think...
4,petter matteis love Tears in my eyes money vis...,positive,"[petter, matteis, love, Tears, in, my, eyes, m...",[petter matteis love Tears in my eyes money vi...


# <p style="font-family:newtimeroman;font-size:100%;text-align:center;color:#F52549;">1.9 Stemming</p>


In [70]:
# Intlize PorterStemmer
ps = PorterStemmer()

# Apply PorterStemmer
df['review_stemmed'] = df['review'].apply(lambda x: ' '.join([ps.stem(word) for word in x.split()]))

# Head
df.head()

Unnamed: 0,review,sentiment,review_word_token,review_sent_token,review_stemmed
0,one reviewers mentioned watching 1 oz episode ...,positive,"[one, reviewers, mentioned, watching, 1, oz, e...",[one reviewers mentioned watching 1 oz episode...,one review mention watch 1 oz episod youll hoo...
1,wonderful little production br br filming tech...,positive,"[wonderful, little, production, br, br, filmin...",[wonderful little production br br filming tec...,wonder littl product br br film techniqu unass...
2,thought wonderful way spend Tears in my eyes h...,positive,"[thought, wonderful, way, spend, Tears, in, my...",[thought wonderful way spend Tears in my eyes ...,thought wonder way spend tear in my eye hot su...
3,basically theres family little boy jake thinks...,negative,"[basically, theres, family, little, boy, jake,...",[basically theres family little boy jake think...,basic there famili littl boy jake think there ...
4,petter matteis love Tears in my eyes money vis...,positive,"[petter, matteis, love, Tears, in, my, eyes, m...",[petter matteis love Tears in my eyes money vi...,petter mattei love tear in my eye money visual...


# <p style="font-family:newtimeroman;font-size:100%;text-align:center;color:#F52549;">1.10 Lemmatization</p>


In [71]:
# Download NLTK resources (uncomment the following line if not already downloaded)?
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...


True

In [72]:
 # Intilize Lemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Apply 
df['review_lemmatized'] = df['review'].apply(lambda x: ' '.join([wordnet_lemmatizer.lemmatize(word , pos='v') for word in x.split()]))

# Head
df.head()

Unnamed: 0,review,sentiment,review_word_token,review_sent_token,review_stemmed,review_lemmatized
0,one reviewers mentioned watching 1 oz episode ...,positive,"[one, reviewers, mentioned, watching, 1, oz, e...",[one reviewers mentioned watching 1 oz episode...,one review mention watch 1 oz episod youll hoo...,one reviewers mention watch 1 oz episode youll...
1,wonderful little production br br filming tech...,positive,"[wonderful, little, production, br, br, filmin...",[wonderful little production br br filming tec...,wonder littl product br br film techniqu unass...,wonderful little production br br film techniq...
2,thought wonderful way spend Tears in my eyes h...,positive,"[thought, wonderful, way, spend, Tears, in, my...",[thought wonderful way spend Tears in my eyes ...,thought wonder way spend tear in my eye hot su...,think wonderful way spend Tears in my eye hot ...
3,basically theres family little boy jake thinks...,negative,"[basically, theres, family, little, boy, jake,...",[basically theres family little boy jake think...,basic there famili littl boy jake think there ...,basically theres family little boy jake think ...
4,petter matteis love Tears in my eyes money vis...,positive,"[petter, matteis, love, Tears, in, my, eyes, m...",[petter matteis love Tears in my eyes money vi...,petter mattei love tear in my eye money visual...,petter matteis love Tears in my eye money visu...


# <p style="font-family:newtimeroman;font-size:100%;text-align:center;color:#F52549;">2. Text Representation</p>


![image.png](attachment:49f570de-2754-45c4-b6a3-ade7a5d2e9b7.png)

### ***1. What is Feature Extraction from Text?***
***Feature extraction from text refers to the process of converting textual data into numerical representations. It's also known as text representation or text vectorization.***

### ***2. Why Do We Need It?***
***In natural language processing (NLP) tasks using machine learning, providing effective features to models is crucial. Text vectorization is a pivotal step in ML-based NLP, facilitating the transformation of text data into a format suitable for modeling.***

### ***3. Why Is It Difficult?***
***While it's straightforward to represent data like pixels in images or audio in numerical form, converting text into meaningful numerical representations poses challenges. Text carries rich semantic meaning, and preserving this meaning during conversion is complex.***

###***4. What Is the Core Idea?***
***The core idea of text-to-number conversion is to encode the semantic meaning of the text into numerical representations effectively.***

### ***5. What Are the Techniques?***
***a. One-Hot Encoding (OHE): Assigning a unique binary value to each word in the vocabulary.***

***b. Bag of Words (BOW): Representing text as a frequency distribution of words, disregarding grammar and word order.***

***c. N-Grams: Capturing the co-occurrence of adjacent words by considering sequences of n words.***

***d. Term Frequency-Inverse Document Frequency (TF-IDF): Weighing the importance of words in a document relative to their frequency across the corpus.***

### ***Some Common Terms***
#### a. ***Corpus: The entire collection of text data, comprising all documents or texts under consideration.***

#### b. ***Vocabulary: The set of unique words present in the corpus.***

#### c. ***Document: Each individual piece of text in the corpus, such as a review, article, or sentence.***

#### d. ***Word: The basic unit of textual data, representing a single element of meaning.***

# <p style="font-family:newtimeroman;font-size:100%;text-align:center;color:#F52549;">Corpus</p>

In [84]:
# First of all we made a corpus of All words 
corpus = []

# Run a Loop and Append Reviews in corpus
for i in range(len(df)):
    #review = re.sub('[^a-zA-Z]', ' ', df['review_lemmatized'][i]) #replace all of the char which aren't in the alphabet
    review = re.sub('[^a-zA-Z]', ' ', df['review_stemmed'][i])
    review = review.split()
    review = ' '.join(review)
    corpus.append(review)

In [85]:
# Len Of Corpus
print(f'The Lenght of the Corpus is : {len(corpus)}') # in fact, count of review

The Lenght of the Corpus is : 50000


In [86]:
# Total number of words in corpus
# Initialize total_words counter
total_words = 0

# Iterate through each element in the corpus list
for text in corpus:
    # Split the text into words and update the total_words counter
    total_words += len(text.split())

# Print the total number of words
print(f"Total words in Corpus is : {total_words}")

Total words in Corpus is : 6155951


In [87]:
# head
corpus[0]

'one review mention watch oz episod youll hook right exactli happen mebr br first thing struck oz brutal unflinch scene violenc set right word go trust show faint heart timid show pull punch regard drug sex violenc hardcor classic use wordbr br call oz nicknam given oswald maximum secur state penitentari focus mainli emerald citi experiment section prison cell glass front face inward privaci high agenda em citi home manyaryan muslim gangsta latino christian italian irish moreso scuffl death stare dodgi deal shadi agreement never far awaybr br would say main appeal show due fact goe show wouldnt dare forget pretti pictur paint mainstream audienc forget charm forget romanceoz doesnt mess around first episod ever saw struck nasti surreal couldnt say readi watch develop tast oz got accustom high level graphic violenc violenc injustic crook guard wholl sold nickel inmat wholl kill order get away well manner middl class inmat turn prison bitch due lack street skill prison experi watch oz may

In [88]:
df['review'][11]

'saw movie 12 came recall scariest scene big bird eating men dangling helplessly parachutes right air horror horrorbr br young kid going cheesy b films saturday afternoons still tired formula monster type movies usually included hero beautiful woman might daughter professor happy resolution monster died end didnt care much romantic angle 12 year old predictable plots love unintentional humorbr br year later saw psycho came loved star janet leigh bumped early film sat took notice point since screenwriters making story make scary possible wellworn formula rules'

In [89]:
corpus[11]

'saw movi came recal scariest scene big bird eat men dangl helplessli parachut right air horror horrorbr br young kid go cheesi b film saturday afternoon still tire formula monster type movi usual includ hero beauti woman might daughter professor happi resolut monster die end didnt care much romant angl year old predict plot love unintent humorbr br year later saw psycho came love star janet leigh bump earli film sat took notic point sinc screenwrit make stori make scari possibl wellworn formula rule'

# <p style="font-family:newtimeroman;font-size:100%;text-align:center;color:#F52549;">Vocabulary</p>


In [90]:
# Let's find the unique words in the corpus
vocabulary = set()

# Apply vocabulary
for review in corpus:
    # Split the review into words
    words = review.split()
    # Update the vocabulary set with unique words from the review
    vocabulary.update(words)

# Convert the set back to a list if needed
vocabulary = list(vocabulary)

In [91]:
# Lenght of Vocab
print(f'The Lenght of the Vocabulary  is : {len(vocabulary)}')

The Lenght of the Vocabulary  is : 137361


In [92]:
# Head of Vocab
vocabulary[0:10]

['denishawn',
 'sixti',
 'menahem',
 'booklength',
 'presencewth',
 'happensdavid',
 'lim',
 'luxproduc',
 'tooquick',
 'relationshipbr']

# <p style="font-family:newtimeroman;font-size:100%;text-align:center;color:#F52549;">2.1 OneHotEncoding Text</p>


***In natural language processing (NLP), one-hot encoding (OHE) is a technique used to represent categorical data, such as words or phrases, in a binary format. Each word in a vocabulary is assigned a unique binary vector, where only one bit is set to 1 (indicating the presence of the word) and the rest are set to 0.***

In [95]:
vocabulary_short = vocabulary[0:20000]

In [96]:
# Lenght of Vocab
print(f'The Lenght of the Vocabulary  is : {len(vocabulary_short)}')

The Lenght of the Vocabulary  is : 20000


In [105]:
from tensorflow.keras.preprocessing.text import Tokenizer

# Create a tokenizer object with the desired maximum number of words
tokenizer = Tokenizer(num_words=20000)  # Adjust the value as needed (take in count only the 'num_words' most frequent)

# Fit tokenizer on vocabulary_short to build the vocabulary
tokenizer.fit_on_texts(vocabulary_short)

# if vocabulary = ["dog", "cat", "bird"]
# and a text contains "dog" and "cat"
# the OHE vector will be [1, 1, 0]

# One-hot encode the text data
one_hot_encoded = tokenizer.texts_to_matrix(vocabulary_short, mode='binary')

# Print the one-hot encoded representation
print(one_hot_encoded)
print(len(one_hot_encoded)) # one vector for each word in the vocab
print(len(one_hot_encoded[0])) # num_words given to the tokenizer

[[0. 1. 0. ... 0. 0. 0.]
 [0. 0. 1. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 1. 0.]
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 0.]]
20000
20000


# <p style="font-family:newtimeroman;font-size:100%;text-align:center;color:#F52549;">2.2 Bag of Words BOW</p>


***Bag of Words (BoW) is a well-known and widely used technique in text classification, despite its age. Its effectiveness lies in its simplicity and ability to handle out-of-vocabulary (OOV) words.***
### ***2. Working***
***Here's how it works: first, we create a vocabulary from our corpus, typically represented by a fixed number of words, let's say 5. Then, for each document, we count how many times each word from the vocabulary occurs. The order of words doesn't matter in BoW.***
### ***3. Core Idea***
***The core idea behind BoW is that it focuses on word frequency rather than context. It assumes that documents with similar content will have similar word frequencies. In other words, if two documents belong to the same category, they are likely to have similar word frequencies.***
### ***4. How we Capture Semmentic Meaning of Words***
***However, BoW doesn't capture contextual or semantic meaning directly. It treats text as a vector in a high-dimensional space, with each dimension representing a word from the vocabulary. Documents are then compared based on the angle between their vectors. Similar documents will have smaller angles between their vectors, indicating similar word frequencies.***
### ***5. Summarization***
***One advantage of BoW is its ability to handle OOV words, as it only relies on the presence or absence of words in the vocabulary.In summary, BoW is a simple yet effective technique for text classification, focusing on word frequency rather than context or meaning.It represents text as vectors in a high-dimensional space and compares documents based on the angle between their vectors.***

### ***Code Working***

1. **Initializing CountVectorizer**:
   - `cv = CountVectorizer(max_features=10000)`: Here, an instance of the `CountVectorizer` class is created with a parameter `max_features` set to 10000. 
     - `max_features` specifies the maximum number of features (words or tokens) to consider based on their frequency. In this case, it limits the vocabulary to the top 10,000 most frequent words in the corpus.

2. **Fitting the Data**:
   - `bow = cv.fit_transform(df['review_lemmatized'])`: This line fits the `CountVectorizer` to the text data and transforms the text data into a document-term matrix.
     - `df['review_lemmatized']` is assumed to be a Pandas DataFrame column containing preprocessed and lemmatized text data (e.g., reviews).
     - The `fit_transform()` method first fits the `CountVectorizer` to learn the vocabulary from the text data (`df['review_lemmatized']`) and then transforms the text data into a sparse matrix where each row represents a document (review) and each column represents a token (word). The values in the matrix indicate the count of each token in the corresponding document.

In summary, this code segment imports the `CountVectorizer` class, initializes it with a maximum feature limit of 10,000, and then fits it to the lemmatized text data to create a document-term matrix representing the word counts in each document.

In [106]:
# import 
from sklearn.feature_extraction.text import CountVectorizer

# Intilize 
cv = CountVectorizer(max_features=10000) # Adjust the Value as needed 

# Fit on Data 
bow = cv.fit_transform(df['review_stemmed'])

In [107]:
# # Show Vocabulary 
print(f'The lenght of the Vocabulary Generated by Bag of Words is : {len(cv.vocabulary_)}')

The lenght of the Vocabulary Generated by Bag of Words is : 10000


In [108]:
# Enocded Sequence from Data 
bow[0].toarray()

array([[0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [109]:
# Number of times each word has occurred
word_counts = bow.toarray().sum(axis=0)
word_counts

array([  76,   79, 4243, ...,   42,   59,   47], dtype=int64)

# <p style="font-family:newtimeroman;font-size:100%;text-align:center;color:#F52549;">2.3 Bi-Gram and Tri-Gram</p>

***This technique is quite similar to BOW, but with a key difference. Instead of considering single words as vocabulary units, N-grams generate multiple word combinations. For example, in Bi-grams, we pair two words as vocabulary units, while Tri-grams use three-word combinations.***

### ***3. Advantages:***
1. ***Able to capture semantic meanings.***
2. ***Simple and intuitive.***

### ***4. Disadvantages:***
1. ***Increased vocabulary dimensionality in Bi-grams and Tri-grams can lead to longer training times with large datasets.***
2. ***No perfect solution for Out-Of-Vocabulary (OOV) words.***

### ***Code Working***

1. `cv_bi = CountVectorizer(max_features=10000, ngram_range=(2,2))`:
   - Here, `CountVectorizer` is a method from the scikit-learn library used for converting a collection of text documents into a matrix of token counts.
   - `max_features=10000` specifies that we want to limit the number of features (words or tokens) to the top 10,000 most frequent ones. This parameter helps in managing the memory and processing time, especially for large datasets.
   - `ngram_range=(2,2)` specifies that we want to create bigrams, which are sequences of two adjacent words (bi-grams) in the text. This means that the CountVectorizer will consider pairs of consecutive words as features in addition to single words.

2. `bow_bi = cv_bi.fit_transform(df['review_lemmatized'])`:
   - `cv_bi.fit_transform()` is a method that fits the CountVectorizer to the text data and transforms the text data into a document-term matrix.
   - `df['review_lemmatized']` is assumed to be a Pandas DataFrame column containing preprocessed and lemmatized text data (e.g., reviews).
   - The `fit_transform()` method first fits the CountVectorizer to learn the vocabulary from the text data (`df['review_lemmatized']`) and then transforms the text data into a sparse matrix where each row represents a document (review) and each column represents a token (word or bigram). The values in the matrix indicate the count of each token in the corresponding document.

In summary, this code snippet initializes a CountVectorizer object to create a document-term matrix considering `bigrams` (pairs of consecutive words) as features and limits the vocabulary to the top 10,000 most frequent tokens. Then, it fits the CountVectorizer to the lemmatized text data and transforms the data into a sparse matrix representation.

In [None]:
# Bigram 
cv_bi = CountVectorizer(max_features=10000 , ngram_range=(2,2))

# Fitting
bow_bi = cv_bi.fit_transform(df['review_stemmed'])

In [None]:
# Vocabulary of bi-grams
bi_gram_vocabulary = cv_bi.get_feature_names_out()

# Length
print(f'The lenght of the Vocabulary Generated by Bag of Words Bi Gram  is : {len(bi_gram_vocabulary)}')

The lenght of the Vocabulary Generated by Bag of Words Bi Gram  is : 10000


In [None]:
# Number of times each word has occurred
word_counts = bow_bi.toarray().sum(axis=0)
word_counts

array([238, 112, 378, ...,  78,  56, 169])

---


### ***Code Working***

1. `cv_tri = CountVectorizer(ngram_range=(3,3))`:
   - Similar to the previous code, `CountVectorizer` is used to convert text data into a matrix of token counts.
   - `ngram_range=(3,3)` specifies that we want to create **Trigrams**, which are sequences of three adjacent words (tri-grams) in the text. This means that the CountVectorizer will consider triplets of consecutive words as features in addition to single words.

2. We do she same process as we do above. 

In summary, this code initializes a CountVectorizer object to create a document-term matrix considering trigrams (triplets of consecutive words) as features. Then, it fits the CountVectorizer to the lemmatized text data and transforms the data into a sparse matrix representation.

In [None]:
# Tri Gram 
cv_tri = CountVectorizer(max_features=10000,ngram_range=(3,3))

# Fitting 
bow_tri = cv_tri.fit_transform(df['review_stemmed'])

In [None]:
# Vocabulary of bi-grams
tri_gram_vocabulary = cv_tri.get_feature_names_out()

# Length
print(f'The lenght of the Vocabulary Generated by Bag of Words Tri Gram  is : {len(tri_gram_vocabulary)}')

The lenght of the Vocabulary Generated by Bag of Words Tri Gram  is : 10000


In [None]:
# Number of times each word has occurred
word_counts = bow_tri.toarray().sum(axis=0)
word_counts

array([10, 14,  9, ..., 15, 10, 10])

# <p style="font-family:newtimeroman;font-size:100%;text-align:center;color:#F52549;">2.4 TF_IDF</p>

### ***TF-IDF (Term Frequency-Inverse Document Frequency)***
***TF-IDF (Term Frequency-Inverse Document Frequency) assigns different values to words in a vocabulary based on their importance in documents. It operates on a simple principle: if a word appears frequently in a particular document but rarely across the entire corpus, it's considered significant for that document, thus receiving a higher weight.***
 
### ***To calculate these values, TF-IDF uses two terms:***
1. ***TF (Term Frequency): The frequency of a term in a document, normalized between 0 and 1, indicating the probability of that term occurring in the document.***
   ***Formula: TF(t,d) = Number of Occurrences of Term t in Document d / Total Number of Terms in Document d***
 
2. ***IDF (Inverse Document Frequency): Measures how rare a term is across the entire corpus.***
   ***Formula: IDF = Total Number of Documents in Corpus / Number of Documents with Term t***

***So the idea is that , if the word is frequent in the corpus , THE idf value will be very low , If the word is rare in the document the idf wil be high .***

In [None]:
# Import 
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# Intilize 
tf_idf = TfidfVectorizer()

# Fitting
tf = tf_idf.fit_transform(df['review_stemmed'])

In [None]:
# Len of Vocabulary
print(f"The Lenght of Tf-idf Vocabulary is {len(tf_idf.vocabulary_)}")

The Lenght of Tf-idf Vocabulary is 142234


In [None]:
# Some Random Text 
tf[120].toarray()

array([[0., 0., 0., ..., 0., 0., 0.]])

In [None]:
# IDF scores of words
idf_scores = tf_idf.idf_

# Print the IDF scores of words and the vocabulary
print("IDF Scores of Words:", idf_scores)

IDF Scores of Words: [ 8.98658494 10.721186   11.1266511  ... 11.1266511  11.1266511
 11.1266511 ]


# <p style="font-family:newtimeroman;font-size:100%;text-align:center;color:#F52549;">The End</p>


```***Hope you like the setup! If you encounter any issues or have questions, feel free to comment below. I'll do my best to provide clear and helpful answers.***```