# __NLP__
- NLP stands for natural language processing
- Branch of AI that gives machine the ability to understand human language
- human language can be in the form of text or audio format

### __Tokenization__
- A technique involving dividing a sentence or phrase into smaller units known as __tokens__
- Tokens can encompass words, dates, punctuation marks, or fragment of words
- It is a critical step in many NLP tasks, such as text processing, language modelling, and machine translation


__Types of Tokenization__
1. Word Tokenization
    - In word tokenization, a text is divided into indivisual words
    - Words are treated as basic unit of meaning
    - __example:__ 
        - Input: "Tokenization is an important NLP task."
        - Output: ["Tokenization", "is", "an", "important", "NLP", "task", "."]
2. Sentence Tokenization
    - The text is segmented into sentences in sentence tokenization
    - Useful for tasks requiring indivisual sentence analysis/processing
    - __example__
        - Input: "Tokenization is an important NLP task. It helps break down text into smaller units."
        - Output: ["Tokenization is an important NLP task.", "It helps break down text into smaller units."]

__Implementing Tokenization__
1. Word Tokenization using word_tokenize()
    - word_tokenize() function is used to break down a sentence into indivisual words

In [3]:
# Syntax
# from nltk.tokenize import word_tokenize
# word_tokenize(Sentence)

In [4]:
from nltk.tokenize import word_tokenize

sentence = """Hello. my name is Asad. I am 20 years old"""

words = word_tokenize(sentence)
print(words)

['Hello', '.', 'my', 'name', 'is', 'Asad', '.', 'I', 'am', '20', 'years', 'old']


2. Sentence tokenization using __sent_tokenize()__
    - It is used to convert a segment of texts into list of sentences
    

In [5]:
# Syntax:
# from nltk.tokenize import sent_tokenize
# sent_tokenize(text)

In [6]:
# Example:
from nltk.tokenize import sent_tokenize

text = "Hello, my name is Asad. I am 20 years old"
sentence = sent_tokenize(text)
print(sentence)

['Hello, my name is Asad.', 'I am 20 years old']


### __Stemming__
- A method in text processing that eliminates prefixes and suffixes from words, transforming them into their fundamental/root form
- It makes text simpler by using stemmers or stemming algorithms
- __e.g:__ "chocolates" become "chocolate" and "retrieval" becomes "retrieve"


__Stemmer in nltk(Using Porter's Stemmer)__
- Based on idea that suffixes in English are made up of smaller and simpler suffixes
- Group of stems is mapped to same stem and is not necessarily a meaningful word
- __e.g:__ EED -> EE means  “if the word has at least one vowel and consonant plus EED ending, change the ending to EE” as ‘agreed’ becomes ‘agree’
 

In [7]:
# Syntax:
# from nltk.stem import PorterStemmer
# porter=PorterStemmer()
# porter.stem(token)

In [8]:
# Example 1:
from nltk.stem import PorterStemmer

porter = PorterStemmer()
print(porter.stem("liking"))
print(porter.stem("liked"))
print(porter.stem("likes"))
print(porter.stem("Realization"))

like
like
like
realiz


In [9]:
# Example 2: Stemmer giving wrong output
from nltk.stem import PorterStemmer

porter = PorterStemmer()
print(porter.stem("Communication"))

commun


### __Lemmetization__
- Process of grouping together different inflected forms of a word so they can be analyzed as a single term
- Similar to stemming, but it brings context to word
- Links word with similar meaning to one word
- __Note:__ Always convert text to lowercase before performing lemmization
- __e.g:__
    - meeting -> meet
    - was -> be
    - mice -> mouse
    - better -> good
    - corpora -> corpus

__Lemmatizer in ntlk(Wordnet):__
- It links words into sematic relations
- It groups synonyms into in the form of synets

In [10]:
# Syntax:
# from nltk.stem import WordNetLemmatizer
# WordNet=WordNetLemmatizer()
# WordNet.lemmatize(token)

In [11]:
from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()
tokens = [
    "kites",
    "babies",
    "dogs" "flying",
    "smiling",
    "driving",
    "died",
    "tried",
    "feet",
]
for token in tokens:
    print(wnl.lemmatize(token))

# > kites ---> kite
# > babies ---> baby
# > dogs ---> dog
# > flying ---> flying
# > smiling ---> smiling
# > driving ---> driving
# > died ---> died
# > tried ---> tried
# > feet ---> foot

kite
baby
dogsflying
smiling
driving
died
tried
foot


### __Stop Words__
- A commonly used word ("a", "the", "an", or "in") that a search engine is programmed to ignore 
- We do not want these words to take up space in database or take up valuable processing time. For this reason, we remove them easily.
- __e.g__
    1. Can Listening be exhausting? -> Listening, exhausting
    2. I like reading, so i read -> Like, reading, read

__Checking English Stop Words List__
- It includes common words that carry little semantic meaning and are often exluded during text analysis
- __e.g__ "this", "and", "is", "for", "it"
- These are removed to focus more on meaningful terms when processing textdata in NLP

In [12]:
# Example 1: Checking list of StopWords
import nltk
from nltk.corpus import stopwords
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

__Removing Stop Words using nltk__

In [13]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop_words= stopwords.words('english')
text= """This is a sample sentence, showing off the stop words filtration."""
print(text)
#Convert text into lowercase
text=text.lower()
#Tokenize the text
tokens = word_tokenize(text)
filtered_sentence=[]

#Filter out stop words
for t in tokens:
    if t not in stop_words:
        filtered_sentence.append(t)
#Print the filtered sentence
print(filtered_sentence)

This is a sample sentence, showing off the stop words filtration.
['sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']


### __Bag of Words Model__
- A model used to preprocess the text by converting it into bag of words
- Bag of words keeps a total count of occurences of most frequently used words

__Steps for applying Bag of Words Model__
1. __Preprocess the data__
    - Convert text to lowercase
    - Remove all non-word characters
    - Remove all punctuations

In [14]:
# Step 1:
# Consider the following text:
import nltk
import re
import numpy as np
text= "Beans. I was trying to explain to somebody as we were flying in, that’s corn.  That’s beans. And they were very impressed at my agricultural knowledge. Please give it up for Amaury once again for that outstanding introduction. I have a bunch of good friends here today, including somebody who I served with, who is one of the finest senators in the country, and we’re lucky to have him, your Senator, Dick Durbin is here. I also noticed, by the way, former Governor Edgar here, who I haven’t seen in a long time, and somehow he has not aged and I have. And it’s great to see you, Governor. I want to thank President Killeen and everybody at the U of I System for making it possible for me to be here today. And I am deeply honored at the Paul Douglas Award that is being given to me. He is somebody who set the path for so much outstanding public service here in Illinois. Now, I want to start by addressing the elephant in the room. I know people are still wondering why I didn’t speak at the commencement."

#Sentence tokenize
tokens= nltk.sent_tokenize(text)
for i in range(len(tokens)):
    tokens[i]=tokens[i].lower() #Convert text to lowercase
    tokens[i]=re.sub(r'\W',' ',tokens[i]) #Remove all non-word and punctuation characters
    tokens[i]=re.sub(r'\s+',' ',tokens[i]) # Remove all extra spaces
print(tokens)

['beans ', 'i was trying to explain to somebody as we were flying in that s corn ', 'that s beans ', 'and they were very impressed at my agricultural knowledge ', 'please give it up for amaury once again for that outstanding introduction ', 'i have a bunch of good friends here today including somebody who i served with who is one of the finest senators in the country and we re lucky to have him your senator dick durbin is here ', 'i also noticed by the way former governor edgar here who i haven t seen in a long time and somehow he has not aged and i have ', 'and it s great to see you governor ', 'i want to thank president killeen and everybody at the u of i system for making it possible for me to be here today ', 'and i am deeply honored at the paul douglas award that is being given to me ', 'he is somebody who set the path for so much outstanding public service here in illinois ', 'now i want to start by addressing the elephant in the room ', 'i know people are still wondering why i d

2. Obtain most frequent words in text
    - Declare a dictionary to hold bag of words
    - Tokenize each sentence into words
    - For each word, check if it exists in the dictionary 
    - If it does, set count by 1, otherise increment that location

In [15]:
# Step 1:
# Consider the following text:
from nltk.tokenize import sent_tokenize
import re
import numpy as np
text= "Beans. I was trying to explain to somebody as we were flying in, that’s corn.  That’s beans. And they were very impressed at my agricultural knowledge. Please give it up for Amaury once again for that outstanding introduction. I have a bunch of good friends here today, including somebody who I served with, who is one of the finest senators in the country, and we’re lucky to have him, your Senator, Dick Durbin is here. I also noticed, by the way, former Governor Edgar here, who I haven’t seen in a long time, and somehow he has not aged and I have. And it’s great to see you, Governor. I want to thank President Killeen and everybody at the U of I System for making it possible for me to be here today. And I am deeply honored at the Paul Douglas Award that is being given to me. He is somebody who set the path for so much outstanding public service here in Illinois. Now, I want to start by addressing the elephant in the room. I know people are still wondering why I didn’t speak at the commencement."

#Sentence tokenize
tokens=sent_tokenize(text)
for i in range(len(tokens)):
    tokens[i]=tokens[i].lower() #Convert text to lowercase
    tokens[i]=re.sub(r'\W',' ',tokens[i]) #Remove all non-word characters
    tokens[i]=re.sub(r'\s+',' ',tokens[i]) # Remove all punctuations
    
# Step 2:
#Tokenize sentence into words
from nltk.tokenize import word_tokenize
word_freq={}
for token in tokens:
    words=word_tokenize(token)
    for word in words:
        if word not in word_freq.keys():
            word_freq[word]=1
        else:
            word_freq[word]+=1
print(word_freq)

#Create a heap of most frequent 100 words
import heapq
word_freq= heapq.nlargest(100,word_freq,key=word_freq.get)
print(word_freq)


{'beans': 2, 'i': 12, 'was': 1, 'trying': 1, 'to': 8, 'explain': 1, 'somebody': 3, 'as': 1, 'we': 2, 'were': 2, 'flying': 1, 'in': 5, 'that': 4, 's': 3, 'corn': 1, 'and': 7, 'they': 1, 'very': 1, 'impressed': 1, 'at': 4, 'my': 1, 'agricultural': 1, 'knowledge': 1, 'please': 1, 'give': 1, 'it': 3, 'up': 1, 'for': 5, 'amaury': 1, 'once': 1, 'again': 1, 'outstanding': 2, 'introduction': 1, 'have': 3, 'a': 2, 'bunch': 1, 'of': 3, 'good': 1, 'friends': 1, 'here': 5, 'today': 2, 'including': 1, 'who': 4, 'served': 1, 'with': 1, 'is': 4, 'one': 1, 'the': 9, 'finest': 1, 'senators': 1, 'country': 1, 're': 1, 'lucky': 1, 'him': 1, 'your': 1, 'senator': 1, 'dick': 1, 'durbin': 1, 'also': 1, 'noticed': 1, 'by': 2, 'way': 1, 'former': 1, 'governor': 2, 'edgar': 1, 'haven': 1, 't': 2, 'seen': 1, 'long': 1, 'time': 1, 'somehow': 1, 'he': 2, 'has': 1, 'not': 1, 'aged': 1, 'great': 1, 'see': 1, 'you': 1, 'want': 2, 'thank': 1, 'president': 1, 'killeen': 1, 'everybody': 1, 'u': 1, 'system': 1, 'making'

3. Building bag of words model
    - Construct a list of whether the word in each sentence is a frequent word or not
    - If a word is frequent, set it as 1, otherwise set it as 0

In [16]:
# Step 1:
# Consider the following text:
from nltk.tokenize import sent_tokenize
import re
import numpy as np
text= "Beans. I was trying to explain to somebody as we were flying in, that’s corn.  That’s beans. And they were very impressed at my agricultural knowledge. Please give it up for Amaury once again for that outstanding introduction. I have a bunch of good friends here today, including somebody who I served with, who is one of the finest senators in the country, and we’re lucky to have him, your Senator, Dick Durbin is here. I also noticed, by the way, former Governor Edgar here, who I haven’t seen in a long time, and somehow he has not aged and I have. And it’s great to see you, Governor. I want to thank President Killeen and everybody at the U of I System for making it possible for me to be here today. And I am deeply honored at the Paul Douglas Award that is being given to me. He is somebody who set the path for so much outstanding public service here in Illinois. Now, I want to start by addressing the elephant in the room. I know people are still wondering why I didn’t speak at the commencement."

#Sentence tokenize
tokens=sent_tokenize(text)
for i in range(len(tokens)):
    tokens[i]=tokens[i].lower() #Convert text to lowercase
    tokens[i]=re.sub(r'\W',' ',tokens[i]) #Remove all non-word characters
    tokens[i]=re.sub(r'\s+',' ',tokens[i]) # Remove all punctuations
# Step 2:
#Tokenize sentence into words
from nltk.tokenize import word_tokenize
word_freq={}
for token in tokens:
    words=word_tokenize(token)
    for word in words:
        if word not in word_freq.keys():
            word_freq[word]=1
        else:
            word_freq[word]+=1
#Create a heap of most frequent 100 words
import heapq
word_freq= heapq.nlargest(100,word_freq,key=word_freq.get)
#Step 3:
x = []
for token in tokens:
    vector = []
    for word in word_freq:
        if word in word_tokenize(token):
            vector.append(1)
        else:
            vector.append(0)
    x.append(vector)
x=np.array(x)
print(x)

[[0 0 0 ... 0 0 0]
 [1 0 1 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 1 0 ... 1 1 1]
 [1 1 1 ... 0 0 0]
 [1 1 0 ... 0 0 0]]


### __TF-IDF__
- It stands for Term Frequency Inverse Document Frequency of records
- It is defined as calculation of how relevant in a word is in a series or corpus is to a text


__Terminologies in TF-IDF__
1. __Term Frequency__
    - In a document d, term frequency represents number of instances of a word t
    - The weight of a term that occurs in a document is propotional to term frequency
    - __tf(t,d)= count of t in d/ number of words in d__

2. __Document Frequency__
    - This tests meaning of text (similar to TF)
    - The only difference is that in document d, TF is the frequency counter for term t, while df is the number of occurences in the document set N of term t
    - the number of papers in which the word is present is DF
    - __df(t)=occurence of t in documents__ 

3. __Inverse Document Frequency__
    - It tests how relevant a word is
    - Aim of search is to  locate appropriate records that fit the demand
    - Since tf considers all term equally signifigant
    - __idf(t)=log(N/df(t))__ where
        - __df(t)= Document frequency of term t__
        - __N(t)= Number of documents containing term t__
        - __N is number of documents__


4. __TF-IDF__
    - __TF-IDF(t,d,D)=TF(t,d)*IDF(t,D)__
    

In [17]:
# Example 1:
from sklearn.feature_extraction.text import TfidfVectorizer

# Example corpus of documents
corpus = [
    "The sky is blue",
    "The sun is bright and the sky is blue",
    "The sun in the sky is bright",
    "We can see the shining sun, the bright sun"
]
# Step 2
# Create a TfidfVectorizer instance
vectorizer = TfidfVectorizer()

# Get TF-IDF values 
tfidf_matrix = vectorizer.fit_transform(corpus)

# Get the feature names (i.e., the words)
feature_names = vectorizer.get_feature_names_out()

# Step 3
# Convert the sparse matrix to a dense format for display
dense_matrix = tfidf_matrix.todense()

# Step 4
# Display the results
for i, doc in enumerate(dense_matrix):
    print(f"Document {i+1} TF-IDF values:")
    for j, score in enumerate(doc.tolist()[0]):
        print(f"  {feature_names[j]}: {score:.4f}")



Document 1 TF-IDF values:
  and: 0.0000
  blue: 0.6031
  bright: 0.0000
  can: 0.0000
  in: 0.0000
  is: 0.4883
  see: 0.0000
  shining: 0.0000
  sky: 0.4883
  sun: 0.0000
  the: 0.3992
  we: 0.0000
Document 2 TF-IDF values:
  and: 0.4240
  blue: 0.3343
  bright: 0.2706
  can: 0.0000
  in: 0.0000
  is: 0.5413
  see: 0.0000
  shining: 0.0000
  sky: 0.2706
  sun: 0.2706
  the: 0.4425
  we: 0.0000
Document 3 TF-IDF values:
  and: 0.0000
  blue: 0.0000
  bright: 0.3310
  can: 0.0000
  in: 0.5186
  is: 0.3310
  see: 0.0000
  shining: 0.0000
  sky: 0.3310
  sun: 0.3310
  the: 0.5412
  we: 0.0000
Document 4 TF-IDF values:
  and: 0.0000
  blue: 0.0000
  bright: 0.2391
  can: 0.3746
  in: 0.0000
  is: 0.0000
  see: 0.3746
  shining: 0.3746
  sky: 0.0000
  sun: 0.4782
  the: 0.3910
  we: 0.3746


### __Word Embeddings__
- Method of extracting features out of textso that we can input those text into machine learning model to work with text data
- The preserve syntactical and semantic information

### __Word2vec__
- NLP technique for obtaining vector representation of words
- They capture meaning of words based on surrounding words
- The semantic information and relation between different words is preserved
- Developed by researchers at Google

__Creating Word2Vec__
1. Tokenize the sentences
2. Train a word2vec model on tokenized sentences

In [18]:
# Import necessary libraries
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk

# Download NLTK tokenizer data
nltk.download('punkt')

# Sample sentences to train the Word2Vec model
sentences = [
    "Hello, my name is Asad.",
    "I love natural language processing.",
    "Word2Vec is a powerful model for word embeddings.",
    "I enjoy learning new things.",
    "Deep learning is fascinating.",
    "Machine learning and artificial intelligence are interrelated fields."
]

# Tokenize each sentence into words
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]

# Train a Word2Vec model on the tokenized sentences
model = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=5, min_count=1, workers=4)

# Example usage: Find the vector of a word
word = "learning"
if word in model.wv:
    print(f"Vector for '{word}':\n{model.wv[word]}\n")

# Example usage: Find most similar words to a given word
similar_words = model.wv.most_similar("language", topn=3)
print(f"Words most similar to 'learning':\n{similar_words}\n")

# Example usage: Find similarity between two words
similarity_score = model.wv.similarity("learning", "processing")
print(f"Similarity between 'learning' and 'processing': {similarity_score}")


[nltk_data] Downloading package punkt to C:\Users\S.A
[nltk_data]     Tech\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Vector for 'learning':
[ 9.9660458e-05  3.0821718e-03 -6.8126968e-03 -1.3774707e-03
  7.6670009e-03  7.3487638e-03 -3.6748096e-03  2.6433435e-03
 -8.3152140e-03  6.2032156e-03 -4.6383245e-03 -3.1595279e-03
  9.3136756e-03  8.6830143e-04  7.4923071e-03 -6.0711433e-03
  5.1649613e-03  9.9231759e-03 -8.4552281e-03 -5.1386035e-03
 -7.0631793e-03 -4.8644664e-03 -3.7822647e-03 -8.5424874e-03
  7.9531269e-03 -4.8400508e-03  8.4222108e-03  5.2616741e-03
 -6.5445201e-03  3.9631235e-03  5.4638721e-03 -7.4266698e-03
 -7.4060163e-03 -2.4757332e-03 -8.6282277e-03 -1.5768069e-03
 -4.0439886e-04  3.2994591e-03  1.4390461e-03 -8.8006386e-04
 -5.5882698e-03  1.7333244e-03 -8.9027028e-04  6.7976406e-03
  3.9730081e-03  4.5287893e-03  1.4353305e-03 -2.7041151e-03
 -4.3698736e-03 -1.0341319e-03  1.4363434e-03 -2.6442122e-03
 -7.0785680e-03 -7.8077246e-03 -9.1245631e-03 -5.9377626e-03
 -1.8496348e-03 -4.3226960e-03 -6.4599686e-03 -3.7209135e-03
  4.2887270e-03 -3.7411784e-03  8.3785607e-03  1.5363722e-03
 

In [None]:
# Example 2:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt') # Download the tokenizer models if not already downloaded

sample = "Word embeddings are dense vector representations of words."
tokenized_corpus = word_tokenize(sample.lower()) # Lowercasing for consistency

skipgram_model = Word2Vec(sentences=[tokenized_corpus],
						vector_size=100, # Dimensionality of the word vectors
						window=5,		 # Maximum distance between the current and predicted word within a sentence
						sg=1,			 # Skip-Gram model (1 for Skip-Gram, 0 for CBOW)
						min_count=1,	 # Ignores all words with a total frequency lower than this
						workers=4)	 # Number of CPU cores to use for training the model

# Training
skipgram_model.train([tokenized_corpus], total_examples=1, epochs=10)
skipgram_model.save("skipgram_model.model")
loaded_model = Word2Vec.load("skipgram_model.model")
vector_representation = loaded_model.wv['word']
print("Vector representation of 'word':", vector_representation)
