# **Feature Extraction**

Feature Engineering is a very key part of Natural Language Processing. as we all know algorithms and machines can’t understand characters or words or sentences hence we need to encode these words into some specific form of numerical in order to interact with algorithms or machines. we can’t feed the text data containing words or sentences or characters to a machine learning model.

### **1. Bag of Words(BOW) model**

It’s the simplest model, Imagine a sentence as a bag of words here The idea is to take the whole text data and count their frequency of occurrence. and map the words with their frequency. This method doesn’t care about the order of the words, but it does care how many times a word occurs and the default bag of words model treats all words equally.

The feature vector will have the same word length. Words that come multiple times get higher weightage making this model biased

In [1]:
!pip install sklearn
import pandas as pd
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sklearn
  Downloading sklearn-0.0.post1.tar.gz (3.6 kB)
Building wheels for collected packages: sklearn
  Building wheel for sklearn (setup.py) ... [?25l[?25hdone
  Created wheel for sklearn: filename=sklearn-0.0.post1-py3-none-any.whl size=2344 sha256=1b86da812f36887f617436e76db71a0ae350cb983385ef3f48ad110ee74e740c
  Stored in directory: /root/.cache/pip/wheels/42/56/cc/4a8bf86613aafd5b7f1b310477667c1fca5c51c3ae4124a003
Successfully built sklearn
Installing collected packages: sklearn
Successfully installed sklearn-0.0.post1


In [8]:
text = ["I work in Mumbai",
        "NLP is a niche skill",
        "I will travel to London in a month"]
vectorizer = CountVectorizer()
count_matrix = vectorizer.fit_transform(text)
count_array = count_matrix.toarray()
df = pd.DataFrame(data=count_array,columns = vectorizer.get_feature_names_out())
print(df)

   in  is  london  month  mumbai  niche  nlp  skill  to  travel  will  work
0   1   0       0      0       1      0    0      0   0       0     0     1
1   0   1       0      0       0      1    1      1   0       0     0     0
2   1   0       1      1       0      0    0      0   1       1     1     0


After fitting the countVectorizer we can transform any text into the fitted vocabulary.

In [9]:
text2 = ['I love to travel to London, but I prefer to stay in mumbai']
print(vectorizer.transform(text2).toarray())

[[1 0 1 0 1 0 0 0 3 1 0 0]]


### **2. Implementation of the BOW model with n-gram:**

assume that we have the word “not bad” and if we split this into “not” and “bad” then it will lose out its meaning. “not bad” is similar to “good” to some extent. we don’t want to split such words which lose their meaning after splitting. here the idea of n-grams comes into the picture.

In [10]:
text = ["I work in Mumbai",
        "NLP is a niche skill",
        "I will travel to London in a month"]
vectorizer = CountVectorizer(ngram_range = (1,2)) #ngram_range =(1, 1) means only unigrams, ngram_range = (1, 2) means unigrams with bigrams ngram_range=(2, 2) means only bigrams.
count_matrix = vectorizer.fit_transform(text)
count_array = count_matrix.toarray()
df = pd.DataFrame(data=count_array,columns = vectorizer.get_feature_names_out())
print(df)

   in  in month  in mumbai  is  is niche  london  london in  month  mumbai  \
0   1         0          1   0         0       0          0      0       1   
1   0         0          0   1         1       0          0      0       0   
2   1         1          0   0         0       1          1      1       0   

   niche  ...  nlp is  skill  to  to london  travel  travel to  will  \
0      0  ...       0      0   0          0       0          0     0   
1      1  ...       1      1   0          0       0          0     0   
2      0  ...       0      0   1          1       1          1     1   

   will travel  work  work in  
0            0     1        1  
1            0     0        0  
2            1     0        0  

[3 rows x 22 columns]


In [11]:
text2 = ['I love to travel to London, but I prefer to stay in mumbai']
print(vectorizer.transform(text2).toarray())

[[1 0 1 0 0 1 0 0 1 0 0 0 0 0 3 1 1 1 0 0 0 0]]


The BOW model doesn’t give good results since it has a drawback. Assume that there is a particular word that is appearing in all the documents and it comes multiple times, eventually, it will have a higher frequency of occurrence and it will have a greater value that will cause a specific word to have more weightage in a sentence, that’s not good for our analysis.

### **3.TF-IDF (Term frequency-inverse Document Frequency)**

The idea of TF-IDF is to reflect the importance of a word to its document or sentence by normalizing the words which occur frequently in the collection of documents.

**Term frequency (TF):** number of times a term has appeared in a document.

The term frequency is a measure of how frequently or how common a word is for a given sentence.

**Inverse Document Frequency (IDF):**

The inverse document frequency (IDF) is a measure of how rare a word is in a document. Words like “the”,” a” show up in all the documents but rare words will not occur in all the documents of the corpus.

If a word appears in almost every document means it’s not significant for the classification.

IDF of a word is = log(N/n)

N: total number of documents.
n: number of documents containing a term (word)

TF-IDF Evaluates how relevant is a word to its sentence in a collection of sentences or documents.

**Manual creation of Tf-idf model from scratch**

In [20]:
import nltk
from nltk.tokenize import word_tokenize 
nltk.download('punkt')
text = ["I work in Mumbai",
        "NLP is a niche skill",
        "I will travel to London in a month"]
#Preprocessing the text data
sentences = []
word_set = []
 
for sent in text:
    x = [i.lower() for  i in word_tokenize(sent) if i.isalpha()]
    sentences.append(x)
    for word in x:
        if word not in word_set:
            word_set.append(word)
 
#Set of vocab 
word_set = set(word_set)
#Total documents in our corpus
total_documents = len(sentences)
 
#Creating an index for each word in our vocab.
index_dict = {} #Dictionary to store index for each word
i = 0
for word in word_set:
    index_dict[word] = i
    i += 1

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [13]:
#Create a count dictionary
def count_dict(sentences):
    word_count = {}
    for word in word_set:
        word_count[word] = 0
        for sent in sentences:
            if word in sent:
                word_count[word] += 1
    return word_count
word_count = count_dict(sentences)

In [15]:
#Term Frequency
def termfreq(doc, word):
    N = len(doc)
    occurance = len([token for token in doc if token == word])
    return occurance/N

In [16]:
def inverse_doc_freq(word):
    try:
        word_occurance = word_count[word] + 1
    except:
        word_occurance = 1
    return np.log(total_documents/word_occurance)

In [18]:
def tf_idf(sentence):
    tf_idf_vec = np.zeros((len(word_set),))
    for word in sentence:
        tf = termfreq(sentence,word)
        idf = inverse_doc_freq(word)
         
        value = tf*idf
        tf_idf_vec[index_dict[word]] = value 
    return tf_idf_vec

In [19]:
#TF-IDF Encoded text corpus
import numpy as np
import pprint
vectors = []
for sent in sentences:
    vec = tf_idf(sent)
    vectors.append(vec)
 
pprint.pprint(vectors)

[array([ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        , -0.07192052, -0.07192052,
        0.        ,  0.        ,  0.10136628]),
 array([ 0.08109302,  0.        ,  0.        ,  0.08109302,  0.        ,
        0.        ,  0.08109302,  0.        , -0.05753641, -0.05753641,
        0.        ,  0.        ,  0.        ]),
 array([ 0.        ,  0.04505168,  0.        ,  0.        ,  0.04505168,
        0.04505168,  0.        ,  0.04505168, -0.03196467, -0.03196467,
        0.04505168,  0.04505168,  0.        ])]


**Tf-idf model using sklearn**

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [29]:
text = ["I work in Mumbai",
        "NLP is a niche skill to work on",
        "I will travel to London for work in a month",
        "I came to work here on nlp"]
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(text)
count_array = matrix.toarray()
df = pd.DataFrame(data=count_array,columns = vectorizer.get_feature_names_out())
print(df)

       came       for      here        in        is    london     month  \
0  0.000000  0.000000  0.000000  0.572892  0.000000  0.000000  0.000000   
1  0.000000  0.000000  0.000000  0.000000  0.450701  0.000000  0.000000   
2  0.000000  0.398368  0.000000  0.314078  0.000000  0.398368  0.398368   
3  0.504889  0.000000  0.504889  0.000000  0.000000  0.000000  0.000000   

     mumbai     niche       nlp        on     skill        to    travel  \
0  0.726641  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
1  0.000000  0.450701  0.355338  0.355338  0.450701  0.287677  0.000000   
2  0.000000  0.000000  0.000000  0.000000  0.000000  0.254273  0.398368   
3  0.000000  0.000000  0.398060  0.398060  0.000000  0.322264  0.000000   

       will      work  
0  0.000000  0.379192  
1  0.000000  0.235195  
2  0.398368  0.207885  
3  0.000000  0.263472  


### **Word2Vec**

Word2Vec consists of models for generating word embedding. These models are shallow two-layer neural networks having one input layer, one hidden layer, and one output layer. Word2Vec utilizes two architectures :

Learning Links : 
- https://www.youtube.com/watch?v=UqRCEmrv1gQ
- https://www.youtube.com/watch?v=Otde6VGvhWM


In [30]:
!pip install nltk
!pip install gensim

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


#### **1.CBOW (Continuous Bag of Words):**
CBOW model predicts the current word given context words within a specific window. The input layer contains the context words and the output layer contains the current word. The hidden layer contains the number of dimensions in which we want to represent the current word present at the output layer. 

In [31]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [37]:
# importing all necessary modules
from nltk.tokenize import sent_tokenize, word_tokenize
import gensim
from gensim.models import Word2Vec
import gensim.downloader as api

# Reads txt file
sample = open("word2vec_training_data_aliceinwonderland.txt")

s = sample.read()

# Replaces escape character with space
f = s.replace("\n", " ")

data = []

# iterate through each sentence in the file
for i in sent_tokenize(f):
	temp = []
	# tokenize the sentence into words
	for j in word_tokenize(i):
		temp.append(j.lower())

	data.append(temp)

# Create CBOW model
model1 = gensim.models.Word2Vec(data, min_count = 1, window = 5)

# Print results
print("Cosine similarity between 'alice' " +
			"and 'wonderland' - CBOW : ",
	model1.wv.similarity('alice', 'wonderland'))
	
print("Cosine similarity between 'alice' " +
				"and 'machines' - CBOW : ",
	model1.wv.similarity('alice', 'machines'))



Cosine similarity between 'alice' and 'wonderland' - CBOW :  0.98836577
Cosine similarity between 'alice' and 'machines' - CBOW :  0.9476097


#### **2.Skip Gram:** 
Skip gram predicts the surrounding context words within specific window given current word. The input layer contains the current word and the output layer contains the context words. The hidden layer contains the number of dimensions in which we want to represent current word present at the input layer. 

In [38]:
# Create Skip Gram model
model2 = gensim.models.Word2Vec(data, min_count = 1,window = 5, sg = 1)
 
# Print results
print("Cosine similarity between 'alice' " +
          "and 'wonderland' - Skip Gram : ",
    model2.wv.similarity('alice', 'wonderland'))
     
print("Cosine similarity between 'alice' " +
            "and 'machines' - Skip Gram : ",
      model2.wv.similarity('alice', 'machines'))



Cosine similarity between 'alice' and 'wonderland' - Skip Gram :  0.6384189
Cosine similarity between 'alice' and 'machines' - Skip Gram :  0.8152881
