# NLP techniques 

* Bag of Words 
* bigram, trigram, ngram, skipgram
* TF-IDF 1972
* Word2Vec

# Bag of Words


# 
```python
# Define the input sentences
sentences = ["I love coding", "Coding is fun", "Python is awesome"]

# Initialize an empty dictionary to store the word counts
word_counts = {}

# Iterate over each sentence
for sentence in sentences:
    # Split the sentence into words
    words = sentence.split()
    
    # Iterate over each word
    for word in words:
        # Check if the word is already in the dictionary
        if word in word_counts:
            # If yes, increment the count by 1
            word_counts[word] += 1
        else:
            # If no, add the word to the dictionary with count 1
            word_counts[word] = 1

# Print the word counts
for word, count in word_counts.items():
    print(f"{word}: {count}")
```
This code implements the bag of words technique by counting the occurrences of each word in the input sentences. It splits each sentence into words, and then iterates over each word to update the word counts in a dictionary. Finally, it prints the word counts.



In [3]:
# Define the input sentences
sentences = ["I love coding", "Coding is fun.", "Python is awesome and fun"]

#  an empty dictionary to store the word counts
word_counts = {}

# Iterate oInitializever each sentence
for sentence in sentences:
    # Split the sentence into words
    words = sentence.split()
    
    # Iterate over each word
    for word in words:
        # Check if the word is already in the dictionary
        if word.lower() in word_counts:
            # If yes, increment the count by 1
            word_counts[word.lower()] += 1
        else:
            # If no, add the word to the dictionary with count 1
            word_counts[word] = 1

# Print the word counts
for word, count in word_counts.items():
    print(f"{word}: {count}")

I: 1
love: 1
coding: 2
is: 2
fun.: 1
Python: 1
awesome: 1
and: 1
fun: 1


In [8]:
# create a dataframe from wordcounts
import pandas as pd
df = pd.DataFrame(list(word_counts.items()), columns = ['Word', 'Count'])
# print(df)
# add a row unk for word with count 1
df.loc[len(df.index)] = ['unk', 1]
print(df)

      Word  Count
0        I      1
1     love      1
2   coding      2
3       is      2
4     fun.      1
5   Python      1
6  awesome      1
7      and      1
8      fun      1
9      unk      1


In [16]:
# modify the dataframe to do the following
# convert the word counts to column. 
# add sentences as rows 
# fill the dataframe with the word counts for each sentence
# fill the missing values with 0
# print the dataframe

# create a new dataframe
df1 = pd.DataFrame(columns = list(word_counts.keys()))

# iterate over each sentence

for sentence in sentences:
    # split the sentence into words
    words = sentence.split()
    
    # create a dictionary to store the word counts for the sentence
    sentence_word_counts = {}
    
    # iterate over each word
    for word in words:
        # check if the word is already in the dictionary
        if word.lower() in sentence_word_counts:
            # if yes, increment the count by 1
            sentence_word_counts[word.lower()] += 1
        else:
            # if no, add the word to the dictionary with count 1
            sentence_word_counts[word.lower()] = 1
    
    # add the sentence to the dataframe
    df1.loc[len(df1.index)] = sentence_word_counts

df1.fillna(0, inplace=True)
df1 = df1.astype(int)
print(df1)


   I  love  coding  is  fun.  Python  awesome  and  fun
0  0     1       1   0     0       0        0    0    0
1  0     0       1   1     1       0        0    0    0
2  0     0       0   1     0       0        1    1    1


## CountVectorizer in scikit-learn


In [18]:
sentences

['I love coding', 'Coding is fun.', 'Python is awesome and fun']

In [26]:
# use count vectorizer to do the same
from sklearn.feature_extraction.text import CountVectorizer

# create an instance of CountVectorizer
# use pattern to consider singe character words


vectorizer = CountVectorizer(lowercase=True, token_pattern=r"(?u)\b\w+\b")

# fit the vectorizer on the sentences
vectorizer.fit(sentences)

# transform the sentences to a matrix
X = vectorizer.transform(sentences)

# create a dataframe from the matrix
df2 = pd.DataFrame(X.toarray(), columns = vectorizer.get_feature_names_out())
print(df2)

df2['Unk'] = 0
df2

   and  awesome  coding  fun  i  is  love  python
0    0        0       1    0  1   0     1       0
1    0        0       1    1  0   1     0       0
2    1        1       0    1  0   1     0       1


Unnamed: 0,and,awesome,coding,fun,i,is,love,python,Unk
0,0,0,1,0,1,0,1,0,0
1,0,0,1,1,0,1,0,0,0
2,1,1,0,1,0,1,0,1,0


 write 2 sentences wiht same words but different order having differnt meaning
* "The cat chased the dog eagerly."
* "Eagerly, the dog chased the cat."


# Bigram, Tri GRam and N gram 

Sentence: "The quick brown fox jumps over the lazy dog."

```
## Bigrams:
(The, quick)
(quick, brown)
(brown, fox)
(fox, jumps)
(jumps, over)
(over, the)
(the, lazy)
(lazy, dog)
```

##Trigrams:
```
(The, quick, brown)
(quick, brown, fox)
(brown, fox, jumps)
(fox, jumps, over)
(jumps, over, the)
(over, the, lazy)
(the, lazy, dog)
```

## 4-grams (4-grams or quadgrams):
```
(The, quick, brown, fox)
(quick, brown, fox, jumps)
(brown, fox, jumps, over)
(fox, jumps, over, the)
(jumps, over, the, lazy)
(over, the, lazy, dog)
```

Sentence: "I love to eat pizza with extra cheese."

Bigrams:
(I, love)
(love, to)
(to, eat)
(eat, pizza)
(pizza, with)
(with, extra)
(extra, cheese)
Trigrams:
(I, love, to)
(love, to, eat)
(to, eat, pizza)
(eat, pizza, with)
(pizza, with, extra)
(with, extra, cheese)
4-grams:
(I, love, to, eat)
(love, to, eat, pizza)
(to, eat, pizza, with)
(eat, pizza, with, extra)
(pizza, with, extra, cheese)
Explanation:
Bigrams: Sequential pairs of words in the sentence.
Trigrams: Sequential triplets of words in the sentence.
N-grams: Sequential sequences of n words in the sentence.
Each level (bigram, trigram, n-gram) provides increasingly more context about the sequence of words in the sentence, which is useful in various natural language processing tasks such as language modeling, text generation, and machine translation.


Vocabulary =  # number of unique words in the dataset. 



In [27]:
# Create bigram features using CountVectorizer
vectorizer = CountVectorizer(lowercase=True, ngram_range=(2,2), token_pattern=r"(?u)\b\w+\b")

# fit the vectorizer on the sentences
vectorizer.fit(sentences)

# transform the sentences to a matrix
X = vectorizer.transform(sentences)

# create a dataframe from the matrix
df3 = pd.DataFrame(X.toarray(), columns = vectorizer.get_feature_names_out())
print(df3)


   and fun  awesome and  coding is  i love  is awesome  is fun  love coding  \
0        0            0          0       1           0       0            1   
1        0            0          1       0           0       1            0   
2        1            1          0       0           1       0            0   

   python is  
0          0  
1          0  
2          1  


In [28]:
# Create bigram features using CountVectorizer
vectorizer = CountVectorizer(lowercase=True, ngram_range=(1,2), token_pattern=r"(?u)\b\w+\b")

# fit the vectorizer on the sentences
vectorizer.fit(sentences)

# transform the sentences to a matrix
X = vectorizer.transform(sentences)

# create a dataframe from the matrix
df3 = pd.DataFrame(X.toarray(), columns = vectorizer.get_feature_names_out())
print(df3)


   and  and fun  awesome  awesome and  coding  coding is  fun  i  i love  is  \
0    0        0        0            0       1          0    0  1       1   0   
1    0        0        0            0       1          1    1  0       0   1   
2    1        1        1            1       0          0    1  0       0   1   

   is awesome  is fun  love  love coding  python  python is  
0           0       0     1            1       0          0  
1           0       1     0            0       0          0  
2           1       0     0            0       1          1  


In [29]:
# Create bigram features using CountVectorizer
vectorizer = CountVectorizer(lowercase=True, ngram_range=(2,3), token_pattern=r"(?u)\b\w+\b")

# fit the vectorizer on the sentences
vectorizer.fit(sentences)

# transform the sentences to a matrix
X = vectorizer.transform(sentences)

# create a dataframe from the matrix
df3 = pd.DataFrame(X.toarray(), columns = vectorizer.get_feature_names_out())
print(df3)


   and fun  awesome and  awesome and fun  coding is  coding is fun  i love  \
0        0            0                0          0              0       1   
1        0            0                0          1              1       0   
2        1            1                1          0              0       0   

   i love coding  is awesome  is awesome and  is fun  love coding  python is  \
0              1           0               0       0            1          0   
1              0           0               0       1            0          0   
2              0           1               1       0            0          1   

   python is awesome  
0                  0  
1                  0  
2                  1  
