## Word Vectorization different techniques:
### `1.Bag of words`
### `2.n-grams`
### `3.TF-IDF`

### Bag of words : bag of words are nothing but the frequency of words in a sentence/text/doc. The frequencies are represented in the form of a matrix. In case of NLP, we need to convert the text data into form of vectors so that we can feed those as input data to the NLP Model. 

In [37]:
## Bag Of Words :example1 (single sentence)
from sklearn.feature_extraction.text import CountVectorizer

string = ["This is one of the examples one !"]
vect1 = CountVectorizer()
vect1.fit_transform(string)
print("Features are:", vect1.get_feature_names())              #get the features/words


#Now store the vectors created from the words frequencies:
x = vect1.fit_transform(string).toarray()                      #get the vector of the wordfrequencies  
print('The word vectors are:', x)

Features are: ['examples', 'is', 'of', 'one', 'the', 'this']
The word vectors are: [[1 1 1 2 1 1]]


Explanation:
- This corpus has only 1 sentence in it. Total 6 words are present in the sentence 
- The word 'one' has appeared two times in the text, so the value = 2. the rest has appeared only once in the text/string/sentence, hence the vlaue = 1

In [42]:
## Bag Of Words :example2 (multile sentences)
text = 'words that using is and analysis on TED talks and on using words analytics and learning key. Using learning is the key here.We have not grabbed their land, their culture, their history and tried to enforce our way of life on them. Why, because we respect the freedom of others. That is why my first vision is that of freedom. Anobody has the vision of freedom.'

# Cleaning the text data
import nltk
import re
from nltk.corpus import stopwords            #removed stop words
from nltk.stem import WordNetLemmatizer      #converting the words into their base forms using lemmatization

wordnet=WordNetLemmatizer()
sentences = nltk.sent_tokenize(text)
corpus = []
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])
    review = review.lower()
    review = review.split()
    review = [wordnet.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)
print('The corpus is:', corpus)
print()

vect2 = CountVectorizer()   
freq_matrix = vect2.fit_transform(corpus).toarray()

print(freq_matrix) 

The corpus is: ['word using analysis ted talk using word analytics learning key', 'using learning key grabbed land culture history tried enforce way life', 'respect freedom others', 'first vision freedom', 'anobody vision freedom']

[[1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 1 1 0 2 0 0 2]
 [0 0 0 1 1 0 0 1 1 1 1 1 1 0 0 0 0 1 1 0 1 0]
 [0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
 [0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]]


- Vector of Sentence 1: [1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 1 1 0 2 0 0 2]
- Vector of Sentence 2: [0 0 0 1 1 0 0 1 1 1 1 1 1 0 0 0 0 1 1 0 1 0] and so on   

Explanation:
- In this corpus there are total 5 sentences present
- For each sentence, word frequencies are shown in the form of a matrix
- We can see that there are total 2 highest appeared words (with frequency = 2)
- And the highest occured words in this corpus belongs to the first sentence and second sentence

#### Drawbacks of using a Bag-of-Words (BoW) Model: In the above example, we have vectors of length 22. However, we will face issues when we come across new sentences
- If the new sentences contain new words, then our vocabulary size would increase and thereby, the length of the vectors would increase too.
- Additionally, the vectors would also contain many 0s, thereby resulting in a sparse matrix (which is what we would like to avoid)
- We are retaining no information on the grammar of the sentences nor on the ordering of the words in the text.

### n-grams: An N-gram is a sequence of N number of tokens (or words). We can think n-grams as extension of the BOG Model with some modification to it. In BOG we considered each word as a single token. In n-gram we have the freedom to choose n number of words together as a token
- n= 1, we call it a “unigram”
- n=2, it is called a “bigram”
- n=3, it is called a “trigram”
- We can clearly see that BOW model is nothing but n-gram model when n=1

In [36]:
#n-grams
from sklearn.feature_extraction.text import CountVectorizer 

string = ["This is the example two!"]

#unigram
vect1 = CountVectorizer(ngram_range=(1,1))
vect1.fit_transform(string)

#bigram
vect2 = CountVectorizer(ngram_range=(2,2))
vect2.fit_transform(string)

#trigram
vect3 = CountVectorizer(ngram_range=(3,3))
vect3.fit_transform(string)

#4gram
vect4 = CountVectorizer(ngram_range=(4,4))
vect4.fit_transform(string)


print("1-gram  :",vect1.get_feature_names())
print("2-gram  :",vect2.get_feature_names())
print("3-gram  :",vect3.get_feature_names())
print("4-gram  :",vect4.get_feature_names())

1-gram  : ['example', 'is', 'the', 'this', 'two']
2-gram  : ['example two', 'is the', 'the example', 'this is']
3-gram  : ['is the example', 'the example two', 'this is the']
4-gram  : ['is the example two', 'this is the example']


In [38]:
vect4.fit_transform(string).toarray()

array([[1, 1, 1, 1]], dtype=int64)

#### Limitations of N-gram approach to Language Modeling : N-gram based language models do have a few drawbacks:
- The higher the N, the better is the model usually. But this leads to lots of computation overhead that requires large computation power
- N-grams are a sparse representation of language. This is because we build the model based on the probability of words co-occurring. It will give zero probability to all the words that are not present in the training corpus

### TF-IDF: 

###### 1.Term Frequency
- It is simply the frequency in which a word appears in a document in comparison to the total number words in the document. Mathematically given as: 
- TF = (Number of times a word appears in the document) / (Total number of words in the document)

###### 2.Inverse Document Frequency
- Term frequency has a disadvantage that it tends to give higher weights to words with higher frequency. In such cases words like ‘a’, ‘the’, ‘in’, ’of’ etc. appears more in the documents than other regular words. Thus, more important words are wrongly given lower weights as their frequency is less. To tackle this problem IDF was introduced. IDF decreases the weights of such high frequency terms and increases the weight of terms with rare occurrence. Mathematically it is given as:
- IDF = log [(Number of documents)/(Number of documents the word appears in)]
- note: [log has base 2]

##### Finally, Tf-Idf Score = TF * IDF


In [51]:
text = 'words that using is and analysis on words TED talks and on using words analytics and learning key. Using learning is the key here.We have not grabbed their land, their culture, their history and tried to enforce our way of life on them. Why, because we respect the freedom of others. That is why my first vision is that of freedom. Anobody has the vision of freedom.'

# Cleaning the text data
import nltk
import re
from nltk.corpus import stopwords            #removed stop words
from nltk.stem import WordNetLemmatizer      #converting the words into their base forms using lemmatization

wordnet=WordNetLemmatizer()
sentences = nltk.sent_tokenize(text)
corpus = []
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])
    review = review.lower()
    review = review.split()
    review = [wordnet.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)
print('The corpus is:', corpus)
print()

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
x = tfidf.fit_transform(corpus)
df = pd.DataFrame(x.todense(),columns=tfidf.get_feature_names())
df

The corpus is: ['word using analysis word ted talk using word analytics learning key', 'using learning key grabbed land culture history tried enforce way life', 'respect freedom others', 'first vision freedom', 'anobody vision freedom']



Unnamed: 0,analysis,analytics,anobody,culture,enforce,first,freedom,grabbed,history,key,...,life,others,respect,talk,ted,tried,using,vision,way,word
0,0.243213,0.243213,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.196222,...,0.0,0.0,0.0,0.243213,0.243213,0.0,0.392445,0.0,0.0,0.729638
1,0.0,0.0,0.0,0.316978,0.316978,0.0,0.0,0.316978,0.316978,0.255735,...,0.316978,0.0,0.0,0.0,0.0,0.316978,0.255735,0.0,0.316978,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.427993,0.0,0.0,0.0,...,0.0,0.63907,0.63907,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.690159,0.462208,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.556816,0.0,0.0
4,0.0,0.0,0.690159,0.0,0.0,0.0,0.462208,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.556816,0.0,0.0
