** Bag Of Words **
* Scikitlearn CountVectorizer acts as the Bag of words.
* The ngram_range specifies the ngrams to be considered for vectorizing.
* (1,1) indicates unigram which are individual words.
* (1,2) indicates unigrams as well as bigrams and so on.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

def bow_extractor(corpus,ngram_range=(1,1)):
    vectorizer=CountVectorizer(min_df=1,ngram_range=ngram_range)
    features=vectorizer.fit_transform(corpus)
    return vectorizer,features

In [3]:
CORPUS = [
'the sky is blue',
'sky is blue and sky is beautiful',
'the beautiful sky is so blue',
'i love blue cheese'
]

new_doc = ['loving this blue sky today']

In [3]:
bow_vectorizer,bow_features=bow_extractor(CORPUS)
features=bow_features.todense()
features

matrix([[0, 0, 1, 0, 1, 0, 1, 0, 1],
        [1, 1, 1, 0, 2, 0, 2, 0, 0],
        [0, 1, 1, 0, 1, 0, 1, 1, 1],
        [0, 0, 1, 1, 0, 1, 0, 0, 0]], dtype=int64)

In [4]:
bow_vectorizer.transform(new_doc).todense()

matrix([[0, 0, 1, 0, 0, 0, 1, 0, 0]])

In [6]:
bow_vectorizer,bow_features=bow_extractor(CORPUS,ngram_range=(1,3))
features=bow_features.todense()
features

matrix([[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1,
         1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1],
        [1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 2, 1, 1, 1, 0, 0, 0, 0, 0, 2,
         2, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1,
         1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0],
        [0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

def tfidfextractor(corpus,ngram_range=(1,1)):
    vectorizer=TfidfVectorizer(min_df=1,norm="l2",smooth_idf=True,use_idf=True,ngram_range=ngram_range)
    features=vectorizer.fit_transform(corpus)
    return vectorizer,features

In [4]:
tfidfvectorizer,features=tfidfextractor(CORPUS)
print(features.todense())
tfidfvectorizer.transform(new_doc).todense()

[[0.         0.         0.39921021 0.         0.48829139 0.
  0.48829139 0.         0.60313701]
 [0.44051607 0.34730793 0.22987956 0.         0.5623514  0.
  0.5623514  0.         0.        ]
 [0.         0.43202578 0.28595344 0.         0.3497621  0.
  0.3497621  0.54796992 0.43202578]
 [0.         0.         0.34618161 0.66338461 0.         0.66338461
  0.         0.         0.        ]]


matrix([[0.        , 0.        , 0.63295194, 0.        , 0.        ,
         0.        , 0.77419109, 0.        , 0.        ]])

** Word2Vec**
* Word2Vec model is released by Google which is based on bag of words and skip gram based architectures.
* This gives the words in a n dimensional space where the words with similar meaning are placed closer to each other.
* Below is gensim implementation of Word2Vec.The parameters are as follows:
* size:the total number of features in feature space
* window:the length of window of words to be considered in the algorithm
* min_count:minimum count of words across documents to be considered.To ignore sparse words of no interest.
* sample: To downsample occurence of most frequent words.

In [2]:
import gensim
import nltk

In [4]:
TOKENIZED_CORPUS=[nltk.word_tokenize(sentence) for sentence in CORPUS]
tokenized_new_doc=[nltk.word_tokenize(sentence) for sentence in new_doc]

In [6]:
model=gensim.models.Word2Vec(TOKENIZED_CORPUS,size=10,window=10,min_count=2,sample=1e-3)

In [9]:
#We need to use the model to extract features for every single word.
print(model["sky"])
print(model["blue"])

[ 0.00893308  0.02171977  0.04001991  0.02354194 -0.01633224  0.02735614
 -0.00345451  0.00202995  0.03671695 -0.04223331]


  """Entry point for launching an IPython kernel.
  


array([ 9.0188542e-03,  2.6809189e-02,  3.8394030e-02, -2.6277043e-02,
        3.2703634e-02,  2.0800639e-02,  3.6418807e-02, -3.9342128e-02,
        2.8878924e-02, -6.8178852e-05], dtype=float32)