# Demonstrate CountVectorizer

__TfidfVectorizer__ and __CountVectorizer__ both are methods for converting text data into vectors as model can process only numerical data.

In __CountVectorizer__ we only count the number of times a word appears in the document which results in biasing in favour of most frequent words. This ends up in ignoring rare words which could have helped is in processing our data more efficiently.

To overcome this , we use __TfidfVectorizer__.

In __TfidfVectorizer__ we consider overall document weightage of a word. It helps us in dealing with most frequent words. Using it we can penalize them. TfidfVectorizer weights the word counts by a measure of how often they appear in the documents.

__Example__<br/>
sample = ['problem of evil',
          'evil queen',
          'horizon problem']

__CountVectorizer__
<img src=Data/CountVectorizer.png>

__TfidfVectorizer__
<img src=Data/TfidfVectorizer.png>


In [1]:
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
cv = CountVectorizer()

In [3]:
x_traincv = cv.fit_transform(['Hi how are you', 'how are you doing'])

In [4]:
# The two sentenses above are converted to the bag of words
cv.get_feature_names()

['are', 'doing', 'hi', 'how', 'you']

In [5]:
# The two sentences are now represented using bag of words in vector form
x_train_arr = x_traincv.toarray()
x_train_arr

array([[1, 0, 1, 1, 1],
       [1, 1, 0, 1, 1]], dtype=int64)

In [6]:
# This is the first sentence in vector form
x_train_arr[0]

array([1, 0, 1, 1, 1], dtype=int64)

In [7]:
cv.inverse_transform(x_train_arr[0])

[array(['are', 'hi', 'how', 'you'], dtype='<U5')]

In [8]:
cv.inverse_transform(x_train_arr[1])

[array(['are', 'doing', 'how', 'you'], dtype='<U5')]

# Demonstrate TfidfVectorizer

__TfidfVectorizer__ and __CountVectorizer__ both are methods for converting text data into vectors as model can process only numerical data.

In __CountVectorizer__ we only count the number of times a word appears in the document which results in biasing in favour of most frequent words. This ends up in ignoring rare words which could have helped is in processing our data more efficiently.

To overcome this , we use __TfidfVectorizer__.

In __TfidfVectorizer__ we consider overall document weightage of a word. It helps us in dealing with most frequent words. Using it we can penalize them. TfidfVectorizer weights the word counts by a measure of how often they appear in the documents.

__Example__<br/>
sample = ['problem of evil',
          'evil queen',
          'horizon problem']

__CountVectorizer__
<img src=Data/CountVectorizer.png>

__TfidfVectorizer__
<img src=Data/TfidfVectorizer.png>


In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [10]:
cv = TfidfVectorizer()

In [11]:
x_traincv = cv.fit_transform(['Hi how are you', 'how are you doing'])

In [12]:
# The two sentenses above are converted to the bag of words
cv.get_feature_names()

['are', 'doing', 'hi', 'how', 'you']

In [13]:
# The two sentences are now represented using bag of words in vector form
x_train_arr = x_traincv.toarray()
x_train_arr

array([[0.44832087, 0.        , 0.63009934, 0.44832087, 0.44832087],
       [0.44832087, 0.63009934, 0.        , 0.44832087, 0.44832087]])

In [14]:
# This is the first sentence in vector form
x_train_arr[0]

array([0.44832087, 0.        , 0.63009934, 0.44832087, 0.44832087])

In [15]:
cv.inverse_transform(x_train_arr[0])

[array(['are', 'hi', 'how', 'you'], dtype='<U5')]

In [16]:
cv.inverse_transform(x_train_arr[1])

[array(['are', 'doing', 'how', 'you'], dtype='<U5')]