text feature are about converting texts into representative numerical values. one of the simplest methods to do this is by <i>word counts</i>: you take each snippet of text, count the occurances of each word and put it in a tabular format!

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

  return f(*args, **kwds)


In [2]:
sample = ['problem of evil',
          'evil queen',
          'horizon problem']

the fastest way to achieve the aforementioned tabular format is to use <b>CountVectorizer</b>

In [3]:
vec = CountVectorizer()
X = vec.fit_transform(sample)
X

<3x5 sparse matrix of type '<class 'numpy.int64'>'
	with 7 stored elements in Compressed Sparse Row format>

result is a sparse matrix recording the number of times each word appears

In [4]:
pd.DataFrame(X.toarray(), columns=vec.get_feature_names())

Unnamed: 0,evil,horizon,of,problem,queen
0,1,0,1,1,0
1,1,0,0,0,1
2,0,1,0,1,0


Disadvantages of using CountVectorizer:<br>
- raw word count means a lot of emphasis is put  on words which occur more frequently and this may not be true for example conjunctions and prepositions occur most often but have absolutely no meaning by themselves.
- this can cause bias in classification algorithms
<p><p>the best way to fix this is by using <b>TF-IDF: term frequency - inverse document frequency</b>, which weighs the word counts by a measure of how often they occur in all the documents ie cosiders <i>overall document weightage</i> of a word. in layman's terms it penalizes words that occur way too often.
<br>it may be noted that tf-idf is sensitive to document symmetry in corpus distribution

In [5]:
vec = TfidfVectorizer()
X = vec.fit_transform(sample)
pd.DataFrame(X.toarray(), columns=vec.get_feature_names())

Unnamed: 0,evil,horizon,of,problem,queen
0,0.517856,0.0,0.680919,0.517856,0.0
1,0.605349,0.0,0.0,0.0,0.795961
2,0.0,0.795961,0.0,0.605349,0.0
