Parts of code snippets are from:

a. http://blog.christianperone.com/2011/09/machine-learning-text-feature-extraction-tf-idf-part-i/
b. http://blog.christianperone.com/2011/10/machine-learning-text-feature-extraction-tf-idf-part-ii/
c. https://buhrmann.github.io/tfidf-analysis.html
d. https://www.youtube.com/watch?v=hXNbFNCgPfY
TF-IDF stands for "Term Frequency, Inverse Document Frequency". It is a way to score the importance of words (or "terms") in a document based on how frequently they appear across multiple documents.
1. If a word appears frequently in a document, it's important. Give the word a high score.
2. But if a word appears in many documents, it's not a unique identifier. Give the word a low score.

Therefore, common words like "the" and "for", which appear in many documents, will be scaled down. Words that appear frequently in a single document will be scaled up.

Vector Space Model is a algebraic model of converting textual information as a vector and it represents the features extracted from the document. Step 1 is to create a dictionary of all terms in the document into dimensions ignoring the common english terms

In [1]:
train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.",
    "We can see the shining sun, the bright sun.")

In [2]:
# scikit.learn, what we have presented as the term-frequency, 
# is called CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

In [3]:
print (vectorizer)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)


In [4]:
# Here is the vocabulary index without using the stop words 
vectorizer.fit_transform(train_set)
print (vectorizer.vocabulary_)

{'the': 5, 'sky': 3, 'is': 2, 'blue': 0, 'sun': 4, 'bright': 1}


In [5]:
# We then use the same vectorizer to create a sparse matrix. It is a 
# Scipy sparse matrix with elements stored in a Coordinate format.
# the spare matrix will be represented as 
# (Document#, dictionary_word) Number of occurences
smatrix = vectorizer.transform(test_set)
print (smatrix)

  (0, 1)	1
  (0, 2)	1
  (0, 3)	1
  (0, 4)	1
  (0, 5)	2
  (1, 1)	1
  (1, 4)	2
  (1, 5)	2


In [6]:
# We can convert this sparse matrix into a dense matrix
# the dense matrix will have the shape number of documents * no of words
# Every element is represented as number of occurences
# Each row is count of (blue, bright, is, sky, sun, the)
dmatrix = smatrix.todense()
print (dmatrix)

[[0 1 1 1 1 2]
 [0 1 0 0 2 2]]


In [7]:
# The tf-idf comes to our rescue for the problem observed above
# tf-idf then does to solve that problem, is to scale down the frequent 
# terms while scaling up the rare terms; a term that occurs 10 times 
# more than another isn’t 10 times more important than it, that’s why 
# tf-idf uses the logarithmic scale to do that
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(norm="l2")
tfidf.fit(smatrix)
print ("IDF:", tfidf.idf_)

IDF: [ 2.09861229  1.          1.40546511  1.40546511  1.          1.        ]


In [8]:
tf_idf_matrix = tfidf.transform(smatrix)
print (tf_idf_matrix.todense())

[[ 0.          0.31701073  0.44554752  0.44554752  0.31701073  0.63402146]
 [ 0.          0.33333333  0.          0.          0.66666667  0.66666667]]


In [9]:
# Understading the tfidf in much more detail
docA = 'the cat sat on my face'
docB = 'the dog sat on my bed'

In [10]:
# tokening 
bowA = docA.split(" ")
bowB = docB.split(" ")

In [11]:
print (bowA)

['the', 'cat', 'sat', 'on', 'my', 'face']


In [12]:
# Let us create a set of all words
wordSet = set(bowA).union(set(bowB))

In [13]:
print (wordSet)

{'cat', 'face', 'the', 'sat', 'bed', 'my', 'on', 'dog'}


In [15]:
# create dictionaries for each word count
wordDictA = dict.fromkeys(wordSet, 0)
wordDictB = dict.fromkeys(wordSet, 0)

In [16]:
wordDictA

{'bed': 0, 'cat': 0, 'dog': 0, 'face': 0, 'my': 0, 'on': 0, 'sat': 0, 'the': 0}

In [17]:
# Count of the number of word and set in the dict
for word in bowA:
    wordDictA[word]+=1

for word in bowB:
    wordDictB[word]+=1

In [18]:
wordDictA

{'bed': 0, 'cat': 1, 'dog': 0, 'face': 1, 'my': 1, 'on': 1, 'sat': 1, 'the': 1}

In [19]:
# put the dictionary into a df
import pandas as pd
pd.DataFrame([wordDictA, wordDictB])

Unnamed: 0,bed,cat,dog,face,my,on,sat,the
0,0,1,0,1,1,1,1,1
1,1,0,1,0,1,1,1,1


In [26]:
# the matrix above has the common words usually available, this is the problem tfidf can solve
# TF_IDF = tf(w)*idf(w)
# tf(w) = (Number of times the word appears in a document) / Total number of word in the document
# idf(w) = log(Number of documents/ Number of documents that contain word w)
def computeTF(wordDict, bow):
    tfDict = {}
    bowCount = len(bow)
    for word, count in wordDict.items():
        tfDict[word] = count/float(bowCount)
    return tfDict

In [27]:
tfBowA = computeTF(wordDictA, bowA)
tfBowB = computeTF(wordDictB, bowB)

In [44]:
print (tfBowA)

{'cat': 0.16666666666666666, 'face': 0.16666666666666666, 'the': 0.16666666666666666, 'sat': 0.16666666666666666, 'bed': 0.0, 'my': 0.16666666666666666, 'on': 0.16666666666666666, 'dog': 0.0}


In [30]:
def computeIDF(docList):
    import math
    idfDict = {}
    N = len(docList)
    
    # count the documents that contains word w
    idfDict = dict.fromkeys(docList[0].keys(), 0)
    for doc in docList:
        for word, val in doc.items():
            if val > 0:
                idfDict[word] += 1
    # divide N by denominator above and take a log
    for word, val in idfDict.items():
        idfDict[word] = math.log(N/ float(val))
    return idfDict

In [32]:
idfs =  computeIDF([wordDictA, wordDictB])

In [46]:
print (idfs)

{'cat': 0.6931471805599453, 'face': 0.6931471805599453, 'the': 0.0, 'sat': 0.0, 'bed': 0.6931471805599453, 'my': 0.0, 'on': 0.0, 'dog': 0.6931471805599453}


In [38]:
def computeTFIDF(tfBow, Idfs):
    tfidf = {}
    for word, val in tfBow.items():
        tfidf[word] = val * idfs[word]
    return tfidf

In [40]:
tfidfBowA = computeTFIDF(tfBowA, idfs)
tfidfBowB = computeTFIDF(tfBowB, idfs)

In [41]:
# Putting this into a Matrix
pd.DataFrame([tfidfBowA, tfidfBowB])

Unnamed: 0,bed,cat,dog,face,my,on,sat,the
0,0.0,0.115525,0.0,0.115525,0.0,0.0,0.0,0.0
1,0.115525,0.0,0.115525,0.0,0.0,0.0,0.0,0.0


In [53]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(stop 
                      ngram_range=(1, 2),use_idf=1, smooth_idf=1, sublinear_tf=1)

In [54]:
from time import time
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(wordSet)
print("done in %0.3fs." % (time() - t0))

done in 0.011s.


In [55]:
print (tfidf)

  (0, 1)	1.0
  (1, 3)	1.0
  (2, 7)	1.0
  (3, 6)	1.0
  (4, 0)	1.0
  (5, 4)	1.0
  (6, 5)	1.0
  (7, 2)	1.0


showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

In [57]:
from nltk import stopwords

ImportError: cannot import name 'stopwords'

In [13]:
# Creting a vec pipe
from sklearn.feature_extraction.text import TfidfVectorizer
''' Create text vectorization pipeline with optional dimensionality reduction. '''
def get_vec_pipe(num_comp=0, reducer='svd'):
    tfv = TfidfVectorizer(min_df=6, max_features=None, strip_accents='unicode',analyzer="word", 
                          token_pattern=r'\w{1,}', ngram_range=(1, 2),use_idf=1, smooth_idf=1, sublinear_tf=1)
    # Vectorizer
    vec_pipe = [
        ('col_extr', JsonFields(0, ['title', 'body', 'url'])),
        ('squash', Squash()),
        ('vec', tfv)
    ]

    # Reduce dimensions of tfidf
    if num_comp > 0:
        if reducer == 'svd':
            vec_pipe.append(('dim_red', TruncatedSVD(num_comp)))
        elif reducer == 'kbest':
            vec_pipe.append(('dim_red', SelectKBest(chi2, k=num_comp)))
        elif reducer == 'percentile':
            vec_pipe.append(('dim_red', SelectPercentile(f_classif, percentile=num_comp)))

        vec_pipe.append(('norm', Normalizer()))

    return Pipeline(vec_pipe)


IndentationError: unindent does not match any outer indentation level (<tokenize>, line 11)

In [None]:
def top_tfidf_feats(row, features, top_n=10):
    topn_ids = np.argsort(row)[::-1][:top_n]
    top_feats = [(features[i], row[i]) for i in topn_ids]
    df = pd.DataFrame(top_feats)
    df.columns = ['feature', 'tfidf']
    return df

def top_feats_in_doc(Xtr, features, row_id, top_n=10):
    row = np.squeeze(Xtr[row_id].toarray())
    return top_tfidf_feats(row, features, top_n)

def top_mean_feats(Xtr, features, grp_ids=None, min_tfidf=0.1, top_n=10):
    if grp_ids:
        D = Xtr[grp_ids].toarray()
    else:
        D = Xtr.toarray()

    D[D < min_tfidf] = 0
    tfidf_means = np.mean(D, axis=0)
    return top_tfidf_feats(tfidf_means, features, top_n)

def top_feats_by_class(Xtr, y, features, min_tfidf=0.1, top_n=10):
    dfs = []
    labels = np.unique(y)
    for label in labels:
        ids = np.where(y==label)
        feats_df = top_mean_feats(Xtr, features, ids, min_tfidf=min_tfidf, top_n=top_n)
        feats_df.label = label
        dfs.append(feats_df)
    return dfs

def plot_tfidf_classfeats_h(dfs):
    fig = plt.figure(figsize=(12, 100), facecolor="w")
    x = np.arange(len(dfs[0]))
    for i, df in enumerate(dfs):
        #z = int(str(int(i/3)+1) + str((i%3)+1))
        ax = fig.add_subplot(9, 1, i+1)
        ax.spines["top"].set_visible(False)
        ax.spines["right"].set_visible(False)
        ax.set_frame_on(False)
        ax.get_xaxis().tick_bottom()
        ax.get_yaxis().tick_left()
        ax.set_xlabel("Mean Tf-Idf Score", labelpad=16, fontsize=16)
        ax.set_ylabel("Gene", labelpad=16, fontsize=16)
        ax.set_title("Class = " + str(df.label), fontsize=18)
        ax.ticklabel_format(axis='x', style='sci', scilimits=(-2,2))
        ax.barh(x, df.tfidf, align='center')
        ax.set_yticks(x)
        ax.set_ylim([-1, x[-1]+1])
        yticks = ax.set_yticklabels(df.feature)
        plt.subplots_adjust(bottom=0.09, right=0.97, left=0.15, top=0.95, wspace=0.52)
    plt.show()