Parts of code snippets are from:
Illustrates the TFIDF from mathemaics to scikits learn

a. http://blog.christianperone.com/2011/09/machine-learning-text-feature-extraction-tf-idf-part-i/

b. http://blog.christianperone.com/2011/10/machine-learning-text-feature-extraction-tf-idf-part-ii/

Another approach to understand TFIDF

c. https://www.youtube.com/watch?v=hXNbFNCgPfY

d. https://www.youtube.com/watch?v=BJ0MnawUpaU&t=352s

Illustrates some fucntion that can be used on the TFIDF

e. https://buhrmann.github.io/tfidf-analysis.html

TF-IDF stands for "Term Frequency, Inverse Document Frequency". It is a way to score the importance of words (or "terms") in a document based on how frequently they appear across multiple documents.
1. If a word appears frequently in a document, it's important. Give the word a high score.
2. But if a word appears in many documents, it's not a unique identifier. Give the word a low score.

Therefore, common words like "the" and "for", which appear in many documents, will be scaled down. Words that appear frequently in a single document will be scaled up.

Vector Space Model is a algebraic model of converting textual information as a vector and it represents the features extracted from the document. Step 1 is to create a dictionary of all terms in the document into dimensions ignoring the common english terms

In [1]:
train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.",
    "We can see the shining sun, the bright sun.")

In [2]:
# scikit.learn, what we have presented as the term-frequency, 
# is called CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

In [3]:
print (vectorizer)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)


In [4]:
# Here is the vocabulary index without using the stop words 
vectorizer.fit_transform(train_set)
print (vectorizer.vocabulary_)

{'the': 5, 'sky': 3, 'is': 2, 'blue': 0, 'sun': 4, 'bright': 1}


In [5]:
# We then use the same vectorizer to create a sparse matrix. It is a 
# Scipy sparse matrix with elements stored in a Coordinate format.
# the spare matrix will be represented as 
# (Document#, dictionary_word) Number of occurences
smatrix = vectorizer.transform(test_set)
print (smatrix)

  (0, 1)	1
  (0, 2)	1
  (0, 3)	1
  (0, 4)	1
  (0, 5)	2
  (1, 1)	1
  (1, 4)	2
  (1, 5)	2


In [6]:
# We can convert this sparse matrix into a dense matrix
# the dense matrix will have the shape number of documents * no of words
# Every element is represented as number of occurences
# Each row is count of (blue, bright, is, sky, sun, the)
dmatrix = smatrix.todense()
print (dmatrix)

[[0 1 1 1 1 2]
 [0 1 0 0 2 2]]


In [7]:
# The tf-idf comes to our rescue for the problem observed above
# tf-idf then does to solve that problem, is to scale down the frequent 
# terms while scaling up the rare terms; a term that occurs 10 times 
# more than another isn’t 10 times more important than it, that’s why 
# tf-idf uses the logarithmic scale to do that
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(norm="l2")
tfidf.fit(smatrix)
print ("IDF:", tfidf.idf_)

IDF: [ 2.09861229  1.          1.40546511  1.40546511  1.          1.        ]


In [8]:
tf_idf_matrix = tfidf.transform(smatrix)
print (tf_idf_matrix.todense())

[[ 0.          0.31701073  0.44554752  0.44554752  0.31701073  0.63402146]
 [ 0.          0.33333333  0.          0.          0.66666667  0.66666667]]


In [10]:
# Understading the tfidf in much more detail
docA = 'the cat sat on my face'
docB = 'the dog sat on my bed'

In [11]:
# tokening 
bowA = docA.split(" ")
bowB = docB.split(" ")

In [12]:
print (bowA)

['the', 'cat', 'sat', 'on', 'my', 'face']


In [13]:
# Let us create a set of all words
wordSet = set(bowA).union(set(bowB))

In [14]:
print (wordSet)

{'the', 'sat', 'my', 'face', 'bed', 'cat', 'on', 'dog'}


In [15]:
# create dictionaries for each word count
wordDictA = dict.fromkeys(wordSet, 0)
wordDictB = dict.fromkeys(wordSet, 0)

In [16]:
wordDictA

{'bed': 0, 'cat': 0, 'dog': 0, 'face': 0, 'my': 0, 'on': 0, 'sat': 0, 'the': 0}

In [17]:
# Count of the number of word and set in the dict
for word in bowA:
    wordDictA[word]+=1

for word in bowB:
    wordDictB[word]+=1

In [18]:
wordDictA

{'bed': 0, 'cat': 1, 'dog': 0, 'face': 1, 'my': 1, 'on': 1, 'sat': 1, 'the': 1}

In [19]:
# put the dictionary into a df
import pandas as pd
pd.DataFrame([wordDictA, wordDictB])

Unnamed: 0,bed,cat,dog,face,my,on,sat,the
0,0,1,0,1,1,1,1,1
1,1,0,1,0,1,1,1,1


In [20]:
# the matrix above has the common words usually available, this is the problem tfidf can solve
# TF_IDF = tf(w)*idf(w)
# tf(w) = (Number of times the word appears in a document) / Total number of word in the document
# idf(w) = log(Number of documents/ Number of documents that contain word w)
def computeTF(wordDict, bow):
    tfDict = {}
    bowCount = len(bow)
    for word, count in wordDict.items():
        tfDict[word] = count/float(bowCount)
    return tfDict

In [21]:
tfBowA = computeTF(wordDictA, bowA)
tfBowB = computeTF(wordDictB, bowB)

In [22]:
print (tfBowA)

{'the': 0.16666666666666666, 'sat': 0.16666666666666666, 'my': 0.16666666666666666, 'face': 0.16666666666666666, 'bed': 0.0, 'cat': 0.16666666666666666, 'on': 0.16666666666666666, 'dog': 0.0}


In [23]:
def computeIDF(docList):
    import math
    idfDict = {}
    N = len(docList)
    
    # count the documents that contains word w
    idfDict = dict.fromkeys(docList[0].keys(), 0)
    for doc in docList:
        for word, val in doc.items():
            if val > 0:
                idfDict[word] += 1
    # divide N by denominator above and take a log
    for word, val in idfDict.items():
        idfDict[word] = math.log(N/ float(val))
    return idfDict

In [24]:
idfs =  computeIDF([wordDictA, wordDictB])

In [25]:
print (idfs)

{'the': 0.0, 'sat': 0.0, 'my': 0.0, 'face': 0.6931471805599453, 'bed': 0.6931471805599453, 'cat': 0.6931471805599453, 'on': 0.0, 'dog': 0.6931471805599453}


In [26]:
def computeTFIDF(tfBow, Idfs):
    tfidf = {}
    for word, val in tfBow.items():
        tfidf[word] = val * idfs[word]
    return tfidf

In [27]:
tfidfBowA = computeTFIDF(tfBowA, idfs)
tfidfBowB = computeTFIDF(tfBowB, idfs)

In [28]:
# Putting this into a Matrix
pd.DataFrame([tfidfBowA, tfidfBowB])

Unnamed: 0,bed,cat,dog,face,my,on,sat,the
0,0.0,0.115525,0.0,0.115525,0.0,0.0,0.0,0.0
1,0.115525,0.0,0.115525,0.0,0.0,0.0,0.0,0.0


In [44]:
# Let us apply the TFIDF from sklearn and compare the results
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(analyzer='word',lowercase='True', stop_words='english')

In [33]:
print(bowA)

['the', 'cat', 'sat', 'on', 'my', 'face']


In [53]:
from time import time
t0 = time()
tfidf_vectorizer.fit(wordSet)
#tfidf_vectorizer.
tfidf_A = tfidf_vectorizer.fit_transform(bowA)
tfidf_B = tfidf_vectorizer.fit_transform(bowB)
print("done in %0.3fs." % (time() - t0))

done in 0.006s.


In [54]:
print (tfidf_A)

  (1, 0)	1.0
  (2, 2)	1.0
  (5, 1)	1.0


In [55]:
fin_tfidfA = tfidf_A.todense()
fin_tfidfB = tfidf_B.todense()

In [56]:
print (fin_tfidfA)

[[ 0.  0.  0.]
 [ 1.  0.  0.]
 [ 0.  0.  1.]
 [ 0.  0.  0.]
 [ 0.  0.  0.]
 [ 0.  1.  0.]]


In [57]:
fin_tfidfA.shape
fin_tfidfA.shape

(6, 3)

In [58]:
feature_names = tfidf_vectorizer.get_feature_names()
print (feature_names)

['bed', 'dog', 'sat']


In [65]:
# Using TFIDF to perform topic extraction
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.datasets import fetch_20newsgroups

In [66]:
print("Loading dataset...")
t0 = time()
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'))
data_samples = dataset.data[:n_samples]
print("done in %0.3fs." % (time() - t0))

Loading dataset...
done in 2.721s.


In [73]:
print (len(data_samples))

2000


In [74]:
stopset = set(stopwords.words('english'))
print (stopset)

{'off', 'here', 'hasn', 'over', 'against', 'their', 'but', 'above', 'needn', 'doesn', 'this', 'down', 'i', 'how', 'ourselves', 'too', 'with', 'no', 'any', 'hers', 'just', 'own', 'himself', 'did', 'couldn', 'won', 'yourselves', 'nor', 'herself', 'so', 'to', 'during', 'up', 'than', 'mustn', 'each', 'where', 'all', 'into', 'only', 's', 't', 'having', 'if', 'what', 'from', 'before', 'after', 'are', 'ours', 'through', 'me', 'between', 'can', 'is', 'more', 'an', 'it', 'as', 'o', 'very', 'was', 'had', 'its', 'mightn', 'should', 'out', 've', 'which', 'ain', 'whom', 'both', 'then', 'd', 'didn', 'you', 'were', 'same', 'itself', 'a', 'haven', 'most', 'the', 'of', 'weren', 'yourself', 'below', 'further', 'his', 'we', 'there', 'll', 'such', 'on', 'these', 'at', 'isn', 'when', 'other', 'myself', 'am', 'until', 'now', 'not', 'by', 'that', 'or', 'under', 'my', 'does', 'yours', 'for', 'who', 'be', 'him', 're', 'they', 'few', 'wouldn', 'been', 'shan', 'why', 'will', 'being', 'in', 'she', 'them', 'again'

In [75]:
# Let us start with the first step of vectorizing
data_samples[0]

"Well i'm not sure about the story nad it did seem biased. What\nI disagree with is your statement that the U.S. Media is out to\nruin Israels reputation. That is rediculous. The U.S. media is\nthe most pro-israeli media in the world. Having lived in Europe\nI realize that incidences such as the one described in the\nletter have occured. The U.S. media as a whole seem to try to\nignore them. The U.S. is subsidizing Israels existance and the\nEuropeans are not (at least not to the same degree). So I think\nthat might be a reason they report more clearly on the\natrocities.\n\tWhat is a shame is that in Austria, daily reports of\nthe inhuman acts commited by Israeli soldiers and the blessing\nreceived from the Government makes some of the Holocaust guilt\ngo away. After all, look how the Jews are treating other races\nwhen they got power. It is unfortunate.\n"

In [83]:
vectorizer = TfidfVectorizer(stop_words=stopset, use_idf=True, ngram_range=(1,3), max_features=5000, analyzer='word', 
                             lowercase=True, norm='l2')
X = vectorizer.fit_transform(data_samples)

In [84]:
X[0]

<1x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 46 stored elements in Compressed Sparse Row format>

In [88]:
print (X[0])

  (0, 4844)	0.0711724269128
  (0, 4370)	0.0849337242543
  (0, 4294)	0.119602186967
  (0, 3962)	0.196050256492
  (0, 695)	0.154891800464
  (0, 1413)	0.136647103602
  (0, 4260)	0.117365134839
  (0, 2789)	0.48825320374
  (0, 3491)	0.134410051474
  (0, 2369)	0.283686649061
  (0, 4916)	0.0963772935546
  (0, 2631)	0.134410051474
  (0, 1634)	0.14184332453
  (0, 3644)	0.129572824027
  (0, 3142)	0.0557308970244
  (0, 1346)	0.134410051474
  (0, 2570)	0.127876246769
  (0, 3114)	0.15748268052
  (0, 4860)	0.105344092594
  (0, 4615)	0.0905284227811
  (0, 2211)	0.135503147729
  (0, 1669)	0.15748268052
  (0, 1636)	0.167453317579
  (0, 2541)	0.0931792069269
  (0, 1319)	0.144921163404
  (0, 4500)	0.0679603960735
  (0, 2835)	0.0849337242543
  (0, 3651)	0.102851772256
  (0, 3736)	0.124085586487
  (0, 995)	0.119602186967
  (0, 4029)	0.154891800464
  (0, 1257)	0.144921163404
  (0, 3738)	0.129572824027
  (0, 340)	0.13910821757
  (0, 4141)	0.140437763884
  (0, 3662)	0.116318567674
  (0, 1994)	0.0970700327922


In [89]:
X.shape

(2000, 5000)

In [90]:
lsa = TruncatedSVD(n_components=27, n_iter=100)
lsa.fit(X)

TruncatedSVD(algorithm='randomized', n_components=27, n_iter=100,
       random_state=None, tol=0.0)

In [91]:
lsa.components_[0]

array([ 0.02491891,  0.02282795,  0.00282885, ...,  0.00723936,
        0.00481203,  0.00079901])

In [95]:
terms = vectorizer.get_feature_names()
for i, comp in enumerate(lsa.components_):
    termsInComp = zip(terms,comp)
    sortedTerms = sorted(termsInComp, key=lambda x: x[1], reverse=True) [:10]
    print ("Concept %d:" %i)
    for term in sortedTerms:
        print (term[0])
    print (" ")

Concept 0:
would
one
like
people
know
get
think
good
time
could
 
Concept 1:
god
people
think
jesus
bible
even
us
say
law
government
 
Concept 2:
geb
gordon banks
pitt edu
cadre
banks n3jxp
banks n3jxp skepticism
cadre dsl
cadre dsl pitt
chastity
chastity intellect
 
Concept 3:
god
thanks
please
bible
jesus
anyone
know
mail
advance
christian
 
Concept 4:
thanks
would
anyone
please
advance
know
game
thanks advance
mail
could
 
Concept 5:
key
chip
government
clipper
encryption
keys
use
law
clipper chip
system
 
Concept 6:
drive
god
00
chip
car
new
sale
drives
please
hard
 
Concept 7:
edu
com
mail
please
space
send
ftp
00
last
information
 
Concept 8:
game
chip
key
team
god
games
clipper
keys
play
encryption
 
Concept 9:
would
drive
edu
israel
drives
00
software
com
computer
people
 
Concept 10:
would
god
car
like
00
key
would like
edu
use
windows
 
Concept 11:
edu
com
file
drive
get
ftp
problem
try
car
think
 
Concept 12:
space
like
nasa
data
time
window
earth
computer
moon
launch
 
Conc

In [13]:
# Creting a vec pipe
from sklearn.feature_extraction.text import TfidfVectorizer
''' Create text vectorization pipeline with optional dimensionality reduction. '''
def get_vec_pipe(num_comp=0, reducer='svd'):
    tfv = TfidfVectorizer(min_df=6, max_features=None, strip_accents='unicode',analyzer="word", 
                          token_pattern=r'\w{1,}', ngram_range=(1, 2),use_idf=1, smooth_idf=1, sublinear_tf=1)
    # Vectorizer
    vec_pipe = [
        ('col_extr', JsonFields(0, ['title', 'body', 'url'])),
        ('squash', Squash()),
        ('vec', tfv)
    ]

    # Reduce dimensions of tfidf
    if num_comp > 0:
        if reducer == 'svd':
            vec_pipe.append(('dim_red', TruncatedSVD(num_comp)))
        elif reducer == 'kbest':
            vec_pipe.append(('dim_red', SelectKBest(chi2, k=num_comp)))
        elif reducer == 'percentile':
            vec_pipe.append(('dim_red', SelectPercentile(f_classif, percentile=num_comp)))

        vec_pipe.append(('norm', Normalizer()))

    return Pipeline(vec_pipe)


IndentationError: unindent does not match any outer indentation level (<tokenize>, line 11)

In [None]:
def top_tfidf_feats(row, features, top_n=10):
    topn_ids = np.argsort(row)[::-1][:top_n]
    top_feats = [(features[i], row[i]) for i in topn_ids]
    df = pd.DataFrame(top_feats)
    df.columns = ['feature', 'tfidf']
    return df

def top_feats_in_doc(Xtr, features, row_id, top_n=10):
    row = np.squeeze(Xtr[row_id].toarray())
    return top_tfidf_feats(row, features, top_n)

def top_mean_feats(Xtr, features, grp_ids=None, min_tfidf=0.1, top_n=10):
    if grp_ids:
        D = Xtr[grp_ids].toarray()
    else:
        D = Xtr.toarray()

    D[D < min_tfidf] = 0
    tfidf_means = np.mean(D, axis=0)
    return top_tfidf_feats(tfidf_means, features, top_n)

def top_feats_by_class(Xtr, y, features, min_tfidf=0.1, top_n=10):
    dfs = []
    labels = np.unique(y)
    for label in labels:
        ids = np.where(y==label)
        feats_df = top_mean_feats(Xtr, features, ids, min_tfidf=min_tfidf, top_n=top_n)
        feats_df.label = label
        dfs.append(feats_df)
    return dfs

def plot_tfidf_classfeats_h(dfs):
    fig = plt.figure(figsize=(12, 100), facecolor="w")
    x = np.arange(len(dfs[0]))
    for i, df in enumerate(dfs):
        #z = int(str(int(i/3)+1) + str((i%3)+1))
        ax = fig.add_subplot(9, 1, i+1)
        ax.spines["top"].set_visible(False)
        ax.spines["right"].set_visible(False)
        ax.set_frame_on(False)
        ax.get_xaxis().tick_bottom()
        ax.get_yaxis().tick_left()
        ax.set_xlabel("Mean Tf-Idf Score", labelpad=16, fontsize=16)
        ax.set_ylabel("Gene", labelpad=16, fontsize=16)
        ax.set_title("Class = " + str(df.label), fontsize=18)
        ax.ticklabel_format(axis='x', style='sci', scilimits=(-2,2))
        ax.barh(x, df.tfidf, align='center')
        ax.set_yticks(x)
        ax.set_ylim([-1, x[-1]+1])
        yticks = ax.set_yticklabels(df.feature)
        plt.subplots_adjust(bottom=0.09, right=0.97, left=0.15, top=0.95, wspace=0.52)
    plt.show()