# TF-IDF

https://en.wikipedia.org/wiki/Tf%E2%80%93idf

TFIDF (Term Frequency - Inverse Document Frequency) is a statistical method used to quantify the importance of words within a given text, compared to a background corpus.

How does this work?

https://triton.ml/blog/tf-idf-from-scratch

## TF-IDF from scratch

First, lets read in the complete set of publication data.

In [None]:
import config
import csv

file = open(config.demoPubmedFile)
reader = csv.reader(file,delimiter='\t')
pubData=[]
for row in reader:
    text=row[2]+' '+row[3]
    pubData.append(text.lower().split())
    
#Removes header
pubData = pubData[1:]
print(len(pubData),'publications')


#### Computing a TF Map

**TF(term) = # of times the term appears in document / total # of terms in document** 

Now that our data is usable, we’d like to start computing the TF and the IDF. Computing the tf of a word in a publication requires us to calculate the number of words in a publication, and the number of times each word appears in the publication. We can store each (word, word count pair) in a dictionary. The keys of the dictionary are then just the unique terms in the publication. The following function takes in a publication and outputs a tf dictionary for that publication.

In [89]:
def computePublicationTFDict(publication):
    """ Returns a tf dictionary for each publication whose keys are all 
    the unique words in the publication and whose values are their 
    corresponding tf.
    """
    #Counts the number of times the word appears in publication
    publicationTFDict = {}
    for word in publication:
        if word in publicationTFDict:
            publicationTFDict[word] += 1
        else:
            publicationTFDict[word] = 1
    #Computes tf for each word           
    for word in publicationTFDict:
        publicationTFDict[word] = publicationTFDict[word] / len(publication)
    return publicationTFDict

In [90]:
#run for each list
tfDict={}
for d in range(0,len(pubData)):
    tfDict[d]=computePublicationTFDict(pubData[d])
print(tfDict[0])

{'sixty-five': 0.0041841004184100415, 'common': 0.0041841004184100415, 'genetic': 0.008368200836820083, 'variants': 0.0041841004184100415, 'and': 0.016736401673640166, 'prediction': 0.0041841004184100415, 'of': 0.029288702928870293, 'type': 0.008368200836820083, '2': 0.008368200836820083, 'diabetes.': 0.0041841004184100415, 'we': 0.008368200836820083, 'developed': 0.008368200836820083, 'a': 0.02092050209205021, '65': 0.0041841004184100415, 'diabetes': 0.0041841004184100415, '(t2d)': 0.0041841004184100415, 'variant-weighted': 0.0041841004184100415, 'gene': 0.016736401673640166, 'score': 0.02092050209205021, 'to': 0.029288702928870293, 'examine': 0.0041841004184100415, 'the': 0.06276150627615062, 'impact': 0.0041841004184100415, 'on': 0.0041841004184100415, 't2d': 0.016736401673640166, 'risk': 0.02092050209205021, 'assessment': 0.0041841004184100415, 'in': 0.012552301255230125, 'u.k.-based': 0.0041841004184100415, 'consortium': 0.0041841004184100415, 'prospective': 0.0041841004184100415,

#### Computing an IDF Map

**IDF(term) = log(total # of documents / # of documents with term in it)** 

Computing the idf of a word requires us to compute the total number of documents and the number of documents that contains the word. In our case we can calculate the total number of documents with len(data), the number of publications. For each publication, we increment the document count for each unique word. We can use the keys of the dictionaries that we calculated in the TF step to get the unique set of words. The resulting IDF dictionary’s keys will be the set of all unique words across every document.

In [91]:
def computeCountDict():
    """ Returns a dictionary whose keys are all the unique words in
    the dataset and whose values count the number of reviews in which
    the word appears.
    """
    countDict = {}
    # Run through each publications's tf dictionary and increment countDict's (word, doc) pair
    for review in tfDict:
        for word in tfDict[review]:
            if word in countDict:
                countDict[word] += 1
            else:
                countDict[word] = 1
    return countDict

#Stores the publication count dictionary
countDict = computeCountDict()
testWord='genetic'
print(testWord,countDict[testWord])

genetic 1001


Finally, we can compute an idfDict, using countDict and some math, and store it.

In [92]:
import math

def computeIDFDict():
    """ Returns a dictionary whose keys are all the unique words in the
    dataset and whose values are their corresponding idf.
    """
    idfDict = {}
    for word in countDict:
        idfDict[word] = math.log(len(pubData) / countDict[word])
    return idfDict
  
#Stores the idf dictionary
idfDict = computeIDFDict()

print(idfDict["genetic"])

2.4113364566200812


In this case there are 11,160 publictions, and the word genetic is mentioned 1,001 times. Therefore the idf = log(11160/1001).

#### Computing the TF-IDF Map

**TF-IDF(term) = TF(term) * IDF(term)**

The last step is to compute the TF-IDF. We use our existing tf dictionaries and simply multiply each value by the idf. We can use the idf keys since they contain every unique word.

In [93]:
def computeReviewTFIDFDict(reviewTFDict):
    """ Returns a dictionary whose keys are all the unique words in the
    review and whose values are their corresponding tfidf.
    """
    reviewTFIDFDict = {}
    #For each word in the publication, we multiply its tf and its idf.
    for word in tfDict[reviewTFDict]:
        reviewTFIDFDict[word] = tfDict[reviewTFDict][word] * idfDict[word]
    return reviewTFIDFDict

#Stores the TF-IDF dictionaries
tfidfDict = [computeReviewTFIDFDict(review) for review in tfDict]
print(tfidfDict[0])


{'sixty-five': 0.03609600023169605, 'common': 0.01105860993593375, 'genetic': 0.020178547754142937, 'variants': 0.013644217742064873, 'and': 0.0010234509476662228, 'prediction': 0.01862994470912016, 'of': 0.0009847888075354757, 'type': 0.025659099785437893, '2': 0.024545706414833945, 'diabetes.': 0.020215292327209883, 'we': 0.006099618113897391, 'developed': 0.02719124274420763, 'a': 0.004423669849068166, '65': 0.02412026432822547, 'diabetes': 0.01613497669836691, '(t2d)': 0.029361950388875548, 'variant-weighted': 0.03899619763989665, 'gene': 0.04483529817899097, 'score': 0.09205251694784888, 'to': 0.0039019646067503305, 'examine': 0.012372649089518146, 'the': 0.003748328683648812, 'impact': 0.01162787128345355, 'on': 0.003454195544283977, 't2d': 0.11181646705138147, 'risk': 0.04028447968025657, 'assessment': 0.014681800179743102, 'in': 0.0013262666868937099, 'u.k.-based': 0.03609600023169605, 'consortium': 0.021935371514768126, 'prospective': 0.01349065553766978, 'studies,': 0.0164861

So, using the **genetic** example before, for publication 1, we had a tf value of **'genetic': 0.008368200836820083** and the idf is 2.4113364566200812. Multiply these together, and you get 0.02 (ish) as seen above :)

## TF-IDF using sklearn

The above has been implemented in the python package scikit-learn (sklearn) - https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

And can be achieved in just a few lines:

In [94]:
import nltk

from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.porter import PorterStemmer

#first lets create some functions to help process the data

#load orcid to pmid data
def load_orcid():
    print('load_orcid')
    orcidToPubmedID={}
    with open(config.demoOrcidFile) as f:
        next(f)
        for line in f:
            orcid,pmid = line.rstrip().split('\t')
            if orcid in orcidToPubmedID:
                orcidToPubmedID[orcid].append(pmid)
            else:
                orcidToPubmedID[orcid]=[pmid]
    return orcidToPubmedID

#load the publication data
def load_pubmed():
    print('load_pubmed')
    pubmedText={}
    with open(config.demoPubmedFile, newline='') as csvfile:
        reader = csv.reader(csvfile, delimiter='\t')
        next(reader, None)
        for row in reader:
            text=row[2]+' '+row[3]
            pubmedText[row[0]]=text
    return pubmedText

#create dictionary of orcid to publication text
def orcid_to_pubmed():
    print('orcid_to_pubmed')
    orcidToPubmedID=load_orcid()
    pubmedText = load_pubmed()
    orcidToPubmed={}
    for orcid in orcidToPubmedID:
        oText=''
        for p in orcidToPubmedID[orcid]:
            if p in pubmedText:
                oText+=(pubmedText[p])
        orcidToPubmed[orcid]=oText
    return orcidToPubmed

print('Reading corpus')
token_dict = {}
orcidToPubmed = orcid_to_pubmed()
for orcid in orcidToPubmed:
    token_dict[orcid] = orcidToPubmed[orcid].lower()

#sklean including bigrams and trigrams
tfidf = TfidfVectorizer(stop_words='english',ngram_range=(1,3))

#fit_transform creates the tf-idf model and returns term-document frequency matrix
%time tfs = tfidf.fit_transform(token_dict.values())

#get similarity matrix for all people
#https://stackoverflow.com/questions/8897593/how-to-compute-the-similarity-between-two-text-documents
matrix=(tfs * tfs.T).A
#store this and the dictionary for access in other notebooks
%store matrix
%store token_dict
%store tfs
%store tfidf

print('Done')

Reading corpus
orcid_to_pubmed
load_orcid
load_pubmed
CPU times: user 14.2 s, sys: 660 ms, total: 14.9 s
Wall time: 13.2 s
Stored 'matrix' (ndarray)
Stored 'token_dict' (dict)
Stored 'tfs' (csr_matrix)
Stored 'tfidf' (TfidfVectorizer)
Done


And to test a document:

In [95]:
def tfidf_doc(tfidf='',text=''):
    text=text.lower()
    #transform function transforms a document to document-term matrix
    response = tfidf.transform([text])

    feature_names = tfidf.get_feature_names()
    res={}
    for col in response.nonzero()[1]:
        res[feature_names[col]]=response[0, col]
        #reverse sort the results
        sorted_res = sorted(res.items(), key=lambda kv: kv[1], reverse=True)
    return sorted_res

#sorted_res=tfidf_doc(tfidf=tfidf,text='genetic mendelian the')
testText=",".join(pubData[0])
#print(testText)
sorted_res=tfidf_doc(tfidf=tfidf,text=testText)
for s in sorted_res[:10]:
    print(s)

#get the genetic score
for s in sorted_res:
    #print(s[0])
    if s[0] == 'genetic':
        print(s)


('t2d', 0.2771374468347099)
('gene score', 0.18732077434507402)
('95 ci', 0.1680563752219566)
('ci', 0.16519742224518033)
('95', 0.14483197609614537)
('score', 0.14159779049586887)
('risk model', 0.12883139598887866)
('nri', 0.09366038717253701)
('framingham', 0.09366038717253701)
('27 kg', 0.09064520376269428)
('genetic', 0.03563686144053027)


#### Why the different tf-idf scores to the first time?

The first example treated each publication separately. In the sklearn example above, each person's collection of publications is treated as a single record. This does mean that there are duplicate publications in the model. Combine this with the slight variation of tf-idf in sklearn, the use of bigrams and trigrams, and the stopwords removal in sklearn, and this might explain the difference. 

For example, the top term in sklean is **t2d** (type 2 diabetes). However, in the manual tf-idf, method this is split across three results **(t2d)**, **td.** and **td**



## TF-IDF on our data

We can now identify the key words in each person's publications, by creating a single document of all texts and comapring to the background frequencies. 

For example:

In [96]:
orcidToPubmedID=load_orcid()
pubmedText = load_pubmed()
#get all publications for a specific ORCID
orcidID='0000-0001-7328-4233'
oText=''
for p in orcidToPubmedID[orcidID]:
    if p in pubmedText:
        oText+=(pubmedText[p])
res = tfidf_doc(tfidf=tfidf,text=oText)
for r in res[0:10]:
    print(r)


load_orcid
load_pubmed
('nematode', 0.1399503126881953)
('mammary', 0.1399503126881953)
('ccr5', 0.13745566504259402)
('motu', 0.11465349697937983)
('crpc', 0.10690996169979536)
('cell', 0.09380439488333107)
('cancer', 0.09324539850690329)
('mammary stem', 0.09163711002839603)
('id4', 0.09163711002839603)
('genome', 0.08541729113595346)


We can now easily do this for all ORCID

In [97]:
o=open(config.tfidfFile,'w')
counter=0

orcidToPubmedID=load_orcid()
pubmedText = load_pubmed()
for orcid in orcidToPubmed:
    #don't really want to do this for all, so just orcid with < 100 publications!
    if len(orcidToPubmedID[orcid])<100:
        counter+=1
        if counter<=5:
            print(counter,orcid)
            oText=''
            for p in orcidToPubmedID[orcid]:
                if p in pubmedText:
                    oText+=(pubmedText[p])
            print(len(oText))
            %time res = tfidf_doc(tfidf=tfidf,text=oText)
            for r in res[0:100]:
                o.write(orcid+'\t'+r[0]+'\t'+str(r[1])+'\n')
o.close()

load_orcid
load_pubmed
1 0000-0001-5001-3350
1959
CPU times: user 2.58 s, sys: 37.4 ms, total: 2.62 s
Wall time: 2.63 s
2 0000-0001-5008-0705
378
CPU times: user 2.45 s, sys: 39.1 ms, total: 2.49 s
Wall time: 2.49 s
3 0000-0001-5017-9473
9903
CPU times: user 3.44 s, sys: 44.8 ms, total: 3.48 s
Wall time: 3.48 s
4 0000-0001-5031-7493
3435
CPU times: user 2.6 s, sys: 38.2 ms, total: 2.63 s
Wall time: 2.64 s
5 0000-0001-5052-3182
1935
CPU times: user 2.6 s, sys: 45.7 ms, total: 2.65 s
Wall time: 2.65 s
