# TF-IDF

https://en.wikipedia.org/wiki/Tf%E2%80%93idf

TFIDF (Term Frequency - Inverse Document Frequency) is a statistical method used to quantify the importance of words within a given text, compared to a background corpus.

How does this work?

https://triton.ml/blog/tf-idf-from-scratch

## TF-IDF from scratch

In [2]:
import config
import csv

file = open(config.demoPubmed)
reader = csv.reader(file,delimiter='\t')
pubData=[]
for row in reader:
    text=row[2]+' '+row[3]
    #print(text.split())
    pubData.append(text.lower().split())
    
#Removes header
pubData = pubData[1:]
print(len(pubData),'publications')


11160 publications


Computing a TF Map

**TF(term) = # of times the term appears in document / total # of terms in document** 

Now that our data is usable, we’d like to start computing the TF and the IDF. Recall that computing the tf of a word in a document requires us to calculate the number of words in a review, and the number of times each word appears in the review. We can store each (word, word count pair) in a dictionary. The keys of the dictionary are then just the unique terms in the review. The following function takes in a review and outputs a tf dictionary for that review.

In [None]:
def computeReviewTFDict(review):
    """ Returns a tf dictionary for each review whose keys are all 
    the unique words in the review and whose values are their 
    corresponding tf.
    """
    #Counts the number of times the word appears in review
    reviewTFDict = {}
    for word in review:
        if word in reviewTFDict:
            reviewTFDict[word] += 1
        else:
            reviewTFDict[word] = 1
    #Computes tf for each word           
    for word in reviewTFDict:
        reviewTFDict[word] = reviewTFDict[word] / len(review)
    return reviewTFDict

In [None]:
#run for each list
tfDict={}
for d in range(0,len(pubData)-1):
    tfDict[d]=computeReviewTFDict(pubData[d])
print(tfDict[0])

#### Computing an IDF Map

**IDF(term) = log(total # of documents / # of documents with term in it)** 

Computing the idf of a word requires us to compute the total number of documents and the number of documents that contains the word. In our case we can calculate the total number of documents with len(data), the number of publications. For each publication, we increment the document count for each unique word. We can use the keys of the dictionaries that we calculated in the TF step to get the unique set of words. The resulting IDF dictionary’s keys will be the set of all unique words across every document.

In [None]:
def computeCountDict():
    """ Returns a dictionary whose keys are all the unique words in
    the dataset and whose values count the number of reviews in which
    the word appears.
    """
    countDict = {}
    # Run through each review's tf dictionary and increment countDict's (word, doc) pair
    for review in tfDict:
        for word in tfDict[review]:
            if word in countDict:
                countDict[word] += 1
            else:
                countDict[word] = 1
    return countDict

#Stores the review count dictionary
countDict = computeCountDict()
countDict["genetic"]

Finally, we can compute an idfDict, using countDict and some math, and store it.

In [None]:
import math

def computeIDFDict():
    """ Returns a dictionary whose keys are all the unique words in the
    dataset and whose values are their corresponding idf.
    """
    idfDict = {}
    for word in countDict:
        idfDict[word] = math.log(len(pubData) / countDict[word])
    return idfDict
  
#Stores the idf dictionary
idfDict = computeIDFDict()

print(idfDict["genetic"])
print(idfDict["mendelian"])
print(idfDict["the"])

## TF-IDF using sklearn

The above has been implemented in the python package scikit-learn (sklearn) - https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

And can be achieved in just a few lines:

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

#first lets create some functions to help process the data

#load orcid to pmid data
def load_orcid():
    print('load_orcid')
    orcidToPubmedID={}
    with open(config.orcidFile) as f:
        next(f)
        for line in f:
            orcid,pmid = line.rstrip().split('\t')
            if orcid in orcidToPubmedID:
                orcidToPubmedID[orcid].append(pmid)
            else:
                orcidToPubmedID[orcid]=[pmid]
    return orcidToPubmedID

#load the publication data
def load_pubmed():
    print('load_pubmed')
    pubmedText={}
    with open(config.pubmedFile, newline='') as csvfile:
        reader = csv.reader(csvfile, delimiter='\t')
        next(reader, None)
        for row in reader:
            text=row[2]+' '+row[3]
            pubmedText[row[0]]=text
    return pubmedText

#create dictionary of orcid to publication text
def orcid_to_pubmed():
    print('orcid_to_pubmed')
    orcidToPubmedID=load_orcid()
    pubmedText = load_pubmed()
    orcidToPubmed={}
    for orcid in orcidToPubmedID:
        oText=''
        for p in orcidToPubmedID[orcid]:
            if p in pubmedText:
                oText+=(pubmedText[p])
        orcidToPubmed[orcid]=oText
    return orcidToPubmed

print('Reading corpus')
token_dict = {}
orcidToPubmed = orcid_to_pubmed()
for orcid in orcidToPubmed:
    token_dict[orcid] = orcidToPubmed[orcid].lower()

#sklean tokeniser, including bigrams and trigrams
tfidf = TfidfVectorizer(stop_words='english',ngram_range=(1,3))

#fit_transform creates the tf-idf model and returns term-document frequency matrix
%time tfs = tfidf.fit_transform(token_dict.values())

#get similarity matrix for all people
#https://stackoverflow.com/questions/8897593/how-to-compute-the-similarity-between-two-text-documents
matrix=(tfs * tfs.T).A
#store this and the dictionary for access in other notebooks
%store matrix
%store token_dict
%store tfs
%store tfidf

print('Done')

Reading corpus
orcid_to_pubmed
load_orcid
load_pubmed
CPU times: user 12.8 s, sys: 516 ms, total: 13.3 s
Wall time: 12.9 s
Stored 'matrix' (ndarray)
Stored 'token_dict' (dict)
Stored 'tfs' (csr_matrix)
Stored 'tfidf' (TfidfVectorizer)
Done


And to test some words:

In [4]:
def tfidf_doc(tfidf='',text=''):
    text=text.lower()
    #transform function transforms a document to document-term matrix
    response = tfidf.transform([text])

    feature_names = tfidf.get_feature_names()
    res={}
    for col in response.nonzero()[1]:
        res[feature_names[col]]=response[0, col]
        #reverse sort the results
        sorted_res = sorted(res.items(), key=lambda kv: kv[1], reverse=True)
    return sorted_res

sorted_res=tfidf_doc(tfidf=tfidf,text='genetic mendelian the')
for s in sorted_res:
        print(s)


('mendelian', 0.8418234170672653)
('genetic', 0.5397530310032479)


## TF-IDF on our data

We can now identify the key words in each person's publications, by creating a single document of all texts and comapring to the background frequencies. 

For example:

In [6]:
orcidToPubmedID=load_orcid()
pubmedText = load_pubmed()
#get all publications for a specific ORCID
orcidID='0000-0001-7328-4233'
oText=''
for p in orcidToPubmedID[orcidID]:
    if p in pubmedText:
        oText+=(pubmedText[p])
res = tfidf_doc(tfidf=tfidf,text=oText)
for r in res[0:10]:
    print(r)


load_orcid
load_pubmed
('nematode', 0.14027902569742493)
('mammary', 0.14027902569742493)
('ccr5', 0.13777851866416813)
('motu', 0.11492279324093777)
('crpc', 0.10716107007213078)
('cell', 0.09402472111432539)
('cancer', 0.09346441177633637)
('mammary stem', 0.0918523457761121)
('id4', 0.0918523457761121)
('genome', 0.0856179178746168)


We can now easily do this for all ORCID

In [29]:
o=open('output/orcid-tf-idf.txt','w')
counter=0

orcidToPubmedID=load_orcid()
pubmedText = load_pubmed()
for orcid in orcidToPubmed:
    #don't really want to do this for all, so just orcid with < 100 publications!
    if len(orcidToPubmedID[orcid])<100:
        counter+=1
        if counter<=5:
            print(counter,orcid)
            oText=''
            for p in orcidToPubmedID[orcid]:
                if p in pubmedText:
                    oText+=(pubmedText[p])
            print(len(oText))
            %time res = tfidf_doc(tfidf=tfidf,text=oText)
            for r in res[0:100]:
                o.write(orcid+'\t'+r[0]+'\t'+str(r[1])+'\n')
o.close()

load_orcid
load_pubmed
1 0000-0001-5001-3350
5877
CPU times: user 2.45 s, sys: 63.9 ms, total: 2.52 s
Wall time: 2.53 s
2 0000-0001-8347-5092
26148
CPU times: user 8.85 s, sys: 84 ms, total: 8.94 s
Wall time: 8.97 s
3 0000-0002-8570-0406
4913
CPU times: user 2.68 s, sys: 64.5 ms, total: 2.74 s
Wall time: 2.75 s
4 0000-0002-3091-3164
1228
CPU times: user 2.42 s, sys: 43.9 ms, total: 2.46 s
Wall time: 2.47 s
5 0000-0001-6563-9903
2452
CPU times: user 2.45 s, sys: 45.8 ms, total: 2.5 s
Wall time: 2.5 s
