# Term Frequency Inverse Document Frequency (TF) Vector Representation
__Anish Sachdeva (DTU/2K16/MC/013)__

__Natural Language Processing (IT-425)__

In this noteook we will extract Term Frequency vector representations from a given corpus, where our corpus will be my resume. We will divide the corpus into 6 different parts and each part will be treated as a document. The vector for a given word will be a $1 \times 6$ vector and each column will represent the frequency countof how many times the word occured in that particular document. 

## 1. Importing Required Packages

In [6]:
import pprint
from collections import Counter

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import numpy as np
import pandas

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\anish\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 2. Importing the Corpus (Resume)

In [38]:
resume_file = open('../assets/resume.txt', 'r')
resume = resume_file.read().lower()
resume_file.close()
print(resume)

anish sachdeva
software developer + clean code enthusiast

phone : 8287428181
email : anish_@outlook.com
home : sandesh vihar, pitampura, new delhi - 110034
date of birth : 7th april 1998
languages : english, hindi, french

work experience
what after college (4 months)
delhi, india
creating content to teach core java and python with data structures and algorithms and giving online classes to students.
giving python classes workshops to students all around india and teaching core data structures and the python api
with emphasis on data structures, algorithms and problem solving. see a sample python batch here:
https://github.com/anishlearnstocode/python-workshop-6

also teaching java to students in batches of 10 days, where the full java api and data types are covered along with many
important algorithms are aso taught. see a sample java batch here: https://github.com/anishlearnstocode/java-wac-batch-32

summer research fellow at university of auckland (2 months)
auckland, new zealand
w

## 3. Tokenizing The Resume
We now create a utility function called `tokenize` that will take in a corpus (resume in this case) and will return us a list of tokens after removing stopwords and punctuations. It will only consider alphabetic words and all numbers have also been ignored.  

In [39]:
# utility function for tokenizing
def tokenize(document: str, stopwords_en=stopwords.words('english'), tokenizer=nltk.RegexpTokenizer(r'\w+')):
    document = document.lower()
    return [token for token in tokenizer.tokenize(document) if token not in stopwords_en and token.isalpha()]

In [40]:
# tokenizing the resume
tokens = tokenize(resume)

# see first 30 tokens
print(tokens[: 30])

['anish', 'sachdeva', 'software', 'developer', 'clean', 'code', 'enthusiast', 'phone', 'email', 'outlook', 'com', 'home', 'sandesh', 'vihar', 'pitampura', 'new', 'delhi', 'date', 'birth', 'april', 'languages', 'english', 'hindi', 'french', 'work', 'experience', 'college', 'months', 'delhi', 'india']


## 4. Dividing the Corpus Into 6 Documents

In [42]:
k = len(tokens) // 6
documents = []
for i in range(5):
    documents.append(tokens[i * k: (i + 1) * k])
documents.append(tokens[5 * k:])

In [43]:
# the 6th document is 
pprint.pp(documents[5])

['links',
 'https',
 'www',
 'linkedin',
 'com',
 'https',
 'github',
 'com',
 'anishlearnstocode',
 'https',
 'www',
 'hackerrank',
 'com',
 'anishviewer',
 'honours',
 'awards',
 'mitacs',
 'globalink',
 'scholarship',
 'cohort',
 'summer',
 'research',
 'fellowship',
 'university',
 'auckland',
 'mathematics',
 'department',
 'technical',
 'student',
 'cern',
 'google',
 'india',
 'challenge',
 'scholarship',
 'certifications',
 'trinity',
 'college',
 'london',
 'plectrum',
 'guitar',
 'grade',
 'distinction',
 'trinity',
 'college',
 'london',
 'plectrum',
 'guitar',
 'grade',
 'merit',
 'trinity',
 'college',
 'london',
 'plectrum',
 'guitar',
 'grade',
 'distinction',
 'trinity',
 'college',
 'london',
 'plectrum',
 'guitar',
 'grade',
 'distinction',
 'french',
 'level',
 'cern',
 'java',
 'data',
 'structures',
 'algorithms',
 'coding',
 'ninjas',
 'web',
 'development',
 'ruby',
 'rails',
 'coding',
 'ninjas',
 'competitive',
 'programming',
 'coding',
 'ninjas']


## 5. Calculating Most Common 5 Tokens From Each Document & Storing Frequency Tables for Each Document

In [44]:
most_common = set()
document_frequencies = []
for document in documents:
    frequencies = Counter(document)
    document_frequencies.append(frequencies)
    for word, frequency in frequencies.most_common(5):
        most_common.add(word)

In [45]:
# number of tokens we have selected, as it isn't necessary to obtain 30 unique tokens
print('Number of tokens:', len(most_common))

Number of tokens: 27


In [46]:
# The tokens from the first document are
print('Tokens from first document:', document_frequencies[0].most_common(5))

Tokens from first document: [('python', 5), ('data', 3), ('structures', 3), ('students', 3), ('com', 2)]


In [47]:
# The selected tokens are
pprint.pp(most_common)

{'algorithms',
 'also',
 'applications',
 'auckland',
 'cern',
 'college',
 'com',
 'computer',
 'data',
 'geometry',
 'group',
 'guitar',
 'java',
 'london',
 'many',
 'mathematics',
 'participated',
 'plectrum',
 'python',
 'requests',
 'research',
 'structures',
 'students',
 'theory',
 'trinity',
 'university',
 'worked'}


## 6. Calculating Number of Documents a Keyword Appears In
The TF-IDF vector for a given word is given by:
$$
tfidf(w, d) = tf(w, d) \times idf(w, d) \\
idf(w, d) = \log{\frac{N_t}{N_w}}
$$

where:

$N_t:$ is the total numeber of documents and

$N_w:$ is the total number of documents containing the keyword $w$.

We now create a dictionary `N_w` (_str_ $\rightarrow$ _int_ ) which will store the number of docuemnts a word $w$ occurrs in.

In [48]:
N_t = 6
N_w = {}
for word in most_common:
    count = 0
    for frequencies in document_frequencies:
        count = count + (word in frequencies)
    N_w[word] = count

In [49]:
# seeing the N_w map for all the selected words
pprint.pp(N_w)

{'requests': 1,
 'geometry': 1,
 'mathematics': 3,
 'university': 3,
 'algorithms': 4,
 'java': 6,
 'college': 2,
 'group': 2,
 'many': 2,
 'com': 3,
 'theory': 2,
 'python': 2,
 'plectrum': 1,
 'students': 2,
 'london': 1,
 'research': 3,
 'cern': 3,
 'trinity': 1,
 'participated': 1,
 'guitar': 1,
 'data': 5,
 'applications': 1,
 'worked': 3,
 'also': 3,
 'structures': 3,
 'auckland': 2,
 'computer': 1}


We notice above that __java__ is the only word in the given list to appear in all 6 documents.

## 7. Computing the TF-IDF Vectors

In [50]:
vectors = {}
for word in most_common:
    vector = [0] * 6
    for index, frequencies in enumerate(document_frequencies):
        vector[index] = frequencies[word] * np.log(N_t / N_w[word])
    vectors[word] = vector

In [51]:
# Let's see the vector output for a few words
print(vectors['java'])

[0.0, 0.0, 0.0, 0.0, 0.0, 0.0]


In [52]:
print(vectors['students'])

[3.295836866004329, 0.0, 0.0, 1.0986122886681098, 0.0, 0.0]


In [53]:
# you can also test it out with a word of your choice, try below:
word = 'python'
print(vectors.get(word, [0] * 6))

[5.493061443340549, 0.0, 0.0, 0.0, 1.0986122886681098, 0.0]


## 8. Representing The Vectors in a Tabular Form 

In [54]:
table = pandas.DataFrame(data=vectors)

In [55]:
print(table.iloc[:, 0:7])

   requests  geometry  mathematics  university  algorithms  java   college
0  0.000000  0.000000     0.000000    0.000000    0.810930   0.0  1.098612
1  0.000000  5.375278     2.079442    1.386294    0.405465   0.0  0.000000
2  0.000000  0.000000     0.000000    0.000000    0.000000   0.0  0.000000
3  3.583519  0.000000     0.000000    0.000000    0.000000   0.0  0.000000
4  0.000000  0.000000     1.386294    1.386294    0.810930   0.0  0.000000
5  0.000000  0.000000     0.693147    0.693147    0.405465   0.0  4.394449


In [56]:
print(table.iloc[:, 7:14])

      group      many       com    theory    python  plectrum  students
0  0.000000  0.000000  1.386294  0.000000  5.493061  0.000000  3.295837
1  1.098612  1.098612  0.693147  3.295837  0.000000  0.000000  0.000000
2  2.197225  0.000000  0.000000  1.098612  0.000000  0.000000  0.000000
3  0.000000  2.197225  0.000000  0.000000  0.000000  0.000000  1.098612
4  0.000000  0.000000  0.000000  0.000000  1.098612  0.000000  0.000000
5  0.000000  0.000000  2.079442  0.000000  0.000000  7.167038  0.000000


In [57]:
print(table.iloc[:, 14:20])

     london  research      cern   trinity  participated    guitar
0  0.000000  0.000000  0.000000  0.000000      0.000000  0.000000
1  0.000000  0.693147  0.000000  0.000000      0.000000  0.000000
2  0.000000  1.386294  2.772589  0.000000      0.000000  0.000000
3  0.000000  0.000000  0.693147  0.000000      3.583519  0.000000
4  0.000000  0.000000  0.000000  0.000000      0.000000  0.000000
5  7.167038  0.693147  1.386294  7.167038      0.000000  7.167038


In [58]:
print(table.iloc[:, 20:])

       data  applications    worked      also  structures  auckland  computer
0  0.546965      0.000000  0.000000  0.693147    2.079442  0.000000  0.000000
1  0.182322      0.000000  1.386294  0.000000    0.000000  3.295837  0.000000
2  0.000000      5.375278  2.079442  0.693147    0.000000  0.000000  0.000000
3  0.182322      0.000000  2.079442  2.079442    0.000000  0.000000  0.000000
4  0.364643      0.000000  0.000000  0.000000    2.079442  0.000000  5.375278
5  0.182322      0.000000  0.000000  0.000000    0.693147  1.098612  0.000000


__See full output in text format [here](https://github.com/anishLearnsToCode/bow-representation/blob/master/assets/tfidf.txt).__