# Term Frequency (TF) Vector Representation
__Anish Sachdeva (DTU/2K16/MC/013)__

__Natural Language Processing (IT-425)__

In this noteook we will extract Term Frequency vector representations from a given corpus, where our corpus will be my resume. We will divide the corpus into 6 different parts and each part will be treated as a document. The vector for a given word will be a $1 \times 6$ vector and each column will represent the frequency countof how many times the word occured in that particular document. 

## 1. Importing Required Packages

In [1]:
# Importing all necesary packages
from collections import Counter

import pprint
import nltk
import numpy as np
import pandas
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\anish\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 2. Importing the Corpus (Resume)

In [36]:
resume_file = open('../assets/resume.txt', 'r')
resume = resume_file.read().lower()
resume_file.close()
print(resume)

anish sachdeva
software developer + clean code enthusiast

phone : 8287428181
email : anish_@outlook.com
home : sandesh vihar, pitampura, new delhi - 110034
date of birth : 7th april 1998
languages : english, hindi, french

work experience
what after college (4 months)
delhi, india
creating content to teach core java and python with data structures and algorithms and giving online classes to students.
giving python classes workshops to students all around india and teaching core data structures and the python api
with emphasis on data structures, algorithms and problem solving. see a sample python batch here:
https://github.com/anishlearnstocode/python-workshop-6

also teaching java to students in batches of 10 days, where the full java api and data types are covered along with many
important algorithms are aso taught. see a sample java batch here: https://github.com/anishlearnstocode/java-wac-batch-32

summer research fellow at university of auckland (2 months)
auckland, new zealand
w

## 3. Tokenizing the Resume
we will now create a utility function called `tokenize` that will take our resume document and return us the tokens by removing stopwords, punctuations and numericals from the document.

In [37]:
# utiity function to tokenize document
def tokenize(document: str, stopwords_en=stopwords.words('english'), tokenizer=nltk.RegexpTokenizer(r'\w+')):
    document = document.lower()
    return [token for token in tokenizer.tokenize(document) if token not in stopwords_en and token.isalpha()]

In [38]:
# creating the tokens
tokens = tokenize(resume)

# printing first 30 tokens
print(tokens[: 30])

['anish', 'sachdeva', 'software', 'developer', 'clean', 'code', 'enthusiast', 'phone', 'email', 'outlook', 'com', 'home', 'sandesh', 'vihar', 'pitampura', 'new', 'delhi', 'date', 'birth', 'april', 'languages', 'english', 'hindi', 'french', 'work', 'experience', 'college', 'months', 'delhi', 'india']


## 4. Dividing the Corpus Tokens into 6 Documents
We now divide these tokens evenly into 6 different documents.

In [39]:
k = len(tokens) // 6
documents = []
for i in range(5):
    documents.append(tokens[i * k: (i + 1) * k])
documents.append(tokens[5 * k:])

print('The third document contains the following tokens:')
pprint.pp(documents[2])

The third document contains the following tokens:
['scientist',
 'microsoft',
 'quantum',
 'research',
 'working',
 'cutting',
 'edge',
 'research',
 'using',
 'foundational',
 'group',
 'theory',
 'mobius',
 'transformations',
 'real',
 'world',
 'practical',
 'applications',
 'software',
 'developer',
 'cern',
 'months',
 'cern',
 'geneva',
 'switzerland',
 'worked',
 'core',
 'platforms',
 'team',
 'fap',
 'bc',
 'group',
 'part',
 'agile',
 'team',
 'developers',
 'maintains',
 'adds',
 'core',
 'functionality',
 'applications',
 'used',
 'internally',
 'cern',
 'hr',
 'financial',
 'administrative',
 'departments',
 'including',
 'scientific',
 'worked',
 'legacy',
 'applications',
 'comprise',
 'single',
 'times',
 'multiple',
 'frameworks',
 'java',
 'spring',
 'boot',
 'hibernate',
 'java',
 'ee',
 'also',
 'worked',
 'google',
 'polymer',
 'jsp',
 'client',
 'side',
 'maintained',
 'cern',
 'electronic',
 'document',
 'handing',
 'system',
 'application',
 'loc']


## 5. Calculating Most Common 5 Tokens from Each Document & Storing Frequency Tables for Each Document

We take the 5 most common words from each document but they may not necessarily be unique and there might be some repition so we store them in set `most_common`. We also store the frequencies of words in all 6 documents in a `document_frequencies` list where each element is a dictionary.

In [40]:
most_common = set()
document_frequencies = []
for document in documents:
    frequencies = Counter(document)
    document_frequencies.append(frequencies)
    for word, frequency in frequencies.most_common(5):
        most_common.add(word)

In [41]:
# see the most common words in document 5
print(document_frequencies[4].most_common(5))

[('structures', 3), ('computer', 3), ('algorithms', 2), ('java', 2), ('university', 2)]


In [42]:
# number of most common words from all 6 documents (may not be 30)
print('number of most common words:', len(most_common))

number of most common words: 27


In [43]:
# the most common words are
pprint.pp(most_common)

{'algorithms',
 'also',
 'applications',
 'auckland',
 'cern',
 'college',
 'com',
 'computer',
 'data',
 'geometry',
 'group',
 'guitar',
 'java',
 'london',
 'many',
 'mathematics',
 'participated',
 'plectrum',
 'python',
 'requests',
 'research',
 'structures',
 'students',
 'theory',
 'trinity',
 'university',
 'worked'}


## 6. Calculating Term Frequency (TF) Vectors
We now calculate Term Frequency Vectors using the `document_frequencies` list.

In [44]:
vectors = {}
for word in most_common:
    vector = [0] * 6
    for index, frequencies in enumerate(document_frequencies):
        vector[index] = frequencies[word]
    vectors[word] = vector
pprint.pp(vectors)

{'requests': [0, 0, 0, 2, 0, 0],
 'structures': [3, 0, 0, 0, 3, 1],
 'students': [3, 0, 0, 1, 0, 0],
 'london': [0, 0, 0, 0, 0, 4],
 'group': [0, 1, 2, 0, 0, 0],
 'college': [1, 0, 0, 0, 0, 4],
 'also': [1, 0, 1, 3, 0, 0],
 'cern': [0, 0, 4, 1, 0, 2],
 'mathematics': [0, 3, 0, 0, 2, 1],
 'plectrum': [0, 0, 0, 0, 0, 4],
 'computer': [0, 0, 0, 0, 3, 0],
 'java': [2, 3, 2, 2, 2, 1],
 'guitar': [0, 0, 0, 0, 0, 4],
 'research': [0, 1, 2, 0, 0, 1],
 'participated': [0, 0, 0, 2, 0, 0],
 'data': [3, 1, 0, 1, 2, 1],
 'algorithms': [2, 1, 0, 0, 2, 1],
 'trinity': [0, 0, 0, 0, 0, 4],
 'worked': [0, 2, 3, 3, 0, 0],
 'geometry': [0, 3, 0, 0, 0, 0],
 'com': [2, 1, 0, 0, 0, 3],
 'auckland': [0, 3, 0, 0, 0, 1],
 'university': [0, 2, 0, 0, 2, 1],
 'many': [0, 1, 0, 2, 0, 0],
 'applications': [0, 0, 3, 0, 0, 0],
 'theory': [0, 3, 1, 0, 0, 0],
 'python': [5, 0, 0, 0, 1, 0]}


In [45]:
# see the tf vector for any word (you can modify below)
word = 'java'
print(list(vectors[word]))

[2, 3, 2, 2, 2, 1]


## 7. Representing in Tabular Form

In [46]:
table = pandas.DataFrame(data=vectors)
print(table.iloc[:, 0:8])

   requests  structures  students  london  group  college  also  cern
0         0           3         3       0      0        1     1     0
1         0           0         0       0      1        0     0     0
2         0           0         0       0      2        0     1     4
3         2           0         1       0      0        0     3     1
4         0           3         0       0      0        0     0     0
5         0           1         0       4      0        4     0     2


In [47]:
print(table.iloc[:, 8:15])

   mathematics  plectrum  computer  java  guitar  research  participated
0            0         0         0     2       0         0             0
1            3         0         0     3       0         1             0
2            0         0         0     2       0         2             0
3            0         0         0     2       0         0             2
4            2         0         3     2       0         0             0
5            1         4         0     1       4         1             0


In [48]:
print(table.iloc[:, 15:22])

   data  algorithms  trinity  worked  geometry  com  auckland
0     3           2        0       0         0    2         0
1     1           1        0       2         3    1         3
2     0           0        0       3         0    0         0
3     1           0        0       3         0    0         0
4     2           2        0       0         0    0         0
5     1           1        4       0         0    3         1


In [49]:
print(table.iloc[:, 22:])

   university  many  applications  theory  python
0           0     0             0       0       5
1           2     1             0       3       0
2           0     0             3       1       0
3           0     2             0       0       0
4           2     0             0       0       1
5           1     0             0       0       0


__See full output [here](https://github.com/anishLearnsToCode/bow-representation/blob/master/assets/tf.txt).__