# One Hot Vector Representation of Words
__Anish Sachdeva (DTU/2K16/MC/13)__

__Natural Language Processing - Dr. Seba Susan__

In this notebook we se how to extract one hot vectors from a given corpus and in this case our corpus will be a resume. Our first step is importing all required packages.

## 1. Import Packages

In [15]:
from collections import Counter
import pprint
import random

import pandas
import numpy as np
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\anish\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 2. Importing the Corpus
In this case our corpus will be the resume which we will divide into 6 parts and each part will represent a document.

In [2]:
resume_file = open('../assets/resume.txt', 'r')
resume = resume_file.read().lower()
resume_file.close()
print(resume)

anish sachdeva
software developer + clean code enthusiast

phone : 8287428181
email : anish_@outlook.com
home : sandesh vihar, pitampura, new delhi - 110034
date of birth : 7th april 1998
languages : english, hindi, french

work experience
what after college (4 months)
delhi, india
creating content to teach core java and python with data structures and algorithms and giving online classes to students.
giving python classes workshops to students all around india and teaching core data structures and the python api
with emphasis on data structures, algorithms and problem solving. see a sample python batch here:
https://github.com/anishlearnstocode/python-workshop-6

also teaching java to students in batches of 10 days, where the full java api and data types are covered along with many
important algorithms are aso taught. see a sample java batch here: https://github.com/anishlearnstocode/java-wac-batch-32

summer research fellow at university of auckland (2 months)
auckland, new zealand
w

## 3. Tokenizing the Resume
We will now tokenize the Resume and obtain words/tokens from our corpus. We will further divide up these tokens into 6 parts to obtain our 6 documents.

In [3]:
# creating utility function to tokenize the document
def tokenize(document: str, stopwords_en=stopwords.words('english'), tokenizer=nltk.RegexpTokenizer(r'\w+')) -> list:
    document = document.lower()
    return [token for token in tokenizer.tokenize(document) if token not in stopwords_en and token.isalpha()]

In [5]:
# getting the tokens
tokens = tokenize(resume)
# printing first 25 tokens
print(tokens[: 25])

['anish', 'sachdeva', 'software', 'developer', 'clean', 'code', 'enthusiast', 'phone', 'email', 'outlook', 'com', 'home', 'sandesh', 'vihar', 'pitampura', 'new', 'delhi', 'date', 'birth', 'april', 'languages', 'english', 'hindi', 'french', 'work']


## 4. Dividing Corpus into 6 documents
We now divide the corpus into 6 documents, basically we equally divide the tokens list into 6 different fragments.   

In [8]:
# dividing corpus into 6 documents
k = len(tokens) // 6
documents = []
for i in range(5):
    documents.append(tokens[i * k: (i + 1) * k])
documents.append(tokens[4 * k:])

# printing the second document
pprint.pp(documents[1])

['batches',
 'days',
 'full',
 'java',
 'api',
 'data',
 'types',
 'covered',
 'along',
 'many',
 'important',
 'algorithms',
 'aso',
 'taught',
 'see',
 'sample',
 'java',
 'batch',
 'https',
 'github',
 'com',
 'anishlearnstocode',
 'java',
 'wac',
 'batch',
 'summer',
 'research',
 'fellow',
 'university',
 'auckland',
 'months',
 'auckland',
 'new',
 'zealand',
 'worked',
 'geometry',
 'mobius',
 'transformations',
 'differential',
 'geometry',
 'dr',
 'pedram',
 'hekmati',
 'department',
 'mathematics',
 'university',
 'auckland',
 'worked',
 'various',
 'topics',
 'mathematics',
 'abelian',
 'group',
 'theory',
 'measure',
 'theory',
 'graph',
 'theory',
 'differential',
 'geometry',
 'attended',
 'lectures',
 'conferences',
 'notable',
 'speakers',
 'throughout',
 'academia',
 'industry',
 'met',
 'ed',
 'witten',
 'currently',
 'forefront',
 'applied',
 'mathematics',
 'physics']


## 5. Calculating 5 Most Common Tokens from Each Document
we now calculate the 5 Most common Tokens from each document and it is not necessary taht we obtain $5 \times 6 = 30$ tokens in the end as some tokens might be repeated.

In [10]:
# calculating most common 5 tokens from each document
most_common = set()
for document in documents:
    frequencies = Counter(document)
    for word, frequency in frequencies.most_common(5):
        most_common.add(word)

# print number of most common tokens
print('Number of Most Common Tokens:', len(most_common), end='\n')

# print the most common tokens
pprint.pprint(most_common)

Number of Most Common Tokens: 25
{'algorithms',
 'also',
 'applications',
 'auckland',
 'cern',
 'college',
 'com',
 'data',
 'geometry',
 'group',
 'guitar',
 'java',
 'london',
 'mathematics',
 'participated',
 'plectrum',
 'python',
 'requests',
 'research',
 'structures',
 'students',
 'theory',
 'trinity',
 'university',
 'worked'}


## 6. Creating the One Hot Vectors for Each Word in the `most_common` Set
We will iterate over all the words in `most_common` set and for each word we will se whether it is present in a particular document or not. If it is then we mark it as $1$ in the corresponding row otherwise as $0$. 

In [16]:
vectors = {}
for word in most_common:
    vector = np.zeros((6), dtype=int)
    for index, document in enumerate(documents):
        vector[index] = word in document
    vectors[word] = vector

In [13]:
# let us see the vector output for a sample word, you can aso modify the word below to see one-hot vector representation
word = 'students'
print(word + ':', vectors[word].T)

students: [[1 0 0 0 1 1]]


In [14]:
# we can see the vector representations for all the words in most_common set of words
for word in most_common:
    print(word + ':', vectors[word].T)

python: [[1 0 0 0 1 1]]
also: [[1 0 1 1 0 0]]
college: [[1 0 0 0 0 1]]
mathematics: [[0 1 0 0 1 1]]
applications: [[0 0 1 0 0 0]]
structures: [[1 0 0 0 1 1]]
research: [[0 1 1 0 0 1]]
group: [[0 1 1 0 0 0]]
java: [[1 1 1 1 1 1]]
worked: [[0 1 1 1 0 0]]
guitar: [[0 0 0 0 0 1]]
students: [[1 0 0 0 1 1]]
trinity: [[0 0 0 0 0 1]]
theory: [[0 1 1 0 0 0]]
plectrum: [[0 0 0 0 0 1]]
auckland: [[0 1 0 0 0 1]]
data: [[1 1 0 0 1 1]]
london: [[0 0 0 0 0 1]]
geometry: [[0 1 0 0 0 0]]
participated: [[0 0 0 1 0 0]]
com: [[1 1 0 0 0 1]]
cern: [[0 0 1 1 0 1]]
requests: [[0 0 0 1 0 0]]
university: [[0 1 0 0 1 1]]
algorithms: [[1 1 0 0 1 1]]


## 7. Making the Table of Words & Vector Representations
We now create a $6 \times 25$ table representing the words and their repective vectors.

In [17]:
table = pandas.DataFrame(data=vectors)
print(table)

   python  also  college  mathematics  applications  structures  research  \
0       1     1        1            0             0           1         0   
1       0     0        0            1             0           0         1   
2       0     1        0            0             1           0         1   
3       0     1        0            0             0           0         0   
4       1     0        0            1             0           1         0   
5       1     0        1            1             0           1         1   

   group  java  worked  ...  auckland  data  london  geometry  participated  \
0      0     1       0  ...         0     1       0         0             0   
1      1     1       1  ...         1     1       0         1             0   
2      1     1       1  ...         0     0       0         0             0   
3      0     1       1  ...         0     0       0         0             1   
4      0     1       0  ...         0     1       0         0    