# Removing Stop Words
## Anish Sachdeva (DTU/2K16/MC/013)
## Natural Language Processing - Dr. Seba Susan

## 1. Importing the NLTK Library and Functions
We need to import the [nltk](https://www.nltk.org/) libray which is heavily used in Natural language Preprocessing and we will import some specific functions and classes from the library that will help us with our data preprocessing. 

> Note: If the following imports do not function, most likely one or more library is not installed on your machine. Oen your terminal and run the `pip install nltk` and `pip install pickle` commands.

In [65]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import pickle
from collections import Counter

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\anish\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 2. Importing the Corpus
We will use standard Python functions to read our __resume.txt__ file which has some textual data and is my current resume. 

In [18]:
resume_file = open('../assets/resume.txt', 'r')
resume = resume_file.read()
resume_file.close()
print('The resume file has the following data')
print(resume)

The resume file has the following data
Anish Sachdeva
Software Developer + Clean Code Enthusiast

Phone : 8287428181
email : anish_@outlook.com
home : sandesh vihar, pitampura, new delhi - 110034
date of birth : 7th April 1998
languages : English, Hindi, French

Work Experience
What After College (4 months)
Delhi, India
Creating content to teach Core Java and Python with Data Structures and Algorithms and giving online classes to students

Summer Research Fellow at University of Auckland (2 Months)
Auckland, New Zealand
Worked on Geometry of Mobius Transformations, Differential Grometry under Dr. Pedram Hekmati at the Department of Mathematics, University of Auckland

Software Developer at CERN (14 Months)
CERN, Geneva, Switzerland
Worked in the core Platforms team of the FAP-BC group. Part of an agile team of developers that maintains and adds core functionality to applications used internally at CERN by HR, Financial, Administrative and other departments including Scientific
Worked o

## 3. Tokenizing the Resume File
There are 3 ways to tokenize the resume corpus we have opened.
1. We create our own function that uses the python regex __re__ library and iterate over our file to extract tokens.
1. The second way is to use the inbuilt `word_tokenize` method in the `nltk.tokenize` package.
1. The third way is we craete our tokenizer usin the 'nltk.RegexTokenizer' factory method. 

> Note: here the term token refers to a word. In NLP jargon, a word is referred to as a token and in the following notebook and in NLP in general the 2 are used interchangibly.

We will impliment all 3 and see their different advantages and disadvantages.

### 3.1 Custom Tokenizer

In [19]:
def custom_tokenizer(document):
    tokens = []
    for sentence in document.split('\n'):
        for word in sentence.split():
            tokens.append(word)
    return tokens

In [20]:
# Testing our Function
print(custom_tokenizer('this is a simple message string. Will this work?'))

['this', 'is', 'a', 'simple', 'message', 'string.', 'Will', 'this', 'work?']


We can see from the above example that our function works well but is making a mistake with recognizing tokens when attached with punctutions. Words such as __work?__ should appear as `['work', '?']` as seperate tokens but the python `split()` method is unable to seperate tokens from punctuations.

The above code can further be shortened using the inbuilt `word_tokenize()` function defined inside `nltk.tokenize`.

### 3.2 Using the `word_tokenize` Function 

In [22]:
print(word_tokenize('this is a simple message string. Will this work?'))

['this', 'is', 'a', 'simple', 'message', 'string', '.', 'Will', 'this', 'work', '?']


We can clearly see that this is a cleaner API and this performed much better than our custom function as this was able to seperate out punctuations from the tokens. But in our token extraction step from our corpuswe also need to remove punctuations andto accomplish that we create our own tokenizer using the `nltk.RegexTokenizer` factory Method.   

In [32]:
# Note: Emojis ae also treated as a single token and are retained in tokenization process using the word_tokenize function
print(word_tokenize("this is soo cool üòé let's celebrate"))

['this', 'is', 'soo', 'cool', 'üòé', 'let', "'s", 'celebrate']


### 3.3 Using the `nltk.RegexTokenizer` Factory Method 

In [23]:
tokenizer = nltk.RegexpTokenizer(r'\w+')

In [24]:
# testing the tokenizer (you can also test any string of your choice here)
print(tokenizer.tokenize('this is a simple message string. Will this work?'))

['this', 'is', 'a', 'simple', 'message', 'string', 'Will', 'this', 'work']


We can see that all punctuations have gone and we can use this to tokenize our resume which will remove punctuations and also give us tokens in the original case. 

> Note: The Tokenizer doesn't change the case of the tokens.

### 3.4 Tokenizing the Resume
Before tokenizing we must convert our text data to lowercase and then tokenize it. 

In [25]:
resume = resume.lower()
# Using the nltk.RegexTokenizerbuilt before
tokens = tokenizer.tokenize(resume)

In [30]:
print('The number of tokens are:', len(tokens))
print('The first 40 tokens are:', tokens[:40])

The number of tokens are: 363
The first 40 tokens are: ['anish', 'sachdeva', 'software', 'developer', 'clean', 'code', 'enthusiast', 'phone', '8287428181', 'email', 'anish_', 'outlook', 'com', 'home', 'sandesh', 'vihar', 'pitampura', 'new', 'delhi', '110034', 'date', 'of', 'birth', '7th', 'april', '1998', 'languages', 'english', 'hindi', 'french', 'work', 'experience', 'what', 'after', 'college', '4', 'months', 'delhi', 'india', 'creating']


We can see that only alphanumerical tokens have been retained and all other punctuation tokens like __.__ and __,__ have been removed. This step is very beneficial as punctautions in this particular context where we wish to extract the important information from text/resume or if we wish to run sentiment analysis do not contribute a lot to the information or the sentiment.

In [29]:
# Note: Emojis are also removed using the tokenizer we have built
print(tokenizer.tokenize('amazing news üéâ this is sooo awesome üòÄüç∞'))

['amazing', 'news', 'this', 'is', 'sooo', 'awesome']


## 4. Removing Stopwords From the Tokens
Stopwords are a few special words in a language that are used in the language for a grammatical purpose, but from an Informatoon point of view do not add a lot of meaning to the corpus. These words can be removed without hampering information extraction or sentiment analysis. Some examples of these words are `['i', 'me', 'my', 'myself', 'yours', ....]` and many more. 

In [33]:
# load the stopwords from the nltk corpus 
stopwords_en = stopwords.words('english') 

In [35]:
# print first 20 stopwords
print('The first 20 stopwords are:', stopwords_en[:20])

The first 20 stopwords are: ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his']


In [36]:
print('The total number of stopwords in the english language as defined by nltk are:', len(stopwords_en))

The total number of stopwords in the english language as defined by nltk are: 179


We will now convert the stopwords into a set so that we can perform operations to check whether a word is a stopword or not in constant time. 

In [37]:
stopwords_en = set(stopwords_en)

In [38]:
# testig whether a word is a stop word or not - you can modify this to test for any word
print('i' in stopwords_en)

True


In [40]:
# Creating the updated tokens list after removing all stopwords 
stop_words_removed_tokens = [token for token in tokens if token not in stopwords_en]
print('The number of tokens after removing stopwords:', len(stop_words_removed_tokens))
print('First 20 tokens:', stop_words_removed_tokens[:20])

The number of tokens after removing stopwords: 288
First 20 tokens: ['anish', 'sachdeva', 'software', 'developer', 'clean', 'code', 'enthusiast', 'phone', '8287428181', 'email', 'anish_', 'outlook', 'com', 'home', 'sandesh', 'vihar', 'pitampura', 'new', 'delhi', '110034']


In [42]:
print('First 30 tokens (with stopwords)', tokens[:30], '\n')
print('First 30 tokens (without stopwords)', stop_words_removed_tokens[:30])

First 30 tokens (with stopwords) ['anish', 'sachdeva', 'software', 'developer', 'clean', 'code', 'enthusiast', 'phone', '8287428181', 'email', 'anish_', 'outlook', 'com', 'home', 'sandesh', 'vihar', 'pitampura', 'new', 'delhi', '110034', 'date', 'of', 'birth', '7th', 'april', '1998', 'languages', 'english', 'hindi', 'french'] 

First 30 tokens (without stopwords) ['anish', 'sachdeva', 'software', 'developer', 'clean', 'code', 'enthusiast', 'phone', '8287428181', 'email', 'anish_', 'outlook', 'com', 'home', 'sandesh', 'vihar', 'pitampura', 'new', 'delhi', '110034', 'date', 'birth', '7th', 'april', '1998', 'languages', 'english', 'hindi', 'french', 'work']


Observing a difference in tokens in the above cell is tough, but we can compare the number of tokens to clearly see that after removing stopwords there has been a clear reduction in the number of tokens.

In [44]:
print('Number of tokens (with stopwords):', len(tokens))
print('number of tokens (without stopwords):', len(stop_words_removed_tokens))

Number of tokens (with stopwords): 363
number of tokens (without stopwords): 288


## 5. Combing All Steps To Create a Pipeline
We can now create a single python function that will take in a file location as the input and return a tokenized version in lowercase with punctuations and stopwords removed.

In [97]:
# to tokenize a given string document
def tokenize(document):
    document = document.lower()
    tokenizer = nltk.RegexpTokenizer(r'\w+')
    tokens = [token for token in tokenizer.tokenize(document) if token not in stopwords_en]
    return tokens

In [59]:
# to tokenize a given file given a file path
def tokenize_file(file_location):
    file = open(file_location, 'r')
    document = file.read()
    file.close()
    return tokenize(document)

In [60]:
print(tokenize_file('../assets/do-not-go-gentle-into-that-good-night.txt'))

['go', 'gentle', 'good', 'night', 'old', 'age', 'burn', 'rave', 'close', 'day', 'rage', 'rage', 'dying', 'light', 'though', 'wise', 'men', 'end', 'know', 'dark', 'right', 'words', 'forked', 'lightning', 'go', 'gentle', 'good', 'night', 'good', 'men', 'last', 'wave', 'crying', 'bright', 'frail', 'deeds', 'might', 'danced', 'green', 'bay', 'rage', 'rage', 'dying', 'light', 'wild', 'men', 'caught', 'sang', 'sun', 'flight', 'learn', 'late', 'grieved', 'way', 'go', 'gentle', 'good', 'night', 'grave', 'men', 'near', 'death', 'see', 'blinding', 'sight', 'blind', 'eyes', 'could', 'blaze', 'like', 'meteors', 'gay', 'rage', 'rage', 'dying', 'light', 'father', 'sad', 'height', 'curse', 'bless', 'fierce', 'tears', 'pray', 'go', 'gentle', 'good', 'night', 'rage', 'rage', 'dying', 'light']


In [62]:
# sentence consiting of many stopwords
print(tokenize('To be, or not to be. That is the question.'))

['question']


In [64]:
# We can finally call the tokenizer method on our resume file and save the tokens in a pickle file for further 
# analytics and discussion
resume_tokens = tokenize_file('../assets/resume.txt')
pickle.dump(resume_tokens, open('../assets/resume.p', 'wb'))

## 6. Analytics
We will create perform the following analytis on our document:
1. To calculate unique words in a document.
1. To calculate frequency of words in a document.

In [66]:
# loading tokens from pickle file
resume_tokens = pickle.load(open('../assets/resume.p', 'rb'))
unique_tokens = set(resume_tokens)
frequency = Counter(resume_tokens)

In [90]:
print('Totoal Number of Tokens:', len(resume_tokens))
print('Number of Unique Tokens:', len(unique_tokens), '\n')
print('Frequency of 15 most common tokens: \n', frequency.most_common(15))

Totoal Number of Tokens: 288
Number of Unique Tokens: 214 

Frequency of 15 most common tokens: 
 [('java', 7), ('worked', 6), ('com', 4), ('delhi', 4), ('months', 4), ('core', 4), ('data', 4), ('structures', 4), ('university', 4), ('cern', 4), ('algorithms', 3), ('auckland', 3), ('mathematics', 3), ('computer', 3), ('https', 3)]


After running frequency analysis we see that the top skills + experiences that I have listed on my resume are Java, Data Structures and Algorithms, Mathematics, core computers and also university of auckland and CERN. These are all important informations that I would like prospective employers to know and thsi wouldn't have been possible if we wouldn't have removed stopwords and the top frequency words would've been 'I', 'am', 'doing', 'working', 'me' etc.

Although the result isn't perfect. We see that __worked__ has also appeared in our tokens which doesn't convey a lot of information. We can be prety sure that people who have listed some experience on their resume have defininately _worked_ or _interned_ at that company, so for this particular task we can update our list of stopwords.

Ideally we should curate our list of stopwords based on the domain we are working on. We will now add some Resume/CV related domain words to the stopwords set.

In [103]:
stopwords_en.add('work')
stopwords_en.add('working')
stopwords_en.add('worked')
stopwords_en.add('intern')
stopwords_en.add('interning')
stopwords_en.add('interned')
stopwords_en.add('https')
stopwords_en.add('com')
stopwords_en.add('new')

In [104]:
# we once again create our tokens and run anlytics
# We can finally call the tokenizer method on our resume file and save the tokens in a pickle file for further 
# analytics and discussion
resume_tokens = tokenize_file('../assets/resume.txt')
pickle.dump(resume_tokens, open('../assets/resume.p', 'wb'))

In [105]:
# loading tokens from pickle file
resume_tokens = pickle.load(open('../assets/resume.p', 'rb'))
unique_tokens = set(resume_tokens)
frequency = Counter(resume_tokens)
print('Totoal Number of Tokens:', len(resume_tokens))
print('Number of Unique Tokens:', len(unique_tokens), '\n')
print('Frequency of 15 most common tokens: \n', frequency.most_common(15))

Totoal Number of Tokens: 272
Number of Unique Tokens: 209 

Frequency of 15 most common tokens: 
 [('java', 7), ('delhi', 4), ('months', 4), ('core', 4), ('data', 4), ('structures', 4), ('university', 4), ('cern', 4), ('algorithms', 3), ('auckland', 3), ('mathematics', 3), ('computer', 3), ('software', 2), ('developer', 2), ('english', 2)]


We can now infer more skills and experiences using the frequency analysis after added Resume/CV based (domain specific) stopwords.

## 7. Creating Formatted (Pretty) Output
We will create a utility function that will return the original document in a similaryl formatted manner.

In [77]:
def tokenized_formatted(document):
    result = []
    for line in document.split('\n'):
        tokens = tokenize(line)
        result.append(' '.join(tokens))
    return '\n'.join(result)

In [106]:
resume_file = open('../assets/resume.txt', 'r')
resume = resume_file.read()
resume_file.close()
resume_tokenized_pretty = tokenized_formatted(resume)
print(resume_tokenized_pretty)

anish sachdeva
software developer clean code enthusiast

phone 8287428181
email anish_ outlook
home sandesh vihar pitampura delhi 110034
date birth 7th april 1998
languages english hindi french

experience
college 4 months
delhi india
creating content teach core java python data structures algorithms giving online classes students

summer research fellow university auckland 2 months
auckland zealand
geometry mobius transformations differential grometry dr pedram hekmati department mathematics university auckland

software developer cern 14 months
cern geneva switzerland
core platforms team fap bc group part agile team developers maintains adds core functionality applications used internally cern hr financial administrative departments including scientific
legacy applications comprise single times multiple frameworks java spring boot hibernate java ee also google polymer 1 0 jsp client side
maintained cern electronic document handing system application 1m loc comprising multiple framewo

In [107]:
# We now store this output in a text file
pickle.dump(resume_tokenized_pretty, open('../assets/resume_tokenized.p', 'wb'))

In [108]:
# We can view contents of the file using the below code
file = pickle.load(open('../assets/resume_tokenized.p', 'rb'))
print(file)

anish sachdeva
software developer clean code enthusiast

phone 8287428181
email anish_ outlook
home sandesh vihar pitampura delhi 110034
date birth 7th april 1998
languages english hindi french

experience
college 4 months
delhi india
creating content teach core java python data structures algorithms giving online classes students

summer research fellow university auckland 2 months
auckland zealand
geometry mobius transformations differential grometry dr pedram hekmati department mathematics university auckland

software developer cern 14 months
cern geneva switzerland
core platforms team fap bc group part agile team developers maintains adds core functionality applications used internally cern hr financial administrative departments including scientific
legacy applications comprise single times multiple frameworks java spring boot hibernate java ee also google polymer 1 0 jsp client side
maintained cern electronic document handing system application 1m loc comprising multiple framewo

## 8. Discussion
We have seen that after removing the stopwords from the resume our number of words has gone down considerably and also that the words we removed never added a lot of meaning to our text. Large companies who will receive many resumes will want to search them using keywords such as `Java`, `Python`, `web Development` etc. and words such as `i`, `me`, `mine` are superfluous in nature.

So, by removing these stopwords we have actually make our corpus more information dense and any other further task we might perform such as converting these words into embeddings or any other Machine Learning/Deep Learning task will now be done on a smaller Corpus and hence would run faster.

Considering these above advantages removing stopwords is a very beneficial pre-processing step.