# Homework #1

In this homework you will be analyzing job descriptions from a number of different fields. The thought is that these job descriptions might contain both jargon word ands phrases.

The challenge here will be to analyze the text of the included job descriptions, but to also compare the words and phrases there with a reference set. In this case, we will use Reuters news articles as a background corpus to compare our possible jargon text with.

This homework will require that you read in the text of the job descriptions and then tokenize them. You will then need to take the tokens and compare them to the Reuters as both individual tokens and also as bigrams.

You need not look at the frequency of the terms. We are aiming for just term differences, so simply reporting back the tokens that are only in the job descriptions will be sufficient. One key thing to consider here is what kind of tokens will you want to report on. For example, the job descriptions might contain numbers and other things. Generally, you'd not want to report back numbers. Also, you might want to consider lowercasing things. 

If you'd like you can also try to stem or lemmatize the text.

The code has been built around using NLTK, but you could just as easily do this with Spacy.

In [30]:
# here we will import necessary libraries for using NLTK
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
import nltk.data
from os import listdir
from os.path import isfile, join
from nltk.util import bigrams 
from nltk.tokenize import TreebankWordTokenizer
from nltk.stem import WordNetLemmatizer
from string import punctuation
from nltk.corpus import stopwords
from nltk.stem.porter import *
sentence_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
treebank_tokenizer = TreebankWordTokenizer()
porter_stemmer = PorterStemmer()
wordnet_lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [31]:
dir_base = "data"


####
# Notice: We are reusing code from class notes... remember these kind of building blocks
####

def read_file(filename):
    input_file_text = open(filename , encoding='utf-8').read()
    return input_file_text

    
def read_directory_files(directory):
    file_texts = []
    files = [f for f in listdir(directory) if isfile(join(directory, f))]
    for f in files:
        file_text = read_file(join(directory, f))
        print(file_text)
        file_texts.append({"file":f, "content": file_text })
    return file_texts
    
# here we will generate the list that contains all the files and their contents
text_corpus = read_directory_files(dir_base)
print(text_corpus)

Dominion Engineering, Inc. (DEI; domeng.com) is a small (~40-person) company that supports the commercial energy industry in the US and abroad with technology, laboratory R&D testing, and consulting. The working environment at DEI is close-knit and professional, but not overly formal. Typical project teams are 2-3 persons working together and in collaboration with the Customer. Newer employees work under the general mentorship of more senior engineers, while still maintaining fairly autonomous roles, roles that may evolve over time to meet emergent needs.

One of DEI’s areas of expertise is degradation of nuclear power plant materials. This position would provide engineering analysis support to DEI project managers and subject matter experts for materials degradation projects and may also be called upon to provide support to other emergent DEI projects. Engineering analysis areas will include corrosion evaluation, fracture mechanics, and microstructural characterization, as well as dev

In [42]:
def process_description(job_description_object):
    job_description = job_description_object["content"]

    stopword = stopwords.words("english")
    job_description=job_description.lower()
    job_description=''.join(word for word in job_description if not word.isdigit())
    job_description = ''.join(word for word in job_description if word not in punctuation)
   
    # tokenizing job description
    tokens=treebank_tokenizer.tokenize(job_description)
    #elimnating stop words
    tokens = [word for word in tokens if word not in stopword]
    
    #lemmatizing the tokens
    lemmatized_words = [wordnet_lemmatizer.lemmatize(word) for word in tokens]
    
    return  lemmatized_words

In [43]:
# looping over the contents of job description
all_job_description_words = []
for job_description in text_corpus:
    all_job_description_words.extend(process_description(job_description))

In [44]:
len(all_job_description_words)

3908

In [55]:
job_bigrams = nltk.bigrams(all_reuters_words)
text_freq_bi = nltk.FreqDist(job_bigrams)
text_freq_bi

FreqDist({('mln', 'dlrs'): 4405, ('v', 'mln'): 3949, ('mln', 'v'): 3923, ('ct', 'v'): 3387, ('ct', 'net'): 2247, ('v', 'ct'): 1934, ('v', 'loss'): 1784, ('billion', 'dlrs'): 1662, ('rev', 'mln'): 1619, ('net', 'v'): 1579, ...})

In [46]:
from nltk.corpus import reuters

In [53]:
num_docs = len(reuters.fileids())
all_reuters_words = []
#using all files in reuters doc
for doc_id in reuters.fileids()[:]: 

    reuters_text = reuters.open(doc_id).read()
    # elimnating noise and converting the data to lower case
    reuters_text=reuters_text.lower()
    stopword = stopwords.words("english")
    reuters_text=''.join(word for word in reuters_text if not word.isdigit())
    reuters_text = ''.join(word for word in reuters_text if word not in punctuation)
    tokens_r=treebank_tokenizer.tokenize(reuters_text)
    tokens_r = [word for word in tokens_r if word not in stopword]
    
    #lemmatizing the words
    lemmatized_words_r = [wordnet_lemmatizer.lemmatize(word) for word in tokens_r]
    #appending the ruetres job description
    all_reuters_words.extend(lemmatized_words_r)

In [48]:
len(all_reuters_words)

855324

In [54]:
job_bigrams = nltk.bigrams(all_reuters_words)
text_freq_bi_r = nltk.FreqDist(job_bigrams)
text_freq_bi_r

FreqDist({('mln', 'dlrs'): 4405, ('v', 'mln'): 3949, ('mln', 'v'): 3923, ('ct', 'v'): 3387, ('ct', 'net'): 2247, ('v', 'ct'): 1934, ('v', 'loss'): 1784, ('billion', 'dlrs'): 1662, ('rev', 'mln'): 1619, ('net', 'v'): 1579, ...})

In [50]:
#comparing words in all_job_description_words to the one in all_reuters_words
jargon=[]    
for jd in all_job_description_words:
        if jd not in all_reuters_words:
            if jd not in jargon: 
                jargon.append(jd)
                

print(jargon)
print(len(jargon))

['domengcom', 'closeknit', 'collaboration', 'mentorship', 'emergent', '’', 'fracture', 'microstructural', 'characterization', 'graduating', 'bachelor', 'skillsattributes', '·', 'coursework', 'metallic', 'microstructures', 'hrswk', 'usable', 'parental', 'investmentgrade', 'unparalleled', 'rampup', 'recordexpansion', 'footprint', 'signon', '“', '”', 'worldclass', 'performancebased', 'allinclusive', 'spouse', 'firstyear', 'rookie', 'workplace', 'credentialing', 'ancc', 'credential', 'accreditation', 'transcript', 'validation', 'personify', 'aprnvalidationanaorg', 'uploads', 'templated', 'followsup', 'faculty', 'assigning', 'registrar', 'directorassistant', 'concurrently', 'passion', 'rewarding', 'stellar', 'interviewing', 'entrepreneurial', 'thrive', 'selfmotivated', 'fundraising', 'excellence', 'perk', 'seasoned', 'lap', 'uncapped', 'loyalty', 'localstate', 'disseminate', 'healthrelated', 'collegiate', 'residency', 'recruiter', 'informationtraining', 'soldierleader', 'noncontributory', '

In [51]:
jargon.sort()
print(jargon)

['accomplishing', 'accreditation', 'adaptable', 'adaptableflexible', 'affectively', 'agility', 'aligning', 'allinclusive', 'analysisdata', 'ancc', 'answered', 'approachable', 'aprnvalidationanaorg', 'aspiration', 'assertive', 'assigning', 'babs', 'bachelor', 'bpr', 'breadth', 'calculator', 'capturemigration', 'certificationregistrationlicensure', 'cgi', 'changeorders', 'characterization', 'charting', 'clarksburg', 'click', 'closeknit', 'coining', 'collaboration', 'collegeschool', 'collegiate', 'commissionbased', 'communitybased', 'compassionate', 'competitivenatured', 'concurrently', 'converse', 'coursework', 'credential', 'credentialing', 'crm', 'customerfocus', 'customize', 'cutover', 'cuttingedge', 'cws', 'decimal', 'deploying', 'description', 'desktopdefense', 'detailoriented', 'diploma', 'directorassistant', 'disability', 'discovering', 'disseminate', 'dod', 'domengcom', 'drawingscontract', 'dynamaps', 'email', 'embrace', 'emergent', 'empathetic', 'empower', 'enlightening', 'enlis

# Analysis of your results

Below this cell, please put a short writeup of your approach and comments on your results. The goal here is to explain how well you think your method worked based on looking at some of your output data. Additionally, please describe things you might do fifferently or ways in which you might improve the process if you were given more time.

## Method used:
- I extracted the content from each job description. Then I elimnated noise by lower casing all the letter and elimnating the punctuations.
- Then I tokenized the content and got rid of all the stop words.
- After that I lemmatized the tokens, I choose lemmatization over stemming since lemma generates an actual language word.
- I performed the same operation for reuters also.

## Comparing the lists of tokens:
- After generating two lists, each containing tokens for the documents. I compared the two. 
- I did the comparision using for and if loops.
- And generated a final list of jargon words.
- The method was successful in identifying most of the jargon words, but there are still a few words in the list that would not be considered jargon.

## Imprving the method:
- There are 268 words in the ist of jargon.
- Despite elimnating punctuations, I can see that the list contains a few punctuations, I would look for a better method to elimnate all the punctuations. 
- Certain words like "answered", "calculator" and "georgetown" are also a part of my jargon list.
- I use a larger set of files or possibly a file that has a wider variety of information for comparision, in order to elimnate such words.
- In this method I have not used bigrams for analysis, my next step would probably involve exploring bigrams.
- Furthermore, I may look into some cross validation process, since there may be certain jargon words that the algorithm did not pick up on. 