<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction:-Word2Vec-Training-for-Job-Descriptions" data-toc-modified-id="Introduction:-Word2Vec-Training-for-Job-Descriptions-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction: Word2Vec Training for Job Descriptions</a></span><ul class="toc-item"><li><span><a href="#Dataset" data-toc-modified-id="Dataset-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Dataset</a></span></li><li><span><a href="#Python-Library" data-toc-modified-id="Python-Library-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Python Library</a></span></li></ul></li><li><span><a href="#Data-Set-Loading-and-Cleaning-Up" data-toc-modified-id="Data-Set-Loading-and-Cleaning-Up-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data Set Loading and Cleaning Up</a></span><ul class="toc-item"><li><span><a href="#Load-Job-Description-CSV-Data" data-toc-modified-id="Load-Job-Description-CSV-Data-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Load Job Description CSV Data</a></span></li><li><span><a href="#Clean-Up" data-toc-modified-id="Clean-Up-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Clean Up</a></span></li></ul></li><li><span><a href="#Word-Embedding-Training-with-Word2Vec" data-toc-modified-id="Word-Embedding-Training-with-Word2Vec-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Word Embedding Training with Word2Vec</a></span><ul class="toc-item"><li><span><a href="#Tokenization" data-toc-modified-id="Tokenization-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Tokenization</a></span></li><li><span><a href="#Lemmatization" data-toc-modified-id="Lemmatization-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Lemmatization</a></span></li><li><span><a href="#Word-Embedding-Training-with-Gensim" data-toc-modified-id="Word-Embedding-Training-with-Gensim-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Word Embedding Training with Gensim</a></span></li></ul></li></ul></div>

# Introduction: Word2Vec Training for Job Descriptions

In this notebook, I am going to train word2vec word embedding with ~2.5 million job descriptions (from Cloudera production rf_job_description table). I will use NLTK and gensim word2vec to do the training and save the model as a binary file.

## Dataset

The dataset is a historical data of job descriptions stored as "JobDescriptions_2.5M.csv" file.

## Python Library

In [1]:
# Pandas and numpy for converting from Spark dataframe into Pandas dataframe
import pandas as pd
import numpy as np
# Make the random numbers predictable
np.random.seed(42)
import multiprocessing
cpu_count = multiprocessing.cpu_count()

In [2]:
# Allow multiple output/display from one cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [3]:
from gensim.models import Word2Vec
import nltk
import ssl
try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context
    
# Stop Word Removal
from nltk.corpus import stopwords
# from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

In [4]:
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.externals import joblib
from sklearn.metrics import accuracy_score, classification_report



In [5]:
nltk.download("stopwords")
nltk.download('wordnet')
nltk.download('omw')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /home/ivan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

[nltk_data] Downloading package wordnet to /home/ivan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

[nltk_data] Downloading package omw to /home/ivan/nltk_data...
[nltk_data]   Package omw is already up-to-date!


True

[nltk_data] Downloading package punkt to /home/ivan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# Data Set Loading and Cleaning Up

## Load Job Description CSV Data

In [6]:
df = pd.read_csv('../JobDescriptions_2.5M.csv', header='infer')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2452237 entries, 0 to 2452236
Data columns (total 3 columns):
req_guid           object
job_title          object
job_description    object
dtypes: object(3)
memory usage: 56.1+ MB


In [7]:
print("The total amount of training data on job descriptions is: ", len(df))

The total amount of training data on job descriptions is:  2452237


## Clean Up

In [8]:
# remove any rows without job_description
df.dropna(inplace=True)

In [9]:
# Combine simply job description and title so that job title is a part of job description
df['job_description'] = df['job_title'] + " " + df['job_description']

In [10]:
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2307199 entries, 0 to 2452236
Data columns (total 3 columns):
req_guid           object
job_title          object
job_description    object
dtypes: object(3)
memory usage: 70.4+ MB


Unnamed: 0,req_guid,job_title,job_description
0,15983255,Handler,Handler Ability to lift 75 lbs Ability to mane...
1,14382131,Receiver day shift,Receiver day shift Contract role may be taken...
2,15255153,NET Developer PCI Compliance,NET Developer PCI Compliance Our client is lo...
3,HPJP00081328,US Business Analyst 3,US Business Analyst 3 TEC to Genuent Transfer ...
4,SSBK26782-1,Team Member 3 Years,Team Member 3 Years Job Description TBD


# Word Embedding Training with Word2Vec

In this step, raw text data will be transformed into feature vectors. I will implement the following steps in order to obtain relevant features from the dataset.

* Tokenizing
* Remove stop words
* Lemmatization (not stem since stemming can reduce the interpretability) 
* Word Embeddings Training

## Tokenization

Tokenization is the process by dividing the quantity of text into smaller parts called tokens so that each token can be further treated for machine learning purposes. A token can be a character, a word, a sentence or a paragraph. In this notebook, I only consider words as tokens.

In [11]:
# Tokenize the job description and title
# I can use NLTK word_tokenize function to process the job description field (by removing punctuations 
# and separating words) like below
# df['job_description'] = df.apply(lambda row: word_tokenize(row.job_description), axis=1)
# Or I can just use python string split function to separate text since the job description has been cleaned
df['job_description'] = df["job_description"].str.lower()
df['job_description'] = df["job_description"].str.split(" ")

In [12]:
df.head()

Unnamed: 0,req_guid,job_title,job_description
0,15983255,Handler,"[handler, ability, to, lift, 75, lbs, ability,..."
1,14382131,Receiver day shift,"[receiver, day, shift, , contract, role, may, ..."
2,15255153,NET Developer PCI Compliance,"[, net, developer, pci, compliance, our, clien..."
3,HPJP00081328,US Business Analyst 3,"[us, business, analyst, 3, tec, to, genuent, t..."
4,SSBK26782-1,Team Member 3 Years,"[team, member, 3, years, , , job, description,..."


__Stopword Removal__ (not used)

A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a NLP program has been programmed to ignore. In this notebook, I will use NLTK stop words dataset to remove any stop words in job description field.

In [13]:
# # Get stopwords list from NLTK library
# stop_words = stopwords.words('english')
# # Define a function to remove any stop words from input text
# def removeStopWords(x):
#         return [w.lower() for w in x if (w not in stop_words) and (w != '') and (w is not None)]
# # Apply the defined function to remove stop words for job descriptions
# df['job_description'] = df.apply(lambda row: removeStopWords(row.job_description), axis=1)
# # Show some results
# df.head()

## Lemmatization

Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. For example, in English, the verb 'to walk' may appear as 'walk', 'walked', 'walks', 'walking'. The base form, 'walk', that one might look up in a dictionary, is called the lemma for the word. I will use NLTK lemmatization function to convert words into their lemma.

In [14]:
# Define lemmatization function by using NLTK WordNetLemmatizer function
def lemma(x):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(w.lower(), pos='v') for w in x if (w != '') and (w is not None)]
# Apply the defined function to process job descriptions
df['job_description'] = df.apply(lambda row: lemma(row.job_description), axis=1)
# Show some results
df.head()

Unnamed: 0,req_guid,job_title,job_description
0,15983255,Handler,"[handler, ability, to, lift, 75, lbs, ability,..."
1,14382131,Receiver day shift,"[receiver, day, shift, contract, role, may, be..."
2,15255153,NET Developer PCI Compliance,"[net, developer, pci, compliance, our, client,..."
3,HPJP00081328,US Business Analyst 3,"[us, business, analyst, 3, tec, to, genuent, t..."
4,SSBK26782-1,Team Member 3 Years,"[team, member, 3, years, job, description, tbd]"


## Word Embedding Training with Gensim

A word embedding is a form of representing words and documents using a dense vector representation. The position of a word within the vector space is learned from text and is based on the words that surround the word when it is used. Word embeddings can be trained using the input texts. One can read more about word embeddings [here](https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/) and [here](https://jalammar.github.io/illustrated-word2vec/).

In this notebook, I am going to use gensim library Word2Vec functionality to generate word embedding vectors so that we can use those vectors later on to train the other models.

Gensim Word2Vec will generate a vector (dimension of 300 here) for each word after training based on all job descriptions.

In [15]:
# Prepare all the text input for training word2vec model
sentences = df['job_description'].tolist()

In [18]:
# Define and train a word2vec model. 
# Here I set vector dimension size to be 300, window (word distanse) to be 5 
# and use all available CPUs for parallel processing
model_w2v = Word2Vec(sentences, size=300, window=5, min_count=1, workers=cpu_count)
# summarize vocabulary
# word_vocabulary = list(model_w2v.wv.vocab)
# print(word_vocabulary)
# save model with binary format
model_w2v.save('nnc_word2vec.bin')
# load model when needed so that this word2vec model doesn't need to be re-trained
# model_w2v = Word2Vec.load('nnc_word2vec.pkl')
print(model_w2v)

Word2Vec(vocab=381161, size=300, alpha=0.025)
