<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction:-NLP-Learning-for-Job-Descriptions" data-toc-modified-id="Introduction:-NLP-Learning-for-Job-Descriptions-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction: NLP Learning for Job Descriptions</a></span><ul class="toc-item"><li><span><a href="#Dataset" data-toc-modified-id="Dataset-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Dataset</a></span></li><li><span><a href="#Python-Library" data-toc-modified-id="Python-Library-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Python Library</a></span></li></ul></li><li><span><a href="#Data-Set-Loading-and-Cleaning-Up" data-toc-modified-id="Data-Set-Loading-and-Cleaning-Up-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data Set Loading and Cleaning Up</a></span><ul class="toc-item"><li><span><a href="#Load-Job-Description-CSV-Data" data-toc-modified-id="Load-Job-Description-CSV-Data-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Load Job Description CSV Data</a></span></li><li><span><a href="#Clean-Up" data-toc-modified-id="Clean-Up-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Clean Up</a></span></li></ul></li><li><span><a href="#Text-Feature-Engineering" data-toc-modified-id="Text-Feature-Engineering-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Text Feature Engineering</a></span><ul class="toc-item"><li><span><a href="#Tokenization" data-toc-modified-id="Tokenization-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Tokenization</a></span></li><li><span><a href="#Stopword-Removal" data-toc-modified-id="Stopword-Removal-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Stopword Removal</a></span></li><li><span><a href="#Lemmatization" data-toc-modified-id="Lemmatization-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Lemmatization</a></span></li><li><span><a href="#Word-Embedding-Vectors-with-Gensim" data-toc-modified-id="Word-Embedding-Vectors-with-Gensim-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Word Embedding Vectors with Gensim</a></span><ul class="toc-item"><li><span><a href="#Word2Vec-Mean-Vector" data-toc-modified-id="Word2Vec-Mean-Vector-3.4.1"><span class="toc-item-num">3.4.1&nbsp;&nbsp;</span>Word2Vec Mean Vector</a></span></li><li><span><a href="#Covert-Vectors-into-Columns" data-toc-modified-id="Covert-Vectors-into-Columns-3.4.2"><span class="toc-item-num">3.4.2&nbsp;&nbsp;</span>Covert Vectors into Columns</a></span></li></ul></li></ul></li><li><span><a href="#Training-and-Testing-Data-Preparation" data-toc-modified-id="Training-and-Testing-Data-Preparation-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Training and Testing Data Preparation</a></span><ul class="toc-item"><li><span><a href="#Standard-Scaling" data-toc-modified-id="Standard-Scaling-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Standard Scaling</a></span></li><li><span><a href="#Label-Encoding" data-toc-modified-id="Label-Encoding-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Label Encoding</a></span></li><li><span><a href="#Data-Splits-for-Training-and-Testing-Data-Sets" data-toc-modified-id="Data-Splits-for-Training-and-Testing-Data-Sets-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Data Splits for Training and Testing Data Sets</a></span></li></ul></li><li><span><a href="#Neural-Network-Classification" data-toc-modified-id="Neural-Network-Classification-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Neural Network Classification</a></span><ul class="toc-item"><li><span><a href="#Neural-Network-Multi-class-Classifier-Training" data-toc-modified-id="Neural-Network-Multi-class-Classifier-Training-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Neural Network Multi-class Classifier Training</a></span></li><li><span><a href="#Training-Validation" data-toc-modified-id="Training-Validation-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Training Validation</a></span></li></ul></li><li><span><a href="#Summary" data-toc-modified-id="Summary-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Summary</a></span></li></ul></div>

# Introduction: NLP Learning for Job Descriptions

In this notebook, I will demonstrate how to process unstructure text feature data by using natural language processing (NLP) technologies and train a classification model with those text features. I will use NLTK and gensim word2vec to do text feature engineering and then use multi-class neural network classifier to train a classification model.

## Dataset

The dataset is a historical data of job descriptions stored as "job_descriptions.csv" file.

## Python Library

In [1]:
# Pandas and numpy for converting from Spark dataframe into Pandas dataframe
import pandas as pd
import numpy as np
# Make the random numbers predictable
np.random.seed(42)
import multiprocessing
cpu_count = multiprocessing.cpu_count()

In [2]:
# Allow multiple output/display from one cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [3]:
from gensim.models import Word2Vec
import nltk
import ssl
try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context
    
# Stop Word Removal
from nltk.corpus import stopwords
# from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

In [4]:
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.externals import joblib
from sklearn.metrics import accuracy_score, classification_report



In [5]:
nltk.download("stopwords")
nltk.download('wordnet')
nltk.download('omw')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /home/ivan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

[nltk_data] Downloading package wordnet to /home/ivan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

[nltk_data] Downloading package omw to /home/ivan/nltk_data...
[nltk_data]   Package omw is already up-to-date!


True

[nltk_data] Downloading package punkt to /home/ivan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# Data Set Loading and Cleaning Up

## Load Job Description CSV Data

In [6]:
df = pd.read_csv('./job_descriptions.csv', header='infer')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 127784 entries, 0 to 127783
Data columns (total 8 columns):
req_guid                  127784 non-null object
original_hcs_code         127784 non-null object
original_hcs_level        127784 non-null object
updated_assigned_hcs      127784 non-null object
updated_assigned_level    127784 non-null int64
level_indicator           127784 non-null int64
job_title                 127783 non-null object
job_description           127784 non-null object
dtypes: int64(2), object(6)
memory usage: 7.8+ MB


  interactivity=interactivity, compiler=compiler, result=result)


In [7]:
print("The total amount of training data on job descriptions is: ", len(df))

The total amount of training data on job descriptions is:  127784


## Clean Up

In [8]:
# remove any rows without job_description
df.dropna(inplace=True)

In [9]:
# Combine simply job description and title so that job title is a part of job description
df['job_description'] = df['job_title'] + " " + df['job_description']

In [10]:
# Prepare final data set for next step of text feature engineering
df = df[['req_guid', 'updated_assigned_hcs', 'job_title', 'job_description']]

In [11]:
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 127783 entries, 0 to 127783
Data columns (total 4 columns):
req_guid                127783 non-null object
updated_assigned_hcs    127783 non-null object
job_title               127783 non-null object
job_description         127783 non-null object
dtypes: object(4)
memory usage: 4.9+ MB


Unnamed: 0,req_guid,updated_assigned_hcs,job_title,job_description
0,15795912,00-000,Warehouse Material Handler 1st Shift TEMP 6 mo,Warehouse Material Handler 1st Shift TEMP 6 mo...
1,15370412,00-000,Structural Welder,Structural Welder Welding structural steel wit...
2,15735158,00-000,Construction Laborer,Construction Laborer This individual will be r...
3,15797256,00-000,Wrapping,Wrapping Must Have Ability to lift 50lbs palle...
4,15469997,00-000,Operations Representative,Operations Representative Executes the busines...


# Text Feature Engineering

In this step, raw text data will be transformed into feature vectors. I will implement the following steps in order to obtain relevant features from the dataset.

* Tokenizing
* Remove stop words
* Lemmatization (not stem since stemming can reduce the interpretability) 
* Word Embeddings as features

## Tokenization

Tokenization is the process by dividing the quantity of text into smaller parts called tokens so that each token can be further treated for machine learning purposes. A token can be a character, a word, a sentence or a paragraph. In this notebook, I only consider words as tokens.

In [12]:
# Tokenize the job description and title
# I can use NLTK word_tokenize function to process the job description field (by removing punctuations 
# and separating words) like below
df['job_description'] = df.apply(lambda row: word_tokenize(row.job_description), axis=1)
# Or I can just use python string split function to separate text since the job description has been cleaned
df['job_description'] = df["job_description"].str.lower()
df['job_description'] = df["job_description"].str.split(" ")

In [13]:
df.head()

Unnamed: 0,req_guid,updated_assigned_hcs,job_title,job_description
0,15795912,00-000,Warehouse Material Handler 1st Shift TEMP 6 mo,"[warehouse, material, handler, 1st, shift, tem..."
1,15370412,00-000,Structural Welder,"[structural, welder, welding, structural, stee..."
2,15735158,00-000,Construction Laborer,"[construction, laborer, this, individual, will..."
3,15797256,00-000,Wrapping,"[wrapping, must, have, ability, to, lift, 50lb..."
4,15469997,00-000,Operations Representative,"[operations, representative, executes, the, bu..."


## Stopword Removal

A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a NLP program has been programmed to ignore. In this notebook, I will use NLTK stop words dataset to remove any stop words in job description field.

In [14]:
# Get stopwords list from NLTK library
stop_words = stopwords.words('english')
# Define a function to remove any stop words from input text
def removeStopWords(x):
        return [w.lower() for w in x if (w not in stop_words) and (w != '') and (w is not None)]
# Apply the defined function to remove stop words for job descriptions
df['job_description'] = df.apply(lambda row: removeStopWords(row.job_description), axis=1)
# Show some results
df.head()

Unnamed: 0,req_guid,updated_assigned_hcs,job_title,job_description
0,15795912,00-000,Warehouse Material Handler 1st Shift TEMP 6 mo,"[warehouse, material, handler, 1st, shift, tem..."
1,15370412,00-000,Structural Welder,"[structural, welder, welding, structural, stee..."
2,15735158,00-000,Construction Laborer,"[construction, laborer, individual, responsibl..."
3,15797256,00-000,Wrapping,"[wrapping, must, ability, lift, 50lbs, palleti..."
4,15469997,00-000,Operations Representative,"[operations, representative, executes, busines..."


## Lemmatization

Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. For example, in English, the verb 'to walk' may appear as 'walk', 'walked', 'walks', 'walking'. The base form, 'walk', that one might look up in a dictionary, is called the lemma for the word. I will use NLTK lemmatization function to convert words into their lemma.

In [15]:
# Define lemmatization function by using NLTK WordNetLemmatizer function
def lemma(x):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(w.lower(), pos='v') for w in x if (w != '') and (w is not None)]
# Apply the defined function to process job descriptions
df['job_description'] = df.apply(lambda row: lemma(row.job_description), axis=1)
# Show some results
df.head()

Unnamed: 0,req_guid,updated_assigned_hcs,job_title,job_description
0,15795912,00-000,Warehouse Material Handler 1st Shift TEMP 6 mo,"[warehouse, material, handler, 1st, shift, tem..."
1,15370412,00-000,Structural Welder,"[structural, welder, weld, structural, steel, ..."
2,15735158,00-000,Construction Laborer,"[construction, laborer, individual, responsibl..."
3,15797256,00-000,Wrapping,"[wrap, must, ability, lift, 50lbs, palletizing..."
4,15469997,00-000,Operations Representative,"[operations, representative, execute, business..."


## Word Embedding Vectors with Gensim

A word embedding is a form of representing words and documents using a dense vector representation. The position of a word within the vector space is learned from text and is based on the words that surround the word when it is used. Word embeddings can be trained using the input texts. One can read more about word embeddings [here](https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/) and [here](https://jalammar.github.io/illustrated-word2vec/).

In this notebook, I am going to use gensim library Word2Vec functionality to generate word embedding vectors so that I can use those vectors later on to train the classification model.

### Word2Vec Mean Vector

Gensim Word2Vec will generate a vector (dimension of 100 here) for each word after training based on all job descriptions. So, I will define a function to get average (mean) vectors for a job description.

In [16]:
# Prepare all the text input for training word2vec model
sentences = df['job_description'].tolist()

In [17]:
# Define and train a word2vec model. 
# Here I set vector dimension size to be 100, window (word distanse) to be 5 
# and use all available CPUs for parallel processing
model_w2v = Word2Vec(sentences, size=100, window=5, min_count=1, workers=cpu_count)
# summarize vocabulary
# word_vocabulary = list(model_w2v.wv.vocab)
# print(word_vocabulary)
# save model with binary format
model_w2v.save('nnc_word2vec.pkl')
# load model when needed so that this word2vec model doesn't need to be re-trained
# model_w2v = Word2Vec.load('nnc_word2vec.pkl')
print(model_w2v)

Word2Vec(vocab=55287, size=100, alpha=0.025)


In [18]:
# Define a function to get average (mean) vectors for a job description based on trained word2vec model
def get_mean_vectors(words):
    # remove out of vocabulary words
    words = [word for word in words if word in model_w2v.wv.vocab]
    if len(words) >= 1:
        return np.mean(model_w2v[words], axis=0)
    else:
        return []

In [19]:
# Apply the function to generate 100 dimension vectors for each job description in the data set
df['job_description_vectors'] = df.apply(lambda row: get_mean_vectors(row.job_description), axis=1) 
# Show some sample results
df.head()

  


Unnamed: 0,req_guid,updated_assigned_hcs,job_title,job_description,job_description_vectors
0,15795912,00-000,Warehouse Material Handler 1st Shift TEMP 6 mo,"[warehouse, material, handler, 1st, shift, tem...","[0.067955986, -0.052602034, 0.108059414, -0.27..."
1,15370412,00-000,Structural Welder,"[structural, welder, weld, structural, steel, ...","[1.5541024, -1.2365535, -0.10899881, 1.2434216..."
2,15735158,00-000,Construction Laborer,"[construction, laborer, individual, responsibl...","[-0.167428, -0.9671807, 0.015980506, -0.235558..."
3,15797256,00-000,Wrapping,"[wrap, must, ability, lift, 50lbs, palletizing...","[-0.2389299, 0.036015894, -0.20679249, -0.1900..."
4,15469997,00-000,Operations Representative,"[operations, representative, execute, business...","[0.035469715, -0.04396796, -0.034893647, 0.377..."


Actually, I can use gensim library Doc2Vec function to generate vectors as well.  However, I don't choose this method due to relatively too large files that Doc2Vec function will produce (see the screen shot below). Any interested in the discussion on Word2Vec vs. Doc2Vec, it can be refer to this [link](https://datascience.stackexchange.com/questions/20076/word2vec-vs-sentence2vec-vs-doc2vec)

![Title 1](./Capture.PNG)

In [19]:
# Prepare tagged documents for training
descriptions = df['job_description'].tolist()
tags = df['req_guid'].tolist()
docs = []
for i in range(len(tags)):
    docs.append(doc2vec.TaggedDocument(words=descriptions[i], tags=["Train-" + str(tags[i])]))
    
docs[:1]

In [21]:
# Setup doc2vec model which is similar to word2vec model setup
model_d2v = doc2vec.Doc2Vec(dm=0, vector_size=100, window=5, min_count=1, workers=cpu_count)
# summarize vocabulary
model_d2v.build_vocab(docs)
# train model
model_d2v.train(docs, total_examples=model_d2v.corpus_count, epochs=model_d2v.epochs)
# save model with binary format
model_d2v.save('doc2vec.pkl')

In [22]:
# Load doc2vec model when needed so that the model don't have to be re-trained
model_d2v = doc2vec.Doc2Vec.load('doc2vec.pkl')

In [23]:
# Get the job description vectors from the trained doc2vec model by using infer_vector function
df['job_description_vectors'] = df.apply(lambda row: model_d2v.infer_vector(row.job_description), axis=1)
# or directly retrieve from tagged documents
df['job_description_vectors'] = df.apply(lambda row: model_d2v.docvecs["Train-" + str(row.req_guid)], axis=1) 

### Covert Vectors into Columns

Here, I am going to convert job description vectors into columns so that I can easily generate training and testing data.

In [20]:
# Put job description vectors as columns
series = df['job_description_vectors'].apply(lambda x : np.array(x)).as_matrix().reshape(-1,1)
w2v = np.apply_along_axis(lambda x : x[0], 1, series)
w2v_df = pd.DataFrame(w2v)
final_df = pd.concat([df, w2v_df], axis=1)

  


In [21]:
# Name label field so that the classification model knows which field is the target (label) one
final_df.rename(columns={'updated_assigned_hcs':'label'}, inplace=True)
# Prepare final data table by removing unuseful columns
final_df.drop(columns=['req_guid','job_title', 'job_description', 'job_description_vectors'], inplace=True)

In [22]:
# Remove any NULL or blank rows
final_df.dropna(inplace=True)

In [23]:
# Show some sample results
final_df.head()

Unnamed: 0,label,0,1,2,3,4,5,6,7,8,...,90,91,92,93,94,95,96,97,98,99
0,00-000,0.067956,-0.052602,0.108059,-0.274452,-0.445171,0.184035,-0.415023,-0.282931,0.2427,...,0.276811,-0.602863,0.552518,0.552918,-0.268735,0.196824,-0.345696,0.906765,-0.207538,-0.412793
1,00-000,1.554102,-1.236554,-0.108999,1.243422,-1.27314,-0.707512,0.086695,1.114845,0.540642,...,-0.78499,-0.105541,0.763936,-1.061563,0.216444,-0.51471,0.317936,1.307202,-1.117071,0.313168
2,00-000,-0.167428,-0.967181,0.015981,-0.235559,-0.347159,-0.294594,-0.496681,0.486209,-0.905325,...,-0.789382,-0.309183,0.445784,0.726108,0.15822,-0.92346,0.015126,-0.397063,-1.572622,0.203792
3,00-000,-0.23893,0.036016,-0.206792,-0.190075,-1.207528,0.260064,-0.293271,0.37671,-0.099776,...,-0.576839,-1.660701,0.790664,-0.204595,-0.127094,-1.202703,0.254096,1.396883,0.24286,0.493747
4,00-000,0.03547,-0.043968,-0.034894,0.377821,-0.157486,0.530797,-0.689256,-1.000499,0.215361,...,0.117709,0.898091,0.442654,0.131698,0.141214,0.057979,0.185774,-0.812538,0.350675,0.269515


# Training and Testing Data Preparation

In [24]:
# Get numpy array on feature set X and target set y
X = np.array(final_df.drop(columns=['label']))

y = np.array(final_df['label'])

In [25]:
# Double check their shape
X.shape
y.shape

(127782, 100)

(127782,)

## Standard Scaling

Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. neural network classifier). People usually can standardize features by removing the mean and scaling to unit variance. In this notebook, I will use StandardScaler function to do that.

In [26]:
# standardize the vectors
X = StandardScaler().fit_transform(X)
X

array([[-0.147318  ,  0.08925455,  0.09173775, ...,  0.8263443 ,
         0.22614478, -0.7334884 ],
       [ 3.0366647 , -2.2223854 , -0.5396993 , ...,  1.3645416 ,
        -1.4959618 ,  0.7712565 ],
       [-0.65161455, -1.6964407 , -0.17612602, ..., -0.9260334 ,
        -2.3585    ,  0.5445462 ],
       ...,
       [ 1.2356126 , -2.3783293 , -0.24089916, ...,  0.2772672 ,
        -0.22560264,  1.4271886 ],
       [ 0.23332882,  1.2109891 ,  0.48697844, ..., -0.3224733 ,
        -0.06261613,  2.0399702 ],
       [-0.30823442,  1.4374849 , -1.8495783 , ..., -2.1731262 ,
         2.2189806 , -1.0544403 ]], dtype=float32)

In [27]:
y

array(['00-000', '00-000', '00-000', ..., '00-000', '00-000', '00-000'],
      dtype=object)

## Label Encoding

Holds the label for each class. Encode categorical features using a one-hot or ordinal encoding scheme. It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels. In this notebook, I am going to use LabelEncoder function to generate indexes for the target label field.

In [28]:
# Encoding the lebels
tmp = y
y = LabelEncoder().fit_transform(y)
y

array([0, 0, 0, ..., 0, 0, 0])

In [29]:
# Save the mapping between index and labels so that I can refer back to labels when get predicted results.
df_label = pd.DataFrame({'label':tmp, 'label_index':y})
df_label.drop_duplicates().to_csv('./hcs_label_index.csv', index = None, header=True)

## Data Splits for Training and Testing Data Sets

In [30]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

print(len(X_train))
print(len(X_test))

102225
25557


# Neural Network Classification

## Neural Network Multi-class Classifier Training

In [31]:
# Define a neural netowork multi-class classifier model
# Two hidden layers are set to 100 and 85
# Activation function is set to relu
# Solver function is set to adam
# Learning rate is set to adaptive, which means it will automatically adjust from initial rate
# Early stopping is set to true so that it won't waste time when not much improvement can be achieved
# Alpha is set to 0.01 so that some degree of overfitting can be prevented
# verbose is set to true so that all steps of training can be displayed
est = MLPClassifier(activation='relu', \
                    hidden_layer_sizes=(100,85),\
                    solver='adam',\
                    learning_rate='adaptive',\
                    max_iter=2000,\
                    learning_rate_init=0.0001,\
                    early_stopping=True,\
                    tol=0.000001,\
                    verbose=True,\
                    alpha=0.01)

In [32]:
print("Training Neural Network MLP-Classifier...")
nnc = est.fit(X, y)
print(nnc)

Training Neural Network MLP-Classifier...
Iteration 1, loss = 0.81464265
Validation score: 0.912826
Iteration 2, loss = 0.42569925
Validation score: 0.917051
Iteration 3, loss = 0.38669795
Validation score: 0.919634
Iteration 4, loss = 0.36450521
Validation score: 0.922529
Iteration 5, loss = 0.35019385
Validation score: 0.926051
Iteration 6, loss = 0.34008885
Validation score: 0.928398
Iteration 7, loss = 0.33246543
Validation score: 0.930902
Iteration 8, loss = 0.32694455
Validation score: 0.928946
Iteration 9, loss = 0.32222512
Validation score: 0.931137
Iteration 10, loss = 0.31833392
Validation score: 0.931450
Iteration 11, loss = 0.31510674
Validation score: 0.931998
Iteration 12, loss = 0.31194218
Validation score: 0.931920
Iteration 13, loss = 0.30915078
Validation score: 0.931763
Iteration 14, loss = 0.30669303
Validation score: 0.931841
Iteration 15, loss = 0.30454031
Validation score: 0.931685
Iteration 16, loss = 0.30244810
Validation score: 0.932780
Iteration 17, loss = 0.

## Training Validation

In [33]:
# Check classification accuracy on training data
print("Accuracy on Training Data are: ", accuracy_score(nnc.predict(X_train), y_train))

Accuracy on Training Data are:  0.9395940327708486


In [34]:
# Check classification accuracy on testing data 
print("Accuracy on Training Data are: ", accuracy_score(nnc.predict(X_test), y_test))

Accuracy on Training Data are:  0.9393903822827405


In [35]:
# Check classification (confusion matrix) report on overall data set
print(classification_report(y, nnc.predict(X)))

              precision    recall  f1-score   support

           0       0.96      0.98      0.97    115892
           1       0.67      0.67      0.67      3332
           2       0.61      0.29      0.40       102
           3       0.66      0.40      0.50       677
           4       0.55      0.50      0.52       563
           5       0.62      0.22      0.33       238
           6       0.57      0.30      0.39        87
           7       0.61      0.51      0.56       424
           8       0.73      0.45      0.56       353
           9       0.72      0.55      0.62       477
          10       0.70      0.63      0.67      1480
          11       0.75      0.59      0.66       400
          12       0.00      0.00      0.00        13
          13       0.00      0.00      0.00        38
          14       0.62      0.19      0.29       292
          15       0.71      0.62      0.67       554
          16       0.64      0.47      0.54       152
          17       0.00    

  _warn_prf(average, modifier, msg_start, len(result))


In [36]:
# Save the model into a file so that it can be used later on without re-training the model
joblib.dump(nnc, './nnc.pkl')

['./nnc.pkl']

In [37]:
# Load the classification model
loaded_model = joblib.load('./nnc.pkl')
# Predict and score the data set with the loaded model
result = loaded_model.score(X_test, y_test)
print("Accuracy on Testing Data are: ", result)

Accuracy on Testing Data are:  0.9393903822827405


# Summary

In this notebook I went through major steps to demonstrate how a NLP data science project can be implemented. 

During this project implementation, I used pandas, gensim and scikit-learn libraries running on a local machine. That is, the running time can be an exponential growth on the size of training data set. This issue can be addressed by using multiprocessing or dask libraries to enhance parallel processing.

Another scalable solution for this problem is to use Spark (with pyspark library) on a Hadoop cluster (a group of multiple servers). However, it also means the NLP models will be depended on the spark distributed environment. So, this approach is probably only applicable or feasible when a big data volume of training data set needs to be considered.