<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction:-NLP-Learning-for-Job-Descriptions" data-toc-modified-id="Introduction:-NLP-Learning-for-Job-Descriptions-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction: NLP Learning for Job Descriptions</a></span><ul class="toc-item"><li><span><a href="#Dataset" data-toc-modified-id="Dataset-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Dataset</a></span></li><li><span><a href="#Python-Library" data-toc-modified-id="Python-Library-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Python Library</a></span></li></ul></li><li><span><a href="#Data-Set-Loading-and-Cleaning-Up" data-toc-modified-id="Data-Set-Loading-and-Cleaning-Up-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data Set Loading and Cleaning Up</a></span><ul class="toc-item"><li><span><a href="#Load-Job-Description-CSV-Data" data-toc-modified-id="Load-Job-Description-CSV-Data-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Load Job Description CSV Data</a></span></li><li><span><a href="#Clean-Up" data-toc-modified-id="Clean-Up-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Clean Up</a></span></li></ul></li><li><span><a href="#Text-Feature-Engineering" data-toc-modified-id="Text-Feature-Engineering-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Text Feature Engineering</a></span><ul class="toc-item"><li><span><a href="#Tokenization" data-toc-modified-id="Tokenization-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Tokenization</a></span></li><li><span><a href="#Stopword-Removal" data-toc-modified-id="Stopword-Removal-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Stopword Removal</a></span></li><li><span><a href="#Lemmatization" data-toc-modified-id="Lemmatization-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Lemmatization</a></span></li><li><span><a href="#Word-Embedding-Vectors-with-Gensim" data-toc-modified-id="Word-Embedding-Vectors-with-Gensim-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Word Embedding Vectors with Gensim</a></span><ul class="toc-item"><li><span><a href="#Word2Vec-Mean-Vector" data-toc-modified-id="Word2Vec-Mean-Vector-3.4.1"><span class="toc-item-num">3.4.1&nbsp;&nbsp;</span>Word2Vec Mean Vector</a></span></li><li><span><a href="#Covert-Vectors-into-Columns" data-toc-modified-id="Covert-Vectors-into-Columns-3.4.2"><span class="toc-item-num">3.4.2&nbsp;&nbsp;</span>Covert Vectors into Columns</a></span></li></ul></li></ul></li><li><span><a href="#Data-Preparation-for-Training-and-Testing" data-toc-modified-id="Data-Preparation-for-Training-and-Testing-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Data Preparation for Training and Testing</a></span><ul class="toc-item"><li><span><a href="#Standard-Scaling" data-toc-modified-id="Standard-Scaling-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Standard Scaling</a></span></li><li><span><a href="#Label-Encoding" data-toc-modified-id="Label-Encoding-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Label Encoding</a></span></li><li><span><a href="#Data-Splits-for-Training-and-Testing-Data-Sets" data-toc-modified-id="Data-Splits-for-Training-and-Testing-Data-Sets-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Data Splits for Training and Testing Data Sets</a></span></li></ul></li><li><span><a href="#Neural-Network-Classification" data-toc-modified-id="Neural-Network-Classification-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Neural Network Classification</a></span><ul class="toc-item"><li><span><a href="#Neural-Network-Multi-class-Classifier-Training" data-toc-modified-id="Neural-Network-Multi-class-Classifier-Training-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Neural Network Multi-class Classifier Training</a></span></li><li><span><a href="#Training-Validation" data-toc-modified-id="Training-Validation-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Training Validation</a></span></li></ul></li><li><span><a href="#Summary" data-toc-modified-id="Summary-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Summary</a></span></li></ul></div>

# Introduction: NLP Learning for Job Descriptions

In this notebook, I will demonstrate how to process unstructure text feature data by using natural language processing (NLP) technologies and train a classification model with those text features. I will use NLTK and gensim word2vec to do text feature engineering and then use multi-class neural network classifier to train a classification model.

## Dataset

The dataset is a historical data of job descriptions stored as "job_descriptions.csv" file.

## Python Library

In [1]:
# Pandas and numpy for converting from Spark dataframe into Pandas dataframe
import pandas as pd
import numpy as np
# Make the random numbers predictable
np.random.seed(42)
import multiprocessing
cpu_count = multiprocessing.cpu_count()

In [2]:
# Allow multiple output/display from one cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [3]:
from gensim.models import Word2Vec
import nltk
import ssl
try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context
    
# Stop Word Removal
from nltk.corpus import stopwords
# from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

In [4]:
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.externals import joblib
from sklearn.metrics import accuracy_score, classification_report



In [5]:
nltk.download("stopwords")
nltk.download('wordnet')
nltk.download('omw')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /home/ivan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

[nltk_data] Downloading package wordnet to /home/ivan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

[nltk_data] Downloading package omw to /home/ivan/nltk_data...
[nltk_data]   Package omw is already up-to-date!


True

[nltk_data] Downloading package punkt to /home/ivan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# Data Set Loading and Cleaning Up

## Load Job Description CSV Data

In [None]:
df = pd.read_csv('./golden_training_15_18_19.csv', header='infer')

In [7]:
print("The total amount of training data on job descriptions is: ", len(df))

The total amount of training data on job descriptions is:  127893


## Clean Up

In [None]:
# remove any rows without job_description and this may change the index order
df.dropna(inplace=True)
# need to reset the index since the index will be used to concat the tables
df = df.reset_index(drop=True)
# check if index is reset
df.head()
df.tail()

In [9]:
# Combine simply job description and title so that job title is a part of job description
df['job_description'] = df['job_title'] + " " + df['job_description']

In [10]:
# Prepare final data set for next step of text feature engineering
df = df[['req_guid', 'updated_assigned_codes', 'job_title', 'job_description']]

# Text Feature Engineering

In this step, raw text data will be transformed into feature vectors. I will implement the following steps in order to obtain relevant features from the dataset.

* Tokenizing
* Remove stop words
* Lemmatization (not stem since stemming can reduce the interpretability) 
* Word Embeddings as features

## Tokenization

Tokenization is the process by dividing the quantity of text into smaller parts called tokens so that each token can be further treated for machine learning purposes. A token can be a character, a word, a sentence or a paragraph. In this notebook, I only consider words as tokens.

I can use NLTK word_tokenize function to process the job description field (by removing punctuations 
and separating words) like below

```python 
df['job_description'] = df.apply(lambda row: word_tokenize(row.job_description), axis=1) 
```

Or I can just use python string split function to separate text since the job description has been cleaned as below.

In [12]:
# Tokenize the job description and title
df['job_description'] = df["job_description"].str.lower()
df['job_description'] = df["job_description"].str.split(" ")

In [None]:
df.head()

## Stopword Removal

A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a NLP program has been programmed to ignore. In this notebook, I will use NLTK stop words dataset to remove any stop words in job description field.

In [None]:
# Get stopwords list from NLTK library
stop_words = stopwords.words('english')
# Define a function to remove any stop words from input text
def removeStopWords(x):
        return [w.lower() for w in x if (w not in stop_words) and (w != '') and (w is not None)]
# Apply the defined function to remove stop words for job descriptions
df['job_description'] = df.apply(lambda row: removeStopWords(row.job_description), axis=1)
# Show some results
df.head()

## Lemmatization

Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. For example, in English, the verb 'to walk' may appear as 'walk', 'walked', 'walks', 'walking'. The base form, 'walk', that one might look up in a dictionary, is called the lemma for the word. I will use NLTK lemmatization function to convert words into their lemma.

In [None]:
# Define lemmatization function by using NLTK WordNetLemmatizer function
def lemma(x):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(w.lower(), pos='v') for w in x if (w != '') and (w is not None)]
# Apply the defined function to process job descriptions
df['job_description'] = df.apply(lambda row: lemma(row.job_description), axis=1)
# Show some results
df.head()

## Word Embedding Vectors with Gensim

A word embedding is a form of representing words and documents using a dense vector representation. The position of a word within the vector space is learned from text and is based on the words that surround the word when it is used. Word embeddings can be trained using the input texts. One can read more about word embeddings [here](https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/) and [here](https://jalammar.github.io/illustrated-word2vec/).

In this notebook, I am going to use gensim library Word2Vec functionality to generate word embedding vectors so that I can use those vectors later on to train the classification model.

### Word2Vec Mean Vector

Gensim Word2Vec will generate a vector (dimension of 300 here) for each word after training based on all job descriptions. So, I will define a function to get average (mean) vectors for a job description.

In this project, I will use the pre-train word2vec model which pre-trained on 2.5 million job descriptions.

In [16]:
# Load the pre-trained model
word2vec_path = '../../Word2Vec_Pretrained/nnc_word2vec.bin'
model_w2v = Word2Vec.load(word2vec_path)
print(model_w2v)

Word2Vec(vocab=381161, size=300, alpha=0.025)


In [17]:
# Define a function to get average (mean) vectors for a job description based on trained word2vec model
def get_mean_vectors(words):
    # remove out of vocabulary words
    words = [word for word in words if word in model_w2v.wv.vocab]
    if len(words) >= 1:
        return np.mean(model_w2v[words], axis=0)
    else:
        return []

In [None]:
# Apply the function to generate 300 dimension vectors for each job description in the data set
df['job_description_vectors'] = df.apply(lambda row: get_mean_vectors(row.job_description), axis=1) 
# Show some sample results
df.head()

The following codes can be used to train a custom word2vec word embedding model when needed.

```python
# Prepare all the text input for training word2vec model
sentences = df['job_description'].tolist()
# Define and train a word2vec model. 
# Here I set vector dimension size to be 300, window (word distanse) to be 5 
# and use all available CPUs for parallel processing
model_w2v = Word2Vec(sentences, size=300, window=5, min_count=1, workers=cpu_count)
# summarize vocabulary
# word_vocabulary = list(model_w2v.wv.vocab)
# print(word_vocabulary)
# save model with binary format
model_w2v.save('nnc_word2vec.pkl')
# load model when needed so that this word2vec model doesn't need to be re-trained
# model_w2v = Word2Vec.load('nnc_word2vec.pkl')
print(model_w2v)
```

Alternatively, I can use gensim library Doc2Vec to train a Doc2Vec model and then generate vectors as well.  However, I don't choose this method due to 1). relatively too large files that Doc2Vec function will produce (see the screen shot below) and 2). better to be used for documentation simiarlity comparison. Any interested in the discussion on Word2Vec vs. Doc2Vec, it can be refer to this [link](https://datascience.stackexchange.com/questions/20076/word2vec-vs-sentence2vec-vs-doc2vec)

![Title 1](./Capture.PNG)

The training Doc2Vec model and generating vectors from it can be seen in the following codes:
```python
# Prepare tagged documents for training
descriptions = df['job_description'].tolist()
tags = df['req_guid'].tolist()
docs = []
for i in range(len(tags)):
    docs.append(doc2vec.TaggedDocument(words=descriptions[i], tags=["Train-" + str(tags[i])]))
    
docs[:1]

# # Setup doc2vec model which is similar to word2vec model setup
model_d2v = doc2vec.Doc2Vec(dm=0, vector_size=100, window=5, min_count=1, workers=cpu_count)
# # summarize vocabulary
model_d2v.build_vocab(docs)
# # train model
model_d2v.train(docs, total_examples=model_d2v.corpus_count, epochs=model_d2v.epochs)
# # save model with binary format
model_d2v.save('doc2vec.pkl')

# # Load doc2vec model when needed so that the model don't have to be re-trained
model_d2v = doc2vec.Doc2Vec.load('doc2vec.pkl')

# # Get the job description vectors from the trained doc2vec model by using infer_vector function
df['job_description_vectors'] = df.apply(lambda row: model_d2v.infer_vector(row.job_description), axis=1)
# # or directly retrieve from tagged documents
df['job_description_vectors'] = df.apply(lambda row: model_d2v.docvecs["Train-" + str(row.req_guid)], axis=1) 
```

### Covert Vectors into Columns

Here, I am going to convert job description vectors into columns so that I can easily generate training and testing data.

In [19]:
# Put job description vectors as columns
series = df['job_description_vectors'].apply(lambda x : np.array(x)).values.reshape(-1,1)
w2v = np.apply_along_axis(lambda x : x[0], 1, series)
w2v_df = pd.DataFrame(w2v)
final_df = pd.concat([df, w2v_df], axis=1)

In [None]:
final_df.head()

In [21]:
# Name label field so that the classification model knows which field is the target (label) one
final_df.rename(columns={'updated_assigned_codes':'label'}, inplace=True)
# Prepare final data table by removing unuseful columns
final_df.drop(columns=['req_guid','job_title', 'job_description', 'job_description_vectors'], inplace=True)

In [22]:
# Remove any NULL or blank rows
final_df.dropna(inplace=True)

In [23]:
# Show some sample results
final_df.head()

Unnamed: 0,label,0,1,2,3,4,5,6,7,8,...,290,291,292,293,294,295,296,297,298,299
0,00-000,-0.086361,-0.29486,-0.699821,0.10251,-0.081705,0.60875,-0.724328,-0.128486,0.835863,...,0.342909,-0.746965,0.324627,-0.347712,0.166916,0.406738,-0.79952,0.077131,0.359275,0.287279
1,00-000,-0.173583,1.278914,-1.200068,0.564437,0.263772,-0.438466,0.271884,-0.734823,-0.538172,...,-0.337122,0.506212,-0.369966,-0.994712,0.451871,0.017072,-0.211761,-0.362503,0.997578,0.576597
2,00-000,0.673318,-0.032487,0.432261,-0.508771,0.476943,0.291932,-0.509735,0.862125,-0.903905,...,-0.135338,1.484898,-0.3767,-0.514329,0.851197,-0.124478,0.088335,-0.612183,-0.592375,0.263966
3,00-000,0.120055,0.1505,0.023888,0.623277,-0.154826,-0.322433,-1.134776,0.638786,-0.339165,...,-0.049732,0.486116,0.528126,0.541076,0.17981,-0.529596,0.055889,-0.019389,-0.363229,0.296873
4,00-000,0.714245,0.083796,-0.238266,-0.223597,0.355629,0.173768,-0.725232,0.296629,-0.313183,...,0.235431,0.755302,0.291932,0.643426,-0.024221,0.418336,1.028122,-1.041257,0.196144,-1.213655


# Data Preparation for Training and Testing

In [24]:
# Get numpy array on feature set X and target set y
X = np.array(final_df.drop(columns=['label']))

y = np.array(final_df['label'])

In [25]:
# Double check their shape
X.shape
y.shape

(125880, 300)

(125880,)

## Standard Scaling

Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. neural network classifier). People usually can standardize features by removing the mean and scaling to unit variance. In this notebook, I will use StandardScaler function to do that.

In [26]:
# standardize the vectors
X = StandardScaler().fit_transform(X)
X

array([[ 0.15307766, -1.0897384 , -1.6942576 , ...,  0.3473458 ,
         0.91302884,  0.58050215],
       [-0.03016952,  2.9683032 , -2.804353  , ..., -0.5940591 ,
         2.404998  ,  1.1093311 ],
       [ 1.7491156 , -0.41319907,  0.81794024, ..., -1.128707  ,
        -1.3113598 ,  0.5378902 ],
       ...,
       [-0.14342314, -0.18321522, -1.2022563 , ..., -0.05808655,
         3.3263485 , -0.6350132 ],
       [ 0.15869135,  0.35835943, -0.33742136, ...,  0.84332585,
        -1.1395992 ,  1.2240493 ],
       [ 0.39118102, -0.6659653 , -1.193912  , ...,  0.45354998,
        -0.08339486,  1.8141719 ]], dtype=float32)

In [27]:
y

array(['00-000', '00-000', '00-000', ..., '00-000', '00-000', '00-000'],
      dtype=object)

## Label Encoding

Holds the label for each class. Encode categorical features using a one-hot or ordinal encoding scheme. It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels. In this notebook, I am going to use LabelEncoder function to generate indexes for the target label field.

In [28]:
# Encoding the lebels
tmp = y
y = LabelEncoder().fit_transform(y)
y

array([0, 0, 0, ..., 0, 0, 0])

In [29]:
# Save the mapping between index and labels so that I can refer back to labels when get predicted results.
df_label = pd.DataFrame({'label':tmp, 'label_index':y})
df_label.drop_duplicates().to_csv('./label_index.csv', index = None, header=True)

## Data Splits for Training and Testing Data Sets

In [30]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

print(len(X_train))
print(len(X_test))

100704
25176


# Neural Network Classification

## Neural Network Multi-class Classifier Training

In [31]:
# Define a neural netowork multi-class classifier model
# Two hidden layers are set to 300 and 200
# Activation function is set to relu
# Solver (optimizer) function is set to adam for faster converge and better performance (lbfgs solver/optimizer causes overfitting)
# Early_stopping is set to true to prevent overfitting
# Learning rate is set to adaptive, which means it will automatically adjust from initial rate during each iteration
# Alpha is set to 0.01 so that some degree of overfitting can be prevented
# Batch_size is set to 128 so that 128 training data instances are used for calculation gradient descent
# Random_state is set to 42, consistant with initial setup
# Verbose is set to true to display information for all iterations
est = MLPClassifier(activation='relu', \
                    hidden_layer_sizes=(300, 200),\
                    solver='adam',\
                    learning_rate='adaptive',\
                    max_iter=500,\
                    learning_rate_init=0.001,\
                    batch_size=128,\
                    tol=0.000001,\
                    random_state=42,\
                    early_stopping=True,\
                    verbose=True,\
                    alpha=0.01)

In [32]:
print("Training Neural Network MLP-Classifier...")
nnc = est.fit(X_train, y_train)
print(nnc)

Training Neural Network MLP-Classifier...
Iteration 1, loss = 0.23895872
Validation score: 0.965247
Iteration 2, loss = 0.13646807
Validation score: 0.965942
Iteration 3, loss = 0.11615888
Validation score: 0.969814
Iteration 4, loss = 0.10408511
Validation score: 0.972098
Iteration 5, loss = 0.09497245
Validation score: 0.972098
Iteration 6, loss = 0.08947659
Validation score: 0.971204
Iteration 7, loss = 0.08344035
Validation score: 0.974084
Iteration 8, loss = 0.08043861
Validation score: 0.972694
Iteration 9, loss = 0.07744376
Validation score: 0.974680
Iteration 10, loss = 0.07567926
Validation score: 0.975573
Iteration 11, loss = 0.07334694
Validation score: 0.973985
Iteration 12, loss = 0.07051221
Validation score: 0.975871
Iteration 13, loss = 0.06924813
Validation score: 0.974978
Iteration 14, loss = 0.06974275
Validation score: 0.973091
Iteration 15, loss = 0.06719382
Validation score: 0.975375
Iteration 16, loss = 0.06495366
Validation score: 0.975871
Iteration 17, loss = 0.

## Training Validation

In [33]:
# Check classification accuracy on training data
print("Accuracy on Training Data are: ", accuracy_score(nnc.predict(X_train), y_train))

Accuracy on Training Data are:  0.9932475373371464


In [34]:
# Check classification accuracy on testing data 
print("Accuracy on Testing Data are: ", accuracy_score(nnc.predict(X_test), y_test))

Accuracy on Testing Data are:  0.9766047028916428


In [35]:
# Check classification (confusion matrix) report on testing data set
print(classification_report(y_test, nnc.predict(X_test)))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99     21509
           1       0.92      0.94      0.93       605
           2       1.00      0.40      0.57        25
           3       0.80      0.88      0.84       152
           4       0.86      0.63      0.73       125
           5       0.47      0.62      0.53        34
           6       0.89      0.85      0.87        20
           7       0.92      0.91      0.91        96
           8       0.94      0.89      0.92        74
           9       0.78      0.95      0.86        88
          10       0.95      0.95      0.95       286
          11       0.92      0.89      0.91        82
          12       0.33      0.50      0.40         2
          13       0.50      0.50      0.50         6
          14       0.79      0.67      0.73        55
          15       0.90      0.80      0.85       111
          16       0.94      0.71      0.81        42
          17       1.00    

  _warn_prf(average, modifier, msg_start, len(result))


In [36]:
print("Training Neural Network MLP-Classifier on whole data set...")
nnc = est.fit(X, y)
print(nnc)

Training Neural Network MLP-Classifier on whole data set...
Iteration 1, loss = 0.22232934
Validation score: 0.963854
Iteration 2, loss = 0.13176461
Validation score: 0.968462
Iteration 3, loss = 0.11196658
Validation score: 0.963298
Iteration 4, loss = 0.10196866
Validation score: 0.971163
Iteration 5, loss = 0.09240039
Validation score: 0.972196
Iteration 6, loss = 0.08956471
Validation score: 0.970448
Iteration 7, loss = 0.08350056
Validation score: 0.972593
Iteration 8, loss = 0.07999116
Validation score: 0.974182
Iteration 9, loss = 0.07834055
Validation score: 0.971322
Iteration 10, loss = 0.07666497
Validation score: 0.975056
Iteration 11, loss = 0.07272874
Validation score: 0.974420
Iteration 12, loss = 0.07298363
Validation score: 0.974738
Iteration 13, loss = 0.07129260
Validation score: 0.975612
Iteration 14, loss = 0.06935033
Validation score: 0.975691
Iteration 15, loss = 0.06793387
Validation score: 0.973308
Iteration 16, loss = 0.06860305
Validation score: 0.976565
Itera

In [37]:
# Save the model into a file so that it can be used later on without re-training the model
joblib.dump(nnc, './nnc.pkl')

['./nnc.pkl']

In [38]:
# Load the classification model
loaded_model = joblib.load('./nnc.pkl')
# Predict and score the data set with the loaded model
result = loaded_model.score(X_test, y_test)
print("Accuracy on Testing Data are: ", result)

Accuracy on Testing Data are:  0.992333968859231


# Summary

In this notebook I went through major steps to demonstrate how a NLP data science project can be implemented. 

During this project implementation, I used pandas, gensim and scikit-learn libraries running on a local machine. That is, the running time can be an exponential growth on the size of training data set. This issue can be addressed by using multiprocessing or dask libraries to enhance parallel processing.

Another scalable solution for this problem is to use Spark (with pyspark library) on a Hadoop cluster (a group of multiple servers). However, it also means the NLP models will be depended on the spark distributed environment. So, this approach is probably only applicable or feasible when a big data volume of training data set needs to be considered.