# Topic modelling with spacy + scikit-learn libraries

1. **Motivation**: After researching some of the practical applications of topic modelling, 
I began to seek an approchable and logical path to perform such tasks on real world data.
_Scitkit-learn_ and _SpaCy_ both have features which are  easy to understand and implement
for effective [Natural Language Processing](https://en.wikipedia.org/wiki/Natural_language_processing)
and [Topic Modelling](https://en.wikipedia.org/wiki/Topic_model). 

2. **Data set**: Data Science Job listings from [Indeed](https://www.indeed.com)

3. **References**: 
    - [Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)
        - Scitkit-learn LDA -- [sklearn.decomposition.LatentDirichletAllocation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html)
    - [Vectorizing](https://mc.ai/machine-learning%E2%80%8A-%E2%80%8Anlp-vectorization-techniques/)
        - Scitkit-learn CountVectorizer -- [sklearn.feature_extraction.text.CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
        - spaCy word-embedding -- [spaCy word vectors](http://mlreference.com/word-vectors-spacy)
        - Scikit-learn TF-IDF Vectorizer -- [sklearn.feature_extraction.text.TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
    - Visualization
        - [pyLDAvis](https://pyldavis.readthedocs.io/en/latest/readme.html)
        

In [1]:
# Imports
import pandas as pd

# sklearn imports
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# spaCy imports
import spacy
from spacy.lang.en.stop_words import STOP_WORDS

In [2]:
# Load data
JOBS = pd.read_csv('job_scrape.csv')
JOBS.head()

Unnamed: 0,desc,titles
0,With one application you can be considered for...,Data Engineer
1,About FulgentFulgent is a leader in genetic an...,Backend developer for Bioinformatics Software ...
2,With one application you can be considered for...,Data Warehouse Engineer
3,We are GMAD (Globe Marketing And Advertising D...,Data Engineer
4,What We DoDolphin (withdolphin.com) is a free ...,Data Science Intern


In [3]:
# Get basic information about data
print(JOBS.shape)

(89, 2)


In [4]:
# Curious what the most common job titles are...
TOP_5 = JOBS['titles'].value_counts()[:5]
print('\n'.join([title for title in TOP_5.index]))

Data Scientist
Data Engineer
Data Warehouse Engineer
Data Science Intern
Staff Data Scientist


In [5]:
# Take a look at some of our stop words
STOP_WORDS = list(STOP_WORDS)
print('\n'.join([word for word in STOP_WORDS[:10]]))

nine
’d
where
been
namely
so
thence
i
us
in


## Step 1 - Clean and tokenize

In [6]:
# Load spacy model
NLP = spacy.load('en_core_web_lg')

In [7]:
# Write function to tokenize each document in the dataframe
def tokenize(doc):
    """ Takes a single document and tokenizes the text
        by word.
    """
    parsed_doc = NLP(doc)
    tokens = [
        str(word.lemma_).lower() for word in parsed_doc if
        (not word.is_punct) &
        (not word.is_stop) &
        (not word.is_digit)
    ]
    return ' '.join(tokens)

In [8]:
# Tokenize the job descriptions
TOKENS = JOBS['desc'].apply(tokenize)

In [9]:
# Write into dataframe
JOBS['tokens'] = TOKENS

In [10]:
JOBS['tokens']

0     application consider thousand tech role lead c...
1     fulgentfulgent leader genetic genomic clinical...
2     application consider thousand tech role lead c...
3     gmad globe marketing advertising distributors ...
4     dodolphin withdolphin.com free resource help f...
5     myers media group join myers media group inter...
6     fun work company people truly believe   commit...
7     cwds emphasize user center design collaboratio...
8     join world roadway smart safe   join true pass...
9     rs21rs21 lead datum science visualization comp...
10    carta design product transform way hospital us...
11    company mission highlights mpulse mobile leade...
12    mission data science team mist juniper company...
13    fun work company people truly believe   commit...
14    northrop grumman develop cut edge technology p...
15    datum technology drastically transform investm...
16    intuit hire staff data scientist focus consume...
17    application consider thousand tech role le

## Step 2 - Represent as vectors

To be able to utilize the mathematical principles of LDA / Singular Value Decomposition, the words must be converted into vectors.

In [11]:
# Look at a sample of different vector representations
TEST_CORPUS = [
    "vision improve clear lense",
    "power tool light weight fun",
    ["clear vision lense safety"],
    ["review different"]
]

def vectorize(doc):
    """ Convert tokens in doc to vectors using 
        spaCy model.
    """
    doc = NLP(doc)
    vectors = [token.vector for token in doc]
    return vectors

In [12]:
DOC = 'this is a test'
print(type(vectorize(DOC)))

<class 'list'>


In [13]:
# Oops, I want to make a single list of strings:
TEST_CORPUS = [string for doc in TEST_CORPUS for string in doc]

In [14]:
# Create DataFrame object to hold test case and such
CORPUS_MATRIX = pd.DataFrame()

# Create vectors
TEST_VECTORS = [vectorize(x) for x in TEST_CORPUS]

# Put corpus and vectors into dataframe
CORPUS_MATRIX['reviews'] = TEST_CORPUS
CORPUS_MATRIX['vectors'] = TEST_VECTORS

In [15]:
CORPUS_MATRIX.head()

Unnamed: 0,reviews,vectors
0,v,"[[0.18061, 0.085702, -0.095462, 0.36394, 0.114..."
1,i,"[[0.18733, 0.40595, -0.51174, -0.55482, 0.0397..."
2,s,"[[0.066489, 0.45961, -0.12104, -0.016088, 0.21..."
3,i,"[[0.18733, 0.40595, -0.51174, -0.55482, 0.0397..."
4,o,"[[-0.14567, -0.68028, -0.47473, -0.13255, -0.0..."


## Step 3 - Analyze Topics from LDA Transformation

In [27]:
# Instantiate the vectorizer
VECTORIZER = CountVectorizer(stop_words='english', lowercase=False)

# Vectorize the job descriptions in preperation for LDA
DATA_VECTORIZED = VECTORIZER.fit_transform(JOBS['tokens'])

In [28]:
DATA_VECTORIZED

<89x3117 sparse matrix of type '<class 'numpy.int64'>'
	with 14621 stored elements in Compressed Sparse Row format>

In [29]:
# Instantiate LDA
LDA = LatentDirichletAllocation(n_components=5, max_iter=10)

In [30]:
# fit transform on our vectorized representations
DATA_TRANSFORMED = LDA.fit_transform(DATA_VECTORIZED)

In [31]:
# Functions for printing keywords for each topic
# Antiquated, but it still works
def selected_topics(model, vectorizer, top_n=10):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names()[i], topic[i])
                        for i in topic.argsort()[:-top_n - 1:-1]])

In [33]:
print(selected_topics(LDA, VECTORIZER, top_n=5))

Topic 0:
[('datum', 116.81319261910616), ('experience', 66.20026488723788), ('year', 46.7724744093462), ('work', 44.853959947127066), ('team', 34.911610356251636)]
Topic 1:
[('experience', 56.70274418319609), ('application', 49.46684121889495), ('opportunity', 49.09134289869498), ('career', 42.98717776040977), ('datum', 39.692435879053974)]
Topic 2:
[('datum', 243.4270529508716), ('experience', 209.52654589002117), ('data', 171.58521664213634), ('work', 151.8556949305336), ('science', 129.73975456972582)]
Topic 3:
[('ashland', 1.1993198065696675), ('sgg', 1.1993198065696675), ('jr', 1.1993198065696675), ('preprocessing', 1.1993198065696675), ('abc', 1.1993198065696675)]
Topic 4:
[('experience', 58.37044191088628), ('work', 47.737597146157896), ('data', 45.31044620342194), ('team', 39.22776295323152), ('develop', 38.330815723724434)]
None


In [34]:
# import pyLDAvis for sklearn
import pyLDAvis.sklearn

In [35]:
# enable notebook
pyLDAvis.enable_notebook()
VIS = pyLDAvis.sklearn.prepare(LDA, DATA_VECTORIZED, VECTORIZER, mds='tsne')
VIS

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))
