# Topic Modelling
---

## Introduction

Text Analysis is used for summarizing or getting useful information out of a large amount of unstructured text stored in documents. This opens up the opportunity of using text data alongside more conventional data sources (e.g, surveys and administrative data). The goal of text analysis is to take a large corpus of complex and unstructured text data and extract important and meaningful messages in a comprehensible meaningful way. 


Text Analysis can help with the following tasks:

* **Informationan retrieval**: Help find relevant information in large databases such as a systematic literature review. 

* **Clustering and text categorization**: Techniques like topic modeling can summarize a large corpus of text by finding the most important phrases. 

* **Text Summarization**: Create category-sensitive text summaries of a large corpus of text. 

* **Machine Translation**: Translate from one language to another. 

In this tutorial, we are going to analyze job advertisements from 2010-2015 using topic modeling to examine the content of our data and document classification to tag the type of job in the advertisement. First we will go over how to transform our data into a matrix that can be read in by an algorithm. 
 


## Glossary of Terms

* **Corpus**: A corpus of documents is the set of all documents in the dataset.

* **Tokenize**: Tokenization is the process by which text is sepearated into meaningful terms or phrases. In english this is fairly triial as words as separated by whitespace. 

* **Stemming**: Stemming is a type of text normalization where words that have different forms but their essential meaning at the same are normalized to the original dictionary form of a word. For example "go," "went," and "goes" all stem from the lemma "go."

* **TFIDF**: TFIDF (Term frequency-inverse document frequency) is an example of feature enginnering where the most important words are extracted by taking account their frequency in documents and the entire corpus of documents as a whole.

* **Topic Modeling**: Topic modeling is an unsupervised learning method where groups of co-occuring words are clustered into topics. Typically, the words in a cluster should be related and make sense (e.g, boat, ship, captain). Individual documents will then fall into multiple topics. 

* **LDA**: LDA (latent Dirichlet allocation) is a type of probabilistic model commonly used for topic modelling. 

* **Stop Words**: Stop words are words that have little semantic meaning like prepositions, articles and common nouns. They can often be ignored. 

In [2]:
%pylab inline 
import nltk
import re
import time
import progressbar

import pandas as pd
from __future__ import print_function
from six.moves import zip, range 

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_curve, roc_auc_score, auc
from sklearn import preprocessing
from collections import Counter, OrderedDict
from nltk.corpus import stopwords
from nltk import PorterStemmer
from nltk import RegexpTokenizer

from gensim import models
from gensim.models.doc2vec import LabeledSentence
from gensim.models import Doc2Vec

nltk.download('stopwords') #download the latest stopwords

Populating the interactive namespace from numpy and matplotlib
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\vinee\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Load the Data

Our Dataset for this tutorial will be a subset jobs-ad data from 2010-2015 compiled by the Commonwealth of Virginia. The full data and how this subset was created can be found in the data folder in this tutorial.

In [3]:
jobs_data = pd.read_csv('jobs_subset.csv')

# Explore the data

In [4]:
jobs_data.head()

Unnamed: 0,normalizedTitle_onetName,normalizedTitle_onetCode,jobDescription,title
0,Lawyers,23-1011.00,This position is in support of the United Stat...,"Paralegal (Buffalo, NY)"
1,Software Quality Assurance Engineers and Testers,15-1199.01,The successful candidate will maintain and dev...,System Integration / Test Engineer
2,Software Quality Assurance Engineers and Testers,15-1199.01,This position is a Senior System Integration &...,System Integration / Test Engineer Staff
3,Software Quality Assurance Engineers and Testers,15-1199.01,"In this position, candidates may participate i...",Mult Func Fin Analyst
4,Software Quality Assurance Engineers and Testers,15-1199.01,This is a business operations analyst position...,Mult Func Fin Analyst


The normalizedTitle_onetName and the normalizedTitle_onetCode are derived from the Onet Database. We wil use the names in the document tagging portion of the tutorial. The jobDescription is the actual jobDescription and the title is derived from the jobDescription.

# How many unique job titles are in this dataset?

In [5]:
jobs_data.title.unique()

array(['Paralegal (Buffalo, NY)', 'System Integration / Test Engineer',
       'System Integration / Test Engineer Staff', ...,
       'JAVA ENGINEER(perm)', 'BSA Analyst', 'AML/KYC Analyst'],
      dtype=object)

In [6]:
jobs_data.normalizedTitle_onetName.unique()

array(['Lawyers', 'Software Quality Assurance Engineers and Testers',
       'Graduate Teaching Assistants', 'Financial Examiners',
       'Credit Analysts'], dtype=object)

Our dataset has 5 unique job titles assigned by the OneNet. Futhermore it has way more number of titles that are in the title column.

We can perfrom text analytics on the Job description to help us discover the most important jobs and phrases in the corpus

# Topic Modelling 

We are going to apply topic modeling, an unsuperivised learning method, to our corpus to find the high-level topics in our corpus as a first-go for exploring our data. As we apply topic modeling we will discuss ways of cleaning and preprocessing our data to get the best results.

Topic modeling is a broad subfield of machine learning and natural language processing. We are going to focus on one approach, Latent Dirichlet allocation (LDA). LDA is a fully Bayesian extension of probabilistic latent semantic indexing, itself a probabilistic extension of latent semantic analysis.

In topic modeling we first assume the existence of topics in the corpus and that there is a small number of topics that can explain a corpus. Topics, in this case, are a ranked-list of words from our corpus, with the highest probability words at the top. A single document can be explained by multiple topics. For instance, an article on net neutrality has to do with both technology and politics. The set of topics used by a document is known as the document's allocation, hence, the name latent Dirchlet allocation, each document has an allocation of latent topics allocated by Dirchlet distribution.

# Data Processing 

Here we will 
One of the most important step in working with text data is cleaning and processing. This includes but is not limited to forming a corpus of text, tokenization, removing stop-words, finding words colocated together (N-grams), and stemming and lemmatization. Each of these steps will be discussed below. The ultimate goal is to transform our text data into a form of algorithm we can work with. A sequence of symbols cannot be fed directly into an algorithm. Algorithms expect numerical feature vectors with fixed size rather then a document with a variable document length. We will be transforming our text corpus into a bag of n-grams to be further analyzed. In this form our text data is represented as a matrix where each row refers to a specific job description (document) and each column is the occurence of a word (feature).

# Bag of n-gram representation example

We want to take our collection of documents, corpus, and convert it into a matrix. Fortunately sklearn has a pre-built object,  CountVectorizer, that can tokenize, eliminate stopwords, identify n-grams and stem our corpus, outputing a matrix in one step. 
We will also use an example and check how it create bag of words before we use it on the dataset|

In [7]:
def create_bow(corpus, NGRAM_RANGE = (0,1), stem = False, stop_words = None, MIN_DF = 0.05, MAX_DF = 0.95, USE_IDF = False):
    
    '''
    corpus: out document with the texts
    NGRAM_RANGE= The number of word we use as features
    stem= bool, whether we want to stem the words
    stop_words = list of commonly occuring words that we want to exclude from corpus
    MIN_DF = remove the words that occurs less that 5% in the corpus
    MAX_DF = remove the words that occurs greater than 95% in the corpus
    USE_IDF = bool, whether to invert the document frequency
    
    RETURN
    
    bow = and bag of words that is a scipy matrix
    features = list of unique words in the corpus
    '''
    
    #parameters to call the vectorizer function
    ANALYZER = 'word' #that analyzes a word rather than sentences
    STRIP_ACCENTS ='unicode'
    
    if stem:
        tokenizer = lambda x : [stemmer.stem(i) for i in x.split()]
    else:
        tokenizer = None
    vectorizer = CountVectorizer(analyzer=ANALYZER, strip_accents=STRIP_ACCENTS, tokenizer=tokenizer, stop_words=stop_words, max_df=MAX_DF, min_df=MIN_DF, ngram_range=NGRAM_RANGE)
    bow = vectorizer.fit_transform(corpus)
    features = vectorizer.get_feature_names()
    
    if USE_IDF:
        NORM = None
        SMOOTH_IDF = True
        SUBLINEAR_TF = True
        transformer = TfidfTransformer(norm=NORM, use_idf=USE_IDF, smooth_idf=SMOOTH_IDF, sublinear_tf=SUBLINEAR_TF)
        tfidf = transformer.fit_transform(bow)
        
        return tfidf, features
    else:
        return bow, features
    
    
    

In [78]:
test_sentence = ["Hi, my name is Vineet", "this is a test sentence and i am test it", 'lets rock this study']

In [79]:
test_bow, test_features = create_bow(test_sentence)

In [80]:
test_bow2, test_features2 = create_bow(test_sentence, USE_IDF=True)

In [81]:
test_bow2.toarray()

array([[0.        , 0.        , 1.69314718, 1.28768207, 0.        ,
        0.        , 1.69314718, 1.69314718, 0.        , 0.        ,
        0.        , 0.        , 0.        , 1.69314718],
       [1.69314718, 1.69314718, 0.        , 1.28768207, 1.69314718,
        0.        , 0.        , 0.        , 0.        , 1.69314718,
        0.        , 2.86674738, 1.28768207, 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        1.69314718, 0.        , 0.        , 1.69314718, 0.        ,
        1.69314718, 0.        , 1.28768207, 0.        ]])

The counter_vectorizer outputs a matrix. In this case a sparse matrix, a matrix that has a lot more 0s then 1s. To save space scipy has special methods for storing sparse matrices in a space-efficient way rather than saving many many 0s. Now lets print it in an array so we can see the shape and features.

In [12]:
test_bow.toarray()

array([[0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1],
       [1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 2, 1, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0]], dtype=int64)

In [13]:
test_features

['am',
 'and',
 'hi',
 'is',
 'it',
 'lets',
 'my',
 'name',
 'rock',
 'sentence',
 'study',
 'test',
 'this',
 'vineet']

As you can see above that the list is a feature (the columns in the matrix) and the 1's represent that this particluar word (or feature) is present in the document(which is the row in the above matrix)

Now lets count the number of times a word occured in the corpus. We can do this by summing up the features of out bow matrix

In [14]:
def word_counts(bag_of_words, feature_names):
    
    np_bow = bag_of_words.toarray()
    word_count = np.sum(np_bow, axis=0)
    np_word_count = np.asarray(word_count).ravel()
    
    dict_word_count = dict(zip(feature_names, np_word_count))

    ordered_word_counts = OrderedDict(sorted(dict_word_count.items(), key = lambda x : x[1], reverse=True))
    
    return ordered_word_counts

In [15]:
word_counts(test_bow, test_features)

OrderedDict([('is', 2),
             ('test', 2),
             ('this', 2),
             ('sentence', 1),
             ('name', 1),
             ('rock', 1),
             ('hi', 1),
             ('lets', 1),
             ('it', 1),
             ('study', 1),
             ('my', 1),
             ('vineet', 1),
             ('and', 1),
             ('am', 1)])

As you can see above. We successfully created a word counter in the corpus which gives us a nicely arranged dictionary of tuples. 

Now lets begin working out jobs data. Before beginning, we need to add the jobs description to the numpy list. We can do that using .values attribute

In [82]:
corpus = jobs_data.jobDescription.values

In [83]:
corpus[:10]

array(["This position is in support of the United States Attorney's Office in Buffalo, NY located at 138 Delaware Avenue. The estimated length of this assignment is through May 31, 2016 with the possibility of extensions. Overtime and travel may be authorized as needed. The Employee will perform the following tasks, but not limited to: *Serves as a primary resource for the office on all aspects of docketing. *Recommends revisions in office procedures to improve the docketing function by expediting case processing, changing codes for data elements, etc. *Reviews reports to identify recurring errors and trains new employees in the operation and use of the case tracking system. Advises other personnel on appropriate codes for unprecedented cases. Explains the operation of the system and importance of timely information to new attorneys and support staff. *Responsibility for developing/maintaining automated records for the criminal and civil case workload from initial referral to final dis

Lets work on creating the topics out of these large documents.

In [42]:
def create_topics(tfidf, features, N_TOPICS=3, N_TOP_WORDS=5,):
    """
    Given a matrix of features of text data generate topics
    
    Parameters
    -----------
    tfidf: scipy sparse matrix
        sparse matrix of text features
    N_TOPICS: int
        number of topics (default 10)
    N_TOP_WORDS: int
        number of top words to display in each topic (default 10)
        
    Returns
    -------
    ls_keywords: ls
        list of keywords for each topics
    doctopic: array
        numpy array with percentages of topic that fit each category
    N_TOPICS: int
        number of assumed topics
    N_TOP_WORDS: int
        Number of top words in a given topic. 
    """
    
    lda = LatentDirichletAllocation(n_topics=N_TOPICS, learning_method='online')

    #bar.update(i)
    #i += 1
    doctopic = lda.fit_transform(tfidf)
    #bar.update(i)
    #i+=1

    ls_keywords = []

    for i, topic in enumerate(lda.components_):
        word_idx = np.argsort(topic)[::-1][:N_TOP_WORDS]
        keywords = ", ".join(features[i] for i in word_idx)
        ls_keywords.append(keywords)
        print(i,keywords)
        #bar.update(i)
        #i+=1
    return ls_keywords, doctopic
            

In [84]:
corpus_bow, corpus_features = create_bow(corpus)

Lets look at the top 10 features of the corpus

In [85]:
corpus_features[:100]

['00',
 '000',
 '10',
 '100',
 '11',
 '12',
 '15',
 '20',
 '2015',
 '25',
 '30',
 '40',
 '50',
 '500',
 '80',
 '800',
 'aa',
 'abilities',
 'ability',
 'able',
 'about',
 'above',
 'academic',
 'access',
 'accordance',
 'according',
 'accounting',
 'accounts',
 'accredited',
 'accuracy',
 'accurate',
 'achieve',
 'across',
 'act',
 'action',
 'active',
 'activities',
 'addition',
 'additional',
 'administration',
 'administrative',
 'admired',
 'advance',
 'advanced',
 'affirmative',
 'again',
 'against',
 'age',
 'agencies',
 'agency',
 'agile',
 'agreeing',
 'al',
 'all',
 'along',
 'also',
 'an',
 'analysis',
 'analyst',
 'analytical',
 'analyze',
 'and',
 'anos',
 'any',
 'applicable',
 'applicants',
 'application',
 'applications',
 'apply',
 'applying',
 'appropriate',
 'architecture',
 'are',
 'area',
 'areas',
 'around',
 'as',
 'aspects',
 'assessment',
 'assigned',
 'assist',
 'assistance',
 'assistant',
 'assisting',
 'assists',
 'associate',
 'associated',
 'assurance',
 'a

This looks really bad! Those arent even meaningfull words. Lets look at the count of words.

In [21]:
list(word_counts(corpus_bow, corpus_features).items())[:10] #printing top 10

[('and', 72036),
 ('to', 36758),
 ('the', 34940),
 ('of', 31793),
 ('in', 23879),
 ('with', 16877),
 ('for', 15605),
 ('or', 13556),
 ('experience', 11108),
 ('is', 10153)]

A lot of words above are prepositions and articles that dont really have any semantic meaning.

In [43]:
ls_keywords, ls_doctopic = create_topics(corpus_bow, corpus_features)



0 de, en, la, con, el
1 and, to, the, of, in
2 and, to, the, of, in


Looking at the above topics, we are not getting a proper answer from these except that some of the text is in Spanish. We need to clean this data. We will using regular expressions to remove the punctuations.

In [47]:
REGEX_THAT = r'\W+|\d+'

processed_corpus = np.array([re.sub(REGEX_THAT, " ", comment).lower() for comment in corpus])

In [48]:
corpus[1]

"The successful candidate will maintain and develop electronic production and development test solutions. Provide hands on hardware/software troubleshooting and upgrade support for existing production test stations in a real time environment. Participate in the design, build, integration and validation of automated and manual test equipment. Interpret and understand test requirements, develop/modify test procedures, test reports and other test related documentation. Must be able to interface with the customer during planned meetings and support required presentations. The successful candidate will also have the ability and willingness to on occasion support our Product Support group in maintaining, troubleshooting and servicing electronic fielded equipment. Participate in the integration, installation and validation of fielded equipment. Interpret and understand repair procedures and other repair related documentation. Must be able to interface with the customer during field service tr

In [49]:
processed_corpus[1]

'the successful candidate will maintain and develop electronic production and development test solutions provide hands on hardware software troubleshooting and upgrade support for existing production test stations in a real time environment participate in the design build integration and validation of automated and manual test equipment interpret and understand test requirements develop modify test procedures test reports and other test related documentation must be able to interface with the customer during planned meetings and support required presentations the successful candidate will also have the ability and willingness to on occasion support our product support group in maintaining troubleshooting and servicing electronic fielded equipment participate in the integration installation and validation of fielded equipment interpret and understand repair procedures and other repair related documentation must be able to interface with the customer during field service trips and intern

Now lets tokenize it. It is nothing but a list of words. But here we choose the important words.

In [50]:
english_stopwords = stopwords.words('english')

In [51]:
english_stopwords

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [52]:
processed_bow, processed_features = create_bow(processed_corpus, stop_words=english_stopwords)
processed_word_counts = word_counts(processed_bow, processed_features)
processed_word_counts

OrderedDict([('experience', 11108),
             ('de', 8575),
             ('work', 6583),
             ('job', 5631),
             ('software', 5360),
             ('skills', 5337),
             ('development', 5079),
             ('test', 4744),
             ('team', 4450),
             ('ability', 4248),
             ('required', 4147),
             ('business', 4010),
             ('quality', 3664),
             ('requirements', 3653),
             ('position', 3494),
             ('testing', 3413),
             ('knowledge', 3366),
             ('years', 3336),
             ('management', 3320),
             ('must', 3261),
             ('systems', 3103),
             ('engineering', 3047),
             ('services', 3043),
             ('information', 2989),
             ('support', 2955),
             ('legal', 2816),
             ('en', 2791),
             ('status', 2757),
             ('opportunity', 2734),
             ('including', 2627),
             ('working', 2559),
   

If you see above, the words from top to "postion" are almost used in every job description. We will remove these by adding them to the list of stopwords.

In [61]:
remove_these_words = list(processed_word_counts)[:15]
new_stopwords = english_stopwords + remove_these_words
processed_bow, processed_features = create_bow(processed_corpus, stop_words=new_stopwords, USE_IDF=False)

In [62]:
processed_word_counts = word_counts(processed_bow, processed_features)
processed_word_counts

OrderedDict([('testing', 3413),
             ('knowledge', 3366),
             ('years', 3336),
             ('management', 3320),
             ('must', 3261),
             ('systems', 3103),
             ('engineering', 3047),
             ('services', 3043),
             ('information', 2989),
             ('support', 2955),
             ('legal', 2816),
             ('en', 2791),
             ('status', 2757),
             ('opportunity', 2734),
             ('including', 2627),
             ('working', 2559),
             ('design', 2531),
             ('company', 2528),
             ('technical', 2487),
             ('data', 2359),
             ('provide', 2307),
             ('time', 2297),
             ('degree', 2276),
             ('related', 2264),
             ('new', 2242),
             ('strong', 2206),
             ('environment', 2105),
             ('qualifications', 2103),
             ('system', 2072),
             ('compliance', 2019),
             ('equal', 1993),
 

In [63]:
processed_keywords, processed_topic = create_topics(processed_bow, processed_features)




0 must, legal, compliance, management, time
1 testing, engineering, systems, design, knowledge
2 en, la, con, legal, el


Lets increase the number of topics and check

In [64]:
processed_keywords, processed_topic = create_topics(processed_bow, processed_features, N_TOP_WORDS=10)



0 en, la, con, el, experiencia, pruebas, para, trabajo, los, que
1 legal, must, time, medical, provide, company, training, apply, information, state
2 testing, systems, engineering, knowledge, design, years, technical, status, management, support


In [65]:
processed_keywords, processed_topic = create_topics(processed_bow, processed_features, N_TOP_WORDS=10, N_TOPICS=10)



0 en, la, con, el, experiencia, pruebas, para, trabajo, que, las
1 robert, half, must, apply, compensation, applicants, applying, openings, authorized, candidates
2 status, gender, protected, opportunity, disability, race, religion, origin, color, national
3 medical, laboratory, health, care, must, hours, clinical, duties, time, patient
4 management, knowledge, testing, process, procedures, ensure, related, project, support, data
5 services, data, us, world, clients, professional, service, looking, professionals, career
6 testing, design, engineering, engineer, systems, knowledge, tools, technical, web, years
7 compliance, management, financial, credit, sales, strong, risk, degree, new, including
8 systems, engineering, security, status, years, must, support, system, information, national
9 legal, law, counsel, matters, attorney, corporate, litigation, company, including, issues


Lets test on a job description. I copied a Data Scientist job description at Amazon from LinkedIn.

In [71]:
test_doc = "Amazon has built a reputation for excellence with recent examples of being named the #1 most trusted company for customers. To deliver on this reputation for trust the Marketplace Abuse team is tasked with identifying and preventing abuse for our customers and brand owners worldwide. Selling Partner Abuse is hiring a Data Scientist who will research and develop innovative machine learning based approaches to predict the near-term future. The Data Scientist must understand deeply about Machine Learning methods and love extracting, nurturing and transforming high volume of data into actionable business information and making it readily accessible to stakeholders."



In [72]:
preprocess_test = np.array([re.sub(REGEX_THAT, " ", test_doc).lower()])

In [73]:
preprocess_test

array(['amazon has built a reputation for excellence with recent examples of being named the   most trusted company for customers to deliver on this reputation for trust the marketplace abuse team is tasked with identifying and preventing abuse for our customers and brand owners worldwide selling partner abuse is hiring a data scientist who will research and develop innovative machine learning based approaches to predict the near term future the data scientist must understand deeply about machine learning methods and love extracting nurturing and transforming high volume of data into actionable business information and making it readily accessible to stakeholders '],
      dtype='<U664')

In [77]:
processed_bow_test, processed_features = create_bow(preprocess_test, stop_words=new_stopwords, MIN_DF=0, MAX_DF=1)
processed_keys, processed_topic_test = create_topics(processed_bow_test, processed_features, N_TOPICS=2, N_TOP_WORDS=10)

0 recent, examples, extracting, brand, must, actionable, customers, readily, volume, approaches
1 abuse, data, learning, scientist, machine, customers, reputation, volume, marketplace, understand


