# Extracting topics from unlabeled training data

We attempted to adapt [a tutorial](https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24) explaining how to perform topic modeling using latent Dirichlet allocation (LDA) in Python, in order to use it with the data set for job advertisements found at [Data Norge](data.norge.no). The data is already labeled in many different ways, but we wanted to explore the space of topic extraction based on only the job description.

**NOTE:** It's possible to encounter memory issues when running the code samples in this notebook. We're not sure wheer only the memory consumption internally in Jupyter is causing the issue, or if external programs also affect it. We recommend the following course of action:
- Shutdown all running notebooks and re-run the notebook from the start
- Close all potentially memory-consuming applications, browser tabs, etc. that you don't currently need and re-run the notebook from the start

If none of the above actions help, it may be that your system does not have enough memory available to properly run the notebook. Consider running it on a more powerful computer, if possible.

In [7]:
# imports
import pandas as pd
import os
import wget # downloading data
from pprint import pprint # pretty printing
import HTMLParser
import gensim
from gensim import corpora, models
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
import nltk

In [8]:
# setup and variable creation

np.random.seed(2018)
nltk.download("wordnet")

html = HTMLParser.HTMLParser()          # use html.unescape() later
stemmer = SnowballStemmer("norwegian")  # get the stems of words in Norwegian
whitespace_pattern = re.compile(r'\s+') # whitespace pattern (used when normalizing whitespace)

years_with_desc = [2002, 2015, 2016, 2017]
years_with_ads = list(range(2002, 2017 + 1))
data_dir = os.path.join('.', 'data')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


#### NOTE

A potential improvement would be to autodetect English ads and use an English stemmer for those, but it's debatable whether the accuracy would be improved to a great extent. It's also possible that it would be an improvement to only consider the Norwegian applications, or to consider different languages separately, in order to avoid throwing off the topic extraction algorithm. The same separation principle could also be applied to nynorsk and bokmål documents.

In [9]:
# data processing function definitions

# Convert to stem of word (works -> work, ...)
# Lemmatize (3rd person -> 1st person, past tense -> present tense, ...)
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Add unique words longer than 3 characters and excluding stopwords to result,
# after stemming and lemmatizing
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

# Remove html tags
def remove_tags(raw_html):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, ' ', raw_html)
    return cleantext

# Unescape HTML-encoded characters
def unescape_html(raw_html):
    return html.unescape(raw_html)

# Normalize whitespace (replace any whitespace with a single whitespace)
def remove_whitespace(text):
    return whitespace_pattern.sub(' ', text)

In [10]:
# function which downloads data

def download_data(start_year=2012, end_year=2017):
    if 'data' not in os.listdir('.'):
        !mkdir 'data'
    for year in range(start_year, end_year + 1):
        filename_ads = ''.join([str(year), '_ads', '.csv'])
        filename_desc = ''.join([str(year), '_desc', '.csv'])
        ads_url = ''.join(['https://hotell.difi.no/download/nav/ledige-stillinger/', str(year), '?download'])
        desc_url = ''.join(['https://hotell.difi.no/download/nav/stillingstekster/', str(year), '?download'])
        
        if filename_ads not in os.listdir(os.path.join(data_dir)) and year in years_with_ads:
            wget.download(ads_url, out=os.path.join(data_dir, filename_ads))
            print(''.join(['Downloaded ads for ', str(year)]))
        if filename_desc not in os.listdir(os.path.join(data_dir)) and year in years_with_desc:
            wget.download(desc_url, out=os.path.join(data_dir, filename_desc))
            print(''.join(['Downloaded desc for ', str(year)]))

print("Years with job descriptions:\n  %s\n" %years_with_desc)
print("All years with job ad data:\n  %s\n" %years_with_ads)

Years with job descriptions:
  [2002, 2015, 2016, 2017]

All years with job ad data:
  [2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017]



In [11]:
# download the data

download_data()
os.listdir(data_dir)

['2016_ads.csv',
 '2017_ads.csv',
 '2016_desc.csv',
 '2013_ads.csv',
 '2015_desc.csv',
 '2015_ads.csv',
 '2012_ads.csv',
 '2017_desc.csv',
 '2014_ads.csv']

In [12]:
# Load dataset from csv

year = 2015
job_ads_filename = ''.join([str(year), '_ads.csv'])
job_ads_desc_filename = ''.join([str(year), '_desc.csv'])

df_ads = pd.read_csv(
    os.path.join(data_dir, job_ads_filename),
    sep=';'
)

df_ads_desc = pd.read_csv(
    os.path.join(data_dir, job_ads_desc_filename),
    sep=';'
)

# Merge datasets into one
df = pd.merge(df_ads, df_ads_desc)

# Rename columns that have different names in different datasets
df.rename(columns={'stillingsutlysning': 'stillingsbeskrivelse'}, inplace=True)

# Perform some basic preprocessing of the text: decode as utf-8 to prevent ascii decoding issues,
# remove HTML tags, unescape HTML encoded stuff (such as &lt; becoming <), and normalize whitespace
# to exactly 1 space between every word
df[['stillingsbeskrivelse']] = df['stillingsbeskrivelse']\
  .map(lambda x: x.decode("utf-8"))\
  .map(remove_tags)\
  .map(unescape_html)\
  .map(remove_whitespace)

# show sample
df.head()

Unnamed: 0,stillingsnummer,nav_enhet_kode,registrert_dato,sistepubl_dato,statistikk_aar_mnd,offisiell_statistikk_flagg,stilling_kilde,arbeidssted_fylkesnummer,arbeidssted_fylke,arbeidssted_kommunenummer,...,yrke_grovgruppe,yrkeskode,yrke,yrkesbetegnelse,virksomhet_organisasjonsnr,virksomhet_navn,antall_stillinger,stillingstittel,statistikk_periode,stillingsbeskrivelse
0,101201412000010,101,18.12.2014,02.01.2015,201501,1,Annonsert i media,1,Østfold,101,...,Undervisning,2342,Førskolelærere,Førskolelærer,973967592,HØGSKOLEN I ØSTFOLD,1,Barnehagelærar / Førstekonsulent (14/04241),201501,# senter for framandspråk i opplæringa er det ...
1,101201412000011,101,19.12.2014,01.02.2015,201501,1,Overført fra arbeidsgiver,1,Østfold,101,...,Serviceyrker og annet arbeid,5413,Fengselsbetjenter,Fengselsbetjent,993315060,HALDEN FENGSEL,1,Halden fengsel - ledig fast stilling som verks...,201501,Jobbnorge ID: 108617 # fengsel er det ledig fa...
2,101201412000012,101,19.12.2014,16.01.2015,201501,1,Overført fra arbeidsgiver,1,Østfold,101,...,Serviceyrker og annet arbeid,5413,Fengselsbetjenter,Fengselsbetjent,993315060,HALDEN FENGSEL,1,Halden fengsel - ledig fast stilling som verks...,201501,Jobbnorge ID: 108301 # fengsel er det ledig fa...
3,101201412000013,101,23.12.2014,23.01.2015,201501,1,Annonsert i media,1,Østfold,101,...,Butikk- og salgsarbeid,3322,Selgere (engros),Key Account Manager (salg),886332882,SMART INNOVATION NORWAY AS,1,Key Account Manager,201501,# har ambisjoner om å vokse og styrker teamet ...
4,101201412000014,101,23.12.2014,30.01.2015,201501,1,Annonsert i media,1,Østfold,101,...,Ingeniør- og ikt-fag,2152,Sivilingeniører (elektronikk),Forsker (elektronikk),886332882,SMART INNOVATION NORWAY AS,1,Senior Reasearchers/Reasearchers,201501,# is now expanding its leading team with senio...


In [13]:
# show tokenziation of a sample document

doc_sample = df[["stillingsbeskrivelse"]].iloc[0].values[0]

print('Original document: ')
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print('\n\nTokenized and lemmatized document: ')
print(preprocess(doc_sample))

Original document: 
[u'#', u'senter', u'for', u'framandspr\xe5k', u'i', u'oppl\xe6ringa', u'er', u'det', u'ledig', u'stilling', u'som', u'Barnehagel\xe6rar', u'/', u'F\xf8rstekonsulent', u'Generelt', u'om', u'stillinga', u'Framandspr\xe5ksenteret', u'sine', u'ansvarsomr\xe5de', u'(sj\xe5', u'mandat', u'http://fremmedspraksenteret.no/nor/fremmedspraksenteret/fremmedspraksenteret/mandat)', u'f\xf8reset', u'at', u'dei', u'tilsette', u'er', u'dyktige', u'l\xe6rarar', u'i', u'sine', u'fag', u'og', u'at', u'dei', u'har', u'evne', u'til', u'\xe5', u'ta', u'del', u'i', u'arbeidet', u'med', u'\xe5', u'auke', u'engelsk-', u'og', u'framandspr\xe5kkunnskapane', u'i', u'Noreg.', u'Senteret', u'har', u'tilsett', u'r\xe5dgjevarar', u'for', u'dei', u'st\xf8rste', u'framandspr\xe5ka', u'i', u'skulen.', u'Dei', u'arbeider', u'med', u'ressursutvikling,', u'l\xe6replanrelaterte', u'oppg\xe5ver', u'og', u'andre', u'oppg\xe5ver', u'som', u'senteret', u'f\xe5r.', u'Senteret', u'vil', u'i', u'2015', u'fortset

In [14]:
# process the data
# NOTE: takes a while

processed = df["stillingsbeskrivelse"].map(preprocess)
processed[:10]

0    [sent, framandspråk, opplæring, led, still, ba...
1    [jobbnorg, fengsel, led, fast, still, verksbet...
2    [jobbnorg, fengsel, led, fast, still, verksbet...
3    [ambisjon, voks, styrk, team, rekrutter, rett,...
4    [expand, lead, team, senior, research, senior,...
5    [arbeidsoppgav, utfør, ettersyn, kontroll, sam...
6    [selskap, sterk, vekst, ønsk, derfor, styrk, s...
7    [kund, søk, dykt, formann, innlei, fast, still...
8    [caverion, søk, dykt, kuldemontør, tekniker, k...
9    [ønsk, styrk, utvikl, bygg, betong, avdeling, ...
Name: stillingsbeskrivelse, dtype: object

In [15]:
# count the appearances of a word across all the documents
dictionary = gensim.corpora.Dictionary(processed)
for _, (k, v) in zip(range(10), dictionary.iteritems()):
    print("%s: %s" %(k, v))

63660: vanligast
111295: torvgat
23487: xjhqs
114682: epidemiologoisk
40612: beober
58345: nanopartikl
15007: forutsettning
41435: demensplass
38939: skrankerutin
46635: sentrumsgat


In [16]:
# Remove extremes: keep only words appearing in at least 15 documents,
# but in no more than half, and only the most common 100000
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

In [17]:
# create bag of words
bow_corpus = [dictionary.doc2bow(doc) for doc in processed]

In [18]:
# show same sample document as before, but having been processed and analyzed/filtered
bow_doc_0 = bow_corpus[0]

for i in range(len(bow_doc_0)):
    print("Word %s (\"%s\") appears %s time(s)." %(bow_doc_0[i][0], dictionary[bow_doc_0[i][0]], bow_doc_0[i][1]))

Word 0 ("aktivitet") appears 1 time(s).
Word 1 ("ald") appears 1 time(s).
Word 2 ("all") appears 1 time(s).
Word 3 ("andr") appears 2 time(s).
Word 4 ("ansvarsområd") appears 1 time(s).
Word 5 ("arbeid") appears 6 time(s).
Word 6 ("attest") appears 1 time(s).
Word 7 ("auk") appears 1 time(s).
Word 8 ("bachelorgrad") appears 1 time(s).
Word 9 ("bakgrunn") appears 1 time(s).
Word 10 ("barnehag") appears 3 time(s).
Word 11 ("barnehagelær") appears 2 time(s).
Word 12 ("bedrift") appears 1 time(s).
Word 13 ("bedt") appears 1 time(s).
Word 14 ("behandling") appears 1 time(s).
Word 15 ("bruk") appears 1 time(s).
Word 16 ("båd") appears 1 time(s).
Word 17 ("dansk") appears 1 time(s).
Word 18 ("denn") appears 1 time(s).
Word 19 ("dersom") appears 1 time(s).
Word 20 ("dett") appears 1 time(s).
Word 21 ("difor") appears 1 time(s).
Word 22 ("dykt") appears 1 time(s).
Word 23 ("eining") appears 1 time(s).
Word 24 ("engelsk") appears 2 time(s).
Word 25 ("epost") appears 1 time(s).
Word 26 ("etnisk")

In [19]:
# build tf-idf model based on bag of words
tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]

for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.04847050660185597),
 (1, 0.041437040533137764),
 (2, 0.021852977196621545),
 (3, 0.039152451583557554),
 (4, 0.04593479158756476),
 (5, 0.07030043810676488),
 (6, 0.030523788422979554),
 (7, 0.08739506749829075),
 (8, 0.04756081908836948),
 (9, 0.03862542053536803),
 (10, 0.12812970413530828),
 (11, 0.1233353283949595),
 (12, 0.036522159880601916),
 (13, 0.04877699895083946),
 (14, 0.04413239300377279),
 (15, 0.030200333445347387),
 (16, 0.023539413065908718),
 (17, 0.07666903905036727),
 (18, 0.03508965080248561),
 (19, 0.03412971205721225),
 (20, 0.022797318886060135),
 (21, 0.078138159961223),
 (22, 0.03086145742531736),
 (23, 0.07378886852222086),
 (24, 0.07367420843056069),
 (25, 0.05381898123905263),
 (26, 0.05670674534733809),
 (27, 0.03581854493103228),
 (28, 0.01692660753475327),
 (29, 0.015675591493595858),
 (30, 0.08697438335981353),
 (31, 0.0866269100334132),
 (32, 0.36434233896531415),
 (33, 0.12230734055581391),
 (34, 0.06886540731940217),
 (35, 0.12593005472830923

In [20]:
# Create LDA model (Latent Dirichlet Allocation,
# https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) based on
# bag of words
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=10, id2word=dictionary, passes=2, workers=2)

In [22]:
# Print topics generated from bag of words
topics = {}
for idx, topic in lda_model.print_topics(-1):
    top = topic.split(" + ")
    
    topics[idx] = {"words": [{"weight": float(y[0]), "word": y[1]} for y in [x.split("*") for x in top]]}
    
for topic in topics:
    print("Topic: %s" %topic)
    for w in topics[topic]["words"]:
        print("Word:   %-20s (weight:   %s)" %(w["word"], w["weight"]))

Topic: 0
Word:   "inn"                (weight:   0.014)
Word:   "teknisk"            (weight:   0.013)
Word:   "drift"              (weight:   0.012)
Word:   "prosjekt"           (weight:   0.012)
Word:   "led"                (weight:   0.01)
Word:   "ønsk"               (weight:   0.009)
Word:   "relevant"           (weight:   0.008)
Word:   "bygg"               (weight:   0.008)
Word:   "vedlikehold"        (weight:   0.008)
Word:   "arbeidsoppgav"      (weight:   0.007)
Topic: 1
Word:   "pasient"            (weight:   0.019)
Word:   "avdeling"           (weight:   0.017)
Word:   "tverrfag"           (weight:   0.013)
Word:   "behandling"         (weight:   0.012)
Word:   "samarbeid"          (weight:   0.011)
Word:   "fag"                (weight:   0.01)
Word:   "psykisk"            (weight:   0.01)
Word:   "leg"                (weight:   0.01)
Word:   "poliklinikk"        (weight:   0.01)
Word:   "fast"               (weight:   0.009)
Topic: 2
Word:   "kirk"               (weight: 

## Results:

Two team members independently categorized the different topics based on what we think the topics have in common. These are our results:

### Team member 1 categories

| Topic   | Category                  | Comments                        |
|:------- |:------------------------- |:------------------------------- |
| Topic 0 | Bygg- og ingeniørfag      |                                 |
| Topic 1 | Pleie og omsorg           |                                 |
| Topic 2 | Engelsk søknadstekst      |                                 |
| Topic 3 | Stillingsinformasjon      |                                 |
| Topic 4 | Kundebehandling           | ("kund")                        |
| Topic 5 | Ønskede kvalifikasjoner   | ("søker", "utdanning", "ønsk")  |
| Topic 6 | Barn- og ungdomsarbeid    |                                 |
| Topic 7 | Arbeid i offentlig sektor | ("kommun")                      |
| Topic 8 | Ledelse                   |                                 |
| Topic 9 | Utdanning                 |                                 |

### Team member 2 categories

| Topic   | Category                  | Comments                        |
|:------- |:------------------------- |:------------------------------- |
| Topic 0 | Bygg og anlegg            |                                 |
| Topic 1 | Helse                     |                                 |
| Topic 2 | Utdanning/akademisk       |                                 |
| Topic 3 | Kontor                    |                                 |
| Topic 4 | Kundebehandling           |                                 |
| Topic 5 | Statlig/offentlig sektor  |                                 |
| Topic 6 | Barnehage                 |                                 |
| Topic 7 | Kommunal sektor           |                                 |
| Topic 8 | Ledelse                   |                                 |
| Topic 9 | Nynorsk                   |                                 |

In [None]:
# Create LDA model using TF-IDF information
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=10, id2word=dictionary, passes=2, workers=4)

In [23]:
# Print topics generated from TF-IDF
topics = {}
for idx, topic in lda_model_tfidf.print_topics(-1):
    top = topic.split(" + ")
    
    topics[idx] = {"words": [{"weight": float(y[0]), "word": y[1]} for y in [x.split("*") for x in top]]}
    
for topic in topics:
    print("Topic: %s" %topic)
    for w in topics[topic]["words"]:
        print("Word:   %-20s (weight:   %s)" %(w["word"], w["weight"]))

Topic: 0
Word:   "søkj"               (weight:   0.011)
Word:   "vert"               (weight:   0.009)
Word:   "ver"                (weight:   0.008)
Word:   "sjukeplei"          (weight:   0.007)
Word:   "innan"              (weight:   0.007)
Word:   "arbeidsoppgåv"      (weight:   0.007)
Word:   "søkjar"             (weight:   0.006)
Word:   "eigenskap"          (weight:   0.006)
Word:   "skul"               (weight:   0.006)
Word:   "kommun"             (weight:   0.006)
Topic: 1
Word:   "stat"               (weight:   0.005)
Word:   "jobbnorg"           (weight:   0.005)
Word:   "samisk"             (weight:   0.004)
Word:   "finnmark"           (weight:   0.003)
Word:   "søker"              (weight:   0.003)
Word:   "tan"                (weight:   0.003)
Word:   "utdanning"          (weight:   0.003)
Word:   "relevant"           (weight:   0.003)
Word:   "offent"             (weight:   0.003)
Word:   "institutt"          (weight:   0.003)
Topic: 2
Word:   "kirk"               (wei

## Next steps:

- Obtain distribution of predicted topics in each of the categories of the training data
- Determine likely topic of model based on relevant word and [Los](https://www.difi.no/fagomrader-og-tjenester/digitalisering-og-samordning/nasjonal-arkitektur/los)
- Look into hierarchical clustering (agglomerative)?