<h1>Job offers analyzer (Topic Modeling)

This is an application of unsupervised learning (a field of machine learning), which takes as input the description of job offers (in this particular case: jobs for electricians) and as output returns the categories or topics most commonly found in the job offers (e.g: construction, design, maintenance, etc).

First, we import the data obtained from the job offers web site (a webscrapper was created to extract the data).

In [23]:
import pandas as pd

df = pd.read_excel('Electricista_2_20210107.xls', encoding='utf-8')
df.head(3)

Unnamed: 0.1,Unnamed: 0,id_oferta,empresa,cargo,descripcion
0,0,B194FAD9362652E061373E686DCF3405,Dar Ayuda Temporal,Técnico electricista,"Descripción\nDiseño, ensamble, instalación, pr..."
1,1,D5060C2E95F4263361373E686DCF3405,Saitemp S.A,Oficial electricista - Subestaciones eléctricas,Descripción\nSe requiere técnico o tecnólogo e...
2,2,E3D143CB9CD5BED561373E686DCF3405,JOBANDTALENT CO S A S,Tecnico Electricista - ensamble de tableros,Descripción\nEn Jobandtalent empleamos a más d...


We import a list of "stopwords" in spanish, which is the language of job offers:

In [2]:
with open('stopwords_esp.txt', 'r') as f:
    stopwords_esp = f.readlines()
    
print(type(stopwords_esp))

<class 'list'>


Then we create the "bag of words" coming from the job descriptions. This will be the input to the algorithm.
As parameters we enter a document frequency of 10% (to penalize words with high appearance) and take the first 1000 most frequent words.


In [14]:
from sklearn.feature_extraction.text import CountVectorizer

#
count = CountVectorizer(stop_words=stopwords_esp,
                        max_df = 0.1,
                        max_features=1000)
X = count.fit_transform(df['descripcion'].values)

  'stop_words.' % sorted(inconsistent))


Next, we create an instance of the algorithm Latent Dirichlet Allocation, implemented in scikit-learn.

As parameter we input the desired number of topics, in this case 4 topics.

In [20]:
from sklearn.decomposition import LatentDirichletAllocation

#n_components : number of topics

lda = LatentDirichletAllocation(n_components=4,
                                random_state=500,
                                learning_method='batch')
X_topics = lda.fit_transform(X)

In [21]:
lda.components_.shape

(4, 1000)

Finally, for each topic we print the most relevant words detected by the algorithm and analyze the results. 

In [22]:
n_top_words = 10
feature_names = count.get_feature_names()

for topic_idx, topic in enumerate(lda.components_):
    print("Topic %d:" % (topic_idx + 1))
    print(" ".join([feature_names[i]
                    for i in topic.argsort()\
                        [:-n_top_words - 1:-1]]))

Topic 1:
empleo máquinas 00 mecánico facatativá ofrecemos forma lectura colombia minima
Topic 2:
reparación mantenimientos ensamble controles mecánico energía detección fallas respectiva correctivos
Topic 3:
c1 licencia domingo b1 energía labor ciudad seguro gestión compañía
Topic 4:
ingeniero servicio subestaciones elementos comercial elaboración calidad profesionalaños eléctrica cargos


**Analysis:**

As a result we obtain the 4 topics with the most important words in the job descriptions, according to the algorithm. We observe that the algorithm identified the following topics:

•	Topic 1 and 2: electricians with repair and maintenance skills.

•	Topic 3: words with high frequency but undefined topic.

•	Topic 4: electrical engineer for service in substations

Application: with the information supplied by topic modeling of job offers, candidates could improve their CV and job alerts may be created. Also candidates could check job trends and be aware of most hiring positions.


Based on example published by Sebastian Raschka in the book Python Machine Learning 2 Ed. Pages 296-300