<a href="https://colab.research.google.com/github/apalmk/BadMovieNight/blob/master/job_mobility.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Importing the data into a dataframe

In [0]:
import numpy as np
import pandas as pd
import re
import json
from urllib.request import urlopen
from nltk.stem.snowball import EnglishStemmer
from sklearn.feature_extraction.text import CountVectorizer
from nltk import word_tokenize     
import nltk     
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from nltk.stem import WordNetLemmatizer 
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import LatentDirichletAllocation

In [0]:
df=pd.read_csv("corpclimate_assessment.csv")

In [6]:
df['Description']

0                                  All Accounting Duties
1      We offer great service backed with 35+ years o...
2      Specialized in installing garage doors and gar...
3                                running the whole show!
4      Running the day to day operations of the bussi...
                             ...                        
295    <U+F0B2>\tInstrumental in growing iGATE busine...
296    My first working place. Unforgettable moments....
297    responsible to support the Regional Audit Asia...
298    Spokesperson,Served as a mentor to Junior and ...
299    conseptual design, detail design, prepared tec...
Name: Description, Length: 300, dtype: object

##Extracting names from the ID column

To extract the names from the ID column I split on the hyphen and took the first two words (first name, and the last name) that had length>1 and only consisted of alphabets after the split. 

In [0]:
li=list(df['ID'])
names=[]
c=0
for i in li:
  w=re.split(r'-',i)
  n=""
  c=0
  for j in w:
    if len(j)>1:
      #Getting the first and the last name from the ID column
      if c<=1:
        if j.isalpha() and (j!="md" or j!="phd"):
          n=n+j+" "
          c=c+1
  names.append(n.strip())

We have observed that some IDs had company names instead of the names of the people in them.

##Getting the gender of the person from the name

I used the gender API to identify the gender of the person from his/her name. The gender API returns unknown if the gender can't be known from the name. In our case the names of businesses such as "drain services" etc don't have a gender. 

In [0]:
def get_gender(name):
  myKey = "ojShYjxwnEaMgXWjgk"
  url = "https://gender-api.com/get?key=" + myKey + "&name="+name
  response = urlopen(url)
  decoded = response.read().decode('utf-8')
  data = json.loads(decoded)
  return data["gender"]

In [0]:
#getting the gender from the name
gender=[]
for i in names:
  if i == "":
    gender.append("unknown")
  else:
    gender.append(get_gender(i.split(" ")[0]))

##Identifying the type of job from the descriptions

For this taks we can either use LDA or NMF. We will be using LDA and identifying the prevelant topics in the texts of the description, and then seeing what topic is mostly present in each document (i.e. for each person).

1. Making a bag of words model.


In [0]:
data = list(df['Description'])

#Adding task speicific stop words
my_stopwords = set(ENGLISH_STOP_WORDS)
my_stopwords.add("and")
my_stopwords.add("of")
my_stopwords.add("to")
my_stopwords.add("for")
my_stopwords.add("the")
my_stopwords.add("in")
my_stopwords.add("as")
my_stopwords.add("de")
my_stopwords.add("et")
my_stopwords.add("all")
my_stopwords.add("en")
my_stopwords.add("la")
my_stopwords.add("het")
my_stopwords.add("es")
my_stopwords.add("do")
my_stopwords.add("pt")
my_stopwords.add("with")
my_stopwords.add("des")
my_stopwords.add("une")
my_stopwords.add("las")
my_stopwords.add("van")
my_stopwords.add("del")


#Making the bag of words model
vect = CountVectorizer(ngram_range=(1, 3),stop_words=my_stopwords, lowercase=True)
bow = vect.fit_transform(data).todense()


2. Using LDA to identify the topics and percentage of these topics in each document.

In [0]:
lda = LatentDirichletAllocation(n_components=25, learning_method="batch")
d_lda = lda.fit_transform(bow)

We have chosen to represent each job into 25 topics, we will now look at what top 5 words that contribute to each topic to identify what job each topic might be modeling. 

In [0]:
def get_key(n):
  for word, num in vect.vocabulary_.items():
    if num ==n:
        return word

In [98]:
topic_words = {}

for topic, comp in enumerate(lda.components_): 
    word_idx = np.argsort(comp)[::-1][:5]
    topic_words[topic] = [get_key(i) for i in word_idx]

for topic, words in topic_words.items():
    print('Topic: %d' % topic)
    print('  %s' % ', '.join(words))

Topic: 0
  application, needs, department, control, insurance
Topic: 1
  engineering, design, unit, chemical, sp
Topic: 2
  2013, general, store, april, december
Topic: 3
  exchange, company, project, mas, fui aprendendo
Topic: 4
  teacher, com, study, job, aan
Topic: 5
  clients, planning, media, f0a7, management
Topic: 6
  records, f0b2, japan, management, business
Topic: 7
  projects, project, power, support, plant
Topic: 8
  certificate, training, al, leighton, services
Topic: 9
  activities, general, design, training, technical
Topic: 10
  militares, da, ex, rcito, ex rcito
Topic: 11
  sales, business, project, management, market
Topic: 12
  assisted, design, rãƒâ, services, trading
Topic: 13
  water, treatment, service, water treatment, cooling
Topic: 14
  staff, department, housekeeping, ensure, books
Topic: 15
  control, voor, proceso, design, group
Topic: 16
  financial, audit, customer, voor, research
Topic: 17
  maintenance, inspection, site, products, staff
Topic: 18
  depa

From the above output we can see that the topic0 covers insurance and application oriented jobs, the topic1 covers engineering, chemical engineering and design. Now the d_lda matrix tells us how much each person belong to each topic we identified. Let us have a look at one row in the matrix.

In [99]:
d_lda[0]

array([0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01,
       0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.76, 0.01, 0.01,
       0.01, 0.01, 0.01])

We can see that the person 1 does jobs relating to topic 19 (computer sales and specialist) with high probability (76%).

3. We will now merge the gender list with matrix obtained above and build a logistic regression classifier. We will try to predict the gender from the participation of the person in the topics (from his possible job). The coefficients of each of the 25 topics will tell us which topic (set of jobs) contribute highly to classification of genders. This gives the jobs that pre-dominantly employ one gender more than the other. 

In [0]:
#making numpy array from matrix
a = np.asarray(d_lda)

a1 = np.hstack((a,np.asarray(gender).reshape(-1,1)))

#Making a list for the column names
ln=[]
for i in range(0,26):
  ln.append("column"+str(i))


#Making the array into a dataframe
ddf= pd.DataFrame(a1, columns=ln)

We will be removing rows that have gender as unknown.

In [0]:
ddf1= ddf[ddf["column25"]!='unknown']

In [128]:
ddf1.shape

(263, 26)

We lost 37 rows that had gender as unknown

In [130]:
X=ddf1.loc[:, ddf1.columns != 'column25']
y = ddf1["column25"]
lr = LogisticRegression()
lr.fit(X, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [136]:
lr.coef_

array([[ 0.37935867, -0.25280998, -0.79040919, -0.74956658,  0.61498461,
         0.44146906,  0.09473661,  0.54731516,  0.7276716 ,  0.11747732,
         0.52918818,  0.40968386, -0.7067211 ,  0.19102825,  0.31344065,
        -0.46633067, -0.49928982,  0.02201175,  0.62301074, -0.39486904,
         0.15810998, -1.22262023,  0.09408712, -0.24970439,  0.06878837]])

In [139]:
#Multiplied -1 to get the descending order
np.argsort(list(-1*lr.coef_))

array([[ 8, 18,  4,  7, 10,  5, 11,  0, 14, 13, 20,  9,  6, 22, 24, 17,
        23,  1, 19, 15, 16, 12,  3,  2, 21]])

We can see that the jobs denoted by topic 8 (certification and training services) have predominantly employed one of the genders, then comes jobs by topic 18 (gestion (Spanish for Management, personal assistance departments)) and so on.

##Making a new dataframe and generating an excel sheet from this new dataframe

Let us make a new data frame from the cleaned data we obtained.

In [0]:
ddf1.to_csv("final.csv", sep='\t', encoding='utf-8')

##Comments and observartions

Comments and Observations:

1. The data was noisy and had multiple languages (English, French, Spanish etc.).

2. The ID column sometimes had names of the people and sometimes had names of the companies. Ex: 'a-1-business-solutions-pllc-b318b6a9'


3. If we increase the no of topics we are obtaining (here we only did for 25 topics), we can have more accurate job representations.

4. Some ID values did not have any names in them.  Ex:  ‘a-a-c-a5863310’