# Data analysis of Linkedin  Skills and positions
Tingxiang Zhu & Mengyao Zhang

# Introduction

The aim of the project is to study through the **LinkedIn** data to find out the **relationships between personal skills and occupations fields**. We are going to analyze how certain types of skills contribute to a person’s chance of getting jobs in certain field. We also want to use a person’s education background as supporting information to see if the college a person studied in will add on the chances.

The whole project can be divided into five parts 
* data collecting: Collect data of personal skills and positions by scrapying Linkedin profile data
* pre-processing: Process the raw data we collected (most important)
* model training: Using model similar to TF-IDF to analysis relationship between skills and occupations
* model evaluating: Test model after training
* result visualizing: Using word-cloud and other ways to represent model results

## Data Collection

First of all, for analyzing, we need people’s skills and there current or past job information. Since LinkedIn doesn’t provide a public API that allows us to extract skill and job data directly, we are getting them through screen scraping. We start from a single person’s name and visit his or her homepage through linkedin url. Meanwhile, we store “also viewed” homepages as new entries for screen scraping. For each home page, we fetch all the listed skills, and titles for current and past work experiences. For each skill, we store the count of endorsement on that skill as a weight of that skill. We also fetch the company information for each period of work experience and the school (education) information for each person in case of future analysis usage. 

In all, we have scraped 7,000 lines of data from LinkedIn webpages. Here is one code snippet example of skills scrapy about how we scrap data from the website:

In [None]:
# Code Snippet from project file linkedin.py
# Reference: https://github.com/idwaker/linkedin
with WebBus(browser) as bus:
        bus.driver.get(LINKEDIN_URL)
        login_into_linkedin(bus.driver, username)
        iteration = 0;
        idx = 0;
        while iteration < 20000:
                name = all_names[idx]
                if name in all_names[:idx]:
                    idx += 1
                    continue
                iteration += 1
                idx += 1
                click.echo("Getting ...")
                try:
                    search_input = bus.driver.find_element_by_id('main-search-box')
                except NoSuchElementException:
                    continue
                search_input.send_keys(name)
                search_form = bus.driver.find_element_by_id('global-search')
                search_form.submit()
                profiles = []
                results = None
                try:
                    results = bus.driver.find_element_by_id('results-container')
                except NoSuchElementException:
                    continue
                links = results.find_elements_by_xpath(link_title)

                # get all the links before going through each page
                links = [link.get_attribute('href') for link in links]
                for link in links:
                    # XXX: This whole section should be separated from this method
                    bus.driver.get(link)
                    overview = None
                    overview_xpath = '//div[@class="profile-overview-content"]'
                    try:
                        overview = bus.driver.find_element_by_xpath(overview_xpath)
                    except NoSuchElementException:
                        click.echo("No overview section skipping this user")
                        continue
                    skills = ''
                    skills_xpath =  '//a[@class="endorse-item-name-text"]'
                    try:
                        skills_summary = overview.find_elements_by_xpath(skills_xpath)
                    except NoSuchElementException:
                        skills = ''
                    else:
                        try:
                            skills_list = [skill.text for skill in skills_summary]
                            for i in skills_list:
                                skills += str(i)+','
                        except Exception, e:
                            pass
                    name_elements = None
                    name_list = []
                    name_xpath = '//a[contains(@href,"trk=prof-sb-browse_map-name")]'
                    try:
                        name_elements = overview.find_elements_by_xpath(name_xpath)
                    except NoSuchElementException:
                        name_list = ''
                        print "failed"
                    else:
                        for element in name_elements:
                            try:
                                if str(element.text.strip().lower()) != '' and (str(element.text.strip().lower()) not in name_list) and (str(element.text.strip().lower()) not in all_names):
                                    name_list.append(str(element.text.strip().lower()))
                            except Exception:
                                continue
                        if len(name_list) != 0:
                            with open("list_of_names.csv","a+") as f:
                                wr = csv.writer(f,delimiter="\n")
                                wr.writerow(name_list)
                    all_names = collect_names(infile)
                    data = {
                        'skills':skills
                    }
                    profiles.append(data)

The data is stored in a csv file.By loading them into pandas dataframe, we can have a brief look at the data. (Full code: data_process.py)
    

In [None]:
print dataframe.dtypes

In [None]:
fullname           object
locality           object
industry           object
current summary    object
past summary       object
education          object
skills             object
endorsements       object
positions          object
dtype: object

In [None]:
print dataframe.head(5)

In [None]:
head                fullname                  locality  \
0  Tingxiang Zhu (Star)  Pittsburgh, Pennsylvania   
1    Jiaming Ni (Oscar)  Pittsburgh, Pennsylvania   
2            Jiaming Ni      Shanghai City, China   
3            Jiaming Ni                     China   
4      Yuqi Wang (yuki)  Pittsburgh, Pennsylvania   

                               industry       current summary  \
0   Information Technology and Services                   NaN   
1   Information Technology and Services                   NaN   
2  Mechanical or Industrial Engineering   Honeywell Aerospace   
3                       Broadcast Media  Shanghai Media Group   
4                              Internet                   NaN   

                                        past summary  \
0  DaoCloud.io, 10years.me, Hand Enterprise Solut...   
1                                            NetEase   
2  SKF Global Technical Center China, Donghua Uni...   
3                                Shanghai Meda Group   
4                 Rakuten, Hundsun Technologies Inc.   

                                           education  \
0                         Carnegie Mellon University   
1                         Carnegie Mellon University   
2                                 Donghua University   
3                                                NaN   
4  Carnegie Mellon University - H. John Heinz III...   

                                              skills  \
0  Cloud Computing,Python,Hadoop,SQL,Start-ups,in...   
1  Python,Java,Shell Scripting,MySQL,MapReduce,Ha...   
2  Testing,Engineering,NI LabVIEW,Matlab,Manufact...   
3                                                NaN   
4  Java,Linux,Microsoft Office,Databases,HTML,Mic...   

                         endorsements  \
0        5,6,5,5,2,2,2,3,1,2,1,1,1,1,   
1  8,8,6,6,4,3,2,0,0,0,0,0,0,0,0,0,0,   
2    2,1,0,0,1,1,1,1,0,0,0,0,0,0,0,0,   
3                                 NaN   
4          20,14,13,12,9,6,6,5,4,3,3,   

                                           positions  
0  Software Engineer Intern,Co-Founder,Business I...  
1                       Software Development Intern,  
2  Advanced Manufacturing Engineer,Hard Machining...  
3           Researcher,Researcher,Researcher,Editor,  
4        Software Engineer,Android Developer Intern,  

By acting some simply aggregation, the some statistics we can get from the raw data is as listed below.

Number of different skills:

In [None]:
9541

Number of different job titles:

In [None]:
num of positions: 23994

Most popular positions (TOP 10):

In [None]:
['Director', 'Manager', 'Software Engineer', 'Consultant', 'Vice President', 'Owner', 'Project Manager', 'President', 'Intern', 'Founder']

People with top number of skills:

In [None]:
Christina Quinones

His/Her experiences:

In [None]:
Oracle Applications,Oracle E-Business Suite,ERP,Oracle,CRM,Business Process,Testing,Business Analysis,Project Management,Visio,Microsoft Excel,Data Analysis,Oracle CRM,Analysis,Leadership,Troubleshooting,Program Management,Management,Financial Modeling,Cloud Computing,Oracle Order Management,Sales,Lean Process/DFSS Green...,MS Access, Excel, Word,Shoretel Administration,Project Management,Strategy,Agile Methodologies,Hedge Funds,Fixed Income,Software Development,Derivatives,SDLC,Management,Equities,SQL,Software Project...,.NET,Asset Managment,Consulting,Bloomberg,C#,Data Warehousing,Software Engineering

Most well mentioned skills overall:

In [None]:
['management', 'leadership', 'strategy', 'marketing', 'project management', 'social media', 'business development', 'strategic planning', 'program management', 'sales']

Most well mentioned skills among people who have been “Software Engineer”:

In [None]:
{'software development': 16, 'xml': 11, 'java': 11, 'javascript': 10, 'sql': 10, 'agile methodologies': 9, 'c#': 9, 'ajax': 8, 'linux': 8, 'scrum': 8}

## Data Pre-processing

### Raw Data Problems

However, many problems exist in the raw data. 
* **Invalid data problem**
  * The homepage of some Linkedin user are written in languages other then English.
  * Some positions is empty.
  * Some users did not fill in any skills.
  
  
* **Job title similarity problem:   **
  ***This is a very tough nut to crack in this project. ***For example, some people use “Software Engineer” as their job titles while others use “Software Developer”. Since the meaning of these positions are similar, we want to simply merge those similar types of job as one. Likewise, we need to apply the merging to similar skills such as “teamwork” and “teamworking”. The school names are mostly in standard form.


### Pre-processing  of data
#### Drop invalid data

The pre-processing task includes removing invalid data, unifying the expression of job positions and skills, and categorizing them. 

* Remove invalid data
Since we focus on the trend between skills and positions, we mainly deal with these two columns. We drop the NaN records and store them as dictionary for later processing.


In [None]:
## Code Snippet from file data_process.py
def read_skill_and_position(df):
    # drop NaN records
    df = df.dropna(subset=['positions', 'skills'])
    position_dic = {}
    positions_total = []
    top_skill_num = 0
    for index, row in df.iterrows():
        positions = row['positions'][:-1].split(",")
        skills = row['skills'][:-1].split(",")
        skills = [skill.lower() for skill in skills if skill]
        if len(skills) > top_skill_num:
            top_person = index
            top_skill_num = len(skills)
        positions_total.extend(positions)
        for position in positions:
            if position_dic.has_key(position):
                position_dic[position.strip()] += Counter(skills)
            else:
                position_dic[position.strip()] = Counter(skills)
    positions_set = set(positions_total)
    position_counter = Counter(positions_total)
    return position_dic, position_counter, top_person, positions_set

After process of data, we can get positions lists and skill lists for later analysis.

In [None]:
Snippet of positions list:
    
['Senior Security Architect', ' Police Central e-Crime Unit', 'Consulting Services Manager', 
 'Cape Sharp Tidal', 'Director of Training', 'Regional Sales Director Middle East',
 'Manager- Business Planning and Analysis', 'Product & Services Manager', 
 ' Marketing & Business Operations/Vice President', ' Development Manager', 
 'Temporary', "Vice President of America's Region Marketing", 'Senior Consultant']

In [None]:
Snippet of skills list:
    
['java script', 'ministry leadership', "children's books", 
 'probability', 'garden coaching', 'maple', 'logicworks', 'permanent life insurance',
 'activity coordination', 'computer arithmetic', 'dodaf', 'mathematical programming', 
 'tech sherpa', 'energy work', 'member of agcas...', 'goldsmithing', 'asp.net web api',
 'metal card', 'bathroom accessories', 'google website optimizer', 
 'iar ezbedded workbench', 'undercover', 'icefaces', 'gifting strategies']

#### Clustering positions and skills data:

As we said above about the "job tital similarity problem", we can apply some natural language processing such as lemmatizing and word similarity analysis. Since we have 23994 different positions and 9541 different skills right now, We expect to have much less types of skills and job positions remained after the processing. Therefore, we may categorize them manually. 

So, the actual problem here is how to cluster a list of words like positions and skills?

From the course we know that we can represent strings in a numerical vector. One typical approach is to combine k-means clustering with Distance, but how to represent means of string? We normally use weight called as TF-IDF weight, but that is mostly related to the area of "text document" clustering. We have no documents, but we want to clustering single words.

We first need to convert word to vector:
* ** Method 1:** using **edit distance** to calculate similarity between words:



In [None]:
# http://stats.stackexchange.com/questions/123060/clustering-a-long-list-of-strings-words-into-similarity-groups
words = np.asarray(list(positions_set)) #So that indexing with a list will work
# https://pypi.python.org/pypi/editdistance , a faster algorithm than Levenshtein distance
lev_similarity = -1*np.array([[editdistance.eval(w1,w2) for w1 in words] for w2 in words])
affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)


Then we get the result like this:

In [None]:
 - *Vice President:* Asst Vice-President, Doctoral Student, Global President, Graduate Student, President, Senior Vice President, Vice President, Vice-President
 - *PGA Assistant Golf Professional:* Assistant Golf Professional, PGA Assistant Golf Professional, PGA Head Golf Professional, PGA Lead Assistant Golf Professional
 - *Vice President - Account:*  President & Author, Executive Vice President - Account, General Ledger Accountant, Vice President - Account, Vice President of Peabody Hall
 - *Accounting Specialist:* Accounting Assistant, Accounting Specialist, Application Specialist, Communications Specialist, Hard Machining Specialist, Laboratory Specialist, Staffing Specialist
 - *Assistant Manager:* Account Technical Manager, Assistant Client Manager, Assistant Director, Assistant Engineer, Assistant Executive, Assistant Location Manager, Assistant Manager, Business Alliance Manager, Engineering Manager, Install Sales Manager, Operations Manager, Sales Associate and Cashier
 - *Graduate Student Researcher:* Final Year Project Researcher, Graduate Student Instructor, Graduate Student Researcher, Student Researcher, Undergraduate Researcher, Undergraduate independent research
 - * Economics Department:*  Economics Department,  Pharmacy Department, Economics 10 Teaching Fellow, Economics Tutor

It seems ok but when we analysis it carefully, we can find out that this method only take account of the string similarity in the word.
For us, we need the similarity of the meaning between the words. 
So, we come up another method:

* Method 2: Using wiki pedia search list word to cluster the positons. Since when we search wikipeida, we can get a list of words that has the close meaning. We used think that may be a good idea since the list of wiki must be edited manually. We expect higher accuracy. For example:

In [None]:
wikipedia.search("obama")

In [None]:
[u'Barack Obama', u'Family of Barack Obama', u'Michelle Obama', u'Crush on Obama', u'Barack Obama Sr.', u'Obama logo', u'Presidency of Barack Obama', u'Barack Obama in comics', u'Obama']

But when we cluster all positions using this method, we only drop about 200 positions with 20000 postions left.
The result is really bad.

* Method 3: We can search every position words in WikiPedia and then get all the wiki documents as the training material. Finally we can get the vector of words. This method can let us calculate similarity between words by their real meaning instead of just the surface similarity of string.
In this process, we use [**Gensim**](https://radimrehurek.com/gensim/) python library find the similarity. Since it can automatically extract semantic topics from raw document. 
Here is the code snippet to get Wiki documents and get similarity:

In [None]:
# Code Snippet from project file wiki_word2vec_test.py
#  Reference ：http://www.shuang0420.com/2016/05/30/gensim-word2vec%E5%AE%9E%E6%88%98/
# -*- coding: utf-8 -*-
import gensim
from gensim.models import Phrases
from gensim.corpora import WikiCorpus
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
from stop_words import get_stop_words
import os
import logging
import re
import multiprocessing
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

# logging information
logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
logging.root.setLevel(level=logging.INFO)

# get input file, text format
inp = sys.argv[1]
input = open(inp, 'r')
output = open('output_word.seq', 'w')
#  remove the stop words
stop_words = get_stop_words('en')

# read file and separate words
for line in input.readlines():
    line=line.strip('\n')
    new_line = ''
    for i in line.split(' '):
        if i.isalpha():
            if i.lower() not in stop_words:
                new_line += i.lower() + ' '
    output.write(new_line)

output.close()
output= open('output_word.seq', 'r')
model = Word2Vec(LineSentence(output), size=100, window=3, min_count=5,workers=multiprocessing.cpu_count())
# # save model
model.save('output_word.model')
model.save_word2vec_format('output_word.vector', binary=False)


After save the model we can use this training model get the similarity we need:

In [None]:
# test
model=gensim.models.Word2Vec.load('output_word.model')
vocab = list(model.vocab.keys())
print model.similarity('engineer', 'consultant')

In [None]:
0.512911466709

Besides, we also tried some other methods of clutering these words such as using phrases model instead words model. However, the clustring result is not good as the words model. 
Finally, after we get the similarity between the words of positions, it's much easier for us to clustering all the positions.
We found a job field list as follows. We used this list to compare similarity with all the positions in our data. Since some positions are not single word, we iterate compare each word in phrases then get the highest similarity as its category.

Code Snippets:

In [None]:

# Code Snippet from project file get_new_positions.py 
model=gensim.models.Word2Vec.load('output_word.model')
count = 0
category_list = ['software','product', 'consultant','analyst', 'accounting','manufacturing','sports','banking','fundraiser','fashion','information',
'operation', 'market', 'reporter', 'sales', 'finance', 'hr', 'public','technology','business']
for word in positions_set:
    old_word = word
    new_position = ''
    word = re.sub('[^0-9a-zA-Z]+', ' ', word)
    word_ist = word.split()
    max_similarity = 0
    for i in word_ist:
        for j in category_list:
            simi = 0
            try:
                simi = model.similarity(i.lower(), j)
            except:
                continue
            if simi >= max_similarity:
                max_similarity = simi
                new_position = j
    if new_position not in category_list:
        new_position = 'other'
        count += 1
    new_positions_dict[old_word] = new_position

Snippet of final old-positon: new-postions dictionary:

['Software Developer Engineer in Test': 'software', 'Commercial Restructuring Executive': 'finance', 'Partner/Creative': 'fashion', ' Web 1.0': 'software', 'Web Content Producer': 'reporter', 'Senior Law Enforcement Policy Analyst for US Army': 'analyst', 'Short-term attachment': 'fundraiser', 'CHRO & EVP Human Resources': 'hr', ' Business Developmen'Lab Intern': 'fundraiser',  'People & Culture Product Manager': 'product', 'Java/UI Developer': 'software']

We generally checked the text, most category is reasonable. 
Using this new postions list, we can process our analysis model better.

    

## Analysis Model

There are two type of models we used in the study on LinkedIn profiles. 
* **Skill Model** to transform skill set and position set of each person into numeric vectors
* **Learning Model** to learn from the training data to perdict a person's position set from his/her skills

### Skill Model

The skill model aims to change the lists of string skills and positions into numeric vectors so that they can be accepted by the learning model. 

#### Feature Vector

The skills for people is somewhat similar to words in natual language processing. Person A may shares some skills with person B and some other skills with person C, which is just like the occurence of words in documents. The more skills two people have in common, the more likely they obtain the same position, as how we classify documents by word frequency. 

Therefore, se apply the idea in natural language processing to quantify the skills and positions. That is, the TF-IDF matrix. In NLP, we use the term frequency (TF) to evaluate the weight of a word in a document. In skill processing, we use the endorsement on that skill (SE) alternatively, since endorsement in a way shows us how well the person really master the skill. Inverse document frequency (IDF) is used to evaluate the importance of a term by how rare it appears among documents. Accordingly, we use inverse person frequency (IPF) to weigh the importance of a skill. A skill is more generally mastered by everyone (Microsoft for instance), it is less likely that it can represent certain type of job positions. The detailed correspondence of TF-IDF and our skill model is shown in the table below.

![Named Entity Recognition](img/se-ipf.png)

A union of all skills acquired by each person forms the whole skill set, and we can build the skill vector based on the skill set. 

For instance, if the whole skill set is $['marketing','leadership','microsoft','python','data analysis','management']$ with IPF $[2, 1, 0.1, 2, 4, 2]$. 

A person with skill $['data analysis', 'microsoft','leadership']$ and endorsement $[10, 15, 4]$ has skill vector: $[0, 4, 1.5, 0, 40 ,0]$


We have also tried matrix using only SE without IPF, the classification result is slightly lower to the one with IPF. Therefore, SE-IPF is our final model for features.

#### Label Vector

One challenge part of our skill analysis study is that each person has had different type of positions through their lives.Therefore, the class label should be a set of positions instead of one single position. That is, certain input will get multiple labels for both training and testing data.

In this case, we use ad-hoc representation for labels. The pre-processing already generates us a position set: 

$['product','fashion', 'finance', 'business', 'reporter', 'hr', 'sales', 'fundraiser', 'accounting', 'operation', 'manufacturing',\\ 'technology', 'analyst', 'market', 'information', 'consultant', 'sports', 'banking', 'other', 'public', 'software']$ 

Thus, the position vector is build according to this set. A position taken by a person (at any time through his/her life) is marked 1 in position vector, otherwise it is marked 0.

For instance, a person has taken job as $['software', 'business','analyst']$ will be assigned label vector $[0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1]$.

We mark the position 1 as long as it is taken by a person, no matter when has he/she taken the position or how many times he/she taken that position.

### Learning Model

The fact that a set of skills can lead to multiple positions make our classification task a multilabel classification task. We expect to see multiple possible positions when we do the prediction on a certain set of skills. Therefore, we use the OneVsRestClassifier in sklearn.multiclass which provides the multilabel function.

For the internal estimators of the classfier, we have test linear SVM, Multinomial Naive Bayes, Gaussian Naive Bayes and Orthogonal Matching Pursuit. The classification results for those different estimators are listed in the table below. The classication results for using only SE without IPF is also included.

![Named Entity Recognition](img/models.png)

We have also tried to implement our own Multinomial Naive Bayes classifier but the result is far from comparable with sklearn. Therefore we end up selecting the classifiers in sklearn.

Notice in the result table, there are average precision and recall as well as f1-score. Considering the actuall situation that a person may want to use the prediction on his/her skills to decide which position to apply for, it would be preferred to have more possible positions with some of them not so related instead of having all positions related but miss some potential positions. Therefore, we think we should value recall over precision. At last, we choose Multinomial Naive Bayes with highest recall.

## Result

### Position Predict

With the model developed in previous section, we are able to predict a person's potential position given his/her skills and corresponding endorsements.

For instance, Given a person's skill as $['java', 'python', 'programming', 'visual studio']$ with endorsements $[1,1,1,1]$, the prediction result will be $['software', 'information', 'reporter']$.

While the report doesn't seem very related, the top related positions according to our common sense - software and information are in the predict output. Moreover, probably a good programmer can become a good reporter!

### Skill graph

One thing good about Naive Bayes classifier is that is actually predict from the conditional probability per feature per class and it records the conditional probabilities in classifier variables. Therefore, we can access those probabilities and find out what skills take more weights on obtaining a certain position.

To visualize the importance of skills in certain positions, we use wordgraph package in python to build a skill graph for each position based on the skills' conditional probabilities. Some sample figures are shown as below.

![Named Entity Recognition](img/skill_graph_sample.png)

From the graph, we can get some idea about which skills are essential when you want to apply for certain positions.

For instance, software jobs emphasize skills on software development. Knowledge about leading technologies such as cloud computing, data center will also add to the chance of entering such field. On the other hand, public jobs require for skills like public relations, and dealing with the media. Accounting jobs asks more for skills on accounting and finance report.

### Conclusion & Future Works###

#### Conclusions

In this project, we scrapy the data from Linkedin Website to analysis how skills meet the requirements of positions in specific fields. In this project, we finished the completed process of "practical data science". 

We begin from getting raw data from Linkedin, and convert the data to better expression for latter analysis. Then we used model to train the training data and do prediction using test data. Finally, we use the popular word-cloud finish the data visualization. 

By doing the whole process step by step, we also tried a lot of new models and libraries to optimize the final model.

The final complete model provides two functionalities:

** 1. Predict potential positions given a set of skills **

** 2. Illustrate the important skill in each of the 21 field by word graph **

Using this model, we can identify that there do exist certain pattern between the skills and job positions, which is consistent with our original assumption.

#### Future Works

There are still some points to improve for our model.

1. During data pre-processing, we filter and classify only the position data into 21 different fields to improve classification. In fact, the skills data also includes duplicate skills expressed in different words. Getting those data pre-processed may help further improve the result.

2. We use all the skills when training the model, while some of them have little impact on the results. Applying some feature selection algorithm and reduce the dimension may be helpful.

3. When using the positions as label, we mark a position as 1 if a person has taken that position regardless of how often and when did that person take it. Taking these information into consideration may help further improve the classification.

4. One thing we noticed is that the trainning error of all tried estimators are nearly 0. That is, the training result suffers from overfitting. Having more trainning samples may help solving the problem.

If we have time, we would like to try out those method in the future to see if they can acheive a higher accuracy.

Reference: 
* http://www.tfidf.com/
* https://github.com/idwaker/linkedin
* http://www.shuang0420.com/2016/05/30/gensim-word2vec%E5%AE%9E%E6%88%98/
* https://radimrehurek.com/gensim/
* https://github.com/amueller/word_cloud