# 🎉 Exploratory Data Analysis for Linkedin Job Posting 2023  🎉

Team-RYL: Lluka Stojollari, Renqing Cuomao, Yunlong Dong 

March 2024, Data Visualization, Milestone 1, EPFL

This notebook contains the code and analysis for the exploratory data analysis of the Linkedin Job Posting dataset for the year 2023. The goal of this analysis is to gain insights and understand the characteristics of the job postings in order to inform decision-making and strategy.

**Note**: In order to reproduce the results, please ensure that you have downloaded the data locally. Alternatively, you can run the notebook on the repository containing the data from this link:  https://github.com/LukaSt99/COM-480-Data/tree/main

## Imports and Libraries 🚨

Here, we import all the needed libraries. Make sure to install all the frameworks and libraries used below.

In [131]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [132]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

Preprocessing phase: Containing basic preprocessing , merging and normalizing the data

In [135]:
# Read the CSV file into a pandas DataFrame
job_posting = pd.read_csv("Data/job_postings.csv")
job_posting.head(2)

Unnamed: 0,job_id,company_id,title,description,max_salary,med_salary,min_salary,pay_period,formatted_work_type,location,...,closed_time,formatted_experience_level,skills_desc,listed_time,posting_domain,sponsored,work_type,currency,compensation_type,scraped
0,3757940104,553718.0,Hearing Care Provider,Overview\n\nHearingLife is a national hearing ...,,5250.0,,MONTHLY,Full-time,"Little River, SC",...,,Entry level,,1699090000000.0,careers-demant.icims.com,0,FULL_TIME,USD,BASE_SALARY,1699138101
1,3757940025,2192142.0,Shipping & Receiving Associate 2nd shift (Beav...,Metalcraft of Mayville\nMetalcraft of Mayville...,,,,,Full-time,"Beaver Dam, WI",...,,,,1699080000000.0,www.click2apply.net,0,FULL_TIME,,,1699085420


In [136]:
#Company-related datasets
companies = pd.read_csv("Data/company_details/companies.csv")
companies.head(2)

Unnamed: 0,company_id,name,description,company_size,state,country,city,zip_code,address,url
0,1009,IBM,"At IBM, we do more than work. We create. We cr...",7.0,NY,US,"Armonk, New York",10504,International Business Machines Corp.,https://www.linkedin.com/company/ibm
1,1016,GE HealthCare,Every day millions of people feel the impact o...,7.0,0,US,Chicago,0,-,https://www.linkedin.com/company/gehealthcare


In [138]:
#Preprocess company_industries
company_industries = pd.read_csv("Data/company_details/company_industries.csv")
# group by company_id and create a list of industries (in order to get rid of duplicates)
company_industries = company_industries.groupby('company_id')['industry'].apply(lambda x: list(x)).reset_index()   
company_industries.head(3)

Unnamed: 0,company_id,industry
0,1009,"[Information Technology & Services, IT Service..."
1,1016,"[Hospital & Health Care, Hospitals and Health ..."
2,1021,"[Renewables & Environment, Renewable Energy Se..."


In [139]:
#Preprocess company_specialities
company_specialities = pd.read_csv("Data/company_details/company_specialities.csv")
company_specialities = company_specialities.groupby('company_id')['speciality'].apply(lambda x: list(x)).reset_index()
company_specialities.head(3)

Unnamed: 0,company_id,speciality
0,1009,"[Cloud, Mobile, Cognitive, Security, Research,..."
1,1016,"[Healthcare, Biotechnology]"
2,1021,"[Distributed Power, Gasification, Generators, ..."


In [140]:
employee_counts = pd.read_csv("Data/company_details/employee_counts.csv")
#Since there are duplicates for each company_id, we will keep the latest observation 
employee_counts = employee_counts.sort_values('time_recorded', ascending=True).drop_duplicates('company_id')
employee_counts.head(3)

Unnamed: 0,company_id,employee_count,follower_count,time_recorded
0,81149246,6,91,1692645000.0
1,10033339,3,187,1692645000.0
2,6049228,20,82,1692645000.0


Merge , filter and preprocess the data realted to companies

In [142]:
#Merge the companies and company_industries datasets
companies = pd.merge(companies, company_industries, on='company_id', how='left') 
#Merge the companies and company_specialities datasets
companies = pd.merge(companies, company_specialities, on='company_id', how='left')
#Merge the companies and employee_counts datasets
companies = pd.merge(companies, employee_counts, on='company_id', how='left')
companies.head(5)

Unnamed: 0,company_id,name,description,company_size,state,country,city,zip_code,address,url,industry_x,speciality_x,employee_count_x,follower_count_x,time_recorded_x,industry_y,speciality_y,employee_count_y,follower_count_y,time_recorded_y
0,1009,IBM,"At IBM, we do more than work. We create. We cr...",7.0,NY,US,"Armonk, New York",10504,International Business Machines Corp.,https://www.linkedin.com/company/ibm,"[Information Technology & Services, IT Service...","[Cloud, Mobile, Cognitive, Security, Research,...",316130.0,16114398.0,1692851000.0,"[Information Technology & Services, IT Service...","[Cloud, Mobile, Cognitive, Security, Research,...",316130.0,16114398.0,1692851000.0
1,1016,GE HealthCare,Every day millions of people feel the impact o...,7.0,0,US,Chicago,0,-,https://www.linkedin.com/company/gehealthcare,"[Hospital & Health Care, Hospitals and Health ...","[Healthcare, Biotechnology]",53495.0,2060378.0,1692853000.0,"[Hospital & Health Care, Hospitals and Health ...","[Healthcare, Biotechnology]",53495.0,2060378.0,1692853000.0
2,1021,GE Power,"GE Power, part of GE Vernova, is a world energ...",7.0,NY,US,Schenectady,12345,1 River Road,https://www.linkedin.com/company/gepower,"[Renewables & Environment, Renewable Energy Se...","[Distributed Power, Gasification, Generators, ...",26963.0,2340835.0,1692866000.0,"[Renewables & Environment, Renewable Energy Se...","[Distributed Power, Gasification, Generators, ...",26963.0,2340835.0,1692866000.0
3,1025,Hewlett Packard Enterprise,Official LinkedIn of Hewlett Packard Enterpris...,7.0,Texas,US,Houston,77389,1701 E Mossy Oaks Rd Spring,https://www.linkedin.com/company/hewlett-packa...,"[Information Technology & Services, IT Service...",,70995.0,3646359.0,1692840000.0,"[Information Technology & Services, IT Service...",,70995.0,3646359.0,1692840000.0
4,1028,Oracle,We’re a cloud technology company that provides...,7.0,Texas,US,Austin,78741,2300 Oracle Way,https://www.linkedin.com/company/oracle,"[Information Technology & Services, IT Service...","[enterprise, software, applications, database,...",202019.0,9289332.0,1692861000.0,"[Information Technology & Services, IT Service...","[enterprise, software, applications, database,...",202019.0,9289332.0,1692861000.0


Import maps datasets

In [170]:
#Maps datasets
industries = pd.read_csv("Data/maps/industries.csv")
industries.head(3)

Unnamed: 0,industry_id,industry_name
0,1,Defense and Space Manufacturing
1,3,Computer Hardware Manufacturing
2,4,Software Development


In [171]:
#Maps datasets
skills = pd.read_csv("Data/maps/skills.csv")
skills.head(3)

Unnamed: 0,skill_abr,skill_name
0,PRCH,Purchasing
1,SUPL,Supply Chain
2,PR,Public Relations


Job-related datasets

In [176]:
#Job-related datasets
benefits = pd.read_csv("Data/job_details/benefits.csv")
# group by job_id and create a list of benefits (in order to get rid of duplicates)
benefits = benefits.groupby('job_id')['type'].apply(lambda x: list(x)).reset_index()
benefits.head(3)

Unnamed: 0,job_id,type
0,3958427,[Medical insurance]
1,85008768,"[Medical insurance, Vision insurance, Dental i..."
2,133114754,"[Medical insurance, 401(k), Vision insurance]"


In [177]:

job_industries = pd.read_csv("Data/job_details/job_industries.csv")
#Merge job_industries and industries datasets
job_industries = pd.merge(job_industries, industries, on='industry_id', how='left')
# group by job_id and create a list of industries (in order to get rid of duplicates)

job_industries = job_industries.groupby('job_id')['industry_name'].apply(lambda x: list(x)).reset_index()
job_industries.head(3)

Unnamed: 0,job_id,industry_name
0,3958427,[Personal Care Product Manufacturing]
1,85008768,[Insurance]
2,102339515,[Consumer Services]


In [179]:
job_skills = pd.read_csv("Data/job_details/job_skills.csv")
#Merge job_skills and skills datasets
job_skills = pd.merge(job_skills, skills, on='skill_abr', how='left')
job_skills = job_skills.groupby('job_id')['skill_name'].apply(lambda x: list(x)).reset_index()
job_skills.head(3)

Unnamed: 0,job_id,skill_name
0,3958427,"[Design, Art/Creative, Information Technology]"
1,85008768,"[Sales, Business Development]"
2,102339515,"[Business Development, Sales]"


In [180]:
salaries = pd.read_csv("Data/job_details/salaries.csv")
salaries.head(3)

Unnamed: 0,salary_id,job_id,max_salary,med_salary,min_salary,pay_period,currency,compensation_type
0,1,3378133231,30.0,,22.0,HOURLY,USD,BASE_SALARY
1,2,3690843087,65000.0,,55000.0,YEARLY,USD,BASE_SALARY
2,3,3691794313,22.0,,19.0,HOURLY,USD,BASE_SALARY
