<a href="https://www.kaggle.com/code/enricofindley/linkedin-job-postings-2023-data-analysis?scriptVersionId=143356704" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## **LinkedIn Job Postings Analysis**

In this notebook i will attempt to merge, clean the data, and doing basic data visualization for analysis.

I hope you all enjoy my notebook, please upvote if you like it. Thanks

# Importing Dependencies

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from functools import reduce # module to help merge multiple dataframe
from wordcloud import WordCloud # module to print word cloud

# Data Collecting & Pre-Processing

In [None]:
# main dataframe
job_postings_data = pd.read_csv("/kaggle/input/linkedin-job-postings/job_postings.csv")
job_postings_data

let's check if there is duplicate and null in <code>job_id</code>

In [None]:
duplicates = job_postings_data['job_id'].duplicated()
num_duplicates = duplicates.sum()
num_duplicates

In [None]:
job_postings_data['job_id'].isnull().sum()

there is no duplicate and no null so move on to the next step

in <code>benefits.csv</code> there exist duplicate <code>job_id</code> and unrelated <code>inferred</code> column so lets combine the duplicate <code>job_id</code> and remove the <code>inferred</code> column

In [None]:
job_benefits_data = pd.read_csv("/kaggle/input/linkedin-job-postings/job_details/benefits.csv")
job_benefits_data = job_benefits_data.drop('inferred', axis=1) # remove 'inferred' column
job_benefits_data = job_benefits_data.groupby('job_id')['type'].agg(lambda x: ', '.join(x)).reset_index() # aggregate same job benefits

then we merge our main dataframe with job benefits dataframe

In [None]:
job_postings_data = job_postings_data.merge(job_benefits_data, on="job_id", how="left")
job_postings_data

we will also merge duplicate <code>job_id</code> on <code>job_skills.csv</code> and join them into the main dataframe

In [None]:
job_skills_data = pd.read_csv("/kaggle/input/linkedin-job-postings/job_details/job_skills.csv")
job_skills_data = job_skills_data.groupby('job_id')['skill_abr'].agg(lambda x: ', '.join(x)).reset_index() # aggregate same job skills

job_postings_data = job_postings_data.merge(job_skills_data, on="job_id", how="left")
job_postings_data

the job_details data is done, next we will pre-process the company_details data

In [None]:
company_data = pd.read_csv("/kaggle/input/linkedin-job-postings/company_details/companies.csv")

company_industries_data = pd.read_csv("/kaggle/input/linkedin-job-postings/company_details/company_industries.csv")
company_industries_data = company_industries_data.groupby('company_id')['industry'].agg(lambda x: ', '.join(x)).reset_index() # aggregate same company industries

company_specialities_data = pd.read_csv("/kaggle/input/linkedin-job-postings/company_details/company_specialities.csv")
company_specialities_data = company_specialities_data.groupby('company_id')['speciality'].agg(lambda x: ', '.join(x)).reset_index() # aggregate same company specialities

employee_counts_data = pd.read_csv("/kaggle/input/linkedin-job-postings/company_details/employee_counts.csv")
employee_counts_data = employee_counts_data.groupby('company_id')['time_recorded'].max().reset_index() # get newest data based on 'time_recorded' column

company_data = company_data.merge(company_industries_data, on="company_id", how="left")
company_data = company_data.merge(company_specialities_data, on="company_id", how="left")
company_data = company_data.merge(employee_counts_data, on="company_id", how="left")
company_data

Finally we merge job postings and company data together and we have data about company that post jobs on linkedIn.

In [None]:
merged_data = job_postings_data.merge(company_data, on="company_id", how="left")
merged_data

In [None]:
merged_data.isnull().sum()

next we will eliminate certain column to use for analysis, and then rename the column for better understanding

In [None]:
choosen_column = ['title','name','description_x','formatted_work_type','location','original_listed_time','application_type','sponsored','description_y','company_size','industry']
merged_data = merged_data[choosen_column]
merged_data

In [None]:
pretty_column_name = {'title': 'job_title', 'name': 'company_name', 'description_x': 'job_description',
               'formatted_work_type': 'work_type','original_listed_time': 'listed_time','description_y': 'company_description'}
data = merged_data.rename(columns=pretty_column_name)
data = data.dropna()
data

# EDA (Exploratory Data Analysis)

In [None]:
# Create a word cloud from job titles
job_titles_text = ' '.join(data['job_title'])
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(job_titles_text)

# Display the word cloud
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.title('Job Title Word Cloud')
plt.axis('off')
plt.tight_layout()
plt.show()

In [None]:
# Visualization 2: Company Size Distribution (Pie Chart)
company_size_counts = data['company_size'].value_counts()
plt.figure(figsize=(8, 8))
plt.pie(company_size_counts, labels=company_size_counts.index, autopct='%1.1f%%', colors=['lightcoral', 'lightgreen', 'lightblue'])
plt.title('Company Size Distribution')
plt.axis('equal')
plt.tight_layout()
plt.show()

In [None]:
# Visualization 3: Job Type Breakdown (Bar Chart)
job_type_counts = data['work_type'].value_counts()
plt.figure(figsize=(10, 6))
job_type_counts.plot(kind='bar', color='lightseagreen')
plt.title('Job Type Breakdown')
plt.xlabel('Job Type')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

In [None]:
# Visualization 6: Sponsored vs. Non-Sponsored Listings (Pie Chart)
sponsored_counts = data['sponsored'].value_counts()
plt.figure(figsize=(8, 8))
plt.pie(sponsored_counts, labels=['Non-Sponsored', 'Sponsored'], autopct='%1.1f%%', colors=['lightblue', 'lightgreen'])
plt.title('Sponsored vs. Non-Sponsored Listings')
plt.axis('equal')
plt.tight_layout()
plt.show()