## Sprint 1 - LinkedIn Job Postings Data Prep and Exploration

In today’s dynamic job market, both recruiters and job seekers face multifaceted challenges in navigating the complexities of talent acquisition. With thousands of job postings listed on platforms like LinkedIn every day, understanding salary trends, skill requirements, and benefits offerings becomes crucial for making informed decisions.

By addressing challenges such as predicting salary ranges, uncovering temporal trends, and extracting key information from job descriptions, this project seeks to add significant value to both recruiters and job applicants. Ultimately, the project aspires to contribute to the efficiency and effectiveness of talent acquisition practices, fostering better matches between employers and employees.

In this codealong, we will go through basic EDA processes to explore the `LinkedIn Job Postings` dataset. Our goal is to build a cleaned dataset using Pandas to organize data and develop a framework appropriate for the future data analysis involving regression models.

### Dataset

We will be using a real dataset from LinkedIn that contains detailed information about job postings from 2023. The dataset contains information about

### Data Loading

In [4]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

In [7]:
raw_data = pd.read_csv('Capstone Data/job_postings.csv')

In [8]:
raw_data.head()

Unnamed: 0,job_id,company_id,title,description,max_salary,med_salary,min_salary,pay_period,formatted_work_type,location,...,closed_time,formatted_experience_level,skills_desc,listed_time,posting_domain,sponsored,work_type,currency,compensation_type,scraped
0,3757940104,553718.0,Hearing Care Provider,Overview\n\nHearingLife is a national hearing ...,,5250.0,,MONTHLY,Full-time,"Little River, SC",...,,Entry level,,1699090000000.0,careers-demant.icims.com,0,FULL_TIME,USD,BASE_SALARY,1699138101
1,3757940025,2192142.0,Shipping & Receiving Associate 2nd shift (Beav...,Metalcraft of Mayville\nMetalcraft of Mayville...,,,,,Full-time,"Beaver Dam, WI",...,,,,1699080000000.0,www.click2apply.net,0,FULL_TIME,,,1699085420
2,3757938019,474443.0,"Manager, Engineering",\nThe TSUBAKI name is synonymous with excellen...,,,,,Full-time,"Bessemer, AL",...,,,Bachelor's Degree in Mechanical Engineering pr...,1699080000000.0,www.click2apply.net,0,FULL_TIME,,,1699085644
3,3757938018,18213359.0,Cook,descriptionTitle\n\n Looking for a great oppor...,,22.27,,HOURLY,Full-time,"Aliso Viejo, CA",...,,Entry level,,1699080000000.0,jobs.apploi.com,0,FULL_TIME,USD,BASE_SALARY,1699087461
4,3757937095,437225.0,Principal Cloud Security Architect (Remote),"Job Summary\nAt iHerb, we are on a mission to ...",275834.0,,205956.0,YEARLY,Full-time,United States,...,,Mid-Senior level,,1699090000000.0,careers.iherb.com,0,FULL_TIME,USD,BASE_SALARY,1699085346


In [9]:
raw_data.tail()

Unnamed: 0,job_id,company_id,title,description,max_salary,med_salary,min_salary,pay_period,formatted_work_type,location,...,closed_time,formatted_experience_level,skills_desc,listed_time,posting_domain,sponsored,work_type,currency,compensation_type,scraped
33241,133114754,77766802.0,Sales Manager,Are you a dynamic and creative marketing profe...,,,,,Full-time,"Santa Clarita, CA",...,,,,1692830000000.0,,0,FULL_TIME,,,1
33242,108965123,,Office Administrative Assistant,"A fast-fashion wholesaler, is looking for a fu...",,,,,Full-time,"New York, NY",...,,,,1699040000000.0,,0,FULL_TIME,,,1699044401
33243,102339515,52132271.0,Franchise Owner,DuctVentz is a dryer and A/C – heat vent clean...,,,,,Full-time,Greater Boston,...,,,,1699050000000.0,,0,FULL_TIME,,,1699063495
33244,85008768,,Licensed Insurance Agent,While many industries were hurt by the last fe...,52000.0,,45760.0,YEARLY,Full-time,"Chico, CA",...,,,,1692750000000.0,,1,FULL_TIME,USD,BASE_SALARY,1
33245,3958427,630152.0,Stylist/ Clorist,Karen Marie is looking for an awesome experien...,80000.0,,35000.0,YEARLY,Full-time,"Chicago, IL",...,,,Must be a seasoned stylist with an existing bo...,1699050000000.0,,0,FULL_TIME,USD,BASE_SALARY,1699057868


In [10]:
raw_data.sample(5)

Unnamed: 0,job_id,company_id,title,description,max_salary,med_salary,min_salary,pay_period,formatted_work_type,location,...,closed_time,formatted_experience_level,skills_desc,listed_time,posting_domain,sponsored,work_type,currency,compensation_type,scraped
25393,3697354826,71001279.0,Recruiting Coordinator,"Remote\nAbout DZConneX\nAt DZConneX (DZX), we ...",,,,,Full-time,"Florida, United States",...,,Entry level,,1692740000000.0,careers.dayzim.com,0,FULL_TIME,,,1
15192,3749356185,79634.0,Business Development Specialist - DC,Are you ready to make a positive impact on cli...,,,,,Full-time,"Washington, DC",...,,Mid-Senior level,,1699050000000.0,jobs.lever.co,0,FULL_TIME,,,1699137350
28866,3693074432,11056.0,IT Project Manager (Application Development),Position Title: Sr. Technical Project Manager ...,75.0,,65.0,HOURLY,Full-time,"Philadelphia, PA",...,,Mid-Senior level,,1692750000000.0,,0,FULL_TIME,USD,BASE_SALARY,1
21999,3699087779,12555.0,"Associate, PortNYC",Our Vision: To make New York City the global m...,70000.0,,68000.0,YEARLY,Full-time,"New York, NY",...,,Mid-Senior level,,1692830000000.0,,0,FULL_TIME,USD,BASE_SALARY,1
25387,3697354902,1344.0,Sr Communications Specialist,Design solutions to drive safe living and qual...,,,,,Full-time,"Charlotte, NC",...,,Mid-Senior level,,1692740000000.0,careers.honeywell.com,1,FULL_TIME,,,1


In [11]:
print(f"We have {raw_data.shape[0]} rows and {raw_data.shape[1]} columns with  string (categorical), floats and integers.")

We have 33246 rows and 28 columns with  string (categorical), floats and integers.


In [12]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33246 entries, 0 to 33245
Data columns (total 28 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   job_id                      33246 non-null  int64  
 1   company_id                  32592 non-null  float64
 2   title                       33246 non-null  object 
 3   description                 33245 non-null  object 
 4   max_salary                  11111 non-null  float64
 5   med_salary                  2241 non-null   float64
 6   min_salary                  11111 non-null  float64
 7   pay_period                  13352 non-null  object 
 8   formatted_work_type         33246 non-null  object 
 9   location                    33246 non-null  object 
 10  applies                     16238 non-null  float64
 11  original_listed_time        33246 non-null  float64
 12  remote_allowed              4802 non-null   float64
 13  views                       258

In [13]:
df_raw = raw_data.drop('scraped', axis=1)

In [14]:
df_raw.isna().sum()/raw_data.shape[0]*100

job_id                         0.000000
company_id                     1.967154
title                          0.000000
description                    0.003008
max_salary                    66.579438
med_salary                    93.259339
min_salary                    66.579438
pay_period                    59.838778
formatted_work_type            0.000000
location                       0.000000
applies                       51.158034
original_listed_time           0.000000
remote_allowed                85.556157
views                         22.138002
job_posting_url                0.000000
application_url               36.846538
application_type               0.000000
expiry                         0.000000
closed_time                   96.474764
formatted_experience_level    27.615352
skills_desc                   98.986344
listed_time                    0.000000
posting_domain                40.780846
sponsored                      0.000000
work_type                      0.000000


In [None]:
df_raw_dropped = df_raw.drop('med_salary', 'remote_allowed', 'closed_time', 'skills_desc', axis=1)

In [15]:
df_raw.duplicated().sum()

0