In [1]:
import re

import pandas as pd
import numpy as np

import glassdoor_scraper as gs

pd.set_option('display.max_colwidth', None)

In [2]:
df = pd.read_csv('./glassdoor_jobs.csv')
df

Unnamed: 0,Job Title,Company,Location,Salary Estimate,Rating,Description,Headquarters,Size,Founded,Ownership,Industry,Sector,Revenue,Competitors
0,Data Scientist,Havi Supply Chain,"Chicago, IL",$105K - $120K (Employer provided),3.6,"Full/Part-time: Full time\nJob Category: Analytics\nCity: Chicago\nHAVI is a global, privately owned company focused on innovating, optimizing and managing the supply chains of leading brands. Offering services in marketing analytics, packaging, supply chain management and logistics, HAVI partners with companies to address challenges big and small across the supply chain, from commodity to customer. Founded in 1974, HAVI employs more than 10,000 people and serves customers in more than 100 countrie...",-1,10000+ Employees,1975,Company - Private,Business Consulting,Management & Consulting,Unknown / Non-Applicable,-1
1,Risk Analytics Data Scientist,PayPal,"Austin, TX",$101K - $173K (Employer provided),3.6,"The Company\nPayPal has been revolutionizing commerce globally for more than 25 years. Creating innovative experiences that make moving money, selling, and shopping simple, personalized, and secure, PayPal empowers consumers and businesses in approximately 200 markets to join and thrive in the global economy.\n\nWe operate a global, two-sided network at scale that connects hundreds of millions of merchants and consumers. We help merchants and consumers connect, transact, and complete payments, whet...",-1,10000+ Employees,1998,Company - Public,Internet & Web Services,Information Technology,$10+ billion (USD),-1
2,"Data Scientist, Applied AI - Remote",Azumo,Remote,-1,4.1,"Azumo is currently looking for a highly motivated Data Scientist / Machine Learning Engineer to develop and enhance our data and analytics infrastructure. The position is FULLY REMOTE, based in Latin America.\nThis position will provide you with the opportunity to collaborate with a dynamic team and talented data scientists in the field of big data analytics and applied AI. If you have a passion for designing and implementing advanced machine learning and deep learning models, particularly in the...",-1,Unknown,--,Company - Private,Electronics Manufacturing,Manufacturing,Unknown / Non-Applicable,-1
3,Sr. Data Scientist,EDGE,"Chicago, IL",$110K - $160K (Employer provided),-1.0,-1,-1,-1,-1,-1,-1,-1,-1,-1
4,Senior Data Scientist,Envestnet,"Berwyn, PA",$132K - $172K (Glassdoor est.),4.2,-1,-1,-1,-1,-1,-1,-1,-1,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
800,Principal Data Scientist,Delta Airlines,"Atlanta, GA",$134K - $172K (Glassdoor est.),4.2,"UNITED STATES, GEORGIA, ATLANTA\n\nGLOBAL CONSUMER INSIGHT\n\n17-JUL-2025\n\nREF #: 29205\nHOW YOU'LL HELP US KEEP CLIMBING (OVERVIEW & KEY RESPONSIBILITIES)\nAt Delta Air Lines, connection is at the heart of everything we do and guides our every action. We strive to welcome and care for all of our customers during their travels with us and aim to deliver an elevated experience.",-1,10000+ Employees,1928,Company - Public,"Airlines, Airports & Air Transportation",Transportation & Logistics,$10+ billion (USD),-1
801,DATA SCIENTIST,Reliance Global Services,"South Plainfield, NJ",$84K - $134K (Glassdoor est.),4.3,"Analyze large datasets and interact with relational databases. Write SQL queries, stored procedures, functions, triggers, and views. Develop and implement statistical models, machine learning algorithms, and data-driven solutions for business clients. Automate existing processes to streamline data analysis. Build dashboards and generate reports. Make recommendations based on analytical insights. Skills Required: Python, SQL, Git, Hadoop, Spark, SAS and Tableau. Master’s degree in Science, Techno...",-1,51 to 200 Employees,--,Company - Private,--,--,$5 to $25 million (USD),-1
802,Data Scientist and AI Engineer,SKT Lab,"Santa Clara, CA",$103K - $173K (Glassdoor est.),-1.0,"We are always looking for smart, enthusiastic, creative people to join our team! If you are interested in working with us, please email us at job@sktlab.com.\n\nQualifications:\nWe are looking for a candidate with 5+ years of experience in a Data Scientist role, who has Bachelor's or Graduate degree in Computer Science, Statistics, Informatics, Information Systems or another quantitative field.\nExperience with big data tools\nExperience with relational SQL and NoSQL databases\nDeveloping in R and Pyt...",-1,1 to 50 Employees,--,Company - Public,Computer Hardware Development,Information Technology,Unknown / Non-Applicable,-1
803,Data Scientist,Gustaine,"Orange, CA",$72K - $128K (Glassdoor est.),-1.0,"Role Summary / Purpose\nHighly motivated self-driven Engineer in statistics / predictive modeling / data quality to lead and guide multi-disciplinary project teams addressing key challenges for different businesses. Creation of intellectual property will be a key expectation in this role.\n\nEssential Responsibilities\nAs a senior data scientist in the Modeling and Optimization, you will create and guide programs to invent and deliver predictive modeling and decision technologies for diverse busines...",-1,Unknown,--,Company - Public,--,--,Unknown / Non-Applicable,-1


In [3]:
df.shape

(805, 14)

Out of 1316 (805 + 511) job postings, 511 were duplicates 

In [4]:
df.dtypes

Job Title           object
Company             object
Location            object
Salary Estimate     object
Rating             float64
Description         object
Headquarters         int64
Size                object
Founded             object
Ownership           object
Industry            object
Sector              object
Revenue             object
Competitors          int64
dtype: object

## Deleting Duplicates

In [5]:
mask = df.duplicated()
print(f"NO of duplicate job postings = {len(df.loc[mask])}")

NO of duplicate job postings = 92


In [6]:
print(f"No original job postings = {len(df)}")
df.drop(list(df.loc[mask].index), axis='index', inplace=True)
print(f"No of job postings after deleting duplicates = {len(df)}")

No original job postings = 805
No of job postings after deleting duplicates = 713


In [7]:
df.columns

Index(['Job Title', 'Company', 'Location', 'Salary Estimate', 'Rating',
       'Description', 'Headquarters', 'Size', 'Founded', 'Ownership',
       'Industry', 'Sector', 'Revenue', 'Competitors'],
      dtype='object')

In [8]:
df.drop(['Size', 'Founded'], axis='columns', inplace=True)

df.rename(columns={
    'Ownership':'ownership',
    'Revenue':'revenue',
    'Competitors':'competitors',
    "Job Title":"job_title",
    "Company":"company",
    "Location":"location",
    "Salary Estimate":"salary_est",
    "Rating":"rating",
    "Description":"description",
    "Headquarters":"headquarters",
    "Industry":"industry",
    "Sector":"sector",
}, inplace=True)

In [9]:
df.head()

Unnamed: 0,job_title,company,location,salary_est,rating,description,headquarters,ownership,industry,sector,revenue,competitors
0,Data Scientist,Havi Supply Chain,"Chicago, IL",$105K - $120K (Employer provided),3.6,"Full/Part-time: Full time\nJob Category: Analytics\nCity: Chicago\nHAVI is a global, privately owned company focused on innovating, optimizing and managing the supply chains of leading brands. Offering services in marketing analytics, packaging, supply chain management and logistics, HAVI partners with companies to address challenges big and small across the supply chain, from commodity to customer. Founded in 1974, HAVI employs more than 10,000 people and serves customers in more than 100 countrie...",-1,Company - Private,Business Consulting,Management & Consulting,Unknown / Non-Applicable,-1
1,Risk Analytics Data Scientist,PayPal,"Austin, TX",$101K - $173K (Employer provided),3.6,"The Company\nPayPal has been revolutionizing commerce globally for more than 25 years. Creating innovative experiences that make moving money, selling, and shopping simple, personalized, and secure, PayPal empowers consumers and businesses in approximately 200 markets to join and thrive in the global economy.\n\nWe operate a global, two-sided network at scale that connects hundreds of millions of merchants and consumers. We help merchants and consumers connect, transact, and complete payments, whet...",-1,Company - Public,Internet & Web Services,Information Technology,$10+ billion (USD),-1
2,"Data Scientist, Applied AI - Remote",Azumo,Remote,-1,4.1,"Azumo is currently looking for a highly motivated Data Scientist / Machine Learning Engineer to develop and enhance our data and analytics infrastructure. The position is FULLY REMOTE, based in Latin America.\nThis position will provide you with the opportunity to collaborate with a dynamic team and talented data scientists in the field of big data analytics and applied AI. If you have a passion for designing and implementing advanced machine learning and deep learning models, particularly in the...",-1,Company - Private,Electronics Manufacturing,Manufacturing,Unknown / Non-Applicable,-1
3,Sr. Data Scientist,EDGE,"Chicago, IL",$110K - $160K (Employer provided),-1.0,-1,-1,-1,-1,-1,-1,-1
4,Senior Data Scientist,Envestnet,"Berwyn, PA",$132K - $172K (Glassdoor est.),4.2,-1,-1,-1,-1,-1,-1,-1


# Data cleaning and Feature Engineering

In [10]:
job_titles = df['job_title'].unique()
print(f"number of unique job descriptions = {len(job_titles)}")
print("Job titles scraped =>", job_titles)

number of unique job descriptions = 338
Job titles scraped => ['Data Scientist' 'Risk Analytics Data Scientist'
 'Data Scientist, Applied AI - Remote' 'Sr. Data Scientist'
 'Senior Data Scientist' 'Data Scientist - Model Optimization'
 'Machine Learning Engineer' 'AI Scientist - Machine Learning (US/KR)'
 'Machine Learning Engineer L-1' 'ML/AI Engineers' 'Staff Data Scientist'
 'Data Scientist (N)' 'DATA SCIENTIST 2' 'Supply Chain Data Scientist'
 'Data Scientist, Marketing' 'Data Scientist I' 'Sr Data Scientist'
 'Data Scientist I - Early Careers' 'Data Scientist II'
 'Senior Product Data Scientist'
 'AI/ML Scientist – Operational Twinning & Healthcare Optimization'
 'Sr Data Scientist, Pricing Analytics (OR/Optimization)'
 'Data Scientist / Engineer' 'Data Scientist, Growth'
 'Senior Data Scientist - Insights and Analytics'
 'Data Scientist – Machine Learning Focus' 'Data Scientist 1'
 'Senior Data Scientist I'
 'Artificial Intelligence/ Machine Learning Engineer'
 'Machine Learning-

In [11]:
# job title refining and categorizing

def categorize_title(title):
    if not isinstance(title, str):
        return 'Other'
        
    title_low = title.lower()
    # 1. Machine Learning / ML Engineer
    if 'machine learning' in title_low or 'ml engineer' in title_low or 'ml scientist' in title_low or 'ml/llm' in title_low:
        return 'machine learning engineer'
    # 2. AI Scientist / AI Engineer
    if 'ai scientist' in title_low or 'artificial intelligence' in title_low or 'ai engineer' in title_low or 'gen ai' in title_low or 'ai/ml' in title_low:
        return 'ai/ml engineer/scientist'
    # 3. Data Scientist (This is broad and catches most of your titles)
    if 'data scientist' in title_low:
        return 'data scientist'
    # 4. Data Engineer
    if 'data engineer' in title_low:
        return 'data engineer'
    # 5. Data Analyst
    if 'analyst' in title_low:
        return 'data analyst'
    # 6. Research Scientist
    if 'research' in title_low:
        return 'research scientist'
    # 7. Statistician
    if 'statistician' in title_low:
        return 'statistician'
    # 8. Other (Catch-all for everything else)
    else:
        return 'other'

# --- Function to Extract Seniority ---
def extract_seniority(title):
    if not isinstance(title, str):
        return 'Not Specified'
        
    title_low = title.lower()

    # We check from highest to lowest seniority
    if 'principal' in title_low:
        return 'principal'
    if 'staff' in title_low:
        return 'staff'
    if 'lead' in title_low or 'manager' in title_low:
        return 'lead/manager'
    if 'sr' in title_low or 'senior' in title_low:
        return 'senior'
    if 'jr' in title_low or 'junior' in title_low or 'associate' in title_low or 'graduate' in title_low or 'entry' in title_low:
        return 'junior/associate'
    # Using regex to find levels like I, II, III, 1, 2, 3
    # \b matches a "word boundary" to avoid matching 'iii' in 'principal'
    if re.search(r'\b(i{3}|iii|3)\b', title_low):
        return 'Level 3'
    if re.search(r'\b(i{2}|ii|2)\b', title_low):
        return 'Level 2'
    if re.search(r'\b(i|1)\b', title_low) and 'level 1' not in title_low and 'l-1' not in title_low:
         if 'data scientist i' in title_low or 'level 1' in title_low or 'l-1' in title_low:
            return 'level 1'
    # If no other seniority is found
    return 'standard/mid-level'


df['job_category'] = df['job_title'].apply(categorize_title)
# 2. Create the new seniority column
df['seniority'] = df['job_title'].apply(extract_seniority)

print("--- Job Categories ---")
print(df['job_category'].value_counts())

print("\n--- Seniority Levels ---")
print(df['seniority'].value_counts())

--- Job Categories ---
job_category
data scientist               574
machine learning engineer     77
ai/ml engineer/scientist      38
other                         12
data analyst                   6
research scientist             4
data engineer                  2
Name: count, dtype: int64

--- Seniority Levels ---
seniority
standard/mid-level    432
senior                151
Level 2                38
junior/associate       35
staff                  17
Level 3                17
level 1                12
principal               7
lead/manager            4
Name: count, dtype: int64


In [12]:
df

Unnamed: 0,job_title,company,location,salary_est,rating,description,headquarters,ownership,industry,sector,revenue,competitors,job_category,seniority
0,Data Scientist,Havi Supply Chain,"Chicago, IL",$105K - $120K (Employer provided),3.6,"Full/Part-time: Full time\nJob Category: Analytics\nCity: Chicago\nHAVI is a global, privately owned company focused on innovating, optimizing and managing the supply chains of leading brands. Offering services in marketing analytics, packaging, supply chain management and logistics, HAVI partners with companies to address challenges big and small across the supply chain, from commodity to customer. Founded in 1974, HAVI employs more than 10,000 people and serves customers in more than 100 countrie...",-1,Company - Private,Business Consulting,Management & Consulting,Unknown / Non-Applicable,-1,data scientist,standard/mid-level
1,Risk Analytics Data Scientist,PayPal,"Austin, TX",$101K - $173K (Employer provided),3.6,"The Company\nPayPal has been revolutionizing commerce globally for more than 25 years. Creating innovative experiences that make moving money, selling, and shopping simple, personalized, and secure, PayPal empowers consumers and businesses in approximately 200 markets to join and thrive in the global economy.\n\nWe operate a global, two-sided network at scale that connects hundreds of millions of merchants and consumers. We help merchants and consumers connect, transact, and complete payments, whet...",-1,Company - Public,Internet & Web Services,Information Technology,$10+ billion (USD),-1,data scientist,standard/mid-level
2,"Data Scientist, Applied AI - Remote",Azumo,Remote,-1,4.1,"Azumo is currently looking for a highly motivated Data Scientist / Machine Learning Engineer to develop and enhance our data and analytics infrastructure. The position is FULLY REMOTE, based in Latin America.\nThis position will provide you with the opportunity to collaborate with a dynamic team and talented data scientists in the field of big data analytics and applied AI. If you have a passion for designing and implementing advanced machine learning and deep learning models, particularly in the...",-1,Company - Private,Electronics Manufacturing,Manufacturing,Unknown / Non-Applicable,-1,data scientist,standard/mid-level
3,Sr. Data Scientist,EDGE,"Chicago, IL",$110K - $160K (Employer provided),-1.0,-1,-1,-1,-1,-1,-1,-1,data scientist,senior
4,Senior Data Scientist,Envestnet,"Berwyn, PA",$132K - $172K (Glassdoor est.),4.2,-1,-1,-1,-1,-1,-1,-1,data scientist,senior
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
800,Principal Data Scientist,Delta Airlines,"Atlanta, GA",$134K - $172K (Glassdoor est.),4.2,"UNITED STATES, GEORGIA, ATLANTA\n\nGLOBAL CONSUMER INSIGHT\n\n17-JUL-2025\n\nREF #: 29205\nHOW YOU'LL HELP US KEEP CLIMBING (OVERVIEW & KEY RESPONSIBILITIES)\nAt Delta Air Lines, connection is at the heart of everything we do and guides our every action. We strive to welcome and care for all of our customers during their travels with us and aim to deliver an elevated experience.",-1,Company - Public,"Airlines, Airports & Air Transportation",Transportation & Logistics,$10+ billion (USD),-1,data scientist,principal
801,DATA SCIENTIST,Reliance Global Services,"South Plainfield, NJ",$84K - $134K (Glassdoor est.),4.3,"Analyze large datasets and interact with relational databases. Write SQL queries, stored procedures, functions, triggers, and views. Develop and implement statistical models, machine learning algorithms, and data-driven solutions for business clients. Automate existing processes to streamline data analysis. Build dashboards and generate reports. Make recommendations based on analytical insights. Skills Required: Python, SQL, Git, Hadoop, Spark, SAS and Tableau. Master’s degree in Science, Techno...",-1,Company - Private,--,--,$5 to $25 million (USD),-1,data scientist,standard/mid-level
802,Data Scientist and AI Engineer,SKT Lab,"Santa Clara, CA",$103K - $173K (Glassdoor est.),-1.0,"We are always looking for smart, enthusiastic, creative people to join our team! If you are interested in working with us, please email us at job@sktlab.com.\n\nQualifications:\nWe are looking for a candidate with 5+ years of experience in a Data Scientist role, who has Bachelor's or Graduate degree in Computer Science, Statistics, Informatics, Information Systems or another quantitative field.\nExperience with big data tools\nExperience with relational SQL and NoSQL databases\nDeveloping in R and Pyt...",-1,Company - Public,Computer Hardware Development,Information Technology,Unknown / Non-Applicable,-1,ai/ml engineer/scientist,standard/mid-level
803,Data Scientist,Gustaine,"Orange, CA",$72K - $128K (Glassdoor est.),-1.0,"Role Summary / Purpose\nHighly motivated self-driven Engineer in statistics / predictive modeling / data quality to lead and guide multi-disciplinary project teams addressing key challenges for different businesses. Creation of intellectual property will be a key expectation in this role.\n\nEssential Responsibilities\nAs a senior data scientist in the Modeling and Optimization, you will create and guide programs to invent and deliver predictive modeling and decision technologies for diverse busines...",-1,Company - Public,--,--,Unknown / Non-Applicable,-1,data scientist,standard/mid-level


In [13]:
# salary cleaning

def parse_salary(s):
    if pd.isna(s):
        return np.nan
    s = s.strip()

    # Case 1: $100K - $112K (...)
    match_range = re.match(r'^\$(\d+)K\s*-\s*\$(\d+)K', s)
    if match_range:
        low, high = map(int, match_range.groups())
        return (low + high) / 2 * 1000  # convert to full USD value

    # Case 2: $112K (...)
    match_single = re.match(r'^\$(\d+)K', s)
    if match_single:
        return int(match_single.group(1)) * 1000

    # Case 3: $38.07 - $56.73 Per Hour (...)
    match_hour = re.match(r'^\$(\d+(?:\.\d+)?)\s*-\s*\$(\d+(?:\.\d+)?)\s*Per Hour', s)
    if match_hour:
        low, high = map(float, match_hour.groups())
        hourly = (low + high) / 2
        yearly = hourly * 7 * 5 * 52  # 7 hrs/day * 5 days/week * 52 weeks/year
        return yearly

    return np.nan  # if none match

df['annual_salary_avg'] = df['salary_est'].apply(parse_salary)
df.head()

Unnamed: 0,job_title,company,location,salary_est,rating,description,headquarters,ownership,industry,sector,revenue,competitors,job_category,seniority,annual_salary_avg
0,Data Scientist,Havi Supply Chain,"Chicago, IL",$105K - $120K (Employer provided),3.6,"Full/Part-time: Full time\nJob Category: Analytics\nCity: Chicago\nHAVI is a global, privately owned company focused on innovating, optimizing and managing the supply chains of leading brands. Offering services in marketing analytics, packaging, supply chain management and logistics, HAVI partners with companies to address challenges big and small across the supply chain, from commodity to customer. Founded in 1974, HAVI employs more than 10,000 people and serves customers in more than 100 countrie...",-1,Company - Private,Business Consulting,Management & Consulting,Unknown / Non-Applicable,-1,data scientist,standard/mid-level,112500.0
1,Risk Analytics Data Scientist,PayPal,"Austin, TX",$101K - $173K (Employer provided),3.6,"The Company\nPayPal has been revolutionizing commerce globally for more than 25 years. Creating innovative experiences that make moving money, selling, and shopping simple, personalized, and secure, PayPal empowers consumers and businesses in approximately 200 markets to join and thrive in the global economy.\n\nWe operate a global, two-sided network at scale that connects hundreds of millions of merchants and consumers. We help merchants and consumers connect, transact, and complete payments, whet...",-1,Company - Public,Internet & Web Services,Information Technology,$10+ billion (USD),-1,data scientist,standard/mid-level,137000.0
2,"Data Scientist, Applied AI - Remote",Azumo,Remote,-1,4.1,"Azumo is currently looking for a highly motivated Data Scientist / Machine Learning Engineer to develop and enhance our data and analytics infrastructure. The position is FULLY REMOTE, based in Latin America.\nThis position will provide you with the opportunity to collaborate with a dynamic team and talented data scientists in the field of big data analytics and applied AI. If you have a passion for designing and implementing advanced machine learning and deep learning models, particularly in the...",-1,Company - Private,Electronics Manufacturing,Manufacturing,Unknown / Non-Applicable,-1,data scientist,standard/mid-level,
3,Sr. Data Scientist,EDGE,"Chicago, IL",$110K - $160K (Employer provided),-1.0,-1,-1,-1,-1,-1,-1,-1,data scientist,senior,135000.0
4,Senior Data Scientist,Envestnet,"Berwyn, PA",$132K - $172K (Glassdoor est.),4.2,-1,-1,-1,-1,-1,-1,-1,data scientist,senior,152000.0


In [14]:
df['min_salary'] = df['salary_est'].str.extract(r'^\$(\d+)K', expand=False)
df['max_salary'] = df['salary_est'].str.extract(r'\$(\d+)K(\s\(.*\))?$')[0]

df['max_salary'] = pd.to_numeric(df['max_salary'], errors='coerce') * 1000
df['min_salary'] = pd.to_numeric(df['min_salary'], errors='coerce') * 1000

df.head()

Unnamed: 0,job_title,company,location,salary_est,rating,description,headquarters,ownership,industry,sector,revenue,competitors,job_category,seniority,annual_salary_avg,min_salary,max_salary
0,Data Scientist,Havi Supply Chain,"Chicago, IL",$105K - $120K (Employer provided),3.6,"Full/Part-time: Full time\nJob Category: Analytics\nCity: Chicago\nHAVI is a global, privately owned company focused on innovating, optimizing and managing the supply chains of leading brands. Offering services in marketing analytics, packaging, supply chain management and logistics, HAVI partners with companies to address challenges big and small across the supply chain, from commodity to customer. Founded in 1974, HAVI employs more than 10,000 people and serves customers in more than 100 countrie...",-1,Company - Private,Business Consulting,Management & Consulting,Unknown / Non-Applicable,-1,data scientist,standard/mid-level,112500.0,105000.0,120000.0
1,Risk Analytics Data Scientist,PayPal,"Austin, TX",$101K - $173K (Employer provided),3.6,"The Company\nPayPal has been revolutionizing commerce globally for more than 25 years. Creating innovative experiences that make moving money, selling, and shopping simple, personalized, and secure, PayPal empowers consumers and businesses in approximately 200 markets to join and thrive in the global economy.\n\nWe operate a global, two-sided network at scale that connects hundreds of millions of merchants and consumers. We help merchants and consumers connect, transact, and complete payments, whet...",-1,Company - Public,Internet & Web Services,Information Technology,$10+ billion (USD),-1,data scientist,standard/mid-level,137000.0,101000.0,173000.0
2,"Data Scientist, Applied AI - Remote",Azumo,Remote,-1,4.1,"Azumo is currently looking for a highly motivated Data Scientist / Machine Learning Engineer to develop and enhance our data and analytics infrastructure. The position is FULLY REMOTE, based in Latin America.\nThis position will provide you with the opportunity to collaborate with a dynamic team and talented data scientists in the field of big data analytics and applied AI. If you have a passion for designing and implementing advanced machine learning and deep learning models, particularly in the...",-1,Company - Private,Electronics Manufacturing,Manufacturing,Unknown / Non-Applicable,-1,data scientist,standard/mid-level,,,
3,Sr. Data Scientist,EDGE,"Chicago, IL",$110K - $160K (Employer provided),-1.0,-1,-1,-1,-1,-1,-1,-1,data scientist,senior,135000.0,110000.0,160000.0
4,Senior Data Scientist,Envestnet,"Berwyn, PA",$132K - $172K (Glassdoor est.),4.2,-1,-1,-1,-1,-1,-1,-1,data scientist,senior,152000.0,132000.0,172000.0


In [15]:
one_salary_mask = df['salary_est'].str.contains(r'^\$\d+K\s\(.*\)$', regex=True)
df.loc[one_salary_mask, ['salary_est', 'min_salary', 'max_salary']]

Unnamed: 0,salary_est,min_salary,max_salary
55,$112K (Employer provided),112000.0,112000.0
189,$82K (Employer provided),82000.0,82000.0
234,$112K (Employer provided),112000.0,112000.0
327,$82K (Employer provided),82000.0,82000.0
398,$84K (Employer provided),84000.0,84000.0
433,$136K (Employer provided),136000.0,136000.0
561,$80K (Employer provided),80000.0,80000.0


In [16]:
print("--- Job Industries ---")
print(df['industry'].value_counts())

print("\n--- Sector ---")
print(df['sector'].value_counts())

--- Job Industries ---
industry
-1                                         79
--                                         71
Information Technology Support Services    59
Enterprise Software & Network Solutions    37
Health Care Services & Hospitals           35
                                           ..
Culture & Entertainment                     1
State & Regional Agencies                   1
Grantmaking & Charitable Foundations        1
Travel Agencies                             1
Primary & Secondary Schools                 1
Name: count, Length: 72, dtype: int64

--- Sector ---
sector
Information Technology                         179
-1                                              78
--                                              71
Financial Services                              46
Retail & Wholesale                              41
Manufacturing                                   40
Healthcare                                      35
Management & Consulting                     

In [17]:
temp_lst = ['job_title', 'company', 'salary_est', 'description', 'headquarters', 'industry', 'sector', 'job_category', 'seniority']
df[temp_lst] = df[temp_lst].replace(['..', '-1', '--'], pd.NA)
df.head(2)

Unnamed: 0,job_title,company,location,salary_est,rating,description,headquarters,ownership,industry,sector,revenue,competitors,job_category,seniority,annual_salary_avg,min_salary,max_salary
0,Data Scientist,Havi Supply Chain,"Chicago, IL",$105K - $120K (Employer provided),3.6,"Full/Part-time: Full time\nJob Category: Analytics\nCity: Chicago\nHAVI is a global, privately owned company focused on innovating, optimizing and managing the supply chains of leading brands. Offering services in marketing analytics, packaging, supply chain management and logistics, HAVI partners with companies to address challenges big and small across the supply chain, from commodity to customer. Founded in 1974, HAVI employs more than 10,000 people and serves customers in more than 100 countrie...",-1,Company - Private,Business Consulting,Management & Consulting,Unknown / Non-Applicable,-1,data scientist,standard/mid-level,112500.0,105000.0,120000.0
1,Risk Analytics Data Scientist,PayPal,"Austin, TX",$101K - $173K (Employer provided),3.6,"The Company\nPayPal has been revolutionizing commerce globally for more than 25 years. Creating innovative experiences that make moving money, selling, and shopping simple, personalized, and secure, PayPal empowers consumers and businesses in approximately 200 markets to join and thrive in the global economy.\n\nWe operate a global, two-sided network at scale that connects hundreds of millions of merchants and consumers. We help merchants and consumers connect, transact, and complete payments, whet...",-1,Company - Public,Internet & Web Services,Information Technology,$10+ billion (USD),-1,data scientist,standard/mid-level,137000.0,101000.0,173000.0


In [18]:
industry_absent_filter = df['industry'].isna()
sector_absent_filter = df['sector'].isna()
print(len(df.loc[~industry_absent_filter & sector_absent_filter]))
print(len(df.loc[industry_absent_filter & ~sector_absent_filter]))

df.loc[industry_absent_filter | sector_absent_filter]

0
1


Unnamed: 0,job_title,company,location,salary_est,rating,description,headquarters,ownership,industry,sector,revenue,competitors,job_category,seniority,annual_salary_avg,min_salary,max_salary
3,Sr. Data Scientist,EDGE,"Chicago, IL",$110K - $160K (Employer provided),-1.0,,-1,-1,,,-1,-1,data scientist,senior,135000.0,110000.0,160000.0
4,Senior Data Scientist,Envestnet,"Berwyn, PA",$132K - $172K (Glassdoor est.),4.2,,-1,-1,,,-1,-1,data scientist,senior,152000.0,132000.0,172000.0
5,Data Scientist - Model Optimization,"quadric, Inc","Burlingame, CA",$114K - $167K (Glassdoor est.),4.0,,-1,-1,,,-1,-1,data scientist,standard/mid-level,140500.0,114000.0,167000.0
6,Machine Learning Engineer,Adidev Technologies Inc,"Austin, TX",$100K - $147K (Glassdoor est.),3.8,,-1,-1,,,-1,-1,machine learning engineer,standard/mid-level,123500.0,100000.0,147000.0
7,AI Scientist - Machine Learning (US/KR),Gauss Labs,"Palo Alto, CA",$142K - $187K (Glassdoor est.),3.8,,-1,-1,,,-1,-1,machine learning engineer,standard/mid-level,164500.0,142000.0,187000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
781,Forward Deployed Data Scientist,Growth Signals,"Boston, MA",,-1.0,"Forward Deployed Data Scientist\nLocation: Boston, MA preferred, remote may be considered\nType: Full-time\nStage: Seed-stage AI Startup\nIndustry: B2B SaaS / Artificial Intelligence\n\nAs a Forward Deployed Data Scientist (FDDS) at Growth Signals, you'll work directly with enterprise customers to help them get real value from our GenAI platform. You'll be their primary technical partner, combining data science expertise with product knowledge and customer success to solve their toughest challenges.\nI...",-1,-1,,,-1,-1,data scientist,standard/mid-level,,,
791,Data Scientist,WTAnow,"Herndon, VA",$98K - $136K (Glassdoor est.),5.0,"At WTAnow, LLC dba WebTech Analytics, we rely on insightful data to power our systems and solutions. We’re seeking an experienced data scientist to deliver insights on a daily basis. The ideal candidate will have mathematical and statistical expertise, along with natural curiosity and a creative mind. While mining, interpreting, and cleaning our data, this person will be relied on to ask questions, connect the dots, and uncover hidden opportunities for realizing the data’s full potential. As par...",-1,Company - Private,,,Unknown / Non-Applicable,-1,data scientist,standard/mid-level,117000.0,98000.0,136000.0
797,Data Scientist,SelectMinds,"Dallas, TX",$100K - $143K (Glassdoor est.),-1.0,"Job Description:\nWe are seeking a data scientist to join our business intelligence team to help us make better business decisions based on our data. The ideal candidate will have a working knowledge of statistics, mathematics, and data science programming languages (e.g., SQL, R, Python). Your primary responsibilities will be performing statistical analyses, running custom SQL queries, and identifying patterns and trends that can improve our products’ and services’ efficiency and usability. You ...",-1,Company - Public,,,Unknown / Non-Applicable,-1,data scientist,standard/mid-level,121500.0,100000.0,143000.0
801,DATA SCIENTIST,Reliance Global Services,"South Plainfield, NJ",$84K - $134K (Glassdoor est.),4.3,"Analyze large datasets and interact with relational databases. Write SQL queries, stored procedures, functions, triggers, and views. Develop and implement statistical models, machine learning algorithms, and data-driven solutions for business clients. Automate existing processes to streamline data analysis. Build dashboards and generate reports. Make recommendations based on analytical insights. Skills Required: Python, SQL, Git, Hadoop, Spark, SAS and Tableau. Master’s degree in Science, Techno...",-1,Company - Private,,,$5 to $25 million (USD),-1,data scientist,standard/mid-level,109000.0,84000.0,134000.0


In [19]:
df.loc[industry_absent_filter & ~sector_absent_filter]

Unnamed: 0,job_title,company,location,salary_est,rating,description,headquarters,ownership,industry,sector,revenue,competitors,job_category,seniority,annual_salary_avg,min_salary,max_salary
275,AI & GEN AI Data Scientist-Experienced Associate,PRICE WATERHOUSE COOPERS,"Chicago, IL",$63K - $140K (Employer provided),3.7,,-1,-1,,Financial Services,$100 to $500 million (USD),-1,ai/ml engineer/scientist,junior/associate,101500.0,63000.0,140000.0


Dropping `headquarters` and `industry`. No use of them

In [20]:
df.at[275, 'sector'] = 'Financial Services'

In [21]:
df.drop(['headquarters', 'industry'], axis='columns', inplace=True)
df.head()

Unnamed: 0,job_title,company,location,salary_est,rating,description,ownership,sector,revenue,competitors,job_category,seniority,annual_salary_avg,min_salary,max_salary
0,Data Scientist,Havi Supply Chain,"Chicago, IL",$105K - $120K (Employer provided),3.6,"Full/Part-time: Full time\nJob Category: Analytics\nCity: Chicago\nHAVI is a global, privately owned company focused on innovating, optimizing and managing the supply chains of leading brands. Offering services in marketing analytics, packaging, supply chain management and logistics, HAVI partners with companies to address challenges big and small across the supply chain, from commodity to customer. Founded in 1974, HAVI employs more than 10,000 people and serves customers in more than 100 countrie...",Company - Private,Management & Consulting,Unknown / Non-Applicable,-1,data scientist,standard/mid-level,112500.0,105000.0,120000.0
1,Risk Analytics Data Scientist,PayPal,"Austin, TX",$101K - $173K (Employer provided),3.6,"The Company\nPayPal has been revolutionizing commerce globally for more than 25 years. Creating innovative experiences that make moving money, selling, and shopping simple, personalized, and secure, PayPal empowers consumers and businesses in approximately 200 markets to join and thrive in the global economy.\n\nWe operate a global, two-sided network at scale that connects hundreds of millions of merchants and consumers. We help merchants and consumers connect, transact, and complete payments, whet...",Company - Public,Information Technology,$10+ billion (USD),-1,data scientist,standard/mid-level,137000.0,101000.0,173000.0
2,"Data Scientist, Applied AI - Remote",Azumo,Remote,,4.1,"Azumo is currently looking for a highly motivated Data Scientist / Machine Learning Engineer to develop and enhance our data and analytics infrastructure. The position is FULLY REMOTE, based in Latin America.\nThis position will provide you with the opportunity to collaborate with a dynamic team and talented data scientists in the field of big data analytics and applied AI. If you have a passion for designing and implementing advanced machine learning and deep learning models, particularly in the...",Company - Private,Manufacturing,Unknown / Non-Applicable,-1,data scientist,standard/mid-level,,,
3,Sr. Data Scientist,EDGE,"Chicago, IL",$110K - $160K (Employer provided),-1.0,,-1,,-1,-1,data scientist,senior,135000.0,110000.0,160000.0
4,Senior Data Scientist,Envestnet,"Berwyn, PA",$132K - $172K (Glassdoor est.),4.2,,-1,,-1,-1,data scientist,senior,152000.0,132000.0,172000.0


In [22]:
df['rating'] = df['rating'].astype('float')
df.loc[df['rating'] < 0, 'rating'] = np.nan

df['rating'].unique()

array([3.6, 4.1, nan, 4.2, 4. , 3.8, 4.3, 3.4, 3.5, 3.7, 3.9, 3. , 4.7,
       2.4, 4.5, 2.8, 3.3, 2.9, 3.2, 2.6, 4.4, 4.8, 5. , 2.7, 3.1, 1.8,
       4.6, 2.5, 4.9, 2.2, 1. ])

In [23]:
df.loc[df['location'] == 'Remote']

Unnamed: 0,job_title,company,location,salary_est,rating,description,ownership,sector,revenue,competitors,job_category,seniority,annual_salary_avg,min_salary,max_salary
2,"Data Scientist, Applied AI - Remote",Azumo,Remote,,4.1,"Azumo is currently looking for a highly motivated Data Scientist / Machine Learning Engineer to develop and enhance our data and analytics infrastructure. The position is FULLY REMOTE, based in Latin America.\nThis position will provide you with the opportunity to collaborate with a dynamic team and talented data scientists in the field of big data analytics and applied AI. If you have a passion for designing and implementing advanced machine learning and deep learning models, particularly in the...",Company - Private,Manufacturing,Unknown / Non-Applicable,-1,data scientist,standard/mid-level,,,
172,Data Scientist,"Riverbed Technology, Inc.",Remote,,3.0,"Riverbed. Empower the Experience:\nRiverbed, the leader in AI observability, helps organizations optimize their user’s experiences by leveraging AI automation for the prevention, identification, and resolution of IT issues. With over 20 years of experience in data collection and AI and machine learning, Riverbed’s open and AI-powered observability platform and solutions optimize digital experiences and greatly improves IT efficiency. Riverbed also offers industry-leading Acceleration solutions th...",Company - Private,Information Technology,$500 million to $1 billion (USD),-1,data scientist,standard/mid-level,,,
269,Sr. Data Scientist,Americor,Remote,$120K - $150K (Employer provided),3.6,"Americor is an innovative finance technology company with a unique approach to debt resolution. Americor focuses on getting Americans out of debt quickly and responsibly so they can take control of their lives again. Our employees love coming to work and find value in what they do, consistently recognizing Americor as a top place to work. With over a decade of experience paired with a number of best company awards, Americor has relieved over $2 billion in debt for over 200,000 Americans.\n\nWe are...",Company - Private,Financial Services,$100 to $500 million (USD),-1,data scientist,senior,135000.0,120000.0,150000.0
535,Data Scientist,SOURCEFLY LLC,Remote,,5.0,Job Information\nJob Type\nFull time\nIndustry\nGovernment/Military\nSecurity Clearance\nPublic Trust\nRemote Job\n\nAbout Us,Company - Private,Information Technology,$1 to $5 million (USD),-1,data scientist,standard/mid-level,,,
553,Data Scientist – Risk Modeling & Analytics,SoulPage IT Solutions,Remote,,4.4,"August 19, 2025\nRole: Data Scientist\nEmployment Type: C2C\nClient: REI Systems – FDA Account\nLocation: Remote, USA\nNo. of Positions: 2\nWork Experience: 5+ Years of experience",Company - Private,Information Technology,Unknown / Non-Applicable,-1,data scientist,standard/mid-level,,,
613,AI/ML Specialist,Techfinite Systems,Remote,,5.0,"AI/ML Specialist\nHybrid/Remote · Full Time · Experience: 1-2 years\nAs an AI/ML Specialist at Techfinite Systems, you will support the design, development, and deployment of machine learning models for enterprise and healthcare applications. This role is ideal for early-career professionals eager to apply their academic or internship experience in a real-world environment with impactful projects.\nKey Responsibilities\nAssist in building, training, and testing machine learning models.\nWork with sen...",Company - Public,,Unknown / Non-Applicable,-1,ai/ml engineer/scientist,standard/mid-level,,,
691,Machine Learning Quantitative Researcher,Stormlight Capital,Remote,$300K - $600K (Employer provided),4.0,"Benefits:\nBonus based on performance\nFlexible schedule\nHome office stipend\nOpportunity for advancement\nPaid time off\nProfit sharing\nSigning bonus\n\nAbout Stormlight Capital\n\nStormlight Capital LLC is a high-frequency trading firm and market maker specializing in event contracts. Leveraging advanced technology and quantitative expertise, we deliver deep liquidity, efficient pricing, and robust risk management for our proprietary trading strategies. Continuous innovation and disciplined execution k...",Company - Private,,$1 to $5 million (USD),-1,research scientist,standard/mid-level,450000.0,300000.0,600000.0
742,Senior Data Scientist,Age of Learning,Remote,$170K - $215K (Employer provided),4.0,"Company Overview:\nAge of Learning® is the leading developer of engaging and effective Pre-K through 5th grade learning resources that help children build a strong foundation for academic success and a lifelong love of learning. The company’s research-based curriculum, developed by education experts, includes the award-winning programs ABCmouse.com® Early Learning Academy and Adventure Academy™, as well as the adaptive, personalized school solutions, My Math Academy®, My Reading Academy®, and My ...",Company - Private,Education,$100 to $500 million (USD),-1,data scientist,senior,192500.0,170000.0,215000.0


In [24]:
desc_remote_mask = df['description'].str.contains(r'remote', na=False)
print(df.loc[desc_remote_mask, 'description'])

115       Role Details\nLocation: San Francisco, CA. This role is based out of our San Francisco HQ and is not eligible for full-time remote work.\nAbout Us\nAt FurtherAI, we’re building the next generation of AI agents for the insurance industry - a trillion-dollar market ready for transformation.\nWe’ve raised more than $30M from top investors (Andreesen Horowitz, YC, Nexus, South Park Commons, Converge) and have grown 10x in revenue this year alone. Our customers include some of the largest names in insura...
120           Fullpower®-AI delivers a complete B2B IoT platform for AI-powered algorithms, remote contactless biosensing together with end-to-end engineering services, and customization of software in the field of life sciences, health, and biotechnology. Fullpower's platform is vetted and deployed as a PaaS, backed by a patent portfolio of 135+ patents. Fullpower's key areas of expertise include contactless biosensing, remote monitoring, non-invasive sleep technology, and the

In [25]:
df['job_state'] = df['location'].str.extract(r', ([A-Z]{2})', expand=False)
df.head()

Unnamed: 0,job_title,company,location,salary_est,rating,description,ownership,sector,revenue,competitors,job_category,seniority,annual_salary_avg,min_salary,max_salary,job_state
0,Data Scientist,Havi Supply Chain,"Chicago, IL",$105K - $120K (Employer provided),3.6,"Full/Part-time: Full time\nJob Category: Analytics\nCity: Chicago\nHAVI is a global, privately owned company focused on innovating, optimizing and managing the supply chains of leading brands. Offering services in marketing analytics, packaging, supply chain management and logistics, HAVI partners with companies to address challenges big and small across the supply chain, from commodity to customer. Founded in 1974, HAVI employs more than 10,000 people and serves customers in more than 100 countrie...",Company - Private,Management & Consulting,Unknown / Non-Applicable,-1,data scientist,standard/mid-level,112500.0,105000.0,120000.0,IL
1,Risk Analytics Data Scientist,PayPal,"Austin, TX",$101K - $173K (Employer provided),3.6,"The Company\nPayPal has been revolutionizing commerce globally for more than 25 years. Creating innovative experiences that make moving money, selling, and shopping simple, personalized, and secure, PayPal empowers consumers and businesses in approximately 200 markets to join and thrive in the global economy.\n\nWe operate a global, two-sided network at scale that connects hundreds of millions of merchants and consumers. We help merchants and consumers connect, transact, and complete payments, whet...",Company - Public,Information Technology,$10+ billion (USD),-1,data scientist,standard/mid-level,137000.0,101000.0,173000.0,TX
2,"Data Scientist, Applied AI - Remote",Azumo,Remote,,4.1,"Azumo is currently looking for a highly motivated Data Scientist / Machine Learning Engineer to develop and enhance our data and analytics infrastructure. The position is FULLY REMOTE, based in Latin America.\nThis position will provide you with the opportunity to collaborate with a dynamic team and talented data scientists in the field of big data analytics and applied AI. If you have a passion for designing and implementing advanced machine learning and deep learning models, particularly in the...",Company - Private,Manufacturing,Unknown / Non-Applicable,-1,data scientist,standard/mid-level,,,,
3,Sr. Data Scientist,EDGE,"Chicago, IL",$110K - $160K (Employer provided),,,-1,,-1,-1,data scientist,senior,135000.0,110000.0,160000.0,IL
4,Senior Data Scientist,Envestnet,"Berwyn, PA",$132K - $172K (Glassdoor est.),4.2,,-1,,-1,-1,data scientist,senior,152000.0,132000.0,172000.0,PA


In [26]:
missing_job_state_mask = df['job_state'].isna()
df.loc[missing_job_state_mask, ['location', 'job_state']]

Unnamed: 0,location,job_state
2,Remote,
23,United States,
29,Minnesota,
36,United States,
40,United States,
72,United States,
75,Alaska,
79,United States,
84,United States,
85,Oregon,


In [27]:
us_states = {
    'Alaska': 'AK',
    'California': 'CA',
    'Georgia': 'GA',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'New Jersey': 'NJ',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Virginia': 'VA',
    'Remote': 'R',
}

missing_job_state_mask = df['job_state'].isna()
df.loc[missing_job_state_mask, 'job_state'] = df.loc[missing_job_state_mask, 'location'].apply(lambda x: us_states[x] if x in us_states.keys() else pd.NA)

df.loc[missing_job_state_mask, ['location', 'job_state']]

Unnamed: 0,location,job_state
2,Remote,R
23,United States,
29,Minnesota,MN
36,United States,
40,United States,
72,United States,
75,Alaska,AK
79,United States,
84,United States,
85,Oregon,OR


In [28]:
df['job_state'].value_counts()

job_state
CA    133
TX     71
MA     55
IL     54
VA     39
GA     34
FL     29
PA     27
NJ     20
NC     20
WA     19
OH     19
NY     17
MI     17
AR     12
MD     11
MO     10
MN      9
UT      9
R       8
IN      8
KS      7
CT      6
CO      6
AZ      4
HI      4
TN      4
AL      4
SC      4
AK      3
ID      3
WY      3
ME      3
KY      2
WI      2
OK      2
WV      1
OR      1
DC      1
MS      1
NV      1
PR      1
ND      1
NE      1
DE      1
LA      1
Name: count, dtype: int64

California has the highest jobs in data science followed by Texas

In [29]:
df['desc_len'] = df['description'].apply(lambda x: len(x.strip()) if isinstance(x, str) else 0)
df[['description', 'desc_len']].head(60)

Unnamed: 0,description,desc_len
0,"Full/Part-time: Full time\nJob Category: Analytics\nCity: Chicago\nHAVI is a global, privately owned company focused on innovating, optimizing and managing the supply chains of leading brands. Offering services in marketing analytics, packaging, supply chain management and logistics, HAVI partners with companies to address challenges big and small across the supply chain, from commodity to customer. Founded in 1974, HAVI employs more than 10,000 people and serves customers in more than 100 countrie...",503
1,"The Company\nPayPal has been revolutionizing commerce globally for more than 25 years. Creating innovative experiences that make moving money, selling, and shopping simple, personalized, and secure, PayPal empowers consumers and businesses in approximately 200 markets to join and thrive in the global economy.\n\nWe operate a global, two-sided network at scale that connects hundreds of millions of merchants and consumers. We help merchants and consumers connect, transact, and complete payments, whet...",503
2,"Azumo is currently looking for a highly motivated Data Scientist / Machine Learning Engineer to develop and enhance our data and analytics infrastructure. The position is FULLY REMOTE, based in Latin America.\nThis position will provide you with the opportunity to collaborate with a dynamic team and talented data scientists in the field of big data analytics and applied AI. If you have a passion for designing and implementing advanced machine learning and deep learning models, particularly in the...",503
3,,0
4,,0
5,,0
6,,0
7,,0
8,"Job Description:\nDevelop high-quality, maintainable code to build and deploy computer vision modules and machine learning models as part of an AI pipeline\nWorks with data and software engineering team to integrate models into pipeline\nSupport construction and iteration of product prototypes\nSupport data pipeline engineering and automation\nStay up to date with state of the art developments in computer vision tasks\nOther duties as required by supervisors\nAdditional Requirements:",481
9,"Role: ML/AI Engineers\n(This role is open to US Citizens, Green Card holders, GC-EAD only. We do not sponsor visas.)\n\nSummary:\nAdidev is looking for an adept Machine Learning Engineer to take the helm in deploying advanced machine learning models, with a special emphasis on Generative AI. In this role, you will craft and refine AI-driven solutions, turning innovative ideas into value-adding features and services, thereby solidifying our market leadership and technological forefront for our client...",503


In [30]:
df.drop(['desc_len'], axis='columns', inplace = True)

df['description'] = df['description'].str.replace('\n', ' ')
df.head(1)

Unnamed: 0,job_title,company,location,salary_est,rating,description,ownership,sector,revenue,competitors,job_category,seniority,annual_salary_avg,min_salary,max_salary,job_state
0,Data Scientist,Havi Supply Chain,"Chicago, IL",$105K - $120K (Employer provided),3.6,"Full/Part-time: Full time Job Category: Analytics City: Chicago HAVI is a global, privately owned company focused on innovating, optimizing and managing the supply chains of leading brands. Offering services in marketing analytics, packaging, supply chain management and logistics, HAVI partners with companies to address challenges big and small across the supply chain, from commodity to customer. Founded in 1974, HAVI employs more than 10,000 people and serves customers in more than 100 countrie...",Company - Private,Management & Consulting,Unknown / Non-Applicable,-1,data scientist,standard/mid-level,112500.0,105000.0,120000.0,IL


In [31]:
df['competitors'].unique()

array([-1])

In [32]:
# competitor count
df['num_comp'] = df['competitors'].apply(lambda x: len(x.split(',')) if x!=-1 else 0)

not_desc_list = ["job_title","company","location","salary_est","rating","ownership","sector","revenue","annual_salary_avg","job_state", "num_comp"]
df.head()[not_desc_list]

Unnamed: 0,job_title,company,location,salary_est,rating,ownership,sector,revenue,annual_salary_avg,job_state,num_comp
0,Data Scientist,Havi Supply Chain,"Chicago, IL",$105K - $120K (Employer provided),3.6,Company - Private,Management & Consulting,Unknown / Non-Applicable,112500.0,IL,0
1,Risk Analytics Data Scientist,PayPal,"Austin, TX",$101K - $173K (Employer provided),3.6,Company - Public,Information Technology,$10+ billion (USD),137000.0,TX,0
2,"Data Scientist, Applied AI - Remote",Azumo,Remote,,4.1,Company - Private,Manufacturing,Unknown / Non-Applicable,,R,0
3,Sr. Data Scientist,EDGE,"Chicago, IL",$110K - $160K (Employer provided),,-1,,-1,135000.0,IL,0
4,Senior Data Scientist,Envestnet,"Berwyn, PA",$132K - $172K (Glassdoor est.),4.2,-1,,-1,152000.0,PA,0


In [33]:
df[df['num_comp'] != 0]

Unnamed: 0,job_title,company,location,salary_est,rating,description,ownership,sector,revenue,competitors,job_category,seniority,annual_salary_avg,min_salary,max_salary,job_state,num_comp


In [34]:
df.drop(['num_comp'], axis='columns', inplace=True)

df['revenue'].unique()

array(['Unknown / Non-Applicable', '$10+ billion (USD)', '-1',
       'Less than $1 million (USD)', '$1 to $5 million (USD)',
       '$1 to $5 billion (USD)', '$5 to $10 billion (USD)',
       '$500 million to $1 billion (USD)', '$5 to $25 million (USD)',
       '$100 to $500 million (USD)', '$25 to $100 million (USD)'],
      dtype=object)

In [35]:
df['revenue'].value_counts()

revenue
Unknown / Non-Applicable            212
$10+ billion (USD)                  138
$1 to $5 billion (USD)               78
-1                                   77
$100 to $500 million (USD)           48
$5 to $25 million (USD)              45
$5 to $10 billion (USD)              35
$25 to $100 million (USD)            28
$500 million to $1 billion (USD)     22
$1 to $5 million (USD)               20
Less than $1 million (USD)           10
Name: count, dtype: int64

In [36]:
df['revenue'].isna().sum()

np.int64(0)

In [37]:
df['ownership'].value_counts()

ownership
Company - Public                  267
Company - Private                 257
-1                                 79
Subsidiary or Business Segment     30
College / University               24
Nonprofit Organization             16
Unknown                            16
Government                          7
Hospital                            6
Contract                            5
Self-employed                       4
Private Practice / Firm             1
Franchise                           1
Name: count, dtype: int64

In [38]:
df['ownership'].isna().sum()

np.int64(0)

In [39]:
df['sector'].value_counts()

sector
Information Technology                         179
Financial Services                              46
Retail & Wholesale                              41
Manufacturing                                   40
Healthcare                                      35
Management & Consulting                         35
Aerospace & Defense                             28
Education                                       26
Pharmaceutical & Biotechnology                  22
Energy, Mining & Utilities                      21
Insurance                                       20
Media & Communication                           17
Government & Public Administration              10
Telecommunications                              10
Transportation & Logistics                       9
Arts, Entertainment & Recreation                 5
Construction, Repair & Maintenance Services      5
Real Estate                                      4
Restaurants & Food Service                       3
Human Resources & Staffi

In [40]:
len(df.loc[df['sector'].isna()])

149

In [41]:
ownership_sector_map = {
    'College / University': 'Education',
    'Nonprofit Organization': 'Nonprofit & NGO',
    'Government': 'Government & Public Administration',
    'Hospital': 'Healthcare',
    'Franchise': 'Restaurants & Food Service',
}

mask_tmp = (df['ownership'] == 'College / University') | (df['ownership'] == 'Nonprofit Organization') | (df['ownership'] == 'Government') | (df['ownership'] == 'Hospital') | (df['ownership'] == 'Franchise')
mask_tmp = mask_tmp & df['sector'].isna()
print(f"# of rows without sector imputed = {len(df.loc[mask_tmp])}")

df.loc[mask_tmp, 'sector'] = df.loc[mask_tmp, 'ownership'].apply(lambda x: ownership_sector_map[x])

# of rows without sector imputed = 2


In [42]:
df['ownership'] = df['ownership'].replace('-1', 'Unknown')
df['ownership'] = df['ownership'].replace(['Company - Private', 'Private Practice / Firm', 'Self-employed', 'Franchise'], 'Private')
df['ownership'] = df['ownership'].replace(['Company - Public', 'Subsidiary or Business Segment', 'Government'], 'Public')
df['ownership'] = df['ownership'].replace(['Nonprofit Organization', 'College / University', 'Hospital', 'Contract'], 'Other')
df['ownership'] = df['ownership'].fillna('Unknown')

## Handling missing values

In [43]:
df.isna().sum()

job_title              0
company                0
location               0
salary_est            89
rating                56
description           63
ownership              0
sector               147
revenue                0
competitors            0
job_category           0
seniority              0
annual_salary_avg     91
min_salary           100
max_salary           100
job_state             25
dtype: int64

- we'll fill salary based on the sector

In [44]:
null_sector_mask = df['sector'].isna()
null_desc_msk = df['description'].isna()
null_salary_mask = df['annual_salary_avg'].isna()

In [45]:
print(f"# rows which have sector but no salary = {len(df.loc[~null_sector_mask & null_salary_mask])}")
print(f"# rows which have no sector as well as no salary = {len(df.loc[null_sector_mask & null_salary_mask])}")
print(f"# rows which have no sector, no salary and also no description = {len(df.loc[null_desc_msk & null_sector_mask & null_salary_mask])}")

# rows which have sector but no salary = 68
# rows which have no sector as well as no salary = 23
# rows which have no sector, no salary and also no description = 6


Imputing strategy for `annual_salary_mask`
- remove the rows which have no sector, no salary as well as no description
- Now, the rows which have no sector as well as no salary = 17. We could have given the description of these rows to llm and extracted sector from it. But now we can't. So remove these 17 rows as well... in total all 23 rows with null sector and null salary
- Next, I think we should impute the 68 rows based on the median annual_salary_avg of a sector

In [46]:
df.drop(list(df.loc[null_sector_mask & null_salary_mask].index), axis='index', inplace=True)

In [47]:
print(f"# rows which have no sector as well as no salary = {len(df.loc[df['sector'].isna() & df['annual_salary_avg'].isna()])}")

# rows which have no sector as well as no salary = 0


In [48]:
null_sector_mask = df['sector'].isna()
null_desc_msk = df['description'].isna()

null_sector_but_not_desc_mask = ~null_desc_msk & null_sector_mask
print(f"# rows with description but sector null = {len(df.loc[null_sector_but_not_desc_mask])}")

# rows with description but sector null = 68


- so we could have imputed sector in these 68(and earlier 17) rows. and subsequently salaries. but currently we don't have the description. We'll leave the idea as is

In [49]:
df.isna().sum()

job_title              0
company                0
location               0
salary_est            67
rating                44
description           57
ownership              0
sector               124
revenue                0
competitors            0
job_category           0
seniority              0
annual_salary_avg     68
min_salary            77
max_salary            77
job_state             19
dtype: int64

In [51]:
# imputing salary based on the median of all salaries in any sector

df['sector'] = df['sector'].fillna('Unknown')

salary_grouped_sectorwise = df.groupby('sector')
df['annual_salary_avg'] = salary_grouped_sectorwise['annual_salary_avg'].transform(lambda x: x.fillna(x.median()))

# For any sector that had all NaN salaries, fill with global median
overall_median = df['annual_salary_avg'].median()
df['annual_salary_avg'] = df['annual_salary_avg'].fillna(overall_median)

df.isna().sum()

  return np.nanmean(a, axis, out=out, keepdims=keepdims)


job_title             0
company               0
location              0
salary_est           67
rating               44
description          57
ownership             0
sector                0
revenue               0
competitors           0
job_category          0
seniority             0
annual_salary_avg     0
min_salary           77
max_salary           77
job_state            19
dtype: int64

In [53]:
not_desc_list.remove('num_comp')
df.loc[df['annual_salary_avg'].isna(), not_desc_list]

Unnamed: 0,job_title,company,location,salary_est,rating,ownership,sector,revenue,annual_salary_avg,job_state


In [54]:
df['rating'] = salary_grouped_sectorwise['rating'].transform(lambda x: x.fillna(x.median()))

df.isna().sum()

job_title             0
company               0
location              0
salary_est           67
rating                0
description          57
ownership             0
sector                0
revenue               0
competitors           0
job_category          0
seniority             0
annual_salary_avg     0
min_salary           77
max_salary           77
job_state            19
dtype: int64

In [55]:
df.loc[df['min_salary'].isna(), 'min_salary'] = df.loc[df['min_salary'].isna(), 'annual_salary_avg']
df.loc[df['max_salary'].isna(), 'max_salary'] = df.loc[df['max_salary'].isna(), 'annual_salary_avg']

df['job_state'] = df['job_state'].fillna('Unknown')
df['salary_est'] = df['salary_est'].fillna('Unavailable')

df.isna().sum()

job_title             0
company               0
location              0
salary_est            0
rating                0
description          57
ownership             0
sector                0
revenue               0
competitors           0
job_category          0
seniority             0
annual_salary_avg     0
min_salary            0
max_salary            0
job_state             0
dtype: int64

In [56]:
len(df)

690

In [57]:
df.to_csv('salary_data_cleaned.csv')

In [58]:
df

Unnamed: 0,job_title,company,location,salary_est,rating,description,ownership,sector,revenue,competitors,job_category,seniority,annual_salary_avg,min_salary,max_salary,job_state
0,Data Scientist,Havi Supply Chain,"Chicago, IL",$105K - $120K (Employer provided),3.6,"Full/Part-time: Full time Job Category: Analytics City: Chicago HAVI is a global, privately owned company focused on innovating, optimizing and managing the supply chains of leading brands. Offering services in marketing analytics, packaging, supply chain management and logistics, HAVI partners with companies to address challenges big and small across the supply chain, from commodity to customer. Founded in 1974, HAVI employs more than 10,000 people and serves customers in more than 100 countrie...",Private,Management & Consulting,Unknown / Non-Applicable,-1,data scientist,standard/mid-level,112500.0,105000.0,120000.0,IL
1,Risk Analytics Data Scientist,PayPal,"Austin, TX",$101K - $173K (Employer provided),3.6,"The Company PayPal has been revolutionizing commerce globally for more than 25 years. Creating innovative experiences that make moving money, selling, and shopping simple, personalized, and secure, PayPal empowers consumers and businesses in approximately 200 markets to join and thrive in the global economy. We operate a global, two-sided network at scale that connects hundreds of millions of merchants and consumers. We help merchants and consumers connect, transact, and complete payments, whet...",Public,Information Technology,$10+ billion (USD),-1,data scientist,standard/mid-level,137000.0,101000.0,173000.0,TX
2,"Data Scientist, Applied AI - Remote",Azumo,Remote,Unavailable,4.1,"Azumo is currently looking for a highly motivated Data Scientist / Machine Learning Engineer to develop and enhance our data and analytics infrastructure. The position is FULLY REMOTE, based in Latin America. This position will provide you with the opportunity to collaborate with a dynamic team and talented data scientists in the field of big data analytics and applied AI. If you have a passion for designing and implementing advanced machine learning and deep learning models, particularly in the...",Private,Manufacturing,Unknown / Non-Applicable,-1,data scientist,standard/mid-level,116000.0,116000.0,116000.0,R
3,Sr. Data Scientist,EDGE,"Chicago, IL",$110K - $160K (Employer provided),3.8,,Unknown,Unknown,-1,-1,data scientist,senior,135000.0,110000.0,160000.0,IL
4,Senior Data Scientist,Envestnet,"Berwyn, PA",$132K - $172K (Glassdoor est.),4.2,,Unknown,Unknown,-1,-1,data scientist,senior,152000.0,132000.0,172000.0,PA
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
800,Principal Data Scientist,Delta Airlines,"Atlanta, GA",$134K - $172K (Glassdoor est.),4.2,"UNITED STATES, GEORGIA, ATLANTA GLOBAL CONSUMER INSIGHT 17-JUL-2025 REF #: 29205 HOW YOU'LL HELP US KEEP CLIMBING (OVERVIEW & KEY RESPONSIBILITIES) At Delta Air Lines, connection is at the heart of everything we do and guides our every action. We strive to welcome and care for all of our customers during their travels with us and aim to deliver an elevated experience.",Public,Transportation & Logistics,$10+ billion (USD),-1,data scientist,principal,153000.0,134000.0,172000.0,GA
801,DATA SCIENTIST,Reliance Global Services,"South Plainfield, NJ",$84K - $134K (Glassdoor est.),4.3,"Analyze large datasets and interact with relational databases. Write SQL queries, stored procedures, functions, triggers, and views. Develop and implement statistical models, machine learning algorithms, and data-driven solutions for business clients. Automate existing processes to streamline data analysis. Build dashboards and generate reports. Make recommendations based on analytical insights. Skills Required: Python, SQL, Git, Hadoop, Spark, SAS and Tableau. Master’s degree in Science, Techno...",Private,Unknown,$5 to $25 million (USD),-1,data scientist,standard/mid-level,109000.0,84000.0,134000.0,NJ
802,Data Scientist and AI Engineer,SKT Lab,"Santa Clara, CA",$103K - $173K (Glassdoor est.),3.9,"We are always looking for smart, enthusiastic, creative people to join our team! If you are interested in working with us, please email us at job@sktlab.com. Qualifications: We are looking for a candidate with 5+ years of experience in a Data Scientist role, who has Bachelor's or Graduate degree in Computer Science, Statistics, Informatics, Information Systems or another quantitative field. Experience with big data tools Experience with relational SQL and NoSQL databases Developing in R and Pyt...",Public,Information Technology,Unknown / Non-Applicable,-1,ai/ml engineer/scientist,standard/mid-level,138000.0,103000.0,173000.0,CA
803,Data Scientist,Gustaine,"Orange, CA",$72K - $128K (Glassdoor est.),3.8,"Role Summary / Purpose Highly motivated self-driven Engineer in statistics / predictive modeling / data quality to lead and guide multi-disciplinary project teams addressing key challenges for different businesses. Creation of intellectual property will be a key expectation in this role. Essential Responsibilities As a senior data scientist in the Modeling and Optimization, you will create and guide programs to invent and deliver predictive modeling and decision technologies for diverse busines...",Public,Unknown,Unknown / Non-Applicable,-1,data scientist,standard/mid-level,100000.0,72000.0,128000.0,CA
