# AI Impact on the Job Market - Notebook 1 (Import, Organize, Describe)
### A visualization report by Ali Powell

## Import Libraries and Setup

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
sns.set(context='notebook', style='white')

## Acquiring the Remote Data

In [11]:
pip install kagglehub # Install the package in terminal


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [12]:
import kagglehub # Import the needed package for downloading
path = kagglehub.dataset_download("sahilislam007/ai-impact-on-job-market-20242030") # Access the specific dataset we want to work with from Kaggle
print("Current path to dataset files:", path) # Give us the path to the new file we created

Current path to dataset files: /Users/alipowell/.cache/kagglehub/datasets/sahilislam007/ai-impact-on-job-market-20242030/versions/1


In [13]:
# This is the file path to the CSV file that we added

data_path = "/Users/alipowell/.cache/kagglehub/datasets/sahilislam007/ai-impact-on-job-market-20242030/versions/1"

In [15]:
# Shows our CSV file that we will be using

! ls {data_path}

ai_job_trends_dataset.csv


## About the Dataset

This dataset was created by user SahilIslam007 on Kaggle, and was last updated in June of 2025. This AI generated dataset contains 30,000 rows and 13 columns, and reflect's AI's potential influence over the job market, from 2024 to 2030. According to the author, ""This is a synthetic dataset generated using realistic modeling, public job data patterns (U.S. BLS, OECD, McKinsey, WEF reports), and AI simulation to reflect plausible scenarios from 2024 to 2030. Ideal for educational, research, and AI project purposes".

This dataset can be downloaded at "https://www.kaggle.com/datasets/sahilislam007/ai-impact-on-job-market-20242030/data".

## Features of the Data

In [12]:
# Read in the tables and show the first 5 rows

AI_data = pd.read_csv(f"{data_path}/ai_job_trends_dataset.csv")
AI_data.head()

Unnamed: 0,Job Title,Industry,Job Status,AI Impact Level,Median Salary (USD),Required Education,Experience Required (Years),Job Openings (2024),Projected Openings (2030),Remote Work Ratio (%),Automation Risk (%),Location,Gender Diversity (%)
0,Investment analyst,IT,Increasing,Moderate,42109.76,Master’s Degree,5,1515,6342,55.96,28.28,UK,44.63
1,"Journalist, newspaper",Manufacturing,Increasing,Moderate,132298.57,Master’s Degree,15,1243,6205,16.81,89.71,USA,66.39
2,Financial planner,Finance,Increasing,Low,143279.19,Bachelor’s Degree,4,3338,1154,91.82,72.97,Canada,41.13
3,Legal secretary,Healthcare,Increasing,High,97576.13,Associate Degree,15,7173,4060,1.89,99.94,Australia,65.76
4,Aeronautical engineer,IT,Increasing,Low,60956.63,Master’s Degree,13,5944,7396,53.76,37.65,Germany,72.57


In [13]:
# Show the titles of all the different columns in the dataset

AI_data.columns

Index(['Job Title', 'Industry', 'Job Status', 'AI Impact Level',
       'Median Salary (USD)', 'Required Education',
       'Experience Required (Years)', 'Job Openings (2024)',
       'Projected Openings (2030)', 'Remote Work Ratio (%)',
       'Automation Risk (%)', 'Location', 'Gender Diversity (%)'],
      dtype='object')

In [15]:
# Creating a table of column names, data types, and an example of the data that would be in that column.

cols_table = pd.DataFrame({
    "Column Name": AI_data.columns,
    "Data Type": AI_data.dtypes.astype(str),
    "Example Value": [AI_data[col].dropna().iloc[0] if AI_data[col].dropna().size > 0 else "" for col in AI_data.columns]
})

cols_table

Unnamed: 0,Column Name,Data Type,Example Value
Job Title,Job Title,object,Investment analyst
Industry,Industry,object,IT
Job Status,Job Status,object,Increasing
AI Impact Level,AI Impact Level,object,Moderate
Median Salary (USD),Median Salary (USD),float64,42109.76
Required Education,Required Education,object,Master’s Degree
Experience Required (Years),Experience Required (Years),int64,5
Job Openings (2024),Job Openings (2024),int64,1515
Projected Openings (2030),Projected Openings (2030),int64,6342
Remote Work Ratio (%),Remote Work Ratio (%),float64,55.96


In [7]:
# Sort the types of data in each column by numerical and categorical (this will make it easier to effectively graph later)
# This juts creates a list of each column name that we can refer to later if needed

numerical_features = [
    'Median Salary (USD)', 'Experience Required (Years)',
    'Job Openings (2024)', 'Projected Openings (2030)',
    'Remote Work Ratio (%)', 'Automation Risk (%)',
    'Gender Diversity (%)'
]

categorical_features = [
    'Industry', 'Job Status', 'AI Impact Level',
    'Required Education', 'Location'
]

## Other Data/Sources Used to Refute Main Source

Note: Figures generated from the datasets presented in these resources.

1 - https://www.bls.gov/emp/ - US Bureau of Labor Statistics

About: "The Employment Projections (EP) program develops information about the labor market for the Nation as a whole for 10 years in the future."

Data: Job Growth - Sheet1.csv

2 - https://www.bls.gov/mwe/factsheets/grouped-work-levels-factsheet.htm - US Bureau of Labor Statistics

About: "The Modeled Wage Estimates (MWE) provide annual estimates of average hourly wages for occupations by job characteristics and within a given geographical location. Job characteristics refer to the attributes of workers within an occupation and include worker bargaining status (union and nonunion), work status (part-time and full-time), basis of pay (incentive-based or time-based), and work level (entry, intermediate, experienced, 1–15 and unable to be leveled). 

MWE is produced by leveraging the strength and breadth of the Occupational Employment and Wage Statistics (OEWS) and National Compensation Survey (NCS) programs to provide more details on occupational wages than either program provides individually. The occupational and geographic wage data come from the OEWS and the job characteristics, which include bargaining status (union and nonunion), work status (part-time and full-time), basis of pay (incentive-based and time-based), and work level wage data, come from the NCS."

Data: WagesByExperience - Sheet1.csv

3 - 
- US Bureau of Labor Statistics - “Making Sense of Job Openings and Other Labor Market Measures”
- US Bureau of Labor Statistics - - The Employment Projections (EP) Methodology
- Acker, Joan. “Gender Inequalities in the Workplace” (2015)
- England, Paula et al. (2020). The Gender Wage Gap, Between-Firm Inequality, and Job Segregation by Gender
- Upjohn Institute — “AI exposure and the future of work: Linking task-based measures to U.S. occupational employment projections”
- BLS — “Growth trends for selected occupations considered at risk from automation” (2022, Monthly Labor Review)

All of the following articles/readings are available publicly online, and are quality academic pieces. These sources were written using accurate real-world data and professional analyses. The information pulled from these sources was used to develop a real-world analysis of numeric features from the original AI generated dataset, and compare the real vs ai-generated trends. (These are not datasets)

In [4]:
# Reading in 2 new CSVs for refuting work

# csv 1 - Wages by Experience
import pandas as pd 

df = pd.read_csv("WagesByExperience - Sheet1.csv", header=1) # csv format incorrect, fixing it here
df.columns = df.columns.str.strip()

df.columns


Index(['Sector', 'Info ID', 'Low Experience', 'Medium Experience',
       'High Experience'],
      dtype='object')

In [5]:
df.head()

Unnamed: 0,Sector,Info ID,Low Experience,Medium Experience,High Experience
0,29-0000 Healthcare Practitioners and Technical...,A,$22.67,–,$32.21
1,29-0000 Healthcare Practitioners and Technical...,B,–,$25.61,$28.27
2,29-0000 Healthcare Practitioners and Technical...,C,–,–,$29.48
3,29-0000 Healthcare Practitioners and Technical...,D,–,$23.89,$27.93
4,29-0000 Healthcare Practitioners and Technical...,E,$21.50,$27.83,–


In [8]:
cols_table2 = pd.DataFrame({
    "Column Name": df.columns,
    "Data Type": df.dtypes.astype(str),
    "Example Value": [df[col].dropna().iloc[0] if df[col].dropna().size > 0 else "" for col in df.columns]
})

cols_table2

Unnamed: 0,Column Name,Data Type,Example Value
Sector,Sector,object,29-0000 Healthcare Practitioners and Technical...
Info ID,Info ID,object,A
Low Experience,Low Experience,object,$22.67
Medium Experience,Medium Experience,object,–
High Experience,High Experience,object,$32.21


In [6]:
# csv 2 - Job Growth

dfone = pd.read_csv("Job Growth - Sheet1.csv")
dfone.columns # show column titles from the dataset

Index(['Industry sector', '2022 NAICS', 'Employment, 2014', 'Employment, 2024',
       'Employment, 2034', 'Employment change, numeric, 2014–24',
       'Employment change, numeric, 2024–34',
       'Employment change, percent, 2014–24',
       'Employment change, percent, 2024–34', 'Percent distribution, 2014',
       'Percent distribution, 2024', 'Percent distribution, 2034',
       'Compound annual rate of change, 2014–24',
       'Compound annual rate of change, 2024–34'],
      dtype='object')

In [7]:
dfone.head() # display first 5 rows

Unnamed: 0,Industry sector,2022 NAICS,"Employment, 2014","Employment, 2024","Employment, 2034","Employment change, numeric, 2014–24","Employment change, numeric, 2024–34","Employment change, percent, 2014–24","Employment change, percent, 2024–34","Percent distribution, 2014","Percent distribution, 2024","Percent distribution, 2034","Compound annual rate of change, 2014–24","Compound annual rate of change, 2024–34"
0,Total employment,—,150436.3,169956.1,175167.9,19519.8,5211.8,13.0,3.1,100.0,100.0,100.0,1.2,0.3
1,Self-employed workers,—,9344.3,9906.4,10122.8,562.1,216.4,6.0,2.2,6.2,5.8,5.8,0.6,0.2
2,Total wage and salary employment,—,141092.0,160049.7,165045.1,18957.7,4995.4,13.4,3.1,93.8,94.2,94.2,1.3,0.3
3,"Agriculture, forestry, fishing, and hunting",11,1383.5,1480.7,1482.1,97.2,1.4,7.0,0.1,0.9,0.9,0.8,0.7,0.0
4,"Mining, quarrying, and oil and gas extraction",21,838.4,586.1,576.8,-252.3,-9.3,-30.1,-1.6,0.6,0.3,0.3,-3.5,-0.2


In [9]:
cols_table3 = pd.DataFrame({
    "Column Name": dfone.columns,
    "Data Type": dfone.dtypes.astype(str),
    "Example Value": [dfone[col].dropna().iloc[0] if dfone[col].dropna().size > 0 else "" for col in dfone.columns]
})

cols_table3

Unnamed: 0,Column Name,Data Type,Example Value
Industry sector,Industry sector,object,Total employment
2022 NAICS,2022 NAICS,object,—
"Employment, 2014","Employment, 2014",object,150436.30
"Employment, 2024","Employment, 2024",object,169956.10
"Employment, 2034","Employment, 2034",object,175167.90
"Employment change, numeric, 2014–24","Employment change, numeric, 2014–24",object,19519.80
"Employment change, numeric, 2024–34","Employment change, numeric, 2024–34",object,5211.80
"Employment change, percent, 2014–24","Employment change, percent, 2014–24",float64,13.0
"Employment change, percent, 2024–34","Employment change, percent, 2024–34",float64,3.1
"Percent distribution, 2014","Percent distribution, 2014",float64,100.0
