# Data Cleaning - AI Salary Analysis: Remote Work and Experience Premiums

## Purpose

This notebook prepares the raw AI salary data for analysis by:
1. Filtering to the US market and full-time positions 
2. Creating clean remote work classifications 
3. Categorizing job roles for hypothesis testing 
4. Removing outliers and selecting releavnt variables 

**input:** 'salaries.csv'

**Output:** 'cleaned_salaries.csv'

## Load Raw Data and Import Libraries

We will import out nessesary packages and load our raw data into this notebook

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)


df_raw = pd.read_csv('salaries.csv')

print(f"Raw data loaded: {df_raw.shape[0]:,} rows, {df_raw.shape[1]} columns")
print(f"\nColumns: {df_raw.columns.tolist()}")
print(f"\nFirst 5 rows:")
df_raw.head()
df_raw.info()


Raw data loaded: 151,445 rows, 11 columns

Columns: ['work_year', 'experience_level', 'employment_type', 'job_title', 'salary', 'salary_currency', 'salary_in_usd', 'employee_residence', 'remote_ratio', 'company_location', 'company_size']

First 5 rows:


Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2025,EX,FT,Head of Data,348516,USD,348516,US,0,US,M
1,2025,EX,FT,Head of Data,232344,USD,232344,US,0,US,M
2,2025,SE,FT,Data Scientist,145400,USD,145400,US,0,US,M
3,2025,SE,FT,Data Scientist,81600,USD,81600,US,0,US,M
4,2025,MI,FT,Engineer,160000,USD,160000,US,100,US,M


To give context to the data, I printed the raw info for the data and also investigated if there were any NaN values within the datatset

In [3]:
print(df_raw.info())

print(df_raw.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 151445 entries, 0 to 151444
Data columns (total 11 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   work_year           151445 non-null  int64 
 1   experience_level    151445 non-null  object
 2   employment_type     151445 non-null  object
 3   job_title           151445 non-null  object
 4   salary              151445 non-null  int64 
 5   salary_currency     151445 non-null  object
 6   salary_in_usd       151445 non-null  int64 
 7   employee_residence  151445 non-null  object
 8   remote_ratio        151445 non-null  int64 
 9   company_location    151445 non-null  object
 10  company_size        151445 non-null  object
dtypes: int64(4), object(7)
memory usage: 12.7+ MB
None
work_year             0
experience_level      0
employment_type       0
job_title             0
salary                0
salary_currency       0
salary_in_usd         0
employee_residence    0
remo