<img align="right" style="padding-left:50px;" src="figures_wk4/data_cleaning.png" width=350><br>
### User Bias in Data Cleaning
For your homework assignment this week, we will explore how our treatment of our data can impact the quality of our results.

**Dataset:**
The data is a Salary Survey from AskAManager.org. It’s US-centric-ish but does allow for a range of country inputs.

A list of the corresponding survey questions can be found [here](https://www.askamanager.org/2021/04/how-much-money-do-you-make-4.html).

 

In [None]:
! pip install pycountry

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

In [2]:
df= pd.read_csv('survey_data.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28108 entries, 0 to 28107
Data columns (total 18 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   timestamp  28108 non-null  object 
 1   q1         28108 non-null  object 
 2   q2         28033 non-null  object 
 3   q3         28107 non-null  object 
 4   q4         7273 non-null   object 
 5   q5         28108 non-null  object 
 6   q6         20793 non-null  float64
 7   q7         28108 non-null  object 
 8   q8         211 non-null    object 
 9   q9         3047 non-null   object 
 10  q10        28108 non-null  object 
 11  q11        23074 non-null  object 
 12  q12        28026 non-null  object 
 13  q13        28108 non-null  object 
 14  q14        28108 non-null  object 
 15  q15        27885 non-null  object 
 16  q16        27937 non-null  object 
 17  q17        27931 non-null  object 
dtypes: float64(1), object(17)
memory usage: 3.9+ MB


In [4]:
df.head()

Unnamed: 0,timestamp,q1,q2,q3,q4,q5,q6,q7,q8,q9,q10,q11,q12,q13,q14,q15,q16,q17
0,4/27/2021 11:02:10,25-34,Education (Higher Education),Research and Instruction Librarian,,55000,0.0,USD,,,United States,Massachusetts,Boston,5-7 years,5-7 years,Master's degree,Woman,White
1,4/27/2021 11:02:22,25-34,Computing or Tech,Change & Internal Communications Manager,,54600,4000.0,GBP,,,United Kingdom,,Cambridge,8 - 10 years,5-7 years,College degree,Non-binary,White
2,4/27/2021 11:02:38,25-34,"Accounting, Banking & Finance",Marketing Specialist,,34000,,USD,,,US,Tennessee,Chattanooga,2 - 4 years,2 - 4 years,College degree,Woman,White
3,4/27/2021 11:02:41,25-34,Nonprofits,Program Manager,,62000,3000.0,USD,,,USA,Wisconsin,Milwaukee,8 - 10 years,5-7 years,College degree,Woman,White
4,4/27/2021 11:02:42,25-34,"Accounting, Banking & Finance",Accounting Manager,,60000,7000.0,USD,,,US,South Carolina,Greenville,8 - 10 years,5-7 years,College degree,Woman,White


### Assignment
Your goal for this assignment is to observe how your data treatment during the cleaning process can skew or bias the dataset.

Before diving right in, stop and read through the questions associated with the dataset. As you can see, they are either free-form text entries or categorical selections. Knowing this, perform some exploratory data analysis (EDA) to investigate the "state" of the dataset.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28108 entries, 0 to 28107
Data columns (total 18 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   timestamp  28108 non-null  object 
 1   q1         28108 non-null  object 
 2   q2         28033 non-null  object 
 3   q3         28107 non-null  object 
 4   q4         7273 non-null   object 
 5   q5         28108 non-null  object 
 6   q6         20793 non-null  float64
 7   q7         28108 non-null  object 
 8   q8         211 non-null    object 
 9   q9         3047 non-null   object 
 10  q10        28108 non-null  object 
 11  q11        23074 non-null  object 
 12  q12        28026 non-null  object 
 13  q13        28108 non-null  object 
 14  q14        28108 non-null  object 
 15  q15        27885 non-null  object 
 16  q16        27937 non-null  object 
 17  q17        27931 non-null  object 
dtypes: float64(1), object(17)
memory usage: 3.9+ MB


In [6]:
questions_map = {
    'q1': 'age',
    'q2': 'Work_Industry',
    'q3': 'Job_Title',
    'q4': 'Job_Title_Add_Comments',
    'q5': 'Annual_Salary',
    'q6': 'Bonus',
    'q7': 'Currency',
    'q8': 'Currency_Other',
    'q9': 'Income_Context',
    'q10': 'Country_Work',
    'q11': 'State_Work_USA',
    'q12': 'City_Work',
    'q13': 'Total_Years_of_Experience',
    'q14': 'Related_Work_Experience',
    'q15': 'Highest_Qualification',
    'q16': 'Gender',
    'q17': 'Race'
}

df.rename(columns=questions_map, inplace=True)
df.sample(5)

Unnamed: 0,timestamp,age,Work_Industry,Job_Title,Job_Title_Add_Comments,Annual_Salary,Bonus,Currency,Currency_Other,Income_Context,Country_Work,State_Work_USA,City_Work,Total_Years_of_Experience,Related_Work_Experience,Highest_Qualification,Gender,Race
25522,5/6/2021 18:49:40,25-34,Nonprofits,Research Associate,,46000,750.0,USD,,,United States,Virginia,Arlington,5-7 years,5-7 years,College degree,Woman,White
12178,4/28/2021 8:40:28,25-34,"Accounting, Banking & Finance",Project Coordinator,,74000,,USD,,,USA,Massachusetts,Boston,5-7 years,2 - 4 years,College degree,Woman,White
6846,4/27/2021 14:12:09,25-34,"Marketing, Advertising & PR",Global Communications Coordinator,,52000,,CAD,,,Canada,,Montreal,2 - 4 years,2 - 4 years,Master's degree,Woman,White
18518,4/29/2021 0:37:28,25-34,"Marketing, Advertising & PR",Senior Product Marketing Manager,,210000,35000.0,USD,,,United States,California,San Francisco,8 - 10 years,2 - 4 years,Master's degree,Man,Asian or Asian American
7765,4/27/2021 15:11:29,25-34,Real Estate,Assistant,,35000,300.0,USD,,300 per closed transaction,US,Louisiana,New Orleans,2 - 4 years,1 year or less,Master's degree,Woman,White


In [7]:
# Converting Strings to Numbers
# We need to convert the columns 'Annual_Salary' and 'Bonus' to numbers. We'll also convert the 'Total_Years_of_Experience' column to a number.

# Step 1: Remove any commas from the values in the 'Annual_Salary' and 'Bonus' columns.
df['Annual_Salary'] = df['Annual_Salary'].str.replace(',', '')

# Step 2: Convert the 'Annual_Salary', 'Bonus', and 'Total_Years_of_Experience' columns to numbers.
df['Annual_Salary'] = pd.to_numeric(df['Annual_Salary'], errors='coerce')
df['Bonus'] = pd.to_numeric(df['Bonus'], errors='coerce')

df[['Annual_Salary', 'Bonus']].describe()

Unnamed: 0,Annual_Salary,Bonus
count,28108.0,20793.0
mean,361932.4,18244.6
std,36193380.0,833624.9
min,0.0,0.0
25%,54000.0,0.0
50%,75000.0,2000.0
75%,109826.8,10000.0
max,6000070000.0,120000000.0


In [8]:
# Checking the distribution of the Category columns
unique_values = pd.DataFrame(columns=['Column', 'Unique Values'])
for column in df.select_dtypes(include='object').columns:
    new_row = pd.DataFrame({
        'Column': [column],
        'Unique Values': [df[column].nunique()]
    })
    unique_values = pd.concat([unique_values, new_row], ignore_index=True)

unique_values.sort_values('Unique Values', ascending=True)

Unnamed: 0,Column,Unique Values
14,Gender,5
13,Highest_Qualification,6
1,age,7
11,Total_Years_of_Experience,8
12,Related_Work_Experience,8
5,Currency,11
15,Race,51
6,Currency_Other,124
9,State_Work_USA,137
8,Country_Work,382


`Race` is high unique values count because the question permitted multi-selection or free-text entries, participants provided multiple or unique identifications, resulting in 51 distinct values.

**1. Currency_Other (124 unique):** A broad range of responses was contributed by free-text entries for alternative or less common currencies. <br>
**2. State_Work_USA (137 unique):** Multiple distinct state abbreviations or spellings were introduced, resulting in a high unique count. <br>
**3. Country_Work (382 unique):** A globally diverse respondent pool produced significant variation in country names. <br>
**4. Work_Industry (1220 unique):** Numerous specialized or niche sectors were captured through free-form descriptions of industries. <br>
**5. Income_Context (2983 unique):** Personalized explanations for salary circumstances were provided, creating considerable variability. <br>
**6. City_Work (4841 unique):** City names worldwide appeared in various spellings and languages, leading to a large number of distinct entries. <br>
**7. Job_Title_Add_Comments (7010 unique):** A wide range of clarifications was added to roles, accounting for extensive variation. <br>
**8. Job_Title (14377 unique):** The diversity of occupations was reflected in the broad array of unique job titles reported <br>

In [9]:
# Looking at the top 5 values in each column and the number of times they appear
unique_values_top_5 = unique_values.head(5)
for column in unique_values_top_5['Column']:
    # print(f"\n\n{column}")
    print(df[column].value_counts().head())
    print('*' * 50)

timestamp
4/27/2021 11:24:33    5
4/27/2021 11:30:18    5
4/27/2021 11:12:58    5
4/27/2021 11:05:08    5
4/27/2021 11:05:17    5
Name: count, dtype: int64
**************************************************
age
25-34    12668
35-44     9908
45-54     3193
18-24     1236
55-64      994
Name: count, dtype: int64
**************************************************
Work_Industry
Computing or Tech                       4711
Education (Higher Education)            2466
Nonprofits                              2420
Health care                             1899
Government and Public Administration    1893
Name: count, dtype: int64
**************************************************
Job_Title
Software Engineer           286
Project Manager             230
Director                    198
Senior Software Engineer    196
Program Manager             152
Name: count, dtype: int64
**************************************************
Job_Title_Add_Comments
Fundraising                           20
In commerc

In [10]:
# Checking the missing values percentage
missing_values = df.isnull().mean() * 100
missing_values = missing_values[missing_values > 0]
missing_values.sort_values(ascending=False)

Currency_Other            99.249324
Income_Context            89.159670
Job_Title_Add_Comments    74.124804
Bonus                     26.024619
State_Work_USA            17.909492
Highest_Qualification      0.793368
Race                       0.629714
Gender                     0.608368
City_Work                  0.291732
Work_Industry              0.266828
Job_Title                  0.003558
dtype: float64

**Reasons For High Null values**
1. Many respondents likely used one of the standard currency options, so they had no need to fill in “Currency_Other.”
2. For “Income_Context,” participants may have felt their salary situation was straightforward enough and therefore left that field blank.
3. With “Job_Title_Add_Comments,” respondents might not have considered additional comments necessary or found the question optional, resulting in a high rate of non-responses.

**Observations** <br>
Based on the exploratory data analysis, this dataset appears to be in a relatively messy state with several data quality challenges that need to be addressed. Here's a detailed assessment:

The dataset contains salary survey responses from over 28,000 participants, with 18 columns covering various aspects of employment information. One of the most significant issues is the inconsistency in data entry formats, particularly evident in columns like Job_Title (14,377 unique values), City_Work (4,841 unique values), and Work_Industry (1,220 unique values). This high number of unique values suggests a lack of standardization in how respondents entered their information, making it difficult to perform meaningful aggregations or analyses without substantial cleaning.

The missing value analysis reveals concerning patterns, with some columns having extremely high null rates. Currency_Other (99.25%), Income_Context (89.16%), and Job_Title_Add_Comments (74.12%) show the highest percentages of missing values. While some of these missing values may be legitimate (for example, Currency_Other would naturally be empty for those using standard currencies), others might represent data collection issues or participant non-response that could potentially bias our analysis.

The numerical data, particularly in the Annual_Salary and Bonus columns, shows signs of potential data quality issues. The presence of extreme values (maximum salary of 6 billion and maximum bonus of 120 million) suggests either data entry errors or outliers that need careful consideration. Additionally, the salary data was originally stored as strings with commas, requiring conversion to numeric format for analysis. The wide range in these values and the presence of multiple currencies (11 unique currency types) adds another layer of complexity to the data cleaning process.

The categorical variables show varying degrees of standardization issues. For example, the Country_Work column contains 382 unique values, far more than the actual number of countries in the world, indicating variations in spelling, capitalization, or format (e.g., "US", "USA", "United States" likely referring to the same country). Similar patterns are observed in the State_Work_USA column (137 unique values) and Race column (51 unique values), where free-form text entry has led to multiple representations of the same categories.

The demographic information (age, gender, race) and educational/experience data appear more structured, with reasonable numbers of unique values, though they still contain some missing values that need to be addressed. The timestamp column indicates that all data was collected within a specific timeframe, providing some consistency in the temporal aspect of the dataset.

Overall, while the dataset contains valuable information about salary distributions across various demographics and industries, its current state requires substantial cleaning and standardization before it can be effectively used for analysis. The main challenges lie in handling the free-form text entries, addressing missing values, dealing with currency conversions, and managing outliers while preserving the integrity of the underlying data.


#### The Plan

**Data Standardization and Currency Handling:**
First, we need to standardize the salary data by converting all currencies to USD. We'll create a currency conversion mapping using the Currency and Currency_Other columns. We'll also need to clean the Annual_Salary column by removing any non-numeric characters and standardizing number formats. Extreme outliers (like the 6 billion salary) should be investigated individually and corrected if they're clearly errors, rather than being automatically removed.

**Geographic Data Cleanup:**
The Country_Work column needs standardization to consolidate variations like "US", "USA", and "United States". We can create a mapping dictionary for common variations. Similarly, for State_Work_USA, we'll standardize state names to their official two-letter abbreviations. The City_Work column, while messy, should be standardized for major cities but may need to be left as-is for smaller locations to avoid incorrect mappings. This geographic standardization will be particularly important for salary analysis across regions.

**Industry Consolidation:**
The Work_Industry column requires categorization into broader industry groups while preserving specific sub-industries in a new column. The Job_Title_Add_Comments column should be preserved as-is since it contains valuable contextual information.

**Race Binary Encoding:**
For the Race column, we'll create multiple binary columns for each possible race category since it allows multiple selections. For example:
- Race_White (0/1)
- Race_Asian (0/1)
- Race_Black (0/1)
- Race_Hispanic (0/1)
- Race_Native_American (0/1)
- Race_Pacific_Islander (0/1)

This approach will allow for proper representation of multi-racial identifications and make it easier to analyze demographic patterns.

**Handling Missing Values:**
For columns with high missing rates, we'll use different strategies:
- Currency_Other (99.25% missing): Can be safely left as-is since it's only relevant for non-standard currencies
- Income_Context (89.16% missing): Preserve as supplementary information without imputation
- Job_Title_Add_Comments (74.12% missing): Keep as-is since it's supplementary
- Bonus (26.02% missing): Replace with 0 for roles where bonuses aren't typical, based on industry patterns
- State_Work_USA (17.91% missing): Only fill for confirmed US locations, leave others as missing

**Demographic Data Cleanup:**
For demographic columns (gender), we'll:
- Standardize gender categories while preserving non-binary and other gender identities and null value handing

**Experience and Education:**
The Total_Years_of_Experience and Related_Work_Experience columns need to be converted to numeric ranges for analysis. We'll create minimum and maximum years columns while preserving the original categorical data. The Highest_Qualification column should be standardized to common degree terminology while maintaining the original granularity.

**Outlier Management:**
Instead of removing outliers, we'll flag them in new columns:
- Salary_Outlier_Flag: Based on industry-specific IQR calculations
- Experience_Mismatch_Flag: Where total experience seems inconsistent with age
- Bonus_Outlier_Flag: For unusually high bonus-to-salary ratios

This approach preserves the original data while maintaining its richness and allowing for different levels of analysis granularity. By keeping Job_Title as-is and implementing binary race columns, we avoid losing valuable information while making the race data more analytically useful.

#### Implementation

Based on the plan the you described above, go ahead and clean up the dataset.

[Add as many code cell below here as needs]

In [11]:
import pycountry
import numpy as np
import re

def clean_string(text):
    if pd.isna(text):
        return text
        
    # Convert to lowercase and remove special characters
    cleaned = str(text).lower()
    
    # Remove everything in parentheses and after common separators
    cleaned = re.sub(r'\(.*?\)', '', cleaned)
    cleaned = re.sub(r'but.*$', '', cleaned)
    cleaned = re.sub(r'for.*$', '', cleaned)
    cleaned = re.sub(r'based.*$', '', cleaned)
    
    # Remove special characters and extra spaces
    cleaned = re.sub(r'[^\w\s]', ' ', cleaned)
    cleaned = re.sub(r'\s+', ' ', cleaned)
    cleaned = cleaned.strip()
    
    # Return None for non-country entries
    if len(cleaned) < 2:  # Too short to be a country
        return None
        
    # Common variations mapping
    us_patterns = [
        r'^u\s*s\s*a*$',  # usa, us, u s a
        r'^united\s*states?.*$',  # united states, united state
        r'^us[st][ast].*$',  # ussa, usst, usta, etc.
        r'^unit.*stat.*$',  # united states with typos
        r'.*states?\s*of\s*americ[as]*$'  # states of america variations
    ]
    
    uk_patterns = [
        r'^uk$',
        r'^united\s*kingdom.*$',
        r'^england.*$',
        r'^scotland.*$',
        r'^wales.*$',
        r'^great\s*britain.*$',
        r'^britain.*$'
    ]
    
    # Check for US variations
    for pattern in us_patterns:
        if re.match(pattern, cleaned):
            return 'United States'
            
    # Check for UK variations
    for pattern in uk_patterns:
        if re.match(pattern, cleaned):
            return 'United Kingdom'
            
    # Other common variations
    common_fixes = {
        'brasil': 'Brazil',
        'uae': 'United Arab Emirates',
        'hong konh?g?': 'Hong Kong',
        'viet\s*nam': 'Vietnam',
        'myanmar': 'Myanmar',
        'burma': 'Myanmar',
        'czechia': 'Czech Republic',
        'ceska republika': 'Czech Republic',
        'nederland': 'Netherlands',
        'new zealand aotearoa': 'New Zealand',
        'canda': 'Canada',
        'canadw': 'Canada',
        'australi.*': 'Australia'
    }
    
    for pattern, replacement in common_fixes.items():
        if re.match(pattern, cleaned):
            return replacement
            
    # Remove non-country entries
    non_countries = [
        'remote', 'global', 'international', 'worldwide', 'europe',
        'policy', 'contracts', 'finance', 'bonus', 'salary', 'benefits',
        'company', 'government', 'position', 'employee'
    ]
    
    if cleaned in non_countries or any(word in cleaned for word in non_countries):
        return None
        
    return cleaned

In [12]:
# Get unique country values
unique_countries = df['Country_Work'].dropna().unique()

# Create mapping dictionary
country_mapping = {}
for country in unique_countries:
    try:
        cleaned = clean_string(country)
        if cleaned is None:
            country_mapping[country] = np.nan
        else:
            try:
                found = pycountry.countries.search_fuzzy(cleaned)
                country_mapping[country] = found[0].name
            except:
                if cleaned.title() in [c.name for c in pycountry.countries]:
                    country_mapping[country] = cleaned.title()
                else:
                    country_mapping[country] = np.nan
    except:
        country_mapping[country] = np.nan

# Apply mapping
df['Country_Work'] = df['Country_Work'].map(country_mapping)

In [13]:
df.Country_Work.value_counts().to_dict()

{'United States': 23143,
 'Canada': 1686,
 'United Kingdom': 1582,
 'Australia': 391,
 'Germany': 197,
 'New Zealand': 130,
 'Ireland': 125,
 'Netherlands': 90,
 'France': 68,
 'Spain': 62,
 'Sweden': 41,
 'Switzerland': 38,
 'Belgium': 35,
 'Japan': 29,
 'Denmark': 24,
 'India': 23,
 'American Samoa': 22,
 'South Africa': 20,
 'Singapore': 20,
 'Austria': 18,
 'Finland': 16,
 'Italy': 15,
 'Norway': 14,
 'Israel': 14,
 'Malaysia': 13,
 'Philippines': 13,
 'Brazil': 12,
 'China': 11,
 'Poland': 10,
 'Mexico': 8,
 'Virgin Islands, U.S.': 7,
 'Czechia': 6,
 'Argentina': 6,
 'Thailand': 6,
 'Greece': 5,
 'Colombia': 5,
 'Taiwan, Province of China': 5,
 'Korea, Republic of': 5,
 'Portugal': 5,
 'Nigeria': 5,
 'Pakistan': 5,
 'Romania': 5,
 'Hong Kong': 5,
 'Puerto Rico': 4,
 'Latvia': 4,
 'Kenya': 3,
 'Ghana': 3,
 'Chile': 3,
 'Sri Lanka': 2,
 'Bermuda': 2,
 'Hungary': 2,
 'Luxembourg': 2,
 'Zimbabwe': 2,
 'Namibia': 2,
 'Indonesia': 2,
 'Lithuania': 2,
 'Slovenia': 2,
 'Myanmar': 2,
 'Cyp

In [14]:
from collections import defaultdict

city_state_map = defaultdict(lambda: defaultdict(int))

for _, row in df[
    (df['Country_Work'] == 'United States') & 
    (df['State_Work_USA'].notna()) & 
    (df['City_Work'].notna())
].iterrows():
    city = row['City_Work'].strip().lower()
    state = row['State_Work_USA'].strip()
    city_state_map[city][state] += 1

In [15]:
city_state_mapping = {
    city: max(states.items(), key=lambda x: x[1])[0]
    for city, states in city_state_map.items()
}

city_state_mapping

{'boston': 'Massachusetts',
 'chattanooga': 'Tennessee',
 'milwaukee': 'Wisconsin',
 'greenville': 'South Carolina',
 'hanover': 'New Hampshire',
 'columbia': 'South Carolina',
 'yuma': 'Arizona',
 'st. louis': 'Missouri',
 'palm coast': 'Florida',
 'scranton': 'Pennsylvania',
 'detroit': 'Michigan',
 'saint paul': 'Minnesota',
 'chicago': 'Illinois',
 'pomona': 'California',
 'atlanta': 'Georgia',
 'boca raton': 'Florida',
 'philadelphia': 'Pennsylvania',
 'dayton': 'Ohio',
 'bradenton': 'Florida',
 'ann arbor': 'Michigan',
 'washington dc': 'District of Columbia',
 'silver spring': 'Maryland',
 'washington': 'District of Columbia',
 'san antonio': 'Texas',
 'minneapolis': 'Minnesota',
 'washington, dc': 'District of Columbia',
 'richmond': 'Virginia',
 'research triangle': 'North Carolina',
 'kalamazoo': 'Michigan',
 'manhattan': 'New York',
 'sacramento': 'California',
 'dallas': 'Texas',
 'waynesboro': 'Virginia',
 'pittsburgh': 'Pennsylvania',
 'arlington, va': 'Virginia',
 'chape

In [16]:
mask = (
    (df['Country_Work'] == 'United States') & 
    (df['State_Work_USA'].isna()) & 
    (df['City_Work'].notna())
)


# Count before
print("Missing states for US records before:", df[
    (df['Country_Work'] == 'United States') & 
    (df['State_Work_USA'].isna())
].shape[0])



# Apply the mapping
for idx in df[mask].index:
    city = df.loc[idx, 'City_Work'].strip().lower()
    if city in city_state_mapping:
        df.loc[idx, 'State_Work_USA'] = city_state_mapping[city]

Missing states for US records before: 171


In [17]:
# Count after
print("Missing states for US records after:", df[
    (df['Country_Work'] == 'United States') & 
    (df['State_Work_USA'].isna())
].shape[0])



# Show some examples of filled values
print("\nExample of filled state values:")
filled_examples = df[
    (df['Country_Work'] == 'United States') & 
    (df['State_Work_USA'].notna()) & 
    (df['City_Work'].notna())
].sample(5)


print(filled_examples[['City_Work', 'State_Work_USA']].to_string())

Missing states for US records after: 44

Example of filled state values:
         City_Work State_Work_USA
21863     Freeport       Illinois
9439      New York       New York
4592   Minneapolis      Minnesota
24557      Houston          Texas
2133      Columbus           Ohio


In [18]:
# Function to extract numeric ranges from experience strings
def extract_years_range(experience_str):
    if pd.isna(experience_str):
        return pd.NA, pd.NA
    
    experience_str = str(experience_str).lower().strip()
    
    # Handle special cases
    if experience_str in ['< 1 year', '1 year or less']:
        return 0, 1
    
    elif '41 years or more' in experience_str:
        return 41, 50 
    
    


    numbers = [int(s) for s in experience_str.replace('years', '').split('-') if s.strip().isdigit()]
    
    if len(numbers) == 2:
        return numbers[0], numbers[1]
    elif len(numbers) == 1:
        if '31 - 40' in experience_str:
            return 31, 40
        elif '21 - 30' in experience_str:
            return 21, 30
        elif '41' in experience_str:
            return 41, 50
        else:
            return numbers[0], numbers[0]
    return pd.NA, pd.NA


In [19]:
# Create new columns for experience ranges
df[['Total_Years_Min', 'Total_Years_Max']] = df.apply(
    lambda x: pd.Series(extract_years_range(x['Total_Years_of_Experience'])), 
    axis=1
)

df[['Related_Years_Min', 'Related_Years_Max']] = df.apply(
    lambda x: pd.Series(extract_years_range(x['Related_Work_Experience'])), 
    axis=1
)

In [20]:
# Function to extract age ranges with additional handling
def extract_age_range(age_str):
    if pd.isna(age_str):
        return pd.NA, pd.NA
    
    age_str = str(age_str).lower().strip()
    
    # Handle special cases
    if 'under 18' in age_str:
        return 0, 17
    elif '65' in age_str or 'over' in age_str:
        return 65, 100  # Using 100 as an upper bound
    
    # Extract numbers from strings like "25-34", "35-44" etc.
    numbers = [int(s) for s in age_str.split('-') if s.isdigit()]
    
    if len(numbers) == 2:
        return numbers[0], numbers[1]
    return pd.NA, pd.NA

# Create new columns for age ranges
df[['Age_Min', 'Age_Max']] = df.apply(
    lambda x: pd.Series(extract_age_range(x['age'])), 
    axis=1
)

In [21]:
# Print detailed statistics to verify transformations
print("\nAge Categories Distribution:")
print(df['age'].value_counts())
print("\nAge Range Statistics:")
print(df[['Age_Min', 'Age_Max']].describe())

print("\nExperience Categories Distribution:")
print(df['Total_Years_of_Experience'].value_counts().head(10))
print("\nTotal Years Experience Range Statistics:")
print(df[['Total_Years_Min', 'Total_Years_Max']].describe())

# Check for any remaining null values
print("\nNull Values in Transformed Columns:")
print(df[['Age_Min', 'Age_Max', 'Total_Years_Min', 'Total_Years_Max']].isnull().sum())

# Print unique values to verify all categories are captured
print("\nUnique Experience Categories:")
print(sorted(df['Total_Years_of_Experience'].unique()))


Age Categories Distribution:
age
25-34         12668
35-44          9908
45-54          3193
18-24          1236
55-64           994
65 or over       95
under 18         14
Name: count, dtype: int64

Age Range Statistics:
            Age_Min       Age_Max
count  28108.000000  28108.000000
mean      31.672762     40.632702
std        8.710147      9.369640
min        0.000000     17.000000
25%       25.000000     34.000000
50%       35.000000     44.000000
75%       35.000000     44.000000
max       65.000000    100.000000

Experience Categories Distribution:
Total_Years_of_Experience
11 - 20 years       9630
8 - 10 years        5381
5-7 years           4886
21 - 30 years       3645
2 - 4 years         3038
31 - 40 years        870
1 year or less       533
41 years or more     125
Name: count, dtype: int64

Total Years Experience Range Statistics:
       Total_Years_Min  Total_Years_Max
count     28108.000000     28108.000000
mean         10.250605        15.785435
std           6.9152

In [22]:
df.isna().sum()

timestamp                        0
age                              0
Work_Industry                   75
Job_Title                        1
Job_Title_Add_Comments       20835
Annual_Salary                    0
Bonus                         7315
Currency                         0
Currency_Other               27897
Income_Context               25061
Country_Work                    78
State_Work_USA                4907
City_Work                       82
Total_Years_of_Experience        0
Related_Work_Experience          0
Highest_Qualification          223
Gender                         171
Race                           177
Total_Years_Min                  0
Total_Years_Max                  0
Related_Years_Min                0
Related_Years_Max                0
Age_Min                          0
Age_Max                          0
dtype: int64

In [23]:
df.sample(5)

Unnamed: 0,timestamp,age,Work_Industry,Job_Title,Job_Title_Add_Comments,Annual_Salary,Bonus,Currency,Currency_Other,Income_Context,...,Related_Work_Experience,Highest_Qualification,Gender,Race,Total_Years_Min,Total_Years_Max,Related_Years_Min,Related_Years_Max,Age_Min,Age_Max
24562,5/5/2021 16:24:23,35-44,Education (Higher Education),Administrative Assistant,,42000,,USD,,,...,11 - 20 years,College degree,Woman,White,11,20,11,20,35,44
6792,4/27/2021 14:09:11,25-34,Media & Digital,Assistant Production Editor,,41000,1000.0,USD,,,...,2 - 4 years,College degree,Woman,"Asian or Asian American, White",2,4,2,4,25,34
6671,4/27/2021 14:01:31,25-34,Computing or Tech,Software Engineer,,90000,,CAD,,,...,5-7 years,College degree,Woman,White,8,10,5,7,25,34
11242,4/27/2021 23:55:45,45-54,"Energy, Oil & Gas",Director of Contracts & Risk,,130000,15000.0,USD,,,...,21 - 30 years,Master's degree,Woman,White,21,30,21,30,45,54
10731,4/27/2021 22:03:29,35-44,Hospitality & Events,Event Monitor,"Work in a convention center setting up, taking...",28000,2000.0,USD,,,...,2 - 4 years,High School,Man,White,21,30,2,4,35,44


In [24]:
# First verify the new columns have been created correctly
print("Before dropping columns:")
print(df.columns)

# Drop original columns while keeping the new numeric range columns
columns_to_drop = ['age', 'Total_Years_of_Experience', 'Related_Work_Experience']
df = df.drop(columns=columns_to_drop)

print("\nAfter dropping columns:")
print(df.columns)

Before dropping columns:
Index(['timestamp', 'age', 'Work_Industry', 'Job_Title',
       'Job_Title_Add_Comments', 'Annual_Salary', 'Bonus', 'Currency',
       'Currency_Other', 'Income_Context', 'Country_Work', 'State_Work_USA',
       'City_Work', 'Total_Years_of_Experience', 'Related_Work_Experience',
       'Highest_Qualification', 'Gender', 'Race', 'Total_Years_Min',
       'Total_Years_Max', 'Related_Years_Min', 'Related_Years_Max', 'Age_Min',
       'Age_Max'],
      dtype='object')

After dropping columns:
Index(['timestamp', 'Work_Industry', 'Job_Title', 'Job_Title_Add_Comments',
       'Annual_Salary', 'Bonus', 'Currency', 'Currency_Other',
       'Income_Context', 'Country_Work', 'State_Work_USA', 'City_Work',
       'Highest_Qualification', 'Gender', 'Race', 'Total_Years_Min',
       'Total_Years_Max', 'Related_Years_Min', 'Related_Years_Max', 'Age_Min',
       'Age_Max'],
      dtype='object')


In [25]:
# Simplified gender standardization based on the actual survey options
def standardize_gender(gender):
    if pd.isna(gender):
        return 'Other or prefer not to answer'
    
    gender = str(gender).strip()
    
    # Map to only the survey options
    if gender in ['Man', 'Woman', 'Non-binary']:
        return gender
    else:
        return 'Other or prefer not to answer'

In [26]:
# Create new column with standardized categories
df['Gender_Standardized'] = df['Gender'].apply(standardize_gender)

# Print before and after
print("Before standardization:")
print(df['Gender'].value_counts(dropna=False))

print("\nAfter standardization:")
print(df['Gender_Standardized'].value_counts(dropna=False))

# After verification, update the gender column
df = df.drop(columns=['Gender'])
df = df.rename(columns={'Gender_Standardized': 'Gender'})

print("\nFinal Gender Distribution:")
print(df['Gender'].value_counts(dropna=False))

print("\nNull Value Count:")
df.Gender.isna().sum()

Before standardization:
Gender
Woman                            21389
Man                               5502
Non-binary                         747
Other or prefer not to answer      298
NaN                                171
Prefer not to answer                 1
Name: count, dtype: int64

After standardization:
Gender_Standardized
Woman                            21389
Man                               5502
Non-binary                         747
Other or prefer not to answer      470
Name: count, dtype: int64

Final Gender Distribution:
Gender
Woman                            21389
Man                               5502
Non-binary                         747
Other or prefer not to answer      470
Name: count, dtype: int64

Null Value Count:


np.int64(0)

In [27]:
# Print initial missing bonus count
print("Initial missing bonus values:", df['Bonus'].isnull().sum())

# Replace all missing bonus values with 0
df['Bonus'] = df['Bonus'].fillna(0)

# Print final missing bonus count and distribution
print("\nFinal missing bonus values:", df['Bonus'].isnull().sum())
print("\nBonus distribution after filling zeros:")
print(df['Bonus'].describe().round(2))

Initial missing bonus values: 7315

Final missing bonus values: 0

Bonus distribution after filling zeros:
count    2.810800e+04
mean     1.349651e+04
std      7.170322e+05
min      0.000000e+00
25%      0.000000e+00
50%      0.000000e+00
75%      5.000000e+03
max      1.200000e+08
Name: Bonus, dtype: float64


In [28]:
df.columns

Index(['timestamp', 'Work_Industry', 'Job_Title', 'Job_Title_Add_Comments',
       'Annual_Salary', 'Bonus', 'Currency', 'Currency_Other',
       'Income_Context', 'Country_Work', 'State_Work_USA', 'City_Work',
       'Highest_Qualification', 'Race', 'Total_Years_Min', 'Total_Years_Max',
       'Related_Years_Min', 'Related_Years_Max', 'Age_Min', 'Age_Max',
       'Gender'],
      dtype='object')

In [29]:
# Define the conversion rates
conversion_rates = {
    'USD': 1.0, 
    'GBP': 1.35, 
    'CAD': 0.79, 
    'EUR': 1.12, 
    'AUD': 0.74, 
    'NZD': 0.71, 
    'CHF': 1.08, 
    'ZAR': 0.065, 
    'SEK': 0.11, 
    'HKD': 0.13, 
    'JPY': 0.0091
}

# Function to normalize monetary values to USD
def normalize_to_usd(row, column):
    try:
        # Clean the value by removing commas
        value = float(str(row[column]).replace(',', ''))
        currency = row['Currency']
        


        # If currency exists in our conversion rates, use it
        if currency in conversion_rates:
            return value * conversion_rates[currency]
        
        else:
            # If currency not in our list, assume USD
            return value
            
    except:
        return None

In [30]:
# Create new columns with normalized values
df['Salary_USD'] = df.apply(lambda row: normalize_to_usd(row, 'Annual_Salary'), axis=1)

df['Bonus_USD'] = df.apply(lambda row: normalize_to_usd(row, 'Bonus'), axis=1)

# Fill NA values in Bonus_USD with 0
df['Bonus_USD'] = df['Bonus_USD'].fillna(0)

In [31]:
# Create salary and bonus outlier flags based on industry-specific IQR
def flag_outliers(df, column):
    flag_column = f'{column}_Outlier_Flag'
    df[flag_column] = False
    

    for industry in df['Work_Industry'].unique():
        if pd.isna(industry):
            continue
            
        industry_values = df[df['Work_Industry'] == industry][column]
        Q1 = industry_values.quantile(0.25)
        Q3 = industry_values.quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        industry_mask = df['Work_Industry'] == industry
        outlier_mask = (df[column] < lower_bound) | (df[column] > upper_bound)
        df.loc[industry_mask & outlier_mask, flag_column] = True

In [32]:
# Apply outlier flagging for both salary and bonus
flag_outliers(df, 'Salary_USD')
flag_outliers(df, 'Bonus_USD')

In [33]:
# Get unique race values from the dataset
unique_races = set()
for race_str in df['Race'].dropna():
    # Split by comma and strip whitespace
    races = [r.strip() for r in str(race_str).split(',')]
    unique_races.update(races)

# Create binary columns for each unique race value
for race in sorted(unique_races):
    column_name = f'Race_{race.replace(" ", "_")}'
    df[column_name] = df['Race'].apply(
        lambda x: 1 if pd.notna(x) and race in str(x) else 0
    )

In [34]:
# Create Multiple_Races flag
race_columns = [col for col in df.columns if col.startswith('Race_')]
df['Multiple_Races'] = (df[race_columns].sum(axis=1) > 1).astype(int)

# Print summary
print("Race Distribution:")
total_respondents = len(df)
for col in sorted(race_columns):
    count = df[col].sum()
    percentage = (count / total_respondents) * 100
    print(f"{col.replace('Race_', '').replace('_', ' ')}: {count:,} ({percentage:.1f}%)")

print(f"\nMultiple Races: {df['Multiple_Races'].sum():,} ({(df['Multiple_Races'].sum() / total_respondents) * 100:.1f}%)")

Race Distribution:
Another option not listed here or prefer not to answer: 725 (2.6%)
Asian or Asian American: 1,833 (6.5%)
Black or African American: 895 (3.2%)
Hispanic: 1,101 (3.9%)
Latino: 1,101 (3.9%)
Middle Eastern or Northern African: 182 (0.6%)
Native American or Alaska Native: 157 (0.6%)
White: 24,384 (86.8%)
or Spanish origin: 1,101 (3.9%)

Multiple Races: 1,853 (6.6%)


In [35]:
df.sample(5)

Unnamed: 0,timestamp,Work_Industry,Job_Title,Job_Title_Add_Comments,Annual_Salary,Bonus,Currency,Currency_Other,Income_Context,Country_Work,...,Race_Another_option_not_listed_here_or_prefer_not_to_answer,Race_Asian_or_Asian_American,Race_Black_or_African_American,Race_Hispanic,Race_Latino,Race_Middle_Eastern_or_Northern_African,Race_Native_American_or_Alaska_Native,Race_White,Race_or_Spanish_origin,Multiple_Races
11065,4/27/2021 23:10:39,"Accounting, Banking & Finance",Product,,110000,25000.0,USD,,,United States,...,0,0,0,0,0,0,0,1,0,0
22289,4/30/2021 18:55:20,Engineering or Manufacturing,Technical services scientist,,95000,0.0,USD,,,United States,...,0,0,0,0,0,0,0,1,0,0
18584,4/29/2021 1:01:02,Business or Consulting,Project Manager,"Real estate, nonprofits, and venture capital",90000,2000.0,USD,,,United States,...,0,0,0,1,1,0,0,0,1,1
23930,5/3/2021 18:43:49,Education (Higher Education),Instructional Designer,My actual work in as a Project Manager,64000,0.0,USD,,,United States,...,0,0,0,0,0,0,0,1,0,0
24879,5/5/2021 21:18:41,Government and Public Administration,Research scientist,I work for a university but my work site is a ...,76000,0.0,USD,,,United States,...,0,0,0,0,0,0,0,1,0,0


**Observation**

The salary survey dataset presented several major data quality challenges that required careful handling. The most significant issues included inconsistent data entry formats, missing values across multiple columns, and the presence of extreme outliers in salary and bonus fields. Free-form text entries led to high variability in fields like Job_Title (14,377 unique values) and Work_Industry (1,220 unique values), making standardization difficult. Geographic data was particularly messy, with multiple variations of country names (e.g., "US", "USA", "United States") and inconsistent city/state pairings.

The cleaning process substantially improved the dataset's usability for machine learning in several ways. Converting categorical variables into structured numeric formats made them suitable for model input - for example, transforming experience ranges like "5-7 years" into minimum and maximum year values. Standardizing currencies to USD and handling outliers with industry-specific flags rather than removal preserved data while marking potential issues. The creation of binary columns for race categories enabled proper handling of multi-racial identifications without losing information. These transformations maintained the dataset's richness while making it processable by ML algorithms.

Training a model on the uncleaned dataset would likely produce unreliable results. The messy data would introduce several problems: currency differences would distort salary comparisons, free-form text fields would create sparse feature spaces, and unhandled outliers could skew predictions. Geographic analysis would be particularly unreliable due to inconsistent location data. In contrast, the cleaned dataset enables more accurate modeling by providing normalized values, structured categories, and clear indicators for potential data quality issues.

While cleaning the data, I made conscious choices to minimize bias introduction while making the data more usable. Instead of dropping records with missing values, I used contextual filling where appropriate (e.g., zero for missing bonuses) and created flags for uncertain data. For categorical variables like gender, I preserved all response types while standardizing their format. The handling of outliers through industry-specific flags rather than removal ensures that unusual but potentially valid data points remain in the dataset. However, some bias may have been introduced through the currency conversion process and the consolidation of job titles and industries. I attempted to mitigate this by documenting all transformations and preserving original values alongside standardized ones.