<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Capstone Part 9:** Data Wrangling Lab


#### Student Author: Abigail Hedden

## Objectives


- Identify and remove inconsistent data entries.

- Encode categorical variables for analysis.

- Handle missing values using multiple imputation strategies.

- Apply feature scaling and transformation techniques.


## Set-up

In [61]:
# import required packages
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
import numpy as np

## Load in dataset

In [51]:
df = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv')
print(df.head())

   ResponseId                      MainBranch                 Age  \
0           1  I am a developer by profession  Under 18 years old   
1           2  I am a developer by profession     35-44 years old   
2           3  I am a developer by profession     45-54 years old   
3           4           I am learning to code     18-24 years old   
4           5  I am a developer by profession     18-24 years old   

            Employment RemoteWork   Check  \
0  Employed, full-time     Remote  Apples   
1  Employed, full-time     Remote  Apples   
2  Employed, full-time     Remote  Apples   
3   Student, full-time        NaN  Apples   
4   Student, full-time        NaN  Apples   

                                    CodingActivities  \
0                                              Hobby   
1  Hobby;Contribute to open-source projects;Other...   
2  Hobby;Contribute to open-source projects;Other...   
3                                                NaN   
4                                 

## Explore the Dataset


#### Summarize the dataset by displaying the column data types, counts, and missing values.</h5>


In [52]:
# display data types of each column
print(df.dtypes)
print('')

# display counts of non-null values
print(df.count())
print('')

# display missing value counts
print(df.isnull().sum())

ResponseId               int64
MainBranch              object
Age                     object
Employment              object
RemoteWork              object
                        ...   
JobSatPoints_11        float64
SurveyLength            object
SurveyEase              object
ConvertedCompYearly    float64
JobSat                 float64
Length: 114, dtype: object

ResponseId             65437
MainBranch             65437
Age                    65437
Employment             65437
RemoteWork             54806
                       ...  
JobSatPoints_11        29445
SurveyLength           56182
SurveyEase             56238
ConvertedCompYearly    23435
JobSat                 29126
Length: 114, dtype: int64

ResponseId                 0
MainBranch                 0
Age                        0
Employment                 0
RemoteWork             10631
                       ...  
JobSatPoints_11        35992
SurveyLength            9255
SurveyEase              9199
ConvertedCompYearly    4

#### Generate basic statistics for numerical columns.</h5>


In [53]:
df.describe()

Unnamed: 0,ResponseId,CompTotal,WorkExp,JobSatPoints_1,JobSatPoints_4,JobSatPoints_5,JobSatPoints_6,JobSatPoints_7,JobSatPoints_8,JobSatPoints_9,JobSatPoints_10,JobSatPoints_11,ConvertedCompYearly,JobSat
count,65437.0,33740.0,29658.0,29324.0,29393.0,29411.0,29450.0,29448.0,29456.0,29456.0,29450.0,29445.0,23435.0,29126.0
mean,32719.0,2.963841e+145,11.466957,18.581094,7.52214,10.060857,24.343232,22.96522,20.278165,16.169432,10.955713,9.953948,86155.29,6.935041
std,18890.179119,5.444117e+147,9.168709,25.966221,18.422661,21.833836,27.08936,27.01774,26.10811,24.845032,22.906263,21.775652,186757.0,2.088259
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
25%,16360.0,60000.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,32712.0,6.0
50%,32719.0,110000.0,9.0,10.0,0.0,0.0,20.0,15.0,10.0,5.0,0.0,0.0,65000.0,7.0
75%,49078.0,250000.0,16.0,22.0,5.0,10.0,30.0,30.0,25.0,20.0,10.0,10.0,107971.5,8.0
max,65437.0,1e+150,50.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,16256600.0,10.0


## Identifying and Removing Inconsistencies


* Identify inconsistent or irrelevant entries in specific columns (e.g., Country)
* Standardize entries in columns like Country or EdLevel by mapping inconsistent values to a consistent format

In [54]:
print('Number of unique countries BEFORE cleaning:', df['Country'].nunique())
print('')
print(df["Country"].unique())

# update entries inconsistent with other entries and industry
country_mapping = {
    "United States of America": "United States",
    "United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
    "Viet Nam": "Vietnam",
    "Russian Federation": "Russia",
    "Republic of Korea": "South Korea",
    "Democratic People's Republic of Korea": "North Korea",
    "Iran, Islamic Republic of...": "Iran",
    "Venezuela, Bolivarian Republic of...": "Venezuela",
    "Lao People's Democratic Republic": "Laos",
    "Libyan Arab Jamahiriya": "Libya",
    "Syrian Arab Republic": "Syria",
    "Republic of Moldova": "Moldova",
    "Democratic Republic of the Congo": "DR Congo",
    "Congo, Republic of the...": "Republic of the Congo",
    "Micronesia, Federated States of...": "Micronesia",
    "Nomadic": "Other"
}

# update df with country mappping
df["Country"] = df["Country"].replace(country_mapping)

# verify update
print('')
print('Number of unique countries AFTER cleaning:', df['Country'].nunique())
print('')
print(df["Country"].unique())

Number of unique countries BEFORE cleaning: 185

['United States of America'
 'United Kingdom of Great Britain and Northern Ireland' 'Canada' 'Norway'
 'Uzbekistan' 'Serbia' 'Poland' 'Philippines' 'Bulgaria' 'Switzerland'
 'India' 'Germany' 'Ireland' 'Italy' 'Ukraine' 'Australia' 'Brazil'
 'Japan' 'Austria' 'Iran, Islamic Republic of...' 'France' 'Saudi Arabia'
 'Romania' 'Turkey' 'Nepal' 'Algeria' 'Sweden' 'Netherlands' 'Croatia'
 'Pakistan' 'Czech Republic' 'Republic of North Macedonia' 'Finland'
 'Slovakia' 'Russian Federation' 'Greece' 'Israel' 'Belgium' 'Mexico'
 'United Republic of Tanzania' 'Hungary' 'Argentina' 'Portugal'
 'Sri Lanka' 'Latvia' 'China' 'Singapore' 'Lebanon' 'Spain' 'South Africa'
 'Lithuania' 'Viet Nam' 'Dominican Republic' 'Indonesia' 'Kosovo'
 'Morocco' 'Taiwan' 'Georgia' 'San Marino' 'Tunisia' 'Bangladesh'
 'Nigeria' 'Liechtenstein' 'Denmark' 'Ecuador' 'Malaysia' 'Albania'
 'Azerbaijan' 'Chile' 'Ghana' 'Peru' 'Bolivia' 'Egypt' 'Luxembourg'
 'Montenegro' 'Cypr

## Encoding Categorical Variables


#### Encode the Employment column using one-hot encoding
One-hot encoding takes a categorical variable and creates new binary (0 or 1) columns for each unique category. Marks the column corresponding to the category with a 1, and others with 0. Sometimes called making dummy variables.


In [55]:
employment_dummies = pd.get_dummies(df["Employment"], prefix="Employment")
df = pd.concat([df, employment_dummies], axis=1)
#df.drop(columns=["Employment"], inplace=True)

## Handling Missing Values


#### Identify columns with the highest number of missing values


In [56]:
missing_counts = df.isnull().sum().sort_values(ascending=False)
print(missing_counts[missing_counts > 0])

AINextMuch less integrated    64289
AINextLess integrated         63082
AINextNo change               52939
AINextMuch more integrated    51999
EmbeddedAdmired               48704
                              ...  
YearsCode                      5568
NEWSOSites                     5151
LearnCode                      4949
EdLevel                        4653
AISelect                       4530
Length: 109, dtype: int64


#### Impute missing values in `ConvertedCompYearly` with the mean or median


In [57]:
# number of null in column before imputing
null_conv_comp = df['ConvertedCompYearly'].isnull().sum()
print(f"Number of nulls in 'ConvertedCompYearly before imputing': {null_conv_comp}")

# impute with mean
mean_conv_comp = df["ConvertedCompYearly"].mean()
print('Mean ConvertedCompYearly = ', mean_conv_comp)
df["ConvertedCompYearly"].fillna(mean_conv_comp, inplace=True)

# verify that there are no null values 
null_conv_comp = df['ConvertedCompYearly'].isnull().sum()
print(f"Number of nulls in 'ConvertedCompYearly after imputing': {null_conv_comp}")

Number of nulls in 'ConvertedCompYearly before imputing': 42002
Mean ConvertedCompYearly =  86155.28726264134
Number of nulls in 'ConvertedCompYearly after imputing': 0


#### Impute missing values in `RemoteWork` with the most frequent value


In [58]:
# number of null in column before imputing
null_remote = df['RemoteWork'].isnull().sum()
print(f"Number of nulls in 'RemoteWork before imputing': {null_remote}")

mode_remote = df["RemoteWork"].mode()[0]
print('Mode RemoteWork = ', mode_remote)
df["RemoteWork"].fillna(mode_remote, inplace=True)

# verify that there are no null values 
null_remote = df['RemoteWork'].isnull().sum()
print(f"Number of nulls in 'RemoteWork after imputing': {null_remote}")

Number of nulls in 'RemoteWork before imputing': 10631
Mode RemoteWork =  Hybrid (some remote, some in-person)
Number of nulls in 'RemoteWork after imputing': 0


## Feature Scaling and Transformation


#### Apply Min-Max Scaling to normalize the `ConvertedCompYearly` column


In [60]:
scaler = MinMaxScaler()
df["ConvertedCompYearly_Scaled"] = scaler.fit_transform(df[["ConvertedCompYearly"]])

#### Log-transform the ConvertedCompYearly column to reduce skewness


In [62]:
# Add 1 to avoid log(0)
df["ConvertedCompYearly_Log"] = np.log1p(df["ConvertedCompYearly"])

## Feature Engineering


#### Create a new column `ExperienceLevel` based on the `YearsCodePro` column


In [64]:
# replace strings with numbers
df2 = df.copy()
df2["YearsCodePro"] = df2["YearsCodePro"].replace({
    "Less than 1 year": "0.5",
    "More than 50 years": "51"
})

# convert column to numeric
df2["YearsCodePro"] = pd.to_numeric(df2["YearsCodePro"], errors='coerce')

# create experience levels based on number of years
def categorize_experience(years):
    if pd.isnull(years):
        return np.nan
    elif years < 3:
        return "Beginner"
    elif 3 <= years <= 5:
        return "Intermediate"
    elif 5 < years <= 10:
        return "Advanced"
    else:
        return "Expert"

df2["ExperienceLevel"] = df2["YearsCodePro"].apply(categorize_experience)

df2["ExperienceLevel"]

0                 NaN
1              Expert
2              Expert
3                 NaN
4                 NaN
             ...     
65432    Intermediate
65433             NaN
65434    Intermediate
65435        Beginner
65436             NaN
Name: ExperienceLevel, Length: 65437, dtype: object

Copyright © IBM Corporation. All rights reserved.
