# The dataset
source [Salary Prediction Classification](https://www.kaggle.com/datasets/ayessa/salary-prediction-classification)

> ### Columns are:
>   * age: continuous.
>   * workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
>   * fnlwgt: continuous.
>   * education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
>   * education-num: continuous.
>   * marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
>   * occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
>   * relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
>   * race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
>   * sex: Female, Male.
>   * capital-gain: continuous.
>   * capital-loss: continuous.
>   * hours-per-week: continuous.
>   * native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
>   * salary: <=50K or >50K

In [87]:
# Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

# Set default Seaborn style
sns.set()

# Prevent pandas from wrapping colums to the next line
pd.set_option('expand_frame_repr', False)

# Increase the maximum number of rows shown
pd.set_option('display.max_rows', 1000)

pd.set_option('display.max_columns', 30)
pd.set_option('display.width', 1000)

plt.rcParams['figure.figsize'] = [20, 10]

print("All modules loaded successfully")

All modules loaded successfully


In [88]:
salary_df = pd.read_csv("data/salary.csv")
print(salary_df.describe(), "\n")
print(salary_df.info(), "\n")

salary_df.head()

                age        fnlwgt  education-num  capital-gain  capital-loss  hours-per-week
count  32561.000000  3.256100e+04   32561.000000  32561.000000  32561.000000    32561.000000
mean      38.581647  1.897784e+05      10.080679   1077.648844     87.303830       40.437456
std       13.640433  1.055500e+05       2.572720   7385.292085    402.960219       12.347429
min       17.000000  1.228500e+04       1.000000      0.000000      0.000000        1.000000
25%       28.000000  1.178270e+05       9.000000      0.000000      0.000000       40.000000
50%       37.000000  1.783560e+05      10.000000      0.000000      0.000000       40.000000
75%       48.000000  2.370510e+05      12.000000      0.000000      0.000000       45.000000
max       90.000000  1.484705e+06      16.000000  99999.000000   4356.000000       99.000000 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ---

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


# Some data cleaning

In [89]:
all_categorical_cols = [
    'workclass', 'education', 'marital-status', 'occupation',
    'relationship', 'race', 'sex', 'native-country', 'salary']

for column in all_categorical_cols:
    # Remove leading & trailing spaces in the categorical columns
    salary_df[column] =  salary_df[column].str.strip()
    unique_vals = salary_df[column].unique()
    # Remove Unknown values with a question mark
    print(f"column: {column}, unique values: {unique_vals} \n")
    salary_df = salary_df.loc[~(salary_df[column] == '?')].reset_index(drop=True)
    
print(salary_df.info())

column: workclass, unique values: ['State-gov' 'Self-emp-not-inc' 'Private' 'Federal-gov' 'Local-gov' '?'
 'Self-emp-inc' 'Without-pay' 'Never-worked'] 

column: education, unique values: ['Bachelors' 'HS-grad' '11th' 'Masters' '9th' 'Some-college' 'Assoc-acdm'
 'Assoc-voc' '7th-8th' 'Doctorate' 'Prof-school' '5th-6th' '10th'
 'Preschool' '12th' '1st-4th'] 

column: marital-status, unique values: ['Never-married' 'Married-civ-spouse' 'Divorced' 'Married-spouse-absent'
 'Separated' 'Married-AF-spouse' 'Widowed'] 

column: occupation, unique values: ['Adm-clerical' 'Exec-managerial' 'Handlers-cleaners' 'Prof-specialty'
 'Other-service' 'Sales' 'Craft-repair' 'Transport-moving'
 'Farming-fishing' 'Machine-op-inspct' 'Tech-support' 'Protective-serv'
 'Armed-Forces' 'Priv-house-serv' '?'] 

column: relationship, unique values: ['Not-in-family' 'Husband' 'Wife' 'Own-child' 'Unmarried' 'Other-relative'] 

column: race, unique values: ['White' 'Black' 'Asian-Pac-Islander' 'Amer-Indian-Eskimo' 

### My stratergy based on the observed data
  1. Separate the predictor variable `(X)` from the outcome variable `(y)`
  2. Separate the data into training & test sets.
  3. Create our pipeline
  4. Fit the pipeline to the training data (trains the model).
  5. Evaluate the model with the test set.

In [90]:
# set up the X and y variables for modeling
X = salary_df.drop(columns='salary')
y = salary_df.salary == ">50K"

In [91]:
X.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba


# 2. Split data into training & test sets

In [92]:
# 80% of the observations will be in the training set
# 20% of the observations will be in the testing set
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=0.2, random_state=42)

# 3. Create our pipeline
Here, we will create our pipeline. A pipeline is a way to `automate the machine learning workflow` by allowing preprocessing of the data and instantiation of the estimator to occur in a single piece of code. We can easily create a pipeline in Python using sklearn’s `make_pipeline` function.

In [95]:
numeric_columns = [
    'age', 'fnlwgt', 'education-num', 'capital-gain', 
    'capital-loss', 'hours-per-week']

categorical_columns = [
    'workclass', 'education', 'marital-status', 'occupation', 
    'relationship', 'race', 'sex', 'native-country']

# Preprocessing using make_column_transformer
transformer = make_column_transformer(
    (StandardScaler(), numeric_columns),
    (OneHotEncoder(handle_unknown='ignore'), categorical_columns)
)


# pipeline with preprocessing transformers first & estimator last
pipeline = make_pipeline(transformer, KNeighborsClassifier())

# 4. Fit the pipeline to the training data (trains the model).

In [96]:
# fit the pipeline to the training data
pipeline.fit(X_train, y_train)

# 5. Evaluate the model with the test set.

In [97]:
# score the knn model on the testing data
score =  pipeline.score(X_test, y_test)
score_pct = round(score * 100, 2)
print(f"Model score = {score_pct}%")

Model score = 82.23%
