## Problem Statement 

We are working on Income Prediction problem associated with the Adult Income Census dataset. 
The goal is to accurately predict whether or not person is making more or less than $50,000 a year. 
While working through this problem statment.

#### About the Dataset
- **Age:** Describes the age of individuals. Continuous.
- **Workclass:** Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
- **fnlwgt:** Continuous.
- **education:** Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
- **education-num:** Number of years spent in education. Continuous.
- **marital-status:** Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
- **occupation:** Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
- **relationship:** Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
- **race:** White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
- **sex:** Female, Male.
- **capital-gain:** Continuous.
- **capital-loss:** Continuous.
- **hours-per-week:** Continuous.
- **native-country:** United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
- **salary:** >50K,<=50K

In [3]:
import pandas as pd
import numpy as np
import statistics as st
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
sns.set(rc={"figure.figsize":(15,6)})
pd.pandas.set_option("display.max_columns",None)

In [44]:
df = pd.read_csv(r"./data/adult.csv")
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,?,77053,HS-grad,9,Widowed,?,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,?,186061,Some-college,10,Widowed,?,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K


In [30]:
df.replace("?", np.nan,inplace=True)
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,,77053,HS-grad,9,Widowed,,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,,186061,Some-college,10,Widowed,,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K


In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       30725 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education.num   32561 non-null  int64 
 5   marital.status  32561 non-null  object
 6   occupation      30718 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital.gain    32561 non-null  int64 
 11  capital.loss    32561 non-null  int64 
 12  hours.per.week  32561 non-null  int64 
 13  native.country  31978 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [32]:
df.isnull().sum()

age                  0
workclass         1836
fnlwgt               0
education            0
education.num        0
marital.status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital.gain         0
capital.loss         0
hours.per.week       0
native.country     583
income               0
dtype: int64

In [33]:
df.isnull().mean()*100

age               0.000000
workclass         5.638647
fnlwgt            0.000000
education         0.000000
education.num     0.000000
marital.status    0.000000
occupation        5.660146
relationship      0.000000
race              0.000000
sex               0.000000
capital.gain      0.000000
capital.loss      0.000000
hours.per.week    0.000000
native.country    1.790486
income            0.000000
dtype: float64

In [34]:
df['workclass'] = df['workclass'].fillna('MISSING')
df['workclass'].value_counts()

workclass
Private             22696
Self-emp-not-inc     2541
Local-gov            2093
MISSING              1836
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
Without-pay            14
Never-worked            7
Name: count, dtype: int64

In [38]:
df['workclass'].value_counts(normalize=True)

workclass
Private             0.697030
Self-emp-not-inc    0.078038
Local-gov           0.064279
MISSING             0.056386
State-gov           0.039864
Self-emp-inc        0.034274
Federal-gov         0.029483
Without-pay         0.000430
Never-worked        0.000215
Name: proportion, dtype: float64

In [39]:
df.select_dtypes(exclude=np.number)

Unnamed: 0,workclass,education,marital.status,occupation,relationship,race,sex,native.country,income
0,MISSING,HS-grad,Widowed,,Not-in-family,White,Female,United-States,<=50K
1,Private,HS-grad,Widowed,Exec-managerial,Not-in-family,White,Female,United-States,<=50K
2,MISSING,Some-college,Widowed,,Unmarried,Black,Female,United-States,<=50K
3,Private,7th-8th,Divorced,Machine-op-inspct,Unmarried,White,Female,United-States,<=50K
4,Private,Some-college,Separated,Prof-specialty,Own-child,White,Female,United-States,<=50K
...,...,...,...,...,...,...,...,...,...
32556,Private,Some-college,Never-married,Protective-serv,Not-in-family,White,Male,United-States,<=50K
32557,Private,Assoc-acdm,Married-civ-spouse,Tech-support,Wife,White,Female,United-States,<=50K
32558,Private,HS-grad,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,United-States,>50K
32559,Private,HS-grad,Widowed,Adm-clerical,Unmarried,White,Female,United-States,<=50K


In [40]:
df['income'].value_counts()

income
<=50K    24720
>50K      7841
Name: count, dtype: int64

In [41]:
df['income'].value_counts(normalize=True)

income
<=50K    0.75919
>50K     0.24081
Name: proportion, dtype: float64

In [47]:
def remove_outliers_IQR(col, df):
    q1 = df[col].quantile(0.25)
    q3 = df[col].quantile(0.75)

    iqr = q3 - q1

    lower_limit = q1 - 1.5*iqr
    upper_limit = q3 + 1.5*iqr

    df.loc[(df[col]<lower_limit), col] = lower_limit
    df.loc[(df[col]>upper_limit), col] = upper_limit

In [45]:
df.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education.num',
       'marital.status', 'occupation', 'relationship', 'race', 'sex',
       'capital.gain', 'capital.loss', 'hours.per.week', 'native.country',
       'income'],
      dtype='object')

In [4]:
df2 = pd.read_csv(r'D:\MLOps\salary-class-prediction\artifacts\train.csv')

In [5]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22775 entries, 0 to 22774
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   age             22775 non-null  int64
 1   workclass       22775 non-null  int64
 2   education_num   22775 non-null  int64
 3   marital_status  22775 non-null  int64
 4   occupation      22775 non-null  int64
 5   relationship    22775 non-null  int64
 6   race            22775 non-null  int64
 7   sex             22775 non-null  int64
 8   capital_gain    22775 non-null  int64
 9   capital_loss    22775 non-null  int64
 10  hours_per_week  22775 non-null  int64
 11  native_country  22775 non-null  int64
 12  income          22775 non-null  int64
dtypes: int64(13)
memory usage: 2.3 MB


In [54]:
df3 = pd.read_csv(r'D:\MLOps\salary-class-prediction\notebook\data\income_cleandata.csv')
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32537 entries, 0 to 32536
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   age             32537 non-null  int64
 1   workclass       32537 non-null  int64
 2   education_num   32537 non-null  int64
 3   marital_status  32537 non-null  int64
 4   occupation      32537 non-null  int64
 5   relationship    32537 non-null  int64
 6   race            32537 non-null  int64
 7   sex             32537 non-null  int64
 8   capital_gain    32537 non-null  int64
 9   capital_loss    32537 non-null  int64
 10  hours_per_week  32537 non-null  int64
 11  native_country  32537 non-null  int64
 12  income          32537 non-null  int64
dtypes: int64(13)
memory usage: 3.2 MB


: 

In [52]:
col = 'education.num'
remove_outliers_IQR(col, df2)

KeyError: 'education.num'

In [6]:
df2[['age', 'workclass', 'education_num', 'marital_status', 'occupation',
                        'relationship', 'race', 'sex', 'capital_gain', 'capital_loss',
                        'hours_per_week', 'native_country']].head()

Unnamed: 0,age,workclass,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country
0,48,5,10,2,2,0,0,1,7688,0,40,38
1,31,3,13,4,11,1,4,1,0,0,40,38
2,50,3,16,2,9,0,4,1,15024,0,50,38
3,22,3,9,2,11,0,4,1,0,0,45,38
4,28,3,9,4,3,1,4,0,0,0,45,38


In [7]:
target_columns = 'income'
drop_columns = [target_columns]
drop_columns

['income']

In [8]:
df2.drop(drop_columns, axis=1)

Unnamed: 0,age,workclass,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country
0,48,5,10,2,2,0,0,1,7688,0,40,38
1,31,3,13,4,11,1,4,1,0,0,40,38
2,50,3,16,2,9,0,4,1,15024,0,50,38
3,22,3,9,2,11,0,4,1,0,0,45,38
4,28,3,9,4,3,1,4,0,0,0,45,38
...,...,...,...,...,...,...,...,...,...,...,...,...
22770,38,3,15,2,11,5,4,0,0,0,40,38
22771,63,3,13,6,0,1,4,0,0,0,25,38
22772,18,3,7,4,0,3,4,0,0,0,20,38
22773,34,3,11,2,2,0,4,1,0,0,50,38
