## EDA: Census dataset / "Adult dataset"

Census definition: an official count or survey, especially of a population. Dataset from census of 1994 (https://archive.ics.uci.edu/ml/datasets/census+income)

Goal: Determine whether a person makes over 50K a year.

In [99]:
import pandas as pd
import numpy as np

# file is comma and white space separated
df = pd.read_csv('data/census.csv', sep='\s*,\s+',engine='python')

In [100]:
df.sample(2)

Unnamed: 0,age,workclass,fnlgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
24198,36,Private,116608,HS-grad,9,Divorced,Exec-managerial,Not-in-family,White,Female,0,0,40,United-States,<=50K
1919,35,Private,225330,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Female,0,0,40,United-States,<=50K


In [101]:
print(len(df))

32561


### Checking column names

In [102]:
df.columns

Index(['age', 'workclass', 'fnlgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'salary'],
      dtype='object')

In [103]:
df = df.rename(columns={'fnlgt':'final_weight'})
# SIPP includes person weights that estimate the number of people in the target population that each person represents. 

column_names = [col_name.replace('-','_') for col_name in df.columns]
df.columns = column_names

### Check columns

In [104]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   final_weight    32561 non-null  int64 
 3   education       32561 non-null  object
 4   education_num   32561 non-null  int64 
 5   marital_status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital_gain    32561 non-null  int64 
 11  capital_loss    32561 non-null  int64 
 12  hours_per_week  32561 non-null  int64 
 13  native_country  32561 non-null  object
 14  salary          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


No null values present. There seem to be categorial and numerical features. The used data-types seem to be alright.

Checking the categorial features:

In [105]:
cat_columns = ['sex','workclass','education','marital_status','occupation','relationship','race', 'salary']

for col in cat_columns:
    print(col+':')
    print(df[col].unique())


sex:
['Male' 'Female']
workclass:
['State-gov' 'Self-emp-not-inc' 'Private' 'Federal-gov' 'Local-gov' '?'
 'Self-emp-inc' 'Without-pay' 'Never-worked']
education:
['Bachelors' 'HS-grad' '11th' 'Masters' '9th' 'Some-college' 'Assoc-acdm'
 'Assoc-voc' '7th-8th' 'Doctorate' 'Prof-school' '5th-6th' '10th'
 '1st-4th' 'Preschool' '12th']
marital_status:
['Never-married' 'Married-civ-spouse' 'Divorced' 'Married-spouse-absent'
 'Separated' 'Married-AF-spouse' 'Widowed']
occupation:
['Adm-clerical' 'Exec-managerial' 'Handlers-cleaners' 'Prof-specialty'
 'Other-service' 'Sales' 'Craft-repair' 'Transport-moving'
 'Farming-fishing' 'Machine-op-inspct' 'Tech-support' '?'
 'Protective-serv' 'Armed-Forces' 'Priv-house-serv']
relationship:
['Not-in-family' 'Husband' 'Wife' 'Own-child' 'Unmarried' 'Other-relative']
race:
['White' 'Black' 'Asian-Pac-Islander' 'Amer-Indian-Eskimo' 'Other']
salary:
['<=50K' '>50K']


columns workclass and occupation contain '?'

In [106]:

df['salary'] = df['salary'].map({'>50K': 1, 
                               '<=50K': 0})
df['workclass'] = df['workclass'].replace('?' , np.nan)
df['occupation'] = df['occupation'].replace('?' , np.nan)
#df = df.dropna()
print(df.isna().sum())
print(len(df))

age                  0
workclass         1836
final_weight         0
education            0
education_num        0
marital_status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital_gain         0
capital_loss         0
hours_per_week       0
native_country       0
salary               0
dtype: int64
32561
