# Decision Tree in Depth

## A. A Basic Decision Tree Model 

In [1]:
# import the libraries 
import pandas as pd 
import numpy as np
import warnings

warnings.filterwarnings('ignore')

In [2]:
# let us read the census data 
df_census = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data')

# let us save this data to a csv file 
df_census.to_csv("../Data/census.csv", index=False)

In [3]:
# let us look at the dataset
df_census = pd.read_csv('../Data/census.csv', header=None)
df_census.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


So, we find out that the headers are missing in the dataframe! The headers are avaible at : `https://archive.ics.uci.edu/dataset/20/census+income`

In [4]:
df_census.columns=['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week','native-country', 'income']
df_census.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


| Variable Name   | Role     | Type         | Demographic       | Description                                                                                                              | Units | Missing Values |
|------------------|----------|--------------|-------------------|--------------------------------------------------------------------------------------------------------------------------|-------|----------------|
| age             | Feature  | Integer      | Age               | N/A                                                                                                                      |       | no             |
| workclass       | Feature  | Categorical  | Income            | Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.                   |       | yes            |
| fnlwgt          | Feature  | Integer      |                   |                                                                                                                          |       | no             |
| education       | Feature  | Categorical  | Education Level   | Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, etc.   |       | no             |
| education-num   | Feature  | Integer      | Education Level   |                                                                                                                          |       | no             |
| marital-status  | Feature  | Categorical  | Other             | Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.               |       | no             |
| occupation      | Feature  | Categorical  | Other             | Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, etc.               |       | yes            |
| relationship    | Feature  | Categorical  | Other             | Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.                                                     |       | no             |
| race            | Feature  | Categorical  | Race              | White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.                                                             |       | no             |
| sex             | Feature  | Binary       | Sex               | Female, Male.                                                                                                            |       | no             |
| capital-gain    | Feature  | Integer      |                   |                                                                                                                          |       | no             |
| capital-loss    | Feature  | Integer      |                   |                                                                                                                          |       | no             |
| hours-per-week  | Feature  | Integer      |                   |                                                                                                                          |       | no             |
| native-country  | Feature  | Categorical  | Other             | United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), etc.                         |       | yes            |
| income          | Target   | Binary       | Income            | >50K, <=50K.                                                                                                             |       | no             |


In [5]:
df_census.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [11]:
# get the columns with the dtype object 
object_col = df_census.select_dtypes(include='object').columns
object_col

Index(['workclass', 'education', 'marital-status', 'occupation',
       'relationship', 'race', 'sex', 'native-country', 'income'],
      dtype='object')

In [12]:
# let us now see the unique names of the columns 
df_census[object_col].nunique()

workclass          9
education         16
marital-status     7
occupation        15
relationship       6
race               5
sex                2
native-country    42
income             2
dtype: int64

First, let us drop the `education` column from the dataset! (since we already have the column named `education-num` here)

In [13]:
df_census = df_census.drop(['education'], axis=1)
df_census.head(5)

Unnamed: 0,age,workclass,fnlwgt,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


Next, let us now convert all the columns with object dtype to numerical columns! This is to be done using the `get_dummies` method!


In [14]:
df_census = pd.get_dummies(df_census,
                           dtype=int)
df_census.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 94 columns):
 #   Column                                      Non-Null Count  Dtype
---  ------                                      --------------  -----
 0   age                                         32561 non-null  int64
 1   fnlwgt                                      32561 non-null  int64
 2   education-num                               32561 non-null  int64
 3   capital-gain                                32561 non-null  int64
 4   capital-loss                                32561 non-null  int64
 5   hours-per-week                              32561 non-null  int64
 6   workclass_ ?                                32561 non-null  int64
 7   workclass_ Federal-gov                      32561 non-null  int64
 8   workclass_ Local-gov                        32561 non-null  int64
 9   workclass_ Never-worked                     32561 non-null  int64
 10  workclass_ Private                

In [15]:
df_census.head(5)

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,...,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia,income_ <=50K,income_ >50K
0,39,77516,13,2174,0,40,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
1,50,83311,13,0,0,13,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
2,38,215646,9,0,0,40,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
3,53,234721,7,0,0,40,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
4,28,338409,13,0,0,40,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


Let us now store this data as `cleaned_census.csv`

In [None]:
df_census.to_csv("../../Data/cleaned_census.csv", index=False)