# DAT210x - Programming with Python for DS

## Module2 - Lab5

Barry Becker extracted a reasonably clean subset of the 1994, U.S. Census database, with a goal of running predictions to determine whether a person makes over 50K a year. The dataset is hosted on the University of California, Irvine's Machine Learning Repository and includes features such as the person's age, occupation, and hours worked per week, etc.

As clean as the data is, it still isn't quite ready for analysis by SciKit-Learn! Using what you've learned in this chapter, clean up the various columns by encode them properly using the best practices so that they're ready to be examined. We've included a modifies subset of the dataset at Module2/Datasets/census.data and also have some started code to get you going located at Module2/Module2 - Lab5.ipynb.

Load up the dataset and set header label names to: ['education', 'age', 'capital-gain', 'race', 'capital-loss', 'hours-per-week', 'sex', 'classification']

Ensure you use the right command to do this, as there is more than one command! To verify you used the correct one, open the dataset in a text editor like SublimeText or Notepad, and double check your df.head() to ensure the first values match up.
Make sure any value that needs to be replaced with a NAN is set as such. There are at least three ways to do this. One is much easier than the other two.
Look through the dataset and ensure all of your columns have appropriate data types. Numeric columns should be float64 or int64, and textual columns should be object.
Properly encode any ordinal features using the method discussed in the chapter.
Properly encode any nominal features by exploding them out into new, separate, boolean features.

Import and alias Pandas:

In [2]:
import pandas as pd

As per usual, load up the specified dataset, setting appropriate header labels.

In [13]:
df = pd.read_csv(r'Datasets/census.data', header=None, index_col=0, na_values='?')
df.columns =['education', 'age', 'capital-gain', 'race', 'capital-loss', 'hours-per-week', 'sex', 'classification']
df.head(3)

Unnamed: 0_level_0,education,age,capital-gain,race,capital-loss,hours-per-week,sex,classification
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,Bachelors,39,2174.0,White,0,40,Male,<=50K
1,Bachelors,50,,White,0,13,Male,<=50K
2,HS-grad,38,,White,0,40,Male,<=50K


Excellent.

Now, use basic pandas commands to look through the dataset. Get a feel for it before proceeding!

Do the data-types of each column reflect the values you see when you look through the data using a text editor / spread sheet program? If you see `object` where you expect to see `int32` or `float64`, that is a good indicator that there might be a string or missing value or erroneous value in the column.

In [14]:
df.dtypes

education          object
age                 int64
capital-gain      float64
race               object
capital-loss        int64
hours-per-week      int64
sex                object
classification     object
dtype: object

Try use `your_data_frame['your_column'].unique()` or equally, `your_data_frame.your_column.unique()` to see the unique values of each column and identify the rogue values.

If you find any value that should be properly encoded to NaNs, you can convert them either using the `na_values` parameter when loading the dataframe. Or alternatively, use one of the other methods discussed in the reading.

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 29536 entries, 0 to 29535
Data columns (total 8 columns):
education         29536 non-null object
age               29536 non-null int64
capital-gain      29532 non-null float64
race              29536 non-null object
capital-loss      29536 non-null int64
hours-per-week    29536 non-null int64
sex               29536 non-null object
classification    29536 non-null object
dtypes: float64(1), int64(3), object(4)
memory usage: 2.0+ MB


In [15]:
df.describe()

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week
count,29536.0,29532.0,29536.0,29536.0
mean,38.506094,928.454321,84.957408,40.243872
std,13.811739,6557.886804,397.10775,12.326211
min,17.0,0.0,0.0,1.0
25%,27.0,0.0,0.0,40.0
50%,37.0,0.0,0.0,40.0
75%,48.0,0.0,0.0,45.0
max,90.0,99999.0,4356.0,99.0


Look through your data and identify any potential categorical features. Ensure you properly encode any ordinal and nominal types using the methods discussed in the chapter.

Be careful! Some features can be represented as either categorical or continuous (numerical). If you ever get confused, think to yourself what makes more sense generally---to represent such features with a continuous numeric type... or a series of categories?

In [17]:
print(df.education.unique())
print(df.race.unique())
print(df.sex.unique())
print(df.classification.unique())
['Bachelors' 'HS-grad' '11th' 'Masters' '9th' 'Some-college' '7th-8th' 'Doctorate' '5th-6th' '10th' '1st-4th' 'Preschool' '12th']
['<=50K' '>50K']

['Bachelors' 'HS-grad' '11th' 'Masters' '9th' 'Some-college' '7th-8th'
 'Doctorate' '5th-6th' '10th' '1st-4th' 'Preschool' '12th']
['White' 'Black' 'Asian-Pac-Islander' 'Amer-Indian-Eskimo' 'Other']
['Male' 'Female']
['<=50K' '>50K']


In [31]:
df.education = df.education.astype(pd.api.types.CategoricalDtype(
    categories=['Preschool', '1st-4th', '5th-6th', '7th-8th', '9th', '10th', '11th', '12th', 
                'HS-grad', 'Some-college', 'Bachelors', 'Masters', 'Doctorate'], ordered=True))
df.classification = df.classification.astype(pd.api.types.CategoricalDtype(
    categories=['<=50K' '>50K'], ordered=True))
df.race = df.race.astype('category')
df.sex = df.sex.astype('category')

Lastly, print out your dataframe!

In [32]:
df.head()

Unnamed: 0_level_0,education,age,capital-gain,race,capital-loss,hours-per-week,sex,classification
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,Bachelors,39,2174.0,White,0,40,Male,
1,Bachelors,50,,White,0,13,Male,
2,HS-grad,38,,White,0,40,Male,
3,11th,53,,Black,0,40,Male,
4,Bachelors,28,0.0,Black,0,40,Female,
