## Lab Assignment 5

Barry Becker extracted a reasonably clean subset of the 1994, U.S. Census database, with a goal of running predictions to determine whether a person makes over 50K a year. The dataset is hosted on the University of California, Irvine's Machine Learning Repository and includes features such as the person's age, occupation, and hours worked per week, etc.

As clean as the data is, it still isn't quite ready for analysis by SciKit-Learn! Using what you've learned in this chapter, clean up the various columns by encode them properly using the best practices so that they're ready to be examined.

- Load up the dataset and set header label names to: ['education', 'age', 'capital-gain', 'race', 'capital-loss', 'hours-per-week', 'sex', 'classification']
 
- Ensure you use the right command to do this, as there is more than one command! To verify you used the correct one, open the dataset in a text editor like SublimeText or Notepad, and double check your df.head() to ensure the first values match up.

- Make sure any value that needs to be replaced with a NAN is set as such. There are at least three ways to do this. One is much easier than the other two.

- Look through the dataset and ensure all of your columns have appropriate data types. Numeric columns should be float64 or int64, and textual columns should be object.

- Properly encode any ordinal features using the method discussed in the chapter.

- Properly encode any nominal features by exploding them out into new, separate, boolean features.

In [1]:
import pandas as pd

In [12]:
df = pd.read_csv('census.txt', sep=',', names = ['education', 'age', 'capital-gain', 'race', 'capital-loss', 'hours-per-week', 'sex', 'classification'])
df.head(20)

Unnamed: 0,education,age,capital-gain,race,capital-loss,hours-per-week,sex,classification
0,Bachelors,39,2174,White,0,40,Male,<=50K
1,Bachelors,50,?,White,0,13,Male,<=50K
2,HS-grad,38,?,White,0,40,Male,<=50K
3,11th,53,?,Black,0,40,Male,<=50K
4,Bachelors,28,0,Black,0,40,Female,<=50K
5,Masters,37,0,White,0,40,Female,<=50K
6,9th,49,0,Black,0,16,Female,<=50K
7,HS-grad,52,0,White,0,45,Male,>50K
8,Masters,31,14084,White,0,50,Female,>50K
9,Bachelors,42,5178,White,0,40,Male,>50K


Do the data-types of each column reflect the values you see when you look through the data using a text editor / spread sheet program? If you see object where you expect to see int32 or float64, that is a good indicator that there might be a string or missing value or erroneous value in the column.

In [6]:
df.dtypes

education         object
age                int64
capital-gain      object
race              object
capital-loss       int64
hours-per-week     int64
sex               object
classification    object
dtype: object

Try use your_data_frame['your_column'].unique() or equally, your_data_frame.your_column.unique() to see the unique values of each column and identify the rogue values.

If you find any value that should be properly encoded to NaNs, you can convert them either using the na_values parameter when loading the dataframe. Or alternatively, use one of the other methods discussed in the reading.

In [7]:
df['capital-gain'].unique() 

array(['2174', '?', '0', '14084', '5178', '5013', '2407', '14344',
       '15024', '7688', '34095', '4064', '4386', '7298', '1409', '3674',
       '1055', '3464', '2050', '2176', '594', '6849', '4101', '1111',
       '3411', '2597', '25236', '4650', '9386', '2463', '3103', '10605',
       '2964', '3325', '2580', '3471', '4865', '6514', '1471', '2329',
       '99999', '20051', '2105', '2885', '25124', '10520', '2202', '2961',
       '27828', '6767', '8614', '2228', '1506', '13550', '2635', '5556',
       '4787', '3781', '3137', '3818', '3942', '914', '401', '2829',
       '2977', '4934', '2062', '15020', '1424', '3273', '22040', '4416',
       '10566', '991', '4931', '1086', '7430', '6497', '114', '7896',
       '2346', '3432', '2907', '1151', '2414', '2290', '3418', '15831',
       '41310', '4508', '5455', '2538', '3456', '3908', '1848', '3887',
       '5721', '9562', '6418', '1455', '2036', '1831', '11678', '2936',
       '2993', '7443', '6360', '2354', '1797', '1173', '4687', '2009',

In [30]:
df = pd.read_csv('census.txt', sep=',', names = ['education', 'age', 'capital-gain', 'race', 'capital-loss', 'hours-per-week', 'sex', 'classification'],
                na_values = '?')
df['capital-gain'].unique() 
df['education'].unique()

array(['Bachelors', 'HS-grad', '11th', 'Masters', '9th', 'Some-college',
       '7th-8th', 'Doctorate', '5th-6th', '10th', '1st-4th', 'Preschool',
       '12th'], dtype=object)

Look through your data and identify any potential categorical features. Ensure you properly encode any ordinal and nominal types using the methods discussed in the chapter.

Be careful! Some features can be represented as either categorical or continuous (numerical). If you ever get confused, think to yourself what makes more sense generally---to represent such features with a continuous numeric type... or a series of categories?

In [34]:
# sex column
# 0 = Male, 1 = Female
category_sex = ['Male', 'Female']
df.sex = df.sex.astype("category", ordered=True, categories=category_sex).cat.codes

  after removing the cwd from sys.path.


In [35]:
# Classification column:
# 0 = <=50k, 1 = >50K:
category_classification = ['<=50K', '>50K']
df.classification = df.classification.astype('category', ordered=True, categories=category_classification).cat.codes

  after removing the cwd from sys.path.


In [26]:
df.groupby(['education'])['education'].count()

education
10th              933
11th             1175
12th              433
1st-4th           168
5th-6th           333
7th-8th           646
9th               514
Bachelors        5355
Doctorate         413
HS-grad         10501
Masters          1723
Preschool          51
Some-college     7291
Name: education, dtype: int64

In [36]:
category_education = ['Preschool','1st-4th','5th-6th', '7th-8th', '9th','10th','11th','12th', 'HS-grad','Some-college', 'Bachelors','Masters','Doctorate']
df.education = df.education.astype('category', order=True, categories = category_education).cat.codes

  


In [37]:
df.head(20)

Unnamed: 0,education,age,capital-gain,race,capital-loss,hours-per-week,sex,classification
0,10,39,2174.0,White,0,40,0,0
1,10,50,,White,0,13,0,0
2,8,38,,White,0,40,0,0
3,6,53,,Black,0,40,0,0
4,10,28,0.0,Black,0,40,1,0
5,11,37,0.0,White,0,40,1,0
6,4,49,0.0,Black,0,16,1,0
7,8,52,0.0,White,0,45,0,1
8,11,31,14084.0,White,0,50,1,1
9,10,42,5178.0,White,0,40,0,1
