# DAT210x - Programming with Python for DS

## Module2 - Lab5

Import and alias Pandas:

In [2]:
import pandas as pd
import numpy as np

As per usual, load up the specified dataset, setting appropriate header labels.

In [8]:
headers = ['education', 'age', 'capital-gain', 'race', 'capital-loss', 'hours-per-week', 'sex', 'classification']

df = pd.read_csv('Datasets/census.data', names=headers)

df[df == 0] = np.nan

print(df)


          education  age capital-gain                race  capital-loss  \
0         Bachelors   39         2174               White           NaN   
1         Bachelors   50            ?               White           NaN   
2           HS-grad   38            ?               White           NaN   
3              11th   53            ?               Black           NaN   
4         Bachelors   28            0               Black           NaN   
5           Masters   37            0               White           NaN   
6               9th   49            0               Black           NaN   
7           HS-grad   52            0               White           NaN   
8           Masters   31        14084               White           NaN   
9         Bachelors   42         5178               White           NaN   
10     Some-college   37            0               Black           NaN   
11        Bachelors   30            0  Asian-Pac-Islander           NaN   
12        Bachelors   23 

Excellent.

Now, use basic pandas commands to look through the dataset. Get a feel for it before proceeding!

Do the data-types of each column reflect the values you see when you look through the data using a text editor / spread sheet program? If you see `object` where you expect to see `int32` or `float64`, that is a good indicator that there might be a string or missing value or erroneous value in the column.

In [55]:
print(df.dtypes)


education         object
age                int64
capital-gain      object
race              object
capital-loss       int64
hours-per-week     int64
sex               object
classification    object
dtype: object


Try use `your_data_frame['your_column'].unique()` or equally, `your_data_frame.your_column.unique()` to see the unique values of each column and identify the rogue values.

If you find any value that should be properly encoded to NaNs, you can convert them either using the `na_values` parameter when loading the dataframe. Or alternatively, use one of the other methods discussed in the reading.

In [56]:
for i in headers:
    print(i, df.loc[:, i].unique())


education ['Bachelors' 'HS-grad' '11th' 'Masters' '9th' 'Some-college' '7th-8th'
 'Doctorate' '5th-6th' '10th' '1st-4th' 'Preschool' '12th']
age [39 50 38 53 28 37 49 52 31 42 30 23 34 25 32 43 40 54 35 59 56 19 20 45 22
 48 21 24 57 44 18 47 46 41 29 36 79 27 67 33 76 17 55 61 70 64 71 68 51 58
 26 60 90 66 65 77 62 63 80 72 74 69 73 81 78 75 82 83 84 85 88 86 87]
capital-gain ['2174' '?' '0' '14084' '5178' '5013' '2407' '14344' '15024' '7688' '34095'
 '4064' '4386' '7298' '1409' '3674' '1055' '3464' '2050' '2176' '594'
 '6849' '4101' '1111' '3411' '2597' '25236' '4650' '9386' '2463' '3103'
 '10605' '2964' '3325' '2580' '3471' '4865' '6514' '1471' '2329' '99999'
 '20051' '2105' '2885' '25124' '10520' '2202' '2961' '27828' '6767' '8614'
 '2228' '1506' '13550' '2635' '5556' '4787' '3781' '3137' '3818' '3942'
 '914' '401' '2829' '2977' '4934' '2062' '15020' '1424' '3273' '22040'
 '4416' '10566' '991' '4931' '1086' '7430' '6497' '114' '7896' '2346'
 '3432' '2907' '1151' '2414' '2290' '341

Look through your data and identify any potential categorical features. Ensure you properly encode any ordinal and nominal types using the methods discussed in the chapter.

Be careful! Some features can be represented as either categorical or continuous (numerical). If you ever get confused, think to yourself what makes more sense generally---to represent such features with a continuous numeric type... or a series of categories?

In [57]:
education_ordered = ['Preschool', '1st-4th', '5th-6th', '7th-8th', '9th', '10th', '11th', '12th', 'HS-grad', 'Some-college', 'Bachelors', 'Masters', 'Doctorate']
df.education = df.education.astype("category", ordered = True, categories = education_ordered).cat.codes

df = pd.get_dummies(df, columns=['race'])
print(df.sex.unique())

df = pd.get_dummies(df, columns=['sex'])
print(df.classification.unique())

df = pd.get_dummies(df, columns=['classification'])



['Male' 'Female']
['<=50K' '>50K']


Lastly, print out your dataframe!

In [5]:
print(df.head(5))

print("The number of columns now in the dataframe:", len(df.columns))


   education  age capital-gain   race  capital-loss  hours-per-week     sex  \
0  Bachelors   39         2174  White             0              40    Male   
1  Bachelors   50            ?  White             0              13    Male   
2    HS-grad   38            ?  White             0              40    Male   
3       11th   53            ?  Black             0              40    Male   
4  Bachelors   28            0  Black             0              40  Female   

  classification  
0          <=50K  
1          <=50K  
2          <=50K  
3          <=50K  
4          <=50K  
The number of columns now in the dataframe: 8
