In [1]:
import pandas as pd

names = ['age', 'workclass', 'fnlwgt', 'education', 'educationnum', 'maritalstatus', 'occupation', 'relationship', 'race',
        'sex', 'capitalgain', 'capitalloss', 'hoursperweek', 'nativecountry', 'label']

train_df = pd.read_csv("adult/adult.data", header=None, names=names)

print(train_df.head())

   age          workclass  fnlwgt   education  educationnum  \
0   39          State-gov   77516   Bachelors            13   
1   50   Self-emp-not-inc   83311   Bachelors            13   
2   38            Private  215646     HS-grad             9   
3   53            Private  234721        11th             7   
4   28            Private  338409   Bachelors            13   

         maritalstatus          occupation    relationship    race      sex  \
0        Never-married        Adm-clerical   Not-in-family   White     Male   
1   Married-civ-spouse     Exec-managerial         Husband   White     Male   
2             Divorced   Handlers-cleaners   Not-in-family   White     Male   
3   Married-civ-spouse   Handlers-cleaners         Husband   Black     Male   
4   Married-civ-spouse      Prof-specialty            Wife   Black   Female   

   capitalgain  capitalloss  hoursperweek   nativecountry   label  
0         2174            0            40   United-States   <=50K  
1         

# Gathering Statistics on Data

A good place to start is just looking at your data using some pandas functions to better understand what issues there might be. `Describe` will give you counts and some statistics for continuous variables.

In [2]:
import pandas as pd

names = ['age', 'workclass', 'fnlwgt', 'education', 'educationnum', 'maritalstatus', 'occupation', 'relationship', 'race',
        'sex', 'capitalgain', 'capitalloss', 'hoursperweek', 'nativecountry', 'label']

train_df = pd.read_csv("adult/adult.data", header=None, names=names)

print(train_df.describe())

                age        fnlwgt  educationnum   capitalgain   capitalloss  \
count  32561.000000  3.256100e+04  32561.000000  32561.000000  32561.000000   
mean      38.581647  1.897784e+05     10.080679   1077.648844     87.303830   
std       13.640433  1.055500e+05      2.572720   7385.292085    402.960219   
min       17.000000  1.228500e+04      1.000000      0.000000      0.000000   
25%       28.000000  1.178270e+05      9.000000      0.000000      0.000000   
50%       37.000000  1.783560e+05     10.000000      0.000000      0.000000   
75%       48.000000  2.370510e+05     12.000000      0.000000      0.000000   
max       90.000000  1.484705e+06     16.000000  99999.000000   4356.000000   

       hoursperweek  
count  32561.000000  
mean      40.437456  
std       12.347429  
min        1.000000  
25%       40.000000  
50%       40.000000  
75%       45.000000  
max       99.000000  


For all of our numeric values, we now have the `mean`, the `std`, the `min`, the `max`, and a few different `percentiles`.

**Note:** It is good to remember that the mean value will be influenced more by outliers than the median. Also, you can always square the standard deviation to get the variance.

You may have noticed that some columns are missing. The only columns `describe()` function fetched for us are the ones that hold numeric data.

# Finding the Data Types

In [4]:
import pandas as pd

names = ['age', 'workclass', 'fnlwgt', 'education', 'educationnum', 'maritalstatus', 'occupation', 'relationship', 'race',
        'sex', 'capitalgain', 'capitalloss', 'hoursperweek', 'nativecountry', 'label']

train_df = pd.read_csv("adult/adult.data", header=None, names=names)

print(train_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   age            32561 non-null  int64 
 1   workclass      32561 non-null  object
 2   fnlwgt         32561 non-null  int64 
 3   education      32561 non-null  object
 4   educationnum   32561 non-null  int64 
 5   maritalstatus  32561 non-null  object
 6   occupation     32561 non-null  object
 7   relationship   32561 non-null  object
 8   race           32561 non-null  object
 9   sex            32561 non-null  object
 10  capitalgain    32561 non-null  int64 
 11  capitalloss    32561 non-null  int64 
 12  hoursperweek   32561 non-null  int64 
 13  nativecountry  32561 non-null  object
 14  label          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
None


In our dataframe, we have two data types, `object` and `int64`. You can think of an `object` as a string value and `int64` as an integer.

# Converting Data Types

If a column doesn’t seem to have the correct type, it is easy to convert it to different types using `.to_()` functions:

* `to_numeric()`
* `to_datetime()`
* `to_string()`
 
For example:
```
df['numeric_column'] = pd.to_numeric(df['string_column'])
```

# Finding Unique Values

Another useful step is to look at unique values for columns. Here is an example for the `relationship` column:

In [5]:
import pandas as pd

names = ['age', 'workclass', 'fnlwgt', 'education', 'educationnum', 'maritalstatus', 'occupation', 'relationship', 'race',
        'sex', 'capitalgain', 'capitalloss', 'hoursperweek', 'nativecountry', 'label']

train_df = pd.read_csv("adult/adult.data", header=None, names=names)

print(train_df['relationship'].unique())

[' Not-in-family' ' Husband' ' Wife' ' Own-child' ' Unmarried'
 ' Other-relative']


Now we can see all the unique values above and can check the counts of unique values for `relationships`:

In [6]:
import pandas as pd

names = ['age', 'workclass', 'fnlwgt', 'education', 'educationnum', 'maritalstatus', 'occupation', 'relationship', 'race',
        'sex', 'capitalgain', 'capitalloss', 'hoursperweek', 'nativecountry', 'label']

train_df = pd.read_csv("adult/adult.data", header=None, names=names)

print(train_df['relationship'].value_counts())

relationship
Husband           13193
Not-in-family      8305
Own-child          5068
Unmarried          3446
Wife               1568
Other-relative      981
Name: count, dtype: int64


This shows us all the unique values for relationships as well as their counts. So, we have a lot of values as **husband** in the `relationship` column, but less **Other-relative**.

# Grouping the Data

We can also do these types of counts by specific groups by using the `groupby()` function. This function takes a list of columns by which you would like to group your dataframe. It then performs the requested calculations on each group individually and returns the results by group. Here is an example:

In [7]:
import pandas as pd

names = ['age', 'workclass', 'fnlwgt', 'education', 'educationnum', 'maritalstatus', 'occupation', 'relationship', 'race',
        'sex', 'capitalgain', 'capitalloss', 'hoursperweek', 'nativecountry', 'label']

train_df = pd.read_csv("adult/adult.data", header=None, names=names)

print(train_df.groupby('relationship')['label'].value_counts(normalize=True))

relationship    label
Husband         <=50K    0.551429
                >50K     0.448571
Not-in-family   <=50K    0.896930
                >50K     0.103070
Other-relative  <=50K    0.962283
                >50K     0.037717
Own-child       <=50K    0.986780
                >50K     0.013220
Unmarried       <=50K    0.936738
                >50K     0.063262
Wife            <=50K    0.524872
                >50K     0.475128
Name: proportion, dtype: float64


What we did above was group by the variable, `relationship`, and then perform value counts on the variable `label`. For these data, label is whether you make more than **50k**. We can see above that **55%** of husbands make more than **50k**. We received percentages because we used the `normalize=True` parameter.

You can do many types of calculations on groups using Pandas. For example, here we can see the mean hours worked per week by `workclass`.

In [8]:
import pandas as pd

names = ['age', 'workclass', 'fnlwgt', 'education', 'educationnum', 'maritalstatus', 'occupation', 'relationship', 'race',
        'sex', 'capitalgain', 'capitalloss', 'hoursperweek', 'nativecountry', 'label']

train_df = pd.read_csv("adult/adult.data", header=None, names=names)

print(train_df.groupby('workclass')['hoursperweek'].mean())

workclass
?                   31.919390
Federal-gov         41.379167
Local-gov           40.982800
Never-worked        28.428571
Private             40.267096
Self-emp-inc        48.818100
Self-emp-not-inc    44.421881
State-gov           39.031587
Without-pay         32.714286
Name: hoursperweek, dtype: float64


From the above, it looks like Federal government workers work more than local workers on average. **Never-worked** average about 28 hours.

# Finding the Correlation

Another useful statistic is the correlation. If you need a refresher on correlation, please check out [Wikipedia](https://en.wikipedia.org/wiki/Correlation_coefficient). You can calculate all the pair-wise correlations in your dataframe by using the `corr` function.

In [11]:
import pandas as pd
names = ['age', 'workclass', 'fnlwgt', 'education', 'educationnum', 'maritalstatus', 'occupation', 'relationship', 'race',
        'sex', 'capitalgain', 'capitalloss', 'hoursperweek', 'nativecountry', 'label']
train_df = pd.read_csv("adult/adult.data", header=None, names=names)

# Calculate correlations                   
print(train_df.corr(numeric_only=True))

                   age    fnlwgt  educationnum  capitalgain  capitalloss  \
age           1.000000 -0.076646      0.036527     0.077674     0.057775   
fnlwgt       -0.076646  1.000000     -0.043195     0.000432    -0.010252   
educationnum  0.036527 -0.043195      1.000000     0.122630     0.079923   
capitalgain   0.077674  0.000432      0.122630     1.000000    -0.031615   
capitalloss   0.057775 -0.010252      0.079923    -0.031615     1.000000   
hoursperweek  0.068756 -0.018768      0.148123     0.078409     0.054256   

              hoursperweek  
age               0.068756  
fnlwgt           -0.018768  
educationnum      0.148123  
capitalgain       0.078409  
capitalloss       0.054256  
hoursperweek      1.000000  


We can quickly see that compared to all of the correlations, there is a higher correlation between “hours per week” and “education num”, but it is not very high. You will notice, though since our label is an object, it isn’t included here. Knowing how variables correlate with our label would be useful, so let’s take care of that:

In [13]:
import pandas as pd
names = ['age', 'workclass', 'fnlwgt', 'education', 'educationnum', 'maritalstatus', 'occupation', 'relationship', 'race',
        'sex', 'capitalgain', 'capitalloss', 'hoursperweek', 'nativecountry', 'label']
train_df = pd.read_csv("adult/adult.data", header=None, names=names)

# Convert the string label into a value of 1 when >= 50k and 0 otherwise
train_df['label_int'] = train_df.label.apply(lambda x: ">" in x)

# Calculate correlations                   
print(train_df.corr(numeric_only=True))

                   age    fnlwgt  educationnum  capitalgain  capitalloss  \
age           1.000000 -0.076646      0.036527     0.077674     0.057775   
fnlwgt       -0.076646  1.000000     -0.043195     0.000432    -0.010252   
educationnum  0.036527 -0.043195      1.000000     0.122630     0.079923   
capitalgain   0.077674  0.000432      0.122630     1.000000    -0.031615   
capitalloss   0.057775 -0.010252      0.079923    -0.031615     1.000000   
hoursperweek  0.068756 -0.018768      0.148123     0.078409     0.054256   
label_int     0.234037 -0.009463      0.335154     0.223329     0.150526   

              hoursperweek  label_int  
age               0.068756   0.234037  
fnlwgt           -0.018768  -0.009463  
educationnum      0.148123   0.335154  
capitalgain       0.078409   0.223329  
capitalloss       0.054256   0.150526  
hoursperweek      1.000000   0.229689  
label_int         0.229689   1.000000  


There seems to be some decent correlation with the label and education num. One thing to note, though, is that our label is categorical, so correlation doesn’t really apply, our groupby frequencies are probably a better method.

**Note:** Categorical variables are variables with categories with no intrinsic order. For example, gender.

Also, keep in mind, these are just univariate correlations (between one variable) and don’t account for multi-variate effects (between multiple variables). You can also calculate the correlation using the `scipy` package which has the added benefit of p-values. This was discussed in the “Scipy an External Library” lesson.

# Generating Percentiles

Lastly, the describe function of Pandas gives some percentiles, but it is easy to add more:

In [15]:
import pandas as pd
names = ['age', 'workclass', 'fnlwgt', 'education', 'educationnum', 'maritalstatus', 'occupation', 'relationship', 'race',
        'sex', 'capitalgain', 'capitalloss', 'hoursperweek', 'nativecountry', 'label']
train_df = pd.read_csv("adult/adult.data", header=None, names=names)

# Use the describe function to calculate the percentiles specified                     
print(train_df.describe(percentiles=[.01,.05,.95,.99]))

                age        fnlwgt  educationnum   capitalgain   capitalloss  \
count  32561.000000  3.256100e+04  32561.000000  32561.000000  32561.000000   
mean      38.581647  1.897784e+05     10.080679   1077.648844     87.303830   
std       13.640433  1.055500e+05      2.572720   7385.292085    402.960219   
min       17.000000  1.228500e+04      1.000000      0.000000      0.000000   
1%        17.000000  2.718580e+04      3.000000      0.000000      0.000000   
5%        19.000000  3.946000e+04      5.000000      0.000000      0.000000   
50%       37.000000  1.783560e+05     10.000000      0.000000      0.000000   
95%       63.000000  3.796820e+05     14.000000   5013.000000      0.000000   
99%       74.000000  5.100720e+05     16.000000  15024.000000   1980.000000   
max       90.000000  1.484705e+06     16.000000  99999.000000   4356.000000   

       hoursperweek  
count  32561.000000  
mean      40.437456  
std       12.347429  
min        1.000000  
1%         8.000000 

A percentile is the value below which a given percent of the data falls.

We pass the percentile values we want using the percentiles parameter shown above.