# Building a Balanced Data Set

In [1]:
import pandas as pd
import numpy as np
import os

## Step 1: Inspect the Data


In [2]:
filename = os.path.join(os.getcwd(), "..", "..", "data", "censusData.csv")
df = pd.read_csv(filename)

In [3]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex_selfID,capital-gain,capital-loss,hours-per-week,native-country,income
0,36,State-gov,112074,Doctorate,16,Never-married,Prof-specialty,Not-in-family,White,Non-Female,0,0,45,United-States,<=50K
1,35,Private,32528,HS-grad,9,Married-civ-spouse,Handlers-cleaners,Husband,White,Non-Female,0,0,45,United-States,<=50K
2,21,Private,270043,Some-college,10,Never-married,Other-service,Own-child,White,Female,0,0,16,United-States,<=50K
3,45,Private,168837,Some-college,10,Married-civ-spouse,Adm-clerical,Wife,White,Female,0,0,24,Canada,>50K
4,39,Private,297449,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Non-Female,0,0,40,United-States,>50K


In [4]:
df.shape

(7000, 15)

## Step 2: Random Sampling of the Data
Random sampling from the data using `np.random.choice` and `loc`

In [5]:
percentage = 0.3
num_rows = df.shape[0]

# YOUR CODE HERE
df_subset = df.sample(int(percentage*num_rows))



## Step 3: Verifying Imbalance
Is our sample *balanced* with respect to (self-reported) sex? In order to answer that, first we'd like to know how many categories exist for the 'sex_selfID' values in our data.

In [6]:
unique_ssID = df['sex_selfID'].unique()
unique_ssID

array(['Non-Female', 'Female'], dtype=object)

### Calculating the Proportion of Each Class
How many 'Female' examples are in our data sample?

The code cell below uses `np.sum()` to sum up  the `True` values that indicate whether a row has `Female` in the `sex_selfID` field. It divides that sum by the total number of rows in the DataFrame `df_subset`. Run the code to display the results. Note that the sample is not balanced with respect to self-reported sex (assuming that we want balance for the two classes).

In [7]:
counts = df_subset['sex_selfID'].value_counts()
counts

sex_selfID
Non-Female    1401
Female         699
Name: count, dtype: int64

In [8]:
counts['Female']/sum(counts.values)

0.33285714285714285

In [9]:
df_subset.groupby(['sex_selfID', 'income']).size()

sex_selfID  income
Female      <=50K     628
            >50K       71
Non-Female  <=50K     988
            >50K      413
dtype: int64

### Addressing imbalance: upsampling the underrepresented group.

In [10]:
low_income_nonfemale, high_income_nonfemale = df_subset.groupby(['sex_selfID', 'income']).size()['Non-Female']
class_balance_nonfemale = high_income_nonfemale / low_income_nonfemale

low_income_female, high_income_female = df_subset.groupby(['sex_selfID', 'income']).size()['Female']

add_sample_size = int(class_balance_nonfemale*low_income_female - high_income_female)
add_sample_size # we need this many more points in (Female)&(>50K) group for balance

191

In [11]:
# Subset the original data: exclude entries that are already in our sample:
df_never_sampled = df.drop(labels=df_subset.index, axis=0, inplace=False)

# Filter that subset to include only the type of examples that we want to upsample: Females, higher income
condition = (df_never_sampled['income']=='>50K') & (df_never_sampled['sex_selfID']=='Female')
df_never_sampled_target = df_never_sampled[condition]

# Sample from the resulting set
size=min(add_sample_size, df_never_sampled_target.shape[0])
indices = np.random.choice(df_never_sampled_target.index, size=size, replace=False)

# Append the selected examples to our original sample
rows = df.loc[indices]
df_balanced_subset = pd.concat([df_subset, rows], ignore_index=True)
df_balanced_subset.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex_selfID,capital-gain,capital-loss,hours-per-week,native-country,income
0,34,Private,192900,Some-college,10,Married-civ-spouse,Craft-repair,Husband,White,Non-Female,0,0,40,United-States,>50K
1,43,Private,483450,9th,5,Married-civ-spouse,Exec-managerial,Husband,White,Non-Female,0,0,40,Mexico,<=50K
2,31,Private,38223,Bachelors,13,Never-married,Sales,Own-child,White,Non-Female,0,0,45,United-States,<=50K
3,35,Private,220098,HS-grad,9,Married-civ-spouse,Other-service,Wife,White,Female,0,0,40,United-States,>50K
4,33,Private,159303,HS-grad,9,Divorced,Other-service,Unmarried,White,Female,0,0,40,United-States,<=50K


In [12]:
df_balanced_subset.groupby(['sex_selfID', 'income']).size()

sex_selfID  income
Female      <=50K     628
            >50K      253
Non-Female  <=50K     988
            >50K      413
dtype: int64

The resulting balance is not perfect, but it is better than before!