# Simple Binary Classification on Adult Dataset

You can use this notebook to try out StickyLand!

To launch StickyLand, click the note icon in the toobar above.

![](https://i.imgur.com/kQyAEF3.png)

In [None]:
# Install dependencies
%pip install numpy pandas matplotlib scikit-learn

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from collections import Counter
%config InlineBackend.figure_format = 'retina'

## 1. Exploratory Data Analyais

### 1.1. Loading the Dataset

In [None]:
df = pd.read_csv(
    'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data',
    sep=', ',
    engine='python',
    header=None
)

column_names = [
    'Age', 'WorkClass', 'fnlwgt', 'Education', 'EducationNum',
    'MaritalStatus', 'Occupation', 'Relationship', 'Race', 'Gender',
    'CapitalGain', 'CapitalLoss', 'HoursPerWeek', 'NativeCountry', 'Income'
]
df.columns = [n.lower() for n in column_names]

df.shape

### Adult Features

The Adult dataset has 14 features.<br>
The output variable is binary (`income > 50k`).

In [None]:
df.head()

In [None]:
sub_df = df[df['age'] < 20]
sub_df.head()

### Task List [02/22]

- [x] Visualize the adult datatset
    - [x] Histogram of all features
    - [x] Scatter plot of `age` vs. `income`
- [x] Test ML models on new dataset
    - [x] XGBoost
    - [x] Explanable Boosting Machine
- [ ] Share the notebook with Ellie 😊

Also support $\LaTeX$!

In [None]:
def overlay_hist(df, c):
    """
    Plot two histogram of two values overlaying each other.
    """
    
    num_unique = len(df[c].unique())
    
    if df[c].dtype == 'object':
        counter_1 = Counter(df[c][df['target'] == 1])
        counter_2 = Counter(df[c][df['target'] != 1])

        bar_names = []
        bar_densities_1 = []
        bar_densities_2 = []

        for f in counter_1:
            bar_names.append(f)
            bar_densities_1.append(counter_1[f] / df.shape[0])
            bar_densities_2.append(counter_2[f] / df.shape[0])

        for f in counter_2:
            if f not in counter_1:
                bar_names.append(f)
                bar_densities_1.append(counter_1[f] / df.shape[0])
                bar_densities_2.append(counter_2[f] / df.shape[0])

        count_df = pd.DataFrame(np.c_[bar_densities_2, bar_densities_1], index=bar_names)
        ax = count_df.plot.bar(alpha=0.5)
        ax.set_title(c)
        ax.figure.autofmt_xdate(rotation=45)

    else:
        plt.hist(df[c][df['target'] == 1], alpha=0.5, density=True, label='>50k', bins=50)
        plt.hist(df[c][df['target'] != 1], alpha=0.5, density=True, label='<=50k', bins=50)
        plt.title(c)
        
    plt.legend(loc='upper right')
    print('Num of unique values: ', num_unique)
    plt.show()

### Task List [02/22]

- [x] Visualize the adult datatset
    - [x] Histogram of all features
    - [x] Scatter plot of `age` vs. `income`
- [x] Test ML models on new dataset
    - [x] XGBoost
    - [x] Explanable Boosting Machine
- [x] Share the notebook with Ellie 😊

Also support $\LaTeX$!

Transform the target variable `Income` as a binary variable.

In [None]:
df['target'] = [0 if l else 1 for l in (df['income'] == '<=50K')]
new_df = df.copy()

### 1.2. Data Engineering

In this section, we delete or transform some features before training the binary classifier.

In [None]:
intersted_feature = 'maritalstatus'

In [None]:
overlay_hist(df, intersted_feature)

The distribution difference between these two groups on age is quite significant.

In [None]:
overlay_hist(df, 'workclass')

In [None]:
overlay_hist(df, 'fnlwgt')

`fnlwgt` stands for "Final Weight", which is used to give weight to different sample so that people with similar demographic characteristics have the same weight. This feature is not really useful in this model.

In [None]:
del new_df['fnlwgt']

In [None]:
overlay_hist(df, 'education')

In [None]:
overlay_hist(df, 'educationnum')

In [None]:
overlay_hist(df, 'maritalstatus')

In [None]:
overlay_hist(df, 'occupation')

In [None]:
overlay_hist(df, 'relationship')

In [None]:
overlay_hist(df, 'race')

In [None]:
overlay_hist(df, 'gender')

In [None]:
overlay_hist(df, 'capitalgain')

In [None]:
overlay_hist(df, 'capitalloss')

These two features `capitalgain` and `capitalloss` have many 0 values. It makes sense, because the census define capital gain/loss as the profit/loss of asset sales (stocks or real estate). Not all people would yield cpaital gain/loss in a particular. We can convert these two variables as binary features `has_capitalgain` and `has_capitalloss`.

In [None]:
new_df['has_capitalgain'] = [int(t) for t in df['capitalgain'] != 0]
new_df['has_capitalloss'] = [int(t) for t in df['capitalloss'] != 0]

del new_df['capitalgain']
del new_df['capitalloss']

In [None]:
overlay_hist(df, 'hoursperweek')

Working 40 hours a week is typical in the dataset. Interestingly people who earn more tend to work longer.

In [None]:
overlay_hist(df, 'nativecountry')

The majority of the native country is the US. We can encode it as another binary variable `from-usa` to decrease the number of levels.

In [None]:
new_df['from_usa'] = [int(t) for t in df['nativecountry'] == 'United-States']
del new_df['nativecountry']

In [None]:
overlay_hist(df, 'income')

It shows this dataset is quite imbalanced.

In [None]:
new_df.head()

## Image Augmentation

In [None]:
# Install dependencies
%pip install imageio albumentations

In [None]:
import imageio
import numpy as np
import matplotlib.pyplot as plt
import cv2
import albumentations as A
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [None]:
def load_random_image():
    image = imageio.imread('https://picsum.photos/200/300')

    image = image[:, :, :3]

    s = 250
    transform = A.Resize(height=s, width=s)
    augmented = transform(image=image)
    image = augmented['image']
    
    return image

In [None]:
def rotate(image):
    """Rotate the image using Albumentations"""
    transform = A.Rotate(limit=(-10, -9), p=1.0)
    augmented = transform(image=image)
    return augmented['image']

def add_noise(image):
    """Add coarse dropout noise using Albumentations"""
    h, w = image.shape[:2]
    transform = A.CoarseDropout(p=0.50)
    augmented = transform(image=image)
    return augmented['image']

def multiply_hue_saturation(img, factor):
    # Convert RGB to HSV (note: OpenCV uses hue range [0,179])
    hsv = cv2.cvtColor(img, cv2.COLOR_RGB2HSV).astype(np.float32)
    hsv[..., 0] = np.clip(hsv[..., 0] * factor, 0, 179)
    hsv[..., 1] = np.clip(hsv[..., 1] * factor, 0, 255)
    return cv2.cvtColor(hsv.astype(np.uint8), cv2.COLOR_HSV2RGB)

def corrupt(image):
    """Corrupt the image by boosting hue and saturation twice"""
    image = multiply_hue_saturation(image, 4)
    image = multiply_hue_saturation(image, 4)
    return image  

In [None]:
image = load_random_image()
plt.imshow(image);

In [None]:
image = rotate(image)
plt.imshow(image);

In [None]:
image = add_noise(image)
plt.imshow(image);

In [None]:
image = corrupt(image)
plt.imshow(image);

<br>
<br>
<br><br>
<br>

<br>
<br>
<br>
<br>