<a href="https://colab.research.google.com/github/alyssakeehan/Eluvio_Assesment/blob/main/HW3_F21.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Homework 3

DUE Nov 15th at 11:59 PM

In [1]:
import pandas as pd
import numpy as np
import torch
import torchvision

## Problem 1

In this problem, you will implement a simple feed-forward neural network using PyTorch, a straight-forward and simple-to-pickup framework for quickly prototyping deep learning model. 

PyTorch provides 2 powerful things. First, a nice data structure called Tensor (basically a matrix, similar to Numpy ndarray). Tensor is optimized for matrix calculation and can be loaded to a GPU. Tensor is also implemented so that it's easy to calculate and pass back chains of gradients, which is extremely useful for backpropagation on neural network. Second, a nice inner mechanism called Autograd that nicely maps variables involved a chain of calculations and efficiently calculates their gradients via the chain rule when needed. Read more here: https://towardsdatascience.com/pytorch-autograd-understanding-the-heart-of-pytorchs-magic-2686cd94ec95  

You will define a neural network class in PyTorch and use the network to learn a classification task on the famous KDD CUP 99 dataset. You can refer to Problem 2 to see how a network class can be defined, how to use a PyTorch's DataLoader, and how a training loop may looks like.

There are many greate tutorial on PyTorch out there. For example, this video on Youtube explains how to build a simple network in PyTorch quite clearly: https://www.youtube.com/watch?v=oPhxf2fXHkQ

### Part a
Firstly, load and inspect the "**KDD CUP 99**" dataset.

In [2]:
from sklearn.datasets import fetch_kddcup99

X, y = fetch_kddcup99(return_X_y=True, percent10=True)

Downloading https://ndownloader.figshare.com/files/5976042


Split them into a train set (70%), a validation set (10%), and a test set (20%). Then, create a PyTorch's DataLoader for the train set, a DataLoader for the validation set, and a DataLoader for the test set.

You can read about PyTorch's DataLoader from:

*   https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel
*   https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader



In [3]:
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader

# split using train_test_split
X_train, X_test_full, y_train, y_test_full = train_test_split(X, y, test_size=0.3, random_state=674)
X_test, X_validate, y_test, y_validate = train_test_split(X_test_full, y_test_full, test_size=(1/3), random_state=638)

# combine the train, test, and validating sets into their respective dataframes
train = pd.DataFrame(X_train)
train['class'] = y_train
test = pd.DataFrame(X_test)
test['class'] = y_test
valid = pd.DataFrame(X_validate)
valid['class'] = y_validate

# create the loaders for each of the sets
train_loader = torch.utils.data.DataLoader(dataset = train, batch_size = 100, shuffle = True)
test_loader = torch.utils.data.DataLoader(dataset = test, batch_size = 100, shuffle = False)
valid_loader = torch.utils.data.DataLoader(dataset = valid, batch_size = 100, shuffle = False)

### Part b 
Create a Python class for our neural network model. The network should have 1 input layer, 1 hidden layer, and 1 output layer. You are free to choose the size of the hidden layer (it may affect the performance). Use ReLU as the activation function (torch.relu).

In [13]:
import torch

# Any Pytorch's network class is an extension of the torch.nn.Module parent class.
# To define a network class, you need to define at least 2 methods: an __init__() method (constructor) and a forward() method
class SimpleNetwork(torch.nn.Module):
    # Create the network class by filling in this block of code

    # Create the constructor. Add any additional arguments as you wish
    def __init__(self, input_size, hidden_size, num_classes):
        super(SimpleNetwork, self).__init__()
        self.l1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.l2 = nn.Linear(hidden_size, num_classes)

    # Define the feed forward function.
    # x is the input example/examples.
    # Add any additional arguments as you wish.
    def forward(self, x):
        out = self.l1(x)
        out = self.relu(out)
        out = self.l2(out)
        return out

### Part c 
Train the network using the training dataset. Use the SGD optimizer and CrossEntropyLoss. After each epoch, record the current loss and the current training accuracy. The current training accuracy is obtained by evaluating the model on the train dataset. Use the DataLoaders defined in part a to efficiently pass training and testing data.

You can learn about the available optimizers at:
https://pytorch.org/docs/stable/optim.html

You can learn about the available loss functions at:
https://pytorch.org/docs/stable/nn.html#loss-functions

In [16]:
LEARNING_RATE = 0.01
EPOCHS = 100
input_size = 50
hidden_size = 10
num_classes = 2

# initialize the model before the loop
model = SimpleNetwork(input_size, hidden_size,num_classes)

# loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), LEARNING_RATE)

# store the current loss and current training accuracy at every step
for epoch in range(EPOCHS):
    # Turn the model to training mode (gradients will be calculated)
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        # We have to call zero_grad() on the optimizer to remove gradients from the previous data pass.
        # Otherwise, the gradients will be accumulated throughout many passes.
        optimizer.zero_grad()
        # Pass in the data and obtain the output.
        # When you pass the data directly by calling model(data), the model will internally pass the data through the forward() function.
        output = model(data)
        # Compare the output and the ground truth and calculate the loss.
        loss = criterion(output, target)
        # From the calculated loss, call backward() to calculate the gradients for all the paramters in the network.
        loss.backward()
        # Update the parameters according to the gradients. 
        optimizer.step()
        if batch_idx % 100 == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))

RuntimeError: ignored

Plot how the loss and the training accuracy and the validation accuracy change over the epochs. Is there a point where overfitting occurs? If you cannot spot one, answer no. 

### Part d 
Evaluate the model on the test dataset. Print out the accuracy. Does this accuracy agrees with the training accuracy showed on the plot?

## Problem 2

In this problem, we will investigate the effects of various common hyperparameters on the performance of a neural network. In the following cell, you can find a network class already defined for you. You can initiate network instances with different hyperparameters by changing the contructor's arguments.

You are graded based on how you implement and execute the experiments. Since there is some randomness in initiating and training a neural network, there is no guarantee that you will get an expected result for an experiment or that your results should be similar to those of your peers. The expected outcome is that you execute the experiments correctly and the conclusion you get are consistent with your results. For each experiments, try to run the code multiple times and record the average results like what we did in Homework 2 (it will take some time to run, as expected when training any neural network).

In [5]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class mnist_network(torch.nn.Module):
    
    def __init__(self, num_hidden_layers=1, layer_size = 100, activation=None):
        super(mnist_network, self).__init__()
        # layers of the network
        self.num_hidden_layers = num_hidden_layers
        self.layer_size = layer_size
        self.activation = activation

        if(self.activation is 'relu'):
            self.activation = F.relu
        elif(self.activation is 'tanh'):
            self.activation = torch.tanh

        self.layers = nn.ModuleList([nn.Linear(784,self.layer_size)])
        for i in range(1, self.num_hidden_layers):
            self.layers.append(nn.Linear(self.layer_size,self.layer_size))
        self.layers.append(nn.Linear(self.layer_size,10))

    def forward(self, x):
        # converting each image into a vector
        batch_size = x.shape[0]
        x = x.reshape(batch_size,-1)
        # rest of the forward pass 
        for i in range(self.num_hidden_layers+1):
            x = self.layers[i](x)
            if(self.activation is not None):
                x = self.activation(x)
        return x

Run the following code to load the MNIST dataset. For the sake of simplicity, we do not have a validation set in this problem.

In [6]:
import torch
from torchvision import datasets, transforms

transform=transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
        ])

dataset1 = datasets.MNIST('../data', train=True, download=True,
                    transform=transform)
dataset2 = datasets.MNIST('../data', train=False,
                    transform=transform)

# DataLoader is a nice tool provided by PyTorch for passing training or testing examples
train_loader = torch.utils.data.DataLoader(dataset1, batch_size=64)
test_loader = torch.utils.data.DataLoader(dataset2, batch_size=1000)

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ../data/MNIST/raw/train-images-idx3-ubyte.gz


  0%|          | 0/9912422 [00:00<?, ?it/s]

Extracting ../data/MNIST/raw/train-images-idx3-ubyte.gz to ../data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ../data/MNIST/raw/train-labels-idx1-ubyte.gz


  0%|          | 0/28881 [00:00<?, ?it/s]

Extracting ../data/MNIST/raw/train-labels-idx1-ubyte.gz to ../data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw/t10k-images-idx3-ubyte.gz


  0%|          | 0/1648877 [00:00<?, ?it/s]

Extracting ../data/MNIST/raw/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz


  0%|          | 0/4542 [00:00<?, ?it/s]

Extracting ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw



  return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s)


Here is an example of training and testing a model:

In [7]:
import torch.nn as nn
import torch.optim as optim
import itertools

def train(model, criterion, train_loader, optimizer, epoch):
    # Turn the model to training mode (gradients will be calculated)
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        # We have to call zero_grad() on the optimizer to remove gradients from the previous data pass.
        # Otherwise, the gradients will be accumulated throughout many passes.
        optimizer.zero_grad()
        # Pass in the data and obtain the output.
        # When you pass the data directly by calling model(data), the model will internally pass the data through the forward() function.
        output = model(data)
        # Compare the output and the ground truth and calculate the loss.
        loss = criterion(output, target)
        # From the calculated loss, call backward() to calculate the gradients for all the paramters in the network.
        loss.backward()
        # Update the parameters according to the gradients. 
        optimizer.step()
        if batch_idx % 100 == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))


def test(model, criterion, test_loader):
    # Turn the model to testing mode (gradients will not be calculated)
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            output = model(data)
            test_loss += criterion(output, target).item()  # sum up batch loss
            pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)

    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))

# Number of training epochs
epochs = 5
# Learning rate
lr = 0.01

# Create the model
model = mnist_network(num_hidden_layers=1,layer_size=20,activation='relu')

# Define the training and testing loss
train_criterion = nn.CrossEntropyLoss()
test_criterion = nn.CrossEntropyLoss(reduction='sum')

# Define the optimizer
# We have to specify the learning rate and the parameters that the optimizer should update during training.
# In this case, we specify that the optimizer should update all the parameters from our model.
optimizer = optim.SGD(model.parameters(),lr=lr)

for epoch in range(1, epochs + 1):
        # Training
        train(model, train_criterion, train_loader, optimizer, epoch)
        # Testing
        test(model, test_criterion, test_loader)



Test set: Average loss: 0.3486, Accuracy: 9003/10000 (90%)


Test set: Average loss: 0.2947, Accuracy: 9142/10000 (91%)


Test set: Average loss: 0.2707, Accuracy: 9193/10000 (92%)


Test set: Average loss: 0.2523, Accuracy: 9260/10000 (93%)


Test set: Average loss: 0.2373, Accuracy: 9298/10000 (93%)



## Part a

First, we will investigate the effect of varying the size of the hidden layer. Create 3 one-hidden-layer networks with the sizes of the hidden layers being 5, 20, 50, respectively. We will call these the 5-network, the 20-network, and the 50-network. All networks should use ReLU activation.

Train the 5-network on the MNIST dataset for 10 epochs with learning rate 0.001. After each epoch, record the current training accuracy of the network. 

In [20]:
# Number of training epochs
epochs = 10
# Learning rate
lr = 0.001

# Create the model
model = mnist_network(num_hidden_layers=1,layer_size=20,activation='relu')

# Define the training and testing loss
train_criterion = nn.CrossEntropyLoss()
test_criterion = nn.CrossEntropyLoss(reduction='sum')

# Define the optimizer
# We have to specify the learning rate and the parameters that the optimizer should update during training.
# In this case, we specify that the optimizer should update all the parameters from our model.
optimizer = optim.SGD(model.parameters(),lr=lr)

for epoch in range(1, epochs + 1):
        # Training
        train(model, train_criterion, train_loader, optimizer, epoch)
        # Testing
        test(model, test_criterion, test_loader)


Test set: Average loss: 1.8741, Accuracy: 4944/10000 (49%)


Test set: Average loss: 1.3716, Accuracy: 6460/10000 (65%)


Test set: Average loss: 0.9590, Accuracy: 7539/10000 (75%)


Test set: Average loss: 0.7893, Accuracy: 7910/10000 (79%)


Test set: Average loss: 0.7066, Accuracy: 8013/10000 (80%)


Test set: Average loss: 0.6572, Accuracy: 8078/10000 (81%)


Test set: Average loss: 0.6244, Accuracy: 8131/10000 (81%)


Test set: Average loss: 0.6009, Accuracy: 8172/10000 (82%)


Test set: Average loss: 0.5833, Accuracy: 8196/10000 (82%)


Test set: Average loss: 0.5694, Accuracy: 8223/10000 (82%)



Test the trained 5-network on the test data. Print out the accuracy.

Train the 20-network on the MNIST dataset for 10 epochs with learning rate 0.001. After each epoch, record the current training accuracy of the network. 

Test the trained 20-network on the test data. Print out the accuracy.

Train the 50-network on the MNIST dataset for 10 epochs with learning rate 0.001. After each epoch, record the current training accuracy of the network. 

Test the trained 50-network on the test data. Print out the accuracy.

Plot the training accuracies over the epochs of the networks on the same figure (there should 3 line plots/scatter plots). 

What is your conclustion on the effect of varying the hidden layer size on the performance of a neural network trained on the MNIST dataset?

## Part b

Now, we will investigate the effect of varying the number of hidden layers. Create 3 networks with 1, 2, and 3 hidden layers, respectively. The size of all hidden layers should be 20 and the activation function is ReLU. We will call these the 1-network, the 2-network, and the 3-network.

Train the 1-network on the MNIST dataset for 10 epochs with learning rate 0.001. After each epoch, 

---

record the current training accuracy of the network. 

Test the trained 1-network on the test data. Print out the accuracy.

Train the 2-network on the MNIST dataset for 10 epochs with learning rate 0.001. After each epoch, record the current training accuracy of the network. 

Test the trained 2-network on the test data. Print out the accuracy.

Train the 3-network on the MNIST dataset for 10 epochs with learning rate 0.001. After each epoch, record the current training accuracy of the network. 

Test the trained 3-network on the test data. Print out the accuracy.

Plot the training accuracies over the epochs of the networks on the same figure (there should 3 line plots/scatter plots). 

What is your conclusion on the effect of varying the number of hidden layers on the performance of a neural network trained on the MNIST dataset?

## Part c

Next, we will investigate the effects of varying the activation functions on a neural network. Create 3 networks. The first network has Sigmoid activation (Sigmoid-network). The second network has ReLU activation (ReLU-network). The third network has Tanh activation (Tanh-network). All networks have one hidden layer with size 20.

Train the Sigmoid-network on the MNIST dataset for 10 epochs with learning rate 0.001. After each epoch, record the current training accuracy of the network. 

Test the trained Sigmoid-network on the test data. Print out the accuracy.

Train the ReLU-network on the MNIST dataset for 10 epochs with learning rate 0.001. After each epoch, record the current training accuracy of the network. 

Test the trained ReLU-network on the test data. Print out the accuracy.

Train the Tanh-network on the MNIST dataset for 10 epochs with learning rate 0.001. After each epoch, record the current training accuracy of the network. 

Test the trained Tanh-network on the test data. Print out the accuracy.

Plot the training accuracies over the epochs of the networks on the same figure (there should 3 line plots/scatter plots). 

What is your conclustion on the effect of varying the activation functions on the performance of a neural network trained on MNIST dataset?

## Part d

Finally, we will look into the effect of varying the value of the learning rate on the performance of a neural network. Create a network with one hidden layer of size 20 and ReLU activation.

Train the network on the MNIST dataset for 10 epochs. Set the learning rate to be 0.1. After each epoch, record the current training accuracy of the network. 

Test the trained network on the test data. Print out the accuracy.

Train the network on the MNIST dataset for 10 epochs. Set the learning rate to be 0.01. After each epoch, record the current training accuracy of the network. 

Test the trained network on the test data. Print out the accuracy.

Train the network on the MNIST dataset for 10 epochs. Set the learning rate to 0.001. After each epoch, record the current training accuracy of the network. 

Test the trained network on the test data. Print out the accuracy.

Plot the training accuracies over the epochs of the scenarios on the same figure (there should 3 line plots/scatter plots). 

What is your conclustion on the effect of varying the learning rate on the performance of a neural network?

## REMARK for Problem 2

You have observed the effects of varying different hyperparameters on the performance of a neural network **on the MNIST dataset**. However, keep in mind that these trends only apply for **the MNIST dataset** and should not be carried to another problem. There is no single hyperparameter settings that works for all problems. As you do more problems, you will build up your intuitions about the hyperparameters so that you can quickly deploy a good model. For example, people observed that setting the learning rate = 0.001 often works the best, though it is not always the case.

## Problem 3

Experimenting with **k-anomity, i-diversity, and t-closeness**. 

Consider a dataset, for example, with 3 ordinary attributes and 1 sensitive attribute. Let the 3 ordinary attributes be Age, Sex, and Education and the sensitive attribute be Income, each row in this dataset is of the form:

$$
    [Age, Sex, Education, Income]
$$

A hacker is interested in knowing the sensitive attribute Income. When the dataset is designed so that if complies with either **k-anomity**, **i-diversity**, and/or **t-closeness**, even if he or she somehow figures out the values of the three, the hacker may not retrive the sensitive information accurately. In general, **k-anomity** is weaker than **i-diversity**, which, in turn, is weaker than **t-closeness**.

By definition, **k-anomity** means that there is at least **k** different rows in the table of which ordinary values are a particular combination of Age, Sex, and Education. For example, the hacker knows the information of the person of interest is Age = 31, Sex = Female, and Education = BS. He or she looks into the data table and found that there are 3 rows with that combination:

$$
    [Age=31, Sex=Female, Education=BS, Income=300k]
$$
$$
    [Age=31, Sex=Female, Education=BS, Income=70k]
$$
$$
    [Age=31, Sex=Female, Education=BS, Income=20k]
$$

The hacker cannot tell accurately what the income of the person is because it can be one of the 3 values shown. This particular combination of information has 3-anomity. If every combination corresponds to at least 3 rows, then the dataset has 3-anomity.

a) Let's look at the dataset **"table.csv"**. Let the sensitive attribute be **education** and others be ordinary attributes. Calculate the anomity of the dataset (the value **k**). First, find all the posible combinations of the ordinary attributes that exists in the dataset. After that, determine the anomity for each combination. The anomity of the dataset is the smallest anomity among the combinations.

In [21]:
table = pd.read_csv('table.csv')
table

Unnamed: 0.1,Unnamed: 0,age,education,race,sex
0,0,39,Bachelors,White,Male
1,1,50,Bachelors,White,Male
2,2,38,HS-grad,White,Male
3,3,53,11th,Black,Male
4,4,28,Bachelors,Black,Female
...,...,...,...,...,...
32556,32556,27,Assoc-acdm,White,Female
32557,32557,40,HS-grad,White,Male
32558,32558,58,HS-grad,White,Female
32559,32559,22,HS-grad,White,Male


In [22]:
print('Unique Age Range Counts: '+str(table['age'].value_counts()))
print('Unique Race Counts: '+str(table['race'].value_counts()))
print('Unique Sex Range Counts: '+str(table['sex'].value_counts()))

Unique Age Range Counts: 36    898
31    888
34    886
23    877
35    876
     ... 
83      6
85      3
88      3
87      1
86      1
Name: age, Length: 73, dtype: int64
Unique Race Counts: White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
Name: race, dtype: int64
Unique Sex Range Counts: Male      21790
Female    10771
Name: sex, dtype: int64


# COMMENT
- based on the code above where I find the value counts of each unique record in the Age Column, there are two instances (ages 86 and 87) that only occur once. 
- Since this is an ordinary attribute, the k-anonymity of this dataset is 1

We can improve the **k-anomity** of the dataset by "suppressing" the ordinary attributes. Suppressing means reducing the resolution of the attribute's value. For this problem, let's suppress Age by replacing the exact age with an age range. For example, instead of leaving age = 32, replace it with age = 30-40. Apply this to **"table.csv"** with the ranges {<20, 20-30, 30-50, >50}. Check if the anomity improves. 

In [None]:
# define a simple function that can be applied to the table
def age_range(age):
    if age < 20:
        return '<20'
    elif age <= 30:
        return '20-30'
    elif age <= 50:
        return '30-50'
    else:
        return '>50'

table['age'] = table['age'].apply(age_range)  
table.head(10)

Unnamed: 0.1,Unnamed: 0,age,education,race,sex
0,0,30-50,Bachelors,White,Male
1,1,30-50,Bachelors,White,Male
2,2,30-50,HS-grad,White,Male
3,3,>50,11th,Black,Male
4,4,20-30,Bachelors,Black,Female
5,5,30-50,Masters,White,Female
6,6,30-50,9th,Black,Female
7,7,>50,HS-grad,White,Male
8,8,30-50,Masters,White,Female
9,9,30-50,Bachelors,White,Male


In [None]:
# running the same code from earlier to check value counts
print('Unique Age Range Counts: '+str(table['age'].value_counts()))
print('Unique Race Counts: '+str(table['race'].value_counts()))
print('Unique Sex Range Counts: '+str(table['sex'].value_counts()))

Unique Age Range Counts: 30-50    15529
20-30     8915
>50       6460
<20       1657
Name: age, dtype: int64
Unique Race Counts: White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
Name: race, dtype: int64
Unique Sex Range Counts: Male      21790
Female    10771
Name: sex, dtype: int64


# comment
- I now see that the smallest value in this dataset is 271 which is the number of people that categorize as an "Other" race.

In [None]:
other = table[table['race'] == 'Other']
print('Unique Age Range Counts: '+str(other['age'].value_counts()))
print('Unique Sex Range Counts: '+str(other['sex'].value_counts()))

Unique Age Range Counts: 20-30    114
30-50    112
>50       27
<20       18
Name: age, dtype: int64
Unique Sex Range Counts: Male      162
Female    109
Name: sex, dtype: int64


# comment
- now, the smallest value count is 18 which corresponds to those who categorize as "other" race and younger than 20 years old

In [None]:
youngest = other[other['age'] == '<20']
print('Unique Sex Range Counts: '+str(youngest['sex'].value_counts()))

Unique Sex Range Counts: Female    13
Male       5
Name: sex, dtype: int64


# comment
- the smallest value now is 5 which corresponds to males under 20 that identify as an "other" race.

In [None]:
youngest[youngest['sex'] == 'Male']

Unnamed: 0.1,Unnamed: 0,age,education,race,sex
9155,9155,<20,HS-grad,Other,Male
11516,11516,<20,12th,Other,Male
11727,11727,<20,11th,Other,Male
17474,17474,<20,7th-8th,Other,Male
18552,18552,<20,Some-college,Other,Male


# comment
- looking at the code above, there are 5 observations when filtered down to a specific combination of the 3 ordinary factors
- hence the k-anonymity being 5

**K-anomity** is nice, however, it fails in many cases. If the rows which share a combination of ordinary attributes have only a few values for the sensitive attribute, then it is not much better than having no anomity at all. For example, consider:

$$
    [Age=31, Sex=Female, Education=BS, Income=300k]
$$
$$
    [Age=31, Sex=Female, Education=BS, Income=20k]
$$
$$
    [Age=31, Sex=Female, Education=BS, Income=20k]
$$
$$
    [Age=31, Sex=Female, Education=BS, Income=20k]
$$

When **k-anomity** fails in the second case, **i-diversity** comes to the rescue. **I-diversity** states that the rows of a particular combination of information must have at least i different values for the sensitive attribute. The above example has 2-diversity, which is not good. 

b) Calculate the **i-diversity** of the dataset **"table.csv"**. Follow similar steps as in part a. 

Suppressing an attribute can also improve the **i-diversity** of the dataset. Repeat the suppression as in **part a** and check if the diversity improves. If it does not, consider further suppress age by using the range {<20, 20-50, >50}.

**T-closeness** is even better than **i-diversity**. **T-closeness** requires that for every combination of information, the distribution of the sensitive attribute's value among the corresponding rows must be close to the overall distribution of the sensitive attribute's value for the whole dataset. Distance between distribution is calculated using the Earth Mover Distance (EMD). The dataset has **t-closeness** if no distance exceeds **t**. 

c) Calculate the overall distribution of **education**. Find the **t-closeness** of the dataset (largest distance between any combination's distribution of marital-status and the overall distribution).

You can use **scipy.stats.wasserstein_distance** to calculate the EMD.