<a href="https://colab.research.google.com/github/aryanfaghihi/ai-course/blob/master/Tabular.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup


## Make a copy

**Follow the following steps to make your own copy. You will lose your changes if you don't make your own copy!**

From the tooltip above, you should see 

File | Edit | View | Insert | Runtime | Tools | Help

1. Click on **File**
2. Click on **Save a copy in Drive**

## Import

Let's import the packages we need for today. We will be using [`fastai`](https://www.fast.ai/) to create our AI model. This library makes it easy to get started. As the creators of `fastai` have put it:

> `fastai` - Making neural nets uncool again!



In [1]:
from fastai.tabular import *

# Data

## Looking at data

Today we will be exploring a dataset called `Adult Data Set`.

> The aim is to predict whether income exceeds $50K/yr based on census data from the US

The dataset has the following attributes

Attribute name | Attribute type
--- | ---
age | continuous (number)
workclass | 8 categories
fnlwgt | continuous (number)
education | 16 categories
education-num | continuous (number)
marital-status | 7 categories
occupation | 14 categories
relationship | 6 categories
race | 5 categories
sex | 2 categories
capital-gain | continuous (number)
capital-loss | continuous (number)
hours-per-week | continuous (number)
native-country | 41 categories
salary | 2 categories


In [2]:
# Downloading and unzipping the dataset
path = untar_data(URLs.ADULT_SAMPLE)

# Reading the csv dataset
data = pd.read_csv(path/'adult.csv')

---

⭐ Print the first few rows of the data. 

---

Hint: you can use the [`head` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html).

In [3]:
# print the first few rows of `data`
### YOUR CODE HERE



## Categorical vs. Continuous 

This is a very important concept for both Tabular data as well as other Deep Learning models. In short:
* **Categorical** variables contain a *finite* number of categories or distinct groups.
* **Continuous** variables are numeric variables that have an *infinite* number of values.

Examples
* **Categorical**: Race
* **Continuous**: Age






---

⭐ Can you think of any other examples for Categorical and Continuous variables?

---

## Categorical Variables

Ok, let's look at some of the features of categorical data in our dataset. You can use the [`value_counts()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) to count the number of each category in your categorical data.

In [4]:
data['salary'].value_counts()

<50k     24720
>=50k     7841
Name: salary, dtype: int64

Here, we can see that most people have a salary of less than 50K in the US according to this dataset.

---

⭐ Choose another categorical variable from `data` and count its categories.

---

In [5]:
### Count the number of categories in another categorical variable
### YOUR CODE HERE


# Hint: you can refer to the table above to find another variable name

## Continuous Variables

Let's now look at one of the continuous variables in our dataset.

In [6]:
data['age'].describe()

count    32561.000000
mean        38.581647
std         13.640433
min         17.000000
25%         28.000000
50%         37.000000
75%         48.000000
max         90.000000
Name: age, dtype: float64

Oh wow, ok let's unpack what all of these stats mean. This might seem a little abstract and useless but don't forget that as a Deep Learning engineer, you will spend at least half of your time on your datasets! So knowing how to find the information you need from your data is an essential skill that you must have.

Stat | Meaning
--- | ---
mean | average
std | standard deviation. It measures how much your data is spread around the average.
min | minimum
25% | 25 percentile. 25% of your data values fall below this value
50% | 50 percentile
75% | 75 percentile
max | maximum


---

⭐ Choose another continuous variable from `data` and [`descibre()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) it

---

In [7]:
### Describe() another continuous variable
### YOUR CODE HERE


# Hint: you can refer to the table above to find another variable name

## Summary

* We looked at how we can import a dataset
* We discussed the difference between categorical and continuous variables
* We explored some basic ways of exploring our dataset

# Model

## Training and Validation dataset

Wait, I thought we're ready to train our model! Don't worry, we're almost there, I promise. 

* **Training** dataset is the subset of your data that is used to **train** your model.
* **Validation** dataset is the subset of your data that is set aside to **evaluate** your model's performance after training it

---

⭐ Why is it important to have a validation set that is separate from the training set?

---

In practice, usually we use 80% of the original dataset for **training** and 20% for **validation**.



In [8]:
# Here we are using 80% split
# you can change this number and experiment
training_split = 0.8

# Find the indices for training set
train_idx_start = 0
train_idx_end = int(training_split*len(data))

# Find the indices for validation set
valid_idx_start = train_idx_end
valid_idx_end = len(data)
valid_idx = range(valid_idx_start, valid_idx_end) # this will be useful for us later

print('Split percentage used:', training_split)
print('Training dataset indices', train_idx_start, '-', train_idx_end)
print('Validation dataset indices', valid_idx_start, '-', valid_idx_end)

Split percentage used: 0.8
Training dataset indices 0 - 26048
Validation dataset indices 26048 - 32561


## Dataloader

In this section, we will train a Deep Learning model to predict the output of one of the columns based on the other columns. For example, we can use the data in the `education`, `occupation` and `workclass` to predict the `salary`. This particularly example can be written as:

> `salary` ~ `education` + `occupation` + `workclass`

* **Dependent** variable: `salary`
* **Independent** variables: `education` `occupation` `workclass`

Let's use this example to demonstrate how you can use `fastai` to train a model using these variables. Let's start with assigining these column names into variable to be used later.

In [9]:
# DEPEDENT VARIABLE
dependent_variable = 'salary'

# INDEPENDENT VARIABLES
## Categorical
categorical_variables = ['workclass', 'education', 'occupation']

## Continuous
continuous_variables = []

Notice that `continuous_variables` is empty. This is because none of the independent variables that we chose for this particular example are continuous. However, if you do have continuous variables, you must append them to `continuous_variables` list instead.

Below, we will use the built-in functionality of `fastai` to import the data for our model's use. But why? Up until this point, we've been working with data by using Python's built-in functionality as well as `pandas`. But now that we have to train a Deep Learning model, we have to pass the data into `fastai`. This is done through `dataloaders`. These functions will take care of all the complexity of data pipelines for training your model.

In [10]:
## DATALOADER
dataloader = TabularDataBunch.from_df(path,                             # path to csv file
                                      data,                             # original data
                                      dependent_variable,               # chosen dependent variable
                                      cont_names=continuous_variables,  # continuous variables
                                      cat_names=categorical_variables,  # continuous variables
                                      valid_idx=valid_idx,              # indices for validation set
                                      procs=[Categorify])

You don't really need to know what all of these parameters do. In fact, even actual Deep Learning engineers look these up all the time. It's more important to know the high-level concepts like Training and Validation sets and differences between categorical and continuous variables.

---
🤓 Advanced (only for those interested)

But for those of you who want to know more, listed below is a brief descrpition of each option
* `cat_names`: specify categorical variables (names)
* `cont_names`: specify continuous variablese (names)
* `procs`: list of pre-processes to be done data. Here we are passing `Categorify` which turns the category values into numbers because computers can only work with numbers. There is a range of `procs` which we will cover in this course.
* `valid_idx`: list of indices (rows) for to be used for validation set. This means `fastai` will take care of splitting the data for us.
---

Last thing before we train our model, I promise. 

It is **always** a good practice to look at a batch of your dataloader to make sure everything looks right. That is, make sure the correct columns have been selected and more importantly, the correct dependent variable (or `target`) has been selected.

In [11]:
dataloader.show_batch()

workclass,education,occupation,target
Private,Some-college,#na#,<50k
Private,HS-grad,Exec-managerial,>=50k
Self-emp-not-inc,Assoc-acdm,Sales,>=50k
Private,Bachelors,Exec-managerial,>=50k
Private,12th,Handlers-cleaners,<50k


## Training

Ok so, this is what we have done so far:

- [x] Looked at our dataset
- [x] Learnt the difference between categorical and continuous variables
- [x] Training and validation dataset

We are finally ready to train our model. 

In [12]:
learn = tabular_learner(dataloader, layers=[200,100], metrics=accuracy)
learn.fit(5)

epoch,train_loss,valid_loss,accuracy,time
0,0.468965,0.455689,0.792876,00:04
1,0.469144,0.458092,0.79088,00:04
2,0.45708,0.456143,0.791647,00:04
3,0.457336,0.451468,0.794872,00:04
4,0.456568,0.452055,0.791033,00:04


🎉 We just trained our first model!

You should be getting around 80% accuracy. What does this mean? Well it means that you have been able to predict `salary` correctly 80% of the time using your independent categorical and continuous variables.

Ok but what about `layers`?
> We will cover it throughout the course but for now you can think of them as how *powerful your model is*. The higher these values the better. But of course, with great power comes great resposibility and we will discuss those in the next sections.

What is `epoch`?
> Epoch is defined as the number of times you train your model. In other words, it is the number of times your model *learns* from each data point. We cover this in more detail in the next section. For now you can stick with 5-10 epochs for training your models.

## Testing

Ok so we just covered how you can train your model, and we saw how you can get 80% accuracy. But wouldn't it better if you could just make up your own data and pass it through the model to see what it predicts? That is what we're going to be doing now.

Let's remind ourselves what columns we used and what each of them contains

In [13]:
print('workclass values:\n', data['workclass'].unique())    # display unique categories for workclass
print('education values:\n', data['education'].unique())    # display unique categories for education
print('occupation values:\n', data['occupation'].unique())  # display unique categories for occupation

workclass values:
 [' Private' ' Self-emp-inc' ' Self-emp-not-inc' ' State-gov' ' Federal-gov' ' Local-gov' ' ?' ' Without-pay'
 ' Never-worked']
education values:
 [' Assoc-acdm' ' Masters' ' HS-grad' ' Prof-school' ' 7th-8th' ' Some-college' ' 11th' ' Bachelors' ' Assoc-voc'
 ' 10th' ' 9th' ' Doctorate' ' 12th' ' 1st-4th' ' 5th-6th' ' Preschool']
occupation values:
 [nan ' Exec-managerial' ' Prof-specialty' ' Other-service' ' Handlers-cleaners' ' Craft-repair' ' Adm-clerical'
 ' Sales' ' Machine-op-inspct' ' Transport-moving' ' ?' ' Farming-fishing' ' Tech-support' ' Protective-serv'
 ' Priv-house-serv' ' Armed-Forces']


In [14]:
test_data = data.iloc[0].copy()
# Change the values below based on possible values above
# Explore the changes in model predictions
test_data['workclass'] = 'Without-pay'
test_data['education'] = 'Masters'
test_data['occupation'] = 'Tech-support'

In [15]:
learn.predict(test_data)

(Category tensor(0), tensor(0), tensor([0.7033, 0.2967]))

Let's interpret the output here: 
* `Category tensor(0)` means <50k
* `Category tensor(1)` means >=50k
* `tensor([a, b])`
  * `a` is the probability of <50k salary
  * `b` is the probability of >=50k salary

You noticed that we said **probability**, this is because all deep learning models are based on probability, they can never be 100% sure about any predction. The best we can do is 99.999999...%!

---

⭐ Try changing `workclass` or `education` or `occupation` values above to see how the model prediction changes. See if it makes sense to you. Chances are sometimes you will get values that don't make sense and that just means our model is not perfect and it makes mistakes.

---

# Your model

---

⭐  Now is your turn to use the template below to pick your own categorical and continuous variables to train your own model.

---

This table might come in handy

Variable Name | Variable Type
--- | ---
age | continuous (number)
workclass | 8 categories
fnlwgt | continuous (number)
education | 16 categories
education-num | continuous (number)
marital-status | 7 categories
occupation | 14 categories
relationship | 6 categories
race | 5 categories
sex | 2 categories
capital-gain | continuous (number)
capital-loss | continuous (number)
hours-per-week | continuous (number)
native-country | 41 categories
salary | 2 categories

In [16]:
data.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'salary'],
      dtype='object')

In [17]:
# DEPEDENT VARIABLE (don't change this)
dependent_variable = 'salary'

############################
#### PICK YOUR OWN VARIABLES
############################

# INDEPENDENT VARIABLES
## Categorical
categorical_variables = ['education']

## Continuous
continuous_variables = []

############################
#### STOP HERE
############################

## DATALOADER
dataloader = TabularDataBunch.from_df(path,                             # path to csv file
                                      data,                             # original data
                                      dependent_variable,               # chosen dependent variable
                                      cont_names=continuous_variables,  # continuous variables
                                      cat_names=categorical_variables,  # continuous variables
                                      valid_idx=valid_idx,              # indices for validation set
                                      procs=[FillMissing, Categorify, Normalize])


learn = tabular_learner(dataloader, layers=[200,100], metrics=accuracy)
learn.fit(5)

epoch,train_loss,valid_loss,accuracy,time
0,0.49593,0.489674,0.779057,00:04
1,0.490456,0.482583,0.785199,00:04
2,0.498849,0.486408,0.785199,00:04
3,0.494893,0.480733,0.785199,00:04
4,0.500412,0.482035,0.785199,00:04


---

⭐  Once you're confident, you can also play around with `layers` to see the difference in training.

---

# Summary

* Data
  * Different ways of exploring your dataset
  * Categorical Variables
    * finite number of categories or distinct groups
  * Continuous Variables
    * numeric variables that have an infinite number of values
* Model
  * Training dataset
    * A subset of your original dataset used by the model to learn
  * Validation dataset
    * A subset of your original dataset used to assess the true accuracy of your model
  * Dataloader
    * Creates the connection between your data and your model
  * Epochs
    * The number of times your model learns from each data point
  * Testing (inference)
    * Also known as inference which is where you can create your own data to see how your model performs