<a href="https://colab.research.google.com/github/aryanfaghihi/ai-course/blob/master/Tabular.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup


## Make a copy

**Follow the following steps to make your own copy. You will lose your changes if you don't make your own copy!**

From the tooltip above, you should see 

File | Edit | View | Insert | Runtime | Tools | Help

1. Click on **File**
2. Click on **Save a copy in Drive**

## Import

Let's import the packages we need for today. We will be using [`fastai`](https://www.fast.ai/) to create our AI model. This library makes it easy to get started. As the creators of `fastai` have put it:

> `fastai` - Making neural nets uncool again!



In [51]:
from fastai.tabular import *

# Data

## Looking at data

Today we will be exploring a dataset called `Adult Data Set`.

> The aim is to predict whether income exceeds $50K/yr based on census data from the US

The dataset has the following attributes

Attribute name | Attribute type
--- | ---
age | continuous (number)
workclass | 8 categories
fnlwgt | continuous (number)
education | 16 categories
education-num | continuous (number)
marital-status | 7 categories
occupation | 14 categories
relationship | 6 categories
race | 5 categories
sex | 2 categories
capital-gain | continuous (number)
capital-loss | continuous (number)
hours-per-week | continuous (number)
native-country | 41 categories
salary | 2 categories


In [52]:
# Downloading and unzipping the dataset
path = untar_data(URLs.ADULT_SAMPLE)

# Reading the csv dataset
data = pd.read_csv(path/'adult.csv')

---

⭐ Print the first few rows of the data. 

---

Hint: you can use the [`head` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html).

In [53]:
# print the first few rows of `data`
### YOUR CODE HERE
data.head()


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,49,Private,101320,Assoc-acdm,12.0,Married-civ-spouse,,Wife,White,Female,0,1902,40,United-States,>=50k
1,44,Private,236746,Masters,14.0,Divorced,Exec-managerial,Not-in-family,White,Male,10520,0,45,United-States,>=50k
2,38,Private,96185,HS-grad,,Divorced,,Unmarried,Black,Female,0,0,32,United-States,<50k
3,38,Self-emp-inc,112847,Prof-school,15.0,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,40,United-States,>=50k
4,42,Self-emp-not-inc,82297,7th-8th,,Married-civ-spouse,Other-service,Wife,Black,Female,0,0,50,United-States,<50k


## Categorical vs. Continuous 

This is a very important concept for both Tabular data as well as other Deep Learning models. In short:
* **Categorical** variables contain a *finite* number of categories or distinct groups.
* **Continuous** variables are numeric variables that have an *infinite* number of values.

Examples
* **Categorical**: Race
* **Continuous**: Age






---

⭐ Can you think of any other examples for Categorical and Continuous variables?

---

## Categorical Variables

Ok, let's look at some of the features of categorical data in our dataset. You can use the [`value_counts()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) to count the number of each category in your categorical data.

In [54]:
data['salary'].value_counts()

<50k     24720
>=50k     7841
Name: salary, dtype: int64

Here, we can see that most people have a salary of less than 50K in the US according to this dataset.

---

⭐ Choose another categorical variable from `data` and count its categories.

---

In [55]:
### Count the number of categories in another categorical variable
### YOUR CODE HERE


# Hint: you can refer to the table above to find another variable name

## Continuous Variables

Let's now look at one of the continuous variables in our dataset.

In [56]:
data['age'].describe()

count    32561.000000
mean        38.581647
std         13.640433
min         17.000000
25%         28.000000
50%         37.000000
75%         48.000000
max         90.000000
Name: age, dtype: float64

Oh wow, ok let's unpack what all of these stats mean. This might seem a little abstract and useless but don't forget that as a Deep Learning engineer, you will spend at least half of your time on your datasets! So knowing how to find the information you need from your data is an essential skill that you must have.

Stat | Meaning
--- | ---
mean | average
std | standard deviation. It measures how much your data is spread around the average.
min | minimum
25% | 25 percentile. 25% of your data values fall below this value
50% | 50 percentile
75% | 75 percentile
max | maximum


---

⭐ Choose another continuous variable from `data` and [`descibre()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) it

---

In [57]:
### Describe() another continuous variable
### YOUR CODE HERE


# Hint: you can refer to the table above to find another variable name

## Summary

* We looked at how we can import a dataset
* We discussed the difference between categorical and continuous variables
* We explored some basic ways of exploring our dataset

# Model

## Training and Validation dataset

Wait, I thought we're ready to train our model! Don't worry, we're almost there, I promise. 

* **Training** dataset is the subset of your data that is used to **train** your model.
* **Validation** dataset is the subset of your data that is set aside to **evaluate** your model's performance after training it

---

⭐ Why is it important to have a validation set that is separate from the training set?

---

In practice, usually we use 80% of the original dataset for **training** and 20% for **validation**.



In [None]:
# here we are using 80% but you can change this number and experiment
training_split = 0.8 
training_portion = int(len(data)*0.8)
testing_portion

training_set = data[]

## Architecture

In this section, we will train a Deep Learning model to predict the output of one of the columns based on the other columns. For example, we can use the data in the `education`, `occupation` and `workclass` to predict the `salary`. This particularly example can be written as:

> `salary` ~ `education` + `occupation` + `workclass`

* **Dependent** variable: `salary`
* **Independent** variables: `education` `occupation` `workclass`

Let's use this example to demonstrate how you can use `fastai` to train a model using these variables. Let's start with assigining these column names into variable to be used later.

In [66]:
dependent_variable = 'salary'
categorical_variables = ['workclass', 'education', 'occupation']
continuous_variables = ['age', 'fnlwgt', 'education-num']
procs = [FillMissing, Categorify, Normalize]

In [67]:
test = TabularList.from_df(data.iloc[800:1000].copy(), cat_names=categorical_variables)

In [68]:
train = (TabularList.from_df(data, cat_names=categorical_variables, procs=[Categorify])
                           .split_by_idx(list(range(800,1000)))
                           .label_from_df(cols=dependent_variable)
                           .add_test(test)
                           .databunch())

In [69]:
train.show_batch(rows=10)

workclass,education,occupation,target
Private,11th,Craft-repair,<50k
Private,11th,#na#,<50k
Private,Assoc-voc,Exec-managerial,>=50k
Self-emp-inc,11th,Sales,<50k
Private,Bachelors,Other-service,>=50k
Private,10th,Craft-repair,<50k
Private,Some-college,Adm-clerical,>=50k
Private,Bachelors,Sales,>=50k
?,Some-college,?,<50k
Private,Some-college,Exec-managerial,<50k


In [70]:
learn = tabular_learner(train, layers=[200,100], metrics=accuracy)

In [71]:
learn.fit(5)

epoch,train_loss,valid_loss,accuracy,time
0,0.467374,0.458308,0.815,00:05
1,0.464918,0.464543,0.815,00:05
2,0.464383,0.466167,0.795,00:05
3,0.463887,0.460155,0.82,00:05


KeyboardInterrupt: ignored

## Inference

In [None]:
row = df.iloc[0]

In [None]:
learn.predict(row)

(Category >=50k, tensor(1), tensor([0.4402, 0.5598]))