# Homework 04 | WEEK 04 (27.09-04.10.2022) | Machine Learning Zoomcamp

Link to the homework [here](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/cohorts/2022/04-evaluation/homework.md)

## Dataset

In this homework, we will use Credit Card Data from book "Econometric Analysis".

Her's is a wget-able [link](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/AER_credit_card_data.csv):

```

wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/AER_credit_card_data.csv

```

The goal of this homework is to inspect the output of different evaluation metrics by creating a classification model (target column `card`).

### getting the data

In [1]:
#data = 'https://raw.githubusercontent.com/alexeygrigorev/datasets/master/AER_credit_card_data.csv'

In [2]:
#!wget $data -O data-homework-04.csv 

In [3]:
ls

data-homework-04.csv  homework_04.ipynb


### import the necessary libraries

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

### read it with pandas

In [5]:
df = pd.read_csv('data-homework-04.csv')
df.head()

Unnamed: 0,card,reports,age,income,share,expenditure,owner,selfemp,dependents,months,majorcards,active
0,yes,0,37.66667,4.52,0.03327,124.9833,yes,no,3,54,1,12
1,yes,0,33.25,2.42,0.005217,9.854167,no,no,3,34,1,13
2,yes,0,33.66667,4.5,0.004156,15.0,yes,no,4,58,1,5
3,yes,0,30.5,2.54,0.065214,137.8692,no,no,0,25,1,7
4,yes,0,32.16667,9.7867,0.067051,546.5033,yes,no,2,64,1,5


Checklist:

* Names of all columns are in small letters and separted by underscores
* Number of missing values in each column
* Number of unique value in each column
* unique values in each column
* datatypes of each column and does it makes sense e.g. age is a string and not int

In [6]:
df.shape

(1319, 12)

In [7]:
df.columns

Index(['card', 'reports', 'age', 'income', 'share', 'expenditure', 'owner',
       'selfemp', 'dependents', 'months', 'majorcards', 'active'],
      dtype='object')

In [8]:
df.isnull().sum()

card           0
reports        0
age            0
income         0
share          0
expenditure    0
owner          0
selfemp        0
dependents     0
months         0
majorcards     0
active         0
dtype: int64

In [9]:
df.nunique()

card              2
reports          13
age             418
income          431
share          1162
expenditure     981
owner             2
selfemp           2
dependents        7
months          193
majorcards        2
active           35
dtype: int64

`card` is our target variable. Other than that `reports`, `owner`, `selfemp`, `dependents` and `majorcards` looks like possible candidate for categorical variables. 

In [10]:
df['majorcards'].unique()

array([1, 0])

In [11]:
df.dtypes

card            object
reports          int64
age            float64
income         float64
share          float64
expenditure    float64
owner           object
selfemp         object
dependents       int64
months           int64
majorcards       int64
active           int64
dtype: object

I think `majorcards` should be object i.e. in `True` or `False`. Let's look into the homework now.

## Preparation

* Create the target variable by mapping `yes` to 1 and `no` to 0.
* Split the dataset into 3 parts: train/validation/test with 60%/20%/20% distribution. Use `train_test_split` funciton for that with `random_state=1`.

In [12]:
df.card

0       yes
1       yes
2       yes
3       yes
4       yes
       ... 
1314    yes
1315     no
1316    yes
1317    yes
1318    yes
Name: card, Length: 1319, dtype: object

In [13]:
df.card = (df.card == 'yes').astype(int)
df.card

0       1
1       1
2       1
3       1
4       1
       ..
1314    1
1315    0
1316    1
1317    1
1318    1
Name: card, Length: 1319, dtype: int64

In [14]:
df.head()

Unnamed: 0,card,reports,age,income,share,expenditure,owner,selfemp,dependents,months,majorcards,active
0,1,0,37.66667,4.52,0.03327,124.9833,yes,no,3,54,1,12
1,1,0,33.25,2.42,0.005217,9.854167,no,no,3,34,1,13
2,1,0,33.66667,4.5,0.004156,15.0,yes,no,4,58,1,5
3,1,0,30.5,2.54,0.065214,137.8692,no,no,0,25,1,7
4,1,0,32.16667,9.7867,0.067051,546.5033,yes,no,2,64,1,5


In [15]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression

In [16]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)

In [17]:
len(df),len(df_full_train), len(df_test)

(1319, 1055, 264)

Now to get our y-variable

In [18]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)
df_full_train = df_full_train.reset_index(drop=True)

In [19]:
y_train = df_train.card.values
y_val = df_val.card.values
y_test = df_test.card.values
y_full_train = df_full_train.card.values

In [20]:
del df_train['card']
del df_val['card']
del df_test['card']
del df_full_train['card']

In [21]:
df_train.head()

Unnamed: 0,reports,age,income,share,expenditure,owner,selfemp,dependents,months,majorcards,active
0,3,40.5,4.0128,0.000299,0.0,no,no,1,12,1,17
1,1,32.33333,6.0,0.0002,0.0,yes,no,4,18,1,4
2,1,29.16667,2.2,0.038205,69.79333,no,no,0,49,1,7
3,1,54.66667,7.29,0.106536,647.2067,yes,no,2,78,1,9
4,0,25.0,3.3984,0.000353,0.0,yes,no,2,29,0,4


## Question 1

ROC AUC could also be used to evaluate feature importance of numerical variables.

Let's do that

* For each numerical variable, use it as score and compute AUC with the `card` variable.
* Use the training dataset for that.

If your AUC is < 0.5, invert this variable by putting "-" in front

(e.g. -df_train['expenditure'])

AUC can go below 0.5 if the variable is negatively correlated with the target varialble. You can change the direction of the correlation by negating this variable - then negative correlation becomes positive.

Which numerical variable (among the following 4) has the highest AUC?

* `reports`
* `dependents`
* `active`
* `share`

In [22]:
feature = ['reports', 'dependents', 'active', 'share']

In [47]:
from sklearn.metrics import auc
from sklearn.metrics import roc_auc_score

In [70]:
for f in feature:
    auc = roc_auc_score(y_train, df_train[f])
    if auc < 0.5:
        auc = roc_auc_score(y_train, -df_train[f])
    print('%s %.3f'%(f, auc))

reports 0.717
dependents 0.533
active 0.604
share 0.989


**Answer to Q1:**

`share` has the highest AUC.