# Homework for week05

> In this homework, we will use Bank credit scoring dataset from [here](https://www.kaggle.com/datasets/kapturovalexander/bank-credit-scoring/data).

## Data Dictionary

- 'age' : age,
- 'job' : job name,
- 'marital' : marital status,
- 'education' : level of education,
- 'default' : has previously defaulted,
- 'balance' : total balance at loaner's account,
- 'housing' : whether loaner has a house,
- 'loan' : loan amount,
- 'contact': mode of contact with loaner,
- 'day' : length of loan,
- 'month' : month when loan was taken,
- 'duration' : loan duration (in days or months?),
- 'campaign' : how many times loaner has taken a loan (just applied or succesfully approved?),
- 'pdays' : ?,
- 'previous' : ?,
- 'poutcome' : ?,
- 'y' : target

### Questions / Assumptions

What does the target `y` represent? 
- That the account is in default? 
- That the loan is approved?

## Question 1: Install Pipenv 

```bash
micromamba install pipenv -y
```

### qn1 ans: `2023.10.3`

Q: What's the version of pipenv you installed?

A: 
```bash
pipenv --version
pipenv, version 2023.10.3
```

## Question 2: install Scikit-Learn

Q: What's the first hash for scikit-learn you get in Pipfile.lock?

### qn2 ans: "sha256:0c275a06..."

A: "sha256:0c275a06c5190c5ce00af0acbb61c06374087949f643ef32d355ece12c4db043"


## Import packages


In [1]:
from tqdm.auto import tqdm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import pickle
from pprint import pprint as pp


from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression



## Question 3

> Write a script for loading these models with pickle
>
> Score this client:
>
> ```json
> {"job": "retired", "duration": 445, "poutcome": "success"}
> ```

```powershell
Invoke-WebRequest -Uri "https://raw.githubusercontent.com/DataTalksClub/machine-learning-zoomcamp/master/cohorts/2023/05-deployment/homework/model1.bin" -outfile "model1.bin"

Invoke-WebRequest -Uri "https://raw.githubusercontent.com/DataTalksClub/machine-learning-zoomcamp/master/cohorts/2023/05-deployment/homework/dv.bin" -outfile "dv.bin"
```
    md5sum model1.bin dv.bin
    8ebfdf20010cfc7f545c43e3b52fc8a1 *model1.bin
    924b496a89148b422c74a62dbc92a4fb *dv.bin

Using manual download, wget or pwsh Invoke all failed. Only download file that works is using the wget code in a `git bash` shell, even though the md5 checksum is different than the one stated in `README.md`

```bash
$ md5sum model1.bin dv.bin
3f57f3ebfdf57a9e1368dcd0f28a4a14  model1.bin
6b7cded86a52af7e81859647fa3a5c2e  dv.bin
```

In [2]:
df = pd.read_csv('../../data/bank.csv', sep=';')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4521 entries, 0 to 4520
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        4521 non-null   int64 
 1   job        4521 non-null   object
 2   marital    4521 non-null   object
 3   education  4521 non-null   object
 4   default    4521 non-null   object
 5   balance    4521 non-null   int64 
 6   housing    4521 non-null   object
 7   loan       4521 non-null   object
 8   contact    4521 non-null   object
 9   day        4521 non-null   int64 
 10  month      4521 non-null   object
 11  duration   4521 non-null   int64 
 12  campaign   4521 non-null   int64 
 13  pdays      4521 non-null   int64 
 14  previous   4521 non-null   int64 
 15  poutcome   4521 non-null   object
 16  y          4521 non-null   object
dtypes: int64(7), object(10)
memory usage: 600.6+ KB


In [4]:
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no


In [5]:
# features = ['job','duration', 'poutcome']
# dicts = df[features].to_dict(orient='records')

# dv = DictVectorizer(sparse=False)
# X = dv.fit_transform(dicts)
# y= df.y.values

# model = LogisticRegression(max_iter=1000).fit(X, y)

In [6]:
# output_file = 'model1.bin'
# with open(output_file, 'wb') as f_out: 
#     pickle.dump((dv, model), f_out)

In [7]:
def load(input_file):
    with open(input_file, 'rb') as f_in: 
        return pickle.load(f_in)

dv = load(f'dv.bin')
model = load(f'model1.bin')

In [8]:
# code cell becomes qn3_predict.py
client = {"job": "retired", "duration": 445, "poutcome": "success"}

X = dv.transform([client])
y_pred = model.predict_proba(X)[0, 1]

print(client)
print(y_pred)

{'job': 'retired', 'duration': 445, 'poutcome': 'success'}
0.9019309332297606


### qn3 ans: 0.902

## Question 4

> Now let's serve this model as a web service
> 
> * Install Flask and gunicorn (or waitress, if you're on Windows)
> * Write Flask code for serving the model
> * Now score this client using `requests`

In [10]:
# code cell becomes qn4_predict.py
import requests

url = "http://localhost:9696/predict"

client = {"job": "unknown", "duration": 270, "poutcome": "failure"}
response = requests.post(url, json=client).json()

print(response)


{'approval_probability': 0.13968947052356817, 'loan approved': False}


### qn4 ans: 0.140

A: `0.140`

In [11]:
if response['loan approved']:
    print('sending email to', 'some client')

## Question 5

Q: So what's the size of this base image?

### qn5 ans: 147 MB

A: `147 MB`

## Question 6

- compose the Dockerfile
- build docker image: `docker build -t hmwk05 .`
- run docker image in container `docker run -it --rm -p 9696:9696 hmwk05:latest`
- in another terminal, test it  `pipenv run .\qn4_flask.py`


In [12]:
# code cell becomes qn6_test.py
import requests

url = "http://localhost:9696/predict"

client = {"job": "retired", "duration": 445, "poutcome": "success"}
response = requests.post(url, json=client).json()

print(response)


{'approval_probability': 0.726936946355423, 'loan approved': True}


Q: What's the probability that this client will get a credit now?

### qn6 ans: 0.730

A: `0.730`