# Machine Learning Project for Kubernetes

We will be creating an application that can predict a persons medical insurance charges based on 6 different features

1. Age (ranges from 19 to 64) (int)
2. Sex (Male and Female) 
3. BMI (ranges from 15.96 to 53.13) (float)
4. Children (ranges from 0 to 5) (int)
5. Smoker (Yes and No)
6. Region (northwest,northeast,southwest,southeast)


The dataset is obtained from Kaggle [here](https://www.kaggle.com/datasets/mirichoi0218/insurance?resource=download). Download the CSV file, and load it onto the notebook using Pandas as demonstrated below.

In [2]:
import pandas as pd

In [3]:
data = pd.read_csv('insurance.csv')

In [4]:
data

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


Special thanks to Miri Choi for providing a dataset thats extremely clean, so we wont have to spend too much time on that aspect! If you want to check the datasets cleanliness for yourself, consider running this command to view all the unique values in each of the features:

```python
for col in data.columns:
    print(col)
    print(data[col].unique())
    print("\n")
```

Nonetheless, we will have to perform some basic feature engineering, as certain features (age,sex etc.) are in string format, which are not compatible with most regression models. These are categorical variables (non-ordinal) and hence one-hot encoding should do the trick! Remember, ordinal encoding only makes sense if the categories have a certain level of value in order (for eg. Gold > Silver > Bronze) but for a category that includes dogs, cats and rabbits, theres no level of order, and hence one-hot encoding is the preferred option.

In [5]:
cols_to_encode = ['sex','smoker','region']
df_encoded = pd.get_dummies(data[cols_to_encode],drop_first=True)
df_not_encoded = data.drop(cols_to_encode, axis=1)
final_df = pd.concat([df_encoded,df_not_encoded],axis=1)

Great, we have a nice, clean dataset. Lets break it up into train and test and see how it performs against some basic regression models.

In [31]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

In [32]:
y = final_df['charges']
X = final_df.drop('charges',axis=1)
X_scaled = MinMaxScaler().fit_transform(X)
X_train,X_test,y_train,y_test = train_test_split(X_scaled,y,test_size=0.2,random_state=42)

In [13]:
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X_train,y_train)
y_preds = model.predict(X_test)

In [14]:
from sklearn.metrics import r2_score
print(r2_score(y_test,y_preds))

0.7835929767120723


In [16]:
from sklearn.ensemble import RandomForestRegressor
model2 = RandomForestRegressor().fit(X_train,y_train)
y_preds_rfr = model2.predict(X_test)
print(r2_score(y_test,y_preds_rfr))

0.8638841461300266


In [18]:
import joblib
joblib.dump(model2, 'rfr_model.sav')

['rfr_model.sav']

In [19]:
from typing import Union

from fastapi import FastAPI

app = FastAPI()


@app.get("/")
def read_root():
    return {"Hello": "World"}


@app.get("/items/{item_id}")
def read_item(item_id: int, q: Union[str, None] = None):
    return {"item_id": item_id, "q": q}

In [23]:
!uvicorn main:app --reload

[32mINFO[0m:     Will watch for changes in these directories: ['/Users/ashwinphilipgeorge/workspace/medium']
[32mINFO[0m:     Uvicorn running on [1mhttp://127.0.0.1:8000[0m (Press CTRL+C to quit)
[32mINFO[0m:     Started reloader process [[36m[1m6806[0m] using [36m[1mWatchFiles[0m
[31mERROR[0m:    Error loading ASGI app. Attribute "app" not found in module "main".
[32mINFO[0m:     Started server process [[36m6830[0m]
[32mINFO[0m:     Waiting for application startup.
[32mINFO[0m:     Application startup complete.
[32mINFO[0m:     Shutting down
[32mINFO[0m:     Waiting for application shutdown.
[32mINFO[0m:     Application shutdown complete.
[32mINFO[0m:     Finished server process [[36m6830[0m]
[32mINFO[0m:     Started server process [[36m6831[0m]
[32mINFO[0m:     Waiting for application startup.
[32mINFO[0m:     Application startup complete.
[32mINFO[0m:     127.0.0.1:56999 - "[1mGET / HTTP/1.1[0m" [32m200 OK[0m
[32mINFO[0m:     127.0.0

In [1]:
import joblib

model = joblib.load('rfr_model.sav')

## Full Preprocessing

In [20]:
def full_preprocessing(data):
    cols_to_encode = ['sex','smoker','region']
    df_encoded = pd.get_dummies(data[cols_to_encode],drop_first=True)
    df_not_encoded = data.drop(cols_to_encode)
    final_df = pd.concat([df_encoded,df_not_encoded])
    X_scaled = MinMaxScaler().fit_transform(final_df)
    return X_scaled

In [21]:
test_series = data.iloc[0].drop('charges')

In [27]:
test_series

age                19
sex            female
bmi              27.9
children            0
smoker            yes
region      southwest
Name: 0, dtype: object

In [33]:
X.iloc[0]

sex_male             0.0
smoker_yes           1.0
region_northwest     0.0
region_southeast     0.0
region_southwest     1.0
age                 19.0
bmi                 27.9
children             0.0
Name: 0, dtype: float64

In [40]:
test_series = [[0,1,0,0,1,19,27.9,0]]

In [42]:
model.predict(test_series)[0]

48316.16197970002