# Linear Regression for estimating Annual Premium of Health Insurance


1. Domain Exploration
- understanding business process, customer journey
- explore and analyze sources of data, data journey, data users
- Identify key exceptions in business process, key beleifs from opeerations/shopfloor stakeholders


2. Data Collection and Exploration
- Collect data from different sources: inside and outside
- Build a dataset, define target attribute
- perform high level exploration to assess data quality


3. Data Cleaning
- Handling duplicates, missing values, outliers
- handle formating, units 


4. Feature Engineering
- Feature Extraction: Extract new features from existing: data modelling, Dimensionality reduciton, OLTP to OLAP
- Feature Selection: Select most relevant features
    - Statistical Research: COrrelation ANalysis, ANOVA, Chisquare
    - Data Visualization: Univariate, Bivariate and Multivariate
    

5. Feature Preprocessing: 
- Encoding features
- SCaling features


7. Model Development
- Select algorithm, train model
- Evaluate Model

8. Optimize model
- Tune model, optimize data and get the best model

9. Deploy model
- Deploy model as an inferense pipeline

10. Monitor in production

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()

  from pandas.core import (


## 2. Data Collection and Exploration

In [2]:
# load data
df = pd.read_csv("datasets-1/insurance.csv")
df.shape

(1338, 7)

## 3. Data Cleaning

In [6]:
# check duplicates
df.duplicated().sum()

1

In [7]:
# drop the duplicated row
print(df.shape)
df.drop_duplicates(inplace=True)
print(df.shape)


(1338, 7)
(1337, 7)


In [8]:
# check for missing values
df.isnull().sum()

age         0
Gender      0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

In [11]:
# Analyze value at 98 percentile
thresh = df.charges.quantile(0.95)
thresh

41210.04980000002

In [12]:
# replace all values above thresh by max value below thresh
print(df.charges.skew())
df.charges[df.charges>thresh] = df.charges[df.charges<thresh].max()
print(df.charges.skew())

1.5153909108403483
1.3058517533895801


You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df.charges[df.charges>thresh] = df.charges[df.charges<thresh].max()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-co

In [27]:
df.columns

Index(['age', 'Gender', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')

In [28]:
x= df[['age','bmi','children','smoker']]
y = df['charges']

## 5. Feature Preprocessing

In [30]:
from sklearn.preprocessing import LabelEncoder
smoker_en = LabelEncoder()
x['smoker'] = smoker_en.fit_transform(x['smoker'])
# train test split
from sklearn.model_selection import train_test_split
xtrain,xtest,ytrain,ytest = train_test_split(x,y,test_size=0.2, random_state=50)
print(x.shape,xtrain.shape,xtest.shape)
print(y.shape,ytrain.shape,ytest.shape)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,age,bmi,children,smoker
0,19,27.9,0,1
1,18,33.77,1,0
2,28,33.0,3,0
3,33,22.705,0,0
4,32,28.88,0,0


## 6. Applying Machine Learning

In [32]:
import mlflow
import mlflow.sklearn

mlflow.set_tracking_uri("http://3.25.126.82:5000/")

In [33]:
mlflow.create_experiment("Insurance-Anshu")

'123848121867164488'

In [34]:
mlflow.set_experiment("Insurance-Anshu")

<Experiment: artifact_location='mlflow-artifacts:/123848121867164488', creation_time=1728041546854, experiment_id='123848121867164488', last_update_time=1728041546854, lifecycle_stage='active', name='Insurance-Anshu', tags={}>

In [35]:
from sklearn.linear_model import LinearRegression

mlflow.sklearn.autolog()


with mlflow.start_run():
    
    # initiate the model object
    model = LinearRegression()

    # train the model with training data
    model.fit(xtrain,ytrain)
    
    ypred = model.predict(xtest)
    from sklearn.metrics import r2_score
    r2 = r2_score(ytest,ypred)



## Client Code

In [62]:
import requests
import json

data = {"age":[24,], "bmi":[30.0,],"children":[2,],"smoker":[0,]}
data = json.dumps(data)


url= "http://127.0.0.1:5001/predict"
response = requests.post(url,data=data)
print(response.content)

b'{"age": [24], "bmi": [30.0], "children": [2], "smoker": [0], "prediction": 5051.709450479322}'


In [40]:
import mlflow

mlflow.set_tracking_uri("http://3.25.126.82:5000/")

In [41]:
model_name="insurance"
model_version=1

model = mlflow.pyfunc.load_model(model_uri=f"models:/{model_name}/{model_version}")


Downloading artifacts:   0%|          | 0/5 [00:00<?, ?it/s]

In [45]:
model.metadata.signature

inputs: 
  ['age': long, 'bmi': double, 'children': long, 'smoker': integer]
outputs: 
  [Tensor('float64', (-1,))]
params: 
  None

In [56]:
data = pd.DataFrame({"age":[24,], "bmi":[30.0,],"children":[2,],"smoker":np.array([0,]).astype('int32')})
data

Unnamed: 0,age,bmi,children,smoker
0,24,30.0,2,0


In [57]:
model.predict(data)

array([5051.70945048])

In [58]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1 non-null      int64  
 1   bmi       1 non-null      float64
 2   children  1 non-null      int64  
 3   smoker    1 non-null      int32  
dtypes: float64(1), int32(1), int64(2)
memory usage: 156.0 bytes


In [64]:

import requests
import json

data = {"age":[24,], "bmi":[30.0,],"children":[2,],"smoker":[0,]}
data = json.dumps(data)
url = "http://3.25.126.82:5001/predict"


response = requests.post(url,data=data)
print(response.content)

b'{"age": [24], "bmi": [30.0], "children": [2], "smoker": [0], "prediction": 5051.709450479324}'
