### üè• Business Scenario
You are working as a Data Analyst for a health insurance company.
The company provides insurance policies to thousands of customers every year. Recently, management observed that insurance claim costs are increasing, but they are not sure which customer factors are driving the cost the most.


### üéØ Business Goal
The company wants to:
Identify the key factors that increase medical insurance charges

so they can:
Design risk-based premium plans
Reduce losses caused by high-risk customers
Create fair pricing strategies for customers
Your task is to use Multiple Linear Regression to support this decision.
üìÅ Dataset Reality (Messy Data)
The dataset contains real-world problems:
Some important columns are categorical (not numeric)
Certain customer attributes are text-based
One column represents geographical regions
Numerical features have very different value ranges
Some variables may be strongly related to each other
The data cannot be used directly for regression.
 Medical Cost Personal Datasets | Kaggle
Insurance Forecast by using Linear Regression
 
### üß© Assignment Tasks
#### ‚úÖ Task 1 ‚Äî Business Understanding
Identify:
The target variable (company‚Äôs financial concern)
The input variables related to customer risk
Explain how your regression model helps the company control rising costs, not just predict them.
#### ‚úÖ Task 2 ‚Äî Data Inspection
Explore the dataset and:
Separate numeric and categorical columns
Identify columns that are not directly usable in regression
Report any data quality issues you observe.
#### ‚úÖ Task 3 ‚Äî Data Cleaning & Encoding
Convert categorical variables into numeric form.
Decide how to handle:
Region information
Binary attributes like lifestyle indicators
Justify each transformation from a business and modeling perspective.
#### ‚úÖ Task 4 ‚Äî Feature Scaling & Comparability
Observe differences in value ranges among numeric features.
Explain why scaling is important when comparing regression coefficients.
Prepare the data so the effect of each variable can be fairly interpreted.
#### ‚ö†Ô∏è Task 5 ‚Äî The Hidden Trap (Multicollinearity)
Analyze relationships between independent variables.
Identify any highly correlated features (example: age, BMI, and lifestyle habits).
Explain:
Why multicollinearity is a problem in Multiple Linear Regression
How it can confuse business decision-making
Take appropriate steps to fix the issue.
### ‚úÖ Task 6 ‚Äî Build the Multiple Linear Regression Model
Build a Multiple Linear Regression model using the prepared dataset.
Ensure the model is:
Interpretable
Stable
Suitable for explaining customer cost behavior
#### ‚úÖ Task 7 ‚Äî Model Evaluation
Evaluate model performance using appropriate regression metrics.
Comment on whether the model is reliable enough to support pricing decisions.

### Task1


Business Objective:
Identify key customer factors driving medical insurance charges
to support risk-based pricing and cost control strategies.

Target Variable:
- charges (medical insurance cost)

Predictor Variables:
- age, sex, bmi, children, smoker, region

Model Used:
- Multiple Linear Regression



In [25]:
# Import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


In [26]:
# Load dataset
df = pd.read_csv("insurance.csv")
df.head()


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [28]:
num_cols = df.select_dtypes(include=['int64', 'float64']).columns
cat_cols = df.select_dtypes(include=['object']).columns

print("Numerical Columns:", list(num_cols))
print("Categorical Columns:", list(cat_cols))

Numerical Columns: ['age', 'bmi', 'children', 'charges']
Categorical Columns: ['sex', 'smoker', 'region']


In [29]:
df.isnull().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

In [32]:
X = df.drop('charges', axis=1)
y = df['charges']

print("Feature shape:", X.shape)
print("Target shape:", y.shape)
X.columns

Feature shape: (1338, 6)
Target shape: (1338,)


Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region'], dtype='object')

In [34]:
ct = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(drop='first'), cat_cols),
        ('num', StandardScaler(), num_cols[:-1])
    ],
    remainder='passthrough'
)

X_transformed = ct.fit_transform(X)
X_transformed = np.array(X_transformed, dtype=float)

print("Transformed feature shape:", X_transformed.shape)


Transformed feature shape: (1338, 8)


In [37]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
#feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [38]:
regressor = LinearRegression()
regressor.fit(X_train, y_train)

print("Intercept:", regressor.intercept_)
print("Number of coefficients:", len(regressor.coef_))

Intercept: -11931.219050326681
Number of coefficients: 8


In [41]:
from sklearn.metrics import mean_squared_error, r2_score

y_pred=regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("MSE of Multiple linear regression : ",mse)
print("R2 score of the Multiple linear regression : ",r2)

MSE of Multiple linear regression :  33596915.85136145
R2 score of the Multiple linear regression :  0.7835929767120724
