# Regression
Regression is the OG Machine learning algorithm. Linear regression and different derivatives are widely used today in solving many different types of problems like forecasting, prediction etc to a high level of accuracy. It is a statistical method used to model relationships between variables. We'll cover both linear regression for continuous outcomes and logistic regression for binary classification

## Linear Regression
$$ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n $$

y: dependent variable output
x: independent variable input
b0 is the error term
b is the curve of the slope for different input variables

### Intuition:
**Goal:** Find the values of b0 and b1,b2 that minimize the error between predicted y and actual y

**Example:** Predict steel price y based on grade x1 and thickness x2


### Simple explanation

Grade(x1)  Thickness(x2)  Price(y)
   100         5            500
   100         10           550
   200         5            700

Our equation becomes

y = b0 + b1*x1 + b2*x2

500 = b0 + b1(100) + b2(5)    # first point
550 = b0 + b1(100) + b2(10)   # second point
700 = b0 + b1(200) + b2(5)    # third point

#### Subtract equation 1 from 2 to eliminate b0:

550 - 500 = b1(100-100) + b2(10-5)
50 = 5b2
b2 = 10        # This means price increases $10 per mm of thickness

700 - 500 = b1(200-100) + b2(5-5)
200 = 100b1
b1 = 2         # This means price increases $2 per grade unit

#### Plug back into any equation to find b0:

500 = b0 + 2(100) + 10(5)
500 = b0 + 200 + 50
b0 = 250

Our final equation
#### Price = 250 + 2*Grade + 10*Thickness

Lets use a data set to illustrate linear regression

In [None]:
!pip install numpy pandas matplotlib scikit-learn seaborn
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn

### Data Loading and pre-processing
First, let's create our healthcare dataset and prepare it for analysis.


In [3]:
df = pd.read_csv('insurance.csv')

In [8]:
df.columns

Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')

In [None]:
# First step is to examine the data available
print(f"Columns of dataset are {df.columns}")

# Get an overview of summary statistics
print(df.describe())
print(df.info())

### Univariate analysis
Observe the distribution and count of different variables

Creates histograms for the numerical columns age, sex, bmi, and children with 20 bins and a figure size of (10, 8).
plt.show(): Displays the histograms.
sns.countplot(x=col, data=df): Creates a count plot showing the frequency of unique values for each categorical column.

In [None]:

df[['age','sex','bmi','children']].hist(bins=20,figsize=(10,8))
plt.show()

for col in ('sex','smoker','region'):
    sns.countplot(x=col,data = df)
    plt.title(f'Count plot for {col}')
    plt.show()

### Bivariate analysis
Observe the correlation among various variables

In [None]:
# Lets try to find the correlation matrix - corr among various variables 
corr = df.corr()
print(corr)

In [None]:
# We are getting an error because correlation works only among numeric columns
# Selecting only the numeric columns
numeric_cols = df.select_dtypes(include='number')
corr = numeric_cols.corr()

'''visualize the correlation matrix corr, with annotated correlation values
 and a color gradient from blue (negative) to red (positive)
 the vmin and vmax values are for negative to positive correlation
'''
sns.heatmap(corr, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Heatmap (Numeric Columns Only)')
plt.show()

However we don't want to leave out non numeric variables in correlation analysis 
This brings us to categorical variables which have discrete number of values. The way to transform them is to convert the categorical values to numeric which the model can implicitly understand

We have 2 options 
1. Label encoder. : Converts each category into a unique integer. e.g. 'male' -> 0, 'female' ->1. This is easy to implement and works for binary or small categories but can introduce order where none exists
2. One-hot encoding: Creates new columns for each category and assigns 1 if the cateogory is present, else 0. e.g. for 'sex', if its male, it'll assign 'sex_male' to 1 and 'sex_female' to 0

Now lets transform the categorical variables 'sex', 'region', 'smoker' and 'children'. For simplicity we 'll use label encoder as none of them have a lot of values

In [None]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
for col in ['sex', 'region', 'smoker','children']:
    df[col] = le.fit_transform(df[col])
print(df.head(5))


In [None]:
# Now lets get the entire correlation heatmap
corr = df.corr()
sns.heatmap(corr,annot=True,cmap='coolwarm', vmin=-1, vmax=1)
plt.show()

### Normalization
Normalization is a process of scaling numerical values to a specific range (usually [0, 1]). It is often used when

Different numeric columns have very different ranges (e.g., age ranges from 18–65, charges ranges from 1000–50000).
Models that rely on distance-based calculations (like regression, KNN, etc.) may be biased towards columns with larger ranges

In [53]:
numeric_cols = ['age', 'bmi', 'charges']

# Display statistics before normalization
print("Before Normalization:")
print(df[numeric_cols].describe())

Before Normalization:
               age          bmi       charges
count  1338.000000  1338.000000   1338.000000
mean     39.207025    30.663397  13270.422265
std      14.049960     6.098187  12110.011237
min      18.000000    15.960000   1121.873900
25%      27.000000    26.296250   4740.287150
50%      39.000000    30.400000   9382.033000
75%      51.000000    34.693750  16639.912515
max      64.000000    53.130000  63770.428010


In [None]:
# Perform normalization
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_norm = df.copy() # Create a copy for normalized dataframe
# fit_transform: Calculates the minimum and maximum values for each column (from fit()) and scales the values (from transform()) to a 0-1 range.
df_norm[numeric_cols] = scaler.fit_transform(df_norm[numeric_cols])

In [None]:
# transform back to dataframe as df_norm is a numpy array
df_norm = pd.DataFrame(df_norm)
print("After Normalization:")
print(df_norm.describe())

### Linear Regression 
Finally we can proceed with the regression. Here we will use a library which is widely used for all statistical and ML algos. Scikit Learn. As with any machine learning problem, we will need to split the data into training set on which the model will learn the variables and the test set on which the model accuracy will be tested. 

You will see how within a few lines of code, the entire regression can be done. Also we can use the same dataset for other types of regression models which we will cover later but a short overview will be given here

In [55]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Get the error computation code which will be used to minimise the error by the model
from sklearn.metrics import mean_squared_error, mean_absolute_error

# We need to use the independent variables in the X-matrix and dependent variable in the Y matrix
y=df_norm['charges']
X = df_norm.drop(columns =['charges','smoker']) # Why did we drop 'smoker'?

# Split the data into 80-20 among training and test set
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

Train Linear regression model

In [56]:
lr = LinearRegression() # Create an instance of linear regression model
lr.fit(X_train,y_train) # X_train is features and y_train is target value
y_pred = lr.predict(X_test)

Check Variable Dependence (Coefficients)
Linear regression provides coefficients for each feature, showing their impact on the prediction

In [None]:
# Create a DataFrame of feature names and their coefficients
# X.columns: The feature names (column names of X).
# lr.coef_: The corresponding coefficients learned by the LinearRegression model after training.
coefficients = pd.DataFrame({'Feature': X.columns, 'Coefficient': lr.coef_}).sort_values(by='Coefficient', ascending=False)
print(coefficients)

# Visualize feature importance
sns.barplot(x='Coefficient', y='Feature', data=coefficients)
plt.title('Feature Importance in Linear Regression')
plt.show()

Small Digression: How can you easily create barcharts

In [None]:
# Sample data
df = pd.DataFrame({
    'category': ['A', 'B', 'C', 'D'],
    'value': [10, 15, 7, 40]
})

# Barplot
sns.barplot(x='category', y='value', data=df, palette='Blues')
#plt.yscale('log')
plt.title('Barplot of Categories vs Values')
plt.show()

### Mean Squared Error (MSE) and Mean Absolute Error (MAE)

---

#### **Mean Squared Error (MSE)**
The Mean Squared Error (MSE) is defined as:
$$
MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$
- \( y_i \): Actual value.
- \( \hat{y}_i \): Predicted value.
- \( n \): Number of data points.

**Key Points:**
- MSE calculates the **average squared difference** between actual and predicted values.
- The squaring penalizes large errors more, making MSE sensitive to large outliers.
- The output is in **squared units** of the target variable.

---

#### **Mean Absolute Error (MAE)**
The Mean Absolute Error (MAE) is defined as:
$$
MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|
$$
- \( |y_i - \hat{y}_i| \): Absolute difference between actual and predicted values.

**Key Points:**
- MAE calculates the **average absolute difference** between actual and predicted values.
- MAE is more **robust to outliers** compared to MSE.
- The output is in the **same units** as the target variable.

---

### **Differences Between MSE and MAE**

| **Metric** | **Description**                  | **Sensitivity to Outliers** | **Units**          |
|-------------|-----------------------------------|----------------------------|-------------------|
| **MSE**     | Average of squared errors         | High (penalizes large errors) | Squared units (e.g., $^2$) |
| **MAE**     | Average of absolute errors        | Low (treats all errors equally) | Same units as target (e.g., $) |

---

### **Conclusion:**
- Use **MSE** when you want to penalize large errors more heavily (e.g., cost-sensitive applications).
- Use **MAE** when you want a simpler, interpretable metric that is not as sensitive to outliers.


### Understanding errors using numpy

In [None]:
# Sample data
y_true = np.array([100, 120, 80])  # Actual values
y_pred = np.array([90, 130, 75])   # Predicted values

# Calculate MAE manually
mae = np.mean(np.abs(y_true - y_pred))

# Calculate MSE manually
mse = np.mean((y_true - y_pred)**2)

# Calculate RMSE
rmse = np.sqrt(mse)

print(f"MAE: {mae:.2f}")
print(f"MSE: {mse:.2f}")
print(f"RMSE: {rmse:.2f}")

### Coming back to our regression

In [58]:
# Calculate errors
mse = mean_squared_error(y_test, y_pred)  # Mean Squared Error
mae = mean_absolute_error(y_test, y_pred)  # Mean Absolute Error

print(f"Mean Squared Error (MSE): {mse}")
print(f"Mean Absolute Error (MAE): {mae}")


Mean Squared Error (MSE): 0.033253598577450326
Mean Absolute Error (MAE): 0.14544994321069074


In [None]:
# Remember charges were normalized
import seaborn as sns
import matplotlib.pyplot as plt

sns.histplot(y_test, kde=True)
plt.title('Distribution of Charges (Test Data)')
plt.show()


Exercise: Now lets see how having  'smoker' as a independent variable impacts the error rate

Solution: Look at this only once you have done the exercise

In [62]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Features (X) and Target (y)
X = df_norm.drop(columns=['charges'])  # Include all features (including normalized `smoker`)
y = df_norm['charges']  # Target variable (normalized `charges`)

# Train-test split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and fit the linear regression model
lr = LinearRegression()
lr.fit(X_train, y_train)

# Predict on test data
y_pred = lr.predict(X_test)

# Calculate errors
mse = mean_squared_error(y_test, y_pred)  # Mean Squared Error
mae = mean_absolute_error(y_test, y_pred)  # Mean Absolute Error


print(f"Linear Regression Results (with `smoker`):")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Mean Absolute Error (MAE): {mae:.4f}")


Linear Regression Results (with `smoker`):
Mean Squared Error (MSE): 0.0086
Mean Absolute Error (MAE): 0.0668
R-squared Score (R²): 0.7833



Quickly lets check how Random Forest performs here on the same dataset

In [61]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Features (X) and Target (y)
X = df_norm.drop(columns=['charges', 'smoker'])  # All features except the target
y = df_norm['charges']  # Normalized charges (target variable)

# Train-test split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and fit the Random Forest model
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predict on test data
y_pred_rf = rf.predict(X_test)

# Calculate errors
mse_rf = mean_squared_error(y_test, y_pred_rf)  # Mean Squared Error
mae_rf = mean_absolute_error(y_test, y_pred_rf)  # Mean Absolute Error

# Print results
print(f"Random Forest Regression Results (Normalized `charges`):")
print(f"Mean Squared Error (MSE): {mse_rf:.4f}")
print(f"Mean Absolute Error (MAE): {mae_rf:.4f}")


Random Forest Regression Results (Normalized `charges`):
Mean Squared Error (MSE): 0.0378
Mean Absolute Error (MAE): 0.1493
