### Problem Statement 1: Single Feature Linear Regression

**Business Problem:**  
"As an e-commerce analyst, predict customer's Yearly Amount Spent based solely on their Length of Membership. This will help understand if customer loyalty (tenure) directly translates to revenue."

## Step 1: Load and Understand Your Data

In [None]:
# load data from a CSV file and print the first 5 rows in well formatted way comma separated values
import pandas as pd
data = pd.read_csv('../linear_regression_problems/data/Ecommerce_Customers.csv')
#print(data.head().to_json(orient='records', lines=True))
print(data.head().to_csv(index=False, sep=','))

print("\n data types", data.dtypes)


Email,Address,Avatar,Avg. Session Length,Time on App,Time on Website,Length of Membership,Yearly Amount Spent
mstephenson@fernandez.com,"835 Frank Tunnel
Wrightmouth, MI 82180-9605",Violet,34.49726772511229,12.655651149166752,39.57766801952616,4.082620632952961,587.9510539684005
hduke@hotmail.com,"4547 Archer Common
Diazchester, CA 06566-8576",DarkGreen,31.926272026360156,11.109460728682564,37.268958868297744,2.66403418213262,392.2049334443264
pallen@yahoo.com,"24645 Valerie Unions Suite 582
Cobbborough, DC 99414-7564",Bisque,33.000914755642675,11.330278057777512,37.11059744212085,4.104543202376424,487.54750486747207
riverarebecca@gmail.com,"1414 David Throughway
Port Jason, OH 22070-1220",SaddleBrown,34.30555662975554,13.717513665142508,36.72128267790313,3.1201787827480914,581.8523440352178
mstephens@davidson-herman.com,"14023 Rodriguez Passage
Port Jacobville, PR 37242-1057",MediumAquaMarine,33.33067252364639,12.795188551078114,37.53665330059473,4.446308318351435,599.4060920457634




In [30]:
# keep only two columns in the data frame: 'Length of Membership' and 'Yearly Amount Spent'
# Yearly Amount Spent
refined_df = data[['Length of Membership', 'Yearly Amount Spent']]
print(refined_df.head())

   Length of Membership  Yearly Amount Spent
0              4.082621           587.951054
1              2.664034           392.204933
2              4.104543           487.547505
3              3.120179           581.852344
4              4.446308           599.406092


## Step 2: Prepare Your Features and Target and visualize the relationship

In [31]:
x_train = refined_df[['Length of Membership']]  # Double brackets keep it as DataFrame
y_train = refined_df['Yearly Amount Spent']     # Single bracket makes it a Series

# Check shapes
print(f"X shape: {x_train.shape}")  # Should be (500, 1) - 500 rows, 1 column
print(f"y shape: {y_train.shape}")  # Should be (500,) - 500 values

X shape: (500, 1)
y shape: (500,)


In [32]:
import plotly.express as px
import plotly.graph_objects as go

# Create interactive scatter plot using plotly
fig = px.scatter(
    refined_df,
    x='Length of Membership', 
    y='Yearly Amount Spent',
    title='Relationship between Membership Length and Yearly Spending'
)
fig.update_traces(marker=dict(size=8, opacity=0.6))
fig.show()

# What to look for:
# - Is there an upward trend? (positive correlation)
# - How scattered are the points? (strength of relationship)
# - Any outliers? (points far from the pattern)

## Step 3: Split Data for Training and Testing

In [33]:
from sklearn.model_selection import train_test_split

# Why split? 
# - Training set: Model learns from this (like studying from textbook)
# - Testing set: Model is evaluated on this (like taking an exam)
# - This prevents "memorization" (overfitting)
if not x_train.empty and not y_train.empty:
    X_train, X_test, y_train, y_test = train_test_split(
        x_train, y_train, 
        test_size=0.2,      # 20% for testing, 80% for training
        random_state=42     # Makes split reproducible (same split every time)
    )
    print(f"Training samples: {len(X_train)}")
    print(f"Testing samples: {len(X_test)}")
else :
    print("Training data is empty, cannot split.")
    


Training samples: 400
Testing samples: 100


<br>

## Step 4: Create and Train the Model

In [None]:
from sklearn.linear_model import LinearRegression

# Linear Regression finds the best line: y = mx + b
# m = slope (coefficient), b = intercept

# Create an instance of the model
model = LinearRegression()

# Train the model (find the best line)
# fit() method finds optimal m and b values
model.fit(X_train, y_train)

# Extract the equation parameters
slope = model.coef_[0]  # How much y increases when x increases by 1
intercept = model.intercept_  # Where line crosses y-axis

print("Linear Regression Model Trained.")
print(f"Slope (m): {slope}")
print(f"Intercept (b): {intercept}")
print("Equation: Yearly Spent = {:.2f} × Membership Years + {:.2f}".format(slope, intercept))
print(f"Interpretation: Each year of membership adds ${slope:.2f} in yearly spending")


Linear Regression Model Trained.
Slope (m): 64.64010065386708
Intercept (b): 271.3521128033931
Equation: Yearly Spent = 64.64 × Membership Years + 271.35
Equation: Yearly Spent = 64.64 × Membership Years + 271.35
Interpretation: Each year of membership adds $64.64 in yearly spending


<br>

## Step 5: Make Predictions

In [38]:
# Predict on training set (to check if model learned)
train_predictions = model.predict(X_train)

# Predict on testing set (to evaluate performance)
test_predictions = model.predict(X_test)

# Compare predictions to actual values
comparison_df = pd.DataFrame({
    'Actual': y_test,
    'Predicted': test_predictions,
    'Difference': y_test - test_predictions
})

print(comparison_df.head(10))

         Actual   Predicted  Difference
361  401.033135  493.362399  -92.329264
73   534.777188  520.318554   14.458634
374  418.602742  545.316005 -126.713263
155  503.978379  461.485200   42.493179
104  410.069611  492.993962  -82.924351
394  557.608262  515.237210   42.371052
377  538.941975  485.320564   53.621410
124  514.336558  519.193226   -4.856668
68   408.620188  451.020612  -42.400424
450  475.015407  518.667117  -43.651710


<br>

## Step 6: Evaluate model performance

In [45]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import warnings
import numpy as np

# Key Metrics Explained:
# MAE: Average absolute error (in dollars) - easier to interpret
# MSE: Penalizes large errors more
# RMSE: Square root of MSE (back to dollar units)
# R²: How much variation the model explains (0-1, higher is better)

# Calculate metrics for test set
mae = mean_absolute_error(y_test, test_predictions)
mse = mean_squared_error(y_test, test_predictions)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, test_predictions)

print("Model Performance Metrics:")
print(f"MAE : ${mae:.2f} - On average, predictions are off by this amount")
print(f"RMSE: ${rmse:.2f} - Root Mean Squared Error")
print(f"R² Score: {r2:.3f} - Model explains {r2*100:.1f}% of variance")

# Is this good? 
# R² > 0.7 is generally good for business predictions
# MAE should be small relative to average spending ($499)

Model Performance Metrics:
MAE : $37.84 - On average, predictions are off by this amount
RMSE: $46.50 - Root Mean Squared Error
R² Score: 0.563 - Model explains 56.3% of variance


<br>

## Step 7: Visualize the model fit

In [52]:

import numpy as np
# Create interactive plot with regression line
fig = go.Figure()

# Add scatter plot of actual data
fig.add_trace(go.Scatter(
    x=X_test['Length of Membership'], 
    y=y_test,
    mode='markers',
    name='Actual Data',
    marker=dict(size=8, opacity=0.6)
))

# Add regression line
x_range = np.linspace(X_train['Length of Membership'].min(), X_train['Length of Membership'].max(), 100)
y_line = model.predict(x_range.reshape(-1, 1))
fig.add_trace(go.Scatter(
    x=x_range, 
    y=y_line,
    mode='lines',
    name='Regression Line',
    line=dict(color='red', width=2)
))

fig.update_layout(
    title='Linear Regression: Membership Length vs Yearly Spending',
    xaxis_title='Length of Membership',
    yaxis_title='Yearly Spending ($)',
    hovermode='closest',
)
fig.show()


X does not have valid feature names, but LinearRegression was fitted with feature names



## Step 8 : Residual Analysis
- This step involves analyzing the residuals (the differences between the observed and predicted values) to assess the model's performance. A good model will have residuals that are randomly distributed around zero, indicating that the model captures the underlying patterns in the data well.

In [53]:
import plotly.graph_objects as go

# Calculate residuals
residuals = y_test - test_predictions

# Create residual plot
fig = go.Figure()

# Add residuals as scatter points
fig.add_trace(go.Scatter(
    x=test_predictions,  # Predicted values on x-axis
    y=residuals,         # Residuals on y-axis
    mode='markers',
    marker=dict(size=8, opacity=0.6),
    name='Residuals'
))

# Add a horizontal line at y=0
fig.add_hline(y=0, line_dash="dash", line_color="red")

# Update layout for better visualization
fig.update_layout(
    title='Residual Plot: Checking Model Assumptions',
    xaxis_title='Predicted Values',
    yaxis_title='Residuals',
    hovermode='closest'
)

# Show the plot
fig.show()

## further questions for exploration
- Is this good performance? how to decide?
- can we improve our linear regression model keeping same feature? how ?
- can we improve our model by adding more features? which features? how?