Section 1: Predicting used car prices
We’ll be using the cars.csv data set for this section of the exercise. The data set covers the characteristics and prices for used cars sold in India. We are interested in predicting the price of a car given some characteristics. We will attempt to build a linear regression model of Price. We are going to work on filling in the missing data that we previously dropped. 

1.	Transform Price so that it looks more normal, produce histograms of the variable before and after transformation
2.	How many values are missing for Power and Engine?
3.	Which column has the most missing values and what should we do about it?
4.	Build a model of transformed price based on Power, Mileage, Kilometers Driven, and Year, how much variance is explained?
5.	How many rows were used to train the model?
6.	Fill the missing values in Power and Mileage with their respective means and rebuild the model. Now how much variance is explained?
7.	How many rows were used to train the model?
8.	Impute the missing data using MICE and rebuild the model
MICE documentation: https://www.statsmodels.org/dev/generated/statsmodels.imputation.mice.MICE.html 
9.	How have the parameter estimates changed from step 4?
10.	Plot the distribution of Power with and without MICE
11.	Plot the distribution of Engine with and without MICE



In [None]:
import pandas as pd

#load cars.csv and customers.csv into pandas dataframes
cars_df = pd.read_csv("cars.csv")
customers_df = pd.read_csv("customers.csv")

#print the first few rows of each dataframe (cars & customers)
print(cars_df.head())
print(customers_df.head())

#generate some summary info (metadata, for example) about the dataframes
print(f"Cars Shape: {cars_df.shape} (Rows, Columns)")
print(f"Customers Shape: {customers_df.shape} (Rows, Columns)")

#get the data types and non-null counts
print("Cars info:")
print(cars_df.info())
print("Customers info:")
print(customers_df.info())

#get the summary statistics for both dataframes
print("Cars Summary Statistics for numeric columns")
print(cars_df.describe())

print("Customers Summary Statistics for numeric columns")
print(customers_df.describe())

#check for missing data (by counting total missing values by column)
print("Missing Cars data:")
print(cars_df.isnull().sum())

print("Missing Customers data:")
print(customers_df.isnull().sum())

#count unique values per column (might be helpful)
print("Unique Values per Cars column")
print(cars_df.nunique())

print("Unique Values per Customers column")
print(customers_df.nunique())


In [None]:
#1.	Transform Price so that it looks more normal, produce histograms of the variable before and after transformation

#bring in some new libraries we need
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#clean and convert price from the Cars dataframe
cars_df["Price"] = pd.to_numeric(cars_df["Price"], errors="coerce")
cars_df = cars_df.dropna(subset=["Price"])  #drop rows with missing Price

#plot original price histogram
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
sns.histplot(cars_df["Price"], bins=30, kde=True)
plt.title("Original Price Distribution")
plt.xlabel("Price")

#log-transform price to create a more normal-looking distribution
cars_df["Log_Price"] = np.log(cars_df["Price"])

#plot log-transformed price histogram
plt.subplot(1, 2, 2)
sns.histplot(cars_df["Log_Price"], bins=30, kde=True, color="orange")
plt.title("Log-Transformed Price Distribution")
plt.xlabel("Log(Price)")

#display the price distribution histograms side-by-side
plt.tight_layout()
plt.show()

2. How many values are missing for Power and Engine?

In my ExamineTheData cell above, I have already tested for null values in all fields.  We see that there are 36 missing values in Power and 36 missing values in Engine.


In [None]:
#3. Which column has the most missing values and what should we do about it?


In [None]:
num_rows = cars_df.shape[0]

print(f"Missing values in cars dataframe out of {num_rows} rows:")
print(cars_df.isnull().sum())

#with this simple pandas df test I can display the null values for each column and visually see which has the highest count of missing values

In [None]:
#while I can visually see the highest value above, I thought I'd attempt to determine it programmatically
#furthermore, I'll make a function to do it for any given dataframe and also check it for the customers df

def report_max_missing(df, name="DataFrame"):
    """
    Prints the column with the most missing (NaN) values in the given DataFrame.
    Parameters:
        df (DataFrame): A pandas DataFrame
        name (str): A name for the DataFrame (for labeling in the output)
    """
    missing_counts = df.isnull().sum()
    max_missing_col = missing_counts.idxmax()
    max_missing_count = missing_counts.max()

    print([name])
    print(f"\nMax missing columns: {max_missing_count}")
    print(missing_counts)
    #if missing_counts == 0 [wrong usage]:
    #if missing_counts.empty: [wrong thing to test]
    if max_missing_count == 0:
        print(f"\n[name]: No missing values found")
    else:
        print(f"\n{name} — Column with Most Missing Values:")
        print(f"→ Column: {max_missing_col}")
    
#perform the function on both datasets - cars and customers
report_max_missing(cars_df, "Cars Dataset")
report_max_missing(customers_df, "Customers Dataset")

In [None]:
#4.	Build a model of transformed price based on Power, Mileage, Kilometers Driven, and Year, how much variance is explained?

import statsmodels.api as sm

#reload the Cars data since we're transforming from scratch
cars_df = pd.read_csv("cars.csv")

#clean and convert necessary columns
cars_df["Price"] = pd.to_numeric(cars_df["Price"], errors="coerce")
cars_df["Power"] = cars_df["Power"].str.replace(" bhp", "", regex=False)
cars_df["Mileage"] = cars_df["Mileage"].str.replace(" kmpl", "", regex=False)
cars_df["Mileage"] = cars_df["Mileage"].str.replace(" km/g", "", regex=False)
cars_df["Engine"] = cars_df["Engine"].str.replace(" CC", "", regex=False)

cars_df["Power"] = pd.to_numeric(cars_df["Power"], errors="coerce")
cars_df["Mileage"] = pd.to_numeric(cars_df["Mileage"], errors="coerce")
cars_df["Engine"] = pd.to_numeric(cars_df["Engine"], errors="coerce")

#drop rows where any of the key variables are missing
#df_model = cars_df.dropna(subset=["Price", "Power", "Mileage", "Kilometers_Driven", "Year"])
    #I needed to change this to create a "copy" so I don't get a warning on the Price code that follows
df_model = cars_df.dropna(subset=["Price", "Power", "Mileage", "Kilometers_Driven", "Year"]).copy()

#transform Price by taking the log to normalize
df_model["Log_Price"] = np.log(df_model["Price"])

#set up regression features and target
X = df_model[["Power", "Mileage", "Kilometers_Driven", "Year"]]
y = df_model["Log_Price"]

#sdd constant (intercept)
X = sm.add_constant(X)

#fit linear regression model
model = sm.OLS(y, X).fit()
first_R2 = model.rsquared

#R-squared
print(f"R-squared (Variance Explained): {first_R2:.4f}")

#we'll need these later for the MICE comparison
model4_params_df = pd.DataFrame({'Model 4 Coefficients': model.params.round(4)})



We have about 82% of the variability (in the log-transformed care prices) explained based on power, mileage, kilometers driven and year, which is a strong R-sqaured that indicates a good fit

In [None]:
#5.	How many rows were used to train the model?

#since we already cleaned the data, I need only to count the number of rows in the dataframe
num_rows = df_model.shape[0]
print(f"Number of rows used to train the model: {num_rows}")


In [None]:
#6.	Fill the missing values in Power and Mileage with their respective means and rebuild the model. Now how much variance is explained?

#reload the dataset since we're transforming anew
cars_df = pd.read_csv("cars.csv")

#how many rows did we read in
num_rows = cars_df.shape[0]
print(f"\nNumber of rows read in from file: {num_rows}:")


#clean and strip units from strings, as before
cars_df["Power"] = cars_df["Power"].str.replace(" bhp", "", regex=False)
cars_df["Mileage"] = cars_df["Mileage"].str.replace(" kmpl", "", regex=False)
cars_df["Mileage"] = cars_df["Mileage"].str.replace(" km/g", "", regex=False)

#convert to numeric (invalid values will be NaN)
cars_df["Power"] = pd.to_numeric(cars_df["Power"], errors="coerce")
cars_df["Mileage"] = pd.to_numeric(cars_df["Mileage"], errors="coerce")

num_rows = cars_df.shape[0]
print(f"\nHow many rows after null conversions: {num_rows}:")

#calculate means, but skip missing values
mean_power = cars_df["Power"].mean(skipna=True)
mean_mileage = cars_df["Mileage"].mean(skipna=True)

#now, go back through and replace missing values with the means we just calced
#cars_df["Power"].fillna(mean_power, inplace=True) #- got future warning on this code
cars_df.fillna({"Power": mean_power}, inplace=True)

#cars_df["Mileage"].fillna(mean_mileage, inplace=True) #- got future warning on this code
cars_df.fillna({"Mileage": mean_mileage}, inplace=True)
num_rows = cars_df.shape[0]
print(f"\nHow many rows after replacing nulls with means: {num_rows}:")

#show confirmation and results
print(f"Mean Power used for imputation: {mean_power:.2f} bhp")
print(f"Mean Mileage used for imputation: {mean_mileage:.2f} kmpl")

# Optional: Check if any missing values remain
print("\nMissing values after replacing missing values with means:")
print(cars_df[["Power", "Mileage"]].isna().sum())

df_model = cars_df.dropna(subset=["Price", "Power", "Mileage", "Kilometers_Driven", "Year"]).copy()

#log-transform the target variable (Price)
df_model["Log_Price"] = np.log(df_model["Price"])

#define predictors and add constant for intercept
X = df_model[["Power", "Mileage", "Kilometers_Driven", "Year"]]
X = sm.add_constant(X)
y = df_model["Log_Price"]

#build the linear regression model
model = sm.OLS(y, X).fit()

#print R-squared (variance explained) and model summary
print(f"R-squared (Variance Explained): {model.rsquared:.4f}")
print("\nModel Coefficients:")
print(model.params)

# Optional: preview predictions and residuals
#df_model["Predicted_Log_Price"] = model.predict(X)
#df_model["Residuals"] = df_model["Log_Price"] - df_model["Predicted_Log_Price"]

#print("\nFirst 5 Predictions:")
#print(df_model["Predicted_Log_Price"].head())

#print("\nFirst 5 Residuals:")
#print(df_model["Residuals"].head())

num_rows = df_model.shape[0]
print(f"Number of rows used to train the new model: {num_rows}")
print(f"R-squared before replacing missing with averages: {first_R2}")
print(f"R-squared after replacing missing with averages: {model.rsquared}")


We have very similar variance.  The value is slight lower when we replaced the missing values with the mean of the valid values.

In [None]:
#7.	How many rows were used to train the model?

#since we already cleaned the data, I need only to count the number of rows in the dataframe
num_rows = df_model.shape[0]
print(f"Number of rows used to train the new model: {num_rows}")

We can see that all rows in the table were used since we replaced those rows with missing values, with the mean of the rows with valid values.

In [None]:

#Impute the missing data using MICE and rebuild the model

from statsmodels.imputation.mice import MICEData

#reload and clean data (AGAIN)
cars_df = pd.read_csv("cars.csv")

cars_df["Power"] = cars_df["Power"].str.replace(" bhp", "", regex=False)
cars_df["Mileage"] = cars_df["Mileage"].str.replace(" kmpl", "", regex=False)
cars_df["Mileage"] = cars_df["Mileage"].str.replace(" km/g", "", regex=False)
cars_df["Engine"] = cars_df["Engine"].str.replace(" CC", "", regex=False)

cars_df["Power"] = pd.to_numeric(cars_df["Power"], errors="coerce")
cars_df["Mileage"] = pd.to_numeric(cars_df["Mileage"], errors="coerce")
cars_df["Engine"] = pd.to_numeric(cars_df["Engine"], errors="coerce")

cars_df = cars_df.dropna(subset=["Price", "Year", "Kilometers_Driven"]).copy()
cars_df["Log_Price"] = np.log(cars_df["Price"])

#for relevant columns
model_cols = ["Log_Price", "Power", "Mileage", "Kilometers_Driven", "Year"]
mice_data = MICEData(cars_df[model_cols])

#OLS on imputed data
imp_df = mice_data.next_sample()
X = sm.add_constant(imp_df[["Power", "Mileage", "Kilometers_Driven", "Year"]])
y = imp_df["Log_Price"]
model = sm.OLS(y, X).fit()

#results
print("✅ Regression with MICE-imputed data:\n")
print(model.summary())

In [None]:
#9.	How have the parameter estimates changed from step 4?
mice_params_df = pd.DataFrame({'MICE Coefficients': model.params.round(4)})


print(model4_params_df)
print(mice_params_df)


I don't see change between the two sets of coefficients.  It makes me wonder if I did something wrong.  This would seem to imply the imputed values via MICE are very similar to the means for the dataset.  Perhaps I need to take into account the missing values.  

In [None]:
#10. Plot the distribution of Power with and without MICE

#reloading and recleaning to assure the proper data is being used for the plots

df_raw = pd.read_csv("cars.csv")

#clean
df_raw["Power"] = df_raw["Power"].str.replace(" bhp", "", regex=False)
df_raw["Power"] = pd.to_numeric(df_raw["Power"], errors="coerce")

#before imputation
power_before = df_raw["Power"].copy()

#prepare MICE data
df_for_mice = df_raw[["Power", "Mileage", "Kilometers_Driven", "Year", "Price"]].copy()
df_for_mice["Mileage"] = df_for_mice["Mileage"].str.replace(" kmpl", "", regex=False)
df_for_mice["Mileage"] = df_for_mice["Mileage"].str.replace(" km/g", "", regex=False)
df_for_mice["Mileage"] = pd.to_numeric(df_for_mice["Mileage"], errors="coerce")
df_for_mice = df_for_mice.dropna(subset=["Price", "Year", "Kilometers_Driven"]).copy()
df_for_mice["Log_Price"] = np.log(df_for_mice["Price"])

#apply MICE imputation
mice_data = MICEData(df_for_mice[["Power", "Mileage", "Kilometers_Driven", "Year", "Log_Price"]])
df_mice = mice_data.next_sample()

#after imputation
power_after = df_mice["Power"]

#plot both distributions, with side-by-side histograms
fig, axes = plt.subplots(1, 2, figsize=(14, 5), sharey=True)

#before MICE
sns.histplot(power_before, ax=axes[0], color="skyblue", kde=True, bins=30)
axes[0].set_title("Power Distribution: Before MICE")
axes[0].set_xlabel("Power (bhp)")
axes[0].set_ylabel("Density")
axes[0].grid(True)

#after MICE
sns.histplot(power_after, ax=axes[1], color="orange", kde=True, bins=30)
axes[1].set_title("Power Distribution: After MICE")
axes[1].set_xlabel("Power (bhp)")
axes[1].grid(True)

plt.tight_layout()
plt.show()

As we observed with the coefficent commentary above, we do not see a visually-different view with the histograms side-by-side

In [None]:
#11. Plot the distribution of Engine with and without MICE


#reload and clean the dataset
df_raw = pd.read_csv("cars.csv")

#clean Engine column
df_raw["Engine"] = df_raw["Engine"].str.replace(" CC", "", regex=False)
df_raw["Engine"] = pd.to_numeric(df_raw["Engine"], errors="coerce")

#save Engine before imputation
engine_before = df_raw["Engine"].copy()

#clean other required columns
df_raw["Power"] = df_raw["Power"].str.replace(" bhp", "", regex=False)
df_raw["Power"] = pd.to_numeric(df_raw["Power"], errors="coerce")
df_raw["Mileage"] = df_raw["Mileage"].str.replace(" kmpl", "", regex=False)
df_raw["Mileage"] = df_raw["Mileage"].str.replace(" km/g", "", regex=False)
df_raw["Mileage"] = pd.to_numeric(df_raw["Mileage"], errors="coerce")

#drop rows required for model training
df_for_mice = df_raw.dropna(subset=["Price", "Year", "Kilometers_Driven"]).copy()
df_for_mice["Log_Price"] = np.log(df_for_mice["Price"])

#prepare MICE input
mice_data = MICEData(df_for_mice[["Engine", "Power", "Mileage", "Kilometers_Driven", "Year", "Log_Price"]])
df_mice = mice_data.next_sample()

#save Engine after MICE
engine_after = df_mice["Engine"]

#plot side-by-side histograms
fig, axes = plt.subplots(1, 2, figsize=(14, 5), sharey=True)

#before MICE
sns.histplot(engine_before, ax=axes[0], color="lightblue", kde=True, bins=30)
axes[0].set_title("Engine Distribution: Before MICE")
axes[0].set_xlabel("Engine (CC)")
axes[0].set_ylabel("Density")
axes[0].grid(True)

#after MICE
sns.histplot(engine_after, ax=axes[1], color="coral", kde=True, bins=30)
axes[1].set_title("Engine Distribution: After MICE")
axes[1].set_xlabel("Engine (CC)")
axes[1].grid(True)

plt.tight_layout()
plt.show()

Section 2: Predicting customer spending
We’ll be using the customers.csv data set for this exercise. The data set covers the demographic characteristics of some customers and the amount they spent over the past year at an online retailer. For this exercise it is recommended to use the sklearn packages for linear regression, ridge, and lasso. Sklearn documentation linked below.
Linear regression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
Ridge: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html 
Lasso: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html
In order to interact the categorical variables you will need to dummy code them and manually multiply, an example is given below. 
customerDf = pd.get_dummies(data=customerDf, columns=['sex', 'race'], prefix=['sex','race'])
customerDf["Hispanic_Male"] = np.multiply(customerDf["race_hispanic"],customerDf["sex_male"])


1.	Build a linear regression with all the dependent variables and two way interactions between sex and race, consider the other category for race and sex to be the reference category and treat it appropriately
2.	Build ridge models with various values for alpha. Create a chart showing how the coefficients change with alpha values
3.	Build lasso models with various values for alpha. Create a chart showing how the coefficients change with alpha values
4.	Compare the coefficients from linear regression, ridge, and lasso (select an alpha value using your chart)
5.	Compare the R2 from lr, ridge, and lasso
6.	Which model would you choose, and why?
7.	Which variables are dropped from the chosen model that were not dropped in linear regression?


In [None]:
#1. Build a linear regression with all the depndent variables and two way interactions between sex and race, 
#   consider the other category for race and sex to be the reference category and treat it appropriately

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

#load the dataset
df = pd.read_csv("customers.csv")

#dummy encode 'sex' and 'race' with 'other' as reference
df = pd.get_dummies(df, columns=['sex', 'race'], prefix=['sex', 'race'], drop_first=True)

#create interaction terms between sex and race (e.g., male * each race)
#for example: sex_male * race_black, sex_male * race_white, sex_male * race_hispanic (if present)
interaction_terms = []
if 'sex_male' in df.columns:
    for col in df.columns:
        if col.startswith('race_'):
            interaction_col = f"{col}_x_sex_male"
            df[interaction_col] = df[col] * df['sex_male']
            interaction_terms.append(interaction_col)

#prepare X and y, excluding the target and any columns not needed in model
target = "spend"
X_cols = [col for col in df.columns if col != target]
X = df[X_cols]
y = df[target]

#fit linear regression model
model = LinearRegression()
model.fit(X, y)

#print coefficients and R^2
coefficients = pd.Series(model.coef_, index=X.columns)
intercept = model.intercept_
r2 = model.score(X, y)

print("Intercept:", intercept)
print("\nCoefficients:\n", coefficients)
print("\nR-squared:", round(r2, 4))

In [None]:
#2.	Build ridge models with various values for alpha. Create a chart showing how the coefficients change with alpha values

import pandas as pd
import numpy as np
from sklearn.linear_model import Ridge
import matplotlib.pyplot as plt

#reload the dataset
df = pd.read_csv("customers.csv")

#dummy encode sex and race (drop_first=True for reference category)
df = pd.get_dummies(df, columns=['sex', 'race'], prefix=['sex','race'], drop_first=True)

#create interaction terms: sex_male * race_*
interaction_terms = []
if 'sex_male' in df.columns:
    for col in df.columns:
        if col.startswith('race_'):
            interaction_col = f"{col}_x_sex_male"
            df[interaction_col] = df[col] * df['sex_male']
            interaction_terms.append(interaction_col)

#define predictors (X) and target (y)
target = "spend"
X_cols = [col for col in df.columns if col != target]
X = df[X_cols]
y = df[target]

#standardize X (optional, but common for Ridge/Lasso)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Try a range of alphas and record coefficients
alphas = np.logspace(-2, 4, 50)
coefs = []

for alpha in alphas:
    ridge = Ridge(alpha=alpha)
    ridge.fit(X_scaled, y)
    coefs.append(ridge.coef_)

#plotting coefficient paths
plt.figure(figsize=(12, 6))
coefs = np.array(coefs)

for i in range(coefs.shape[1]):
    plt.plot(alphas, coefs[:, i], label=X.columns[i])

plt.xscale("log")
plt.xlabel("Alpha (log scale)")
plt.ylabel("Coefficient value")
plt.title("Ridge Coefficients vs. Alpha")
plt.axhline(0, color='black', linestyle='--', linewidth=1)
plt.legend(loc="best", bbox_to_anchor=(1.05, 1), fontsize="small")
plt.grid(True)
plt.tight_layout()
plt.show()

In [None]:
from sklearn.linear_model import RidgeCV

ridge_alphas = [0.01, 0.1, 1, 10, 100]
ridge_cv = RidgeCV(alphas=ridge_alphas, store_cv_results=True)
ridge_cv.fit(X_scaled, y)
print("Best Ridge alpha:", ridge_cv.alpha_)

#I didn't end up using this value as it's off by a factor of 100 (I'm still trying to understand why)

In [None]:
from sklearn.linear_model import LassoCV

lasso_cv = LassoCV(alphas=[0.001, 0.01, 0.1, 1, 10], cv=5, max_iter=10000)
lasso_cv.fit(X_scaled, y)
print("Best Lasso alpha:", lasso_cv.alpha_)

#I didn't end up using this value as it's off by a factor of 100 (I'm still trying to understand why)

In [None]:
#3.	Build lasso models with various values for alpha. Create a chart showing how the coefficients change with alpha values

import pandas as pd
import numpy as np
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

#relLoad the dataset
df = pd.read_csv("customers.csv")

#encode 'sex' and 'race' (drop_first=True to set reference categories)
df = pd.get_dummies(df, columns=['sex', 'race'], prefix=['sex', 'race'], drop_first=True)

#create interaction terms: sex_male * race_*
interaction_terms = []
if 'sex_male' in df.columns:
    for col in df.columns:
        if col.startswith('race_'):
            interaction_col = f"{col}_x_sex_male"
            df[interaction_col] = df[col] * df['sex_male']
            interaction_terms.append(interaction_col)

#set up X and y
target = "spend"
X_cols = [col for col in df.columns if col != target]
X = df[X_cols]
y = df[target]

#standardize X
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

#Lasso: Loop over alphas and store coefficients
alphas = np.logspace(-2, 1, 50)  # from 0.01 to 10
coefs = []

for alpha in alphas:
    lasso = Lasso(alpha=alpha, max_iter=10000)
    lasso.fit(X_scaled, y)
    coefs.append(lasso.coef_)

#plot coefficient paths
plt.figure(figsize=(12, 6))
coefs = np.array(coefs)

for i in range(coefs.shape[1]):
    plt.plot(alphas, coefs[:, i], label=X.columns[i])

plt.xscale("log")
plt.xlabel("Alpha (log scale)")
plt.ylabel("Coefficient Value")
plt.title("Lasso Coefficients vs. Alpha")
plt.axhline(0, color='black', linestyle='--', linewidth=1)
plt.legend(loc="best", bbox_to_anchor=(1.05, 1), fontsize="small")
plt.grid(True)
plt.tight_layout()
plt.show()

In [None]:
#4.	Compare the coefficients from linear regression, ridge, and lasso (select an alpha value using your chart)

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler

#reload data
df = pd.read_csv("customers.csv")

#dummy encode 'sex' and 'race' and create interaction terms
df = pd.get_dummies(df, columns=['sex', 'race'], prefix=['sex','race'], drop_first=True)

if 'sex_male' in df.columns:
    for col in df.columns:
        if col.startswith('race_'):
            df[f'{col}_x_sex_male'] = df[col] * df['sex_male']

#set up X and y
target = "spend"
X_cols = [col for col in df.columns if col != target]
X = df[X_cols]
y = df[target]

#standardize predictors
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

#fit all models to be compared
linear_model = LinearRegression().fit(X_scaled, y)
ridge_model = Ridge(alpha=10).fit(X_scaled, y)   # <- Picked alpha=10 from ridge plot
lasso_model = Lasso(alpha=0.1, max_iter=10000).fit(X_scaled, y)  # <- Picked alpha=0.1 from lasso plot

#create comparison DataFrame
coef_comparison = pd.DataFrame({
    'Feature': X.columns,
    'Linear': linear_model.coef_,
    'Ridge (α=10)': ridge_model.coef_,
    'Lasso (α=0.1)': lasso_model.coef_,
})

#round and sort by absolute size of Linear Regression coef
coef_comparison = coef_comparison.round(4).set_index("Feature")
coef_comparison = coef_comparison.reindex(coef_comparison['Linear'].abs().sort_values(ascending=False).index)

print(coef_comparison)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler

#prepare the data
df = pd.read_csv("customers.csv")

#encode categorical variables
df = pd.get_dummies(df, columns=['sex', 'race'], prefix=['sex','race'], drop_first=True)

#create interaction term
if 'sex_male' in df.columns:
    for col in df.columns:
        if col.startswith('race_'):
            df[f'{col}_x_sex_male'] = df[col] * df['sex_male']

#setup X and y
target = 'spend'
X = df.drop(columns=[target])
y = df[target]

#standardize X
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

#fit models
models = {
    'Linear Regression': LinearRegression(),
    'Ridge (α=10)': Ridge(alpha=10),
    'Lasso (α=0.1)': Lasso(alpha=0.1, max_iter=10000)
}

coef_df = pd.DataFrame(index=X.columns)

for name, model in models.items():
    model.fit(X_scaled, y)
    coef_df[name] = model.coef_

#plot the coefficients
coef_df.plot(kind='bar', figsize=(12, 6))
plt.title('Model Coefficient Comparison')
plt.xlabel('Features')
plt.ylabel('Coefficient Value')
plt.xticks(rotation=45, ha='right')
plt.legend(loc='best')
plt.tight_layout()
plt.grid(True)
plt.show()

In [None]:
#5.	Compare the R2 from lr, ridge, and lasso

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import r2_score

#Linear Regression
linear_model = LinearRegression().fit(X_scaled, y)
y_pred_linear = linear_model.predict(X_scaled)
r2_linear = r2_score(y, y_pred_linear)

#Ridge Regression (choose alpha based on your ridge plot)
ridge_model = Ridge(alpha=10).fit(X_scaled, y)
y_pred_ridge = ridge_model.predict(X_scaled)
r2_ridge = r2_score(y, y_pred_ridge)

#Lasso Regression (choose alpha based on your lasso plot)
lasso_model = Lasso(alpha=0.1, max_iter=10000).fit(X_scaled, y)
y_pred_lasso = lasso_model.predict(X_scaled)
r2_lasso = r2_score(y, y_pred_lasso)

#print R² comparison
print("R² Comparison:")
print(f"Linear Regression R²: {r2_linear:.4f}")
print(f"Ridge Regression R²:  {r2_ridge:.4f}  (alpha=10)")
print(f"Lasso Regression R²:  {r2_lasso:.4f}  (alpha=0.1)")

The r-squared across all models are nearly identical.  All three models explain about 87% of the variance is customer spending.  The LR model 


#6.	Which model would you choose, and why?

Given the high r2 arcross all models and the small difference in performance, there is no clear sign of overfitting in the set.  Therefore, I choose the Ridge Regression as it gives me a nearly identical  accuracy as the LR and it retains all featuree (unlike Lasso), which may end up being desirable.


#7.	Which variables are dropped from the chosen model that were not dropped in linear regression?

I don't see that any variabls were dropped from Ridge that were not dropped in LR.  The question would seem to imply there should have been. We know Ridge shrinks some coefficients toward zero, but I don't see that any are dropped entirely.  Here are the features that Ridge pulled slightly closer to zero to help prevent overfitting:

Variable                 Amount Shrunk
race_white_x_sex_male    0.0761
age                      0.0509
race_black_x_sex_male    0.0463
race_hispanic_x_sex_male 0.0459
race_other_x_sex_male    0.0402
sex_other                0.0050
income                   0.0021




In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import train_test_split

# Define target and predictors
target_column = "spend"  # Correct column name
X = df.drop(columns=[target_column])
y = df[target_column]

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Linear regression
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
linear_coef = pd.Series(lr_model.coef_, index=X.columns)

# Ridge regression
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train, y_train)
ridge_coef = pd.Series(ridge_model.coef_, index=X.columns)

# Compare shrinkage
shrinkage = linear_coef - ridge_coef
shrunk_coefficients = shrinkage[shrinkage > 0]

# Create comparison table
shrinkage_table = pd.DataFrame({
    "Variable": shrunk_coefficients.index,
    "Linear Regression": linear_coef[shrunk_coefficients.index].values,
    "Ridge": ridge_coef[shrunk_coefficients.index].values,
    "Amount Shrunk": shrunk_coefficients.values
}).sort_values(by="Amount Shrunk", ascending=False).reset_index(drop=True)

shrinkage_table