# Comparing KNN and Linear Regression Models on Advertising Data


In this notebook, we will explore and compare two regression models—K-Nearest Neighbors (KNN) and Linear Regression—to predict sales based on TV advertising spend. We'll visualize the data, fit both models, and evaluate their performance.


## 1. Cloning the Repository and Loading Data

First, we need to clone the repository that contains the dataset we'll be using.

In [None]:
!git clone https://github.com/cesarlegendre/credit_scoring_7904_Q4_2024



## 2. Importing Necessary Libraries


We start by importing all the necessary libraries for data manipulation, visualization, and modeling.




In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, roc_curve, auc
import warnings
warnings.filterwarnings('ignore')

# Load data
path = 'credit_scoring_7904_Q4_2024/data_sets/advertising/data.csv'
df_base = pd.read_csv(path)
df_base

## 2. Data Exploration and Visualization
Plotting TV Advertising Spend vs. Sales
We want to visualize the relationship between TV advertising spend and sales. We'll take a random sample of 10 data points for clarity.

In [None]:

plt.figure(figsize=(10, 6))
df_sample = df_base.sample(10)
plt.scatter(df_sample['TV'], df_sample['Sales'], marker='X', s=50)  # s controls the size of the markers
plt.xlabel('TV Advertising Spend')
plt.ylabel('Sales in 1000 $')
plt.title('TV Advertising Spend vs. Sales in 1000$')
plt.show()


In [None]:

plt.figure(figsize=(10, 6))
df_sample = df_base.sample(10)
plt.scatter(df_sample['TV'], df_sample['Sales'], marker='X', s=100)  # s controls the size of the markers
plt.xlabel('TV Advertising Spend')
plt.ylabel('Sales in 1000')
plt.title('TV Advertising Spend vs. Sales')
plt.axvline(x=175, color='red', linestyle='--')
plt.show()


In [None]:

plt.figure(figsize=(10, 6))
df_sample = df_base.sample(10)
plt.scatter(df_sample['TV'], df_sample['Sales'], marker='X', s=100)  # s controls the size of the markers
plt.xlabel('TV Advertising Spend')
plt.ylabel('Sales in 1000')
plt.title('TV Advertising Spend vs. Sales')
plt.axvline(x=115, color='red', linestyle='--')
plt.show()


# 3. Plotting with a Horizontal Line at Average Sales
Let's add a horizontal line representing the average sales from our sample.

In [None]:

plt.figure(figsize=(10, 6))
df_sample = df_base.sample(10)
plt.scatter(df_sample['TV'], df_sample['Sales'], marker='X', s=100)  # s controls the size of the markers
plt.xlabel('TV Advertising Spend')
plt.ylabel('Sales in 1000')
plt.title('TV Advertising Spend vs. Sales')
average_sales = df_sample['Sales'].mean()
plt.axhline(y=average_sales, color='red', linestyle='--', label=f'Average Sales: {average_sales:.2f}')
plt.legend()
plt.show()


# 4. K-Nearest Neighbors (KNN) Regression
Fitting KNN on Sample Data

In [None]:
# prompt: now, plot a knn predictor for the sample data base (at 10) with the real values and the predicted values. The real values in scatter plot and the predictive value as line

from sklearn.neighbors import KNeighborsRegressor

# Sample data
df_sample = df_base.sample(10)
X = df_sample[['TV']]
y = df_sample['Sales']

# Create and train the KNN model
knn = KNeighborsRegressor(n_neighbors=1)
knn.fit(X, y)

# Generate predictions
X_pred = np.linspace(X['TV'].min(), X['TV'].max(), 50).reshape(-1, 1)
y_pred = knn.predict(X_pred)


# Plot the results
plt.figure(figsize=(10, 6))
plt.scatter(X, y, label='Real Values', marker='X',  s=100)
plt.plot(X_pred, y_pred, label='Predicted Values', color='red')
plt.xlabel('TV Advertising Spend')
plt.ylabel('Sales in 1000')
plt.title('KNN Regression for Sales Prediction')
plt.legend()
plt.show()


# 5. Fitting KNN on Full Data
Now, we'll fit the KNN regressor on the entire dataset for a more comprehensive model.

python

In [None]:

from sklearn.neighbors import KNeighborsRegressor

# Sample data
df_sample = df_base
X = df_sample[['TV']]
y = df_sample['Sales']

# Create and train the KNN model
knn = KNeighborsRegressor(n_neighbors=1)
knn.fit(X, y)

# Generate predictions
X_pred = np.linspace(X['TV'].min(), X['TV'].max(), 200).reshape(-1, 1)
y_pred = knn.predict(X_pred)


# Plot the results
plt.figure(figsize=(10, 6))
plt.scatter(X, y, label='Real Values', marker='X',  s=100)
plt.plot(X_pred, y_pred, label='Predicted Values', color='red')
plt.xlabel('TV Advertising Spend')
plt.ylabel('Sales in 1000')
plt.title('KNN Regression for Sales Prediction')
plt.legend()
plt.show()

## 6. Calculating Mean Squared Error (MSE) for KNN
We evaluate the KNN model by calculating the Mean Squared Error (MSE) between the actual and predicted sales.

In [None]:

from sklearn.metrics import mean_squared_error

# Calculate MSE

mse_knn = mean_squared_error(y, knn.predict(X))
print(f"Mean Squared Error (MSE): {mse_knn}")


## 7. Linear Regression
Fitting Linear Regression on Sample Data
We fit a linear regression model using the same sample of 10 data points.

In [None]:
from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt

# Sample data
df_sample = df_base.sample(10)
X = df_sample[['TV']]
y = df_sample['Sales']

# Create and train the Linear Regression model
linear_reg = LinearRegression()
linear_reg.fit(X, y)

# Generate predictions
X_pred = np.linspace(X['TV'].min(), X['TV'].max(), 50).reshape(-1, 1)
y_pred = linear_reg.predict(X_pred)

# Plot the results
plt.figure(figsize=(10, 6))
plt.scatter(X, y, label='Real Values', marker='X', s=100)
plt.plot(X_pred, y_pred, label='Predicted Values', color='red')
plt.xlabel('TV Advertising Spend')
plt.ylabel('Sales in 1000')
plt.title('Linear Regression for Sales Prediction')
plt.legend()
plt.show()




## 8. Fitting Linear Regression on Full Data
Now, we'll fit the linear regression model on the entire dataset.

In [None]:
from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt

# Sample data
df_sample = df_base
X = df_sample[['TV']]
y = df_sample['Sales']

# Create and train the Linear Regression model
linear_reg = LinearRegression()
linear_reg.fit(X, y)

# Generate predictions
X_pred = np.linspace(X['TV'].min(), X['TV'].max(), 50).reshape(-1, 1)
y_pred = linear_reg.predict(X_pred)

# Plot the results
plt.figure(figsize=(10, 6))
plt.scatter(X, y, label='Real Values', marker='X', s=100)
plt.plot(X_pred, y_pred, label='Predicted Values', color='red')
plt.xlabel('TV Advertising Spend')
plt.ylabel('Sales in 1000')
plt.title('Linear Regression for Sales Prediction')
plt.legend()
plt.show()


## 9. Calculating Mean Squared Error (MSE) for Linear Regression
We evaluate the linear regression model by calculating the MSE.

In [None]:

from sklearn.metrics import mean_squared_error

# Calculate MSE

mse_lr = mean_squared_error(y, linear_reg.predict(X))
print(f"Mean Squared Error (MSE): {mse}")


In [None]:
print(f"Mean Squared Error (MSE) of KNN Model: {mse_knn:.2f}")
print(f"Mean Squared Error (MSE) of Linear Regression Model: {mse_lr:.2f}")
