## Sales Prediction Notebook

This notebook demonstrates a machine learning workflow to predict product sales based on advertising spending on TV, radio, and newspapers.
The process includes data loading, exploratory data analysis, data cleaning, model training, and evaluation.

In [None]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import joblib

## Importing Libraries

Here, we import the necessary libraries for data manipulation, analysis, and visualization. We also import specific functions for model evaluation.

In [None]:
df = pd.read_csv(r"Advertising.csv")
df.head()

## Exploratory Data Analysis

In this section, we load the dataset and perform an initial analysis to understand its structure, size, and statistical properties.

In [None]:
print("Shape of Dataset:")
df.shape

In [None]:
print("Info of Dataset: ")

df.info()

In [None]:
print("Stastical analys of numerical columns: \n")
df.describe()

In [None]:
print("Duplicated rows: ")
df.duplicated().sum()

In [None]:
print("Null values across each columns\n")
df.isnull().sum()

## Data Cleaning and Preprocessing

The dataset contains an 'Unnamed: 0' column which is a row index and not useful for our analysis. We will drop this column to prepare the data for modeling. We also check for duplicate values to ensure data integrity.

In [None]:
df = df.drop(columns = ["Unnamed: 0"],axis = 1) #un-necessary column

In [None]:
df.head()

In [None]:
num_df = df.select_dtypes(include = ["float64"])  #producing a dataframe having float values
corr_matrix = num_df.corr()
sns.heatmap(corr_matrix, annot = True)
plt.title("Linear Relationship with Sales",color="purple",fontsize=15)
plt.show()

In [None]:
num_cols = num_df.columns
num_cols

for col in num_cols:
    plt.subplot(1,2,1)
    sns.histplot(df[col],bins=5,kde=True)
    plt.title("Distribution Plot",color="purple",fontsize = 12)

    plt.subplot(1,2,2)
    sns.boxplot(df[col])
    plt.title("Outliers Visualization",color="purple",fontsize=12)

    plt.tight_layout()
    plt.show()

In [None]:
sns.pairplot(num_df) 
plt.show()

In [None]:
df = df.drop(["Newspaper"], axis =1) #it has no significant effect on sale 

In [None]:
df["tv_radio_interaction"] = df["TV"]*df["Radio"] #creating a new features, both has too good effect on sales

In [None]:
x = df.drop(["Sales"],axis = 1)
y = df["Sales"]

## Model Training and Evaluation

We will now train a linear regression model on the preprocessed data and evaluate its performance using key metrics such as Mean Squared Error (MSE), Mean Absolute Error (MAE), and R-squared.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.2, random_state = 42)

In [None]:
x_train.shape, x_test.shape, y_train.shape, y_test.shape

In [None]:
lr = LinearRegression()
lr.fit(x_train, y_train)

In [None]:
y_pred = lr.predict(x_test)

In [None]:
lr.score(x_test, y_test)*100

In [None]:
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)*100

In [None]:
print("Mse: ",mse)
print("Mae: ",mae)
print("R2_Score: ",r2)

In [None]:
print("Coefficients: ",lr.coef_)
print("Intercept: ",lr.intercept_)

In [None]:
# --- Plot 1: TV Advertising vs Sales ---
plt.figure(figsize=(10, 6))
plt.subplot(1,2,1)# Is plot ka size
sns.regplot(x='TV', y='Sales', data=df,
            scatter_kws={'alpha':0.6, 'color':'#2ecc71'}, # Greenish color
            line_kws={'color':'#27ae60', 'linestyle':'-', 'linewidth':2})
plt.title('TV Advertising Budget vs. Sales')
plt.xlabel('TV Advertising Budget ($)')
plt.ylabel('Sales ($)')
plt.grid(True, linestyle='--', alpha=0.7)

# --- Plot 2: Radio Advertising vs Sales ---
plt.subplot(1,2,2) # Doosre plot ka size
sns.regplot(x='Radio', y='Sales', data=df,
            scatter_kws={'alpha':0.6, 'color':'#3498db'}, # Blue color
            line_kws={'color':'#2980b9', 'linestyle':'-', 'linewidth':2})
plt.title('Radio Advertising Budget vs. Sales')
plt.xlabel('Radio Advertising Budget ($)')
plt.ylabel('Sales ($)')
plt.grid(True, linestyle='--', alpha=0.7)
plt.show() # Doosra plot dikhao

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(lr,x , y, cv=5)

print("Cross Validation Scores:", scores)
print("Average CV Score:", scores.mean()*100)

In [None]:
import pickle

# Assuming your trained model is named 'model'
with open('sales_prediction_model.pkl', 'wb') as file:
    pickle.dump(lr, file)