# Introduction

Sales prediction is a critical task for businesses aiming to optimize their advertising strategies and maximize revenue. By analyzing historical advertising spend across different platforms such as TV, Radio, and Newspaper, businesses can forecast future sales and make data-driven decisions. Machine learning provides powerful tools to model these relationships and predict outcomes based on past data, helping businesses allocate resources efficiently.

In this project, we use Python and machine learning techniques to predict sales based on advertising expenditures, enabling informed decisions on budget allocation and marketing strategies.

# Problem Statement

The primary goal of this project is to predict the sales of a product based on advertising budgets across TV, Radio, and Newspaper platforms. <br>

Businesses often face the challenge of deciding how much to spend on each advertising channel to achieve maximum sales. By building a predictive model: <br>

    - We aim to understand the influence of each advertising channel on sales. 
    - Provide insights that guide budget allocation decisions. 
    - Enable forecasting future sales to support marketing and business strategies.

# Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score

import warnings
warnings.filterwarnings('ignore')


# Load Dataset

In [None]:
data = pd.read_csv('/kaggle/input/advertising-dataset/advertising.csv')  # replace with your dataset path
data.head()

# Exploring Data

In [None]:
data.info()

In [None]:
data.isnull().sum()

In [None]:
data.describe()

# Data Visualisation

In [None]:
# Pairplot
sns.pairplot(data)
plt.show()

In [None]:
# Correlation heatmap
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.show()

### Inference:<br>
### Correlation Analysis

From the correlation heatmap:

- **TV and Sales (0.9):** Very strong positive correlation. Increasing TV advertising significantly boosts sales.
- **Radio and Sales (0.35):** Moderate positive correlation. Radio contributes to sales, but less than TV.
- **Newspaper and Sales (0.16):** Weak positive correlation. Newspaper advertising has minimal effect on sales.

**Feature independence:**
- TV & Radio: 0.055 → almost no correlation
- TV & Newspaper: 0.057 → almost no correlation
- Radio & Newspaper: 0.35 → mild correlation

**Inference:**  
TV and Radio are the key drivers of sales, while Newspaper has little impact. Features are mostly independent, making them suitable for linear regression modeling.


# Data Preparation

In [None]:
X = data[['TV', 'Radio', 'Newspaper']]
y = data['Sales']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model Training

In [None]:
# Initialize models
models = {
    "Linear Regression": LinearRegression(),
    "Random Forest": RandomForestRegressor(n_estimators=100, random_state=42),
    "Gradient Boosting": GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
}

# Dictionary to store results
results = {}

# Train, predict, and evaluate each model
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test, y_pred)
    
    results[name] = {"RMSE": rmse, "R2": r2}
    
# Show results
results_df = pd.DataFrame(results).T
results_df


# Model Visualization

In [None]:
plt.figure(figsize=(18,5))

for i, (name, model) in enumerate(models.items()):
    y_pred = model.predict(X_test)
    
    plt.subplot(1, 3, i+1)
    plt.scatter(y_test, y_pred)
    plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
    plt.xlabel('Actual Sales')
    plt.ylabel('Predicted Sales')
    plt.title(f'{name}: Actual vs Predicted')
    
plt.tight_layout()
plt.show()


# Inferences:

**Key Takeaways:**
- Linear Regression provides a good baseline with interpretable coefficients.
- Random Forest and Gradient Boosting improve prediction accuracy and handle non-linear relationships.
- Gradient Boosting achieves the lowest RMSE and highest R², making it the most accurate model for sales forecasting.

**Inferences:**
- TV and Radio are the most important predictors of sales across all models.
- Advanced models are recommended when prediction accuracy is critical.
- Linear Regression is still valuable for interpretability and understanding feature impacts.