# **Sales Forecasting with Linear Regression**  

## **1. Importing Libraries**  
We use the following libraries for data manipulation, visualization, and modeling:  
- `pandas` for data processing  
- `numpy` for numerical operations  
- `seaborn` and `matplotlib` for data visualization  
- `sklearn` for model training and evaluation  

## **2. Loading and Inspecting the Dataset**  
- Load the dataset from an Excel file  
- Display dataset information to check data types and missing values  

## **3. Data Cleaning and Preprocessing**  
- Replace missing values (`'-'` → `NaN`)  
- Convert numeric columns (`Quantity`, `Discount`, `Price`) to the correct data type  
- Fill missing values with the mean of each column  
- Convert the `Date` column to a datetime format and extract useful time-based features (`Year`, `Month`, `DayOfWeek`)  
- Remove invalid records where `Quantity` is not greater than zero  

## **4. Exploratory Data Analysis (EDA)**  
- Compute and visualize the correlation matrix using a heatmap  

## **5. Splitting Data for Training and Testing**  
- Features: `Discount`, `Price`, `Month`, `DayOfWeek`  
- Target variable: `Quantity`  
- Split into training (80%) and testing (20%) sets  

## **6. Training a Linear Regression Model**  
- Fit a linear regression model to the training data  

## **7. Model Evaluation**  
- Calculate evaluation metrics:  
  - **Mean Absolute Error (MAE)**: Measures average absolute differences  
  - **Root Mean Squared Error (RMSE)**: Measures model accuracy with squared differences  

## **8. Model Coefficients**  
- Display the learned coefficients for each feature  

## **9. Visualizing Predictions**  
- Plot actual vs predicted values with a scatter plot  
- Include a reference diagonal line (`y=x`) to assess prediction accuracy  


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error

file_path = r"./sales/sales23-1.xlsx"
df = pd.read_excel(file_path)

print("Dataset Information:")
print(df.info())

df.replace('-', np.nan, inplace=True)

numeric_cols = ["Quantity", "Discount", "Price"]
for col in numeric_cols:
    df[col] = pd.to_numeric(df[col], errors="coerce")
    df[col].fillna(df[col].mean(), inplace=True)

df["Date"] = pd.to_datetime(df["Date"], errors='coerce')
df.dropna(subset=["Date"], inplace=True)
df["Year"] = df["Date"].dt.year
df["Month"] = df["Date"].dt.month
df["DayOfWeek"] = df["Date"].dt.dayofweek

df = df[df["Quantity"] > 0]

plt.figure(figsize=(8, 6))
numeric_df = df.select_dtypes(include=["number"])
sns.heatmap(numeric_df.corr(), annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation between Variables")
plt.show()

X = df[["Discount", "Price", "Month", "DayOfWeek"]]
y = df["Quantity"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

print("\nModel Evaluation:")
print(f"MAE: {mae:.2f}")
print(f"RMSE: {rmse:.2f}")

coefficients = pd.DataFrame(model.coef_, X.columns, columns=["Coefficient"])
print("\nModel Coefficients:")
print(coefficients)

plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.5, color='blue')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], '--r', linewidth=2)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Actual vs Predicted Values (Linear Regression)")
plt.show()
