# Week 8 Participation Activity – Internet Sales Forecast (Improved)

This notebook uses the **Internet Sales** dataset to:

1. Load and explore the data.
2. Aggregate yearly Internet Sales and build a simple forecast for the next five years.
3. Use summary statistics to identify which product has generated the highest total sales in the dataset.

The code has been refactored for clarity, structure, and readability.


In [None]:
# Import libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


## 1. Load Internet Sales data

In [None]:
# Path to the Internet Sales dataset.
# Update this path if your CSV is stored elsewhere.
data_path = '/content/drive/MyDrive/MSSP607/e.Data/InternetSales.csv'

# Read the CSV. The file may not be UTF-8 encoded, so we specify a fallback encoding.
df = pd.read_csv(data_path, encoding='latin1')

print('Shape of raw data:', df.shape)
print('Columns:')
print(df.columns)

## 2. Preprocess data and compute yearly sales

In [None]:
# We use OrderDate as the transaction date and SalesAmount as the Internet Sales measure.

# Convert OrderDate to datetime
df['OrderDate'] = pd.to_datetime(df['OrderDate'])

# Extract year
df['Year'] = df['OrderDate'].dt.year

# Aggregate yearly Internet Sales
yearly_sales = df.groupby('Year', as_index=False)['SalesAmount'].sum()

print('Yearly Internet Sales:')
display(yearly_sales)

## 3. Simple 5-year forecast using a linear trend

In [None]:
# Use a simple linear regression (via numpy.polyfit) on Year vs. SalesAmount.

years = yearly_sales['Year'].values
sales = yearly_sales['SalesAmount'].values

# Fit a first-degree polynomial: SalesAmount ≈ a * Year + b
coefficients = np.polyfit(years, sales, 1)
trend = np.poly1d(coefficients)

print('Trend function:', trend)

# Forecast the next 5 years beyond the last observed year
last_year = years.max()
future_years = np.arange(last_year + 1, last_year + 6)
forecast_values = trend(future_years)

forecast_df = pd.DataFrame({
    'Year': future_years,
    'ForecastSalesAmount': forecast_values
})

print('5-year forecast:')
display(forecast_df)

In [None]:
# Plot historical and forecasted Internet Sales

plt.figure(figsize=(8, 5))

# Historical
plt.plot(yearly_sales['Year'], yearly_sales['SalesAmount'], marker='o', label='Historical Sales')

# Forecast
plt.plot(forecast_df['Year'], forecast_df['ForecastSalesAmount'], marker='o', linestyle='--', label='Forecast (next 5 years)')

plt.xlabel('Year')
plt.ylabel('SalesAmount')
plt.title('Internet Sales: Historical and 5-Year Forecast')
plt.legend()
plt.tight_layout()
plt.show()

## 4. Summary statistics: Which product generates the most sales?

In [None]:
# We use EnglishProductName as the product identifier.
# Aggregate total SalesAmount by product.

product_sales = df.groupby('EnglishProductName', as_index=False)['SalesAmount'].sum()

# Sort products by total SalesAmount descending
product_sales_sorted = product_sales.sort_values('SalesAmount', ascending=False)

print('Top 10 products by total SalesAmount:')
display(product_sales_sorted.head(10))

# The top product
top_product = product_sales_sorted.iloc[0]
top_name = top_product['EnglishProductName']
top_amount = top_product['SalesAmount']

print('\nProduct with the highest total Internet Sales:')
print(f'  Name  : {top_name}')
print(f'  Sales : {top_amount:,.2f}')

## 5. Notes

- The forecast is based on a simple linear trend and should be interpreted as a rough extrapolation rather than a precise prediction.
- The product analysis uses historical Internet Sales only. The product with the highest total SalesAmount in this dataset has generated the most sales historically.
