# ðŸ“Š Time Series Analysis: Electronics Product Pricing Data

## Overview
This notebook performs comprehensive time series analysis on electronics product pricing data with the goal of understanding discount patterns, detecting seasonality, and preparing data for SARIMA forecasting.

### Analysis Components:
1. **Data Loading and Preprocessing**
2. **Feature Engineering** (discount calculations, temporal features)
3. **Exploratory Data Analysis** (correlations, visualizations)
4. **Seasonality Detection** (monthly and weekly patterns)
5. **Rolling Statistics** (weekly aggregations)
6. **Lag Feature Engineering** (autocorrelation analysis)

### Dataset:
- **Source**: ElectronicsProductsPricingData.csv
- **Time Period**: 2014-2018 (focus on 2017-2018)
- **Key Metric**: Discount percentage over time

---

##  Dataset Loading

This section imports the required libraries and loads the dataset into a pandas DataFrame. The dataset is the foundation for all subsequent preprocessing and analysis steps. We will install all the libraries that we need.

### Library Installation

**Purpose**: Install required Python packages for time series analysis.

**Libraries needed**:
- **`statsmodels`**: Statistical models for time series (ARIMA, SARIMA, ADF test, ACF/PACF)
- **`pandas`**: Data manipulation and analysis with DataFrames
- **`matplotlib`**: Core plotting library for visualizations
- **`seaborn`**: Statistical visualization built on matplotlib

**Installation command**:
```bash
pip install statsmodels pandas matplotlib seaborn
```

**Note**: This cell is commented out because package installation should typically be done via terminal/command prompt, not within the notebook itself. Uncomment only if running in an environment where pip install in cells is appropriate.

In [None]:
## pip install statsmodels pandas matplotlib seaborn

## 1. Loading & Filtering data 

### 1.1 Load data
Reads a CSV file into a DataFrame.

### Load Dataset and Display Info

**Purpose**: Import necessary libraries, load the electronics pricing dataset, and inspect its structure.

**What this code does**:

1. **Import Core Libraries**:
   ```python
   import pandas as pd
   import matplotlib.pyplot as plt
   import seaborn as sns
   ```
   - `pandas`: For data manipulation and analysis
   - `matplotlib.pyplot`: For creating plots and visualizations
   - `seaborn`: For statistical visualizations and attractive default styles

2. **Load CSV File**:
   ```python
   df = pd.read_csv("./ElectronicsProductsPricingData.csv", encoding='latin1')
   ```
   - Reads the CSV file from the current directory
   - `encoding='latin1'`: Handles special characters in product names (accents, symbols)
   - Stores data in DataFrame `df`

3. **Display Dataset Information**:
   ```python
   df.info()
   ```
   - Shows comprehensive dataset summary:
     - Number of rows and columns
     - Column names and their data types
     - Non-null counts (identifies missing values)
     - Memory usage

**Expected Output**:
- DataFrame index range
- List of all columns with their types (int64, float64, object)
- Count of non-null values per column
- Total memory usage

**Key columns to note**:
- Price-related: `prices.amountMin`, `prices.amountMax`, `prices.isSale`
- Date-related: `prices.dateSeen`, `dateAdded`, `dateUpdated`
- Product info: `id`, `brand`, `categories`, `name`

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns



df = pd.read_csv("./ElectronicsProductsPricingData.csv", encoding='latin1')



df.info()

### 1.2 Remove Unnecessary Columns

**Purpose**: Clean the dataset by dropping columns that are not needed for price and discount analysis.

**Columns being removed**:

1. **`Unnamed: 26-30`**: Empty columns likely created during data export
2. **`sourceURLs`**: Web source links (not needed for analysis)
3. **`prices.currency`**: Currency type (assuming all same currency)
4. **`keys`**: Internal database keys
5. **`ean`**: European Article Number (barcode)
6. **`prices.shipping`**: Shipping costs (focusing on product price only)
7. **`manufacturerNumber`**: Manufacturer-specific product codes
8. **`upc`**: Universal Product Code (another barcode standard)

**Why remove these columns**:
- **Reduces memory**: Smaller dataset is faster to process
- **Improves clarity**: Easier to see relevant columns
- **Focuses analysis**: Only keeps features needed for time series modeling


In [None]:
## removing all the unnecessary columns from the dataframe
df.drop(columns=["Unnamed: 26","Unnamed: 27","Unnamed: 28","Unnamed: 29","Unnamed: 30","sourceURLs","prices.currency", "keys","ean","prices.shipping","manufacturerNumber","upc"], inplace=True)
df.head()

### 1.3 Calculate Price Difference and Discount Percentage

**Purpose**: Create new features to quantify discounts for each product.

**What this code does**:

1. **Remove rows with missing prices**:

2. **Calculate absolute price difference**:
   
3. **Calculate discount percentage**:
   
4. **Display updated info**:
   

**Why discount percentage matters**:
- **Time series target**: This is the primary metric we'll forecast
- **Comparability**: 20% discount means the same whether on $10 or $1000 item
- **Business relevance**: Reflects actual promotional strategy

In [None]:

##1.2 we will check if the prices amountMax and minAmount is same 
## we have also put a new column discount percentage.

df.dropna(subset=["prices.amountMax", "prices.amountMin"], inplace=True)
df["price_difference"] = df["prices.amountMax"] - df["prices.amountMin"]
df['discount_percent'] = (df['price_difference'] / df['prices.amountMax']) * 100
 
df.info()

## 

### 1.4 Analyze Product ID Distribution

**Purpose**: Identify which products have multiple price entries (good candidates for time series analysis).


**Explanation**:
- `value_counts()`: Counts occurrences of each unique product ID
- Returns a Series sorted by frequency (descending)
- Shows which products appear most often in the dataset

**What the output tells us**:

- **High counts** (e.g., 150+ entries):
  - Product with extensive price history
  - Multiple observations over time
  - **Excellent for time series modeling**
  - Can detect trends and seasonality

- **Medium counts** (e.g., 20-50 entries):
  - Reasonable historical data
  - Sufficient for basic trend analysis

- **Low counts** (e.g., 1-5 entries):
  - Limited history
  - Not suitable for time series modeling
  - May be new products or single observations

**Why this matters**:
- **Product selection**: Helps choose which products to analyze in detail
- **Data richness**: Products with more entries provide better forecasts
- **Time series viability**: Need sufficient data points for SARIMA models


In [None]:
## we will check if the data has unique ids 
df["id"].value_counts()

### 1.5 Filter for Specific Product Case Study

**Purpose**: Focus on one product with rich historical data and filter for sale periods.



1. **Filter by specific product ID**:
2. **Filter for sale items only**:
3. **Display results**:
  

**Why this approach**:

- **Case study methodology**: Deep dive into one product before generalizing
- **Data richness**: This product has sufficient history for analysis
- **Sale focus**: Understand promotional pricing strategy specifically
- **Simpler modeling**: Single product is easier than multi-product aggregation






In [None]:
## checking in the dataframe electronics if there are different products
## from the above category, we have checked if the dataframe has different ids, and then checked how many values each id has, where we found one particular interested one. 


df_id = df[df["id"] == "AV1YFZVDvKc47QAVgp7V"]
df_id = df_id[df_id["prices.isSale"] == True]
df_id.head()

### 1.6  Date Format and Discount Calculation

Proper date handling is crucial for time series analysis. This section converts string dates to datetime objects and extracts temporal components.

#### Date Parsing and Temporal Feature Engineering

**Purpose**: Convert date strings to datetime objects and extract year components for temporal analysis.

1. **Convert three date columns to datetime format**:
   
   
   **Each date field serves a different purpose**:
   - **`dateSeen`**: When the price was actually observed/scraped
     - Most relevant for understanding consumer-facing prices
     - Best for analyzing real-world pricing trends
   
   - **`dateUpdated`**: When the database record was last modified
     - Shows data freshness
     - May lag behind actual price changes
   
   - **`dateAdded`**: When product first entered the database
     - Useful for understanding product lifecycle
     - Initial pricing analysis
   
   **`errors="coerce"`**: Invalid dates become NaT (Not a Time) instead of raising errors

2. **Extract year from each datetime column**:
  
   - `.dt.year`: Accessor for datetime year component
   - Creates integer columns (2014, 2015, 2016, etc.)
   - Enables easy year-based filtering and grouping

3. **Check year distribution**:
 
   - Shows how many records per year
   - Identifies temporal coverage of dataset
   - Helps decide which years to focus on

4. **Data validation - remove unparseable dates**:
   
5. **Remove redundant original column**:
  
6. **Display cleaned data**:
  

**Why datetime conversion matters**:
- **Enables time-based operations**:
  - Sorting chronologically
  - Resampling (daily â†’ weekly â†’ monthly)
  - Date filtering and slicing
  - Time difference calculations





In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## converting the date columns (which are in strings) to datetime format
df["dateSeen"] = pd.to_datetime(df["prices.dateSeen"], errors="coerce")
df["dateUpdated"] = pd.to_datetime(df["dateUpdated"], errors="coerce")
df["dateAdded"] = pd.to_datetime(df["dateAdded"], errors="coerce")

## extracting the year from the date column
df["date_year"] = df["dateSeen"].dt.year
df["updated_year"] = df["dateUpdated"].dt.year
df["added_year"] = df["dateAdded"].dt.year

df["date_year"].value_counts()
# 2. Drop rows where dates couldn't be parsed (Cleaning the dataset)
# This is part of your "Data validation pipeline" milestone 
df = df.dropna(subset=['dateAdded', 'dateUpdated'])

## we delete the column prices.dateSeen 
df.drop(columns=['prices.dateSeen'], inplace=True)

df.head()

## 2. DATA Visualization 

Comprehensive visualizations to understand pricing patterns, discount trends, and temporal relationships.

### 2.1 Exploratory Data Analysis: Multi-Dimensional Discount Analysis
To identify price volatility and seasonal discounting cycles, we implemented a 2x2 visualization matrix. This aligns with the Week 3-4 Milestone of performing deep exploratory data analysis (EDA) to justify the transition to predictive modeling.

In [None]:
# Create the 2x2 frame
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('DealCatcher EDA: Price & Discount Trends', fontsize=20)

# 1. Scatterplot of Min-Max Prices
sns.scatterplot(ax=axes[0, 0], data=df, x='prices.amountMin', y='prices.amountMax', alpha=0.5)
axes[0, 0].set_title('Scatterplot of Min-Max Prices')

# 2. Discount% vs Date Added
sns.scatterplot(ax=axes[0, 1], data=df, x='dateAdded', y='discount_percent', color='orange')
axes[0, 1].set_title('Discount % vs Date Added')

# 3. Discount% vs Date Last Seen
sns.scatterplot(ax=axes[1, 0], data=df, x='dateSeen', y='discount_percent', color='green')
axes[1, 0].set_title('Discount % vs Date Seen')

# 4. Discount% vs Date Updated
sns.scatterplot(ax=axes[1, 1], data=df, x='dateUpdated', y='discount_percent', color='red')
axes[1, 1].set_title('Discount % vs Date Updated')

# Rotate labels so they don't overlap
for ax in axes.flat:
    plt.setp(ax.get_xticklabels(), rotation=45)

plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

###  2.2 Correlation Heatmap

**Purpose**: Visualize correlations between all numerical variables in the dataset.

**What this code does**:

1. **Select only numerical columns**:
   - Filters DataFrame to keep only numeric data types
   - Excludes strings, dates, and boolean columns
   - Necessary because correlation requires numerical data

2. **Calculate correlation matrix**:
  
   - Computes Pearson correlation between all numeric pairs
   - Values range from -1 to +1

3. **Create heatmap visualization**:
   ```python
   plt.figure(figsize=(6, 4))
   sns.heatmap(df_heat.corr(), annot=True, cmap="coolwarm")
   ```
   - `annot=True`: Shows correlation values in each cell
   - `cmap="coolwarm"`: Blue (negative) â†’ White (zero) â†’ Red (positive)

**Why this matters for time series**:
- Identifies multicollinearity issues
- Helps select independent variables
- Reveals hidden relationships
- Guides feature engineering decisions

In [None]:
## we will use the heatmap for the dataset to visualize the correlation between different numerical features
## For that we will use seaborn library
## Also we will filter out the non-numerical columns from the dataset


df_heat = df.select_dtypes(include=['float64', 'int64'])





## Electronics Products Pricing Data heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(df_heat.corr(), annot=True, cmap="coolwarm")
plt.title("Electronics Products Pricing Data Correlation Heatmap")
plt.show()

### 2.3 Price Range Visualization (Contains Bug)

**Purpose**: Attempt to visualize price ranges over DataFrame indices.

**What this code TRIES to do**:
- Plot maximum and minimum prices over the dataset index



In [None]:
plt.figure(figsize=(10, 6))
plt.plot(df["dateUpdated"], df["prices.amountMax"], marker="o", label="Amount Max")
plt.plot(df["dateUpdated"], df["prices.amountMin"], marker="x", label="Amount Min")
plt.xlabel("Date")
plt.ylabel("Price")
plt.title("Price Range Over Time")
plt.xticks(rotation=45)
plt.legend()
plt.show()

### 2.4 Seasonal Discount Analysis (2017-2018)

**Purpose**: Filter data to 2017-2018 and create two plots showing seasonal discount patterns from different date perspectives.

**Part 1: Data Filtering**

**Why filter to 2017-2018?**
- Earlier analysis showed 2014-2015 had sparse data
- 2017-2018 have the highest data density
- Ensures all three date fields are within range
- Focuses on most recent and complete period

**Part 2: Plot 1 - Seasonality by Date Seen**

**What this does**:
- Extracts month (1-12) from dateSeen
- Creates line plot with:
  - X-axis: Months (1=Jan, 12=Dec)
  - Y-axis: Average discount percentage
  - Two lines (one for 2017, one for 2018)
  - `marker="o"`: Circle at each month

**How to interpret**:
- **Peaks**: Months with highest discounts (likely Nov-Dec)
- **Valleys**: Months with lowest discounts (likely Jan-Feb)
- **Line comparison**: Year-over-year consistency

**Part 3: Plot 2 - Seasonality by Date Added**


**What this shows**:
- Discount patterns based on when products were added
- Different perspective from observation dates
- May reveal product launch strategies



**Expected output**: Two line plots showing monthly discount percentage trends with year-over-year comparison

In [None]:
# we will remove the years 2014 and 2015 from the dataset as they have very few data points and we will focus on the years 2017 and 2018 for our seasonality analysis.


## here we will change the data from the electronics dataset to only include the years 2017 and 2018 for the date seen, date updated and date added columns as they have more data points and we will focus on these years for our seasonality analysis.
df = df[df["updated_year"].isin([2017, 2018])]
df = df[df["date_year"].isin([2017, 2018])]
df = df[df["added_year"].isin([2017, 2018])]

df["month"] = df["dateSeen"].dt.month
plt.figure(figsize=(10, 6))

sns.lineplot(data=df, x="month", y="discount_percent", hue="date_year", marker="o")
plt.title("Discount Seasonality using Date Seen (2017â€“2018)")
plt.show()


##
df_added = df.copy()
df_added["month"] = df_added["dateAdded"].dt.month

plt.figure(figsize=(10, 6))
sns.lineplot(
    data=df_added,
    x="month",
    y="discount_percent",
    hue="added_year",
    marker="o"
)

plt.title("Discount Seasonality using Date Added (2017â€“2018)")
plt.xlabel("Month")
plt.ylabel("Discount Percent")
plt.legend(title="Year")
plt.show()

### 2.5.1 Create Month Ã— Year Pivot Table

**Purpose**: Reorganize data into a pivot table with months as rows and years as columns.

**What this code does**:

1. **Set dateSeen as index and sort chronologically**:
2. **Create pivot table**:


**Parameters explained**:
- `values="discount_percent"`: The metric to aggregate
- `index=...index.month`: Rows = months (1-12)
- `columns=...index.year`: Columns = years (2017, 2018)
- `aggfunc="mean"`: Calculate average discount per cell



**Why this format is useful**:
- **Easy comparison**: See 2017 vs 2018 side-by-side
- **Heatmap ready**: Perfect structure for next visualization
- **Pattern detection**: Scan vertically for seasonal patterns
- **Year-over-year**: Compare horizontally

**Troubleshooting note**:
Comment mentions possible error if `prices.dateSeen` doesn't exist - this is because we already converted it to `dateSeen` in earlier cells.

**Expected output**: A 12Ã—2 table showing average discount percentage for each month in 2017 and 2018

In [None]:
df_index_dateSeen = df.set_index("dateSeen").sort_index()

## if you got this error "None of ['prices.dateSeen'] are in the columns"
## then uncomment the above line 
plt.figure(figsize=(12, 6))
df_pivot_table = df.pivot_table(values="discount_percent", index=df_index_dateSeen.index.month, columns=df_index_dateSeen.index.year, aggfunc="mean")
print(df_pivot_table)

### 2.5.2 Heatmap of Monthly Discount Patterns

**Purpose**: Visualize the pivot table as a color-coded heatmap for easy pattern recognition.

**What this code does**:

**Parameters**:
- `df_pivot_table`: The month Ã— year data from previous cell
- `annot=True`: Display discount percentages in each cell
- `cmap="coolwarm"`: Color map
  - **Blue** = Lower discounts (cool colors)
  - **Red** = Higher discounts (warm colors)
  - **White** = Mid-range
- `linewidths=0.5`: Thin white lines separating cells


**Expected patterns**:
- **Red cluster**: November-December (Black Friday, Christmas)
- **Blue cluster**: January-February (post-holiday)
- **Consistent columns**: Similar color progression in both years

**Business insights**:
- **Promotional calendar**: When to plan major sales
- **Inventory planning**: Anticipate discount periods
- **Competitive timing**: Align with industry patterns



In [None]:

sns.heatmap(df_pivot_table, annot=True, cmap="coolwarm", linewidths=0.5)

## 3. Featuring enginerring pipeline 

###  3.1 Rolling Statistics

Rolling statistics (moving averages) smooth short-term fluctuations and highlight longer-term trends. Essential for:
- Noise reduction
- Trend identification
- Seasonality detection
- Preparing data for forecasting

### Weekly Aggregation and Trend Visualization

**Purpose**: Resample data to weekly frequency and compare discount trends across three date perspectives.

**Part 1: Weekly Resampling (3 versions)**

1. **Weekly average by dateSeen**:
```python
weekly_discount_size_seen = df.resample("W", on="dateSeen")["discount_percent"].mean()
```
- `.resample("W", on="dateSeen")`: Groups by week based on observation date
- `["discount_percent"].mean()`: Averages discounts within each week
- **Use**: Shows actual pricing trends seen by consumers

1. **Weekly average by dateUpdated**:
```python
weekly_discount_size_updated = df.resample("W", on="dateUpdated")["discount_percent"].mean()
```
- Groups by week when records were updated
- **Use**: Shows data freshness patterns

1. **Weekly average by dateAdded**:
```python
weekly_discount_size_added = df.resample("W", on="dateAdded")["discount_percent"].mean()
```
- Groups by week when products entered database
- **Use**: Shows product launch discount strategies

**Why weekly frequency?**
- **Balances detail and noise**:
  - Daily: Too volatile
  - Monthly: Might miss important variations
  - Weekly: Just right for retail patterns


**Part 2: Visualization**

**What the plot shows**:
- **Three overlapping lines**: Each representing different date perspective
- **X-axis**: Time (weekly intervals)
- **Y-axis**: Average discount percentage
- **Markers**: Circle at each week's data point

**How to interpret**:

1. **Lines move together**: Consistent pricing across all dates
2. **Lines diverge**: Different timing in how prices are recorded/updated
3. **Upward trend**: Discounts increasing over time
4. **Downward trend**: Discounts decreasing
5. **Regular peaks/valleys**: Weekly or seasonal patterns
6. **Sharp spikes**: Special sales events



In [None]:
weekly_discount_size_seen = df.resample("W", on="dateSeen")["discount_percent"].mean()
weekly_discount_size_updated = df.resample("W", on="dateUpdated")["discount_percent"].mean()
weekly_discount_size_added = df.resample("W", on="dateAdded")["discount_percent"].mean()

weekly_discount_size_added.head()
plt.figure(figsize=(12, 6))

sns.lineplot(data=weekly_discount_size_seen, label="Date Seen", marker="o")
sns.lineplot(data=weekly_discount_size_updated, label="Date Updated", marker="o")
sns.lineplot(data=weekly_discount_size_added, label="Date Added", marker="o")
plt.title("Weekly Average Discount Percent (2017â€“2018)")

### 3.2 Lag Features 

Lag features represent previous values in a time series. They are crucial for:
- **Autocorrelation analysis**: Understanding temporal dependencies
- **ARIMA/SARIMA modeling**: Determining AR order
- **Machine learning**: Using past to predict future
- **Pattern detection**: Identifying cyclical behavior

#### 3.2.1 Lag Features

**Purpose**: Create lagged versions of the discount time series to analyze autocorrelation.

**What this code does**:

1. **Create clean weekly time series**:
- Sets dateSeen as index
- Resamples to weekly frequency
- Calculates average discount per week
- Removes any NaN values
- Creates clean time series `ts`

2. **Initialize lag DataFrame with current values**:
- Creates DataFrame with one column: "current"
- Contains the weekly discount percentages
- This represents time **t** (present)

3. **Create Lag 1 (1 week ago)**:
- `.shift(1)`: Moves all values down by 1 position
- Shows last week's discount
- First value becomes NaN (no previous week)

**Why these specific lags?**
- **Lag 1**: Immediate persistence (AR(1) component)
- **Lag 3**: ~3-week cycles (common in retail)
- **Lag 6**: ~1.5 month patterns
- **Lag 12**: Quarterly cycles (~3 months)

**Purpose of lag features**:
1. **Autocorrelation**: Do past values predict future?
2. **ARIMA order**: Which lags are significant?
3. **Forecasting**: Use historical data as predictors
4. **Pattern detection**: Find cyclical behavior

**Expected output**: DataFrame with 5 columns showing current week and 4 lagged versions, with NaN in early rows

In [None]:
ts = (df.set_index("dateSeen").resample("W")["discount_percent"].mean()).dropna()
lag_df = pd.DataFrame({"current": ts})

lag_df["lag_1"] = ts.shift(1)
lag_df["lag_3"] = ts.shift(3)
lag_df["lag_6"] = ts.shift(6)
lag_df["lag_12"] = ts.shift(12)

#### 3.2.2 Remove Missing Values from Lag Data

**Purpose**: Clean the lag DataFrame by removing rows with NaN values.


**Why this is necessary**:
- The shift() operation creates NaN values:
  - Lag 1: First 1 row has NaN
  - Lag 3: First 3 rows have NaN
  - Lag 12: First 12 rows have NaN
- `dropna()`: Removes rows where ANY column has NaN
- Result: Only complete cases remain

**Impact**:
- Loses first 12 weeks of data (due to lag_12)
- Trade-off: Smaller dataset but complete feature set
- For ~104 weeks (2 years), ~92 usable rows remain

**Why we need complete cases**:
- Required for correlation calculations
- Necessary for scatter plots
- Most models require complete data
- Ensures fair comparison across all lags

**Expected output**: Cleaned lag_df with no NaN values, ready for analysis

In [None]:
lag_df = lag_df.dropna()

#### 3.3.3 Lag-1 Autocorrelation Plot

**Purpose**: Visualize the relationship between current discount values and values from 1 week ago.

**What this code does**:

**Components**:
- **Square plot** (6Ã—6): Equal scales for x and y
- **X-axis**: Last week's discount (t-1)
- **Y-axis**: This week's discount (t)
- **Each point**: One week of data

**How to interpret patterns**:

**1. Strong Positive Correlation** (upward diagonal):
```
  High current
       â†‘
       |    /
       |   /
       |  /
       | /
       |/________ High lag_1 â†’
```
- Points cluster along diagonal
- High lag_1 â†’ High current
- **Meaning**: Discounts persist week-to-week
- **SARIMA**: Need AR(1) component

**2. No Correlation** (random cloud):
```
       â†‘  â€¢ â€¢
       | â€¢  â€¢ â€¢
       |  â€¢  â€¢
       |â€¢ â€¢   â€¢
       |________ â†’
```
- Points scattered randomly
- **Meaning**: Last week doesn't predict this week
- **SARIMA**: Try different lags

**3. Negative Correlation** (downward diagonal):
```
       â†‘\
       | \
       |  \
       |   \
       |    \
       |________ â†’
```
- High lag_1 â†’ Low current
- **Meaning**: Alternating pattern
- **Rare** in discount data

**What this tells us**:

- **Strong correlation**: 
  - Discounts are predictable
  - Include AR terms in SARIMA
  - Good 1-week-ahead forecasts possible

- **Weak correlation**:
  - Check other lags (3, 6, 12)
  - May need seasonal components
  - More challenging to forecast



**Expected output**: Scatter plot revealing the strength and direction of week-to-week persistence in discounts

In [None]:
plt.figure(figsize=(6, 6))
plt.scatter(lag_df["lag_1"], lag_df["current"])
plt.xlabel("Lag 1 (t-1)")
plt.ylabel("Current (t)")
plt.title("Lag-1 Relationship")
plt.show()

---

# Summary and Next Steps

### What This Notebook Accomplished:

#### âœ… Data Preparation
- Loaded and cleaned electronics pricing dataset
- Removed 12 unnecessary columns
- Handled missing values strategically
- Converted dates to proper datetime format
- Created discount percentage metric

#### âœ… Feature Engineering
- Calculated price differences and discount percentages
- Extracted temporal features (year, month)
- Created lag features (1, 3, 6, 12 weeks)
- Filtered to high-quality time period (2017-2018)

#### âœ… Exploratory Analysis
- Correlation heatmap of numerical features
- 4-panel pricing and discount dashboard
- Seasonal pattern detection (monthly trends)
- Year-over-year comparison

#### âœ… Time Series Preparation
- Weekly resampling for noise reduction
- Trend visualization across three date perspectives
- Pivot table creation (month Ã— year)
- Seasonality heatmap

#### âœ… Autocorrelation Analysis
- Lag feature creation
- Lag-1 scatter plot
- Temporal dependency visualization



---

### ðŸ“Š Next Steps for SARIMA:

1. **Stationarity Testing**:
   - Run Augmented Dickey-Fuller test
   - Apply differencing if needed
   - Confirm stationarity

2. **ACF/PACF Analysis**:
   - Plot autocorrelation functions
   - Identify p, d, q parameters
   - Determine P, D, Q, s for seasonal component

3. **Model Building**:
   - Fit SARIMA(p,d,q)(P,D,Q,s)
   - Test multiple parameter combinations
   - Use AIC/BIC for model selection

4. **Model Validation**:
   - Check residuals (white noise test)
   - Diagnostic plots
   - Cross-validation

5. **Forecasting**:
   - Generate predictions
   - Create confidence intervals
   - Visualize forecasts vs actuals



---

### ðŸ“š Libraries Used:
- **pandas**: Data manipulation
- **matplotlib**: Visualization
- **seaborn**: Statistical plots
- **statsmodels** (ready for next phase): SARIMA modeling

---



## 4.3 Absolute Percentage