## MSc Data Analytics - Capstone Project

#### Predictive Analysis in the Coffee Market: Using Deep Learning to predict coffee prices.
Student id: 2020274 Clarissa Cardoso


## Introduction
This notebook aims to analyse historical coffee price indices and develop a predictive model for future price trends. The focus is on using data from the ICO (International Coffee Organization), particularly indices like I-CIP, that combines prices of Colombian Milds, Other Milds, Brazilian Naturals, and Robustas.

### Dataset:
The dataset used in this analysis consists of historical coffee price data, with daily observations for business days. Prices are expressed in cents of USD per lb. 


The data utilized in this project is sourced from the International Coffee Organization's (ICO) Public Market Information, which provides the I-CIP values free of charge.

For the early stages of the experimentation, 1 year worth of data was available to collect, from 01Feb23 to 29Feb24, which is present on a separate notebook (2020274_capstone_EDA_Models 2.ipynb). In this notebook, recent data from March to September 2024 were added to expand insights and feed more datapoints to modelling stage. 


### Objectives:
1. Clean and preprocess the dataset for missing values and inconsistencies.
2. Explore the time-series behavior of coffee prices through visualizations.
3. Implement various forecasting models to predict future price trends, including traditional statistical models (e.g., ARIMA/Sarima) and deep learning algorithms (e.g., LSTM neural networks).
4. Compare model performance using key metrics (e.g., RMSE, MAE).


### Expected Outcome:
By the end of this notebook, we will identify the best forecasting model for coffee prices and present actionable insights based on the findings.

        Forecasting: generate forecasts for future I-CIP values using the best-performing model(s) and visualize the results to facilitate interpretation and decision-making.
- 1 day
- 5 days = 1 week
- 21 days = 1 month


(- 63 days = 3 months (1 quarter))

### Importing relevant libraries for the project

In [None]:
import keras
import tensorflow as tf

print("Keras version:", keras.__version__)
print("TensorFlow version:", tf.__version__)


## cheking if keras/tensorflow are correclty installed 

In [None]:
#importing libraries
import warnings
warnings.filterwarnings("ignore")

import pandas as pd #dataframes 
import numpy as np #linear algebra
import seaborn as sns #visualization
sns.set(color_codes=True)


import plotly.express as px
import plotly.graph_objects as go


import scipy.stats as stats #statistical resources

import matplotlib.pyplot as plt #visualisation 
%matplotlib inline 


from matplotlib import colors
from matplotlib.ticker import PercentFormatter
import matplotlib as mpl

from sklearn.model_selection import train_test_split # importing function to split the data training and test.
from sklearn.preprocessing import MinMaxScaler # Import the MinMaxScaler module from sklearn.preprocessing library
from sklearn.linear_model import LinearRegression # importing to performe linear regression. 
from sklearn.metrics import make_scorer, r2_score # Importing from Metrics module
from sklearn.preprocessing import StandardScaler # standardize the data
from sklearn import metrics # Metrics module from scikit-learn
from sklearn.model_selection import GridSearchCV # importing for hyperparameter tunning
from sklearn.metrics import mean_squared_error # importing mse
from scipy.stats import shapiro

from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential #last update in python causing dead kernel wehn importing keras functions?
from keras.layers import Dense, LSTM, Dropout, GRU, Bidirectional
from keras.optimizers import SGD
import math
from math import sqrt
from sklearn.metrics import mean_squared_error,mean_absolute_error
from sklearn.metrics import mean_absolute_percentage_error
from scipy.interpolate import interp1d

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LinearRegression

import statsmodels.api as sm
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf



# 1. Load data

For the early stages of this experimentation present on the first notebook (Models2_copy), 1 year worth of data was available to collect, from 01Feb23 to 29Feb24.

This section will review the original dataset compiled with data from feb 23 to feb 24.


A few thingsobserved when importing the raw files from the data source: 

- Column mismatch: Assuming all files have the same column names and order. This could lead to errors when merging DataFrames with different structures. 

The data for each month is published separetely. Originally the 4 first months had different colum labels for the same data 'ICO Composite' , while the following months was simpler version as 'I-CIP'. For that diverson it was not possible to simply merge all dataframes into one. Dta cleaning/manipulation techniques of renamimbg and reorganising them chronologically were adopeted to reach the final dataset for the first year of data alocated in the 'icip_data' below. 





In [None]:
# Read the CSV file 
icip_data = pd.read_csv("icip_df.csv")

# View the first 5 rows
icip_data.head()

Since then, ICO has released additional months that will be included in the main dataframe, considering the timeframe from march to september 2024 as a way to feed more data to the models with the expectation it could improve the results. These seven new files will be sorted by chronological order and have the same labels as the main one above. 


### Importing  additional data from March/24 to September/24 

In [None]:
import os
# List all the files in the folder
os.listdir("icip_24") 

In [None]:
#create for loop to import csv files from the folder with less comands.

# create an empty list to store dfs
dataframes = []

# path to folder where csv files are (in this case same directory)
folder_path = "icip_24"


# to import CSV starting from the third row, skipping the first two
def import_csv(filepath):
    return pd.read_csv(filepath, skiprows=2)

# Iterate through files in the folder
for file in os.listdir(folder_path):
    if file.endswith(".csv"):  # Only consider CSV files
        file_path = os.path.join(folder_path, file)  # Construct the full file path
        dataframes.append(import_csv(file_path))  # Read CSV and append to list

In [None]:
#check the lenght of the directory, how many files exist in the new folder
len(dataframes)

Chcking the heading of the files to undertand how features are allocated in this first stage before combining the new 7 months to main dataframe

The same issue appears with the heading names. So this time around it was decided to ignore the first 2 rows to avoid the unnamed header and only collect the data 

Unnamed: 0	Unnamed: 1	Colombian	Unnamed: 3	Brazilian	Unnamed: 5
0	NaN	I-CIP	NaN	Other Milds	NaN	Robusta


In [None]:
#check if order of files correspond with the directory list, testing if loop is working
dataframes[5].head()

In [None]:
print(dataframes)
#list of all dataframes

To continue the project is necessary to make 2 adjustments in the second directory:
- change the date format from " 06-Jun" to '%Y-%m-%d' format and apply this to all files in the "Unnamed: 0" collum which corresponds to date. This will enable a more smooth combination of the 2 dfs once all dates mantain the correct format. 

In [None]:
# Test: print the first DataFrame to check if the transformation worked
print(dataframes[5].head())

In [None]:
# Function to transform the 'Unnamed: 0' date column for each DataFrame in the list and reorder columns
def transform_date(dataframes, year):
    month_mapping = {
        'Jan': '01', 'Feb': '02', 'Mar': '03', 'Apr': '04',
        'May': '05', 'Jun': '06', 'Jul': '07', 'Aug': '08',
        'Sep': '09', 'Oct': '10', 'Nov': '11', 'Dec': '12'
    }
    
    # Iterate over each DataFrame in the list
    for i in range(len(dataframes)):
        df = dataframes[i]
        
        # Print the columns to inspect if 'Unnamed: 0' exists or if the name is different
        print(f"Columns in DataFrame {i}: {df.columns}")
        
        # Check if 'Unnamed: 0' exists, otherwise handle the column name differently
        if 'Unnamed: 0' in df.columns:
            # Apply the transformation to the 'Unnamed: 0' column to create full date strings
            df['date'] = df['Unnamed: 0'].apply(
                lambda x: '-'.join([str(year), month_mapping[x.split('-')[1]], x.split('-')[0]])
            )
            
            # Convert the 'Date' column to datetime format
            df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
            
            # Drop the original 'Unnamed: 0' column
            df.drop(columns=['Unnamed: 0'], inplace=True)
            
            # Reorder columns to place 'Date' first
            columns = ['date'] + [col for col in df.columns if col != 'date']
            dataframes[i] = df[columns]  # Replace the DataFrame with the reordered one
        else:
            print(f"'Unnamed: 0' column not found in DataFrame {i}")
    
    return dataframes

# Apply the function to the list of DataFrames
dataframes = transform_date(dataframes, 2024)

# Test: print the first DataFrame to check if the column reordering worked
print(dataframes[0].head())

## Checking the right date format was saved and adding year/month columns to match main df

In [None]:
# Function to add year and month columns to each DataFrame in the list
def add_year_month_columns(dataframes):
    for i in range(len(dataframes)):
        df = dataframes[i]
        
        # Extract the year and month from the 'Date' column
        df['year'] = df['date'].dt.year
        df['month'] = df['date'].dt.month
        
        # Replace the DataFrame in the list with the new columns added
        dataframes[i] = df
        
    return dataframes

# Apply the function to the list of DataFrames
dataframes = add_year_month_columns(dataframes)

# checking if transformation worked in the dataframes list:
dataframes[0].head()

### Define chronologiav order for dataframe

In [None]:
# Define the list of DataFrames in the desired order
dfs_in_order = [dataframes[3],dataframes[2],dataframes[4],dataframes[6],dataframes[5],dataframes[1],dataframes[0]]

# Concatenate the DataFrames
merged_df = pd.concat(dfs_in_order,ignore_index=True)

# Display the merged DataFrame
merged_df

In [None]:
merged_df.info()

from the info function displays the new data contains 152 observations across 8 columss from march 24 to september 24.
the first colum shows dates in datetime format, followed by each category of coffee as well as the index values as floats. the added year and month number of each observation is in integer format.


### Rename column names prior to merging both datasets

This will enable to combine previous data from original dataset to have a bigger pool of observations to feed more data in the modeling part. Is expected the final dataset to combine data from feb/23 to sep/24

- df1 = icip_data > contains the original dataset (Feb 2023 - Feb 2024)
- df2 = merged_df > contains the new dataset (Mar 2024 - Sep 2024)



In [None]:
df1 = icip_data
df2 = merged_df


In [None]:
df1

In [None]:
df2.info()

In [None]:
# Ensure that the 'Date' column in both df1 and df2 is in datetime format
df1['date'] = pd.to_datetime(df1['date'])
df2['date'] = pd.to_datetime(df2['date'])

# Rename the columns in df2 to match the structure of df1
df2.columns = ['date', 'I-CIP', 'colombian_milds', 'other_milds', 'brazilian_nat', 'robustas', 'year', 'month']

# Concatenate df1 and df2 into a single DataFrame
combined_df = pd.concat([df1, df2], ignore_index=True)

# Sort by the 'Date' column to ensure chronological order
combined_df = combined_df.sort_values(by='date').reset_index(drop=True)

# Optionally, save the final DataFrame to a CSV file
combined_df.to_csv('final_combined_data.csv', index=False)

# Test: print the first few rows to verify the result
print(combined_df.head())

In [None]:
combined_df

The combined dataset on the correct stucture can help to make better explorations on the next sections. 

# 2. Exploratory Data Analysis


In [None]:
combined_df.info()

### Check fo rmissing values and summary statisctis 


In [None]:
# Missing values
print(combined_df.isnull().sum())

# Get summary statistics for numerical columns
combined_df.describe()

### Plotting trends overtime to begin understanding how this new dataset is presented

## 2.1 Prices plots

###  a. ICO Composite Indicator Price (I-CIP) is the main feature to be used for the predictions


In [None]:
import matplotlib.pyplot as plt

# Plot I-CIP over time
plt.figure(figsize=(10, 6))
plt.plot(combined_df['date'], combined_df['I-CIP'], label='I-CIP')
plt.xlabel('date')
plt.ylabel('I-CIP')
plt.title('I-CIP Trend Over Time')
plt.xticks(rotation=45)
plt.legend()
plt.show()

## b. Comparing the different categories over time:

Each category has a different weight to calculate the final composite. **(get data on this)**


In [None]:
# Plot other categories of coffee over time
plt.figure(figsize=(10, 6))
plt.plot(combined_df['date'], combined_df['I-CIP'], label='I-CIP')
plt.plot(combined_df['date'], combined_df['colombian_milds'], label='Colombian Milds')
plt.plot(combined_df['date'], combined_df['other_milds'], label='Other Milds')
plt.plot(combined_df['date'], combined_df['brazilian_nat'], label='Brazilian Naturals')
plt.plot(combined_df['date'], combined_df['robustas'], label='Robustas')
plt.xlabel('Date')
plt.ylabel('Price')
plt.title('Coffee Types Trend Over Time')
plt.xticks(rotation=45)
plt.legend()
plt.show()

## c. Compare ICIP to each coffee category over time
changing labels for date axis for easier visualisation (ie 2023-03 to MAR 2023)

In [None]:
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

# Define only the columns you want to plot (excluding the last two columns)
columns_to_plot = ['I-CIP', 'colombian_milds', 'other_milds', 'brazilian_nat', 'robustas']

plt.figure(figsize=(10, 6))

# Iterate over the selected columns and plot each one
for column in columns_to_plot:
    plt.plot(combined_df['date'], combined_df[column], label=column)

# Customize x-axis to show months (use date format for better readability)
plt.xlabel('Month')
plt.ylabel('Price (USD)')
plt.title('Price Fluctuations of ICO Composite Indicator and Coffee Groups Over Time')
plt.legend()

# Format the x-axis labels to show the month name with better spacing
plt.gca().xaxis.set_major_locator(mdates.MonthLocator(interval=3))  # Shows every 3rd month
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%b %Y'))

plt.xticks(rotation=45)  # Rotate for better readability

# Show plot
plt.show()

### 2.2 Setting the date column as index:


In [None]:
#create new variable for merged_df and reseting date as the index for building time series in later stages
# Set 'date' column as index
combined_df.set_index('date', inplace=True)

#check output
icip_df = combined_df
icip_df.head()

### 2.2.1 Checking the range of dataset: 


With the dates as indext we can check the range of the dataset: 

- 607 days however the data collected is at a frequency of BUSINESS DAYS, excluding weekends and holidays, which would account for the difference between the total days (607) and the number of observations (431).

In [None]:
## Checking how many days are present in the dataset

print(f'Dataframe contains prices between {icip_df.index.min()} {icip_df.index.max()}')
print(f'Total Days = {icip_df.index.max() - icip_df.index.min()} days')

### Once the date is set as index, is possible to measure the range an frequency of data. 




In [None]:
# making sure the index is set at datetime 
icip_df.index = pd.to_datetime(icip_df.index)

In [None]:
### From the range, confirm the frequency of the index
print(icip_df.index.freq)

A freq marked as 'None' makes python treat the date as irregular. Manually setting the frquency as Business days since the frequancy is not really defined.
This can have a series of benefits:
- Align  data with time-based operations.
- Perform accurate rolling calculations and time series decomposition.
- Handle missing data systematically.
- Use advanced time series models and resampling.


https://pandas.pydata.org/pandas-docs/version/0.16/timeseries.html

In [None]:
icip_df = icip_df.asfreq('B')  # B stands for Business Days

In [None]:
### From the range, confirm the frequency of the index
print(icip_df.index.freq)

### 2.2.1

### a. Checking monthly seasonality



In [None]:
# Extract year and month from the index

## plot only for 2023 and plot a separate for 2024
#are variations in price the same in both year????




icip_df['year'] = icip_df.index.year
icip_df['month'] = icip_df.index.month_name().str[:3]  # This will give  the three-letter month abbreviation.

# Draw Plot
plt.figure(figsize=(12, 7), dpi=80)
sns.boxplot(x='month', y='I-CIP', data=icip_df)

# Set Title
plt.title('Month-wise Box Plot of I-CIP Prices of 2023\n(The Seasonality)', fontsize=18)

# Show the plot
plt.show()

In [None]:
icip_df

In [None]:
# Filter data by year
df_2023 = icip_df[icip_df['year'] == 2023]
df_2024 = icip_df[icip_df['year'] == 2024]

# Plot for 2023
plt.figure(figsize=(12, 7))
sns.boxplot(x='month', y='I-CIP', data=df_2023, order=['Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.title('Month-wise Box Plot of I-CIP Prices in 2023\n(The Seasonality)', fontsize=18)
plt.xlabel('Month')
plt.ylabel('I-CIP Price (In US cents/lb)')
plt.show()

# Plot for 2024 (up to September)
plt.figure(figsize=(12, 7))
sns.boxplot(x='month', y='I-CIP', data=df_2024, order=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep'])
plt.title('Month-wise Box Plot of I-CIP Prices in 2024\n(The Seasonality)', fontsize=18)
plt.xlabel('Month')
plt.ylabel('I-CIP Price (In US cents/lb)')
plt.show()

### Year-over-Year Monthly Comparison:

This plot aims to compare the avarage of icip prices in 2023 and 2024, to highlight differences in each year.

Overall, the year of 24 is represented by higher average, confirming the upward trend seen in line plots in item c. 

In [None]:
# Monthly average comparison
monthly_avg = icip_df.groupby(['year', 'month'])['I-CIP'].mean().unstack(level=0)
monthly_avg.plot(kind='bar', figsize=(12, 6))
plt.title('Average Monthly I-CIP Prices in 2023 vs 2024')
plt.xlabel('Month')
plt.ylabel('Average I-CIP Price (in US cents/lb)')
plt.show()

###  Monthly Average I-CIP Prices with Regional Harvest Annotations

In [None]:
import calendar

In [None]:
# Set the month order to ensure chronological sorting
month_order = list(calendar.month_abbr[1:])  # ['Jan', 'Feb', ..., 'Dec']
icip_df['month'] = pd.Categorical(icip_df['month'], categories=month_order, ordered=True)

# Group by both year and month to keep month names in chronological order
monthly_avg_df = icip_df.groupby([icip_df.index.year, 'month'])['I-CIP'].mean().unstack(level=0)

# Create the plot with annotations of harvest 
plt.figure(figsize=(14, 7))
monthly_avg_df.plot(kind='bar', color=['skyblue', 'salmon'], ax=plt.gca())
plt.xlabel('Month')
plt.ylabel('Average I-CIP Price (in US cents/lb)')
plt.title('Monthly Average I-CIP Prices by Year with Harvest Annotations')

# Annotate months with regional harvests (approximately)
harvest_annotations = {
    'Jan': 'South America, Africa',
    'Feb': 'South America, Africa',
    'Mar': 'South America',
    'Apr': 'Central America, South America',
    'May': 'Asia',
    'Jun': 'South America, Africa, Asia',
    'Jul': 'Asia, Africa',
    'Aug': 'Asia, Africa',
    'Sep': 'Asia',
    'Oct': 'South America, Africa, Asia',
    'Nov': 'Central America, Africa',
    'Dec': 'South America, Africa'
}

# Add annotations at 45-degree angle for readability
for month_idx, (month, regions) in enumerate(harvest_annotations.items()):
    plt.text(month_idx - 0.15, monthly_avg_df.loc[month].max() + 5, 
             regions, ha='center', rotation=45, color='black', fontsize=8)

# Adjust x-axis labels
plt.xticks(rotation=45)
plt.legend(title="Year", loc="upper right")
plt.show()



# harverst dates extracted from 
#Source: https://coffeehunter.com/coffee-seasonality/ accessed on 30/10
# https://www.fairmountaincoffee.com/category-s/102.htm accessedon 30/10

#### Heatmap of Monthly Price Averages 

The heatmat below aims to identify months where prices tend to dip or spike, then cross-reference with known harvest periods. The color intensity provides a quick overview of price levels each month.

In [None]:
# Create a DataFrame for monthly averages
monthly_avg_df = icip_df.groupby([icip_df.index.year, icip_df.index.month])['I-CIP'].mean().unstack()

plt.figure(figsize=(12, 6))
sns.heatmap(monthly_avg_df, annot=True, cmap="YlGnBu", fmt=".1f", linewidths=0.5)
plt.title('Monthly I-CIP Price Averages (in US cents/lb)')
plt.xlabel('Month')
plt.ylabel('Year')
plt.show()

## d. Value distribution across categories



###### plot monthy only by category 

In [None]:
# copy of the original DataFrame without 'year' and 'month' columns
copy = icip_df.drop(columns=['year', 'month']).copy()

In [None]:
import plotly.graph_objects as go

# Define colors from the Set2 palette
colors = ['#66c2a5', '#fc8d62', '#8da0cb', '#e78ac3', '#a6d854']

# Create boxplot traces
box_traces = []
for i, column in enumerate(copy.columns):
    box_trace = go.Box(y=copy[column], name=column, marker=dict(color=colors[i]))
    box_traces.append(box_trace)

# Create layout
layout = go.Layout(title='Boxplot by Group', yaxis=dict(title='Value'), xaxis=dict(title='Variable'))

# Create figure
fig = go.Figure(data=box_traces, layout=layout)

# Show plot
fig.show()

colombian_milds and brazilian_nat have similar median values but different spreads of data, suggesting that the two groups exhibit relatively stable prices.


robustas has the largest range of prices, which could indicate greater market volatility or variability in this coffee category.


I-CIP shows a balanced range, but it’s positioned lower than both colombian_milds and brazilian_nat, which are premium categories.

In [None]:
icip_df.describe()

the last 2 columns were only added to facilite some of the montkly plots, ill copy the main data as a separate dataframe for more statistical measurements

In [None]:
copy.describe()

### Comparing mean values betrween categories


In [None]:
# Bar plot
plt.figure(figsize=(10, 6))
copy.mean().plot(kind='bar', color='skyblue')
plt.title('Mean Values of Groups')
plt.xlabel('Variables')
plt.ylabel('Mean')
plt.xticks(rotation=45)
plt.show()

## 2.2.2 Checking correlation across categories

In [None]:
# Compute the correlation matrix
correlation_matrix = copy.corr()
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix')
plt.show()


ICO Composite Indicator Price (I-CIP) is calculated using a weighted average of four coffee groups: Colombian Milds, Other Milds, Brazilian Naturals, and Robustas, with each group contributing a specific weight to the calculation:

- Colombian Milds: 12%
- Other Milds: 21%
- Brazilian Naturals: 30%
- Robustas: 37%


In [None]:
print(copy.columns)

In [None]:
# Calculate the weighted I-CIP
copy['weighted_I-CIP'] = (0.12 * copy['colombian_milds'] +
                                 0.21 * copy['other_milds'] +
                                 0.30 * copy['brazilian_nat'] +
                                 0.37 * copy['robustas'])

# Compare weighted I-CIP with actual I-CIP
plt.figure(figsize=(10, 6))
plt.plot(copy.index, copy['I-CIP'], label='Actual I-CIP')
plt.plot(copy.index, copy['weighted_I-CIP'], label='Weighted I-CIP', linestyle='--')
plt.xlabel('Date')
plt.ylabel('I-CIP')
plt.title('Actual I-CIP vs Weighted I-CIP')
plt.legend()
plt.show()

#### Lag plots
understanding the entropy of icip prices and how correlated they are

the scatter plot bellow shows the relationship between observations and their lags.
"as the lag increases, the correlation between the time series and its lags generally decreases."

Some sort of autocorrelation in the data is visible in lag 1, (t+1). A strong linear relationship indicates a high correlation between an observation and its immediate predecessor. a similar pattern is observed in lag2, with a few datapoints begining to get apart. Lags 3 and 4 are already more spreaded, meaning the correlation between values is also decreasing as the interval between lags grow.

In [None]:
from pandas.plotting import lag_plot
plt.rcParams.update({'ytick.left' : False, 'axes.titlepad':10})

lp = icip_df['I-CIP']

# Plot
fig, axes = plt.subplots(1, 4, figsize=(10,3), sharex=True, sharey=True, dpi=100)
for i, ax in enumerate(axes.flatten()[:4]):
    lag_plot(lp, lag=i+1, ax=ax, c='firebrick')
    ax.set_title('Lag ' + str(i+1))

    
fig.suptitle('Lag Plots of I-CIP prices \n(Points get wide and scattered with increasing lag -> lesser correlation)\n', y=1.15)    

plt.show

In [None]:
icip_df

In [None]:
# Define the number of lags for 1 month
number_of_lags = 21

# Create subplots with 3 columns
fig, axes = plt.subplots(nrows=7, ncols=3, figsize=(15, 20))

# Flatten the axes array for easy iteration
axes = axes.flatten()

# Generate a lag plot for each lag
for i in range(1, number_of_lags + 1):
    lag_plot(icip_df['I-CIP'], lag=i, ax=axes[i-1])
    axes[i-1].set_title(f'Lag {i}')

# Adjust layout to prevent overlapping
plt.tight_layout()
plt.show()

### Rolling average / Rolling standard deviation


Rolling Mean: The rolling mean is the average of the previous observation window, where the window consists of a series of values from the time series data. Computing the mean for each ordered window. This can significantly help minimize noise in time series data.

In [None]:
## Rolling Statistics at different periods
window_sizes = [5, 21, 63]  # A week, a month, a quarter,  (approximately)
data_rolling = icip_df['I-CIP']  

for window in window_sizes:
    rolling_mean = icip_df['I-CIP'].rolling(window=window).mean()
    rolling_std = icip_df['I-CIP'].rolling(window=window).std()
    
    plt.figure(figsize=(14, 5))
    plt.plot(icip_df['I-CIP'].index, icip_df['I-CIP'], label='Original')
    plt.plot(rolling_mean.index, rolling_mean, label=f'Rolling Mean (window={window})')
    plt.plot(rolling_std.index, rolling_std, label=f'Rolling Std Dev (window={window})')
    plt.title(f'Rolling Mean and Standard Deviation (window size = {window})')
    plt.legend()
    plt.show()

### checking for missing dates for determine the right frequency


from ealier sections, it was observed that there were 100 days missing from the 365 window of dates whithin the dataset. However,  since this dataset displays data from monday-friday (business days) it was "assumed" (in data science we cant make assumptions but still...) that those 'missing' dates were only referent to weekends and/or holidays. This code below extracs the exact dates missing from the entire range for business days.



In [None]:
# Generate a date range for 366 days from the start of your data
# Adjust the period accordingly if you have data spanning multiple years or a different time frame
start_date = icip_df.index.min()
end_date = icip_df.index.max()  


# Generate a range of business days within this period
business_days = pd.bdate_range(start=start_date, end=end_date)

# Now compare the business_days with your DataFrame's index to find out missing dates
missing_dates = business_days.difference(icip_df.index)

print(f"Total number of expected business days: {len(business_days)}")
print(f"Total number of actual days in data: {icip_df.shape[0]}")
print(f"Total number of missing dates: {len(missing_dates)}")
print("Missing dates are:")
print(missing_dates)

The code above says there are no missing dates in the dataframe, however, it still shows there are a few NaN values, 

In [None]:

icip_df.info()
print(icip_df.shape)
print(icip_df.isnull().sum())
icip_df.head()

In [None]:
icip_df.isnull().values.any()

In [None]:
nan_df = icip_df.isna()
print(nan_df)

In [None]:
nan_rows = icip_df.isna().any(axis=1)
print(nan_rows)

In [None]:
#filter rows with nan values
nan_rows = icip_df[icip_df.isna().any(axis=1)]

In [None]:
plt.figure(figsize=(10,6))
sns.heatmap(icip_df.isna(), cbar=False, cmap="viridis")
plt.show()

In [None]:
# Display the dates with missing data
nan_rows.index

## IDENTIFIED MISSING DATES 
## INTERPOLATE FOR THE TIMESERIES

MODELS CANT HANDLE MISSING DATA 

In [None]:
# Function to print out results in customised manner
from statsmodels.tsa.stattools import kpss
def kpss_test(timeseries):
    print ('Results of KPSS Test:')
    kpsstest = kpss(timeseries, regression='c', nlags="auto")
    kpss_output = pd.Series(kpsstest[0:3], index=['Test Statistic','p-value','#Lags Used'])
    for key,value in kpsstest[3].items():
        kpss_output['Critical Value (%s)'%key] = value
    print (kpss_output)

# Call the function and run the test

kpss_test(icip_df['I-CIP'])

In [None]:
# Seasonal decompositions with different periods.

periods = [63, 21, 5]  # Quartely, Monthly and Weekly (considering business days)
# Function to generate the plots for all periods.
for period in periods:
    decompositions = seasonal_decompose(icip_df['I-CIP'], model='additive', period=period)

    # Plotting the components of the decomposition
    plt.rcParams.update({'figure.figsize': (8,8)})
    print(f"Seasonal Decomposition with Period = {period}")
    decompositions.plot()
    plt.show()

# 3. Modelling