## MSc Data Analytics - Capstone Project

#### Predictive Insights in the Coffee Market: Time Series Models for forecasting coffee prices

Student id: 2020274 Clarissa Cardoso

Introduction
This capstone project aims to apply different time series models to forecast the prices of coffee in the commodity stock market. 
Central to this analysis is the ICO Composite Indicator Price (I-CIP), which is a critical benchmark reflecting global coffee market trends. 
Accurate forecasting of the I-CIP is vital for stakeholders throughout the coffee industry, from producers to investors. 


The data utilized in this project is sourced from the International Coffee Organization's (ICO) Public Market Information, which provides the I-CIP values free of charge.

For the early stages of this experimentation, 1 year worth of data was available to collect, from 01Feb23 to 29Feb24.

Objectives:
Model Building: Develop and evaluate various time series forecasting models to predict future values of the I-CIP.


        Model Development: build and train various time series forecasting models, including traditional statistical models (e.g., ARIMA/Sarima) and machine learning algorithms (e.g., LSTM neural networks).
        Model Evaluation: evaluate the performance of each model using appropriate metrics, such as mean absolute error (MAE) and root mean squared error (RMSE), to determine their predictive accuracy.
        Forecasting: generate forecasts for future I-CIP values using the best-performing model(s) and visualize the results to facilitate interpretation and decision-making.
- 1 day
- 5 days = 1 week
- 21 days = 1 month
(- 63 days = 3 months (1 quarter))


In [1]:
import keras
import tensorflow as tf

print("Keras version:", keras.__version__)
print("TensorFlow version:", tf.__version__)


## cheking if keras/tensorflow are correclty installed 

Keras version: 2.10.0
TensorFlow version: 2.10.0


In [2]:
#importing libraries
import warnings
warnings.filterwarnings("ignore")

import pandas as pd #dataframes 
import numpy as np #linear algebra
import seaborn as sns #visualization
sns.set(color_codes=True)


import plotly.express as px
import plotly.graph_objects as go


import scipy.stats as stats #statistical resources

import matplotlib.pyplot as plt #visualisation 
%matplotlib inline 


from matplotlib import colors
from matplotlib.ticker import PercentFormatter
import matplotlib as mpl

from sklearn.model_selection import train_test_split # importing function to split the data training and test.
from sklearn.preprocessing import MinMaxScaler # Import the MinMaxScaler module from sklearn.preprocessing library
from sklearn.linear_model import LinearRegression # importing to performe linear regression. 
from sklearn.metrics import make_scorer, r2_score # Importing from Metrics module
from sklearn.preprocessing import StandardScaler # standardize the data
from sklearn import metrics # Metrics module from scikit-learn
from sklearn.model_selection import GridSearchCV # importing for hyperparameter tunning
from sklearn.metrics import mean_squared_error # importing mse
from scipy.stats import shapiro

from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential #last update in python causing dead kernel wehn importing keras functions?
from keras.layers import Dense, LSTM, Dropout, GRU, Bidirectional
from keras.optimizers import SGD
import math
from math import sqrt
from sklearn.metrics import mean_squared_error,mean_absolute_error
from sklearn.metrics import mean_absolute_percentage_error
from scipy.interpolate import interp1d

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LinearRegression

import statsmodels.api as sm
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf



# 1. Load data

For the early stages of this experimentation, 1 year worth of data was available to collect, from 01Feb23 to 29Feb24.

This section will review the original dataset compiled with data from feb 23 to feb 24.




A few thingsobserved when importing the raw files: 

- Column mismatch: Assuming all files have the same column names and order. This can lead to errors when merging DataFrames with different structures.



Since then, ICO has released additional months that will be included in the dataframe, considering the timeframe from march to september 2024 as a way to feed more data to the models. 


In [3]:
# Read the CSV file 
icip_data = pd.read_csv("icip_df.csv")

# View the first 5 rows
icip_data.head()

Unnamed: 0,date,I-CIP,colombian_milds,other_milds,brazilian_nat,robustas,year,month
0,2023-02-01,171.43,235.92,223.22,191.65,102.31,2023,2
1,2023-02-02,172.5,237.34,226.26,192.86,102.0,2023,2
2,2023-02-03,169.47,232.24,221.86,188.61,101.52,2023,2
3,2023-02-06,171.29,235.17,224.8,190.77,102.02,2023,2
4,2023-02-07,172.14,235.65,226.72,191.92,102.1,2023,2


### Checking  additional data from March/24 to September/24 before combining to main dataframe 

In [4]:
import os
# List all the files in the folder
os.listdir("icip_24") 

['I-CIP_September_2024.csv',
 'I-CIP_August_2024.csv',
 'I-CIP_April_2024.csv',
 'I-CIP_March_2024.csv',
 'I-CIP_May_2024.csv',
 'I-CIP_July_2024.csv',
 'I-CIP_June_2024.csv']

In [5]:
#create for loop to import csv files from the folder with less comands.

# create an empty list to store dfs
dataframes = []

# path to folder where csv files are (in this case same directory)
folder_path = "icip_24"


# to import CSV starting from the third row, skipping the first two
def import_csv(filepath):
    return pd.read_csv(filepath, skiprows=2)

# Iterate through files in the folder
for file in os.listdir(folder_path):
    if file.endswith(".csv"):  # Only consider CSV files
        file_path = os.path.join(folder_path, file)  # Construct the full file path
        dataframes.append(import_csv(file_path))  # Read CSV and append to list

In [6]:
#check the lenght of the directory, how many files exist in the new folder
len(dataframes)

7

Chcking the heading of the files to undertand how features are allocated in this first stage before combining the new 7 months to main dataframe

The same issue appears with the heading names. So this time around it was decided to ignore the first 2 rows to avoid the unnamed header and only collect the data 

Unnamed: 0	Unnamed: 1	Colombian	Unnamed: 3	Brazilian	Unnamed: 5
0	NaN	I-CIP	NaN	Other Milds	NaN	Robusta


In [7]:
#check if order of files correspond with the directory list, testing if loop is working
dataframes[0].head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Milds,Unnamed: 3,Naturals,Unnamed: 5
0,02-Sep,241.71,265.38,260.21,239.79,224.69
1,03-Sep,241.2,264.48,259.5,239.13,224.54
2,04-Sep,245.35,266.69,260.63,241.03,233.05
3,05-Sep,249.73,269.06,268.61,247.93,233.82
4,06-Sep,243.88,262.39,261.53,241.58,229.4


In [8]:
print(dataframes)
#list of all dataframes

[   Unnamed: 0  Unnamed: 1   Milds  Unnamed: 3  Naturals  Unnamed: 5
0      02-Sep      241.71  265.38      260.21    239.79      224.69
1      03-Sep      241.20  264.48      259.50    239.13      224.54
2      04-Sep      245.35  266.69      260.63    241.03      233.05
3      05-Sep      249.73  269.06      268.61    247.93      233.82
4      06-Sep      243.88  262.39      261.53    241.58      229.40
5      09-Sep      247.47  269.31      268.28    246.38      228.99
6      10-Sep      251.18  272.41      271.39    250.57      232.83
7      11-Sep      252.78  272.00      270.98    250.20      237.98
8      12-Sep      255.22  273.63      273.58    252.47      240.75
9      13-Sep      265.67  283.72      283.53    262.78      251.73
10     16-Sep      264.57  282.28      282.63    261.62      250.68
11     17-Sep      269.45  288.40      288.74    267.76      253.35
12     18-Sep      269.47  288.21      288.56    267.49      253.80
13     19-Sep      265.09  283.29      284.33  

To continue the project is necessary to make 2 adjustments in the second directory:
- change the date format from " 06-Jun" to '%Y-%m-%d' format and apply this to all files in the "Unnamed: 0" collum which corresponds to date. This will enable a more smooth combination of the 2 dfs once all dates mantain the correct format. 

In [9]:
# Test: print the first DataFrame to check if the transformation worked
print(dataframes[5].head())

  Unnamed: 0  Unnamed: 1   Milds  Unnamed: 3  Naturals  Unnamed: 5
0     01-Jul      224.77  248.87      245.94    227.53      202.00
1     02-Jul      228.21  251.41      249.08    232.47      204.63
2     03-Jul      225.51  248.18      245.85    228.99      203.10
3     04-Jul      227.04  247.42      246.27    228.97      207.38
4     05-Jul      230.76  251.20      252.03    234.53      208.30


In [10]:
# Function to transform the 'Unnamed: 0' date column for each DataFrame in the list and reorder columns
def transform_date(dataframes, year):
    month_mapping = {
        'Jan': '01', 'Feb': '02', 'Mar': '03', 'Apr': '04',
        'May': '05', 'Jun': '06', 'Jul': '07', 'Aug': '08',
        'Sep': '09', 'Oct': '10', 'Nov': '11', 'Dec': '12'
    }
    
    # Iterate over each DataFrame in the list
    for i in range(len(dataframes)):
        df = dataframes[i]
        
        # Print the columns to inspect if 'Unnamed: 0' exists or if the name is different
        print(f"Columns in DataFrame {i}: {df.columns}")
        
        # Check if 'Unnamed: 0' exists, otherwise handle the column name differently
        if 'Unnamed: 0' in df.columns:
            # Apply the transformation to the 'Unnamed: 0' column to create full date strings
            df['Date'] = df['Unnamed: 0'].apply(
                lambda x: '-'.join([str(year), month_mapping[x.split('-')[1]], x.split('-')[0]])
            )
            
            # Convert the 'Date' column to datetime format
            df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
            
            # Drop the original 'Unnamed: 0' column
            df.drop(columns=['Unnamed: 0'], inplace=True)
            
            # Reorder columns to place 'Date' first
            columns = ['Date'] + [col for col in df.columns if col != 'Date']
            dataframes[i] = df[columns]  # Replace the DataFrame with the reordered one
        else:
            print(f"'Unnamed: 0' column not found in DataFrame {i}")
    
    return dataframes

# Apply the function to the list of DataFrames
dataframes = transform_date(dataframes, 2024)

# Test: print the first DataFrame to check if the column reordering worked
print(dataframes[0].head())

Columns in DataFrame 0: Index(['Unnamed: 0', 'Unnamed: 1', 'Milds', 'Unnamed: 3', 'Naturals',
       'Unnamed: 5'],
      dtype='object')
Columns in DataFrame 1: Index(['Unnamed: 0', 'Unnamed: 1', 'Milds', 'Unnamed: 3', 'Naturals',
       'Unnamed: 5'],
      dtype='object')
Columns in DataFrame 2: Index(['Unnamed: 0', 'Unnamed: 1', 'Milds', 'Unnamed: 3', 'Naturals',
       'Unnamed: 5'],
      dtype='object')
Columns in DataFrame 3: Index(['Unnamed: 0', 'Unnamed: 1', 'Milds', 'Unnamed: 3', 'Naturals',
       'Unnamed: 5'],
      dtype='object')
Columns in DataFrame 4: Index(['Unnamed: 0', 'Unnamed: 1', 'Milds', 'Unnamed: 3', 'Naturals',
       'Unnamed: 5'],
      dtype='object')
Columns in DataFrame 5: Index(['Unnamed: 0', 'Unnamed: 1', 'Milds', 'Unnamed: 3', 'Naturals',
       'Unnamed: 5'],
      dtype='object')
Columns in DataFrame 6: Index(['Unnamed: 0', 'Unnamed: 1', 'Milds', 'Unnamed: 3', 'Naturals',
       'Unnamed: 5'],
      dtype='object')
        Date  Unnamed: 1   Milds  

## Checking the right date format was saved and adding year/month columns to match main df

In [11]:
# Function to add year and month columns to each DataFrame in the list
def add_year_month_columns(dataframes):
    for i in range(len(dataframes)):
        df = dataframes[i]
        
        # Extract the year and month from the 'Date' column
        df['year'] = df['Date'].dt.year
        df['month'] = df['Date'].dt.month
        
        # Replace the DataFrame in the list with the new columns added
        dataframes[i] = df
        
    return dataframes

# Apply the function to the list of DataFrames
dataframes = add_year_month_columns(dataframes)

# checking if transformation worked in the dataframes list:
dataframes[0].head()

Unnamed: 0,Date,Unnamed: 1,Milds,Unnamed: 3,Naturals,Unnamed: 5,year,month
0,2024-09-02,241.71,265.38,260.21,239.79,224.69,2024,9
1,2024-09-03,241.2,264.48,259.5,239.13,224.54,2024,9
2,2024-09-04,245.35,266.69,260.63,241.03,233.05,2024,9
3,2024-09-05,249.73,269.06,268.61,247.93,233.82,2024,9
4,2024-09-06,243.88,262.39,261.53,241.58,229.4,2024,9


In [19]:
# Define the list of DataFrames in the desired order
dfs_in_order = [dataframes[3],dataframes[2],dataframes[4],dataframes[6],dataframes[5],dataframes[1],dataframes[0]]

# Concatenate the DataFrames
merged_df = pd.concat(dfs_in_order,ignore_index=True)

# Display the merged DataFrame
merged_df

Unnamed: 0,Date,Unnamed: 1,Milds,Unnamed: 3,Naturals,Unnamed: 5,year,month
0,2024-03-01,181.39,206.30,204.99,183.44,157.54,2024,3
1,2024-03-04,183.10,209.76,207.30,185.63,157.91,2024,3
2,2024-03-05,181.97,206.67,205.81,183.35,158.63,2024,3
3,2024-03-06,185.77,209.44,208.59,186.12,164.27,2024,3
4,2024-03-07,190.75,215.48,214.71,192.05,167.42,2024,3
...,...,...,...,...,...,...,...,...
147,2024-09-24,269.83,289.78,291.26,267.94,252.30,2024,9
148,2024-09-25,270.09,291.55,293.03,270.32,249.36,2024,9
149,2024-09-26,272.70,296.96,296.29,274.52,249.30,2024,9
150,2024-09-27,268.97,292.23,291.25,269.86,247.45,2024,9


In [24]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 152 entries, 0 to 151
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Date        152 non-null    datetime64[ns]
 1   Unnamed: 1  152 non-null    float64       
 2   Milds       152 non-null    float64       
 3   Unnamed: 3  152 non-null    float64       
 4   Naturals    152 non-null    float64       
 5   Unnamed: 5  152 non-null    float64       
 6   year        152 non-null    int64         
 7   month       152 non-null    int64         
dtypes: datetime64[ns](1), float64(5), int64(2)
memory usage: 9.6 KB


# 2. Data preparation/EDA


# 3. Modelling