# Coffee Price Prediction — Notebook

**Goal:** Build reproducible pipelines and ML models to forecast global coffee futures prices (e.g., ICE Arabica / Yahoo ticker `KC=F`) using market, macro and weather features.

**Data sources (examples to download inside the notebook):**
- ICO (International Coffee Organization) historical indicators and market reports. 
- Yahoo Finance futures ticker `KC=F` (download via `yfinance` or Yahoo historical CSV).
- Kaggle public coffee price datasets as ready CSVs for quick experimentation.

This notebook is structured for production-ready experimentation: data acquisition → EDA → features → models → evaluation → deployment artifacts.

--

## Usage notes
1. Run cells in order. Internet access is required for the 'data download' cells which use `yfinance`, `requests` or Kaggle APIs. 
2. If you run on a cloud notebook, enable internet and set Kaggle credentials (if using Kaggle datasets).


## Data sources (suggested)

- ICO historical data / indicators (Composite indicator prices, group prices): https://ico.org/historical-data-on-the-global-coffee-trade/  
- Yahoo Finance futures (KC=F) historical: https://finance.yahoo.com/quote/KC%3DF/history/  
- Kaggle coffee price datasets (examples): https://www.kaggle.com/datasets/timmofeyy/coffee-prices-historical-data  

(Use these links to download CSVs or call APIs directly from the notebook.)

In [None]:
# Environment setup (run once)
# !pip install yfinance xgboost nbformat pandas scikit-learn statsmodels tensorflow matplotlib seaborn kaggle optuna

# Note: on many managed notebooks yfinance and xgboost are preinstalled.
print('Install the required packages if not present: yfinance, xgboost, statsmodels, tensorflow, optuna, kaggle')

In [None]:
# Data ingestion examples
import pandas as pd
import datetime as dt

# 1) Download coffee futures using yfinance (KC=F). Requires internet.
def fetch_coffee_yahoo(start='1990-01-01', end=None):
    import yfinance as yf
    if end is None:
        end = dt.date.today().isoformat()
    ticker = 'KC=F'
    df = yf.download(ticker, start=start, end=end, progress=False)
    df = df.rename(columns={'Close':'close','Open':'open','High':'high','Low':'low','Volume':'volume'})
    df.index = pd.to_datetime(df.index)
    return df

# 2) Load ICO CSV (if downloaded locally)
def load_ico_csv(path):
    return pd.read_excel(path, sheet_name=None)  # ICO provides multiple sheets sometimes

# 3) Load Kaggle CSV (if downloaded)
def load_kaggle_csv(path):
    return pd.read_csv(path, parse_dates=['Date'], index_col='Date')

# Example usage (uncomment to run with internet):
# coffee = fetch_coffee_yahoo('1980-01-01')
# print(coffee.tail())

## Exploratory Data Analysis (EDA)
- Visualize price series, seasonality, rolling statistics.
- Inspect missing data and contract roll effects (futures continuous contracts need stitching).

In [None]:
# EDA snippets (requires coffee DataFrame from ingestion cell)
import matplotlib.pyplot as plt
def plot_series(df, col='close', title='Coffee price'):
    plt.figure(figsize=(12,4))
    plt.plot(df[col])
    plt.title(title)
    plt.xlabel('Date')
    plt.ylabel('Price (USD/lb)')
    plt.grid(True)
    plt.show()

# Example: plot_series(coffee)

## Feature engineering
- Lag features (t-1 ... t-30)
- Rolling mean/std (7, 30, 90 days)
- Technical indicators (momentum, RSI, ATR)
- Macro features: USD index, Oil price, S&P500, Shipping costs
- Weather/agronomic indices for origin countries (rainfall anomalies)

In [None]:
# Feature engineering examples
import numpy as np

def make_features(df, lags=[1,3,7,14,30]):
    data = df.copy()
    data['log_close'] = np.log(data['close'])
    for l in lags:
        data[f'lag_{l}'] = data['log_close'].shift(l)
    # rolling
    data['roll7_mean'] = data['log_close'].rolling(7).mean()
    data['roll30_mean'] = data['log_close'].rolling(30).mean()
    data['roll7_std'] = data['log_close'].rolling(7).std()
    data = data.dropna()
    return data

# Example usage:
# feats = make_features(coffee)

## Modeling — Baseline (ARIMA / SARIMAX)
Use statsmodels SARIMAX as a strong statistical baseline. Fit on log prices; include exogenous regressors if available.

In [None]:
# ARIMA baseline (example)
from statsmodels.tsa.statespace.sarimax import SARIMAX

def arima_backtest(train_series, order=(1,1,1), seasonal_order=(0,0,0,0)):
    model = SARIMAX(train_series, order=order, seasonal_order=seasonal_order, enforce_stationarity=False, enforce_invertibility=False)
    res = model.fit(disp=False)
    return res

# Example usage:
# res = arima_backtest(np.log(coffee['close']).dropna())
# print(res.summary())

## Modeling — Tree-based (XGBoost / LightGBM)
Create supervised dataset with sliding-window forecasting (predict next-day or next-month price).

In [None]:
# XGBoost example (supervised forecasting)
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error
import xgboost as xgb

def train_xgb(X, y):
    params = {'objective':'reg:squarederror', 'n_estimators':200, 'learning_rate':0.05, 'max_depth':4}
    model = xgb.XGBRegressor(**params)
    model.fit(X, y)
    return model

# Example usage (after building X,y):
# model = train_xgb(X_train, y_train)
# preds = model.predict(X_test)
# print('RMSE', np.sqrt(mean_squared_error(y_test, preds)))

## Modeling — Deep Learning (LSTM)
Use a scaled sequence dataset and a simple LSTM model for comparison. Useful for capturing non-linear temporal patterns.

In [None]:
# LSTM example (Keras)
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from sklearn.preprocessing import MinMaxScaler

def build_lstm(n_timesteps, n_features):
    model = Sequential()
    model.add(LSTM(64, input_shape=(n_timesteps, n_features), return_sequences=False))
    model.add(Dropout(0.2))
    model.add(Dense(1))
    model.compile(optimizer='adam', loss='mse')
    return model

# Note: prepare 3D arrays for LSTM (samples, timesteps, features)

## Evaluation & Backtesting
- Use rolling-origin evaluation (walk-forward)
- Report RMSE, MAE, MAPE for holdout windows
- Check economic metrics: P&L of simple hedging strategy using predictions

In [None]:
# Walk-forward backtest skeleton
def walk_forward_forecast(X, y, model_fn, n_splits=5):
    tscv = TimeSeriesSplit(n_splits=n_splits)
    metrics = []
    for train_idx, test_idx in tscv.split(X):
        X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
        model = model_fn(X_train, y_train)
        preds = model.predict(X_test)
        metrics.append(np.sqrt(mean_squared_error(y_test, preds)))
    return metrics

# Example usage:
# metrics = walk_forward_forecast(X, y, lambda Xt, yt: train_xgb(Xt, yt))

## Deployment & Next steps
- Package pipeline with `mlflow` or `kedro`.
- Export model artifacts and scaler.
- Build a lightweight dashboard (Streamlit / Dash) to show real-time predictions and sell/hold recommendations.

---

### Appendix
- The notebook includes commented instructions to download ICO, Yahoo and Kaggle datasets. Replace file paths or uncomment download cells when running on a machine with internet access.