# Proof of Concept: Linking Climate Data with CFD Prices for Agricultural Commodities

**Objective:** Evaluate the feasibility of predicting CFD price movements (e.g., cotton) by incorporating monthly (or daily aggregated) climatic data over the last 20 years. This notebook serves as a preliminary investigation using machine learning techniques and exploratory data analysis.

**Outline:**
1. Data Acquisition and Loading
2. Data Preprocessing and Aggregation
3. Exploratory Data Analysis (EDA)
4. Feature Engineering (including creation of climatic indices and lag variables)
5. Modeling: Training a simple ML model (e.g., RandomForest)
6. Model Evaluation and Interpretation
7. Conclusions and Next Steps


## 1. Importing Libraries
Import necessary libraries for data analysis, visualization, and modeling.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import TimeSeriesSplit
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error

# You can add more libraries as needed

%matplotlib inline

## 2. Data Acquisition and Loading
Load the climate and CFD price data. For this PoC, you can use synthetic datasets or CSV samples.

*Example files:*
- `climate_data.csv`: Daily or monthly climate variables (temperature, precipitation, extreme events, etc.)
- `price_data.csv`: Historical CFD price series (e.g., cotton) on a monthly basis

In [None]:
# Example: Loading data (replace with your actual file paths or datasets)
climate_df = pd.read_csv('climate_data.csv', parse_dates=['date'])
price_df = pd.read_csv('price_data.csv', parse_dates=['date'])

## 3. Data Preprocessing and Aggregation
Convert the date columns to datetime (if not already done), handle missing values and outliers, aggregate daily data into monthly metrics (e.g., averages, sums, or counts of extreme events), and merge both datasets using the `date` column.

In [None]:
# Example: Aggregating daily climate data to monthly data
climate_monthly = climate_df.resample('M', on='date').agg({
    'temperature': 'mean',
    'precipitation': 'sum',
    # Assuming 'hail' is a binary indicator, count the number of days with hail events
    'hail': 'sum'
}).reset_index()

# Merge climate and price data
data = pd.merge(price_df, climate_monthly, on='date', how='inner')

## 4. Exploratory Data Analysis (EDA)
Visualize the time series, examine variable distributions, and compute a correlation matrix to identify preliminary relationships.

In [None]:
plt.figure(figsize=(12, 6))
plt.plot(data['date'], data['price'], label='CFD Price')
plt.xlabel('Date')
plt.ylabel('Price')
plt.title('CFD Price Over Time')
plt.legend()
plt.show()

# Calculate and print the correlation matrix
corr_matrix = data.corr()
print(corr_matrix)

## 5. Feature Engineering
Create new features such as drought indices, counts of extreme events, or lag variables. For example, create a one-month lag for precipitation.

In [None]:
data['precipitation_lag1'] = data['precipitation'].shift(1)
# Remove rows with NA values resulting from the shift
data = data.dropna()

## 6. Modeling: Training a Simple Machine Learning Model
Use a simple model (e.g., RandomForest) to predict the CFD price using the climate variables as predictors.

In [None]:
# Define features and target
features = ['temperature', 'precipitation', 'hail', 'precipitation_lag1']
target = 'price'

# Use TimeSeriesSplit for temporal cross-validation
tscv = TimeSeriesSplit(n_splits=5)
for train_index, test_index in tscv.split(data):
    train_data, test_data = data.iloc[train_index], data.iloc[test_index]
    
    X_train = train_data[features]
    y_train = train_data[target]
    X_test = test_data[features]
    y_test = test_data[target]
    
    model = RandomForestRegressor(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)
    
    predictions = model.predict(X_test)
    
    rmse = np.sqrt(mean_squared_error(y_test, predictions))
    mae = mean_absolute_error(y_test, predictions)
    print(f'RMSE: {rmse:.2f}, MAE: {mae:.2f}')

## 7. Model Evaluation and Interpretation
Visualize the predictions versus the actual values and analyze feature importance to understand the impact of each climate variable.

In [None]:
plt.figure(figsize=(12, 6))
plt.plot(test_data['date'], y_test, label='Actual Price')
plt.plot(test_data['date'], predictions, label='Predicted Price')
plt.xlabel('Date')
plt.ylabel('Price')
plt.title('Actual vs Predicted CFD Prices')
plt.legend()
plt.show()

# Display feature importances
importances = model.feature_importances_
for feat, imp in zip(features, importances):
    print(f'{feat}: {imp:.2f}')

## 8. Conclusion and Next Steps
Summarize your findings and the model's predictive capability. Discuss limitations and potential improvements (e.g., incorporating more climatic variables, using hybrid models or LSTM networks). Outline the next steps to transition this PoC into a production-level model, such as integrating streaming data and continuously updating the model.

# End of Notebook

This PoC serves as a starting point to evaluate whether incorporating climate data can improve the prediction of agricultural CFD prices. Further research and model refinement will be necessary to address challenges such as data granularity, external economic factors, and model adaptability over time.