**Capstone 4 - "Fibre" (EUR/USD) Forex Analysis**

**Background**

The EUR/USD (Euro/US Dollar) currency pair is one of the most widely traded pairs in the foreign exchange (forex) market, representing the exchange rate between the Euro, the official currency of the Eurozone, and the US Dollar, the official currency of the United States. The popularity of this pair stems from the fact that it involves the two largest economies in the world, the Eurozone and the United States, making it a key indicator of global economic health and investor sentiment.

The dataset provided contains financial and trading-related data, including features such as **Trades**, **Gross Profit**, **Gross Loss**, **% Change**, **Profit Factor**, **Winners**, and **Net Profit**. This data represents key performance indicators (KPIs) that are typically used in evaluating the success of trading strategies and financial outcomes over a period of time. The dataset may include both summary statistics for trading strategies and profit metrics, which provide insights into profitability and risks associated with trades.

Given the financial nature of the dataset, it offers an opportunity to apply machine learning methods to derive valuable insights and predictive capabilities. Predictive modeling can aid in identifying trends, minimizing risks, and optimizing decision-making processes in financial trading strategies.

**Objective**

The primary objective of this project is to apply various machine learning methods to analyze and predict financial performance metrics, such as Net Profit, based on features like Trades, Gross Profit, Gross Loss, % Change, and others. Specifically, the focus will be on the following:

1. **Prediction of Financial Outcomes**: Using supervised learning techniques such as **Regression Analysis** (e.g., Linear Regression, K-Nearest Neighbors, Neural Networks) to predict key financial metrics like Net Profit based on historical data.

2. **Exploration of Relationships Between Variables**: Utilize visualizations such as **pairplots** to explore the relationships between features and their impact on profit outcomes. This will help identify key drivers of profitability in the dataset.

3. **Improvement of Financial Forecasting**: Explore advanced machine learning models, including **Neural Networks**, to improve the accuracy of predictions, uncover hidden patterns, and provide actionable insights for optimizing trading strategies.

The overall goal is to enhance predictive accuracy, which can lead to better decision-making, risk management, and profitability in trading and financial strategy optimization.

**Origin of Data**

Dataset is downloaded from www.fxblue.com which I have collected the data between 8 Oct 2023 to 4 Aug 2024.

All the "Magic No." is an individual Expert Advisor (EA).

Dataset has the records of 200 rows with 40 columns.

In [None]:
# Importing basic libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns

**Loading the Data**

In [None]:
# Load the dataset
data = pd.read_csv('/kaggle/input/forex2023/Forex.csv')

# Display the first few rows of the dataset
data.head()

**EDA and Data Preparation**

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
sns.pairplot(data,hue='Net profit')

In [None]:
corrdf = data.corr()
plt.figure(figsize=(40,30))
sns.heatmap(corrdf, annot=True, cmap="coolwarm")

**Remove some of the columns**

In [None]:
# Define the target variable (e.g., 'Net profit')
target = 'Net profit'

# Select features (excluding the target variable and any non-numeric columns)
features = data.drop(columns=['Average loser length (hours)', 'Average winner length (hours)', 'Avg pips per trade', 'Net pips', 'Magic', 'Opening balance', 'Closing balance', 'Avg lots','Valley (cash)', 'Valley (pips)','Gross profit', 'Gross loss', '% change', 'Avg win', 'Avg loss', 'Longest (hours)', 'Shortest (hours)', 'Lots traded', 'Consec winners', 'Consec losses', 'Consec profit', 'Consec loss','Winners', 'Losers'])

# Convert all features to numeric if needed (useful if there are categorical variables)
features = features.apply(pd.to_numeric, errors='coerce')

# Drop any rows with missing values (optional, based on your data)
features = features.dropna()

In [None]:
features.head()

In [None]:
features.describe()

In [None]:
features.info()

In [None]:
sns.pairplot(features,hue='Net profit')

In [None]:
corrdf = features.corr()
plt.figure(figsize=(30,20))
sns.heatmap(corrdf, annot=True, cmap="coolwarm")

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, data[target], test_size=0.2, random_state=42)

**Linear Regression Model**

In [None]:
# Initialize the Linear Regression model
model = LinearRegression()

# Train the model on the training data
model.fit(X_train, y_train)

# Get the model's coefficients
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)

In [None]:
# Make predictions on the test set
y_pred_lr = model.predict(X_test)

In [None]:
# Calculate evaluation metrics
mae = mean_absolute_error(y_test, y_pred_lr)
mse = mean_squared_error(y_test, y_pred_lr)
r2 = r2_score(y_test, y_pred_lr)

# Print evaluation results
print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"R² Score: {r2}")

In [None]:
# Coefficients of the model
coefficients = pd.DataFrame({'Feature': X_train.columns, 'Coefficient': model.coef_})

# Display the coefficients
print(coefficients)

In [None]:
# Plotting actual vs predicted values
plt.figure(figsize=(10, 6))
sns.scatterplot(x=y_test, y=y_pred_lr, color='blue')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', lw=2)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Linear Regression Model - Actual vs Predicted Values')
plt.show()

**Decision Tree and Random Forest Regressor**

In [None]:
# Initialize the Decision Tree Regressor
dt_model = DecisionTreeRegressor(random_state=42)

# Train the model on the training data
dt_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_dt = dt_model.predict(X_test)

In [None]:
# Calculate evaluation metrics for Decision Tree
mae_dt = mean_absolute_error(y_test, y_pred_dt)
mse_dt = mean_squared_error(y_test, y_pred_dt)
r2_dt = r2_score(y_test, y_pred_dt)

# Print evaluation results
print(f"Decision Tree - Mean Absolute Error (MAE): {mae_dt}")
print(f"Decision Tree - Mean Squared Error (MSE): {mse_dt}")
print(f"Decision Tree - R² Score: {r2_dt}")

In [None]:
# Initialize the Random Forest Regressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model on the training data
rf_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_rf = rf_model.predict(X_test)


In [None]:
# Calculate evaluation metrics for Random Forest
mae_rf = mean_absolute_error(y_test, y_pred_rf)
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

# Print evaluation results
print(f"Random Forest - Mean Absolute Error (MAE): {mae_rf}")
print(f"Random Forest - Mean Squared Error (MSE): {mse_rf}")
print(f"Random Forest - R² Score: {r2_rf}")

In [None]:
# Plotting actual vs predicted values for Random Forest
plt.figure(figsize=(10, 6))
sns.scatterplot(x=y_test, y=y_pred_rf, color='blue')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', lw=2)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Random Forest - Actual vs Predicted Values')
plt.show()

**More libraries to be import**

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

In [None]:
# Define the target variable (dependent variable)
target2 = 'Net profit'

# Define the features (independent variables)
features2 = ['Trades', 'Gross profit', 'Gross loss', '% change', 'Profit factor', 'Winners']

# Create feature matrix (X) and target vector (y)
X = data[features2]
y = data[target2]

In [None]:
# Initialize the StandardScaler
scaler = StandardScaler()

# Fit and transform the features for the entire dataset
X_scaled = scaler.fit_transform(X)

In [None]:
# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

**Neural Network Model**

In [None]:
# Initialize the Neural Network model
model = Sequential()

# Add input layer (input_dim should match the number of features)
model.add(Dense(64, input_dim=X_train.shape[1], activation='relu'))

# Add hidden layers
model.add(Dense(32, activation='relu'))
model.add(Dense(16, activation='relu'))

# Add output layer (since this is regression, no activation function is used here)
model.add(Dense(1))

# Compile the model
model.compile(optimizer='adam', loss='mean_squared_error')

In [None]:
# Train the model
history = model.fit(X_train, y_train, epochs=100, validation_split=0.2, verbose=1)

In [None]:
# Make predictions on the testing data
y_pred_nn = model.predict(X_test)

In [None]:
# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred_nn)

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred_nn)

# Calculate R-squared (R²)
r2 = r2_score(y_test, y_pred_nn)

# Display the evaluation metrics
print(f"Mean Absolute Error: {mae}")
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

In [None]:
# Plot training & validation loss values
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper right')
plt.show()

In [None]:
# Scatter plot of actual vs predicted values
plt.scatter(y_test, y_pred_nn)
plt.xlabel('Actual Net Profit')
plt.ylabel('Predicted Net Profit')
plt.title('Neural Network - Actual vs Predicted Net Profit')
plt.show()

**KNN Regressor Model**

In [None]:
# Initialize the KNN Regressor model with a specific number of neighbors (e.g., 5)
knn_model = KNeighborsRegressor(n_neighbors=5)

# Train the model on the training data
knn_model.fit(X_train, y_train)

In [None]:
# Make predictions on the testing data
y_pred_knn = knn_model.predict(X_test)

In [None]:
# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred_knn)

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred_knn)

# Calculate R-squared (R²)
r2 = r2_score(y_test, y_pred_knn)

# Display the evaluation metrics
print(f"Mean Absolute Error: {mae}")
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

In [None]:
# Example: Tuning the number of neighbors
for k in range(1, 11):
    knn_model = KNeighborsRegressor(n_neighbors=k)
    knn_model.fit(X_train, y_train)
    y_pred = knn_model.predict(X_test)
    r2 = r2_score(y_test, y_pred_knn)
    print(f"Number of Neighbors: {k}, R-squared: {r2}")

In [None]:
# Scatter plot of actual vs predicted values
plt.scatter(y_test, y_pred)
plt.xlabel('Actual Net Profit')
plt.ylabel('Predicted Net Profit')
plt.title('KNN Regressor Model - Actual vs Predicted Net Profit')
plt.show()

In [None]:
# Evaluation results dictionary to store MSE and R² for each model
results = {
    'Model': ['KNN Regressor', 'Linear Regression', 'Decision Tree Regressor', 'Random Forest Regressor', 'Neural Network Regressor'],
    'MSE': [],
    'R²': []
}

In [None]:
# Function to evaluate and store the results in the dictionary
def evaluate_model(name, y_true, y_pred):
    mse = mean_squared_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    results['MSE'].append(mse)
    results['R²'].append(r2)

In [None]:
# Evaluate each model
evaluate_model('KNN Regressor', y_test, y_pred_knn)
evaluate_model('Linear Regression', y_test, y_pred_lr)
evaluate_model('Decision Tree Regressor', y_test, y_pred_dt)
evaluate_model('Random Forest Regressor', y_test, y_pred_rf)
evaluate_model('Neural Network Regressor', y_test, y_pred_nn)

In [None]:
# Convert results dictionary to DataFrame for easier plotting
results_df = pd.DataFrame(results)

In [None]:
### Bar Chart for R² ###
plt.figure(figsize=(10, 6))
plt.barh(results_df['Model'], results_df['R²'], color=['blue', 'green', 'orange', 'red', 'purple'])
plt.xlabel('R-squared (R²)')
plt.title('Model Comparison - R²')
plt.show()

**Challenges Faced and Problem-Solving Approach**

**1. Data Preparation Challenges**

The data is collected for nearly 1 year and is still running. Standardized the features using StandardScaler for models like KNN and Neural Networks, which are sensitive to feature scaling.

**2. Model Selection and Tuning**

**Challenge:** 
Choosing the appropriate machine learning models and finding the right hyperparameters to ensure accurate predictions.

**Approach:**
Selected a diverse set of models, each representing a different type of machine learning algorithm (e.g., Linear Regression, KNN, Decision Tree, Neural Network, and Random Forest).
Used GridSearchCV for hyperparameter tuning to optimize models like KNN, Decision Tree, and Random Forest.
Carefully tuned hyperparameters such as n_neighbors in KNN and max_depth in Decision Tree and Random Forest.

**3. Model Comparison and Evaluation**

**Challenge:** 
Comparing different models with various strengths and weaknesses required careful evaluation using appropriate metrics.

**Approach:**
Used Mean Squared Error (MSE) and R-squared (R²) to compare the models’ performances. These metrics were chosen because they provide insight into prediction error and the percentage of variance explained by the model.
Visualized the results using bar charts to clearly highlight the differences in model performance.

**4. Neural Network Configuration**

**Challenge:**
Configuring a neural network that could effectively learn from the data was more complex compared to traditional models.

**Approach:**
Designed a feed-forward neural network using Keras, with hidden layers and a ReLU activation function, which is commonly used for regression tasks.
Experimented with the number of neurons, layers, and epochs to balance model complexity with performance.
Used StandardScaler to ensure that the input data was scaled appropriately for the neural network.

**5. Handling Non-Linearities and Complex Patterns**

**Challenge:**
Some models like Linear Regression might not capture complex relationships in the data.

**Approach:**
Incorporated models like Random Forest and Neural Networks that can handle non-linear patterns and interactions between features.
Decision Trees and Random Forest can automatically capture non-linearities, while the Neural Network model can learn more complex patterns with multiple layers.

**6. Computational Cost**

**Challenge:** 
Training models like Random Forest and Neural Networks on larger datasets can be computationally expensive and time-consuming.

**Approach:**
Used a sample of the dataset during the initial model training and testing phase to reduce computation time.
Optimized the number of trees and depth in the Random Forest model to balance performance with speed.
Limited the number of epochs for Neural Networks during hyperparameter tuning, then fine-tuned the model using a higher number of epochs once the architecture was optimized.

**7. Model Interpretability**

**Challenge:** 
Interpreting complex models like Neural Networks and Random Forest can be difficult compared to more straightforward models like Linear Regression.

**Approach:**
For Random Forest, used feature importance scores to understand which features contributed the most to the model’s predictions.
For Neural Networks, explored the use of advanced techniques such as SHAP (SHapley Additive exPlanations) for better model interpretability, though this was left as a potential next step.

**To improve the overall performance and robustness of your machine learning models**

**1. Enhancing Data Quality**

**Feature Engineering**

- Explore feature interactions that might be relevant, especially for non-linear models like Random Forest or Neural Networks.
- Use domain knowledge to identify potentially useful features that may not be directly available in the raw dataset.

**2. Model Improvement**

**Ensemble Learning**

- Use Gradient Boosting (e.g., XGBoost, LightGBM, CatBoost) to further boost performance, as these algorithms often outperform individual models on structured/tabular data.

**Neural Network Architecture Optimization**

Deep learning models can be optimized further with proper architecture design and tuning.

- Experiment with different neural network architectures (e.g., more layers, wider layers, different activation functions).
- Implement dropout layers and batch normalization to prevent overfitting and speed up training.
- Tune hyperparameters like the learning rate, batch size, and number of neurons in each layer using a tool like Keras Tuner or Optuna.


