# Data Preprocessing
## Introduction
This notebook focuses on loading the raw data, performing initial data cleaning, eventually creating a processed_data csv.

### Folder Structure & File Handling
To make this process replicable, create two folders in the `data` directory: one titled `raw` and the other titled `processed`.
The initial function loads the data from the `raw` data folder. This file is kept in the `.gitignore` to prevent it from being tracked in version control. The file paths in this script and the rest of the notebooks should then work seamlessly.
The file paths in this script and the rest of the notebooks should then work. 

### Data Cleaning 
This initial script checks for null values in both of the raw columns provided: `signal` and `equity_curve`. To determine the success of the signals, a third column, `equity_return`, is created at this stage. It represents the percentage return of the next day's equity value compared to the current day's.

### Avoiding Look-Ahead Bias
We are aware of look-ahead bias, but also equally aware that the equity returns column will be necessary for calculating or trading statistics in a simple and easy to repeat manner. As such any model, or testing strategy will not use or have accesses to the equity returns for any decision making process.

### Removing Incomplete Records
The script also removes the final record in the data, as it will not have a corresponding next day's equity value for comparison.

### Saving Processed Data
The final part of this script saves the preprocessed data to a CSV file in the `data/processed/` folder for other notebooks to ingest and use.

In [None]:
import pandas as pd

In [None]:
def load_and_preprocess_data(file_path):
    data = pd.read_csv(file_path)

    data['signal'] = pd.to_numeric(data['signal'], errors='coerce')
    data['equity_curve'] = pd.to_numeric(data['equity_curve'], errors='coerce')

    null_signals = data['signal'].isna().sum()
    null_equity_curve = data['equity_curve'].isna().sum()

    print(f"Number of null values in 'signal': {null_signals}")
    print(f"Number of null values in 'equity_curve': {null_equity_curve}")

    data = data.dropna()

    data['equity_returns'] = data['equity_curve'].pct_change().shift(-1)

    data = data.dropna()
    data = data[:-1]

    return data

In [None]:
file_path = "../data/raw/sovereign_quant_developer_assignment_data.csv"
data = load_and_preprocess_data(file_path)

processed_file_path = "../data/processed/processed_data.csv"
data.to_csv(processed_file_path, index = False)

data.info()