# Methodology and Thought Process

This document serves as a summary breakdown of how I approached the problem, my thought processes on the different steps, and the rationale behind certain choices. I believe it is more efficient to present a summary before delving into the detailed code and accompanying explanations.

I chose to break down the problem into four parts:
1. **Data Preprocessing**
2. **Understanding the Signal Meaning**
3. **Developing and Testing the Trading Strategy**
4. **Optimizing the Trading Strategy**

## Data Preprocessing

This step was rather straightforward. After briefly inspecting the CSV file in Excel, I chose to import the dataset into Python. There are two columns, one titled "signal" and a second titled "equity curve."
   - This second column title threw me for a bit of a loop at first glance. I know "equity curve" to be the return of a trading strategy over time, and the prompt in the email refers to it as "equity." As such, because the prompt asks to maximize the returns of the equity column, we approached it as an equity column. For better or worse, they could basically be the same thing. Just a minor side point.

At this point in data preprocessing, I changed any data in the equity curve and signal column to numeric, with any records that cannot be changed converted to a null value. We then checked for the null values in the data; there are none.
   - Had there been null values, we chose for the basis of this assignment to just drop the record. But had this been actual financial data, we would have spent some time determining how we either wanted to smooth out this record, remove it, or otherwise create a value so as not to have an incomplete dataset.

Finally, we created the equity returns column. This will measure the percentage change in equity, something that will be critical for determining the meaning of the signal along with tracking the performance of our trading strategy. We dropped the last row of the dataset because it will not have a value to compare to for the equity returns column.

We also split the data into testing and training datasets for analysis. A note here: I chose to do this at the beginning of each workbook as opposed to doing it at preprocessing and just having it exported to CSV and picked up at the beginning. I feel like this gives me a little more flexibility. You will notice that separating into a test and train dataset is the first step in most notebooks. Any analysis was done on a training dataset, consisting of 80% of the data in sequential order.

## Understanding The Signal

I started my initial analysis of the signal column by viewing its characteristics. The values were positive and negative numbers between -1 and 1, which made me think that the signal values were logarithmic in nature, especially since it's very common in financial data. My first inclination was to further explore the distribution of the signals, both as a whole and separated by positive and negative numbers. They produced bell curve distributions.

Knowing that the numbers were distributed in a bell curve and between -1 and 1 led me to my next goal of understanding the relationship between the direction of the signals (positive or negative) and their relationship to equity returns. This would be critical in determining the meaning of the signals, whether negative signals are indicative of something cheap or of a future negative bias/future negative returns.

To do this, I decided to run a t-test on both the positive and negative signals, with the null hypothesis that positive signals do not lead to positive returns and vice versa for the negative signals.
   - This was the first major breakthrough in understanding the signals; both p-values indicated statistical significance, rejecting the null hypothesis.
   - Therefore, we were able to conclude that positive signals are indicative of positive returns and negative signals are indicative of negative returns.
   - For formulating our trading strategy, we now can begin to create an idea of what we are doing: we will be buying positive signals and selling negative signals, a step in the right direction.

Continuing on, now that we know the direction of our buy and sell in regards to the signal direction, we need to understand what the magnitude of the numbers means. I felt like there were two obvious ways to do this, and I actually end up incorporating both, but I'd like to outline them first.
   1. Take the standard deviation of the positive and negative signals, look at the signal values that are more extreme than one standard deviation on both the negative and positive sides, and analyze their returns in the training set.
   2. Perform a quartile analysis, putting the signal values into 10 bins to have a more magnified view of the signals and those signals that are most associated with each other. The added benefit of this is that it also allows you to look at the standard deviation and how signals that are more extreme than one standard deviation on either side are performing.

While I calculated the standard deviations for both positive and negative signals, we confirmed some of our observations from before: the signals are both of similar bell curve distributions, and both of their standard deviations are relatively similar.

Next, we created a quantile analysis that broke the signal data into 10 roughly equal-weighted bins, from 0 to 9, with 0 having the most extreme negative values and 9 having the most extreme positive values. We also created a group summary table that provides summary statistics of the signal, equity returns, and cumulative returns of the training data sorted by quantile. The findings were quite interesting.
   1. As the t-test suggested, negative signals were associated with negative returns, and positive signals were associated with positive returns. Given our strategy would be selling negative signals, negative signals having negative returns would generate positive trading results because we would be selling those signals (assuming we can sell short).
   2. We noticed that the more extreme quantiles are returning significantly profitable returns. The two most extreme quantiles, 0 and 9, provide the best cumulative returns in the dataset (-1.795054, 2.041303).
   3. There is a lot of noise in the data, specifically in quantiles 3-6, the more central quantiles.
   4. The quantiles which have a max signal value above one standard deviation all provide positive returns, which leads us to believe that the standard deviation provides a good basis for determining which quantiles are worthwhile to trade.

The final exploration piece we investigated was the win percentage. This can be extremely valuable because it gives us insight into what the magnitude of the numbers actually means when combined with the cumulative returns of the quantiles.
   - The win percentages are strangely pretty low. The max win percentage is in the 9th quantile but is only 51.65%. Also, the 0 quantile has the lowest winning percentage but the second-highest returns, which is interesting.

#### How does this all frame our understanding of the signal?
1. The t-test gives us an indication of the direction of the signal, with positive signals indicating future positive returns and negative signals indicating future negative returns.

2. Contrary to initial expectations, the magnitude of the signal does not correspond to an increased win percentage. Instead, it correlates with the size of the returns.
   - This means that the more extreme the magnitude of the signal, the more significant the movement in the direction of the signal. In other words, larger signals are more likely to be associated with larger moves in the market, either positive or negative.
   - The reasoning behind this deduction is that if the more extreme quantiles are not winning significantly more than the other quantiles but have significantly larger returns (sometimes double), they must be winning bigger.
    
## Trading Strategy

After having a firm understanding of the signal, it made the construction of a trading strategy much simpler. We begin with a few assumptions that frame how we will approach the problem.

### Assumption
1. We are looking to predict the move of the direct next day, as represented by the equity returns column. 
    - If the signal is negative we expect the equity returns column to be negative. 
2. We are not carrying positions
    - So as to not have to build a complex backtesting frame work, we are soley devising a strategy that looks to predict whether the equity returns column for that given record. This makes testing much easier and streamlined. 
3. Because we are not carrying positions we are able to sell short. Negative singals will be sold and if the equity returns for that record are negative that would constitute a winning trade because our model predicted negative returns. 
4. Look-ahead bias, the strategy will only look at the signal in order to determine the quantile based of the training data
5. We are using a strategy that trades every signal as a baseline strategy to compare our results, the baseline strategy knows enough to sell every negative signal and buy every positive signal but it does not differentiate. 
6. Trading fees are incredibly crippling, especially to the baseline strategy. In order to demostrate performance and also because there wasn't a provided amount to use for trading fee, we provide results for 3 different trading fees. 
    - The fees were a percentage of the equity_curve column for every executed trade and were add to the returns. 
    - The fees that were use were .0005, .0002, and No fees. 

### Constructing Our Trading Strategy

To construct our trading strategy we first split our dataset into a test and a training dataset, and then make a copy of the test dataset for our baseline. 
    - As mentioned before we made our baseline dataset somewhat smart, it knows to trade in the correct direction of the signal but trades every signal.
    - This works well to demonstrate for us that our more selective model will hopefully provide better results then just trading every signal direction.

We now must construct our quantiles from the training data set. 
    - A specific note here is that because we observe a bell curve distribution, it gives us insight that as long as we have a sufficient sample size, our observation in the training dataset should apply to the test dataset. 

One we have a the max and mins for the different quantiles from the training dataset we construct different bins in which to feed to our method which will assign trading signals to the test data. We use the standard deviation to determine which bins we are assigning to buy and sell quantiles. Any quantile with a max greater then 1 standard deviation for either direction would selected to be traded. This also eliminates any trader bias in determining what to buy or sell. 

 The method that assigns trading signals is only looking at the signal value for each row in the test data set, based on the signal value it assigns it a quantile. If the quantile assigned is in either the buy or sell quantile it will assign it a trading signal. The default signal is a 0 for stay away. 

We then calculate the results. We created a similar chart to our training data set for understanding the signal to view the outcomes. 
This gives us an overview of what each quantile has done but does not calculate the overall strategy results. 

Next we looked to compare trading statistics with our baseline strategy. We provided multiple situations with different fee structures because fees were some important in determining the overall success of the strategy. We chose Total Return, Volatility, Sharpe Ratio, Max Drawdown, Total Trades, and Win percentage. We felt that this statistics gave us a strong understanding of the performance of the strategy and to use for comparision. 

We also built the trading strategy to be dynamic, so we were able to increase and decreate the size of the test dataset. Due to the bell curve distribution we did not expect the quantile to change much with less testing data. This also allowed us to see how the concept of our strategy performed with more data. 

## Optimization

With the trading strategy trading every signal within the selected quantiles that it deems are profitable, it made sense too look to further improve the strategy. We approached this as a basic classification problem. We had a strategy that is profitable trading every signal, could a ML model improve on this by determing to trade or not trade. 

In order to set this up we first needed to decided what features to engineer for the ML classification models. We can use what we know about the signals in order to formulate some features that may be more likely to help us correctly classify trading or not trading. 
    - We know the signals are not mean reverting, so any mean reverting features would most likely not be helpful.
    - The signals have already been used to sort out the data. We will only be running the models on records that have signals in our trading buckets.
    - Positive and negative signals behave differently from our analysis, so we will treat them differently. There we will have to train and test our models on both positive and negative quantile datasets seperately. 
    - We determing that the signal indicate a likelyhood of a large move in their signal direction, so for feature engineering momentum based signals would be a viable place to start.
    - Since all the signals we will be looking at will be relatively similar, looking at the equity curve and its past history would seem a prudent place to start.
    - The best win percentage that we observed was below 52% so we dont need the models to be perfect, just statistically better then selecting every record. 

Features that I looked at were mostly all tied to the past performance of the equity curve values. I looked at were simple moving average, exponential moving average, their associated crossovers, rsi, macd, and of course the signal value. This seemed like a viable place to look because they are more momentum based. On obvious ommission that I would include if I had more time would be to add a momentum feature. 

I admit I made a mistake in my process, and due to time was not able to go back and resolve it. The negative win percentage was the lowest, while still returning some of the highest returns. This should have been the correct place to begin as it had the lowest bar to improvement and also would most likely provide the best returns for time. That being said I started first with the positive and did not have the time I expected to to also look at the negative side.

I looked at 4 machine learning classification models, logistic regression, decision tree, random forest, and svm. I set this up in a way where I could do some trial and error in terms of which features to select. I included the logistic regression feature coefficients to see the impacts on the model. Due to time here that's all I included. The way I structured the setup for my classification, was I created a target variable called profitable_trade, if the trade was profitable it would be a 1 if not it would be a zero. I then had the models see if they could improve on the strategy of trading every signal. (Remember these are the quantiles were we just trade every signal)

My goal was to find a combination from the models where it was selecting when to trade and when not to trade (True Positive and True Negative) from the confusion matrix that would add up to a statistically signficant number that was large then the correct number of winning trades from selecting all of the values.

I created a method that would check for statistical significance of my findings. For nearly all of the models there would be improvement but it was not statistically significant, and therefore we would not add it. 

## Conclusion

We were able to find a model that beat the baseline model every time, including when trading fees were removed. The sharpe ratios were extremely high in all cases. The statistics for our strategy were better then the baseline model in every case. 

Some of my thoughts on the process, the strategy that we employed is rather simple. We took a strong understanding of what the signals indicated an built a simple strategy that was easy to implement, and test. I often feel like in trading the simple strategies have the most merit. 

Due to time constraints I wasn't able to optimize the negative signals, nor was I able to test out the optimization of the svm model that improved prediction for the 9 quantile group. My goal with that notebook was to be able to at least demonstrate some feature engineering beyound what was needed for the trading strategy, and some classification model work with some explanation for my thought process. 

Some areas for future exploration would be 





