# Methodology and Thought Process

This will serve as kind of summary breakdown of how I approached the problem, my thought processes on the different steps and why certain choices were made. I feel like its much easier to have a basic summary before having to read through blocks of code with blocks of text explaining and analysing. 

I choose to break down the problem into really two parts, 4 parts, data preprocessing, understanding the signal meaning, developing and testing the trading strategy, and then looking to further optimize the trading strategy. 

## Data Preprocessing

This step was rather straight forward, after briefly inspecting the csv in excel, I choose to import the data set into python. There are two columns, one titled signal and a second equity curve. 
    - This second column title through me a for a little bit of a loop at first glance and I know equity curve to be the return of a trading strategy over time, and the prompt in the email refers to it as equity. As such, because the prompt asks to maximize the returns of the equity column we approached it as an equity column. For better or worse they could basically be the same thing. Just a minor side point. 

At this point in data preprocessing, I change any data in the equity curve and signal column to numeric, with any records that cannot be changed converted to a null value. We then checked for the null values in the data, there are none. 
    - Had there been null values, we choose for the basis of this assignment to just drop the record. But had this been actual financial data, we would have spent some time determining how we either wanted to smooth out this record, remove it, or otherwise create a value so as not to have an incomplete data set. 

Finally we create the equity returns column, this will measure the percentage change in equity, something that will be critical for determining the meaning of the signal along with tracking the performance of our trading strategy. We drop the last row of the data set beacuse it will not have a value to compare to for the equity returns column. 

We also split the data into testing and training data sets for analysis. A note here, I choose to do this at the beginning of each workbook as opposed to doing it at preprocessig and just having it exported to csv and picked up at the beginning. I feel like this give me a little more flexibility. You will notice that seperating into a test and train data set is the first step in most notebooks. Any analysis was done on a training data, consisting of 80% of the data in sequential order. 

## Understanding The Signal

I started my first initial analysis of the signal column by just viewing the characteristics. There was positive and negative numbers between -1 and 1, which made me think that the signal values were logarithmic in nature, especially since its very common in financial data. My first inclination was to further explore the distribution of the signals, both as a whole and seperated by positive and negative numbers, they produced bell curve distributions. 

Knowing that the numbers were distributed in a bell curve and between -1 and 1 lead me to my next goal of understanding the relationship between the direction of the signals (positive or negative) and their relationship to equity returns. This would be critical in determining the meaning of the signals, whether negative signal are indicative of something cheap or  of a future negative bias/future negative returns.

In order to do this, I decided to run a t-test on both the positive negative signals, with the null hypthesis that positive signals to not lead to positive returns and vice versa for the negative symbols. 
    - This was the first major break through in understanding the signals, both p values indicated statistical significance, rejecting the null hypothesis.
    - There we were able to conclude the positive symbols are indicative of positive returns and negative symbols are indicative of negative returns. 
    - For formulating our trading strategy, we now are able to begin to create an idea of what we are doing, we will be buying positive signal and selling negative symbols, a step in the right direction. 

Continuing on, now that we know the direction of our buy and sell in regards to the signal direction, we need to understand what the magnitude of the numbers mean. I felt like there were two obvious ways to do this, and I actually end up incorporating both, but I'd like to out line them first.
    1. Take the standard deviation of the positive and negative symbols, look at the signal values that are more extreme then standard deviation on both the negative and positive side, and analyze their returns in the training set.
    2. Make a quartile analysis, putting the signal values in 10 bins to have a more magnified view of the signals, and those signals that are most associated with eachother. The added benefit of this is in many ways it also allows you to look at the standard deviation and how signals that are more extreme the 1 standard deviation on either side are performing. 

While I calculated the standard deviations for both positive and negative signals, we confirm some of our observations from before that the signals are both of similar bell curve distributions, both of their standard deviations are relatively similar. 

We next create a quantile analysis that broke the signal data in to 10 roughly equal weighted bins, from 0 to 9 with 0 having the most extreme negative values and 9 having the most extreme positive values. We also create a group summary table, that provides summary statistics of the signal, equity returns, and cumulative returns of the training data sorted by quantile. The findings were quite interesting. 
    1. As the t-test suggested, negative signal were associated with negative returns and positive signals were associated with positive returns. Given our strategy would be selling negative signals, negative signals having negative returns would generate positive trading result because we would be selling those signals (Assumptions are that we are able to sell short). 
    2. We notice tht the more extreme quantiles are returning significantly profitable returns, the two most extreme quantiles 0, and 9 provide the best cumulative returns in the dataset (-.795054, 2.041303). 
    3. Something that is noticeable, is that there is a lot of noise in the data, specifically in quantiles 3-6, the more central quantiles.
    4. The quantiles which have a max signal value above 1 standard deviation all provide positive returns, which leads us to believe that the standard deviation provides a good basis for determining which quantiles are worthwhile to trade. 

The final exploration piece we investigated was the win percentage. This can be extremely valuable because it gives us insight into what the magnitude of the numbers actually mean when combined with the cumulative returns of the quantiles. 
    - The win percentages are strangly pretty low, the max win percentage is in the 9th quantile as but is only 51.65%. Also the 0 quantile has the lowest winning percentage but the second highest returns which is interesting. 

#### How does this all frame our understanding of the signal?
    1. The t-test gives us an indication of the direction of the signal, positive indicates future positive returns and negative indicates future negative returns.

    2. Contrary to initial expectations, the magnitude of the signal does not correspond to an increased win percentage. Instead, it correlates with the size of the returns.
        - This means that the more extreme the magnitude of the signal, the more significant the movement in the direction of the signal. In other words, larger signals are more likely to be associated with larger moves in the market, either positive or negative.
    
## Trading Strategy

After having a firm understanding of the signal, it made the construction of a trading strategy much simpler. We begin with a few assumptions that frame how we will approach the problem. 

### Assumption
1. We are looking to predict the move of the direct next day, as represented by the equity returns column. 
    - If the signal is negative we expect the equity returns column to be negative. 
2. We are not carrying positions
    - So as to not have to build a complex backtesting frame work, we are soley devising a strategy that looks to predict whether the equity returns column for that given record. This makes testing much easier and streamlined. 
3. Because we are not carrying positions we are able to sell short. Negative singals will be sold and if the equity returns for that record are negative that would constitute a winning trade because our model predicted negative returns. 
4. Look-ahead bias, the strategy will only look at the signal in order to determine the quantile based of the training data
5. We are using a strategy that trades every signal as a baseline strategy to compare our results, the baseline strategy knows enough to sell every negative signal and buy every positive signal but it does not differentiate. 
6. Trading fees are incredibly crippling, especially to the baseline strategy. In order to demostrate performance and also because there wasn't a provided amount to use for trading fee, we provide results for 3 different trading fees. 
    - The fees were a percentage of the equity_curve column for every executed trade and were add to the returns. 
    - The fees that were use were .0005, .0002, and No fees. 

### Constructing Our Trading Strategy

To construct our trading strategy we first split our dataset into a test and a training dataset, and then make a copy of the test dataset for our baseline. 
    - As mentioned before we made our baseline dataset somewhat smart, it knows to trade in the correct direction of the signal but trades every signal.
    - This works well to demonstrate for us that our more selective model will hopefully provide better results then just trading every signal direction.

We now must construct our quantiles from the training data set. 
    - A specific note here is that because we observe a bell curve distribution, it gives us insight that as long as we have a sufficient sample size, our observation in the training dataset should apply to the test dataset. 

One we have a the max and mins for the different quantiles from the training dataset we construct different bins in which to feed to our method which will assign trading signals to the test data. We use the standard deviation to determine which bins we are assigning to buy and sell quantiles. Any quantile with a max greater then 1 standard deviation for either direction would selected to be traded. This also eliminates any trader bias in determining what to buy or sell. 

 The method that assigns trading signals is only looking at the signal value for each row in the test data set, based on the signal value it assigns it a quantile. If the quantile assigned is in either the buy or sell quantile it will assign it a trading signal. The default signal is a 0 for stay away. 

We then calculate the results. We created a similar chart to our training data set for understanding the signal to view the outcomes. 
This gives us an overview of what each quantile has done but does not calculate the overall strategy results. 

Next we looked to compare trading statistics with our baseline strategy. We provided multiple situations with different fee structures because fees were some important in determining the overall success of the strategy. We chose Total Return, Volatility, Sharpe Ratio, Max Drawdown, Total Trades, and Win percentage. We felt that this statistics gave us a strong understanding of the performance of the strategy and to use for comparision. 

We also built the trading strategy to be dynamic, so we were able to increase and decreate the size of the test dataset. Due to the bell curve distribution we did not expect the quantile to change much with less testing data. This also allowed us to see how the concept of our strategy performed with more data. 

## Optimization




