## Goal of this notebook
We know from early notebooks that three day momentum models offer the greatest predictive power, then the two day model, then the one day (base) model.

However, how can we be sure of the accuracy of these models, or that they generalize well enough?

In this notebook, we will test these models against each other on random subsets of data, to see which model is able to predict the data most accurately. Our hope, of course, is that the three day model will win out, since it provides th

If there is time, and if the two day model beats the three day model, it may be worth taking the time to build an aggregation of the two, where the three day model will use the two day model for predictions when there is a sequence for which we do not have enough data points (i.e. count_sequence < some tyhreshold). Of course, this may remedy itself when we add more data.

In [1]:
from stock_utils import *

In [2]:
df = pd.DataFrame()
df = df.from_csv('stock_data/nvda.csv')

relative_volumes = get_relative_volume(df, relative_period=relative_period)

two_day_movement_trends = get_trends(movement_categories, 2)
three_day_movement_trends = get_trends(movement_categories, 3)

movement_category_types = ['bd', 'sd', 'sg', 'bg']

## Function Definitions

In [3]:
def get_single_day_probabilities(movement_categories):
    movement_category_types = ['bd', 'sd', 'sg', 'bg']
    single_day_counts = []
    single_day_probabilities = []
    total = 0
    
    for cat in movement_category_types:
        count = count_movement_category(movement_categories, cat)
        single_day_counts.append(count)
        total += count
    
    for count in single_day_counts:
        single_day_probabilities.append(count/total)
        
    return single_day_probabilities

In [4]:
def get_probabilities_after_event(previous_event_category, trends, movement_categories):
    """
    Given an event that occured the previous day, return the probabilities of the next day's
    movement categories conditioned on said event.
    
    Arguments:
      previous_event_category -- The category of the event we observed the previous day 
                                 (or two days in the case of three day momentum)
      trends -- All two (or three) day trends that were observed for this event type.
      movement_categories -- all daily movement categories that were observed
    
    Returns:
      next_day_movement_probabilities -- Probabilities of each of the next day's categories
                                         conditioned on the previous event category
    """
    movement_category_types = ['bd', 'sd', 'sg', 'bg']
    next_day_movement_probabilities = []
    
    for next_day in movement_category_types:
        trend_name = previous_event_category + '_' + next_day
        trend_count = count_trends(trends, trend_name)
            
        trend_total = 0
        for category in movement_category_types:
            trend_total += count_trends(trends, previous_event_category + '_' + category)
            
        trend_prob = trend_count / trend_total
        next_day_movement_probabilities.append(trend_prob)
        
    return next_day_movement_probabilities

In [5]:
def select_data_sample(data, sample_size):
    ## We are going to omit the last sample_size elements, 
    ## so that  if we start the sample towards the end we won't run out of elements
    sub_sample = data[0:-sample_size]
    random_index = random.choice(list(enumerate(sub_sample)))[0]
    return data[random_index:random_index+sample_size]

In [6]:
def get_next_day_probability(probabilities_given_by_model, previous_days):
    if len(previous_days) == 2:
        index = category2index[previous_days[0]] * 4 + category2index[previous_days[1]]
    elif len(previous_days) == 1:
        index = category2index[previous_days[0]]
    elif len(previous_days) == 0:
        return probabilities_given_by_model
    else:
        raise ValueError('So far, only one to three day models are supported. Please set previous_days to a list of length 0 to 2')
    
    return probabilities_given_by_model[index]

In [7]:
def build_model_probabilities(movement_categories, trends, n_day_model):
    movement_category_types = ['bd', 'sd', 'sg', 'bg']
    
    ## Three day model
    if n_day_model == 3:
        three_day_probs = []
        for cat in movement_category_types:
            for cat2 in movement_category_types:
                three_day_probs.append(get_probabilities_after_event(cat + '_' + cat2, trends, movement_categories))
        return three_day_probs
    
    ## Two day model
    elif n_day_model == 2:
        two_day_probs = []
        for cat in movement_category_types:
            two_day_probs.append(get_probabilities_after_event(cat, trends, movement_categories))
        return two_day_probs
    
    ## One day model
    elif n_day_model == 1:
        one_day_probs = get_single_day_probabilities(movement_categories)
        return one_day_probs

    else:
        raise ValueError('So far, only one to three day models are supported. Please set n_day_model between 1 and 3')
        
    return

In [8]:
def random_sample_tests_m1_m2(movement_categories, m1_probs, m1_n_day_model, m2_probs, m2_n_day_model, sample_size=50, n_tries=10000):
    m1_wins = 0
    m2_wins = 0
    n_draws = 0
    
    for a in range(n_tries):
        sample = select_data_sample(movement_categories, sample_size)
        sample_data_probabilities = get_single_day_probabilities(sample)
        
        m1_round_score = 0
        m2_round_score = 0
        
        ## This is so models have enough data to "look back" and predict the following day
        n_lookback_days = max(m1_n_day_model, m2_n_day_model) - 1
        round_length = len(sample) - n_lookback_days 
        
        for i in range(round_length):
            if (n_lookback_days == 1):
                prev_day = sample[i]
                next_day = sample[i+1]
            elif (n_lookback_days == 2):
                day_before_last = sample[i]
                prev_day = sample[i+1]
                next_day = sample[i+2]
            else:
                raise ValueError('This function was meant to test one, two, and three day models against each other.')
                
            if (m1_n_day_model == 1):
                m1_next_day_probs = get_next_day_probability(m1_probs, [])
            elif (m1_n_day_model == 2):
                m1_next_day_probs = get_next_day_probability(m1_probs, [prev_day])
            elif (m1_n_day_model == 3):
                m1_next_day_probs = get_next_day_probability(m1_probs, [day_before_last, prev_day])
            
            if (m2_n_day_model == 1):
                m2_next_day_probs = get_next_day_probability(m2_probs, [])
            elif (m2_n_day_model == 2):
                m2_next_day_probs = get_next_day_probability(m2_probs, [prev_day])
            elif (m2_n_day_model == 3):
                m2_next_day_probs = get_next_day_probability(m2_probs, [day_before_last, prev_day])
        
            ## Weight correct answers on larger movements more heavily
            ## In the case of a tie, don't award any points
            # M1 wins
            if m1_next_day_probs[category2index[next_day]] > m2_next_day_probs[category2index[next_day]]:
                if(next_day == 'bg' or next_day == 'bd'):
                    m1_round_score += 2
                else:
                    m1_round_score += 1
            # M2 wins
            elif m2_next_day_probs[category2index[next_day]] > m1_next_day_probs[category2index[next_day]]:
                if(next_day == 'bg' or next_day == 'bd'):
                    m2_round_score += 2
                else:
                    m2_round_score += 1

        if m1_round_score > m2_round_score:
            m1_wins += 1
        elif m2_round_score > m1_round_score:
            m2_wins += 1
        else:
            n_draws += 1
    
    return m1_wins, m2_wins, n_draws

## Local 2 day model V.S. Local 1 day model
What I mean by **local** is that the model was only trained on the same stock we are testing on.

One day model blows the two day model out of the water. It seems there is not enough data for the two day model to generalize well.

In [9]:
local_two_day_probs = build_model_probabilities(movement_categories, two_day_movement_trends, 2)
local_one_day_probs = build_model_probabilities(movement_categories, [], 1)

In [10]:
m1_wins, m2_wins, n_draws = \
    random_sample_tests_m1_m2(movement_categories, local_one_day_probs, 1, local_two_day_probs, 2)
    
print('Local one day model won ' + str(m1_wins) + ' times')
print('Local two day model won ' + str(m2_wins) + ' times')

Local one day model won 8925 times
Local two day model won 962 times


## Local 3 day model V.S. Local 1 day model

In [24]:
local_three_day_probs = build_model_probabilities(movement_categories, three_day_movement_trends, 3)
local_one_day_probs = build_model_probabilities(movement_categories, [], 1)

In [25]:
m1_wins, m2_wins, n_draws = \
    random_sample_tests_m1_m2(movement_categories, local_one_day_probs, 1, local_two_day_probs, 2)
    
print('Local one day model won ' + str(m1_wins) + ' times')
print('Local three day model won ' + str(m2_wins) + ' times')

Local one day model won 1693 times
Local three day model won 8106 times


## Composite 2 day model V.S. Composite 1 day model
What I mean by **composite** is that  the model is created by averaging over all the stocks in our stock_data folder.

In this case, the two day model blows the one day model out of the water, which we should expect, given that the two day model provides us with added context and information.

In [11]:
all_two_day_trends, _, all_single_day_category_probs, all_movement_categories = \
    get_trends_all_stocks(1, 2, movement_category_types, n_cats=4)

composite_two_day_probs = build_model_probabilities(all_movement_categories, all_two_day_trends, 2)

In [12]:
m1_wins, m2_wins, n_draws = \
    random_sample_tests_m1_m2(movement_categories, all_single_day_category_probs, 1, composite_two_day_probs, 2)

print('Composite one day model won ' + str(m1_wins) + ' times')
print('Composite two day model won ' + str(m2_wins) + ' times')

Composite one day model won 1781 times
Composite two day model won 8113 times


## Composite 2 day model V.S. Local 1 day model
Again, the generalized two-day model wins by a large margin.

In [13]:
composite_two_day_probs = build_model_probabilities(all_movement_categories, all_two_day_trends, 2)
local_one_day_probs = build_model_probabilities(movement_categories, [], 1)

In [14]:
m1_wins, m2_wins, n_draws = \
    random_sample_tests_m1_m2(movement_categories, local_one_day_probs, 1, composite_two_day_probs, 2)
    
print('Local one day model won ' + str(m1_wins) + ' times')
print('Composite two day model won ' + str(m2_wins) + ' times')

Local one day model won 1901 times
Composite two day model won 7980 times


## Composite 3 day model V.S. Composite 1 day model
Three day model wins by a lot.

In [15]:
all_three_day_trends, _, _, _ = get_trends_all_stocks(1, 3, movement_category_types, n_cats=4)
composite_three_day_probs = build_model_probabilities(all_movement_categories, all_three_day_trends, 3)

In [16]:
m1_wins, m2_wins, n_draws = \
    random_sample_tests_m1_m2(movement_categories, all_single_day_category_probs, 1, composite_three_day_probs, 3)
    
print('Composite one day model won ' + str(m1_wins) + ' times')
print('Composite three day model won ' + str(m2_wins) + ' times')

Composite one day model won 1406 times
Composite three day model won 8458 times


## Composite 3 day model V.S. Local 1 day model
Three day model wins by a lot, again.

In [17]:
local_one_day_probs = build_model_probabilities(movement_categories, [], 1)

In [18]:
m1_wins, m2_wins, n_draws = \
    random_sample_tests_m1_m2(movement_categories, local_one_day_probs, 1, composite_three_day_probs, 3)
    
print('Local one day model won ' + str(m1_wins) + ' times')
print('Composite three day model won ' + str(m2_wins) + ' times')

Local one day model won 1199 times
Composite three day model won 8671 times


## Composite 3 day model V.S. Composite 2 day model
Three day model wins.

In [19]:
all_two_day_trends, _, _, _ = get_trends_all_stocks(1, 2, movement_category_types, n_cats=4)
all_three_day_trends, _, _, _ = get_trends_all_stocks(1, 3, movement_category_types, n_cats=4)

In [20]:
composite_two_day_probs = build_model_probabilities(all_movement_categories, all_two_day_trends, 2)
composite_three_day_probs = build_model_probabilities(all_movement_categories, all_three_day_trends, 3)

In [21]:
m1_wins, m2_wins, n_draws = \
    random_sample_tests_m1_m2(movement_categories, composite_two_day_probs, 2, composite_three_day_probs, 3)
    
print('Composite two day model won ' + str(m1_wins) + ' times')
print('Composite three day model won ' + str(m2_wins) + ' times')

Composite two day model won 1542 times
Composite three day model won 8355 times


# All Stocks
Do the above tests for all stocks

In [22]:
g = glob.glob('stock_data/*.csv')

In [23]:
all_two_day_trends, _, all_single_day_category_probs, all_movement_categories = \
      get_trends_all_stocks(1, 2, movement_category_types, n_cats=4)
all_three_day_trends, _, _, _ = get_trends_all_stocks(1, 3, movement_category_types, n_cats=4)
composite_two_day_probs = build_model_probabilities(all_movement_categories, all_two_day_trends, 2)
composite_three_day_probs = build_model_probabilities(all_movement_categories, all_three_day_trends, 3)

for filename in g:
    ticker = ticker_from_csv(filename)
    df = pd.DataFrame()
    df = df.from_csv(filename)
    daily_movements = get_price_movements(df)
    movement_categories = categorize_movements(daily_movements, n_cats=4)

    two_day_movement_trends = get_trends(movement_categories, 2)
    three_day_movement_trends = get_trends(movement_categories, 3)

    local_one_day_probs = build_model_probabilities(movement_categories, [], 1)
    local_two_day_probs = build_model_probabilities(movement_categories, two_day_movement_trends, 2)
    local_three_day_probs = build_model_probabilities(movement_categories, three_day_movement_trends, 3)
    
    print('=-=-=-=-=-=-=-=-=-=-=-=-=-=-=')
    print('             ' + ticker)
    print('=-=-=-=-=-=-=-=-=-=-=-=-=-=-=')
    print()
    
    print('--------------------------------------------------')
    print('Local 1 day model v.s. Local 2 day model')
    print('--------------------------------------------------')
    m1_wins, m2_wins, n_draws = \
      random_sample_tests_m1_m2(movement_categories, local_one_day_probs, 1, local_two_day_probs, 2)
    print('Local one day model won ' + str(m1_wins) + ' times')
    print('Local two day model won ' + str(m2_wins) + ' times')
    print()
    
    print('--------------------------------------------------')
    print('Local 1 day model v.s. Local 3 day model')
    print('--------------------------------------------------')
    m1_wins, m2_wins, n_draws = \
      random_sample_tests_m1_m2(movement_categories, local_one_day_probs, 1, local_three_day_probs, 3)
    print('Local one day model won ' + str(m1_wins) + ' times')
    print('Local three day model won ' + str(m2_wins) + ' times')
    print()
    
    print('--------------------------------------------------')
    print('Local 2 day model v.s. Composite 2 day model')
    print('--------------------------------------------------')
    m1_wins, m2_wins, n_draws = \
      random_sample_tests_m1_m2(movement_categories, local_two_day_probs, 2, composite_two_day_probs, 2)
    print('Local two day model won ' + str(m1_wins) + ' times')
    print('Composite two day model won ' + str(m2_wins) + ' times')
    print()
    
    print('--------------------------------------------------')
    print('Composite 1 day model v.s. Composite 2 day model')
    print('--------------------------------------------------')
    m1_wins, m2_wins, n_draws = \
      random_sample_tests_m1_m2(movement_categories, all_single_day_category_probs, 1, composite_two_day_probs, 2)
    print('Composite one day model won ' + str(m1_wins) + ' times')
    print('Composite two day model won ' + str(m2_wins) + ' times')
    print()
    
    print('--------------------------------------------------')
    print('Local 1 day model v.s. Composite 2 day model')
    print('--------------------------------------------------')
    m1_wins, m2_wins, n_draws = \
      random_sample_tests_m1_m2(movement_categories, local_one_day_probs, 1, composite_two_day_probs, 2)  
    print('Local one day model won ' + str(m1_wins) + ' times')
    print('Composite two day model won ' + str(m2_wins) + ' times')
    print()
    
    print('--------------------------------------------------')
    print('Composite 1 day model v.s. Composite 3 day model')
    print('--------------------------------------------------')
    m1_wins, m2_wins, n_draws = \
      random_sample_tests_m1_m2(movement_categories, all_single_day_category_probs, 1, composite_three_day_probs, 3)
    print('Composite one day model won ' + str(m1_wins) + ' times')
    print('Composite three day model won ' + str(m2_wins) + ' times')
    print()
    
    print('--------------------------------------------------')
    print('Local 1 day model v.s. Composite 3 day model')
    print('--------------------------------------------------')
    m1_wins, m2_wins, n_draws = \
      random_sample_tests_m1_m2(movement_categories, local_one_day_probs, 1, composite_three_day_probs, 3)
    
    print('Local one day model won ' + str(m1_wins) + ' times')
    print('Composite three day model won ' + str(m2_wins) + ' times')
    print()
    
    print('--------------------------------------------------')
    print('Composite 2 day model v.s. Composite 3 day model')
    print('--------------------------------------------------')
    m1_wins, m2_wins, n_draws = \
      random_sample_tests_m1_m2(movement_categories, composite_two_day_probs, 2, composite_three_day_probs, 3)
    
    print('Composite two day model won ' + str(m1_wins) + ' times')
    print('Composite three day model won ' + str(m2_wins) + ' times')
    print()
    
    print()

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
             BAC
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

--------------------------------------------------
Local 1 day model v.s. Local 2 day model
--------------------------------------------------
Local one day model won 6173 times
Local two day model won 3405 times

--------------------------------------------------
Local 1 day model v.s. Local 3 day model
--------------------------------------------------
Local one day model won 2324 times
Local three day model won 7419 times

--------------------------------------------------
Local 2 day model v.s. Composite 2 day model
--------------------------------------------------
Local two day model won 8912 times
Composite two day model won 1014 times

--------------------------------------------------
Composite 1 day model v.s. Composite 2 day model
--------------------------------------------------
Composite one day model won 695 times
Composite two day model won 9286 times

-----------------------------------------

Composite one day model won 1136 times
Composite two day model won 8696 times

--------------------------------------------------
Local 1 day model v.s. Composite 2 day model
--------------------------------------------------
Local one day model won 4567 times
Composite two day model won 5016 times

--------------------------------------------------
Composite 1 day model v.s. Composite 3 day model
--------------------------------------------------
Composite one day model won 1323 times
Composite three day model won 8575 times

--------------------------------------------------
Local 1 day model v.s. Composite 3 day model
--------------------------------------------------
Local one day model won 2481 times
Composite three day model won 7271 times

--------------------------------------------------
Composite 2 day model v.s. Composite 3 day model
--------------------------------------------------
Composite two day model won 1117 times
Composite three day model won 8724 times


=-=-=-=-=-

Composite two day model won 1330 times
Composite three day model won 8525 times


=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
             FB
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

--------------------------------------------------
Local 1 day model v.s. Local 2 day model
--------------------------------------------------
Local one day model won 1441 times
Local two day model won 8236 times

--------------------------------------------------
Local 1 day model v.s. Local 3 day model
--------------------------------------------------
Local one day model won 1999 times
Local three day model won 7740 times

--------------------------------------------------
Local 2 day model v.s. Composite 2 day model
--------------------------------------------------
Local two day model won 6237 times
Composite two day model won 3453 times

--------------------------------------------------
Composite 1 day model v.s. Composite 2 day model
--------------------------------------------------
Composite one day model won 1753 times

### Findings
Since the majority of the stocks we have are volatile (high beta), the models built on the avaerage of those stocks model high beta stocks well, but do a poor job on low beta stocks.

Also, some stocks are more likely to follow trends specific to that sock rather than trends seen across the market. It would be worthwhile to pay attention to which models fit particular stocks.

Maybe it would be best if we made different models for stocks that have different betas.
Also, maybe we should build models for different "markets", such as bear, neutral, and bull markets. The time period over which to define these markets can be varied as well.

#### When does the local two day beat the composite two day?
I believe this happens when there are a high frequency of trends that happen in a given time period, but do not occur many times throughout all time periods.