# Helper Functions

### Strategy function

We need the method get_strategy with parameters share and sp500 for every row in the dataset to define if the best strategy for this share would be to buy, sell or hold as defined in the task.

Buy: A share outperformed the S&P500 by more than 2.5 percentage points for the year
Hold: A share performed within +/- 2.5% percentage points of the S&P500 for the year
Sell: A share performed below the S&P500 by more than 2.5 percentage points for the year

In [1]:
#Function to determine the strategy
# 0: buy, 1: hold, 2: sell

def get_strategy(share, sp500):
    diff = abs(share-sp500)
    biggest = max(share,sp500)
    
    if(biggest == share):
        if(diff > 2.5):
            return 0
        else:
            return 1
    else:
        if(diff > 2.5):
            return 2
        else:
            return 1  

#### Unit Test for Strategy function
It's crucial that our get_strategy function is totally correct since otherwise we would define wrong response variables and the ML part following would be worthless.

In [2]:
import unittest

class TestStrategy(unittest.TestCase):
    
    def test_get_strategy(self):
        #hold test cases
        self.assertEqual(get_strategy(1,1), 1)
        self.assertEqual(get_strategy(-1,-1), 1)
        self.assertEqual(get_strategy(-1,-1), 1)
        self.assertEqual(get_strategy(-1,1), 1)
        self.assertEqual(get_strategy(1,-1), 1)
        self.assertEqual(get_strategy(0,0), 1)
        self.assertEqual(get_strategy(0,-2.3), 1)
        self.assertEqual(get_strategy(2.3,0), 1)
        self.assertEqual(get_strategy(2.3,0), 1)
        self.assertEqual(get_strategy(0,2.3), 1)
        self.assertEqual(get_strategy(10,8), 1)
        self.assertEqual(get_strategy(8,10), 1)
        self.assertEqual(get_strategy(-8,-10), 1)
        
        #buy test cases
        self.assertEqual(get_strategy(10,7), 0)
        self.assertEqual(get_strategy(-3,-6), 0)
        self.assertEqual(get_strategy(2,-1), 0)
        self.assertEqual(get_strategy(3,0), 0)
        
        #sell test cases
        self.assertEqual(get_strategy(7,10), 2)
        self.assertEqual(get_strategy(-6,-3), 2)
        self.assertEqual(get_strategy(-2,1), 2)
        self.assertEqual(get_strategy(0,3), 2)

unittest.main(argv=[''], verbosity=2, exit=False)

test_get_strategy (__main__.TestStrategy) ... ok

----------------------------------------------------------------------
Ran 1 test in 0.003s

OK


<unittest.main.TestProgram at 0x106866eb0>

In [4]:
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
import numpy as np


def feature_selection(x, y, thres):
    """ Find out, which the most important features are. Return a list of the most important features
        which wil be used for the algorithms.
    
    Args:
        x: feature input without NaN values
        y: classification input
        thres: input as percentage value, features with relative importance over this value will be in the output 
    """
    
    feat_labels = x.columns[:]
    
    # Create Random Forest object, fit data and
    # extract feature importance attributes
    forest = RandomForestClassifier(random_state=1, class_weight='balanced')
    forest.fit(x, y)
    importances = forest.feature_importances_
    
    #Define n as number of importances over the value thres
    n = sum(importances > thres)
    
    # Get cumsum of the n most important features
    feat_imp = np.sort(importances)[::-1]
    sum_feat_imp = np.cumsum(feat_imp)[:n]
    
    # Sort output (by relative importance) and 
    # print top n features
    indices = np.argsort(importances)[::-1]
    for i in range(n):
        print('{0:2d}) {1:7s} {2:6.4f}'.format(i + 1, 
                                           feat_labels[indices[i]],
                                           importances[indices[i]]))
        
    
    # Plot Feature Importance (both cumul., individual)
    plt.figure(figsize=(12, 8))
    plt.bar(range(n), importances[indices[:n]], align='center')
    plt.xticks(range(n), feat_labels[indices[:n]], rotation=90)
    plt.xlim([-1, n])
    plt.xlabel('Feature')
    plt.ylabel('Rel. Feature Importance')
    plt.step(range(n), sum_feat_imp, where='mid', 
         label='Cumulative importance')
    plt.tight_layout();
    
    
    # Create a list with the important features for ML algorhithms
    feature_list = [None] * n
    for i in range(n):
        feature_list[i] = feat_labels[indices[i]]
    
    # return the list of important features
    return feature_list