# Top and Bottom Performing
Let's look at how we might get the top performing stocks for a single period. For this example, we'll look at just a single month of closing prices:

In [2]:
import pandas as pd

month = pd.to_datetime('02/01/2018')
close_month = pd.DataFrame(
    {
        'A': 1,
        'B': 12,
        'C': 35,
        'D': 3,
        'E': 79,
        'F': 2,
        'G': 15,
        'H': 59},
    [month])

close_month

Unnamed: 0,A,B,C,D,E,F,G,H
2018-02-01,1,12,35,3,79,2,15,59


`close_month` gives use the prices for the month of February, 2018 for all the stocks in this universe (A, B, C, ...). Looking at these prices, we can see that the top 2 performing stocks for that month was E and H with the prices 79 and 59.

To get this using code, we can use the [`Series.nlargest`](https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.Series.nlargest.html) function. This function returns the items with the *n* largest numbers. For the example we just talked about, our *n* is 2.

In [3]:
try:
    # Attempt to run nlargest
    close_month.nlargest(2)
except TypeError as err:
    print('Error: {}'.format(err))

Error: nlargest() missing 1 required positional argument: 'columns'


What happeened here? It turns out we're not calling the [`Series.nlargest`](https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.Series.nlargest.html) function, we're actually calling [`DataFrame.nlargest`](https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.DataFrame.nlargest.html), since `close_month` is a DataFrame. Let's get the Series from the dataframe using `.loc[month]`, where `month` is the 2018-02-01 index created above.

loc gets rows (or columns) with particular labels from the index.  
iloc gets rows (or columns) at particular positions in the index (so it only takes integers).  
ix usually tries to behave like loc but falls back to behaving like iloc if a label is not present in the index.  
https://stackoverflow.com/questions/31593201/how-are-iloc-ix-and-loc-different

In [4]:
close_month.loc[month].nlargest(2)

E    79
H    59
Name: 2018-02-01 00:00:00, dtype: int64

Perfect! That gives us the top performing tickers for that month. Now, how do we get the bottom performing tickers? There's two ways to do this. You can use Panda's [`Series.nsmallest`](https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.Series.nsmallest.html) function or just flip the sign on the prices and then apply [`DataFrame.nlargest`](https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.DataFrame.nlargest.html). Either way is fine. For this course, we'll flip the sign with nlargest. This allows us to reuse any funtion created with nlargest to get the smallest.

To get the bottom 2 performing tickers from `close_month`, we'll flip the sign.

In [5]:
close_month.loc[month].nsmallest(2)

A    1
F    2
Name: 2018-02-01 00:00:00, dtype: int64

In [6]:
(-1 * close_month).loc[month].nlargest(2)

A   -1
F   -2
Name: 2018-02-01 00:00:00, dtype: int64

That gives us the bottom performing tickers, but not the actual prices. To get this, we can flip the sign from the output of nlargest.

In [7]:
(-1 * close_month).loc[month].nlargest(2) *-1

A    1
F    2
Name: 2018-02-01 00:00:00, dtype: int64

Now you've seen how to get the top and bottom performing prices in a single month. Let's see if you can apply this knowledge.
## Quiz
Implement `date_top_industries` to find the top performing closing prices and return their sectors for a single date. The function should only return the [set](https://docs.python.org/3/tutorial/datastructures.html#sets) of sectors, there shouldn't be any duplicates returned.

- The number of top performing prices to look at is represented by the parameter `top_n`.
- The `date` parameter is the date to look for the top performing prices in the `prices` DataFrame.
- The sector information for each ticker is located in the `sector` parameter.

For example:
```
                 Prices
               A         B         C         D         E
2013-07-08     2         2         7         2         6
2013-07-09     5         3         6         7         5
...            ...       ...       ...

           Sector
A       "Utilities"       
B       "Health Care"       
C       "Real Estate"
D       "Real Estate"
E       "Information Technology"

Date:  2013-07-09
Top N: 3
```
The set created from the function `date_top_industries` should be the following:
```
{"Utilities", "Real Estate"}
```
*Note: Stock A and E have the same price for the date, but only A's sector got returned. We'll keep it simple and only take the first occurrences of ties.*

In [8]:
#import project_tests
def date_top_industries(prices, sector, date, top_n):
    """
    Get the set of the top industries for the date
    
    Parameters
    ----------
    prices : DataFrame
        Prices for each ticker and date
    sector : Series
        Sector name for each ticker
    date : Date
        Date to get the top performers
    top_n : int
        Number of top performers to get
    
    Returns
    -------
    top_industries : set
        Top industries for the date
    """
    # TODO: Implement Function
    return set(sector.loc[prices.loc[date].nlargest(top_n).index])

#project_tests.test_date_top_industries(date_top_industries)
test_date_top_industries(date_top_industries)

Tests Passed


In [10]:
import pandas as pd
import numpy as np
import scipy.stats as stats

def analyze_returns(net_returns):
    """
    Perform a t-test, with the null hypothesis being that the mean return is zero.
    
    Parameters
    ----------
    net_returns : Pandas Series
        A Pandas Series for each date
    
    Returns
    -------
    t_value
        t-statistic from t-test
    p_value
        Corresponding p-value
    """
    # TODO: Perform one-tailed t-test on net_returns
    # Hint: You can use stats.ttest_1samp() to perform the test.
    #       However, this performs a two-tailed t-test.
    #       You'll need to divde the p-value by 2 to get the results of a one-tailed p-value.
    null_hypothesis = 0.0
    t, p = stats.ttest_1samp(net_returns['return'], null_hypothesis)
    return t, p/2
    
def test_run(filename='data/net_returns.csv'):
    """Test run analyze_returns() with net strategy returns from a file."""
    #net_returns = pd.Series.from_csv(filename, header=0)
    net_returns =pd.read_csv(filename, header=0)
    t, p = analyze_returns(net_returns)
    print("t-statistic: {:.3f}\np-value: {:.6f}".format(t, p))

test_run()
#if __name__ == '__main__':
#    test_run()

t-statistic: 0.760
p-value: 0.226606


In [7]:
import collections
from collections import OrderedDict
import copy
import pandas as pd
import numpy as np
from datetime import date, timedelta


pd.options.display.float_format = '{:.8f}'.format


def _generate_output_error_msg(fn_name, fn_inputs, fn_outputs, fn_expected_outputs):
    formatted_inputs = []
    formatted_outputs = []
    formatted_expected_outputs = []

    for input_name, input_value in fn_inputs.items():
        formatted_outputs.append('INPUT {}:\n{}\n'.format(
            input_name, str(input_value)))
    for output_name, output_value in fn_outputs.items():
        formatted_outputs.append('OUTPUT {}:\n{}\n'.format(
            output_name, str(output_value)))
    for expected_output_name, expected_output_value in fn_expected_outputs.items():
        formatted_expected_outputs.append('EXPECTED OUTPUT FOR {}:\n{}\n'.format(
            expected_output_name, str(expected_output_value)))

    return 'Wrong value for {}.\n' \
           '{}\n' \
           '{}\n' \
           '{}' \
        .format(
            fn_name,
            '\n'.join(formatted_inputs),
            '\n'.join(formatted_outputs),
            '\n'.join(formatted_expected_outputs))


def _is_equal(x, y):
    is_equal = False

    if isinstance(x, pd.DataFrame) or isinstance(y, pd.Series):
        is_equal = x.equals(y)
    elif isinstance(x, np.ndarray):
        is_equal = np.array_equal(x, y)
    elif isinstance(x, list):
        if len(x) == len(y):
            for x_item, y_item in zip(x, y):
                if not _is_equal(x_item, y_item):
                    break
            else:
                is_equal = True
    else:
        is_equal = x == y

    return is_equal


def project_test(func):
    def func_wrapper(*args):
        result = func(*args)
        print('Tests Passed')
        return result

    return func_wrapper


def generate_random_tickers(n_tickers=None):
    min_ticker_len = 3
    max_ticker_len = 5
    tickers = []

    if not n_tickers:
        n_tickers = np.random.randint(8, 14)

    ticker_symbol_random = np.random.randint(ord('A'), ord('Z')+1, (n_tickers, max_ticker_len))
    ticker_symbol_lengths = np.random.randint(min_ticker_len, max_ticker_len, n_tickers)
    for ticker_symbol_rand, ticker_symbol_length in zip(ticker_symbol_random, ticker_symbol_lengths):
        ticker_symbol = ''.join([chr(c_id) for c_id in ticker_symbol_rand[:ticker_symbol_length]])
        tickers.append(ticker_symbol)

    return tickers


def generate_random_dates(n_days=None):
    if not n_days:
        n_days = np.random.randint(14, 20)

    start_year = np.random.randint(1999, 2017)
    start_month = np.random.randint(1, 12)
    start_day = np.random.randint(1, 29)
    start_date = date(start_year, start_month, start_day)

    dates = []
    for i in range(n_days):
        dates.append(start_date + timedelta(days=i))

    return dates


def assert_output(fn, fn_inputs, fn_expected_outputs):
    assert type(fn_expected_outputs) == OrderedDict

    fn_outputs = OrderedDict()
    fn_inputs_passed_in = copy.deepcopy(fn_inputs)
    fn_raw_out = fn(**fn_inputs_passed_in)

    # Check if inputs have changed
    for input_name, input_value in fn_inputs.items():
        passed_in_unchanged = _is_equal(input_value, fn_inputs_passed_in[input_name])

        assert passed_in_unchanged, 'Input parameter "{}" has been modified inside the function. ' \
                                    'The function shouldn\'t modify the function parameters.'.format(input_name)

    if len(fn_expected_outputs) == 1:
        fn_outputs[list(fn_expected_outputs)[0]] = fn_raw_out
    elif len(fn_expected_outputs) > 1:
        assert type(fn_raw_out) == tuple,\
            'Expecting function to return tuple, got type {}'.format(type(fn_raw_out))
        assert len(fn_raw_out) == len(fn_expected_outputs),\
            'Expected {} outputs in tuple, only found {} outputs'.format(len(fn_expected_outputs), len(fn_raw_out))
        for key_i, output_key in enumerate(fn_expected_outputs.keys()):
            fn_outputs[output_key] = fn_raw_out[key_i]

    err_message = _generate_output_error_msg(
        fn.__name__,
        fn_inputs,
        fn_outputs,
        fn_expected_outputs)

    for fn_out, (out_name, expected_out) in zip(fn_outputs.values(), fn_expected_outputs.items()):
        assert isinstance(fn_out, type(expected_out)),\
            'Wrong type for output {}. Got {}, expected {}'.format(out_name, type(fn_out), type(expected_out))

        if hasattr(expected_out, 'shape'):
            assert fn_out.shape == expected_out.shape, \
                'Wrong shape for output {}. Got {}, expected {}'.format(out_name, fn_out.shape, expected_out.shape)
        elif hasattr(expected_out, '__len__'):
            assert len(fn_out) == len(expected_out), \
                'Wrong len for output {}. Got {}, expected {}'.format(out_name, len(fn_out), len(expected_out))

        if type(expected_out) == pd.DataFrame:
            assert set(fn_out.columns) == set(expected_out.columns), \
                'Incorrect columns for output {}\n' \
                'COLUMNS:          {}\n' \
                'EXPECTED COLUMNS: {}'.format(out_name, sorted(fn_out.columns), sorted(expected_out.columns))

        if type(expected_out) in {pd.DataFrame, pd.Series}:
            assert set(fn_out.index) == set(expected_out.index), \
                'Incorrect indices for output {}\n' \
                'INDICES:          {}\n' \
                'EXPECTED INDICES: {}'.format(out_name, sorted(fn_out.index), sorted(expected_out.index))

        try:
            out_is_close = np.isclose(fn_out, expected_out, equal_nan=True)
        except TypeError:
            out_is_close = fn_out == expected_out
        else:
            if isinstance(expected_out, collections.Iterable):
                out_is_close = out_is_close.all()

        assert out_is_close, err_message


In [1]:
from collections import OrderedDict
import pandas as pd
from helper import project_test, generate_random_tickers, generate_random_dates, assert_output


@project_test
def test_date_top_industries(fn):
    tickers = generate_random_tickers(10)
    dates = generate_random_dates(2)

    fn_inputs = {
        'prices': pd.DataFrame(
            [
                [21.050810483942833, 17.013843810658827, 10.984503755486879, 11.248093428369392, 12.961712733997235,
                 482.34539247360806, 35.202580592515041, 3516.5416782257166, 66.405314327318209, 13.503960481087077],
                [15.63570258751384, 14.69054309070934, 11.353027688995159, 475.74195118202061, 11.959640427803022,
                 10.918933017418304, 17.9086438675435, 24.801265417692324, 12.488954191854916, 15.63570258751384]],
            dates, tickers),
        'sector': pd.Series(
            ['ENERGY', 'MATERIALS', 'ENERGY', 'ENERGY', 'TELECOM', 'FINANCIALS',
             'TECHNOLOGY', 'HEALTH', 'MATERIALS', 'REAL ESTATE'],
            tickers),
        'date': dates[-1],
        'top_n': 4}
    fn_correct_outputs = OrderedDict([
        (
            'top_industries',
            set(['ENERGY', 'HEALTH', 'TECHNOLOGY']))])

    assert_output(fn, fn_inputs, fn_correct_outputs)
