# Week 6 - Coding Exercise
This notebook **is the deliverable** for your weekly coding exercises. Below you will find the text of the exercise and the space to write your code. Feel free to **add additional code cells** if needed. 

## Completion Instructions
1. You are allowed to add additional **cells**. 
1. Unless specified otherwise, you can use as many **intermediate steps** as you want to get to the final result of each point. We will only mark the final result.
1. Some exercises will ask you to perform a calculation and assign the result to a variable with a **specific name**. Assigning the result to a variable with the wrong name will result in **0 marks** for that point.
1. Some exercises will ask you to perform a calculation and assign the result to a variable of a **specific type** (number, Series, DataFrame, etc.). Assigning the result to a variable of the wrong type will result in **0 marks** for that point. You can check the type of an object with the function **[`type()`](https://www.codingem.com/type-of-in-python/)**.
1. The **final result** of each point should be **shown on screen**. For example if you are asked to assign the result of a calculation to a number called `Total`, this number should be visible on screen. If the results is a DataFrame or Series you should show only few rows with **[`df.head()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html)** or, if more appropriate, **[`df.tail()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html)**. If a final result is not shown on screen, it will lead to **0 marks** for that point.
1. You should not "hard-code" numbers into your calculations if this can be avoided. For example if you need to use the "number of columns" in a DataFrame in a calculation, you should use a command/function to calculate the number of columns and not simply count the columns and use `7` (a hard-coded number) in your calculations. Using hard-coded numbers when this is unnecessary may result in **0 marks** for that point. 


## Submission Instructions
1. Do not change the **name of the file**. Canvas will automatically add your name and student ID to the file.
1. Before submitting the notebook please **check that it runs properly** from top to bottom. To do this, save the file, close it, than re-open it and press the fast-forward button at the top of the notebook or _Restart and Run All Cells_ from the _Kernel_ menu. You can see a discussion of this in this [video](https://youtu.be/P0NyuTGddPo). If your file has a breaking error that does not allow to run the notebook from top to bottom you will receive a **penalty of 5 marks**. 
___

#### Identification
Please enter your **name** and your **student ID** number in this markdown cell:

* **Student Name:** XXXXXX
* **Student ID:** XXXXXX

Missing name or ID will result in **1 mark penalty**.

___
#### Import Statements
Add in the following cell all the import statements that you need to run the entire notebook. Import statements anyehwere else in the notebook will result in a **penalty of 1 mark**.

___
### Exercise CE6.01
Load the file `CE6_Prices.zip` with monthly prices for US stocks, and the files `CE6_BTM.zip` and `CE6_EP12.zip` that contain data on two value indicators (book-to-market and 12-months earning-to-price ratios) for US stocks. All DataFrames should be indexed by the id of the stock and the date variable. You can choose the names you want for these DataFrames.

Combine the two indicators into a single DataFrame called `indicators`, keeping only the observations where both of them have values (`BTM` should be the first column and `EP12` the second).

Add a third column to the DataFrame `indicators` with the lowest of the two indicators. This new column should be called `min` **[Point 1: 1 mark]**.

Using the function `ic_analysis` from the `apmodule` create a Series called `ic_min` with the annual information coefficient for the composite 'min' indicator created above **[Point 2: 1 mark]**. 

The function returns the T-stat and P-Value of the T-test on the average coefficient. The P-Value appears to be zero, but this just indicates that the value is rounded to zero. Your boss is obsessed with decimal points and wants to know the P-Value more precicely. Therefore, repeat the t-test  and assign **the value of the P-Value** with no reduction in decimal points (no rounding to three decimal points) to a variable called `p_val`. This should be a number in __[scientific notation](https://en.wikipedia.org/wiki/Scientific_notation)__ **[Point 3: 1 mark]**. 

___
### Exercise CE6.02
Using the `ic_analysis` function from the `apmodule` generate the monthly information coefficient for the book-to-market value indicator (using the `btm` factor from the `indicators` DataFrame created in the previous exercise). Assign this to an object called `ic_btm`. The function by default returns a series. Transform this series into a DataFrame with the same name (`ic_btm`) **[Point 4: 1 mark]**.

Create an object called `avg_ic` with the average information coefficient for every calendar year. The object can be either a DataFrame or a Series **[Point 5: 1 mark]**.

Calculate the highest annual average Information Coefficient and assign the result to a variable called `max_ic`. This should be a number or an object (DataFrame or Series) containing a single number **[Point 6: 1 mark]**.

Using a specific Pandas command find a way to extract from `max_ic` the year in which we obtain the highest Information Coefficient. The result should be a number or an object (DataFrame or Series) containing a single number. **ATTENTION:** We have not seen this command in class, so you will have to search for it **[Point 7: 1 mark]**.

___
### Exercise CE6.03
Your boss is interested in exploring the profitability of a "Value-Volatility" Strategy where a value and a low-volatility indicator are combined into a single information signal.
1. Load the data on earnings to price ratios for US stocks contained in `CE6_EP12.zip` into a DataFrame called `ep`.
1. Load the data on total volatility for US stocks contained in `CE6_TVOL.zip` into a DataFrame called `tvol`.
1. Combine these two factors into a single DataFrame called `factors` and keep only the observations where they are both non-missing.
1. Create an additional column in `factors` called `value_vol` containing, for each month, the average of `ep` and `tvol` **[Point 8: 1 mark]**.

Use the `ic_analysis` function from the `apmodule` to generate the monthly information coefficient for the value and the lo-volatility factors, as well as the combined value-vol factor. For all calculations use the signals from the DataFrame `factors` and not from the original files. Assign the results to `ic_ep12`, `ic_ivol` and `ic_value_vol`, respectively. Calculate **[Point 9: 1 mark]**:
1. The percentage of months in which the value-vol strategy beats the value strategy (has a higher information coefficient). Assign the result (a number, for example 0.6534... to indicate 65.34%) to `percentage_1`.
1. The percentage of months in which the value-vol strategy beats the low-volatility strategy (has a higher information coefficient). Assign the result (a number, for example 0.6534... to indicate 65.34%) to `percentage_2`.

**ATTENTION:** There are different ways to obtain this result (different commands, formulas, number of intermediate steps...). We will consider them all valid as long as:
1. The calculation is done with python code (no manual counting)
1. The result is numerically correct
1. The result is a number (not a DataFrame or a Series)

**[Final Challenge Mark!]** 

Your boss is quite impressed by your ability to write python functions to make your job easier, so of course she gives you more work. Below you will find the `ic_analysis` function created in class (this is a simplified version that only returns monthly information coefficients). Write a new function called `ic_analysis_new` based on this one but with the following changes:
1. The new function should not print any result on screen (the average IC, the t-test, percentage of positive and negative, ...).
1. The new function should return, instead of a DataFrame with the monthly information coefficient, a DataFrame with the rolling average information coefficient of the past `N` months, where `N` is an additional argument (input) of the into the `ic_analysis_new` function. As default take `N=3`.

To test this new function run it on the column `ep12` from the DataFrame `factors` with the 2 months rolling IC average computed. Assign the result to an object called `ic_test` and show on screen the first 5 rows of this object **[Point 10: 1 mark]**.

In [15]:
def ic_analysis(signal, prices):
        
    #We capture the name of the factor using the .name attribute to use it later
    signal_name = signal.name
    
    #We calculate the returns 
    future_returns = np.log(prices.groupby('id').shift(-1) / prices).rename('fut_ret')
    
    #We join the signal with the future return
    data = signal.to_frame().join(future_returns).dropna()
    data['month'] = data.index.get_level_values('date').month

    #We drop the month column because we do not need it anymore
    data.drop(columns=['month'], inplace=True)
    
    #We calculate the IC
    ic = data.groupby('date').corr(method='spearman').iloc[0::2,-1].droplevel(level=1).rename('IC')
    
    #We print on screen the average IC
    print('Average IC:', round(ic.mean(), 3), '\n')
    
    #We calculate and print the percentage of positive/negative observations
    sign = np.sign(ic).value_counts() / ic.count()
    
    print('Percentage of Positive Periods:', round(sign.loc[1], 3), '\n')
    print('Percentage of Negative Periods:', round(sign.loc[-1], 3), '\n')
    
    #We calculate and print the t-test
    t_test = stats.ttest_1samp(ic, 0)
    print(f'T-Stat: {round(t_test.statistic,3)} P-Value: {round(t_test.pvalue,3)} \n')
    
    #We return the series with the Information Coefficient for further analysis using the original name of the factor
    return ic.rename('IC_' + signal_name)