# Solutions

## About the Data
In this notebook, we will be working with 3 datasets:
- 2018 stock data for Facebook, Apple, Amazon, Netflix, and Google (obtained using the [`stock_analysis` package](https://github.com/fenago/stock-analysis)) and earthquake data from the USGS API.
- Earthquake data from September 18, 2018 - October 13, 2018 (obtained from the US Geological Survey (USGS) using the [USGS API](https://earthquake.usgs.gov/fdsnws/event/1/))
- European Centre for Disease Prevention and Control's (ECDC) [daily number of new reported cases of COVID-19 by country worldwide dataset](https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide) collected on September 19, 2020 via [this link](https://opendata.ecdc.europa.eu/covid19/casedistribution/csv)

## Setup
Note that the COVID-19 data will be read in later as part of the solution to exercise 10.

In [None]:
import pandas as pd
import numpy as np

quakes = pd.read_csv('../../lab_04/exercises/earthquakes.csv')
faang = pd.read_csv('../../lab_04/exercises/faang.csv', index_col='date', parse_dates=True)

## Exercise 1
With the `exercises/earthquakes.csv` file, select all the earthquakes in Japan with a of 4.9 or greater using the `mb` magnitude type.

In [None]:
quakes.query(
    "parsed_place == 'Japan' and magType == 'mb' and mag >= 4.9"
)[['mag', 'magType', 'place']]

## Exercise 2
Create bins for each full number of magnitude (for example, the first bin is (0, 1], the second is (1, 2], and so on) with the `ml` magnitude type and count how many are in each bin.

In [None]:
quakes.query("magType == 'ml'").assign(
    mag_bin=lambda x: pd.cut(x.mag, np.arange(0, 10))
).mag_bin.value_counts()

## Exercise 3
Using the `exercises/faang.csv` file, group by the ticker and resample to monthly frequency. Aggregate the open and close prices with the mean, the high price with the max, the low price with the min, and the volume with the sum.

In [None]:
faang.groupby('ticker').resample('1M').agg(
    {
        'open': np.mean,
        'high': np.max,
        'low': np.min,
        'close': np.mean,
        'volume': np.sum
    }
)

## Exercise 4
Build a crosstab with the earthquake data between the `tsunami` column and the `magType` column. Rather than showing the frequency count, show the maximum magnitude that was observed for each combination. Put the magnitude type along the columns.

In [None]:
pd.crosstab(quakes.tsunami, quakes.magType, values=quakes.mag, aggfunc='max')

## Exercise 5
Calculate the rolling 60-day aggregations of the OHLC data by ticker for the FAANG data. Use the same aggregations as exercise 3.

In [None]:
faang.groupby('ticker').rolling('60D').agg(
    {
        'open': np.mean,
        'high': np.max,
        'low': np.min,
        'close': np.mean,
        'volume': np.sum
    }
)

## Exercise 6
Create a pivot table of the FAANG data that compares the stocks. Put the ticker in the rows and and show the averages of the OHLC and volume traded data.

In [None]:
faang.pivot_table(index='ticker')

## Exercise 7
Calculate the Z-scores of Amazon's data (ticker: AMZN) using `apply()`.

In [None]:
faang.loc['2018-Q4'].query("ticker == 'AMZN'").drop(columns='ticker').apply(
    lambda x: x.sub(x.mean()).div(x.std())
).head()

## Exercise 8
Adding event descriptions:
1. Create a dataframe with three columns: `ticker`, `date`, and `event`.
    1. `ticker` will be `'FB'`.
    2. `date` will be datetimes `['2018-07-25', '2018-03-19', '2018-03-20']`
    3. `event` will be `['Disappointing user growth announced after close.', 'Cambridge Analytica story', 'FTC investigation']`.
2. Set the index to `['date', 'ticker']`
3. Merge this data to the FAANG data with a outer join.

In [None]:
events = pd.DataFrame({
    'ticker': 'FB',
    'date': pd.to_datetime(
         ['2018-07-25', '2018-03-19', '2018-03-20']
    ), 
    'event': [
         'Disappointing user growth announced after close.',
         'Cambridge Analytica story',
         'FTC investigation'
    ]
}).set_index(['date', 'ticker'])

faang.reset_index().set_index(['date', 'ticker']).join(
    events, how='outer'
).sample(10, random_state=0)

## Exercise 9
Use the `transform()` method on the FAANG data to represent all the values in terms of the first date in the data. To do so, divide all values for each ticker by the values of the first date in the data for that ticker. This is referred to as an index, and the data for the first date is the base. [More information](https://ec.europa.eu/eurostat/statistics-explained/index.php/Beginners:Statistical_concept_-_Index_and_base_year). When data is in this format, we can easily see growth over time. Hint: `transform()` can take a function name.

In [None]:
faang = faang.reset_index().set_index(['ticker', 'date'])
faang_index = (faang / faang.groupby(level='ticker').transform('first'))

# view 3 rows of the result per ticker
faang_index.groupby(level='ticker').agg('head', 3)

# Exercise 10
## Part 1
1. Read in the data in the `exercises/covid19_cases.csv` file
2. Create a `date` column by parsing the `dateRep` column into a datetime
3. Set the `date` column as the index
4. Use the `replace()` method to update all occurrences of `United_States_of_America` and `United Kingdom` to `USA` and `UK`, respectively
5. Sort the index

In [None]:
covid = pd.read_csv('../../lab_04/exercises/covid19_cases.csv')\
    .assign(date=lambda x: pd.to_datetime(x.dateRep, format='%d/%m/%Y'))\
    .set_index('date')\
    .replace('United_States_of_America', 'USA')\
    .replace('United_Kingdom', 'UK')\
    .sort_index()

## Part 2
For the 5 countries with the most cases (cumulative), find the day with the largest number of cases.

In [None]:
top_five_countries = covid\
    .groupby('countriesAndTerritories').cases.sum()\
    .nlargest(5).index

covid[covid.countriesAndTerritories.isin(top_five_countries)]\
    .groupby('countriesAndTerritories').cases.idxmax()

## Part 3
Find the 7-day average change in COVID-19 cases for the last week in the data for the countries found in part 2.

In [None]:
covid\
    .groupby(['countriesAndTerritories', pd.Grouper(freq='1D')]).cases.sum()\
    .unstack(0).diff().rolling(7).mean().last('1W')[top_five_countries]

## Part 4
Find the first date that each country other than China had cases:

In [None]:
covid.reset_index()\
    .pivot(index='date', columns='countriesAndTerritories', values='cases')\
    .drop(columns='China')\
    .fillna(0)\
    .apply(lambda x: x[(x > 0)].idxmin())\
    .sort_values()\
    .rename(lambda x: x.replace('_', ' '))

## Part 5
Rank the countries by total cases using percentiles.

In [None]:
covid\
    .pivot_table(columns='countriesAndTerritories', values='cases', aggfunc='sum')\
    .T\
    .transform('rank', method='max', pct=True)\
    .sort_values('cases', ascending=False)\
    .rename(lambda x: x.replace('_', ' '))