# Homework pandas

<table align="left">
    <tr>
    <td><a href="https://colab.research.google.com/github/airnandez/numpandas/blob/master/exam/2022-exam.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a></td>
    <td><a href="https://mybinder.org/v2/gh/airnandez/numpandas/master?filepath=exam%2F2022-exam.ipynb">
  <img src="https://mybinder.org/badge_logo.svg" alt="Launch Binder"/>
</a></td>
  </tr>
</table>

*Author: Fabio Hernandez*

*Last updated: 2022-03-31*

*Location:* https://github.com/airnandez/numpandas/exam

--------------------
## Instructions

For this excercise we will use a public dataset curated and made available by [Our World in Data](https://ourworldindata.org) located in [this repository](https://github.com/owid/energy-data). We will use a snapshot of the dataset as of 2022-03-26.

For your convenience, this notebook is prepared with code for downloading the snapshot dataset from its source, loading it into memory as a **pandas** dataframe and with some cleaning and helper functions. Your mission is to execute the provided cells and to write the code to answer the questions below.

You must not modify the code provided. You must provide additional code for answering the questions asked, following the instructions for each one of them.

When you have finished, please save your notebook in the form of a `.ipynb` file and send it according to the instructions you received by e-mail.

---------------------
## Dependencies

In [None]:
import datetime
import os
import glob

In [None]:
import pandas as pd
pd.set_option('display.max_columns', None)
pd.__version__

In [None]:
import numpy as np
np.__version__

------
## Download the dataset

Define a helper function for downloading the dataset to a local file:

In [None]:
import requests

def download(url: str, path: str):
    """Download file at url and save it locally at path."""
    with requests.get(url, stream=True) as resp:
        if not resp.ok:
            raise f'Could not find file at URL {url}'
            
        mode, data = 'wb', resp.content
        if 'text/plain' in resp.headers['Content-Type']:
            mode, data = 'wt', resp.text
        with open(path, mode) as f:
            f.write(data)

Download the data files, one per year, for the period 2016-2021, both inclusive. We store the downloaded data in the directory `../data` relative to the location of this notebook. If a file has been already been downloaded, don't download it again. The total amount of data to download is about 400 MB.

In [None]:
# Download files
data_sources = (
    'https://raw.githubusercontent.com/airnandez/numpandas/master/data/owid-energy-data.csv',
)

# Create destination directory
os.makedirs(os.path.join('..', 'data'), exist_ok=True)

for url in data_sources:
    # Build the URL and the destination file path
    path = os.path.join('..', 'data', os.path.basename(url))
    
    # If file already exists don't download it again
    if not os.path.isfile(path) :
        print(f'downloading {url} to {path}')
        download(url, path)
    else:
        print(f'local file {path} already exists. Skipping download...')

---------------------
## Load the dataset

Load the dataset (i.e. the file `../data/owid-energy-data.csv`) to a **pandas** dataframe. The information about the format and contents of each column is available [here](https://github.com/owid/energy-data/blob/master/owid-energy-codebook.csv). Please make sure you are familiar with that information which you will need for analysing the data:

In [None]:
path = os.path.join('..', 'data', 'owid-energy-data.csv')
df = pd.read_csv(path)

--------------
## Inspect the dataset

In [None]:
# Inspect the dimensions of the dataframe
rows, columns = df.shape
print(f'This dataframe has {rows:,} rows and {columns:,} columns')

In [None]:
df.sample(10)

-------------------
# Questions (20 points + bonus)

---------------------
## Question N° 1a (4 points)

We want to determine what was the global energy consumption (expressed in terawatt•hours) in year 2019 and compare it to the energy consumption the same year in France. You must provide the code to give values to the variables so that the `response` variable has the correct value.

In [None]:
# Your code goes here

world_energy_consumption = ...
france_energy_consumption = ...
france_consumption_share = (france_energy_consumption / world_energy_consumption) * 100.0

response = f"""
In 2019, the global energy consumption was {world_energy_consumption:,.0f} terawatt•hours and the energy consumption in France was {france_energy_consumption:,.0f} terawatt•hours, which is equivalent to {france_consumption_share:,.0f}% of the global energy consumption.
"""
print(response)

## Question N° 1b (6 points)

We want to determine the evolution of the global primary energy consumption (expressed in terawatt•hours) and compare it to the evolution in global population over the same period.

You need to retrieve the minimum and maximum values of the variable global primary energy consumption present in the dataset and the years where those extremes were reached and compare the evolution in global consumption against the evolution in global population, over the same period.

In [None]:
# Your code goes here

# Retrieve the minimum and maximum values of the column 'primary_energy_consumption' in the dataset
global_energy_consumption_min = ...
global_energy_consumption_max = ...

# Retrieve the years where those extremes values were reached
year_min = ...
year_max = ...

# Retrieve the values for population those same years
population_min = ...
population_max = ...

# Compute the evolutions in consumption and in population
consumption_evolution = 100.0 * (global_energy_consumption_max - global_energy_consumption_min) / global_energy_consumption_min
population_evolution = 100.0 * (population_max - population_min) / population_min

response = f"""
The minimum value of global primary energy consumption in the dataset was {global_energy_consumption_min:,.0f} terawatt•hours and was observed in {year_min}.
The maximum value of global primary energy consumption in the dataset was {global_energy_consumption_max:,.0f} terawatt•hours and was reached in {year_max}.
That peak consumption reached in {year_max} is equivalent to an evolution of {consumption_evolution:.0f}% with respect to the global consumption in {year_min}.
In {year_max} the global population is equivalent to {population_evolution:.0f}% with respect to the global population in {year_min}.
"""
print(response)

## Question N° 2 (5 points)

We want to study how the global energy mix has changed over the last several decades. You are asked to compute the share of energy consumption that comes from 4 sources: coal, oil, gas and nuclear for years 1970, 2000 and 2019.

You must implement the function `get_consumption_share` which must return 4 values, as indicated in the function comments.

In [None]:
def get_consumption_share(year: int) -> (float, float, float, float):
    """Return the energy consumption share that comes from sources
    coal, oil, gas and nuclear for the given year.
    """
    # Your code goes here

    coal_share = ...
    oil_share  = ...
    gas_share  =  ...
    nuclear_share = ...
    
    return coal_share, oil_share, gas_share, nuclear_share

for year in (1970, 2000, 2019):
    coal, oil, gas, nuclear = get_consumption_share(year)
    
    response = f"""
Energy consumption that comes from select sources for year {year}:
    coal:      {coal:2.0f}%
    oil:       {oil:2.0f}%
    gas:       {gas:2.0f}%
    nuclear:   {nuclear:2.0f}%
    aggregate: {coal+oil+gas+nuclear:2.0f}%
    """
    print(response)

## Question N° 3: 5 points

We want to compute the the mean annual change (in percentage) of primary energy consumption that comes from renewables. You are asked to compute the mean share of of primary energy consumption that comes from renewable in France, Europe, China, United States and Japan over the period 2000-2019:

In [None]:
for country in ('France', 'Europe', 'United States', 'China', 'Japan'):
    # Your code goes here
    
    renewables_mean_share = ...
    print(f'{country:>13}: {renewables_mean_share:2.0f}%')

## Bonus question: 2 points

What years did Norway reach the peak in its oil and gas production over the period 1970 - 2019?

You are provided with the function `plot_oil_and_gas` which plots the evolution of oil and gas production. It is designed to work if you provide the right set of values that you must extract from the dataframe. That function expects 3 objects of type `numpy.array` (namely `years`, `oil` and `gas`) which contain the values we need to make the plot. Once the plot is displayed you can inspect it to answer the question above.

In [None]:
import bokeh
import bokeh.plotting
bokeh.plotting.output_notebook()

In [None]:
import numpy as np

In [None]:
def plot_oil_and_gas(years: np.ndarray, oil: np.ndarray, gas: np.ndarray):
    """Generate and display a plot with two lines representing the production of
    oil and gas (in terawatt-hours) over the years.
    """
    # Populate the data source
    data = bokeh.models.ColumnDataSource({
        'year':           years,
        'oil_production': oil,
        'gas_production': gas,
    })

    figure = bokeh.plotting.figure(
        title = f'Annual oil and gas production by Norway ({years[0]}-{years[-1]})',
        x_axis_label = 'year',
        y_axis_label = 'terawatts•hours',
        plot_width = 800,
        plot_height = 600,
        background_fill_color = 'whitesmoke',
        background_fill_alpha = 0.8
    )
    figure.xgrid.grid_line_color = None
    figure.toolbar.autohide = True

    # Add tooltips
    figure.add_tools(bokeh.models.HoverTool(
        tooltips = [
            ('year',       '@year'),
            ('oil production', '@oil_production{,.} terawatts-hours'),
            ('gas production', '@gas_production{,.} terawatts-hours'),
        ],
        mode = 'mouse',
    ))

    # Set the title and axis font sizes
    figure.title.text_font_size = "20px"
    figure.xaxis.axis_label_text_font_size = "16px"
    figure.xaxis.major_label_text_font_size = "14px"
    figure.yaxis.axis_label_text_font_size = "16px"
    figure.yaxis.major_label_text_font_size = "14px"

    # Use thousands separator for the Y axis labels
    figure.yaxis.formatter = bokeh.models.formatters.NumeralTickFormatter(format="0,0")

    # Add a line for oil and another line for gas
    oil_color, gas_color = 'LightSeaGreen', 'Crimson'
    line_width, size, width, alpha = 3, 6, 0.8, 0.7
    figure.circle(x='year', y='oil_production', source=data, color=oil_color, size=size, width=width, alpha=alpha)
    figure.circle(x='year', y='gas_production', source=data, color=gas_color, size=size, width=width, alpha=alpha)
    figure.line(x='year', y='oil_production', source=data, line_color=oil_color, line_width=line_width, width=width, alpha=alpha, legend_label="OIL")
    figure.line(x='year', y='gas_production', source=data, line_color=gas_color, line_width=line_width, width=width, alpha=alpha, legend_label="GAS")

    # Plot the figure
    bokeh.plotting.show(figure)

In [None]:
# Your code goes here

years = ...
oil = ...
gas = ...

plot_oil_and_gas(years, oil, gas)

Over the period 1970-2019, Norway reached its peak of oil production in year **your answer here** and of gas in year **your answer here**