# Table of Contents
  - [References](#References)
  - [Introduction to Libraries](#Introduction-to-Libraries)
    - [Why are libraries important?](#Why-are-libraries-important?)
    - [Finding the right Library](#Finding-the-right-Library)
      - [What is Pandas?](#What-is-Pandas?)
      - [Why We Need Pandas with Python](#Why-We-Need-Pandas-with-Python)
      - [How to use a library](#How-to-use-a-library)
        - [Installing a library](#Installing-a-library)
        - [Importing the library](#Importing-the-library)
        - [Accessing its functions](#Accessing-its-functions)
        - [Getting help and inspiration](#Getting-help-and-inspiration)
    - [First step: Data import and exploration](#First-step:-Data-import-and-exploration)
    - [Exercise reading in data](#Exercise-reading-in-data)
    - [Playground](#Playground)
      - [Data exploration](#Data-exploration)
  - [Building the plot from scratch](#Building-the-plot-from-scratch)
    - [Finding the limits](#Finding-the-limits)
    - [Cleaning missing values](#Cleaning-missing-values)
      - [Exercise: Complete Happiness](#Exercise:-Complete-Happiness)
    - [Adding regional indicator](#Adding-regional-indicator)
      - [Exercise: Final Happiness](#Exercise:-Final-Happiness)
    - [Plotting basic scatter plot](#Plotting-basic-scatter-plot)
    - [Making frames per year](#Making-frames-per-year)
      - [Adding slider bar for time scale](#Adding-slider-bar-for-time-scale)
    - [Adding pause-play button](#Adding-pause-play-button)
    - [Using bubble size as a variable](#Using-bubble-size-as-a-variable)
    - [Classify into categories](#Classify-into-categories)
      - [Exercise Frames with category](#Exercise-Frames-with-category)
      - [Bonus Exercise Fixing a library](#Bonus-Exercise-Fixing-a-library)

# References

* [Guide  to Animated Bubble Charts](https://www.kaggle.com/code/aashita/guide-to-animated-bubble-charts-using-plotly/notebook) using Plotly.
* World Happiness Report [dataset](https://www.kaggle.com/datasets/unsdsn/world-happiness).
* [Intro to Animations](https://plot.ly/python/animations/) in Plotly.
* Create animations online [Stack Overflow](https://stackoverflow.com/questions/45780920/plotly-icreate-animations-offline-on-jupyter-notebook).
* [Adding Sliders](https://plot.ly/python/gapminder-example/) to Animations in Plotly.
* [Bubbly package](https://github.com/AashitaK/bubbly) for plotting interactive and animated bubble charts using Plotly.

# Introduction to Libraries

Welcome!
This Jupyter Notebook is designed to introduce you to the power of Python libraries.
Here in particular we will have a look at data analysis and visualization.
We'll be focusing on two essential libraries:

-  **Pandas**: A versatile library for data manipulation and analysis.
-  **Plotly**: A powerful tool for creating interactive and visually appealing plots.

Through practical examples, you'll learn how to:
- Load and manipulate data using Pandas DataFrames.
- Perform data analysis tasks such as filtering, grouping, and aggregation.
- Create informative and interactive visualizations using Plotly.

## Why are libraries important?

Python libraries are collections of pre-written code that provide reusable functions and tools for specific tasks.
They significantly extend the capabilities of Python, allowing you to perform complex operations without writing everything from scratch.
This saves time and effort, promotes code reusability, and helps you focus on the higher-level logic of your data analysis and visualization workflows.
Have a look at the following plot: imagine you'd have to code every detail from scratch! Instead, we rely on the help of libraries to do most of the heavy lifting for us.
However, there is still quite a bit to do before we can reproduce exactly this.
Let's have a look at the following plot generated with Plotly.

In [None]:
import pandas as pd
from tutorial.my_bubbly import bubbleplot 
from plotly.offline import iplot
path = "data/data_exploration"
gapminder_indicators = pd.read_csv(path + '/gapminder.tsv', delimiter='\t')

figure = bubbleplot(dataset=gapminder_indicators, x_column='gdpPercap', y_column='lifeExp', 
    bubble_column='country', time_column='year', size_column='pop', color_column='continent', 
    x_title="GDP per Capita", y_title="Life Expectancy", title='Gapminder Global Indicators',
    x_logscale=True, scale_bubble=3, height=650)
iplot(figure, config={'scrollZoom': True})

In the graph above, the size corresponds to the population of each country and the values of GDP per capital and life expectency along with the name of the country can be seen by hovering over the cursor on the bubbles.
Imagine the work you'd have to put in to build this without any libraries.

This animated bubble chart can convey a great deal of information since it can accommodate up to *six variables* in total, namely:
- X-axis (GDP per capita)
- Y-axis (Life Expectancy)
- Bubbles (Countries, can be seen by hovering the cursor over the dots)
- Time (in years)
- Size of bubbles (Population)
- Color of bubbles (Continents, variable can be categorical or numerical)


Using the function `bubbleplot` from the module [`bubbly`(bubble charts with plotly)](https://github.com/AashitaK/bubbly): see references for all source material.
Our goal is to recreate this Visualization but with a different dataset.
For this, we have already preloaded a data file in the folder `data/introtolibraries/World-happiness-report-updated_2024.csv` which is an open data record which can be found on kaggle.com.

## Finding the right Library

Finding the right library to use is always tricky and depends on your project requirements, the time you have available and the knowledge you may already have with libraries you already know well.
Spending a bit of time at the beginning of a project researching and exploring some options and asking a friend for advice may be time well invested.
For importing and exploring a dataset the most well-known Python library is Pandas.

### What is Pandas?

[Pandas](https://pandas.pydata.org/docs/index.html) is an open-source data manipulation and analysis library for Python.
It provides powerful data structures and functions designed to make working with structured data intuitive and efficient.
At the heart of Pandas are two primary data structures:

- **DataFrame**:
  A two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labelled axes (rows and columns).
  It's similar to a spreadsheet or SQL table and is generally the most commonly used Pandas object.
- **Series**:
  A one-dimensional labelled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).

Pandas integrates well with various other Python libraries, such as Matplotlib for plotting and NumPy for numerical computations, making it a central library in the Python data science stack.

### Why We Need Pandas with Python

Python, while a powerful programming language, isn't designed specifically for data analysis.
It lacks built-in, high-level data structures and tools that are intuitive and efficient for these tasks.
Here's where Pandas comes in:

- **Data Cleaning and Preparation**:
  Data scientists spend a significant amount of time cleaning and preparing data.
  Pandas simplifies these tasks with built-in functions for filtering, selecting, and manipulating data.
- **Data Analysis**:
  With Pandas, analyzing and exploring data is more straightforward.
  It provides functions for aggregating, summarizing, and transforming data, making it easier to derive insights.
- **Data Visualization**:
  Though Pandas is not a data visualization library, it seamlessly interfaces with Matplotlib for plotting and visualizing data, allowing quick and informative visual analysis.
- **Handling Diverse Data Types**:
  Pandas efficiently handles a variety of data formats, including CSV, Excel files, SQL databases, and HDF5 format, making it a versatile tool for diverse data analysis needs.

### How to use a library

#### Installing a library

```pip install pandas```

#### Importing the library

Just like any code you had written yourself which is located in a different file the code for pandas needs to be imported with the statement

```python
import pandas as pd
```

As a pre-requirement, you must have the Python package pandas installed, e.g. with your favourite package installer like `pip`.
That is for another tutorial, we have preinstalled all the necessary libraries you will need in this environment.
You will notice also with pandas we immediately rename the library namespace to `pd`, which is common for well-known libraries to make typing faster.
We adopt this practice here so you have seen it before and aren't confused when it appears again.

#### Accessing its functions

Like we said a library is just a collection of useful functions.
Now that we have imported the library we can use all the objects within. 
As an example some useful functions in Pandas are


| Function                         | Description                                                                            | Example                                            |
|:---------------------------------|:---------------------------------------------------------------------------------------|:---------------------------------------------------|
| pd.read_csv()                    | Reads data from a CSV file and creates a `DataFrame`.                                    | pd.read_csv('data.csv')                            |
| pd.DataFrame()                   | Creates a `DataFrame` from various data structures (lists, dictionaries, etc.).          | pd.DataFrame({'col1': [1, 2], 'col2': ['a', 'b']}) |
| pd.DataFrame.head()                        | Displays the first few rows of the `DataFrame` (default is 5).                           | df.head()                                          |
| pd.DataFrame.tail()                        | Displays the last few rows of the `DataFrame` (default is 5).                            | df.tail(10)                                        |
| pd.DataFrame.info()                        | Provides a concise summary of the `DataFrame`, including data types and non-null values. | df.info()                                          |
| pd.DataFrame.describe()                    | Generates descriptive statistics of the numerical columns in the `DataFrame`.            | df.describe()                                      |
| pd.DataFrame['column_name']                | Selects a specific column by its name.                                                 | df['Name']                                         |
| pd.DataFrame.loc[row_label, col_label]     | Accesses a group of rows and columns by label(s).                                      | df.loc[0, 'Age']                                   |
| pd.DataFrame.iloc[row_index, col_index]    | Accesses a group of rows and columns by integer position(s).                           | df.iloc[2, 1]                                      |
| pd.DataFrame.groupby('column_name')        | Groups rows based on the values in a specified column for aggregation.                 | df.groupby('Category')['Value'].mean()             |
| pd.DataFrame.sort_values(by='column_name') | Sorts the `DataFrame` by the values in one or more columns.                              | df.sort_values(by='Date', ascending=False)         |
| pd.DataFrame.dropna()                      | Removes rows with missing values (`NaN`).                                                | df.dropna()                                        |
| pd.DataFrame.ffill()                       | Fills missing values by propagating the last valid observation forward to the next valid observation. | df.ffill()                                         |

#### Getting help and inspiration

Of course one of the most important parts is to be able to understand, look up and get help on any function of a library.
Usually, we start with some inspiration, as we gave above with the plot, there might be someone who posted something which you would like to reproduce but with a twist or you would like to change something.
This is generally a good starting point.
However afterwards you won't have the documentation of all the functions so you need to have the skill to find documentation and understand the requirements for function, sometimes you even need to know more about the inner workings of a functions implementations.

There are several ways to access documentation.
One way, assuming it is a well-maintained package online, is to find the documentation website.
For pandas, this is a great place to find details on functions, and changes that may have been made with different versions and explore alternatives to a given function.

[Pandas Documentation](https://pandas.pydata.org/docs/index.html)

If it is a smaller library you can list all top-level functions with  `dir` as in 
```python
print(dir(pd))
```
However, this will only give you top-level functions.

Many environments will have code completion or you may have an AI copilot to find functions.
And if you are struggling with a specific function you can print out the signature and the docstring with `help(function)` or `function?`.

When reading the documentation you might get overwhelmed.
Keep a lookout for the function parameters, many of which may be optional, and good documentations tend to have an example to get a feel for the function.

For most of this tutorial we will give you the infos about a function that you need, however if you want to know more or need some extra information then use these tools to inform yourself.

## First step: Data import and exploration

Already getting your data from a file to a variable you can work with can be a headache.
How do I read the file, how do I choose delimiters and what encoding does the file have?

We will use the `pd.read_csv` function from pandas to read "The World Happiness Report" which is a report study on how people rate their happiness in different countries.

## Exercise reading in data

In the cell below you should write the code that solves the first exercise:

  -  Use the `path_to_happiness` which will be `data/plotly_intro/World-happiness-report-updated_2024.csv` which leads to a CSV file to read in
  -  Read in the CSV into a dataframe and output it as `pd.DataFrame`
  -  Because of how the `.csv`file is formated you must ensure that the encoding is latin1 `encoding='latin1'`

In [None]:
%reload_ext tutorial.tests.testsuite

In [None]:
%%ipytest

import pandas as pd
import numpy as np
def solution_read_in_dataframe(path_to_happiness: str) -> pd.DataFrame:
    """
    Reads in a CSV file containing happiness data and returns it as a pandas DataFrame.

    Instructions:
        - Use the `path_to_happiness` which will be `data/data_exploration/World-happiness-report-updated_2024.csv`.
        - Read in the CSV into a DataFrame using `pd.read_csv`.
        - Ensure the encoding is set to 'latin1' as the file is formatted accordingly.

    Args:
        path_to_happiness (str): Path to the CSV file containing the happiness data.

    Returns:
        pd.DataFrame: A DataFrame containing the happiness data.
    """
    # Your code starts here
    return
    # Your code ends here

## Playground


In [None]:
happyness = pd.read_csv('data/data_exploration/World-happiness-report-updated_2024.csv', encoding='latin1')
happyness.describe()


Use the above playground to explore the dataset using the functions test functions like:
| Function                                      | Description                                                                                                | Example                                      |
|:----------------------------------------------|:-----------------------------------------------------------------------------------------------------------|:---------------------------------------------|
| `pd.DataFrame.head()`                         | Displays the first few rows of the DataFrame (default is 5).                                               | `df.head()`                                  |
| `pd.DataFrame.tail()`                         | Displays the last few rows of the DataFrame (default is 5).                                                | `df.tail(10)`                               |
| `pd.DataFrame.info()`                         | Provides a concise summary of the DataFrame, including data types and non-null values.                      | `df.info()`                                  |
| `pd.DataFrame.describe()`                      | Generates descriptive statistics of the numerical columns in the DataFrame.                                  | `df.describe()`                              |
| `pd.DataFrame['column_name']`                 | Selects a specific column by its name.                                                                      | `df['Name']`                                 |
| `pd.DataFrame.loc[row_label, col_label]`      | Accesses a group of rows and columns by label(s).                                                          | `df.loc[0, 'Age']`                            |
| `pd.DataFrame.iloc[row_index, col_index]`      | Accesses a group of rows and columns by integer position(s).                                               | `df.iloc[2, 1]`                              |
| `pd.DataFrame.groupby('column_name')`          | Groups rows based on the values in a specified column for aggregation.                                     | `df.groupby('Category')['Value'].mean()`      |
| `pd.DataFrame.shape`                           | Returns a tuple representing the dimensionality of the DataFrame (number of rows, number of columns).      | `df.shape`                                   |
| `pd.DataFrame.columns`                         | Returns the column labels of the DataFrame.                                                                | `df.columns`                                 |
| `pd.DataFrame.index`                           | Returns the index (row labels) of the DataFrame.                                                           | `df.index`                                   |
| `pd.DataFrame.dtypes`                          | Returns the data type of each column.                                                                      | `df.dtypes`                                  |
| `pd.DataFrame.values`                          | Returns a NumPy representation of the DataFrame.                                                           | `df.values`                                  |
| `pd.DataFrame.nunique()`                       | Returns the number of unique values in each column.                                                        | `df['City'].nunique()`                       |
| `pd.DataFrame['column_name'].value_counts()`   | Returns a Series containing counts of unique values in a column.                                           | `df['Status'].value_counts()`                  |
| `pd.DataFrame.sort_values(by='column_name')`   | Sorts the DataFrame by the values in a specified column.                                                   | `df.sort_values(by='Date')`                    |
| `pd.DataFrame.sort_index()`                    | Sorts the DataFrame by its index.                                                                         | `df.sort_index()`                             |
| `pd.DataFrame.isna().sum()`                    | Returns the number of missing (NaN) values in each column.                                                 | `df.isna().sum()`                             |
| `pd.DataFrame.duplicated().sum()`              | Returns the number of duplicate rows in the DataFrame.                                                      | `df.duplicated().sum()`                       |
| `pd.DataFrame['column_name'].unique()`        | Returns a NumPy array of the unique values in a column.                                                     | `df['Country'].unique()`                      |
| `pd.DataFrame.sample(n=5)`                     | Returns a random sample of items from the DataFrame (default is 1).                                        | `df.sample(n=10)`                             |
| `pd.DataFrame.filter(items=['col1', 'col3'])`  | Subset the dataframe columns based on the specified items (labels).                                       | `df.filter(items=['Product', 'Price'])`       |
| `pd.DataFrame.filter(like='rate', axis=1)`     | Subset the dataframe columns based on the specified regular expression (using 'like').                    | `df.filter(like='temp', axis=1)`              |
| `pd.DataFrame.filter(regex='^A', axis=1)`      | Subset the dataframe columns based on the specified regular expression (using 'regex').                   | `df.filter(regex='^ID', axis=1)`               |
| `pd.DataFrame.nlargest(n, 'column_name')`     | Returns the first n rows ordered by columns in descending order.                                          | `df.nlargest(3, 'Revenue')`                   |
| `pd.DataFrame.nsmallest(n, 'column_name')`    | Returns the first n rows ordered by columns in ascending order.                                           | `df.nsmallest(2, 'Cost')`                     |
| `pd.DataFrame.corr(numeric_only=True)`        | Computes pairwise correlation of columns, excluding NA/null values unless the entire row/column is NA.    | `df.corr(numeric_only=True)`                  |
| `pd.DataFrame.cov(numeric_only=True)`         | Computes pairwise covariance of columns, excluding NA/null values.                                         | `df.cov(numeric_only=True)`                   |
| `pd.DataFrame.memory_usage(deep=True)`       | Returns the memory usage of each column in bytes. The `deep=True` argument provides a more accurate estimate.| `df.memory_usage(deep=True)`                  |



### Data exploration

After playing around a bit with some functions, this is what an initial exploration of the data set could look like.
Run the next cell.

In [None]:
import pandas as pd

happyness = pd.read_csv('data/data_exploration/World-happiness-report-updated_2024.csv', encoding='latin1')

# Assuming your dataframe is loaded into 'df'
df = happyness
print("--- First few rows of the dataframe ---")
print(df.head(2))
print("\n")

print("--- Summary information about the dataframe ---")
df.info()
print("\n")

print("--- Descriptive statistics for numerical columns ---")
print(df.describe())
print("\n")

print("--- Checking for unique values in each column ---")
for col in df.columns:
    unique_values = df[col].nunique()
    print(f"Column '{col}' has {unique_values} unique values.")
print("\n")

print("--- Checking for NaN (missing) values in each column ---")
print(df.isnull().sum())
print("\n")

# Example: Checking the span of a 'year' column (if it exists)
if 'year' in df.columns:
    min_year = df['year'].min()
    max_year = df['year'].max()
    print(f"--- Span of the 'year' column ---")
    print(f"Minimum year: {min_year}")
    print(f"Maximum year: {max_year}")
    print(f"Year range: {max_year - min_year} years")
    print("\n")



# Building the plot from scratch

The plot you saw at the beginning was a plot that is part of the Plotly tutorial.
Many parts of this tutorial are heavily inspired by this [`kaggle project`](https://www.kaggle.com/code/aashita/guide-to-animated-bubble-charts-using-plotly).
You can find many more on that website or online available in general.
For reference see the end of the notebook.
Let's break down the steps we will go through in this notebook:

- [Finding the limits](#Finding-the-limits)
- [Cleaning missing values](#Cleaning-missing-values)
- [Adding regional indicator](#Adding-regional-indicator)
- [Plotting basic scatter plot](#Plotting-basic-scatter-plot)
- [Making frames per year](#Making-frames-per-year)
- [Adding pause-play button](#Adding-pause-play-button)
- [Using bubble size as a variable](#Using-bubble-size-as-a-variable)
- [Classify into categories](#Classify-into-categories)

## Finding the limits

If we are plotting a function, it is important to know the order of magnitude of some of the data.
In our case for example we want to have an animated plot over some years and it helps to know for which years we actually have data.
In a dataframe we can e.g. use the `.min()` and `.max()` methods. 
Optionally, to understand the distribution or "order of magnitude" of your time values, you might want to plot out the years and check the rough distribution to identify any anomalies or gaps in the data.
This can be done using a histogram or a line plot to visualize the frequency or trend of the time values over the range.

We want to use `matplotlib.pyplot` for displaying the histogram because it has a useful function hist which does exactly that. 

We use the `matplotlib.pyplot as plt` library and there there is `.hist` function which will produce a histogram.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

happiness = pd.read_csv('data/data_exploration/World-happiness-report-updated_2024.csv', encoding='latin1')
years = happiness['year'].unique()
print(f"Unique years in the dataset: {sorted(years)}")

df = happiness
df['year'] = happiness['year'].astype(int)  # Ensure the years are integers

# Determine the minimum and maximum years
min_year = df['year'].min()
max_year = df['year'].max()
number_of_bins = max_year - min_year + 1

# Plot the histogram of the years
plt.figure(figsize=(10, 6))  # Adjust figure size for better readability
plt.hist(df['year'], bins=number_of_bins, edgecolor='grey') # Adjust bins as needed
plt.title('Histogram of Years')
plt.xlabel('Year')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

## Cleaning missing values

Pandas provides several flexible methods for handling missing data, represented as `NaN`. 
You can identify missing values using `.isna()` or `.isnull()`, and then choose a strategy:  `.dropna()` removes rows or columns with missing values, while `.fillna()` replaces them. `.ffill` propagates the last valid observation forward to fill in the missing values.
For example, `df.ffill` will replace a `NaN` with the value from the previous row which had a non-`Nan`value.
You can also fill it with a specific value (like the mean, median, or constant).
For time series data, you might use interpolation with `.interpolate()` to fill gaps.
The best approach depends on the nature of the data and the goal of your analysis.

For this step, we will try to forwardfill the dataframe:
```python
    cleaned_happiness = cleaned_happiness.sort_values(by=['Country name', 'year']).ffill()
```

### Exercise: Complete Happiness

In this exercise, we want to complete the dataframe with missing values.
Complete the function below to 

1. Fill in missing years for every country (so we have an entry for every year between 2005 and 2023 and every country).
   Do this by initializing a DataFrame with `pd.DataFrame()` with a list.
   Then left-merge the happiness dataframe to it with `pd.merge()`.
2. Fill all missing values in the year 2005 with the value 1.
   Use the `.fillna()` function.
3. Forwardfill all the remaining years with the function `.ffill()`.
   (To forward fill the order of the dataframe is important! Make sure to sort first.)

In [None]:
%reload_ext tutorial.tests.testsuite

In [None]:
%%ipytest

import pandas as pd
import numpy as np
def solution_clean_dataset(happiness_df: pd.DataFrame) -> pd.DataFrame:
    """Cleans the dataset by adding missing year and country values

        1. Add in missing years for every country
        2. Fill the minimum year with values of 1
        3. Forward fill the rest of the years

    Args:
        happiness_df : DataFrame containing the happiness data

    Returns:
        - Cleaned DataFrame with missing values filled
    """
    # Your code starts here
    return 
    # Your code ends here

Data cleaning is crucial in data analysis, but several pitfalls exist.
Here's a summary of common mistakes and how to avoid them:

1. Incorrectly Handling Missing Values.
   Replacing `NaN` values with the mean can be misleading, especially with skewed data.
   Consider using the median or more advanced imputation techniques, and understand the reason for missingness.
   Pandas tools like `fillna()`, `dropna()`, and `interpolate()` are essential here.
2. Removing Outliers Without Investigation.
   Avoid automatically deleting outliers.
   Visualize the data to determine if outliers are genuine extreme values or errors.
   If genuine, they may be important for analysis.  Use boolean indexing with summary statistics to handle them in Pandas.
3. Ignoring Data Types.
   Ensure columns have the correct data type.
   Use `df.info()` to check and convert columns with `pd.to_numeric()`, `pd.to_datetime()`, or `astype()`.
4. Not Handling Duplicates Carefully.
   Investigate the source of duplicate rows before removing them.
   They may indicate data entry errors or represent significant repeated measurements.
   Pandas provides `duplicated()` and `drop_duplicates()` for this purpose.
5. Applying Transformations Incorrectly.
   Scaling data without considering outliers can lead to issues.
   If scaling is necessary, consider robust scalers (like `RobustScaler` from `scikit-learn`) that are less affected by outliers.

## Adding regional indicator

We also want to group the countries. 
To add a regional indicator, we'll need another dataset that maps countries to their respective regions.
We can then merge this data with our main happiness dataframe based on the 'Country name' column.

Let's assume you have a CSV file named `country_region_mapping.csv` in your `data/plotly_intro/` directory with columns `'Country name'` and `'Region indicator'`.
Then we could merge the dataframes on the country name and get a regional indicator for all the countries.

Let's explore the `pd.merge` function for that.
The merging of tables comes from SQL Table merges and if you are not familiar with those, for now, keep in mind we want to do a **left merge** with the happiness table being the left table and the region mapping the right table and we merge **on** a column which they have in common.

```python
pd.merge?

Signature:
pd.merge(
    left: 'DataFrame | Series',
    right: 'DataFrame | Series',
    how: 'MergeHow' = 'inner',
    on: 'IndexLabel | AnyArrayLike | None' = None,
    left_on: 'IndexLabel | AnyArrayLike | None' = None,
    right_on: 'IndexLabel | AnyArrayLike | None' = None,
    left_index: 'bool' = False,
    right_index: 'bool' = False,
    sort: 'bool' = False,
    suffixes: 'Suffixes' = ('_x', '_y'),
    copy: 'bool | None' = None,
    indicator: 'str | bool' = False,
    validate: 'str | None' = None,
) -> 'DataFrame'
```

### Exercise: Final Happiness

In this exercise, we want to add the regional indicators to the dataframe

  1) Merge the `region_df` with `complete_happiness_df`.
  2) Fill in missing values with 'Unknown'.

In [None]:
%reload_ext tutorial.tests.testsuite

In [None]:
%%ipytest

import pandas as pd
import numpy as np
def solution_add_regional_indicator(cleaned_happiness_df: pd.DataFrame, region_df: pd.DataFrame) -> pd.DataFrame:
    """Adds a regional indicator to the dataset

        1. Merge the cleaned_happiness_df with region_df on the 'Country name' and 'year' columns
        2. Fill the missing values in the 'Region indicator' column with 'Unknown'

    Args:
        cleaned_happiness_df : DataFrame containing the happiness data
        region_df : DataFrame containing the region data

    Returns:
        - DataFrame with the regional indicator added
    """
    # Your code starts here
    return
    # Your code ends here

## Plotting basic scatter plot

Scatter plots from the Plotly library are the most important thing we will be building upon.
It magically transforms our data into a visual effect.
For that, we have to follow the exact rules of the configuration parameters we need to follow.

We work with the `iplot` function (interactive plot) from the Plotly library, which takes a figure object, which is a dictionary containing the data, the layout and the frames. 

```python
figure = {
    'data': list[trace],
    'layout': dict,
    'frames': list[frame],
}
frame = {
    'data': list[trace],
    'name': str,
}
```

The data can be seen as the plot data initially, and the frames are then the animation steps.

Let's first try to create a simple scatter plot, for that we populate a figure dictionary data with a trace (`dict`) which contains an array of values for `x`, an array of values for `y`, a `mode` ('markers') and an array of strings for the text which is what appears when hovered over.

```python
trace = {
    'x': list[int],
    'y': list[int],
    'mode': 'markers',
    'text': list[str],
    'type': 'scatter'
}
```

In [None]:
# Define the dataset and the columns
from tutorial.data_exploration_helper import get_happiness_data, get_clean_dataset_with_region
from plotly.offline import iplot
dataset = get_clean_dataset_with_region(get_happiness_data())
x_column = 'Freedom to make life choices'
y_column = 'Life Ladder'
description_column = 'Country name'
# time_column = 'year'



# Define figure
figure = {
    'data': [],
    'layout': {},
    'frames': []
}

# Take a random year present in the dataset
year = 2010

# Make the trace
trace = {
    'x': list(dataset.loc[dataset['year'] == year, x_column]), 
    'y': list(dataset.loc[dataset['year'] == year, y_column]),
    'mode': 'markers',
    'text': list(dataset.loc[dataset['year'] == year, description_column]),
    'type': 'scatter',
}

# Append the trace to the figure
figure['data'] = [trace]

# Plot the figure
iplot(figure)

## Making frames per year

Next, we want to make the graph animated with a slider over the years.
This is basically the same thing as making a scatter plot for every year and adding the slider.
So it makes sense to functionalize the trace step we did before and then fill the frames with all of the traces.

In [None]:
from tutorial.data_exploration_helper import get_happiness_data, get_clean_dataset_with_region, get_scatter_figure
from plotly.offline import iplot

dataset = get_clean_dataset_with_region(get_happiness_data())
x_column = 'Freedom to make life choices'
y_column = 'Life Ladder'
description_column = 'Country name'
# time_column = 'year'
figure = get_scatter_figure(dataset, x_column, y_column, description_column)

def frame_by_year(dataset, year, x_column, y_column, description_column):
    """Make a trace for a given year"""
    # Make a trace
    trace = {
        'x': list(dataset.loc[dataset['year'] == year, x_column]), 
        'y': list(dataset.loc[dataset['year'] == year, y_column]),
        'mode': 'markers',
        'text': list(dataset.loc[dataset['year'] == year, description_column]),
        'type': 'scatter'
    }
    frame = {
        'data': [trace],
        'name': str(year)
    }
    return frame

# Get the years
years = dataset['year'].unique()
# Sort the years
years.sort()


# Set timestep
figure['frames'] = [frame_by_year(dataset, year, x_column, y_column, description_column) for year in years]
iplot(figure)

### Adding slider bar for time scale

The slider needs configuring, this would require a bit of reading up what exactly you need or if you have an example you can make use of the existing functions.
The following is heavily inspired by the module  [`bubbly`](https://github.com/AashitaK/bubbly).
This is simply a configuration and contains only the years data.

In [None]:
from tutorial.data_exploration_helper import full_clean_dataset, get_scatter_figure_with_years
from plotly.offline import iplot

dataset = full_clean_dataset()
x_column = 'Freedom to make life choices'
y_column = 'Life Ladder'
description_column = 'Country name'
figure = get_scatter_figure_with_years(dataset, x_column, y_column, description_column)


years = dataset['year'].unique()
years.sort()

figure['layout']['sliders'] = {
    'args': [
        'slider.value', {
            'duration': 400,
            'ease': 'cubic-in-out'
        }
    ],
    'initialValue': min(years),
    'plotlycommand': 'animate',
    'values': years,
    'visible': True
}
sliders_dict = {
    'active': 0,
    'yanchor': 'top',
    'xanchor': 'left',
    'currentvalue': {
        'font': {'size': 20},
        'prefix': 'Year:',
        'visible': True,
        'xanchor': 'right'
    },
    'transition': {'duration': 300, 'easing': 'cubic-in-out'},
    'pad': {'b': 10, 't': 50},
    'len': 0.9,
    'x': 0.1,
    'y': 0,
    'steps': []
}

def slider_step(year):
    '''Creates a slider step.'''
    
    slider_step = {'args': [
        [year],
        {'frame': {'duration': 300, 'redraw': False},
         'mode': 'immediate',
       'transition': {'duration': 300}}
     ],
     'label': str(year),
     'method': 'animate'}
    return slider_step

sliders_dict['steps'] = [slider_step(year) for year in years]
figure['layout']['sliders'] = [sliders_dict]
iplot(figure)

## Adding pause-play button

Buttons we give for free! (Run the above first)

In [None]:
figure['layout']['updatemenus'] = [
    {
        'buttons': [
            {
                'args': [None, {'frame': {'duration': 500, 'redraw': False},
                         'fromcurrent': True, 'transition': {'duration': 300, 
                                                             'easing': 'quadratic-in-out'}}],
                'label': 'Play',
                'method': 'animate'
            },
            {
                'args': [[None], {'frame': {'duration':0, 'redraw': False}, 'mode': 'immediate',
                'transition': {'duration': 0}}],
                'label': 'Pause',
                'method': 'animate'
            }
        ],
        'direction': 'left',
        'pad': {'r': 10, 't': 87},
        'showactive': False,
        'type': 'buttons',
        'x': 0.1,
        'xanchor': 'right',
        'y': 0,
        'yanchor': 'top'
    }
]
iplot(figure)

## Using bubble size as a variable

Now we build on the above interactive graph by setting the size of the bubble as another variable we take the `Log GDP per capita`. 
The size of the bubble is controlled by the ```marker``` attribute of each trace.
The marker should have the format:

```python
trace['marker'] = {
    'sizemode': 'area',
    'sizeref': int,
    'size': list[int]
}
```

First, we need to make sure that the bubble sizes don't blow up the plot but also that they aren't so tiny we cannot see them.
So we probably want sizes from 1 to 500.
So we take the `Log GDP per capita` and scale it to a range of 1 to 500.
We also add in an exponential scale from the numpy library.

In [None]:
from tutorial.data_exploration_helper import get_happiness_data, get_clean_dataset_with_region
import pandas as pd
import numpy as np

complete_happiness_df = get_clean_dataset_with_region(get_happiness_data())

log_gdp_df = complete_happiness_df[['Country name', 'year', 'Log GDP per capita']]
# get global min without 1
log_gdp_df_without_1 = log_gdp_df[log_gdp_df['Log GDP per capita'] != 1]
global_min_log_gdp_per_country = log_gdp_df_without_1['Log GDP per capita'].min()
# replace 1 with global min
log_gdp_df.loc[log_gdp_df['Log GDP per capita'] == 1, 'Log GDP per capita'] = global_min_log_gdp_per_country
global_max_log_gdp_per_country = log_gdp_df['Log GDP per capita'].max()


resized_log_gdp_df = log_gdp_df.copy()
# Scale exponentially between 1 and 500
resized_log_gdp_df['Resized Log GDP per capita'] = np.exp(log_gdp_df['Log GDP per capita']) * 500 / (np.exp(global_max_log_gdp_per_country) - np.exp(global_min_log_gdp_per_country))

print(f"Global min log GDP per country: {global_min_log_gdp_per_country}")
print(f"Global max log GDP per country: {global_max_log_gdp_per_country}")

# append to final_happiness_df
dataset = pd.merge(complete_happiness_df, resized_log_gdp_df, on=['Country name', 'year'], how='left')
# Check out year 2010
dataset[dataset['year'] == 2010].head(10)


In [None]:
from tutorial.data_exploration_helper import set_layout, full_clean_dataset
from plotly.offline import iplot


dataset = full_clean_dataset()
x_column = 'Freedom to make life choices'
y_column = 'Life Ladder'
description_column = 'Country name'
time_column = 'year'
# Set the layout
figure = set_layout(x_title='Freedom to make life choices', y_title='Life Ladder',
            title='Happiness Indicators', x_logscale=False, y_logscale=False, 
            show_slider=True, slider_scale=years, show_button=True, show_legend=False, 
            height=650)

# Define the new variable
bubble_size_column = 'Resized Log GDP per capita'
category_column = 'Regional indicator'



# Make the grid
years = dataset[time_column].unique()
years.sort()
    

# Add the base frame
year = min(years)
trace = {
    'x': list(dataset.loc[dataset['year'] == year, x_column]), 
    'y': list(dataset.loc[dataset['year'] == year, y_column]),
    'mode': 'markers',
    'text': list(dataset.loc[dataset['year'] == year, description_column]),
    'marker': {
        'size': list(dataset.loc[dataset['year'] == year, bubble_size_column]),
        'sizemode': 'area',
        'sizeref': 1,
    },
    'type': 'scatter'
}
figure['data'].append(trace)


def frame_by_year_with_size(dataset, year, x_column, y_column, description_column):
    """Make a trace for a given year with bubble size"""
    # Make a trace
    trace = {
        'x': list(dataset.loc[dataset['year'] == year, x_column]), 
        'y': list(dataset.loc[dataset['year'] == year, y_column]),
        'mode': 'markers',
        'text': list(dataset.loc[dataset['year'] == year, description_column]),
        'marker': {
            'size': list(dataset.loc[dataset['year'] == year, bubble_size_column]),
            'sizemode': 'area',
            'sizeref': 1,
        },
        'type': 'scatter'
    }
    frame = {
        'data': [trace],
        'name': str(year)
    }
    return frame


# Add time frames
figure['frames'] = [frame_by_year_with_size(dataset, year, x_column, y_column, description_column) for year in years]


# Set the layout once more
figure['layout']['xaxis']['range'] = [0, 1.2]
figure['layout']['yaxis']['range'] = [0, 9]

# Plot the animation
iplot(figure, config={'scrollZoom': True})

## Classify into categories

Now we add a category variable, namely region indicator in our case.
By default, if we split the traces into different categories they will all get different colours.
So the figure structure changes in that the data of a frame or a figure is now a list of traces instead of just one.
So we split the data by year for every frame and then by category for every trace.


```python
figure = {
    'data': list[trace], # Split by category
    'layout': {},
    'frames': list[frame] # Split by year
}

frame = {
    'data': list[trace],
    'name': str,
}
```

### Exercise Frames with category

In the below exercise, complete the function to output a frame with the above format, so that it is is a dictionary with `'data'` and `'name'` where the

1. `'data'` is a list of traces where every trace is the subset containing a specific category in the `'Regional indicator'` of the DataFrame.
   Each trace should now also have the key `'name'` which is the category.
3. `'name'` is equal to the year as a string.

Take inspiration from above but now make the trace also a function dependent on the category.

In [None]:
%reload_ext tutorial.tests.testsuite

In [None]:
%%ipytest

import pandas as pd
import numpy as np

def solution_frames_with_category(dataset: pd.DataFrame, year: int, x_column: str, y_column: str, description_column: str, category_column: str, bubble_size_column: str) -> dict:
    """Make a frame for a given year with bubble size and color split the traces

    Args:
        dataset : DataFrame containing the happiness data
        year : Year to plot
        x_column : Column name for x-axis
        y_column : Column name for y-axis
        description_column : Column name for text

    Returns:
        - Dictionary containing the trace and frame information
    """
    # Your code starts here
    return
    # Your code ends here

So, we have finally generated the same interactive graph with our own dataset.
Below is the full figure again.

In [None]:
from tutorial.data_exploration_helper import load_full_happiness_figure
from plotly.offline import iplot

figure = load_full_happiness_figure()
iplot(figure, config={'scrollZoom': True})

### Bonus Exercise Fixing a library


With huge and well-established libraries like pandas or numpy there are many contributers behind them and a lot of effort is spent to find any kind of bugs and mistakes.
However, if you are browsing through possible libraries to use you might also find less well-maintained libraries, ones that may only have a single author and ones that haven't been touched in a while. 

Here we give you a direct example, this tutorial was inspired by the [bubbly](https://github.com/AashitaK/bubbly) package.
However, with an update from the pandas library it is no longer compatible with newer versions of pandas and will through an error (see codeblock below). 
So what to do in that case?

There are many options, you can inform the author of this problem on GitHub.
Of course, they may not have time to fix this.
You can find a different library, however, it might not be exactly the way you wanted it.
You can downgrade your pandas library to be compatible, if you use pip show pandas you will see what version you have, it is possible to uninstall and reinstall a specific version.
However, this might not be feasible if you need it in other places and is generally not a pretty solution. 
Last but not least you can try to fix it yourself.

So as an exercise, we exported the bubbly library as a file `bubbly.py` into the folder `data.plotly_intro`.
It is quite a short library so quite managable.
Try to figure out what the error is exactly and then fix the library locally by modifying only the file `data/plotly_intro/bubbly.py` until the same code below compiles.

Note: You will need to restart the kernel after changes to the packages.

(If you are interested in a solution, we have a fixed version under tutorial.my_bubbly.py, feel free to check the differences.)

In [None]:
import pandas as pd
# from bubbly.bubbly import bubbleplot
from data.data_exploration.bubbly import bubbleplot 
from plotly.offline import iplot
path = "data/data_exploration"
gapminder_indicators = pd.read_csv(path + '/gapminder.tsv', delimiter='\t')

figure = bubbleplot(dataset=gapminder_indicators, x_column='gdpPercap', y_column='lifeExp', 
    bubble_column='country', time_column='year', size_column='pop', color_column='continent', 
    x_title="GDP per Capita", y_title="Life Expectancy", title='Gapminder Global Indicators',
    x_logscale=True, scale_bubble=3, height=650)
iplot(figure, config={'scrollZoom': True})
