# Visualization
The human brain is excellent at recognizing patterns in visual representations of data; therefore, in this section, we will learn how to visualize data using `pandas` as well as the libraries `matplotlib` and `seaborn` for additional functionality. We will create a variety of visualizations that will help us better understand our data.

## Why is Data Visualization Necessary?

So far, we have focused heavily on summarizing the data using statistics. However, summary statistics alone are not enough to understand the distribution – there are many possible distributions for a given set of summary statistics. Data visualization is necessary to truly understand the distribution (GIF only works in Google Colab).

<div style="text-align: center; margin-top: -10px;">
<img width="50%" src="https://raw.githubusercontent.com/stefmolin/data-morph/main/docs/_static/panda-to-star-eased.gif" alt="Data Morph: panda to star" style="min-width: 300px; margin-bottom: -10px;"/>
<div style="margin: auto 26%;"><small><em>A set of points forming a panda can also form a star without any significant changes to the summary statistics displayed above. (source: <a href="https://stefaniemolin.com/data-morph/stable/index.html">Data Morph</a>)</em></small></div>
</div>


## Preparation - Reading and Understanding the Data
We're using Kaggle's [Olympics Dataset](https://www.kaggle.com/datasets/bhanupratapbiswas/olympic-data/data)

**Load the data into a `dataframe`**

Remember to upload the unzipped CSV to the folder section on the left.

In [None]:
# Your code here

**Familiarize yourself with the dataset. What columns are there, what data types are present? What does a row represent? What time period does the dataset cover?**

In [None]:
# Your code here

## Visualizing with Pandas
The quickest way to create a visualization is to use `pandas` directly for it. This allows you to generate charts with a single line of code.

Later, we will create more complex representations using `matplotlib` and `seaborn` (two additional packages).

We will learn these four most common visualizations:
- Histogram
- Bar chart
- Line chart
- Scatter plot

In [None]:
df.head()

### Histogram
For example, we can visualize the distribution of age across all modern Olympic participants. To do this, we use a *histogram*.

In [None]:
df.Age.plot(kind="hist")

We can refine this representation a bit more by specifying additional options (called *arguments*) in the function call. The many possible arguments can be found in the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.hist.html) of the function.

In [None]:
df.Age.plot(
    kind="hist",
    bins=77, #int(df.Age.max() - df.Age.min()),
    grid=True,
    figsize=(12, 6),
    title="Distribution of age across all Olympic participants from 1896-2016",
    xlabel="Age",
    ylabel="Count",
    xticks=list(range(10, 100, 5)) #list(range((int(df.Age.min())), int(df.Age.max()), 5))
    )

### Bar Chart
If we want to display frequencies per category, we can use a bar chart:

In [None]:
# Note that we are combining multiple operations after each other.
# This is possible because each function (or filter in other cases) returns a dataframe
# We can always apply another function on a given dataframe
df.Sport.value_counts().sort_values().plot(kind="barh", figsize=(6, 12), grid=True, title="Number of Athletes per Sport (1896-2016)")

### Line Chart
Line charts are mainly used when changes over time need to be displayed. The x-axis is usually a time period.

For example, we can examine whether the average height has changed over time.

**Do you have a guess?**

*Note:* This uses a more advanced technique called `.groupby()` which we did not cover in this tutorial. Essentially we calcualte a value (e.g. average athelete height) per group (e.g. per year). Don't worry about the details. If you are curious to learn this powerful method, read the [documentation](https://pandas.pydata.org/docs/dev/reference/api/pandas.DataFrame.groupby.html), [official explanation](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html) or use an AI assistant of your choice to learn about the concept.

In [None]:
# These data we want to visualize
df.groupby("Year").Height.mean()

There doesn't seem to be a strong trend. The fluctuation in the first year could be due to measurement errors. Otherwise, the variations are marginal.

I had expected that with the increasing inclusion of female athletes, the average height would decrease. This doesn't seem to be the case.

In [None]:
df.groupby("Year").Height.mean().plot(kind="line", title="Average Height over Time")

Next, let's see if the gender ratio has changed over time.

In [None]:
# We'll save our intermediate data
df_gender = df.groupby("Year").Gender.value_counts(normalize=True).reset_index()
df_gender

Indeed, the proportion of women increases to nearly 50%.

In [None]:
# We filter the intermediate data before visualizing
df_gender[df_gender.Gender == "F"].plot(kind="line", x="Year", y="proportion", title="Proportion of Female Athelets at Olympic Games (1896-2016)")

It is possible that athletes have become taller over time, but this effect is compensated by the larger proportion of women.

To investigate this, we can display the average height by gender over time.

In [None]:
# Again, we save some intermediate transformations
df_gender_height = df.groupby(["Year", "Gender"]).Height.mean().unstack(level="Gender")
df_gender_height.head(10)

Here, we can indeed see a slight upward trend for each gender.

In [None]:
df_gender_height.plot(kind="line", title="Average Height per Gender over Time")

### Scatter Plot
Scatter plots are especially useful when the distribution of two continuous variables needs to be plotted simultaneously.


In [None]:
df.head()

This plot helps us understand the expected relationship between height and weight. Additionally, a scatter plot like this helps identify outliers more easily. There are some athletes who stand out from the large cluster because they are particularly heavy without being tall, or particularly tall without being heavy.

In [None]:
df.plot(kind="scatter", x="Height", y="Weight")

## Now it's your turn!
Use the techniques shown above to create your own visualizations. Follow these five steps:

1) **Ask the right question:** Every good visualization starts with an interesting question. Think about what you want to learn from the dataset.

2) **Choose the right visualization:** Different visualizations are suitable for different data. How many dimensions do you want to display? Are your variables categorical or continuous? Adapt the type of representation to your data.

3) **Prepare the data correctly:** Most of the time, you need to reshape the raw data to effectively display it. This includes filtering and aggregating the data until it is in the right form for representation.

4) **Create the visualization correctly:** The majority of the work is already done. Now it's about styling the visualization correctly so that it is easy to read and highlights the important conclusions.

5) **Interpret the visualization correctly:** What does your visualization show? Does it answer your question? Are the results surprising? What other representations might explain the surprising results?


**Ask & answer 3-5 interesting questions using visualizations**

In [None]:
# Your code here

In [None]:
# Your code here

In [None]:
# Your code here

In [None]:
# Your code here

In [None]:
# Your code here