**This notebook is an exercise in the [Data Visualization](https://www.kaggle.com/learn/data-visualization) course.  You can reference the tutorial at [this link](https://www.kaggle.com/alexisbcook/final-project).**

---


Now it's time for you to demonstrate your new skills with a project of your own!

In this exercise, you will work with a dataset of your choosing.  Once you've selected a dataset, you'll design and create your own plot to tell interesting stories behind the data!

## Setup

Run the next cell to import and configure the Python libraries that you need to complete the exercise.

In [1]:
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
print("Setup Complete")

The questions below will give you feedback on your work. Run the following cell to set up the feedback system.

In [2]:
# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.data_viz_to_coder.ex7 import *
print("Setup Complete")

## Step 1: Attach a dataset to the notebook

Begin by selecting a CSV dataset from [Kaggle Datasets](https://www.kaggle.com/datasets).  If you're unsure how to do this, please revisit the instructions in the previous tutorial.

Once you have selected a dataset, click on the **[+ Add data]** option in the top right corner.  This will generate a pop-up window that you can use to search for your chosen dataset.  

![ex6_search_dataset](https://i.imgur.com/cIIWPUS.png)

Once you have found the dataset, click on the **[Add]** button to attach it to the notebook.  You can check that it was successful by looking at the **Data** dropdown menu to the right of the notebook -- look for an **input** folder containing a subfolder that matches the name of the dataset.

<center>
<img src="https://i.imgur.com/nMYc1Nu.png" width=30%><br/>
</center>

You can click on the carat to the left of the name of the dataset to double-check that it contains a CSV file.  For instance, the image below shows that the example dataset contains two CSV files: (1) **dc-wikia-data.csv**, and (2) **marvel-wikia-data.csv**.

<center>
<img src="https://i.imgur.com/B4sJkVA.png" width=30%><br/>
</center>

Once you've uploaded a dataset with a CSV file, run the code cell below **without changes** to receive credit for your work!

In [3]:
# Check for a dataset with a CSV file
step_1.check()

## Step 2: Specify the filepath

Now that the dataset is attached to the notebook, you can find its filepath.  To do this, begin by clicking on the CSV file you'd like to use.  This will open the CSV file in a tab below the notebook.  You can find the filepath towards the top of this new tab.  

![ex6_filepath](https://i.imgur.com/fgXQV47.png)

After you find the filepath corresponding to your dataset, fill it in as the value for `my_filepath` in the code cell below, and run the code cell to check that you've provided a valid filepath.  For instance, in the case of this example dataset, we would set
```
my_filepath = "../input/fivethirtyeight-comic-characters-dataset/dc-wikia-data.csv"
```  
Note that **you must enclose the filepath in quotation marks**; otherwise, the code will return an error.

Once you've entered the filepath, you can close the tab below the notebook by clicking on the **[X]** at the top of the tab.

In [4]:
# Fill in the line below: Specify the path of the CSV file to read
my_filepath = "../input/covid19s-impact-on-airport-traffic/covid_impact_on_airport_traffic.csv"

# Check for a valid filepath to a CSV file in a dataset
step_2.check()

## Step 3: Load the data

Use the next code cell to load your data file into `my_data`.  Use the filepath that you specified in the previous step.

In [5]:
# Fill in the line below: Read the file into a variable my_data
#my_data = pd.read_csv(my_filepath, index_col="Date", parse_dates=True)
my_data = pd.read_csv(my_filepath)

# Check that a dataset has been uploaded into my_data
step_3.check()

**_After the code cell above is marked correct_**, run the code cell below without changes to view the first five rows of the data.

In [6]:
# Print the first five rows of the data
my_data.head()

## Step 4: Visualize the data

Use the next code cell to create a figure that tells a story behind your dataset.  You can use any chart type (_line chart, bar chart, heatmap, etc_) of your choosing!

### Line Chart

We are going to visualize how the proportion of flights with respect to a baseline has evolved over time for airports of different countries.

In [7]:
# Let's start by knowing how many countries do we have in the data
my_data["Country"].describe()

We have 4 different countries.

In [8]:
# Let's check the total number of airports
my_data["AirportName"].describe()

There are 28 different airports.

In [9]:
# Let's check how many airports do we have for each country.
my_data.groupby("Country")["AirportName"].unique().agg(len)

Since we have data for many years, we are going to plot the evolution only for individual years. 

In [10]:
# The date column is Object.
my_data["Date"].dtype

In [11]:
# Let's change it to datetime
my_data["Date"] = pd.to_datetime(my_data["Date"])
my_data["Date"].dtype

We can choose any country's airports to plot the evolution of the proportion of flights. Let's do this with Canada and 2020.

In [12]:
canada_2020_data = my_data[(my_data["Country"] == "Canada") & (pd.DatetimeIndex(my_data['Date']).year == 2020)]

In [13]:
# Create a plot
plt.figure(figsize=(40,10))
plt.title("Evolution of the Proportion of Flights compared to Baseline on different Dates for Canada on Year 2020")
sns.lineplot(x="Date", 
             y="PercentOfBaseline",
             hue="AirportName",
             data=canada_2020_data)
plt.ylabel("Proportion of Flights (Percent)")
plt.xlabel("Date")

# Check that a figure appears below
step_4.check()

### Bar Chart

Let's plot the minimum proportion of flights with respect to a baseline for each airport.

In [14]:
min_percent_per_airport = my_data.groupby("AirportName")["PercentOfBaseline"].min()
min_percent_per_airport.head()

In [15]:
# Plot a bar chart
plt.figure(figsize=(25,10))
plt.title("Min Proportion of Flights with respect to a baseline for each Airport")
sns.barplot(y=min_percent_per_airport.index, x=min_percent_per_airport, orient="h")
#plt.xticks(rotation=90)
plt.xlabel("Min Proportion of Flights")
plt.ylabel("Airport Name")

### Heatmap

Let's visualize how the proportion of flights is distributed per airport and on each date of a month using a heatmap. The month selected in October.

In [16]:
# Let's review our original data
my_data.head()

In [17]:
# Select the info we want to use
my_data_2d = my_data.loc[:, ["Date", "AirportName", "PercentOfBaseline"]]
my_data_2d["Month"] = pd.DatetimeIndex(my_data['Date']).month
my_data_2d = my_data_2d[my_data_2d["Month"] == 10]
my_data_2d.pop("Month")
my_data_2d.head()

In [18]:
my_data_2d = my_data_2d.pivot(index="Date", columns="AirportName", values="PercentOfBaseline")
my_data_2d.index = pd.to_datetime(my_data_2d.index)
my_data_2d.head()

In [19]:
plt.figure(figsize=(30, 15))
plt.title("Proportion of Flights with respect to a baseline for each Airport on October")
sns.heatmap(data=my_data_2d, annot=True, yticklabels=my_data_2d.index.strftime('%Y:%m:%d'), cmap="Purples", cbar_kws={"location": "top", "fraction": 0.1})
plt.xlabel("Airport Name")
plt.ylabel("Date")

### Scatter Plots

We are going to use a different dataset to do scatter plots.

In [20]:
# Load the dataset
books_filepath = "../input/amazon-top-50-bestselling-books-2009-2019/bestsellers with categories.csv"
books_data = pd.read_csv(books_filepath)

In [21]:
# Show the first entries
books_data.head()

In [24]:
# Let's see how the user rating and the book's price are related
plt.figure(figsize=(15, 15))
plt.title("Relationship between User Rating and book's Price")
sns.scatterplot(x=books_data["User Rating"], y=books_data["Price"])
plt.xlabel("User Rating")
plt.ylabel("Price")

It seems there is no correlation between the Rating and the Price of a book. Let's check that with a regression.

In [25]:
plt.figure(figsize=(15, 15))
plt.title("Relationship between User Rating and book's Price")
sns.regplot(x=books_data["User Rating"], y=books_data["Price"])
plt.xlabel("User Rating")
plt.ylabel("Price")

In fact, it seems that as the rating increases, the price decreses slightly too. However, this conclusion is not very reliable since we don't have too many underrated books on the dataset (it's a top-50).

Is this true for all the genres?

In [43]:
plt.figure(figsize=(15, 15))
plt.title("Relationship between User Rating and book's Price by Genre")
sns.scatterplot(x=books_data["User Rating"], y=books_data["Price"], hue=books_data["Genre"])
plt.xlabel("User Rating")
plt.ylabel("Price")

In [49]:
# Add regression lines
sns.lmplot(
    x="User Rating",
    y="Price",
    hue="Genre",
    data=books_data,
    height=8,
    aspect=2.0,
)

- The fiction book ratings seem to not be very affected by the price.
- However, the non-fiction book ratings do seem to be slighly affected by the price: when price decreases a little bit, ratings increases.

Again, those conclusions are not very reliable due to the lack of enough underrated books.

In [50]:
# Let's plot in a ``categorical scatter plot´´ the relationship between prices and genres. 
plt.figure(figsize=(15, 15))
plt.title("Relationship between Prices and Genres")
sns.swarmplot(x=books_data["Genre"], y=books_data["Price"])
plt.xlabel("Genre")
plt.ylabel("Price")

It seems that fiction books are a little bit cheaper than non-fiction books.

### Histograms

In [68]:
# Let's see how many samples of each rating do we have
plt.figure(figsize=(10,10))
plt.title("# of ratings")
sns.distplot(a=books_data["User Rating"], kde=False, color="tomato")
plt.xlabel("User Rating")
plt.ylabel("# of Ratings")

Most ratings tend to be high. This is expected in this kind of data.

In [70]:
# Let's see the distribution of use ratings
plt.figure(figsize=(10,10))
plt.title("KDE of ratings")
sns.kdeplot(data=books_data["User Rating"], shade=True)
plt.xlabel("User Rating")
plt.ylabel("Density of Ratings")

In [93]:
# Similarly, let's see the distribution of prices vs. user ratings by Genre
sns.jointplot(
    x=books_data["User Rating"],
    y=books_data["Price"],
    hue=books_data["Genre"],
    kind="kde",
    height=8,
    color="orange"
)

In [101]:
# Let's see how many sample are from each rating by genre
plt.figure(figsize=(10,10))
plt.title("# of ratings")
plt.xlabel("User Rating")
plt.ylabel("# of Ratings")

books_data_non_fiction = books_data[books_data["Genre"] == "Non Fiction"]
books_data_fiction = books_data[books_data["Genre"] == "Fiction"]
sns.distplot(a=books_data_non_fiction["User Rating"], label="Non-Fiction", kde=True)
sns.distplot(a=books_data_fiction["User Rating"], label="Fiction", kde=True)

plt.legend()

In [65]:
# Let's see how many books of each genre we have in the top-50.
# This is not plotted as an histogram but as a bar chart
books_per_genre = books_data.groupby("Genre")["Name"].agg(len)

plt.figure(figsize=(10,10))
plt.title("# of books in top-50 per Genre")
sns.barplot(x=books_per_genre.index, y=books_per_genre, palette="muted")
plt.xlabel("Genre")
plt.ylabel("# of books")

In [80]:
# Let's see the most represented authors in the top-50
books_per_author = books_data.groupby("Author")["Name"].agg(len)

plt.figure(figsize=(25,50))
plt.title("# of books in top-50 per Author")
sns.barplot(y=books_per_author.index, x=books_per_author, orient="h")
plt.ylabel("Author")
plt.xlabel("# of books in TOP-50")

## Keep going

Learn how to use your skills after completing the micro-course to create data visualizations in a **[final tutorial](https://www.kaggle.com/alexisbcook/creating-your-own-notebooks)**.

---




*Have questions or comments? Visit the [course discussion forum](https://www.kaggle.com/learn/data-visualization/discussion) to chat with other learners.*