# Introduction to Matplotlib

Matplotlib is a cross-platform, data visualization and graphical plotting library for Python. As such, it offers a viable open source alternative to MATLAB.

Matplotlib and pandas are often used together, and pandas has Matplotlib integrated, so that we can easily plot a pandas dataframe or series calling their method `.plot()`.

You can create plots in `matplotlib` using the **Artist layer** or the **Scripting layer**:

- **Scripting layer** (procedural method) - using `matplotlib.pyplot` as `plt`

    You can use `plt` and add more elements by calling different methods procedurally; for example, `plt.title(...)` to add title or `plt.xlabel(...)` to add label to the x-axis.
    

- **Artist layer** (Object-oriented method) - using an `Axes` instance from Matplotlib

    You can use an `Axes` instance of your current plot and store it in a variable (eg. `ax`). You can add more elements by calling methods on `ax`. For example, use `ax.set_title()` instead of `plt.title()` to add title,  or `ax.set_xlabel()` instead of `plt.xlabel()` to add label to the x-axis.

    This option sometimes is more appropriate and flexible to use for advanced plots (e.g. when creating multiple plots).


### Matplotlib.Pyplot

In this lab, we will mostly use Matplotlib's scripting layer, called `matplotlib.pyplot`. Matplotlib.Pyplot offers a command style functions to use Matplotlib. Each one of these functions incrementally modifies a matplolib figure, e.g. plot a line, set an axis label etc.

Let's import `matplotlib` and `matplotlib.pyplot`, as well as pandas.


In [None]:
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

To check if Matplotlib is loaded and its version run the following:


In [None]:
print('Matplotlib: ', mpl.__version__)

### Importing and preparing the data

We'll use the datasets we worked with in the previous notebook.

> ### Datasets:
>
> **[International tourism, number of arrivals](https://data.worldbank.org/indicator/ST.INT.ARVL)** 
>
>This dataset contains the yearly number of inbound tourists for every country. The data on inbound tourists refer to the number of arrivals, not to the number of people traveling. Thus a person who makes several trips to a country during a given period is counted each time as a new arrival.
>    
>
>**[TripAdvisor European restaurants](https://www.kaggle.com/datasets/stefanoleone992/tripadvisor-european-restaurants)**
>
>This dataset includes restaurants with attributes such as location data, average rating, number of reviews, open hours, cuisine types, awards, etc. The dataset combines the restaurants from the main European countries. In the context of this lab, we will work with a subset of the dataset that includes restaurants in Greece.





To import and prepare the international tourism dataset:

In [None]:
# Read the .csv file and store it as a pandas Data Frame
df_tourism = pd.read_csv("./international_tourism.csv")

df_tourism.drop(['Country Code', 'Indicator Name','Indicator Code'], axis=1, inplace=True)

df_tourism.rename(columns={'Country Name':'Country'}, inplace=True)

# set the country name column as the dataframe's index
df_tourism.set_index('Country', inplace=True)

# delete empty columns
df_tourism.dropna(how='all', axis=1, inplace=True)

# delete empty rows
df_tourism.dropna(how='all', axis=0, inplace=True)

df_tourism

## Line Plots


A line chart is a type of chart which displays information as a series of data points ('markers') connected by straight line segments. Line charts are commonly used to visualize the behavior of a continuous variable over a time period.

To generate a line plot of inbound tourists over a time period for Greece we first need to retrieve the corresponding series from the data frame:


In [None]:
greece = df_tourism.loc['Greece']

greece

To plot a line chart, we call the `.plot()` function of the series.


In [None]:
greece.plot()

As you can see, the the x-axis took as values the index of the series (years) and the y-axis the series values (inbound tourists to Greece). Also, let's label the x and y axis using `plt.title()`, `plt.ylabel()`, and `plt.xlabel()` as follows:


In [None]:
greece.plot(kind='line')

plt.title('Inbound tourists to Greece')
plt.ylabel('Number of arrivals')
plt.xlabel('Year')

plt.show()



Notice the dramatic drop in tourists in 2020. This can be attributed to the outbreak of the COVID-19 pandemic.
To highlight this in our plot we can annotate it with text using the `.text(x, y, text)` method. Parameters `x` and `y` are by default in data coordinates. Since the years in the series is of type 'string', we need to specify x using the corresponding position of the year.

In [None]:
greece.plot(kind='line')

plt.title('Inbound tourists to Greece')
plt.ylabel('Number of tourists')
plt.xlabel('Year')

plt.text(25, 25000000, 'Covid-19 Pandemic Outbreak')

plt.show() 

### Multiple Lines

Let's compare now the number of inbound tourists in Greece, Italy and France for years 2010 to 2021.
First, we need to retrieve the data for the required countries and years.


In [None]:
df_cmp = df_tourism.loc[['Greece', 'Italy', 'France'], '2010': '2021']
df_cmp

Then we call `plot()` on the dataframe to plot it:


In [None]:
df_cmp.plot(kind='line')

😔

This is not the plot we wanted.
Matplotlib plots the index of a dataframe as the x-axis and the columns as lines on the y-axis.

In our first example with just the Greek data, we had a pandas Series with the years as index.
Here, however, we have a dataframe with the 3 countries as index and the years as columns.

To generate the right plot, we need to transpose the dataftame by calling the `transpose()` method to swap the row and columns.

In [None]:
df_cmp = df_cmp.transpose()
df_cmp.head()

Now, if we plot out dataframe:

In [None]:
df_cmp.plot(kind='line')

plt.title('Inbound tourists to Greece')
plt.ylabel('Number of tourists')
plt.xlabel('Year')

# plt.legend().set_visible(True)

plt.show()

**Question 1:** Compare the inbound tourists for years 2010-2020, for the 5 countries with most inbound tourists in 2020.

Hint: Use the sort_values([fields], ascending=False, axis=0, inplace=True)


In [None]:
# type your solution

## Bar Charts

A bar chart or bar graph is a chart or graph that presents categorical data with rectangular bars with heights or lengths proportional to the values that they represent. A bar graph shows comparisons among discrete categories. One axis of the chart shows the specific categories being compared, and the other axis represents a measured value.

The bars can be plotted vertically or horizontally.  A vertical bar chart is sometimes called a column chart. One disadvantage of them is that they lack space for text labelling for each bar.

In matplotlib we can create a bar plot by setting the `kind` parameter:
- `kind=bar` vertical bar chart or column chart
- `kind=barh` horizontal bar chart


**Vertical bar chart**

Let's compare the  number of tourists in 2020 for Greece, Spain, France and Italy.


In [None]:
df_tourism_med = df_tourism.loc[['Greece', 'Italy', 'France', 'Spain'], '2020']
df_tourism_med

In [None]:
df_tourism_med.plot(kind='bar')

plt.xlabel('Country')
plt.ylabel('Number of Tourists')
plt.title('Inbound Tourists in 2020')

plt.show()

The bar plot above shows the total number of tourists for these countries in 2020.

To rotate the x-axis labels we can set the `rot` parameter:

In [None]:
df_tourism_med.plot(kind='bar', rot=45)

plt.xlabel('Country')
plt.ylabel('Number of Tourists')
plt.title('Inbound Tourists in 2020')

plt.show()

**Horizontal Bar Plot**

In [None]:
df_tourism_med.plot(kind='barh')

plt.ylabel('Country')
plt.xlabel('Number of Tourists')
plt.title('Inbound Tourists in 2020')

plt.show()

### Time Series
In the international tourism dataset, even though we have years as the xAxis, we need not handle them explicitly as datetime type. However, if our data include a date field, handling the corresponding values as datetime type is recommended as it enables time-based slicing, proper axis labeling, etc.


> #### Dataset: **[Stock Market Dataset](https://www.kaggle.com/datasets/borismarjanovic/price-volume-data-for-all-us-stocks-etfs)** 
>This dataset includes historical daily prices and volumes of all U.S. stocks and ETFs, containing CSV files for every stock, with values for Date, Open, High, Low, Close, Volume, etc. For this lab, we will use the historical data for the Amazon stock.

Let's load the historical data for the Amazon stock. 

In [None]:
df_stock = pd.read_csv('amzn.csv')
df_stock

Next, we set as index the Date column:

In [None]:
df_stock.set_index('Date', inplace=True)
df_stock

To generate a line plot with the close prices:

In [None]:
df_stock['Close'].plot()
plt.ylabel('Close Price')
plt.show()

To explicitly handle the xAxis values as datetimes, we can change the type of the index to datetime:

In [None]:
df_stock.index = pd.to_datetime(df_stock.index)
type(df_stock.index)

Now, let's plot the time series again:

In [None]:
df_stock['Close'].plot()
plt.ylabel('Close Price')
plt.show()

What we did above, can be done faster by calling the `.read_csv()` method with `parse_dates=True` to ensure that dates are actually parsed as dates and not as strings, and with `index_col=0`, which sets the first column of the CSV data file (Date) to be the index.

In [None]:
df_stock = pd.read_csv('amzn.csv', index_col=0, parse_dates=True)
df_stock.head()

**Question 2:** Create a time series plot to visualize the daily Low and High prices in 2017.

In [None]:
# type your solution

Having explicitly set the index as datetime, we can also resample the data to a yearly frequency, calculating the average of close prices within each year:


In [None]:
yearly_avg_close = df_stock['Close'].resample('Y').mean()

yearly_avg_close

Now, we can visualize these yearly averages:

In [None]:
yearly_avg_close.plot(kind='line')
plt.ylabel('Average Yearly Close Price')


plt.show()

A bar chart may be preferable over a line chart for visualizing the average price per year and facilitating comparisons between years due to its ability to present discrete values and enable direct comparisons:

In [None]:
yearly_avg_close.plot(kind='bar')
plt.ylabel('Average Yearly Close Price')

# Modify x-tick labels
plt.gca().set_xticklabels([x.strftime('%Y') for x in yearly_avg_close.index])


plt.show()

## Area Plots


The **area plot** is a fairly common chart type. Most of the time, area plots are used as stacked area charts.
A stacked area chart is particularly useful when you want to understand both the individual trends of components and how these components contribute to the overall total.

Let's create a stacked area plot for Greece and Cyprus for years 2013-2019:


In [None]:
# transpose the dataframe
df_gr_cy = df_tourism.loc[['Greece', 'Cyprus'], '2013':'2019'].transpose()

df_gr_cy

Area plots are stacked by default. To produce a stacked area plot:

In [None]:
df_gr_cy.plot(kind='area',
             figsize=(10, 5))  # (x, y) size
plt.title('Inbound Tourists for Greece and Cyprus')
plt.ylabel('Number of Tourists')
plt.xlabel('Year')

plt.show()

To produce an unstacked plot, we set parameter `stacked` to value `False`.  Unstacked plots are useful for comparing multiple components independently, highlighting their individual trends without the visual overlap of stacked plots.

In [None]:
df_gr_cy.plot(kind='area',
             stacked=False,
             figsize=(10, 5))  # (x, y) size

plt.title('Inbound Tourists for Greece and Cyprus')
plt.ylabel('Number of Tourists')
plt.xlabel('Year')

plt.show()

The unstacked plot has a default transparency (alpha value) at 0.5. We can modify this value by passing in the `alpha` parameter.


In [None]:
df_gr_cy.plot(kind='area', 
             alpha=0.2,
             stacked=False,
             figsize=(15, 8))

plt.title('Inbound Tourists for Greece and Cyprus')
plt.ylabel('Number of Tourists')
plt.xlabel('Year')

plt.show()

## Histograms

A histogram is the most commonly used graph to show the frequency distribution of a dataset, i.e. how often each different value occurs in the dataset. A histogram partitions its x-axis into bins, with the y-value of every bin being the number of data points that correspond to it.

We will use a subset of TripAdvisor Restaurants dataset. Specifically, we will use a subset that contains restaurants in Greece.

Let's import the data:

In [None]:
df_restaurants = pd.read_csv('tripadvisor_restaurants_greece.csv')
df_restaurants

Let's create a histogram for the average restaurant rating. We can easily graph this distribution by passing `kind=hist` to `plot()`.

In [None]:
df_restaurants['avg_rating'].plot(kind='hist', figsize=(10, 5))

plt.title('Histogram of Average Rating')
plt.ylabel('Number of Restaurants')
plt.xlabel('Average Rating')

plt.show()

The x-axis of this histogram denotes the average rating intervals, while the y-axis represents the count of restaurants falling within each bin.

Using **Numpy**'s `histogram` method, we can get the bin ranges and corresponding counts:


In [None]:
counts, bin_ranges = np.histogram(df_restaurants['avg_rating'].dropna())

print(counts) # frequency count
print(bin_ranges) # bin ranges, default = 10 bins

By default, the `histogram` method creates 10 bins.


Using the bin_ranges we just calculated, we can make the x-axis labels match the bin size:


In [None]:
df_restaurants['avg_rating'].plot(kind='hist', figsize=(8, 5), xticks=bin_ranges)

plt.title('Histogram of Average Rating')
plt.ylabel('Number of Restaurants')
plt.xlabel('Average Rating')
plt.show()

## Pie Charts

A pie chart is a circular statistical graphic, which is divided into slices to illustrate numerical proportion. In a pie chart, the arc length of each slice (and consequently its central angle and area) is proportional to the quantity it represents. We can create pie charts in Matplotlib by passing in the `kind=pie` keyword. This visualization is particularly useful for showing the relative contributions of different categories to a whole.

We will use a subset of TripAdvisor Restaurants dataset for creating pie charts.


Let's import the data:

In [None]:
df_restaurants = pd.read_csv('tripadvisor_restaurants_greece.csv')
df_restaurants

Let's focus on the 'region' column. This takes the following values, corresponding to regions in Greece:

In [None]:
df_restaurants.region.unique()

To create a pie chart to visualize the percentage of restaurants grouped by region, we need to use the *pandas* `groupby` method to summarize the restaurant data by `region`. The `groupby` method first splits the data into groups based on the values of some fields, and then applies an aggregate function to each group, before combining the results.

In [None]:
# group restaurants by region and get the size of every group
region_counts = df_restaurants.groupby('region').size()

region_counts

Next, to generate the pie chart, we set `kind = 'pie'`.

In [None]:
# autopct create %, start angle represent starting point
region_counts.plot(kind='pie',
                            figsize=(5, 6),
                            autopct='%1.1f%%', # label the wedges in percentages 
                            startangle=90,     # start angle 90°
                            )

plt.title('Number of restaurants by Region')
plt.axis('equal') # Sets the pie chart aspect ratio to look like a circle.
plt.ylabel("")
plt.show()


The pie chart is not very clear, with some text overlapping. We can remove the labels and add a legend instead. We can also move the values of every segment outside it. Also to highlight some segemnts we can make them stand out using the `explode` parameter.


In [None]:
explode_list = [0, 0, 0, 0, 0.1, 0, 0, 0, 0, 0, 0.1, 0, 0.1, 0.2] # ratio for each region with which to offset each pie segment.

region_counts.plot(kind='pie',
                            figsize=(15, 6),
                            autopct='%1.1f%%', 
                            startangle=90,    
                            shadow=False,       
                            labels=None,         # hide labels
                            pctdistance=1.1,    # the ratio between the center of each pie slice and the start of its text
                            explode=explode_list # 'explode' lowest 4 countries
                            )

plt.title('Number of restaurants by Region', y=1.1, fontsize= 14)

plt.axis('equal') 

plt.ylabel("")

# add legend
plt.legend(labels=region_counts.index, loc='lower right') 

plt.show()

**Question 3:** Create a pie plot to visualize the percentage of restaurants per province in Crete.

*Hint: Filter first the restaurants for Crete (region = Crete) and then count by province.*



In [None]:
# type your solution

## Scatter Plots

A scatter plot is a type of plot used to compare variables within a dataset. The data are displayed as a collection of points, with the value of one variable determining the position on the horizontal axis and the value of another variable determining the position on the vertical axis. This plot is particularly useful for visualizing and identifying relationships, trends, and potential correlations between variables.

Let's create a scatter plot to visualize the relationship between the open price of the Amazon stock and its daily volume.
For this, we need to set `kind='scatter'` and set the `x` and `y` parameters to specify the columns that go on the x and y-axis, respectively.



In [None]:
df_stock.plot(kind='scatter', x='Open', y='Volume')

plt.title('Open Price vs Volume')
plt.xlabel('Open Price')
plt.ylabel('Volume')

plt.show()

Every point on the scatterplot corresponds to a single day. Note that some points overlap.

From the plot, we can easily observe that, in general, when the Open price is lower there is a higher volume. We can mathematically model this relationship using a regression line (line of best fit).


For this, we'll use **Numpy**'s `polyfit()` method by passing in the following:

In [None]:
from numpy.polynomial import Polynomial

x = df_stock['Open']      
y = df_stock['Volume']

# Degree of polynomial: 1 = linear, 2 = quadratic, ...
polynomial = Polynomial.fit(x, y, deg=1)

# The fitted polynomial
print(polynomial)

To plot the regression line on the scatter plot:


In [None]:
df_stock.plot(kind='scatter', x='Open', y='Volume')

plt.title('Open Price vs Volume')
plt.xlabel('Open Price')
plt.ylabel('Volume')

# plot regression line
plt.plot(x, polynomial(x), color='yellow')


plt.show()

## Box Plots

A box plot visually summarizes the distribution of numerical data by displaying the quartiles, median, and any outliers. It provides a concise representation of the data’s spread, central tendency, and variability. This type of plot is particularly useful for identifying outliers and comparing distributions across different groups.

The chart is a standardized way of displaying the distribution of data based on a five-number summary:

**Minimum Value:** The end of the lower whisker represents the smallest data point, excluding any outliers.

**First Quartile (Q1):** The lower edge of the box indicates the 25th percentile, meaning 25% of data points fall below this value.

**Median (Second Quartile, Q2):** The line within the box marks the median of the dataset.

**Third Quartile (Q3):** The upper edge of the box shows the 75th percentile, with 75% of data points lying below this value.

**Maximum Value:** The end of the upper whisker indicates the largest data point, excluding outliers.

![image-2.png](attachment:image-2.png)


We can make a `boxplot` setting `kind=box` in the `plot`. 

Let's generate a box plot for inbound tourism in Greece.

In [None]:
greece = df_tourism.loc['Greece']

greece

Plot by passing in `kind='box'`.


In [None]:
greece.plot(kind='box', figsize=(10, 7))

plt.title('Box plot of Inbound tourists')
plt.ylabel('Number of Inbound Tourists')

plt.show()

Using the `describe()` method, we can get the actual numbers:

In [None]:
greece.describe()

**Question 4:** Create a box plot to compare the distribution of inbound tourists between Greece, Italy, France and Spain.


In [None]:
# type your solution

## Matplotlib Subplots

In Matplotlib, we can create multiple plots within the same figure using the `plt.subplots(nrows, ncols)` function, which returns a tuple containing a `figure` and an array of `axes`. The `figure` represents the entire plotting area with `nrows` rows and `ncols` columns of individual plots. To plot data on a specific subplot, we pass the appropriate `axes` object to the `ax` parameter of the `df.plot()` method, using `axes[index]` to reference the subplot location. Creating multiple plots in the same figure allows for side-by-side comparisons of different datasets or variables and for analyzing various plot types of the same data within a single visual context.

In [None]:
figure, axes = plt.subplots(1, 2)



figure.suptitle("Tourism in Greece", fontsize=18)

# Box plot
greece.plot(kind='box', vert=False, figsize=(14, 5), ax=axes[0])
axes[0].set_title('Box plot of Inbound tourists in Greece', fontsize=14)
axes[0].set_xlabel('Number of Inbound Tourists')


# Line plot
greece.plot(kind='line', figsize=(14, 5), ax=axes[1]) # add to subplot 2
axes[1].set_title ('Inbound Tourists in Greece per year', fontsize=14)
axes[1].set_ylabel('Number of Inbound Tourists')
axes[1].set_xlabel('Years')

**Question 5:** Create a figure with two subplots:

1. A bar chart to visualize the total inbound tourists for periods `2006-2010`, `2011-2015`, `2016-2020` for France, Italy and Spain.
2. A box plot to visualize the distribution for these countries for years `2006 - 2020`.

In [None]:
# type your solution

**Question 6:** Generate a bar chart to visualize the top 20 cuisines in Athens:

In [None]:
# type your solution