# 12_Matplotlib and Seaborn

**[1] Matplotlib**<br>
- Single plot <br>
- Multiple plots <br>
- Secondary axis<br>

**[2] Seaborn**<br>
- X-axis with categorical data: (<code>lineplot</code>), <code>countplot</code>, <code>barplot</code>, <code>heatmap</code><br>
- Numbercial data: <code>histogram</code>, <code>scatterplot</code><br>
- Multiple plot: <code>jointplot</code>, <code>pairplot</code>, <code>FacetGrid</code><br>

In [None]:
import pandas as pd

## [1] matplotlib

In [None]:
import matplotlib.pyplot as plt

In [None]:
sales_df = pd.DataFrame({"Sales":[113, 84, 87, 50, 97, 68, 48, 54, 37, 38, 40, 57],
                         "Cumulative_sales":[113, 197, 284, 334, 431, 499, 547, 601, 638, 676, 716, 773]}, 
                        index = ["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"])
sales_df 

- **Single plot**

In [None]:
#step1: Create a figure and axes
fig, ax = plt.subplots()

#step2: Plot a chart in axes
sales_df.plot(kind = "bar", y = "Sales", ax = ax)

#step3: Format the style
ax.set_title("Monthly sales report", fontsize = 18)
ax.set_xlabel("Month", fontsize = 12)
ax.set_ylabel("Sales", fontsize = 12);

- **Multiple subplots**

In [None]:
#step1: Create a figure and axes
fig, ax = plt.subplots(nrows = 1, ncols = 2, figsize = (12,4))

#step2: Plot a chart in axes
sales_df.plot(kind = "bar", y = "Sales", ax=ax[0])
sales_df.plot(kind = "line", y = "Cumulative_sales", ax = ax[1], color = "red", marker = "o")

#step3: Format the style
ax[0].set_title("Monthly sales", fontsize = 18)
ax[1].set_title("Cumulative monthly sales", fontsize = 18)
ax[1].set_ylim([0,1000]);

## Exercise.A

In [None]:
import random
random.seed(0)
product_df = pd.DataFrame({"A":[random.randint(100,150) for i in range(6)],"B":[random.randint(20,150) for i in range(6)],
                           "C":[random.randint(120,150) for i in range(6)]}, index = ["Jan","Feb","Mar","Apr","May","Jun"])
product_df

**(A.1) Given the synthetic dataset above, each column represents the quarterly sales of products A, B, and C. Create a figure with three bar charts to show the sales data of each product.**<br>
Setting: <code>figsize = (15,4)</code>

- **Secondary y-axis**

In [None]:
#step1: Create a figure and axes
fig, ax1 = plt.subplots(figsize = (12, 3)) 
ax2 = ax1.twinx() 

#step2: Plot a chart in axes
sales_df.plot(kind = "bar", y = "Sales", ax = ax1)
sales_df.plot(kind = "line", y = "Cumulative_sales", ax = ax2, color = "red", marker = "o")
#step3: Format the style
ax1.set_xlabel("date", fontsize = 18)
ax1.set_ylabel("Sales", fontsize = 18)
ax1.legend(loc="upper left")
ax2.set_ylabel("Cumulative sales", fontsize = 18)
ax2.set_ylim([0,1000])
ax2.legend(loc="upper right");

## [2] Seaborn

In [None]:
# !pip install seaborn

In [None]:
import seaborn as sns

### [2.1] X-axis with categorical data

- **Lineplot (wide form)**

In [None]:
product_df_wide = pd.DataFrame({"A":[67, 57, 87, 50, 97, 68],
                                "B":[78, 102, 113, 98, 80, 84]}, 
                                index = ["Jan","Feb","Mar","Apr","May","Jun"])

product_df_wide

In [None]:
sns.lineplot(product_df_wide)

- **Lineplot (long form)**<br>
Use <code>hue</code> to specify which categorical column should be used to define the subsets.

In [None]:
product_df_long = pd.DataFrame({"month":["Jan","Feb","Mar","Apr","May","Jun","Jan","Feb","Mar","Apr","May","Jun"],
                        "product":["A","A","A","A","A","A","B","B","B","B","B","B"],
                        "sales":[67, 57, 87, 50, 97, 68, 78, 102, 113, 98, 80, 84]})
product_df_long

In [None]:
sns.lineplot(data = product_df_long, x = "month", y = "sales", hue = "product")

*(The following charts all use the wine dataset)*


In [None]:
# Select a subset
wine_raw_df = pd.read_csv("../dataset/wine.csv", index_col = [0])
wine_df = wine_raw_df[(wine_raw_df.country.isin(["Italy","France","Spain"])&
                       (wine_raw_df.price < 200)&
                       (wine_raw_df.variety.isin(["Red Blend","Tempranillo","Chardonnay", "Pinot Noir", "Cabernet Sauvignon"])))].loc[:,["country","price","variety"]]
wine_df

- **Countplot**<br>
It shows the number of occurrences of each category in the variable.

In [None]:
sns.countplot(data = wine_df, x = "country")

- **Countplot - Use the parameter <code>hue</code> to divide each bar into sub-bars**

In [None]:
sns.countplot(data = wine_df, x = "country", hue = "variety")

- **Bar plot**<br>
It shows the mean, median, or other statistics for each category.

In [None]:
sns.barplot(data = wine_df, x = "variety", y = "price", estimator = "mean")

- **Heatmap**<br>
A two-dimensional visual representation of data, where colors represent different values.

In [None]:
# data preparation
heatmap_data = pd.crosstab(wine_df["variety"], wine_df["country"])
heatmap_data

In [None]:
sns.heatmap(data = heatmap_data, annot = True, fmt = "d", cmap = "RdYlGn")

## Exercise.B

The sinking of the Titanic is one of the most infamous shipwrecks in history, resulting in the death of 1502 out of 2224 passengers and crew. 
- **survived**: 0 = No, 1 = Yes
- **pclass**: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
- **sex**: sex
- **age**: Age
- **sibsp**: number of siblings / spouses aboard the Titanic
- **parch**: number of parents / children aboard the Titanic  
- **fare**: Passenger fare
- **embarked**: Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)    

In [None]:
titanic_df = sns.load_dataset("titanic", dtype = {"survived": object, "pclass":object})

**(B.1) Show the first 10 rows of the dataset.**

**(B.2) Use a count plot to display the number of male and female passengers.**

**(B.3) Use a count plot to display the number of male and female passengers, and use the column <code>survived</code> to divide each bar into two subbars.**<br>
Hint: <code>hue</code>

**(B.4) Count the number of surviving and non-surviving passengers in each class.**<br>
Hint: Use <code>crosstab()</code> to create a contingency table based on "pclass" and "survived" columns.

**(B.5) Use the result obtained in (B.4) to draw a heatmap.**<br>
Setting: <code>cmap = "coolwarm"</code>

### [2.2] Numerical data

*(The following charts all use the diabetes dataset)*

In [None]:
# Import data and select observations with pressure greater than 0 (remove outliers)
diabetes_df = pd.read_csv("../dataset/diabetes.csv", dtype = {"Outcome":object})
diabetes_df = diabetes_df[diabetes_df.BloodPressure>0]

- **Histogram**

In [None]:
sns.histplot(data = diabetes_df, x = "BloodPressure")

- **Histogram - group by a categorical variable**

In [None]:
sns.histplot(data = diabetes_df, x = "BloodPressure", hue = "Outcome")

- **Scatter plot**

In [None]:
sns.scatterplot(data = diabetes_df, x = "Age", y = "BloodPressure")

- **Scatter plot - group by a categorical variable**

In [None]:
sns.scatterplot(data = diabetes_df, x = "Age", y = "BloodPressure", hue = "Outcome")

### [2.3] Multiple plots 

- **Joint plot** <br>
To show the relationship between two variables and their respective distributions.

In [None]:
sns.jointplot(data = diabetes_df, x = "Age", y = "BloodPressure")

- **Joint plot - group by a categorical variable**

In [None]:
sns.jointplot(data = diabetes_df, x = "Age", y = "BloodPressure", hue = "Outcome", height = 5)

- **Pairplot**

In [None]:
sns.pairplot(data = diabetes_df.loc[:,["BloodPressure", "BMI", "Age"]], height = 2)

## Exercise.C

The "diamonds" dataset contains information about diamond characteristics, including price, carat (weight), cut, color, clarity, depth, table (width), and dimensions such as width, length, and depth. Below are descriptions of some of the variables used in this exercise.

- **price**: Price (USD)
- **carat**: Weight of the diamond 
- **cut**: Quality of the cut (Fair, Good, Very Good, Premium, Ideal)
- **clarity**: A measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))

**(C.1) Load the dataset "diamonds" from searbon, and display the first 5 rows of the dataset.**<br>
Hint: <code>sns.load_dataset()</code>

**(C.2) Show the distribution of the price.**<br>
Hint: <code>histplot()</code>

**(C.3) Use a scatter plot to show the relathionship between <code>carat</code> and <code>price</code> of the "best" clarity diamonds.**<br>
Hint: Select a subset by using <code>diamond_df.clarity == "IF"</code>

**(C.4) Use a join plot to show the relathionship between <code>carat</code> and <code>price</code> of the best clarity diamonds and their individual distributions.**

- **FacetGrid**

In [None]:
# Initialize the grid
g =  sns.FacetGrid(data = diabetes_df, col = "Outcome")

In [None]:
# Draw a plot on every facet
g = sns.FacetGrid(data = diabetes_df, col = "Outcome")
g.map(sns.histplot, "BloodPressure")

In [None]:
g = sns.FacetGrid(data = diabetes_df, col = "Outcome")
g.map(sns.kdeplot, "BloodPressure")

In [None]:
g = sns.FacetGrid(data = diabetes_df, col = "Outcome")
g.map(sns.scatterplot,"Age", "BloodPressure")

## Exercise.D

Below are descriptions of some of the variables in the "diamonds" dataset.

- **price**: Price (USD)
- **carat**: Weight of the diamond 
- **cut**: Quality of the cut (Fair, Good, Very Good, Premium, Ideal)
- **clarity**: A measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))

**(D.1) Use the dataframe <code>diamond_df</code> in (C.1). Draw a price histogram for each <code>cut</code> category.**<br>
Hint: <code>col="cut"</code>