# Matplotlib Notebook with Real-World Datasets

In this notebook, we will use real-world datasets to learn how to create various types of plots using Matplotlib. We will start from basic plots and move towards advanced visualizations, explaining why each type of plot is useful and when to use it.

## Importing Libraries

First, let's import the necessary libraries.

In [3]:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

## Loading Real-World Dataset

We will use the **Iris dataset**, a classic dataset in machine learning, which is available in `sklearn` and can be loaded without pandas.

In [None]:
from sklearn import datasets

# Load the iris dataset
iris = datasets.load_iris()
data = iris.data
target = iris.target
target_names = iris.target_names

# Display the shape of the data
print('Data shape:', data.shape)
print('Target shape:', target.shape)

## Scatter Plot

**When to use:** Scatter plots are ideal for showing the relationship between two numerical variables.

Let's visualize the relationship between sepal length and sepal width.

In [None]:
# Scatter Plot of Sepal Length vs Sepal Width
plt.figure(figsize=(10, 6))
plt.scatter(data[:, 0], data[:, 1], c=target, cmap='viridis', alpha=0.7)
plt.title('Sepal Length vs Sepal Width')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.colorbar(ticks=[0, 1, 2], label='Species')
plt.show()

## Line Plot

**When to use:** Line plots are ideal for visualizing data trends over a sequence or time.

Let's plot the mean sepal length for each species.

In [None]:
# Calculate mean sepal length for each species
mean_sepal_length = [np.mean(data[target == i, 0]) for i in range(3)]

# Line Plot of Mean Sepal Length
plt.figure(figsize=(8, 6))
plt.plot(target_names, mean_sepal_length, marker='o')
plt.title('Mean Sepal Length by Species')
plt.xlabel('Species')
plt.ylabel('Mean Sepal Length (cm)')
plt.show()

## Histogram

**When to use:** Histograms are used to represent the distribution of a numeric variable.

Let's look at the distribution of petal lengths.

In [None]:
# Histogram of Petal Length
plt.figure(figsize=(10, 6))
plt.hist(data[:, 2], bins=20, color='skyblue', edgecolor='black')
plt.title('Distribution of Petal Length')
plt.xlabel('Petal Length (cm)')
plt.ylabel('Frequency')
plt.show()

## Bar Plot

**When to use:** Bar plots are useful for comparing quantities across different categories.

Let's compare the mean petal width among the species.

In [None]:
# Calculate mean petal width for each species
mean_petal_width = [np.mean(data[target == i, 3]) for i in range(3)]

# Bar Plot of Mean Petal Width
plt.figure(figsize=(8, 6))
plt.bar(target_names, mean_petal_width, color=['red', 'green', 'blue'])
plt.title('Mean Petal Width by Species')
plt.xlabel('Species')
plt.ylabel('Mean Petal Width (cm)')
plt.show()

## Box Plot

**When to use:** Box plots are great for comparing distributions and identifying outliers.

Let's compare the distribution of petal lengths across species.

In [None]:
# Box Plot of Petal Length by Species
plt.figure(figsize=(10, 6))
data_to_plot = [data[target == i, 2] for i in range(3)]
plt.boxplot(data_to_plot, labels=target_names)
plt.title('Petal Length by Species')
plt.xlabel('Species')
plt.ylabel('Petal Length (cm)')
plt.show()

## Heatmap

**When to use:** Heatmaps are used to represent data values in a matrix form, often to visualize correlation matrices.

Let's compute and visualize the correlation matrix for the features.

In [None]:
# Compute correlation matrix
correlation_matrix = np.corrcoef(data.T)

# Plot heatmap
plt.figure(figsize=(8, 6))
plt.imshow(correlation_matrix, cmap='coolwarm', interpolation='nearest')
plt.title('Correlation Matrix Heatmap')
plt.colorbar()
feature_names = iris.feature_names
plt.xticks(range(len(feature_names)), feature_names, rotation=90)
plt.yticks(range(len(feature_names)), feature_names)
plt.show()

## Pie Chart

**When to use:** Pie charts are useful for showing proportions within a whole.

Let's show the proportion of each species in the dataset.

In [None]:
# Pie Chart of Species Proportion
species_counts = [np.sum(target == i) for i in range(3)]
plt.figure(figsize=(8, 8))
plt.pie(species_counts, labels=target_names, autopct='%1.1f%%', startangle=140)
plt.title('Proportion of Each Species in the Iris Dataset')
plt.axis('equal')
plt.show()

## Subplots

**When to use:** Subplots are used when you want to display multiple plots in a grid layout, making comparisons easier.

Let's create subplots for each feature's distribution.

In [None]:
# Subplots of Feature Distributions
features = data.T
feature_names = iris.feature_names

fig, axs = plt.subplots(2, 2, figsize=(12, 8))

axs[0, 0].hist(features[0], bins=20, color='skyblue', edgecolor='black')
axs[0, 0].set_title(feature_names[0])

axs[0, 1].hist(features[1], bins=20, color='salmon', edgecolor='black')
axs[0, 1].set_title(feature_names[1])

axs[1, 0].hist(features[2], bins=20, color='limegreen', edgecolor='black')
axs[1, 0].set_title(feature_names[2])

axs[1, 1].hist(features[3], bins=20, color='violet', edgecolor='black')
axs[1, 1].set_title(feature_names[3])

for ax in axs.flat:
    ax.set_xlabel('Measurement (cm)')
    ax.set_ylabel('Frequency')

plt.tight_layout()
plt.show()