# Data Visualization using Python

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

### Five-number statistics    
    
- <span style="color:yellow">Minimum ($Q_0$ or 0th percentile)</span>: the lowest data point in the data set excluding any outliers
- <span style="color:yellow">First quartile ($Q_1$ or 25th percentile)</span>: also known as the lower quartile $q_n$(0.25), it is the median of the lower half of the dataset
- <span style="color:yellow">Median ($Q_2$ or 50th percentile)</span>: the middle value in the data set
- <span style="color:yellow">Third quartile ($Q_3$ or 75th percentile)</span>: also known as the upper quartile $q_n$(0.75), it is the median of the upper half of the dataset
- <span style="color:yellow">Maximum ($Q_4$ or 100th percentile)</span>: the highest data point in the data set excluding any outliers

- <span style="color:red">Interquartile range (IQR)</span>: the distance between the upper and lower quartiles:
    IQR = $Q_3$ − $Q_1$ = $q_n$ (0.75) − $q_n$ (0.25)

## Box Plot
![boxplot.jpg](attachment:boxplot.jpg)

1. A box plot, also known as a box-and-whisker plot, is a graphical representation of the distribution of a dataset through five summary statistics: minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum.
2. Box plots provide a compact and informative way to visualize the spread, central tendency, and skewness of the data.
3. The box in the plot represents the interquartile range (IQR), which is the range between the first quartile (Q1) and the third quartile (Q3). It contains the middle 50% of the data.
4. The line inside the box represents the median, which is the middle value of the dataset when it is sorted in ascending order.
5. The whiskers, represented by lines extending from the box, indicate the variability of the data beyond the interquartile range.
6. Outliers, which are data points that fall outside the whiskers, are often plotted individually as individual points or small circles.
7. Box plots are particularly useful for comparing the distributions of different groups or variables in a dataset, allowing for easy visual identification of differences in spread, central tendency, and skewness.
8. By providing a summary of key statistical measures, box plots can help identify potential data outliers, investigate data symmetry, and understand the overall shape of the distribution.
9. Box plots are commonly used in exploratory data analysis, data visualization, and statistical inference to gain insights into the characteristics and differences between groups or variables.

Percentile calculation:

1. Sort the dataset data in ascending order.
2. Calculate the index corresponding to the ith percentile, which is (i/100) * (n-1), where n is the number of elements in the sorted dataset.
3. If the index is an integer, the corresponding value is the ith percentile.
4. If the index is not an integer, interpolate between the two nearest values to calculate the exact value at the ith percentile.
- interpolated value = value_at_lower_index + (interpolation_factor * (value_at_upper_index - value_at_lower_index))
    -  where interpolation_factor = (desired_percentile_index - lower_index) / (upper_index - lower_index)

In [4]:
data = [1, 2, 4, 5, 5, 6, 8, 9, 10, 10, 10]

q1 = np.percentile(data, 25)
q2 = np.percentile(data, 50)
q3 = np.percentile(data, 75)

iqr = q3 - q1

upper_whisker = q3 + 1.5 * iqr
lower_whisker = q1 - 1.5 * iqr

print("Q1 : ", q1)
print("q2 (Median) :",q2)
print("q3 :",q3) 
print("IQR: ", iqr)
print("Upper Whisker Boundry: ", upper_whisker)
print("Lower Whisker Boundry: ", lower_whisker)

Q1 :  4.5
q2 (Median) : 6.0
q3 : 9.5
IQR:  5.0
Upper Whisker Boundry:  17.0
Lower Whisker Boundry:  -3.0


In [None]:
data = [1,23,4,5,6,7]

q1 = np.percentile(data, 25)

Comments for the above boxplot:

- The dataset [1, 2, 3, 4, 5, 6, 7] is symmetrically distributed without any extreme outliers. Therefore, the resulting box plot will have equal whisker lengths.

Comments for the above boxplot:

- The box plot demonstrates that the upper whisker is shorter than expected (not reaching the maximum value of 20) due to the presence of an outlier (20) in the dataset. 
- The lower whisker reaches the minimum value of 1. This discrepancy in whisker lengths occurs when the dataset has extreme values or a skewed distribution, causing the distribution of data points to be asymmetrical.

Comments for the above boxplot:

- This special case suggests that the data has a specific distribution where there are no outliers beyond the whisker boundaries. It demonstrates a symmetrical spread of values around the median, resulting in equal whisker lengths.

### Boxplots with skewness

![Boxplots_with_skewness.png](attachment:Boxplots_with_skewness.png)

### Improve Boxplot using matplotlib and seaborn

Matplotlib

Seaborn

### Plot the boxplot for all features in the Iris dataset

Use Notched boxplot:

A notched box plot is a variation of the standard box plot that includes a notch in the middle of the box. The notch is a graphical representation of the confidence interval around the median. It provides a visual estimate of the uncertainty or variability of the median.

## Violin Plot

1. A violin plot is a data visualization that combines aspects of a box plot and a kernel density plot. It provides a summary of the distribution of a dataset, displaying the density of the data at different values.
2. Violin plots are useful for understanding the shape, spread, and multimodal nature of the data, as well as comparing distributions between different groups or variables.
3. The width of the violin at any given point represents the estimated density of the data at that value. Wider sections indicate a higher density of data points, while narrower sections indicate a lower density.
4. Inside the violin, there is often a white dot or a line that represents the median of the data.
5. The interquartile range (IQR) can be visualized as the width of the central box within the violin. It contains the middle 50% of the data.
6. The "whiskers" in a violin plot extend from the ends of the box to represent the range of the data, excluding any outliers.
7. Violin plots can also display individual data points or outliers as small points or markers.
8. By providing a visual representation of the data density, violin plots allow for a more nuanced understanding of the distribution compared to traditional box plots.
9. Violin plots are commonly used in exploratory data analysis, data visualization, and statistical comparisons to reveal patterns, differences, and asymmetries in the data distribution. They are especially useful when dealing with complex or multimodal distributions.

### Plot the violin plot for all features in the Iris dataset