<a href="https://colab.research.google.com/github/cloudpedagogy/data-visualisation-python/blob/main/06_Categorical_Data_Visualization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Categorical Data Visualization


##Overview

**Introduction to Categorical Data Visualization using Python Seaborn**

Categorical data visualization is a crucial aspect of data analysis and exploration, especially when dealing with discrete or qualitative data. In many real-world scenarios, data can be grouped into categories or classes, and understanding the patterns and relationships between these categories is essential for making informed decisions.


**Key Features of Seaborn for Categorical Data Visualization:**

1. **Categorical Plots:** Seaborn offers a variety of categorical plots, including bar plots, count plots, point plots, and many more. These plots are specifically designed to showcase the distribution and relationships between different categories in the data.

2. **Colorful Palettes:** Seaborn comes with a broad range of color palettes that are thoughtfully designed to enhance the visual appeal of your plots. These palettes help differentiate between categories and make the plots more informative and engaging.

3. **Statistical Estimations:** Seaborn's categorical plots can automatically compute and display various statistical estimations, such as confidence intervals, standard deviations, and aggregations, to provide a deeper understanding of the data.

4. **Faceting Capabilities:** Seaborn allows you to create faceted categorical plots, where you can visualize multiple subsets of the data side by side or in a grid. This feature is invaluable when you want to compare various aspects of the data across different categories.

5. **Customization Options:** Seaborn offers an array of customization options, enabling you to fine-tune the appearance of your plots. You can customize the colors, styles, labels, and other elements to match your visualization preferences or the specific requirements of your audience.



#Bar plots and count plots


##Bar plots




Bar plots in Seaborn are a type of visualization used to display categorical data. They are useful for comparing and visualizing the distribution of a categorical variable or the relationship between two categorical variables. In a bar plot, the height of each bar represents the frequency or count of each category.

Seaborn provides the `barplot()` function to create bar plots. It is a high-level interface to create visually appealing and informative bar plots with minimal code.

Here's an example of creating a bar plot using the Pima Indian Diabetes dataset:


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Count the number of people with and without diabetes
diabetes_counts = dataset['Outcome'].value_counts()

# Create a bar plot using Seaborn
sns.set(style="darkgrid")
plt.figure(figsize=(6, 4))
sns.barplot(x=diabetes_counts.index, y=diabetes_counts.values)
plt.title("Number of People with and without Diabetes")
plt.xlabel("Diabetes")
plt.ylabel("Count")
plt.show()


In this example, we first load the Pima Indian Diabetes dataset using Pandas. Next, we use the `value_counts()` function to count the number of people with and without diabetes by counting the frequency of each unique value in the 'Outcome' column.

We then create a bar plot using Seaborn's `barplot()` function. We set the style to "darkgrid" using `sns.set()` to give the plot a grid background. We specify the x-axis values as the unique values in the 'Outcome' column (`diabetes_counts.index`) and the y-axis values as the corresponding counts (`diabetes_counts.values`).

Finally, we add a title, x-label, and y-label to the plot using Matplotlib functions, and display the plot using `plt.show()`.

The resulting bar plot shows the number of people with and without diabetes, where the height of each bar represents the count for each category.


##Count plots



Count plots in Seaborn are used to visualize the count or frequency of categorical variables in a dataset. It displays the number of occurrences of each category on the y-axis, while the x-axis represents the categories themselves. Count plots are useful for exploring the distribution of categorical variables and identifying the most common categories within a dataset.

Here's an example of using a count plot in Seaborn with the Pima Indian Diabetes dataset:


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Plot the count of diabetes outcome
sns.countplot(x='Outcome', data=dataset)
plt.title("Count of Diabetes Outcome")
plt.show()


In this example, we use Seaborn to create a count plot to visualize the frequency of the diabetes outcome in the Pima Indian Diabetes dataset.

We first load the dataset using Pandas as before. Then, we use the `sns.countplot()` function from Seaborn to create the count plot. We specify `x='Outcome'` to indicate that we want to plot the count of the 'Outcome' variable.

Finally, we add a title to the plot using `plt.title()` from Matplotlib, and display the plot using `plt.show()`.

The resulting count plot will show the number of occurrences for each category of the 'Outcome' variable, representing the count of individuals with and without diabetes in the dataset.


#Point plots and factor plots


##Point plots



In Seaborn, a point plot is a type of categorical plot that shows the relationship between two categorical variables and a numeric variable. It displays the estimate of the central tendency (usually the mean) of the numeric variable as a point, along with a confidence interval represented by error bars. Point plots are useful for comparing groups or categories and identifying patterns or differences.

Here's an example of using a point plot in Seaborn with the Pima Indian Diabetes dataset:


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Create a point plot to compare the average glucose levels for different outcomes
sns.pointplot(x="Outcome", y="Glucose", data=dataset, ci="sd")

# Set the plot title and axis labels
plt.title("Average Glucose Levels by Outcome")
plt.xlabel("Outcome")
plt.ylabel("Glucose")

# Display the plot
plt.show()


In this example, we use Seaborn to create a point plot that compares the average glucose levels for different outcomes (diabetic or non-diabetic) in the Pima Indian Diabetes dataset.

We use the `sns.pointplot()` function and specify the `x` and `y` variables as "Outcome" and "Glucose", respectively, to determine the grouping and numeric values to plot. The `ci` parameter is set to "sd" to show the standard deviation as the confidence interval.

We then customize the plot by adding a title using `plt.title()`, and setting the labels for the x-axis and y-axis using `plt.xlabel()` and `plt.ylabel()`, respectively.

Finally, we display the plot using `plt.show()`.

The resulting point plot visualizes the average glucose levels for diabetic and non-diabetic individuals, allowing for a comparison between the two groups. The error bars represent the confidence interval, providing an understanding of the variability in the data.


##Factor Plot

To create a factor plot using Seaborn with the Pima Indian dataset, we first need to import the necessary libraries and load the dataset. The Pima Indian dataset contains information about Pima Indian women, including features such as age, number of pregnancies, glucose levels, blood pressure, skinfold thickness, insulin levels, BMI, diabetes pedigree function, and a target variable indicating whether the individual developed diabetes (1) or not (0).

Let's assume you have already imported the required libraries (`seaborn`, `pandas`, and `matplotlib.pyplot`) and read the dataset into a pandas DataFrame named `pima_data`. Now, let's proceed with creating the factor plot.



In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the Pima Indian dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv'
column_names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigree', 'Age', 'Outcome']
pima_data = pd.read_csv(url, names=column_names)

# Creating the factor plot
sns.set(style="whitegrid")
factor_plot = sns.factorplot(x="Outcome", y="Glucose", hue="Outcome", data=pima_data, kind="box", size=6, aspect=1.5)
factor_plot.set(xlabel="Diabetes Outcome", ylabel="Glucose Level", title="Factor Plot of Glucose Level and Diabetes Outcome")

# Show the plot
plt.show()

In the code above, we use the `sns.factorplot()` function from Seaborn to create the factor plot. We specify the x-axis variable as "Outcome", which represents the target variable indicating the diabetes outcome (0 or 1). The y-axis variable is "Glucose", which represents the glucose levels of the Pima Indian women.

We also use the `hue` parameter to color the box plot by the "Outcome" variable, making it easy to distinguish the glucose levels for diabetic (Outcome=1) and non-diabetic (Outcome=0) individuals. The `kind` parameter is set to "box" to create a box plot, which displays the median, quartiles, and any outliers in the data.

The resulting factor plot will show two side-by-side box plots, one for each diabetes outcome category, allowing us to visually compare the distribution of glucose levels for diabetic and non-diabetic individuals. This type of visualization is useful for understanding how glucose levels differ between the two groups and can provide insights into potential associations with diabetes development.

#Heatmaps and cluster maps


##Heatmaps



Heatmaps in Seaborn are a visualization technique used to represent data in a tabular form using colors. They are particularly useful for displaying and interpreting the relationships or patterns in a dataset. Heatmaps provide a visual summary of the data by using color intensity to represent the values of a variable across different dimensions.

Seaborn is a popular data visualization library in Python that provides a high-level interface to create aesthetically pleasing and informative visualizations. It includes a built-in function `heatmap()` that allows you to easily create heatmaps.

Here's an example of creating a heatmap using Seaborn with the Pima Indian Diabetes dataset:


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Calculate the correlation matrix
correlation_matrix = dataset.corr()

# Create a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm")
plt.title("Correlation Heatmap - Pima Indian Diabetes Dataset")
plt.show()


In this example, we use Seaborn to create a correlation heatmap for the Pima Indian Diabetes dataset.

First, we calculate the correlation matrix of the dataset using the `corr()` method, which computes the pairwise correlations between the columns.

Next, we create a heatmap using the `heatmap()` function from Seaborn. We pass the correlation matrix as the data to be visualized. The `annot=True` argument adds annotations to the heatmap, displaying the correlation values in each cell. The `cmap="coolwarm"` argument specifies the color map to be used.

Finally, we customize the plot by setting the figure size, adding a title, and displaying the heatmap using `plt.show()` from the Matplotlib library.

The resulting heatmap visually represents the correlations between the different features in the Pima Indian Diabetes dataset. Brighter colors indicate stronger positive or negative correlations, while darker colors represent weaker correlations. The annotations provide the actual correlation values in each cell, helping in interpreting the heatmap.


##Cluster maps



Cluster maps in Seaborn are a type of heatmap that is enhanced with hierarchical clustering. They provide a visual representation of the similarity or dissimilarity between samples (rows) and variables (columns) in a dataset. Cluster maps use hierarchical clustering algorithms to group similar rows and columns together, creating clusters that are visually highlighted in the heatmap.

Here's an example of using a cluster map with the Pima Indian Diabetes dataset:


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Remove the 'Outcome' column for clustering analysis
data_for_clustering = dataset.drop('Outcome', axis=1)

# Perform hierarchical clustering and create a cluster map
cluster_map = sns.clustermap(data_for_clustering, cmap='coolwarm', standard_scale=1)

# Display the plot
plt.show()


In this example, we load the Pima Indian Diabetes dataset using Pandas. To create the cluster map, we remove the 'Outcome' column from the dataset since it represents the class labels and is not relevant for clustering analysis.

We use the `clustermap()` function from Seaborn to perform hierarchical clustering on the remaining columns of the dataset. The `cmap='coolwarm'` argument sets the color map for the heatmap, and `standard_scale=1` standardizes the data by subtracting the mean and dividing by the standard deviation.

The resulting cluster map is displayed using Matplotlib's `plt.show()` function. The cluster map visualizes the similarity or dissimilarity between the samples (rows) and variables (columns) in the dataset. Clusters of similar samples and variables are grouped together and represented by different colors in the heatmap.

Cluster maps are useful for identifying patterns, similarities, or outliers in the data, and can provide insights into the underlying structure of the dataset.


#Reflection points

**1. Bar Plots and Count Plots:**
- What are the main differences between bar plots and count plots?
- In what situations would you choose to use a bar plot over a count plot, and vice versa?
- How can you customize the appearance of bar plots and count plots to enhance their visual representation?
- Can you think of any real-world scenarios where bar plots or count plots would be useful for data analysis or visualization?

**2. Point Plots and Factor Plots:**
- How do point plots and factor plots differ in terms of their visual representation and the insights they provide?
- What are some key considerations when choosing between point plots and factor plots for analyzing categorical data?
- Can you explain how to interpret the information presented in point plots and factor plots?
- Are there any limitations or potential pitfalls when using point plots or factor plots that learners should be aware of?

**3. Heatmaps and Cluster Maps:**
- What are the primary purposes and applications of heatmaps and cluster maps in data analysis and visualization?
- How can heatmaps and cluster maps help identify patterns and relationships within a dataset?
- What are some techniques to customize and enhance the interpretation of heatmaps and cluster maps?
- Can you think of specific examples or use cases where heatmaps and cluster maps would be valuable tools for data exploration and decision-making?


#A quiz on Categorical Data Visualization


**Bar Plots and Count Plots:**

1. Which type of plot is used to visualize the distribution of categorical data?
   <br>a) Bar Plot
   <br>b) Count Plot
   <br>c) Point Plot
   <br>d) Heatmap

2. In a count plot, what does the height of each bar represent?
   <br>a) The count of unique values in the category
   <br>b) The count of occurrences of each category
   <br>c) The average value of each category
   <br>d) The standard deviation of each category

**Point Plots and Factor Plots:**

3. Point plots are used to show the relationship between two numerical variables. What type of estimate is shown by default in a point plot?
   <br>a) Mean
   <br>b) Median
   <br>c) Mode
   <br>d) Standard Deviation

4. Factor plots are a generalized form of categorical plots in Seaborn. Which parameter is used to specify the kind of factor plot to create?
   <br>a) plot_type
   <br>b) factor_kind
   <br>c) kind
   <br>d) plot_kind

**Heatmaps and Cluster Maps:**

5. Heatmaps are useful for visualizing:
   <br>a) The distribution of a single numerical variable
   <br>b) The distribution of two numerical variables
   <br>c) The distribution of a single categorical variable
   <br>d) The correlation between two or more variables

6. Cluster maps are used to:
   <br>a) Display a cluster of points on a scatter plot
   <br>b) Cluster and visualize similarity between variables or samples
   <br>c) Create clusters of categorical variables for analysis
   <br>d) Plot data points in a hierarchical clustering arrangement

---
**Answers:**

1. b) Count Plot
2. b) The count of occurrences of each category
3. a) Mean
4. c) kind
5. d) The correlation between two or more variables
6. b) Cluster and visualize similarity between variables or samples

---