<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork/labs/Module%203/images/IDSNlogo.png" width="300" alt="cognitiveclass.ai logo">
</center>

# Pie Charts, Box Plots, Scatter Plots, and Bubble Plots

## Abstract
This comprehensive lab will familiarize you with the powerful data visualization capabilities of Matplotlib. The lab will explore the creation and understanding of Pie Charts, Box Plots, Scatter Plots, and Bubble Plots, providing insights into health data related to diabetes. Through hands-on exercises and real-world datasets, you'll gain the abilities to effectively represent and analyze medical information using Matplotlib's diverse plotting techniques. This lab will prepare you to make informed decisions and advancements in the field of healthcare, particularly in diabetes research and analysis.

Estimated time needed: **30** minutes

## Objectives

After completing this lab you will be able to:

*   Explore Matplotlib library further
*   Create pie charts, box plots, scatter plots and bubble charts


## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

1.  [Exploring Datasets with *p*andas](#0)<br>
2.  [Downloading and Prepping Data](#2)<br>
3.  [Visualizing Data using Matplotlib](#4) <br>
4.  [Pie Charts](#6) <br>
5.  [Box Plots](#8) <br>
6.  [Scatter Plots](#10) <br>
7.  [Bubble Plots](#12) <br>

</div>


# Importing Libraries<a id="0"></a>


In [None]:
#Import primary modules.
import numpy as np  # useful for many scientific computing in Python
import pandas as pd # primary data structure library

#Importing Matplotlib
#%matplotlib inline

import matplotlib as mpl
import matplotlib.pyplot as plt

mpl.style.use('ggplot') # optional: for ggplot-like style

# check for latest version of Matplotlib
print('Matplotlib version: ', mpl.__version__) # >= 2.0.0

## Importing Data <a id="2"></a>

## The Dataset: Diabetes 130 US hospitals for years 1999-2008<a id="1"></a>
<p>
There are various formats for a dataset: .csv, .json, .xlsx  etc. The dataset can be stored in different places, on your local machine or sometimes online.<br>

In this section, you will learn how to load a dataset into our Jupyter Notebook.<br>

In our case, the Diabetes Dataset is an online source, and it is in a CSV (comma separated value) format. Let's use this dataset as an example to practice data reading.

<ul>
    <li>Data source: <a href="https://www.kaggle.com/datasets/brandao/diabetes" target="_blank">https://www.kaggle.com/datasets/brandao/diabetes</a></li>
    <li>Data type: csv</li>
</ul>

The statistical data obtained from https://www.kaggle.com/datasets/brandao/diabetes under [CC0: Public Domain](https://creativecommons.org/publicdomain/zero/1.0/) license.

Dataset output column is "Readmitted" with values "<30",">30","NO". The rest of the columns (44) are input columns.

The Pandas Library is a useful tool that enables us to read various datasets into a dataframe; our Jupyter notebook platforms have a built-in <b>Pandas Library</b> so that all we need to do is import Pandas without installing.

We have already **pre-processed** the data, we will use the **clean data** saved in the csv format for this lab. The original Diabetes dataset can be fetched from [here](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0IT0EN/diabetic_data.csv).
</p>

<details><summary>DataSet structure</summary>
Column names:
    
1. `Race` - Race. Values: Caucasian, Asian, African American, Hispanic, and other
2. `Gender` - Gender. Values: male, female, and unknown/invalid 
3. `Age` - Age grouped in 10-year intervals: [0, 10), [10, 20), ..., [90, 100)
4. `Weight` - Weight in pounds 
5. `Time In Hospital` - Integer number of days between admission and discharge
6. `Medical Specialty` - Integer identifier of a specialty of the admitting physician, corresponding to 84 distinct values 
7. `Num Lab Procedures` - Number of lab tests performed during the encounter 
8. `Num Procedures` - Number of procedures (other than lab tests) performed during the encounter 
9. `Num Medications` - Number of distinct generic names administered during the encounter 
10. `Number Outpatient` - Number of outpatient visits of the patient in the year preceding the encounter 
11. `Number Emergency` - Number of emergency visits of the patient in the year preceding the encounter 
12. `Number Inpatient` - Number of inpatient visits of the patient in the year preceding the encounter 
13. `Diagnosis1` - The primary diagnosis (coded as first three digits of ICD9); 848 distinct values
14. `Diagnosis2` - Secondary diagnosis (coded as first three digits of ICD9); 923 distinct values
15. `Diagnosis3` - Additional secondary diagnosis (coded as first three digits of ICD9); 954 distinct values
16. `Number Diagnoses` - Number of diagnoses entered into the system 
17. `Max Glu Serum` - Indicates the range of the result or if the test was not taken. Values: >200, >300, normal, none if not measured
18. `A1c Result` - Indicates the range of the result or if the test was not taken. Values: >8, >7, normal, none if not measured
19. `Metformin` - Indicates whether the drug was prescribed or there was a change in the dosage 
20. `Repaglinide` - Indicates whether the drug was prescribed or there was a change in the dosage 
21. `Nateglinide` - Indicates whether the drug was prescribed or there was a change in the dosage 
22. `Chlorpropamide` - Indicates whether the drug was prescribed or there was a change in the dosage 
23. `Glimepiride` - Indicates whether the drug was prescribed or there was a change in the dosage
24. `Acetohexamide` - Indicates whether the drug was prescribed or there was a change in the dosage 
25. `Glipizide` - Indicates whether the drug was prescribed or there was a change in the dosage 
26. `Glyburide` - Indicates whether the drug was prescribed or there was a change in the dosage 
27. `Tolbutamide` - Indicates whether the drug was prescribed or there was a change in the dosage 
28. `Pioglitazone` - Indicates whether the drug was prescribed or there was a change in the dosage 
29. `Rosiglitazone` - Indicates whether the drug was prescribed or there was a change in the dosage 
30. `Acarbose` - Indicates whether the drug was prescribed or there was a change in the dosage 
31. `Miglitol` - Indicates whether the drug was prescribed or there was a change in the dosage 
32. `Troglitazone` - Indicates whether the drug was prescribed or there was a change in the dosage 
33. `Tolazamide` - Indicates whether the drug was prescribed or there was a change in the dosage 
34. `Examide` - Indicates whether the drug was prescribed or there was a change in the dosage 
35. `Citoglipton` - Indicates whether the drug was prescribed or there was a change in the dosage 
36. `Insulin` - Indicates whether the drug was prescribed or there was a change in the dosage 
37. `Glyburide-metformin` - Indicates whether the drug was prescribed or there was a change in the dosage 
38. `Glipizide-metformin` - Indicates whether the drug was prescribed or there was a change in the dosage 
39. `Glimepiride-pioglitazone` - Indicates whether the drug was prescribed or there was a change in the dosage 
40. `Metformin-rosiglitazone` - Indicates whether the drug was prescribed or there was a change in the dosage 
41. `Metformin-pioglitazone` - Indicates whether the drug was prescribed or there was a change in the dosage 
42. `Change` - Indicates if there was a change in diabetic medications (either dosage or generic name)
43. `Diabetes Medication` - Indicates if there was any diabetic medication prescribed
44. **`Readmitted`** - Days to inpatient readmission. Values: <30, >30, No for no record of readmission
</details>


In [None]:
df = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0L8IEN/diabetes_new.csv')

print('Data read into a pandas dataframe!')

In [None]:
df.head()

Let's find out how many entries there are in our dataset.


In [None]:
# print the dimensions of the dataframe
print(df.shape)

# Visualizing Data using Matplotlib<a id="4"></a>


# Pie Charts <a id="6"></a>

A `pie chart` is a circular graphic that displays numeric proportions by dividing a circle (or pie) into proportional slices. You are most likely already familiar with pie charts as it is widely used in business and media. We can create pie charts in Matplotlib by passing in the `kind=pie` keyword.

Let's use a pie chart to explore distribution of number of diabetes cases by race.


Step 1: Gather data.

We will use *pandas* `groupby` method to summarize diabetes cases data by race. The general process of `groupby` involves the following steps:

1.  **Split:** Splitting the data into groups based on some criteria.
2.  **Apply:** Applying a function to each group independently:
    .sum()
    .count()
    .mean()
    .std()
    .aggregate()
    .apply()
    .etc..
3.  **Combine:** Combining the results into a data structure.


<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork/labs/Module%203/images/Mod3Fig4SplitApplyCombine.png" height="400" align="center">


In [None]:
# df_race
# Create a copy of the 'Race' column
df_race = pd.DataFrame(df['Race'].copy())

# Add a 'Count' column 
df_race['Count'] = 1

# Filter out rows where 'Race' is mmissing ('?')
df_race = df_race[df_race['Race'] != '?']

# Group by 'Race' and sum the counts
df_race = df_race.groupby('Race').sum()
df_race

Step 2: Plot the data. We will pass in `kind = 'pie'` keyword, along with the following additional parameters:

*   `autopct` -  is a string or function used to label the wedges with their numeric value. The label will be placed inside the wedge. If it is a format string, the label will be `fmt%pct`.
*   `startangle` - rotates the start of the pie chart by angle degrees counterclockwise from the x-axis.
*   `shadow` - Draws a shadow beneath the pie (to give a 3D feel).


In [None]:
# autopct create %, start angle represent starting point
df_race['Count'].plot(kind='pie',
                            figsize=(5, 6),
                            autopct='%1.1f%%', # add in percentages
                            startangle=90,     # start angle 90°
                            shadow=True,       # add shadow      
                            )

plt.title('Distribution of Diabetes Cases by Race')
plt.axis('equal') # Sets the pie chart to look like a circle.
plt.legend(labels=df_race.index, loc='upper left') 


plt.show()

The above visual is not very clear, the numbers and text overlap in some instances. Let's make a few modifications to improve the visuals:

*   Remove the text labels on the pie chart by passing in `legend` and add it as a seperate legend using `plt.legend()`.
*   Push out the percentages to sit just outside the pie chart by passing in `pctdistance` parameter.
*   Pass in a custom set of colors for races by passing in `colors` parameter.
*   **Explode** the pie chart to emphasize the lowest three groups (Asian, Hispanic, Other) by passing in `explode` parameter.


In [None]:
colors_list = ['gold', 'yellowgreen', 'lightcoral', 'lightskyblue', 'lightgreen', 'pink']
explode_list = [0, 0.2, 0, 0.1, 0.2] # ratio for each race with which to offset each wedge.

df_race['Count'].plot(kind='pie',
                            figsize=(10, 6),
                            autopct='%1.1f%%', 
                            startangle=90,    
                            shadow=True,       
                            labels=None,         # turn off labels on pie chart
                            pctdistance=1.12,    # the ratio between the center of each pie slice and the start of the text generated by autopct 
                            colors=colors_list,  # add custom colors
                            explode=explode_list # 'explode' lowest 4 races
                            )

# scale the title up by 12% to match pctdistance
plt.title('Distribution of Diabetes Cases by Race', y=1.12, fontsize = 15) 

plt.axis('equal') 

# add legend
plt.legend(labels=df_race.index, loc='upper left', fontsize=7) 

plt.show()

**Question:** Using a pie chart, explore the distribution of number of diabetes cases grouped by gender.

**Note**: You might need to play with the explore values in order to fix any overlapping slice values.


In [None]:
### type your answer here

# Create a copy of the 'Gender' column
df_gender = pd.DataFrame(df['Gender'].copy())

# Add a 'Count' column 
df_gender['Count'] = 1

# Group by 'Gender' and sum the counts
df_gender = df_gender.groupby('Gender').sum()


explode_list = [0.1, 0.1, 0.2] # ratio for each gender with which to offset each wedge.

df_gender['Count'].plot(kind='pie',
                            figsize=(15, 6),
                            autopct='%1.3f%%', 
                            startangle=90,    
                            shadow=True,       
                            labels=None,                 # turn off labels on pie chart
                            pctdistance=1.12,            # the ratio between the pie center and start of text label
                            explode=explode_list         # 'explode' lowest genders
                            )

# scale the title up by 12% to match pctdistance
plt.title('Distribution of Diabetes Cases by Gender', y=1.12) 
plt.axis('equal') 

# add legend
plt.legend(labels=df_gender.index, loc='upper left') 

# show plot
plt.show()

<details><summary>Click here for a sample python solution</summary>

```python
#The correct answer is:

# Create a copy of the 'Gender' column
df_gender = pd.DataFrame(df['Gender'].copy())

# Add a 'Count' column 
df_gender['Count'] = 1

# Group by 'Gender' and sum the counts
df_gender = df_gender.groupby('Gender').sum()


explode_list = [0.1, 0.1, 0.2] # ratio for each gender with which to offset each wedge.

df_gender['Count'].plot(kind='pie',
                            figsize=(15, 6),
                            autopct='%1.3f%%', 
                            startangle=90,    
                            shadow=True,       
                            labels=None,                 # turn off labels on pie chart
                            pctdistance=1.12,            # the ratio between the pie center and start of text label
                            explode=explode_list         # 'explode' lowest genders
                            )

# scale the title up by 12% to match pctdistance
plt.title('Distribution of Diabetes Cases by Gender', y=1.12) 
plt.axis('equal') 

# add legend
plt.legend(labels=df_gender.index, loc='upper left') 

# show plot
plt.show()
```

</details>


# Box Plots <a id="8"></a>

A `box plot` is a way of statistically representing the *distribution* of the data through five main dimensions:

*   **Minimum:** The smallest number in the dataset excluding the outliers.
*   **First quartile:** Middle number between the `minimum` and the `median`.
*   **Second quartile (Median):** Middle number of the (sorted) dataset.
*   **Third quartile:** Middle number between `median` and `maximum`.
*   **Maximum:** The largest number in the dataset excluding the outliers.


<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork/labs/Module%203/images/boxplot_complete.png" width="440," align="center">


To make a `boxplot`, we can use `kind=box` in `plot` method invoked on a *pandas* series or dataframe.

Let's plot the box plot for the average time in hospital by age of patients.


Step 1: Get the subset of the dataset. Even though we are extracting the data for just one column, we will obtain it as a dataframe. This will help us with calling the `dataframe.describe()` method to view the percentiles.


In [None]:
# Create a df consisting of the 'Age' and 'Time In Hospital' columns
df_box = df[['Age', 'Time In Hospital']]

# Group by 'Age' and find the mean
df_box = df_box.groupby('Age').mean()
df_box

Step 2: Plot by passing in `kind='box'`.


In [None]:
df_box['Time In Hospital'].plot(kind='box', figsize=(8, 6))

plt.title('Box plot of the average time spent in the hospital by age of people with diabetes')
plt.ylabel('Number of days')

plt.show()

We can immediately make a few key observations from the plot above:

1.  The minimum number of days is around 2.5 (min), maximum number of days is around 4.8 (max), and median number of days is around 4.1 (median).
2.  25% of the age groups had an average time in hospital of \~3.6 days or fewer (First quartile).
3.  75% of the age groups had an average time in hospital of \~4.5 days or fewer (Third quartile).

We can view the actual numbers by calling the `describe()` method on the dataframe.


In [None]:
df_box.describe()

One of the key benefits of box plots is comparing the distribution of multiple datasets. Let's analyze number of outpatient and inpatient visits of the patient in the year preceding the encounter using box plots.

**Question:** Compare the distribution of outpatient and inpatient visits of the patients grouped by age.


Step 1: Get the dataset for number of outpatient and inpatient visits and call the dataframe **df_out_in**.

We will use `groupby()` function to group our data by age categories and find mean of inpatient and outpatient visits of the patients. 


In [None]:
### type your answer here
# Create a df_out_in with age, outpatient, and inpatient visits
df_out_in = df[['Age', 'Number Outpatient', 'Number Inpatient']]

# Group by 'Age' and find mean
df_out_in = df_out_in.groupby('Age').mean()

df_out_in

<details><summary>Click here for a sample python solution</summary>

```python
#The correct answer is:
# Create a df_out_in with age, outpatient, and inpatient visits
df_out_in = df[['Age', 'Number Outpatient', 'Number Inpatient']]

# Group by 'Age' and find mean
df_out_in = df_out_in.groupby('Age').mean()

df_out_in
```

</details>


Let's view the percentiles associated with both columns using the `describe()` method.


In [None]:
### type your answer here
df_out_in.describe()

<details><summary>Click here for a sample python solution</summary>

```python
#The correct answer is:
df_out_in.describe()
```
</details>


Step 2: Plot data.


In [None]:
### type your answer here
df_out_in.plot(kind='box', figsize=(10, 7))

plt.title('Box plots of mean number of inpatient and oupatient visits of the patients')
plt.ylabel('Mean of visits')

plt.show()

<details><summary>Click here for a sample python solution</summary>

```python
#The correct answer is:
df_out_in.plot(kind='box', figsize=(10, 7))

plt.title('Box plots of mean number of inpatient and oupatient visits of the patients')
plt.ylabel('Mean of visits')

plt.show()
```

</details>


We can observe that the median of inpatient visits (\~0.33) is around two times bigger than median of outpatient visits (\~0.61). Maximum value of outpatient visits (\~0.42) is lower than minimum value of inpatient visits (\~0.53). By the way, we did not take into account the outliers that are present in both graphs. We'll talk about them later.


If you prefer to create horizontal box plots, you can pass the `vert` parameter in the **plot** function and assign it to *False*. You can also specify a different color in case you are not a big fan of the default red color.


In [None]:
# horizontal box plots
df_out_in.plot(kind='box', figsize=(10, 7), color='blue', vert=False)

plt.title('Box plots of mean number of inpatient and oupatient visits of the patients')
plt.ylabel('Type of visits')

plt.show()

**Subplots**

Often times we might want to plot multiple plots within the same figure. For example, we might want to perform a side by side comparison of the box plot with the line plot of inpatient and oupatient visits.

To visualize multiple plots together, we can create a **`figure`** (overall canvas) and divide it into **`subplots`**, each containing a plot. With **subplots**, we usually work with the **artist layer** instead of the **scripting layer**.

Typical syntax is : <br>

```python
    fig = plt.figure() # create figure
    ax = fig.add_subplot(nrows, ncols, plot_number) # create subplots
```

Where

*   `nrows` and `ncols` are used to notionally split the figure into (`nrows` \* `ncols`) sub-axes,
*   `plot_number` is used to identify the particular subplot that this function is to create within the notional grid. `plot_number` starts at 1, increments across rows first and has a maximum of `nrows` \* `ncols` as shown below.

<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork/labs/Module%203/images/Mod3Fig5Subplots_V2.png" width="500" align="center">


We can then specify which subplot to place each plot by passing in the `ax` paramemter in `plot()` method as follows:


In [None]:
fig = plt.figure() # create figure

ax0 = fig.add_subplot(1, 2, 1) # add subplot 1 (1 row, 2 columns, first plot)
ax1 = fig.add_subplot(1, 2, 2) # add subplot 2 (1 row, 2 columns, second plot). See tip below**

# Subplot 1: Box plot
df_out_in.plot(kind='box', color='blue', vert=False, figsize=(20, 6), ax=ax0) # add to subplot 1
ax0.set_title('Box plots of mean number of inpatient and oupatient visits of the patients')
ax0.set_xlabel('Mean of visits')
ax0.set_ylabel('Type of visits')

# Subplot 2: Line plot
df_out_in.plot(kind='line', figsize=(20, 6), ax=ax1) # add to subplot 2
ax1.set_title ('Line plots of mean number of inpatient and oupatient visits of the patients')
ax1.set_ylabel('Mean of visits')
ax1.set_xlabel('Age')

plt.show()

**Tip regarding subplot convention**

In the case when `nrows`, `ncols`, and `plot_number` are all less than 10, a convenience exists such that a 3-digit number can be given instead, where the hundreds represent `nrows`, the tens represent `ncols` and the units represent `plot_number`. For instance,

```python
   subplot(211) == subplot(2, 1, 1) 
```

produces a subaxes in a figure which represents the top plot (i.e. the first) in a 2 rows by 1 column notional grid (no grid actually exists, but conceptually this is how the returned subplot has been positioned).


Let's try something a little more advanced.


**Question:** Create a box plot to visualize the distribution of the number of lab tests performed during the encounter grouped by races and the *age groups* `Young`, `Middle-aged`, and `Senior`.


Step 1: Get the dataset. Use `pivot_table()` function to set 'Race' column as index, and set age groups in 'Age' as columns in order to easily create a box plots using `plot() function`. The `pivot_table()` function in Pandas is used to create a spreadsheet-style pivot table that aggregates and reshapes data. It allows you to specify the index, columns, and values to rearrange the data according to those specifications.. Name the dataframe **df_box2**.


In [None]:
### type your answer here

# Create df_box2 dataframe using pivot table
df_box2 = df.pivot_table(columns='Age', index='Race', values='Num Lab Procedures')

# Excluding missing data
df_box2 = df_box2[df_box2.index != '?']

df_box2

<details><summary>Click here for a sample python solution</summary>

```python
#The correct answer is:
# Create df_box2 dataframe using pivot table
df_box2 = df.pivot_table(columns='Age', index='Race', values='Num Lab Procedures')

# Excluding missing data
df_box2 = df_box2[df_box2.index != '?']

df_box2
```

</details>


Step 2: Create a new columns containing age categories by finding mean of existing age groups. One way to do that:

1.  Define new age group columns by finding a mean of the respective age ranges
2.  Drop the original age columns after merging


In [None]:
### type your answer here

# Define new age group columns by finding a mean of the respective age ranges
df_box2['Young'] = df_box2.iloc[:, 0:3].mean(axis=1)
df_box2['Middle-aged'] = df_box2.iloc[:, 3:7].mean(axis=1)
df_box2['Senior'] = df_box2.iloc[:, 7:].mean(axis=1)

# Drop the original age columns after merging
df_box2 = df_box2.drop(df_box2.columns[0:10], axis=1)
df_box2

<details><summary>Click here for a sample python solution</summary>

```python
#The correct answer is:
# Define new age group columns by finding a mean of the respective age ranges
df_box2['Young'] = df_box2.iloc[:, 0:3].mean(axis=1)
df_box2['Middle-aged'] = df_box2.iloc[:, 3:7].mean(axis=1)
df_box2['Senior'] = df_box2.iloc[:, 7:].mean(axis=1)

# Drop the original age columns after merging
df_box2 = df_box2.drop(df_box2.columns[0:10], axis=1)
df_box2
```

</details>


Let's learn more about the statistics associated with the dataframe using the `describe()` method.


In [None]:
### type your answer here
df_box2.describe()

<details><summary>Click here for a sample python solution</summary>

```python
#The correct answer is:    
df_box2.describe()
```

</details>


Step 3: Plot the box plots.


In [None]:
### type your answer here
df_box2.plot(kind='box', figsize=(10, 6))

plt.title('Number of lab tests performed during the encounter grouped the age groups')

plt.show()

<details><summary>Click here for a sample python solution</summary>

```python
#The correct answer is:    
df_box2.plot(kind='box', figsize=(10, 6))

plt.title('Number of lab tests performed during the encounter grouped the age groups')

plt.show()
```

</details>


This box plot scans the data and identifies the outliers. In order to be an outlier, the data value must be:<br>

*   larger than Q3 by at least 1.5 times the interquartile range (IQR), or,
*   smaller than Q1 by at least 1.5 times the IQR.

Let's look at 'Young' age category as an example: <br>

*   Q1 (25%) = 41.1667 <br>
*   Q3 (75%) = 42.5135 <br>
*   IQR = Q3 - Q1 = 1.3468 <br>

Using the definition of outlier, any value that is greater than Q3 by 1.5 times IQR will be flagged as outlier.

Outlier > 42.5135 + (1.5 \* 1.3468) <br>
Outlier > 44.5337


In [None]:
# let's check how many entries fall above the outlier threshold 
df_box2 = df_box2.reset_index()
df_box2[df_box2['Young'] > 44.5337]

<details><summary>Click here for a sample python solution</summary>

```python
#The correct answer is:    
df_box2 = df_box2.reset_index()
df_box2[df_box2['Young'] > 44.5337]
```

</details>


'Other' races are considered as outlier since their number of lab tests performed during the encounter exceeds 44.5337

The box plot is an advanced visualizaiton tool, and there are many options and customizations that exceed the scope of this lab. Please refer to [Matplotlib documentation](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.boxplot.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDV0101ENSkillsNetwork970-2023-01-01) on box plots for more information.


# Scatter Plots <a id="10"></a>

A `scatter plot` (2D) is a useful method of comparing variables against each other. `Scatter` plots look similar to `line plots` in that they both map independent and dependent variables on a 2D graph. While the data points are connected together by a line in a line plot, they are not connected in a scatter plot. The data in a scatter plot is considered to express a trend. With further analysis using tools like regression, we can mathematically calculate this relationship and use it to predict trends outside the dataset.

Let's start by exploring the following:

Using a `scatter plot`, let's visualize the trend of number of diagnoses among patients with diabetes for all age groups.


Step 1: Get the dataset.


In [None]:
df_scatter = pd.DataFrame(df[['Age', 'Number Diagnoses']].copy())
df_scatter = df_scatter.groupby('Age').mean().reset_index()
df_scatter

Step 2: Plot the data. In `Matplotlib`, we can create a `scatter` plot set by passing in `kind='scatter'` as plot argument. We will also need to pass in `x` and `y` keywords to specify the columns that go on the x- and the y-axis.


In [None]:
df_scatter.plot(kind='scatter', x='Age', y='Number Diagnoses', figsize=(10, 6), color='darkblue')

plt.title('Average number of diagnoses entered to the system by age')
plt.xlabel('Age')
plt.ylabel('Number of diagnoses')

plt.show()

Notice how the scatter plot does not connect the data  points together. We can clearly observe an upward trend in the data: as the age go by, the average number of diagnoses increases. We can mathematically analyze this upward trend using a regression line (line of best fit).


So let's try to plot a linear line of best fit, and use it to predict the number of diagnoses for the age of 20.

Step 1: Get the equation of line of best fit. We will use **Numpy**'s `polyfit()` method by passing in the following:

*   `x`: x-coordinates of the data.
*   `y`: y-coordinates of the data.
*   `deg`: Degree of fitting polynomial. 1 = linear, 2 = quadratic, and so on.


The `polyfit()` function from NumPy expects numeric data, and the 'Age' column contains string values. That's why we should transform this values into numerical.
We can create a `midpoint_values` dictionary by looping through the range of values and assigning the midpoint to each corresponding age range. It then uses this dictionary to map the string age ranges to their respective midpoint values in the 'Age' column of the DataFrame. 


In [None]:
# Mapping Age ranges to their midpoint values using a loop
midpoint_values = {}
for i in range(0, 100, 10):
    range_str = f'[{i}-{i+10})'
    midpoint_values[range_str] = i + 5

# Convert Age ranges to midpoint values
df_scatter['Age'] = df_scatter['Age'].map(midpoint_values)
df_scatter

Now the input for `polyfit()` function is correct.


In [None]:
x = df_scatter['Age']      # age on x-axis
y = df_scatter['Number Diagnoses']     # number of diagnoses on y-axis
fit = np.polyfit(x, y, deg=1)

fit

The output is an array with the polynomial coefficients, highest powers first. Since we are plotting a linear regression `y= a * x + b`, our output has two elements  `[0.0525231 , 3.78206156]` with the slope in position 0 and intercept in position 1.

Step 2: Plot the regression line on the `scatter plot`.


In [None]:
df_scatter.plot(kind='scatter', x='Age', y='Number Diagnoses', figsize=(10, 6), color='darkblue')

plt.title('Average number of diagnoses entered to the system by age')
plt.xlabel('Age')
plt.ylabel('Number of diagnoses')

# plot line of best fit
plt.plot(x, fit[0] * x + fit[1], color='red') # recall that x is the Age
plt.annotate('y={0:.4f} x + {1:.4f}'.format(fit[0], fit[1]), xy=(60, 6))

plt.show()

# print out the line of best fit
'Number of diagnoses = {0:.4f} * Age + {1:.4f}'.format(fit[0], fit[1]) 

Using the equation of line of best fit, we can estimate the number of diagnoses of patients who are 20 years old:

```python
Number of diagnoses = 0.0525 * Age + 3.7821
Number of diagnoses = 0.0525 * 20 + 3.7821
Number of diagnoses = 4.8321
```

The resulting number is quite realistic because, as mentioned earlier, this scatter plot shows an upward trend, so the number of diagnoses for a 20-year-old patient is greater than 15 and less than 25.


**Question**: Create a scatter plot of the average number of lab tests performed during the encounter grouped by age.


**Step 1**: Get the data:

1.  Create a dataframe that consists of the age and number of lab procdures only. Name it **df_scatter2**.
2.  Find a mean of the numbers for each age group and turn the result into a dataframe.
3.  Reset the index in place.
4.  Display the resulting dataframe.


In [None]:
### type your answer here

# Create a dataframe
df_scatter2 = pd.DataFrame(df[['Age', 'Num Lab Procedures']].copy())

# Find a mean by age
df_scatter2 = df_scatter2.groupby('Age').mean()

# Reset the index in place
df_scatter2.reset_index(inplace=True)

# Show the result
df_scatter2

<details><summary>Click here for a sample python solution</summary>

```python
#The correct answer is:      
# Create a dataframe
df_scatter2 = pd.DataFrame(df[['Age', 'Num Lab Procedures']].copy())

# Find a mean by age
df_scatter2 = df_scatter2.groupby('Age').mean()

# Reset the index in place
df_scatter2.reset_index(inplace=True)

# Show the result
df_scatter2
```

</details>


**Step 2**: Generate the scatter plot by plotting the number of lab procedures versus age in **df_scatter2**.


In [None]:
### type your answer here
# generate scatter plot
df_scatter2.plot(kind='scatter', x='Age', y='Num Lab Procedures', figsize=(10, 6), color='darkblue')

# add title and label to axes
plt.title('Average number of lab tests performed during the encounter by age')
plt.xlabel('Age')
plt.ylabel('Number of lab procedures')

# show plot
plt.show()

<details><summary>Click here for a sample python solution</summary>

```python
#The correct answer is:  
    
# generate scatter plot
df_scatter2.plot(kind='scatter', x='Age', y='Num Lab Procedures', figsize=(10, 6), color='darkblue')

# add title and label to axes
plt.title('Average number of lab tests performed during the encounter by age')
plt.xlabel('Age')
plt.ylabel('Number of lab procedures')

# show plot
plt.show()
```

</details>


# Bubble Plots <a id="12"></a>

A `bubble plot` is a variation of the `scatter plot` that displays three dimensions of data (x, y, z). The data points are replaced with bubbles, and the size of the bubble is determined by the third variable `z`, also known as the weight. In `maplotlib`, we can pass in an array or scalar to the parameter `s` to `plot()`, that contains the weight of each point.

**Let's start by analyzing the average number of medications taken by African Americans and Asians**.


**Step 1**: Get the data for African Americans and Asians.


In [None]:
# Transform the dataset to make entities as columns and years as indexes
df_bubble = df.pivot_table(index='Age', columns='Race', values='Num Medications').reset_index().drop(columns='?')

# show resulting dataframe
df_bubble.head()

**Step 2**: Create the normalized weights.

There are several methods of normalizations in statistics, each with its own use. In this case, we will use [feature scaling](https://en.wikipedia.org/wiki/Feature_scaling?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDV0101ENSkillsNetwork20297740-2021-01-01) to bring all values into the range \[0, 1]. The general formula is:

<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork/labs/Module%203/images/Mod3Fig3FeatureScaling.png" align="center">

where $X$ is the original value, $X'$ is the corresponding normalized value. The formula sets the max value in the dataset to 1, and sets the min value to 0. The rest of the data points are scaled to a value between 0-1 accordingly.


In [None]:
# normalize African American data
norm_aa = (df_bubble['AfricanAmerican'] - df_bubble['AfricanAmerican'].min()) / (df_bubble['AfricanAmerican'].max() - df_bubble['AfricanAmerican'].min())

# normalize Asian data
norm_as = (df_bubble['Asian'] - df_bubble['Asian'].min()) / (df_bubble['Asian'].max() - df_bubble['Asian'].min())

**Step 3**: Plot the data.

*   To plot two different scatter plots in one plot, we can include the axes one plot into the other by passing it via the `ax` parameter.
*   We will also pass in the weights using the `s` parameter. Given that the normalized weights are between 0-1, they won't be visible on the plot. Therefore, we will:
    *   multiply weights by 2000 to scale it up on the graph, and,
    *   add 10 to compensate for the min value (which has a 0 weight and therefore scale with $\times 2000$).


In [None]:
# AfricanAmerican
ax0 = df_bubble.plot(kind='scatter',
                    x='Age',
                    y='AfricanAmerican',
                    figsize=(14, 8),
                    alpha=0.5,  # transparency
                    color='green',
                    s=norm_aa * 2000 + 10,  # pass in weights 
                    )

# Asian
ax1 = df_bubble.plot(kind='scatter',
                    x='Age',
                    y='Asian',
                    alpha=0.5,
                    color="blue",
                    s=norm_as * 2000 + 10,
                    ax=ax0
                    )

ax0.set_ylabel('Number of medications')
ax0.set_title('Average number of medications for African Americans and Asian by age')
ax0.legend(['AfricanAmerican', 'Asian'], loc='upper left', fontsize='x-large')

The size of the bubble corresponds to the magnitude of number of medications for that year, compared to the 0 - 100 data. The larger the bubble is, the more medications are taken by patients of that age group.

In the graph above, we see that the trends are quite similar, but it takes slightly more medications for African Americans to treat diabetes.


**Question**: Create bubble plots the average number of medications taken by Caucasian and Hispanic people. You can use **df_bubble** that we defined and used in the previous example.


Step 1: Normalize the data pertaining to Caucasian and Hispanic people.


In [None]:
### type your answer here
# normalized Caucasian data
norm_ca = (df_bubble['Caucasian'] - df_bubble['Caucasian'].min()) / (df_bubble['Caucasian'].max() - df_bubble['Caucasian'].min())
# normalized Hispanic data
norm_hi = (df_bubble['Hispanic'] - df_bubble['Hispanic'].min()) / (df_bubble['Hispanic'].max() - df_bubble['Hispanic'].min())

<details><summary>Click here for a sample python solution</summary>

```python
#The correct answer is:  
# normalized Caucasian data
norm_ca = (df_bubble['Caucasian'] - df_bubble['Caucasian'].min()) / (df_bubble['Caucasian'].max() - df_bubble['Caucasian'].min())
# normalized Hispanic data
norm_hi = (df_bubble['Hispanic'] - df_bubble['Hispanic'].min()) / (df_bubble['Hispanic'].max() - df_bubble['Hispanic'].min())
```

</details>


Step 2: Generate the bubble plots.


In [None]:
### type your answer here
# Caucasian
ax0 = df_bubble.plot(kind='scatter',
                    x='Age',
                    y='Caucasian',
                    figsize=(14, 8),
                    alpha=0.5,                  # transparency
                    color='green',
                    s=norm_ca * 2000 + 10,  # pass in weights 
                   )

# Hispanic
ax1 = df_bubble.plot(kind='scatter',
                    x='Age',
                    y='Hispanic',
                    alpha=0.5,
                    color="blue",
                    s=norm_hi * 2000 + 10,
                    ax = ax0
                   )

ax0.set_ylabel('Number of medications')
ax0.set_title('Average number of medications for Caucasian and Hispanic people by age')
ax0.legend(['Caucasian', 'Hispanic'], loc='upper left', fontsize='x-large')

<details><summary>Click here for a sample python solution</summary>

```python
#The correct answer is:  
    
# Caucasian
ax0 = df_bubble.plot(kind='scatter',
                    x='Age',
                    y='Caucasian',
                    figsize=(14, 8),
                    alpha=0.5,                  # transparency
                    color='green',
                    s=norm_ca * 2000 + 10,  # pass in weights 
                   )

# Hispanic
ax1 = df_bubble.plot(kind='scatter',
                    x='Age',
                    y='Hispanic',
                    alpha=0.5,
                    color="blue",
                    s=norm_hi * 2000 + 10,
                    ax = ax0
                   )

ax0.set_ylabel('Number of medications')
ax0.set_title('Average number of medications for Caucasian and Hispanic people by age')
ax0.legend(['Caucasian', 'Hispanic'], loc='upper left', fontsize='x-large')
```

</details>


## Conclusions

In conclusion, the use of various visualization tools such as Pie Charts, Box Plots, Scatter Plots, and Bubble Plots can greatly aid healthcare professionals in understanding and communicating diabetes data effectively. Pie Charts can provide a clear overview of the share of people with diabetes, enabling quick comparisons and identifying trends. Box Plots, Scatter Plots, and Bubble Plots can offer deeper insights by showcasing distribution patterns, correlations, and potential outliers, supporting informed decision-making and targeted interventions to address health disparities across different regions.


### Thank you for completing this lab!

## Author

<a href="https://author.skills.network/instructors/bohdan_kuno">Bohdan Kuno</a>

### Other Contributors

<a href="https://author.skills.network/instructors/yaroslav_vyklyuk_2">Prof. Yaroslav Vyklyuk, DrSc, PhD</a>

<a href="https://author.skills.network/instructors/nataliya_boyko">Ass. Prof. Nataliya Boyko, PhD</a>

## Change Log

| Date (YYYY-MM-DD) | Version | Changed By | Change Description                                         |
| ----------------- | ------- | ---------- | ---------------------------------------------------------- |
|2023-12-16|01|Bohdan Kuno|Lab created|


<hr>

## <h3 align="center"> © IBM Corporation 2023. All rights reserved. <h3/>
