# 1. Medical Data Visualizer

## 1.1 About 
This project is my second project working with sample data. This is part of the [freeCodeCamp Data Analysis with Python Certification](https://www.freecodecamp.org/learn/data-analysis-with-python/). I will be visualizing and making calculations from medical examination data using Matplotlib, Seaborn, and Pandas. The dataset values were collected during medical examinations. The rows in the dataset represent patients and the columns represent information like body measurements, results from various blood tests, and lifestyle choices. I will be using the dataset to explore the relationship between cardiac disease, body measurements, blood markers, and lifestyle choices.

### 1.1.1 Dataset Info

The table below lists all the variables in the dataset and what they represent:


|                    Feature                    |    Variable Type    |   Variable  |                    Value Type                    |   |
|:---------------------------------------------:|:-------------------:|:-----------:|:------------------------------------------------:|---|
|                      Age                      |  Objective Feature  |     age     |                    int (days)                    |   |
|                     Height                    |  Objective Feature  |    height   |                     int (cm)                     |   |
|                     Weight                    |  Objective Feature  |    weight   |                    float (kg)                    |   |
|                     Gender                    |  Objective Feature  |    gender   |                 categorical code                 |   |
|            Systolic blood pressure            | Examination Feature |    ap_hi    |                        int                       |   |
|            Diastolic blood pressure           | Examination Feature |    ap_lo    |                        int                       |   |
|                  Cholesterol                  | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |   |
|                    Glucose                    | Examination Feature |     gluc    | 1: normal, 2: above normal, 3: well above normal |   |
|                    Smoking                    |  Subjective Feature |    smoke    |                      binary                      |   |
|                 Alcohol intake                |  Subjective Feature |     alco    |                      binary                      |   |
|               Physical activity               |  Subjective Feature |    active   |                      binary                      |   |
| Presence or absence of cardiovascular disease |   Target Variable   |    cardio   |                      binary                      |   |

## 1.2 Execution 
### 1.2.1 Import Required Libraries/Data
First, all necessary Python libraries are imported and the dataset is imported into a data frame.

In [339]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

`df.head(5)` will show us the first 5 rows of the table.

In [340]:
# Import Data
df = pd.read_csv('medical_examination.csv', delimiter = ',')
df.head(5)

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0


### 1.2.2 Add 'overweight' Column
One of the tasks is to add an overweight column to the data. To determine if a person is overweight, first calculate their BMI by dividing their weight in kilograms by the square of their height in meters. If that value is > 25 then the person is overweight. Use the value 0 for NOT overweight and the value 1 for overweight.

In [341]:
# Calculate BMI
bmi = df['weight'] / np.power(df['height'] * 0.01, 2)

# Add 'overweight' column
df['overweight'] = np.where(bmi > 25, 1, 0)

To make sure the BMI is calculated correctly and that the "overweight" column has been correctly added, we will print each.

In [342]:
print ("BMI:\n", bmi)
df.head(5)

BMI:
 0        21.967120
1        34.927679
2        23.507805
3        28.710479
4        23.011177
           ...    
69995    26.927438
69996    50.472681
69997    31.353579
69998    27.099251
69999    24.913495
Length: 70000, dtype: float64


Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,overweight
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0,0
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1,1
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1,0
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1,1
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0,0


### 1.2.3 Normalize The Data
Looking at the 'cholesterol' and 'gluc' rows in 1.1.1, we can see that there are **3** value types:  
1. Normal
2. Above Normal
3. Well Above Normal
 
We will normalize the data by making 0 always good and 1 always bad. If the value of 'cholesterol' or 'gluc' = 1, we will make the value 0. If the value is > 1, we will make the value 1.  

In [343]:
# Normalize 'cholesterol' column
df['cholesterol'] = df['cholesterol'].apply(lambda x: 0 if x == 1 else 1)

# Normalize 'gluc' column
df['gluc'] = df['gluc'].apply(lambda x: 0 if x == 1 else 1)

We will print the first 5 rows to see if our data has been normalized

In [344]:
df.head(5)

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,overweight
0,0,18393,2,168,62.0,110,80,0,0,0,0,1,0,0
1,1,20228,1,156,85.0,140,90,1,0,0,0,1,1,1
2,2,18857,1,165,64.0,130,70,1,0,0,0,0,1,0
3,3,17623,2,169,82.0,150,100,0,0,0,0,1,1,1
4,4,17474,1,156,56.0,100,60,0,0,0,0,0,0,0


The 'cholesterol' and 'gluc' columns now only show values 0 or 1.

### 1.2.4 Convert The Data & Create a Chart  
We will convert the data into long format by creating a DataFrame for cat plot using `pd.melt` using just the values from 'cholesterol', 'gluc', 'smoke', 'alco', 'active', and 'overweight'.  

After we will create a chart that shows the value counts of good and bad outcomes for the `cholesterol, gluc, alco, active, overweight and smoke` variables for patients with cardio = 1 and cardio = 0 in different panels using Seaborn's `catplot()`. The dataset should be split by 'Cardio' so there is one chart for each cardio value.  

This will be done within the `draw_cat_plot()` function.

In [345]:
def draw_cat_plot():
    # Create DataFrame for cat plot using `pd.melt`
    df_cat = pd.melt(df, id_vars='cardio', value_vars=['cholesterol', 'gluc', 'alco', 'active', 'overweight', 'smoke'])
    
    # Group and reformat the data to split it by 'cardio'. Show the counts of each feature.
    df_cat['total'] = 1
    df_cat = df_cat.groupby(['cardio', 'variable', 'value'], as_index = False).count()
    
    # Map custom legend labels to the 'value' column
    value_labels = {0: 'No', 1: 'Yes'}
    df_cat['value'] = df_cat['value'].map(value_labels)
    
    # Draw the catplot with 'sns.catplot()'
    plot = sns.catplot(
        data = df_cat, x="variable", y="total", col="cardio", hue='value',
        kind="bar", col_order=[0, 1]
    )
    # Get the figure for the output
    fig = plot.fig
    fig.savefig('catplot.png')
    return fig

<div><img src = 'catplot.png' alt = 'Cat Plot Chart' width = '800px')/></div>

### 1.2.5 Clean The Data & Create a Correlation Matrix
We will now filter out (clean) the following patient segments that represent incorrect data:
- diastolic pressure is higher than systolic (Keep the correct data with `(df['ap_lo'] <= df['ap_hi'])`)
- height is less than the 2.5th percentile (Keep the correct data with `(df['height'] >= df['height'].quantile(0.025))`)
- height is more than the 97.5th percentile
- weight is less than the 2.5th percentile
- weight is more than the 97.5th percentile

After, we will use the dataset and plot the correlation matrix using Seaborn's `heatmap()` and generate a mask for the upper triangle.  

This will be done within the `draw_heat_map()` function.

In [346]:
def draw_heat_map():
    # Clean the data
    df_heat = df[
        (df['ap_lo'] <= df['ap_hi']) &
        (df['height'] >= df['height'].quantile(0.025)) &
        (df['height'] <= df['height'].quantile(0.975)) &
        (df['weight'] >= df['weight'].quantile(0.025)) &
        (df['weight'] <= df['weight'].quantile(0.975))
    ]
  
    # Calculate the correlation matrix
    corr = df_heat.corr(method = 'pearson')

    # Generate a mask for the upper triangle
    mask = np.triu(corr)
    
    # Set up the matplotlib figure
    fig, ax = plt.subplots(figsize = (13, 13))

    # Draw the heatmap with 'sns.heatmap()'
    map = sns.heatmap(corr, mask = mask, fmt='.1f', annot = True, square = True, linewidths=1,
               cbar_kws={'shrink': 0.5}, center = 0.08)
  
    fig = map.figure
    fig.savefig('heatmap.png')
    return fig

<div><img src = 'heatmap.png' alt = 'Heat Map Chart' width = '800px')/></div>

## 1.3 Conclusions
After completing all necessary tasks, we can finally visualize useful information regarding the `medical_examination.csv`. This will help us draw conclusions for the data.

### 1.3.1 Cat Plot Discovery
Looking back at section **1.2.4**, we were able to create a bar graph showing the relationship between `cardiovascular disease` and `cholesterol, glucose, alcohol intake, physical activity, overweight, and smoking`.  
<div><img src = 'catplot.png' alt = 'Cat Plot Chart' width = '800px')/></div> 

The middle of both charts *(variables = cholesterol, glucose, overweight)* stood out to me the most. I will dive deeper into these observations below.  

Comparing those without cardiovascular disease (*cardio = 0*) to those with cardiovascular disease (*cardio = 1*) we can infer that:  
- Individuals with cardiovascular disease group appear to have a higher prevalence of risk factors for cardiovascular diseases, such as high cholesterol, high glucose levels, and overweight status.
- Being active, alcohol intake, and smoking do not vary significantly between cardio = 0 and cardio = 1 which suggests that these particular lifestyle factors may not be strong differentiating factors when it comes to predicting the likelihood of cardiovascular disease in this dataset.
- We can infer that having high cholesterol, high glucose levels, and being overweight may be associated with an increased risk of cardiovascular.

### 1.3.2 Heat Map Discovery
Looking back at section **1.2.5**, we were able to generate a heat map showing the correlation between variables. A positive number would mean a positive correlation between the two variables, vice versa. The lighter color cells represent a strong positive correlation (closer to the value 1) and the darker color cells represent a weak or nearly no correlation (close to 0).
<div><img src = 'heatmap.png' alt = 'Heat Map Chart' width = '800px')/></div>  

I will focus on the `0.5, 0.7, and 0.4` cells.

1. (gender, height) = 0.5
    - With a correlation value of 0.5, this indicates that there is a tendency that, on average, taller individuals are more likely to be of one gender compared to the other.   
2. (weight, overweight) = 0.7
    - The value of 0.7 indicates that there's a strong positive relationship between an individual's weight and whether they are considered overweight. In this context, the correlation value indicates that, on average, individuals with higher weight are more likely to be categorized as "overweight".  
3. (cholesterol, gluc) = 0.4
    - With a value of 0.4, we can infer that there is a tendency that when the level of cholesterol is higher the level of glucose is also more likely to be higher.

We can utilize this heat map to visualize the correlation of other variables with each other which will help us make better conclusions about the data. 

### 1.3.3 Final Thoughts
Through this project, I was able to dive deeper into the world of data analytics. I was able to practice creating charts/maps through Python's libraries and draw meaningful conclusions from them. 