# Exploratory Data Analysis of Heart Disease and Demographic Trends

Heart disease stands as a significant contributor to mortality in both men and women. In this data analysis project, I delve into the heart disease dataset to identify common features among distinct demographic groups, aiming to predict and prevent the development of heart disease. The dataset used for this analysis can be found [here](https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset/data).

## Project Objective
The primary objective is to define potential factors contributing to the development of heart disease, laying the groundwork for predictive and preventive measures.

## Understanding the Data
To kick off the analysis, I conducted an exploration of the dataset, ensuring data cleanliness where necessary. The dataset comprises 14 attributes, each providing valuable insights into the patient's health profile:

- **Age (age):** Patient’s age in years.
- **Gender (sex):** Patient’s gender. (M = Male, F = Female)
- **Chest Pain Type (cp):** Chest pain type. (Values: ATA, NAP, ASY, TA)
- **Resting Blood Pressure (trestbps):** Resting blood pressure in mm Hg.
- **Serum Cholesterol (chol):** Serum cholesterol in mg/dl.
- **Fasting Blood Sugar (fbs):** Fasting blood sugar > 120 mg/dl. (0 = True, 1 = False)
- **Resting Electroencephalographic Result (restecg):** Resting electroencephalographic result. (Values: Normal, ST, LVH)
- **Maximum Heart Rate (thalach):** Maximum heart rate achieved.
- **Exercise-Induced Angina (exang):** Exercise-induced angina. (N = No, Y = Yes)
- **ST Depression (oldpeak):** ST depression induced by exercise relative to rest.
- **Slope of the ST Segment (slope):** Slope of the peak exercise ST segment. (Values: Up, Flat, Down)
- **Number of Major Vessels (ca):** Number of major vessels (0-3) colored by fluoroscopy.
- **Thalassemia (thal):** Thalassemia type. (0 = normal; 1 = fixed defect; 2 = reversible defect)
- **Heart Disease Occurrence (target):** Presence of heart disease. (0 = No, 1 = Yes)

This comprehensive exploration sets the stage for a detailed analysis of the dataset, uncovering patterns and relationships within the data.


In [None]:
-- Table preview
SELECT *
FROM heart_disease
LIMIT 10;

| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target |
|-----|-----|----|----------|------|-----|---------|---------|-------|---------|-------|----|------|--------|
| 52  | 1   | 0  | 125      | 212  | 0   | 1       | 168     | 0     | 1       | 2     | 2  | 3    | 0      |
| 53  | 1   | 0  | 140      | 203  | 1   | 0       | 155     | 1     | 3.1     | 0     | 0  | 3    | 0      |
| 70  | 1   | 0  | 145      | 174  | 0   | 1       | 125     | 1     | 2.6     | 0     | 0  | 3    | 0      |
| 61  | 1   | 0  | 148      | 203  | 0   | 1       | 161     | 0     | 0       | 2     | 1  | 3    | 0      |
| 62  | 0   | 0  | 138      | 294  | 1   | 1       | 106     | 0     | 1.9     | 1     | 3  | 2    | 0      |
| 58  | 0   | 0  | 100      | 248  | 0   | 0       | 122     | 0     | 1       | 1     | 0  | 2    | 1      |
| 58  | 1   | 0  | 114      | 318  | 0   | 2       | 140     | 0     | 4.4     | 0     | 3  | 1    | 0      |
| 55  | 1   | 0  | 160      | 289  | 0   | 0       | 145     | 1     | 0.8     | 1     | 1  | 3    | 0      |
| 46  | 1   | 0  | 120      | 249  | 0   | 0       | 144     | 0     | 0.8     | 2     | 0  | 3    | 0      |
| 54  | 1   | 0  | 122      | 286  | 0   | 0       | 116     | 1     | 3.2     | 1     | 2  | 2    | 0      |


## Data cleaning
In the initial state of the dataset, the columns were ***represented with abbreviations***, making them difficult to interpret. Additionally, all values were encoded as numbers. Despite the absence of missing values in the table, the dataset's readability and interpretability were compromised. To address this, a comprehensive data cleaning process was conducted, involving changes to column names and the **replacement of numeric values with descriptive labels**, thus enhancing the overall clarity of the dataset.

In [2]:
-- -- Replaced 1 with "male" and 0 with "female" in the "sex" column.
-- Step 1: Add a temporary column
ALTER TABLE heart_disease
ADD COLUMN sex_temp VARCHAR;

-- Step 2: Update the temporary column
UPDATE heart_disease
SET sex_temp = CASE
    WHEN sex = 1 THEN 'male'
    ELSE 'female'
END;

-- Step 3: Drop the original column
ALTER TABLE heart_disease
DROP COLUMN sex_temp;

-- Step 4: Rename the temporary column
ALTER TABLE heart_disease
RENAME COLUMN sex_temp TO sex;

-- -- Changed the data type of the "chest_pain_type" column from integer to string and replaced values 1, 2, 3, 4 with 'TA', 'ATA', 'NAP', 'ASY' in the same column.
-- Step 1: Add a temporary column
ALTER TABLE heart_disease
ADD COLUMN chest_pain_temp VARCHAR;

-- Step 2: Update the temporary column
UPDATE heart_disease
SET chest_pain_type = CASE
    WHEN chest_pain_type = 0 THEN 'TA'
    WHEN chest_pain_type = 1 THEN 'ATA'
    WHEN chest_pain_type = 2 THEN 'NAP'
    WHEN chest_pain_type = 3 THEN 'ASY'
    ELSE chest_pain_type::VARCHAR
END;
-- Step 3: Drop the original column
ALTER TABLE heart_disease
DROP COLUMN chest_pain_type;

-- Step 4: Rename the temporary column
ALTER TABLE heart_disease
RENAME COLUMN chest_pain_temp TO chest_pain_type;

-- -- Next I changed the data type of the "resting_ecg" column from integer to string and replaced the values 0, 1, 2 with their corresponding labels: "Normal," "ST," "LVH"
-- Step 1: Add a temporary column
ALTER TABLE heart_disease
ADD COLUMN resting_ecg_temp VARCHAR;

-- Step 2: Update the temporary column
UPDATE heart_disease
SET resting_ecg_temp = CASE
    WHEN resting_ecg = 0 THEN 'Normal'
    WHEN resting_ecg = 1 THEN 'ST'
    WHEN resting_ecg = 2 THEN 'LVH'
    ELSE resting_ecg::VARCHAR
END;

-- Step 3: Drop the original column
ALTER TABLE heart_disease
DROP COLUMN resting_ecg;

-- Step 4: Rename the temporary column
ALTER TABLE heart_disease
RENAME COLUMN resting_ecg_temp TO resting_ecg;

-- -- Changed the data type of the "slope" column from integer to string and replaced the values 0, 1, 2 with their corresponding labels: "Up" (upsloping), "Flat" (flat), "Down" (downsloping)
-- Step 1: Add a temporary column
ALTER TABLE heart_disease
ADD COLUMN slope_temp VARCHAR;

-- Step 2: Update the temporary column
UPDATE heart_disease
SET slope_temp = CASE
    WHEN slope = 0 THEN 'Up'
    WHEN slope = 1 THEN 'Flat'
    WHEN slope = 2 THEN 'Down'
    ELSE slope::VARCHAR
END;

-- Step 3: Drop the original column
ALTER TABLE heart_disease
DROP COLUMN slope;

-- Step 4: Rename the temporary column
ALTER TABLE heart_disease
RENAME COLUMN slope_temp TO st_slope;

-- -- Changed the data type of the "thal" (thalassemia) column from integer to string and replace the values 1, 2, 3 with their corresponding labels: 1 = 'normal'; 2 = 'fixed defect'; 3 = 'reversable defect'
-- Step 1: Add a temporary column
ALTER TABLE heart_disease
ADD COLUMN thal_temp VARCHAR;

-- Step 2: Update the temporary column
UPDATE heart_disease
SET thal_temp = CASE
    WHEN thal = 1 THEN 'normal'
    WHEN thal = 2 THEN 'fixed defect'
    WHEN thal = 3 THEN 'reversible defect'
    ELSE thal::VARCHAR
END;

-- Step 3: Drop the original column
ALTER TABLE heart_disease
DROP COLUMN thal;

-- Step 4: Rename the temporary column
ALTER TABLE heart_disease
RENAME COLUMN thal_temp TO thal;

SyntaxError: invalid syntax (3877150330.py, line 1)

In [None]:
-- Overview of Heart Disease Dataset
SELECT *
FROM heart_disease
LIMIT 10;

| age | resting_blood_pressure | cholesterol | fasting_blood_sugar | max_heart_rate | exercise_angina | oldpeak | n_major_vessels | target | sex    | chest_pain_type | resting_ecg | st_slope | thal            |
|-----|------------------------|-------------|----------------------|-----------------|------------------|---------|-----------------|--------|--------|------------------|-------------|----------|-----------------|
| 53  | 130                    | 197         | 1                    | 152             | 0                | 1.2     | 0               | 1      | "male" | "NAP"            | "Normal"    | "Up"     | "fixed defect"  |
| 42  | 136                    | 315         | 0                    | 125             | 1                | 1.8     | 0               | 0      | "male" | "TA"             | "ST"        | "Flat"   | "normal"        |
| 37  | 120                    | 215         | 0                    | 170             | 0                | 0       | 0               | 1      | "female" | "NAP"           | "ST"        | "Down"   | "fixed defect"  |
| 62  | 160                    | 164         | 0                    | 145             | 0                | 6.2     | 3               | 0      | "female" | "TA"            | "Normal"    | "Up"     | "reversible defect" |
| 59  | 170                    | 326         | 0                    | 140             | 1                | 3.4     | 0               | 0      | "male" | "TA"            | "Normal"    | "Up"     | "reversible defect" |
| 61  | 140                    | 207         | 0                    | 138             | 1                | 1.9     | 1               | 0      | "male" | "TA"            | "Normal"    | "Down"   | "reversible defect" |
| 56  | 125                    | 249         | 1                    | 144             | 1                | 1.2     | 1               | 0      | "male" | "TA"            | "Normal"    | "Flat"   | "fixed defect"  |
| 59  | 140                    | 177         | 0                    | 162             | 1                | 0       | 1               | 0      | "male" | "TA"            | "ST"        | "Down"   | "reversible defect" |
| 48  | 130                    | 256         | 1                    | 150             | 1                | 0       | 2               | 0      | "male" | "TA"            | "Normal"    | "Down"   | "reversible defect" |
| 47  | 138                    | 257         | 0                    | 156             | 0                | 0       | 0               | 1      | "male" | "NAP"           | "Normal"    | "Down"   | "fixed defect"  |


## Exploratory Data Analysis
I provided Exploratory Data Analysis in order to investigate and learn the data variables within our dataset.

### 1. Investigated the distribution of ages and gender in the dataset:

In [None]:
SELECT
    AVG(age) AS avg_age,
    MIN(age) AS min_age,
    MAX(age) AS max_age,
    COUNT(*) AS total_records,
	(SUM(CASE WHEN sex = 'male' THEN 1 ELSE 0 END) * 100.0 / COUNT(*)) AS male_percentage,
    (SUM(CASE WHEN sex = 'female' THEN 1 ELSE 0 END) * 100.0 / COUNT(*)) AS female_percentage
FROM heart_disease;

| avg_age | min_age | max_age | total_records | male_percentage | female_percentage |
|---------|---------|---------|---------------|------------------|-------------------|
| 54.43   | 29      | 77      | 1025          | 69.56%           | 30.44%            |


As a result we see that the dataset captures a diverse age group, providing a broad representation of individuals (from a minimum of 29 years to a maximum of 77 years) with average age of individuals is approximately 54.43 years. The dataset exhibits a gender distribution with a higher percentage of males.

### 2. Provided the Chest Pain Analysis in order to investigate trends in chest pain types based on age or gender

In [None]:
SELECT chest_pain_type, COUNT(*) AS count
FROM heart_disease
GROUP BY chest_pain_type;

| chest_pain_type | count |
|-----------------|-------|
| "TA"            | 497   |
| "NAP"           | 284   |
| "ATA"           | 167   |
| "ASY"           | 77    |


TA" appears to be the most common type, while "ASY" is less frequent.

#### Chest Pain Trends Based on Age

In [None]:
SELECT
    CASE
        WHEN age < 31 THEN '<31' 
        WHEN age BETWEEN 31 AND 40 THEN '31-40'
        WHEN age BETWEEN 41 AND 50 THEN '41-50'
        WHEN age BETWEEN 51 AND 60 THEN '51-60'
        WHEN age BETWEEN 61 AND 70 THEN '61-70'
        ELSE '70+' 
    END AS age_group,
    chest_pain_type,
    COUNT(*) AS pain_type_count
FROM heart_disease
GROUP BY age_group, chest_pain_type
ORDER BY age_group, COUNT(*) DESC;

| age_group | chest_pain_type | pain_type_count |
|-----------|-----------------|------------------|
| "<31"     | "ATA"           | 4                |
| "31-40"   | "NAP"           | 24               |
| "31-40"   | "TA"            | 23               |
| "31-40"   | "ASY"           | 10               |
| "31-40"   | "ATA"           | 7                |
| "41-50"   | "TA"            | 95               |
| "41-50"   | "NAP"           | 81               |
| "41-50"   | "ATA"           | 65               |
| "41-50"   | "ASY"           | 6                |
| "51-60"   | "TA"            | 225              |
| "51-60"   | "NAP"           | 110              |
| "51-60"   | "ATA"           | 69               |
| "51-60"   | "ASY"           | 34               |
| "61-70"   | "TA"            | 147              |
| "61-70"   | "NAP"           | 62               |
| "61-70"   | "ASY"           | 27               |
| "61-70"   | "ATA"           | 16               |
| "70+"     | "NAP"           | 7                |
| "70+"     | "TA"            | 7                |
| "70+"     | "ATA"           | 6                |


The prevalence of chest pain type 'TA' shows an increasing trend with age groups. In patients aged 31-40, both 'NAP' and 'TA' chest pain are prevalent, with 24 and 23 occurrences, respectively. Notably, in the age group corresponding to the mean age (51-60), 'TA' is significantly more frequent with 225 occurrences compared to 'NAP' with 62 occurrences.

#### Chest Pain Trends Based on Gender

In [None]:
SELECT
    sex,
    chest_pain_type,
    COUNT(*) AS pain_type_count,
	(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM heart_disease)) AS percentage
FROM heart_disease
GROUP BY sex, chest_pain_type
ORDER BY sex, COUNT(*) DESC;

| sex    | chest_pain_type | pain_type_count | percentage |
|--------|------------------|------------------|------------|
| female | "TA"             | 133              | 12.98%     |
| female | "NAP"            | 109              | 10.63%     |
| female | "ATA"            | 57               | 5.56%      |
| female | "ASY"            | 13               | 1.27%      |
| male   | "TA"             | 364              | 35.51%     |
| male   | "NAP"            | 175              | 17.07%     |
| male   | "ATA"            | 110              | 10.73%     |
| male   | "ASY"            | 64               | 6.24%      |


The distribution of chest pain types varies between genders. For females, the most prevalent chest pain type is 'TA' with 133 occurrences (12.98%), followed by 'NAP' with 109 occurrences (10.63%). 'ATA' and 'ASY' chest pain types are less common among females, with 57 (5.56%) and 13 (1.27%) occurrences, respectively.

In contrast, males exhibit a different pattern. 'TA' remains the dominant chest pain type with 364 occurrences (35.51%), followed by 'NAP' with 175 occurrences (17.07%). 'ATA' and 'ASY' chest pain types are also present among males, with 110 (10.73%) and 64 (6.24%) occurrences, respectively.

This analysis highlights notable gender-specific differences in the distribution of chest pain types, with 'TA' being more prevalent in males compared to females.

### 3. Examined Cardiovascular Risk Factors
#### Investigated the distribution of resting blood pressure, serum cholesterol, and maximum heart rate achieved

In [None]:
SELECT
    AVG(resting_blood_pressure) AS avg_blood_pressure,
    AVG(cholesterol) AS avg_cholesterol,
    AVG(max_heart_rate) AS avg_heart_rate
FROM heart_disease;

| avg_blood_pressure | avg_cholesterol | avg_heart_rate |
|--------------------|-----------------|-----------------|
| 131.61             | 246.00          | 149.11          |


The average resting blood pressure among individuals in the dataset is 131.61 mm Hg, the average serum cholesterol level is 246.00 mg/dl, and the average maximum heart rate achieved is 149.11 beats per minute. These metrics provide insights into the overall cardiovascular health of the individuals represented in the dataset, with the average values serving as key indicators for further analysis and interpretation

### 4. Fasting Blood Sugar and ECG Results:
#### Analyzed the prevalence of fasting blood sugar > 120 mg/dl, where 0 = True, 1 = False

In [None]:
SELECT fasting_blood_sugar, COUNT(*) AS count,
(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM heart_disease)) AS percentage
FROM heart_disease
GROUP BY fasting_blood_sugar;

| fasting_blood_sugar | count | percentage |
|----------------------|-------|------------|
| 0                    | 872   | 85.07%     |
| 1                    | 153   | 14.93%     |


The majority of individuals in the dataset, constituting 85.07%, have fasting blood sugar levels below or equal to 120 mg/dl (coded as 0). A smaller portion, representing 14.93% of the dataset, has fasting blood sugar levels above 120 mg/dl (coded as 1). This distribution provides insights into the prevalence of elevated fasting blood sugar levels within the studied population and is a crucial factor to consider when examining potential relationships with other cardiovascular risk factors.

#### Explored the distribution of resting electrocardiographic results with the prevalence of fasting blood sugar

In [None]:
SELECT
    resting_ecg,
	AVG(CASE WHEN fasting_blood_sugar = 1 THEN 1 ELSE 0 END)*100 AS fasting_blood_sugar_percentage,
    COUNT(*) AS ecg_result_count
FROM heart_disease
GROUP BY resting_ecg
ORDER BY 3 DESC;

| resting_ecg | fasting_blood_sugar_percentage | ecg_result_count |
|-------------|---------------------------------|------------------|
| "ST"        | 11.89%                          | 513              |
| "Normal"    | 18.51%                          | 497              |
| "LVH"       | 0.00%                           | 15               |


The dataset reveals varying patterns in resting electrocardiographic (ECG) results among individuals. The most prevalent ECG result is 'Normal,' accounting for 18.51% of the dataset. 'ST' (ST-T wave abnormality) is the second most common result, occurring in 11.89% of the cases. Notably, there are no instances of 'LVH' (left ventricular hypertrophy) in this dataset. Understanding the distribution of ECG results is essential for assessing cardiac health and identifying potential risk factors associated with different ECG patterns.

### 4. Exercise-Related Insights:
#### Investigated the occurrence of exercise-induced angina

In [None]:
SELECT
    AVG(CASE WHEN exercise_angina = 1 THEN 1 ELSE 0 END)*100 AS angina_percentage,
    AVG(oldpeak) AS avg_st_depression
FROM heart_disease;

| angina_percentage | avg_st_depression |
|-------------------|-------------------|
| 33.66%            | 1.07              |


The dataset indicates that approximately 33.66% of individuals experienced exercise-induced angina. Additionally, the average ST depression induced by exercise relative to rest (oldpeak) is 1.07. These findings provide valuable insights into the prevalence of exercise-related symptoms and the associated magnitude of ST depression. 

### 5. Heart Disease Severity Indicators: examined the distribution of the slope of the peak exercise ST segment:

In [None]:
SELECT
    st_slope,
    COUNT(*) AS st_slope_count,
	(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM heart_disease)) AS percentage
FROM heart_disease
GROUP BY st_slope;

| st_slope | st_slope_count | percentage |
|----------|----------------|------------|
| "Flat"   | 482            | 47.02%     |
| "Up"     | 74             | 7.22%      |
| "Down"   | 469            | 45.76%     |


The distribution of the slope of the peak exercise ST segment reveals interesting patterns. The majority of individuals (47.02%) exhibit a 'Flat' slope, while 45.76% have a 'Down' slope. In contrast, a smaller percentage (7.22%) demonstrate an 'Up' slope. 

#### Analyzed the number of major vessels colored by fluoroscopy

In [None]:
SELECT n_major_vessels, COUNT(*) AS total
FROM heart_disease
GROUP BY n_major_vessels
ORDER BY 1;

| n_major_vessels | total |
|-----------------|-------|
| 0               | 578   |
| 1               | 226   |
| 2               | 134   |
| 3               | 69    |
| 4               | 18    |


The analysis of the number of major vessels colored by fluoroscopy indicates varying prevalence. The majority of individuals (578) have 0 major vessels colored, followed by 226 individuals with 1 vessel, 134 with 2 vessels, and 69 with 3 vessels. Interestingly, there are 18 cases where 4 vessels are colored. 

### 6. Thalassemia Analysis: explored the distribution of thalassemia types (normal, fixed defect, reversible defect)

In [None]:
SELECT
    thal,
    COUNT(*) AS thal_count
FROM heart_disease
GROUP BY thal;

| thal               | thal_count |
|--------------------|------------|
| "fixed defect"     | 544        |
| "reversible defect"| 410        |
| "normal"           | 71         |


The analysis of thalassemia types reveals the distribution of cases among distinct categories. The most prevalent type is 'fixed defect' with 544 occurrences, followed by 'reversible defect' with 410 cases. Notably, there are 71 cases categorized as 'normal'. 

## Comprehensive Conclusion

### Demographic Insights:
- The dataset comprises a diverse age range, with the average age being 54.43 years.
- Gender distribution shows a higher percentage of males (69.56%) compared to females (30.44%).

### Chest Pain Analysis:
- Chest pain types vary among different age groups. Notably, the prevalence of 'TA' chest pain increases with age and it's the most prominent in the group 51-60 years old.
- Detailed breakdowns of chest pain types based on age groups and gender provide valuable insights into potential correlations.

### Cardiovascular Risk Factors:
- Average resting blood pressure is 131.61 mm Hg, average serum cholesterol is 246.00 mg/dl, and the average maximum heart rate achieved is 149.11.
- The majority of the population has fasting blood sugar levels below 120 mg/dl (85.07%).

### Resting Electrocardiographic Results:
- The most common electrocardiographic result is 'Normal' (18.51%), followed by 'ST' elevation (11.89%).

### Exercise-Related Insights:
- Approximately 33.66% of the population experiences exercise-induced angina.
- The average ST depression induced by exercise relative to rest is 1.07.

### Heart Disease Severity Indicators:
- The distribution of the slope of the peak exercise ST segment shows a significant proportion with a 'Flat' slope (47.02%).

### Number of Major Vessels and Thalassemia Types:
- The majority of individuals have 0 or 1 major vessels colored by fluoroscopy.
- 'Fixed defect' is the most common thalassemia type, followed by 'Reversible defect' and 'Normal.'

This comprehensive analysis provides a foundation for further investigation and modeling to predict and prevent heart disease. The insights gained contribute to a better understanding of the dataset and highlight potential areas for focused research and intervention.
