# [Epi/Biostats 101] Final Project: Heart Attack!

Authors: Andrew O'Connor, Carl Conste  
UC Berkeley, Fall 2023  


In [None]:
# uncomment the line below if you are missing any of the libraries
# !pip install IPython ipywidgets ucimlrepo seaborn pandas numpy matplotlib

In [1]:
import pandas as pd
import numpy as np
from utils import *

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#!pip install IPython ipywidgets ucimlrepo

sns.set(style="dark")
plt.style.use("ggplot")

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

# Project Overview
In this **partner** project, you will embark on a comprehensive exploration of certain aspects related to a common umbrella term that encompasses many different issues with the heart, cardiovascular disease (CVD). The project is thoughtfully divided into three distinct sections, each offering a unique perspective on this topic.

### Section Descriptions
**Section 1**: Epidemiology in Practice will introduce you to a practical application of epidemiological research, providing insights into different types of studies, research methodologies, causal inference techniques and the significance of epidemiology in public health.

**Section 2**: EDA (Exploratory Data Analysis) of CVD (Cardiovascular Disorder) takes you into the realm of data analysis, where you will apply your epidemiological knowledge to real-world health datasets. You'll gain practical experience in understanding and interpreting health data.

**Section 3**: Predicting Heart Attacks leverages the predictive modeling and machine learning to build a binary classifier, enabling you to predict heart attacks. You will bridge the gap between epidemiological insights and data-driven decision-making in healthcare.

**Section 4 (Optional, Extra Credit)**: Life's Essential 8 gives you the opportunity to compare/contrast the lived realities of Americans in regards to suggested behaviors that promote "good" health. Public health officials may recommend and promote certain behaviors but socially and economically, are these feasible? This section lets you explore this.

This project offers a cohesive journey, combining theory and practical applications, equipping you with valuable skills in epidemiology, data analysis, and predictive modeling. You will explore the fascinating world of heart disease and its implications for public health.

### Grading
- This project will be graded **80% on completion** and **20% on accuracy**
- There are a total of 20 questions, each worth 5 points.
- There is an optional Extra Credit section, with 2 questions worth 5 points each. Therefore, the maximum score you can recieve on the project is 110/100.
- Preliminary questions are meant to get you started with approaching the questions to come. They are not graded and are specifically denoted as being worth 0 points

<div class="alert alert-block alert-info">
Key things to note for free response "Your Answer Here" questions:
<ul>
<li>Answer with at most 5 sentences. Some explanations may take longer than others.</li>
<li>The goal of this restriction is to not make you write a full essay or spend too much time on these questions.</li>
</ul>
</div>

---

### Section 1: Epidemiology in Practice

In this section, you will be reading the study ["Cardiovascular disease risk factors in relation to smoking behaviour and history: a population-based cohort study"](https://openheart.bmj.com/content/openhrt/3/2/e000358.full.pdf) (Keto J, Ventola H, Jokelainen J, et al.) and extracting some information based on what we've talked about in class. This study provides a practical opportunity to apply epidemiological and causal inference knowledge to investigate the relationships between cardiovascular disease risk factors and smoking behavior.

Throughout this section, our focus will be on the study's methodology, findings, and the causal relationship between smoking and CVD. By examining the impact of smoking behavior on cardiovascular disease risk, we aim to uncover valuable insights that can inform public health strategies and interventions. Our analysis marks the beginning of our exploration into heart-related issues through an epidemiological lens.

**Preliminary** (0 points): Watch the following 2 videos introducing heart/cardiovascular disease (CVD). Reflect on the videos and ask yourself the following questions after watching the videos for your understanding (no need to write anything down for this).  

1. What are some of the causes/risk factors for CVD?
2. What is cholesterol and how does it affect the heart?
3. Why does CVD have such a large impact on Americans?

In [2]:
# run this cell to view the videos
DisplayCVDVideos()

<br>

**Question 1**: [Read](https://open.oregonstate.education/epidemiology/back-matter/appendix-1/) the following research paper [linked here](https://openheart.bmj.com/content/3/2/e000358).  What is the main purpose of this study and how does it contribute to what is already known of the topic?  
Note: Suggested order of reading: Abstract, Introduction, Conclusion, Discussion, (skim) Methods, (skim) Results/Data Analysis  
*Hint*: [Refer to the Abstract, Key Questions (page 1), and Introduction]

___Type your answer here, replacing this text.___

<br>

**Question 2**: What type of epidemiological study was done? Briefly describe the individuals that were selected for this study. Did any drop out of the study? (i.e. what proportion of the original population stayed to the end?)

___Type your answer here, replacing this text.___

<br>

**Question 3** According to the study, how was smoking status assessed?

___Type your answer here, replacing this text.___

<br>

**Question 4** Displayed below is Table 1 (page 3). List all variables that come from continuous distributions and mention the process used to standardize the values of the continuous variables.

<div style="text-align: center;">
    <img src="table1.png" width="700" alt="Table 1">
</div>

___Type your answer here, replacing this text.___

<br>

**Question 5** Suppose we have access to the full database of individuals in the study and assume `Age`  follows a $\textrm{Normal}(46.6, 0.36)$ distribution. What is the probability that if you randomly selected a person from the study that they would be less than 46 years old?  
<br> 

*Hint*: Consider the probability density function (pdf) and cumulative density function (cdf) of `Age`. Import and use any scipy library functions from [the documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html) that may help you to solve this question. 

In [None]:
# Your code here

<br>

**Question 6** Do you believe this study has the ability to determine a causal relationship between smoking and CVD? Why or why not?

___Type your answer here, replacing this text.___

<br>

**Question 7** What kind of study is best to determine causality? Compare and contrast that type of study to the study mentioned in the paper.

___Type your answer here, replacing this text.___

<br>

**Question 8** Suppose we hypothesize that there is a causal relationship between Smoking and CVD. What are two main factors that may affect the relationship between Smoking and CVD? Brainstorm at least 2 variables for each factor.

___Type your answer here, replacing this text.___

---

### Section 2: EDA of CVD

In Section 2, we transition from the realm of epidemiological research to the practical world of data analysis. Our focus here is on conducting Exploratory Data Analysis (EDA) with the aim of gaining a deeper understanding of heart health. To do so, we'll utilize the UCI Machine Learning Repository's [Heart Disease dataset](https://archive.ics.uci.edu/dataset/45/heart+disease), allowing us  to explore the vast landscape of factors and variables associated with heart health. EDA is a crucial step in any data analysis journey, as it provides us with the tools and techniques to uncover patterns, relationships, and potential insights within our data. By the end of this section, you will have honed your skills in data visualization, statistical analysis, and the ability to ask and answer meaningful questions based on real-world health data


Run the cell below to fetch the heart disease data directly from the UCI Machine Learning Repository.

In [None]:
# UCI ML repo - https://archive.ics.uci.edu/dataset/45/heart+disease
heart_disease = Fetch_CVD_Data()

**Preliminary** (0 points): In functions that are already defined for you (to abstract away the unnecessary details of loading/processing data), its not always obvious what a function should return. What function would you use to check the return type of the above code?

In [None]:
# Your Code Here

<br>

**Question 9**: Now, let's explore our dataset. First, produce a pairplot of the dataset, which produces a scatterplot comparing each column with every other column. This code may take a minute to run.

The diagonals represent the histograms of each variables and the other plots are scatter plots showing the correlation between each variable. What are some general trends do you notice in some of these correlation plots? Are there any variables that you think are relatively correlated with each other, and if so, which ones?

In [None]:
fig = # YOUR CODE HERE

# Cleaning Up the Output
for i, j in zip(*np.triu_indices_from(fig.axes, 1)):
    fig.axes[i, j].set_visible(False)
plt.show()

___Type your answer here, replacing this text.___

<br>

**Question 10**: Let's focus on the correlation between ```age``` and ```num```. Plot a histogram that shows the different distributions in age for each value of num. Use the argument ```palette=mako``` to better see the overlapping bars.

What do you notice about these histograms?

*Hint:* Look into ```sns.histplot```'s ```hue``` argument.

In [None]:
# YOUR CODE HERE

___Type your answer here, replacing this text.___

Because we're separating individuals with heart diease into their specific levels of heart disease, our histogram becomes a little hard to read and individuals who don't have heart diease overpower the effectiveness of our graph. To combat this, we'll add a new column called ```num_normalized```, where any individual who has a ```num``` value greater than 1 will get the value of 1 and everyone else 0.

In [None]:
# DO NOT EDIT THIS CELL
heart_disease['num_normalized'] = (heart_disease['num'] >= 1) * 1

Replot a histogram similar to the one above, but instead use ```num_normalized``` instead of num. Its readability should now be much more clearer. What do you notice about this new histogram? Do any of these new findings surprise you? Why or why not?

In [None]:
# YOUR CODE HERE

___Type your answer here, replacing this text.___

<br>

**Question 11**: Let's look at some other variables that have differences in distributions between individuals with and without heart disease. Produce a histogram displaying the distribution of ```thalach``` between the two groups of ```num_normalized```. Explain why you think there may be a difference between the distribution ```thalach``` between thosee that have heart disease and those that don't.

In [None]:
# YOUR CODE HERE

___Type your answer here, replacing this text.___

<br>

**Question 12**: Lastly, let's explore the difference in sex groups and heart diease correlation. Plot a histogram displaying the difference between groups. Report your findings. Did you find this difference shocking or was it more or less expected, and why?

Note: ```sex == 1``` is male, ```sex == 0``` is female

In [None]:
# YOUR CODE HERE

___Type your answer here, replacing this text.___

<br>

**Question 13**: Find one more variable in the dataset that we have not explored yet and has a distinct divide between groups of ```num_normalized```, meaning you can easily tell from the histogram one group is more likely to belong in a certain group of ```num_normalized```. What variable did you pick and why? How did you draw the conclusion that this was a good variable to choose for our model??

**NOTE**: Do ***NOT*** choose either ```ca``` or ```thal``` as there are missing values for those columns and will not be helpful for us in later sections.

In [None]:
# YOUR CODE HERE

___Type your answer here, replacing this text.___

<br>

**Question 14**: Now that we've explored a little bit of the dataset, why did we choose/have you focus on variables that had distinct divides between groups? How can this be beneficial for us?

___Type your answer here, replacing this text.___

<br>

---

### Section 3: Predicting Heart Attacks

In this section, we shift our focus of analyzing heart-related issues to practical applications of prediction and machine learning. Using your cleaned data from the previous section, our goal is to build a classifier capable of predicting the occurrence of heart attacks. This section marks the culmination of our exploration, combining the insights gained from epidemiology and exploratory data analysis with predictive modeling techniques. By the end of this section, you'll have practical experience in building and evaluating predictive models for a critical health outcome.

**Preliminary** (0 points): Remind yourself of the modeling process. What are the steps? What kind of model metrics are associated with binary classification?

___Type your answer here, replacing this text.___

<br>

**Question 15**: First off, let's choose our model. Since we're trying to perform binary classification, which models would work here, and why? Which ones wouldn't?

___Type your answer here, replacing this text.___

<br>

**Question 16**: Next, choose a model you're comfortable working with. We'd recommend using sklearn's ```LogisticRegression``` due to its intuitiveness and simplicity. Develop a model using the variables ```sex``` and ```age```. Predict ```num_normalized``` and record the accuracy of this model.

In [None]:
# feel free to add any necessary imports
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import accuracy_score

# YOUR CODE HERE

<br>

**Question 17**: Train a new model, using the same type of model you used in the previous question. This time, use the variables ```age```, ```sex```, ```thalach```, and a variable of your choice (preferrably the one you performed EDA on). Record its accuracy score. Did this model perform better or worse that the model using only ```age``` and ```sex```. Why do you think this is?

In [None]:
# YOUR CODE HERE

___Type your answer here, replacing this text.___

<br>

**Question 18**: Now, try using all the variables in the dataset, except for ```ca``` or ```thal```, and record the accuracy score. How does it compare to the accuracy scores of the other models?

In [None]:
# YOUR CODE HERE

___Type your answer here, replacing this text.___

<br>

**Question 19**: Create a new model that's of a different type than your previous model. Train this model using the variables ```age```, ```sex```, ```thalch```, and the same variable of choice from the previous questions. Record its accuracy. Why did you choose this model? How does it compare to your other model using the same variables? Explain potential possibilities of the difference (or no difference) between the accuracy of the models.

In [None]:
# YOUR CODE HERE

___Type your answer here, replacing this text.___

<br>

**Question 20**: Let's explore your initial model of choice using the four variables ```age```, ```sex```, ```thalach```, and a variable of your choice. Now, examine the labels our model predicts (whether a person has heart diease or not) and compare it with the true labels. Calculate the False Positive Rate and False Negative Rate of our model and report it. What do these values represent?

In the context of our problem, which would be more harmful: misclassifying someone as having heart diease (False Positiv) or misclassifying someone as not having heart disease (False Negative)?, and why?

In [None]:
# YOUR CODE HERE

___Type your answer here, replacing this text.___

<br>

---

### Section 4 (Extra Credit): Life's Essential 8

In this optional section, you will consider [Life's Essential 8](https://www.heart.org/en/healthy-living/healthy-lifestyle/lifes-essential-8), a set of health factors and behaviors identified by the American Heart Association (AHA) as essential for promoting heart health and overall well-being, in regards to the social and economic conditions of individuals.

**EC 1** (5 points): [Food deserts](https://foodispower.org/access-health/food-deserts/) can be described as geographic areas where residents’ access to affordable, healthy food options (especially fresh fruits and vegetables) is restricted or nonexistent due to the absence of grocery stores within convenient traveling distance. What populations in the US do you believe are at risk/currently living in food deserts? What do you expect the heart health of individuals in these populations to be like?  
*Hint*: Think of the US's history in regards to racism, segregation, immigration, policing, city planning and indigenous peoples in the US.

**Your Answer Here**

<br>

**EC 2** (5 points): Mental health undeniably plays a significant role in overall well-being, yet it is not explicitly covered in Life's Essential 8. In what ways do you believe mental health can affect CVD, either directly or indirectly?

**Your Answer Here**

<br>

---

### Submission
Congratulations! You have finished the final project. You should now have a better grasp of the following concepts:

- Strategically read a research study by emphasizing extraction of important sections first (unlike reading narratives)
- Identify study design and types of variables from a research paper
- Conduct exploratory data analyis (EDA) on an unfamiliar dataset
- Create univariate and bivariate data visualizations to interpret phenomenon
- Create a binary classifier model using Python's sklearn library to predict heart attacks in patients
- Idenitfy model performance by considering various common evaluation metrics of ML models

Please submit this assignment to bCourses before **November 28th, 2023**. 