## Length of the report {-}
The length of the report must be no more than 15 pages, when printed as PDF. However, there is no requirement on the minimum number of pages.

You may put additional stuff as Appendix. You may refer to the Appendix in the main report to support your arguments. However, your appendix is unlikely to be checked while grading, unless the grader deems it necessary. The appendix, references, and information about GitHub and individual contribution will not be included in the page count, and there is no limit on the length of the appendix.

**Delete this section from the report, when using this template.** 

## Code should be put separately in the code template {-}
Your report should be in a research-paper like style. If there is something that can only be explained by showing the code, then you may put it, otherwise do not put the code in the report. We will check your code in the code template. 

## **Delete this section from the report, when using this template.** 

## Background / Motivation

What motivated you to work on this problem?

Mention any background about the problem, if it is required to understand your analysis later on.

In the United States, someone will have a stroke every 40 seconds [2]. The urgency for better stroke protocol and understanding of what causes strokes is pressing because of its prominence in our society. Stroke-related costs in the United States came to 53 billion dollars from 2017 to 2018 [2]. One of the best ways to prevent a stroke is preemptive measures such as maintaining a healthy lifestyle through exercise, abstaining from smoking, and managing underlying health issues such as diabetes [3]. High cholesterol and high blood pressure are early indicators of stroke detection for those more susceptible to stroke [3]. According to the American Heart Association, “Over the past 30 years, stroke incidence among adults 49 and younger has continued to increase in Southern states and the Midwest” [4]. These statistics have motivated our team to create a first step toward preventative stroke modeling through logistic regressions which will predict if a person, given certain characteristics, will be more vulnerable to having a stroke. 

## Problem statement 

Describe your problem statement. Articulate your objectives using absolutely no jargon. Interpret the problem as inference and/or prediction.

## Data sources
What data did you use? Provide details about your data. Include links to data if you are using open-access data.

To create this model, we utilized a ‘Stroke Prediction Dataset’ which can be found on a Data science company’s website, Kaggle. The dataset included different variables: 
unique id, gender, age, hypertension, heart disease, marital status, work type, place of residence, average glucose level, body mass index (BMI), smoking status, and if they had a stroke. We aimed to create a model that minimized the false negative rate because we are working within diagnosis which means false negatives, people being told they are not at risk for stroke but actually are at high risk, would be the worst possible outcome.


## Stakeholders
Who cares? If you are successful, what difference will it make to them?

At-risk patients will benefit the most from our exact modeling because it will allow them to start preventative measures and lifestyle changes sooner to decrease their risk. Along with that, making sure that at-risk patients are prepared is very important to our mission. Helping target at-risk patients with information about how to know if they are experiencing a stroke was very important to us. Taking the next steps promptly is imperative for the person to survive that stroke. Along with that, creating a model that will predict stroke likelihood would also help family and friends of loved ones who are predicted to have a stroke because they could also become more prepared if the situation escalates. Awareness of possible stroke is our driving goal for our patients.
	Our second stakeholder would be a medical professional. Every medical professional swear on the Hippocratic oath and promises to ‘do no harm’ [5]. By having logistic classification models to predict patients' likelihood of stroke, doctors could better address their patients. Along with that, if doctors are unsure of if a person is having a stroke or had a stroke, they can use this model to affirm or reject their null hypothesis. As medical professionals are constantly doing research on different health issues, they can also use this model to find participants in studies of strokes and determine if someone is eligible. 



## Data quality check / cleaning / preparation 

In a tabular form, show the distribution of values of each variable used in the analysis - for both categorical and continuous variables. Distribution of a categorical variable must include the number of missing values, the number of unique values, the frequency of all its levels. If a categorical variable has too many levels, you may just include the counts of the top 3-5 levels. 

If the tables in this section take too much space, you may put them in the appendix, and just mention any useful insights you obtained from the data quality check that helped you develop the model or helped you realize the necessary data cleaning / preparation.

Were there any potentially incorrect values of variables that required cleaning? If yes, how did you clean them? 

Did you do any data wrangling or data preparation before the data was ready to use for model development? Did you create any new predictors from exisiting predictors? For example, if you have number of transactions and spend in a credit card dataset, you may create spend per transaction for predicting if a customer pays their credit card bill. Mention the steps at a broad level, you may put minor details in the appendix. Only mention the steps that ended up being useful towards developing your final model(s).

In [18]:
import pandas as pd
import numpy as np
strokedata = pd.read_csv('healthcare-dataset-stroke-data.csv')

In [22]:
strokedata[['id', 'age', 'avg_glucose_level', 'bmi']].describe()

Unnamed: 0,id,age,avg_glucose_level,bmi
count,5110.0,5110.0,5110.0,4909.0
mean,36517.829354,43.226614,106.147677,28.893237
std,21161.721625,22.612647,45.28356,7.854067
min,67.0,0.08,55.12,10.3
25%,17741.25,25.0,77.245,23.5
50%,36932.0,45.0,91.885,28.1
75%,54682.0,61.0,114.09,33.1
max,72940.0,82.0,271.74,97.6


In [25]:
strokedata['hypertension'].value_counts()

0    4612
1     498
Name: hypertension, dtype: int64

In [24]:
strokedata['heart_disease'].value_counts()

0    4834
1     276
Name: heart_disease, dtype: int64

In [26]:
strokedata['stroke'].value_counts()

0    4861
1     249
Name: stroke, dtype: int64

In [27]:
strokedata_dict = {'Variable Name': ['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status', 'hypertension', 'heart_disease', 'stroke'], 'Missing Values': ['None', 'None', 'None', 'None', 'None', 'None', 'None', 'None'], 'Unique Values': [3, 2, 5, 2, 4, 2, 2, 2]}
pd.DataFrame(strokedata_dict)

Unnamed: 0,Variable Name,Missing Values,Unique Values
0,gender,,3
1,ever_married,,2
2,work_type,,5
3,Residence_type,,2
4,smoking_status,,4
5,hypertension,,2
6,heart_disease,,2
7,stroke,,2


In [28]:
strokedata_cat_dict2 = {'gender': ['Female', 'Male', 'Other', np.NaN, np.NaN], 'gender_count': [2994, 2115, 1, np.NaN, np.NaN], 'ever_married': ['Yes', 'No', np.NaN, np.NaN, np.NaN], 'ever_married_count': [3353, 1757, np.NaN, np.NaN, np.NaN], 'work_type': ['Private', 'Self-employed', 'children', 'Govt_job', 'Never_worked'], 'work_type_count': [2925, 819, 687, 657, 22], 'Residence_type': ['Urban', 'Rural', np.NaN, np.NaN, np.NaN], 'Residence_type_count': [2596, 2514, np.NaN, np.NaN, np.NaN], 'smoking_status': ['never smoked', 'Unknown', 'formerly smoked', 'smokes', np.NaN], 'smoking_status_count': [1892, 1544, 885, 789, np.NaN], 'hypertension': [0, 1, np.NaN, np.NaN, np.NaN], 'hypertension_count': [4612, 498, np.NaN, np.NaN, np.NaN], 'heart_disease': [0, 1, np.NaN, np.NaN, np.NaN], 'heart_disease_count': [4834, 276, np.NaN, np.NaN, np.NaN], 'stroke': [0, 1, np.NaN, np.NaN, np.NaN], 'stroke_count': [4861, 249, np.NaN, np.NaN, np.NaN]}
pd.DataFrame(strokedata_cat_dict2)

Unnamed: 0,gender,gender_count,ever_married,ever_married_count,work_type,work_type_count,Residence_type,Residence_type_count,smoking_status,smoking_status_count,hypertension,hypertension_count,heart_disease,heart_disease_count,stroke,stroke_count
0,Female,2994.0,Yes,3353.0,Private,2925,Urban,2596.0,never smoked,1892.0,0.0,4612.0,0.0,4834.0,0.0,4861.0
1,Male,2115.0,No,1757.0,Self-employed,819,Rural,2514.0,Unknown,1544.0,1.0,498.0,1.0,276.0,1.0,249.0
2,Other,1.0,,,children,687,,,formerly smoked,885.0,,,,,,
3,,,,,Govt_job,657,,,smokes,789.0,,,,,,
4,,,,,Never_worked,22,,,,,,,,,,


We imputed BMI because there were missing values that would have lost a significant portion of our target variable (stroke), so this was a new variable called imputed_bmi using K-Nearest Neighbors and then also a flag variable called bmi_original that said "yes" or "no" if the value was imputed or not. We also binned glucose to better see non-linear trends with stroke. We split our data 70/30 into train and test.

## Exploratory data analysis

Put the relevant EDA here (visualizations, tables, etc.) that helped you figure out useful predictors for developing the model(s). Only put the EDA that ended up being useful towards developing your final model(s). 

List the insights (as bullet points) you got from EDA that ended up being useful towards developing your final model. 

Again, if there are too many plots / tables, you may put them into appendix, and just mention the insights you got from them.

## Approach

What kind of a model (linear / logistic / other) did you use? What performance metric(s) did you optimize and why?

Is there anything unorthodox / new in your approach? 

What problems did you anticipate? What problems did you encounter? Did the very first model you tried work? 

Did your problem already have solution(s) (posted on Kaggle or elsewhere). If yes, then how did you build upon those solutions, what did you do differently? Is your model better as compared to those solutions in terms of prediction / inference?

**Important: Mention any code repositories (with citations) or other sources that you used, and specifically what changes you made to them for your project.**

We used a logistic model to optimize FNR and F1. We wanted to minimize FNR because it was better to overdiagnose people and send them through to more screening than to underdiagnose people at risk of stroke. F1 was an important metric to capture both precision and recall and value both of them when training our model. Because we were minimizing FNR using a lower cutoff, we had difficulty getting our precision high, so this was the main issue in getting a good F1 score.

We had a lot of difficulty finding interactions/transformations that would work for our model. We tried a lot in terms of lasso regression and variable selection beyond arbitrary transformations and transformations based on visualizations, but nothing was super successful beyond a base model.

We did not use code on Kaggle that had successful solutions because many utilized ML techniques that we do not have access to yet, so we tried our best to use variable selection techniques within the scope of this quarter's class.

## Developing the model

Explain the steps taken to develop and improve the base model - informative visualizations / addressing modeling assumption violations / variable transformation / interactions / outlier treatment / influential points treatment / addressing over-fitting / addressing multicollinearity / variable selection - stepwise regression, lasso, ridge regression). 

Did you succeed in achieving your goal, or did you fail? Why?

**Put the final model equation**.

**Important: This section should be rigorous and thorough. Present detailed information about decision you made, why you made them, and any evidence/experimentation to back them up.**

## Limitations of the model with regard to inference / prediction

If it is inference, will the inference hold for a certain period of time, for a certain subset of population, and / or for certain conditions.

If it is prediction, then will it be possible / convenient / expensive for the stakeholders to collect the data relating to the predictors in the model. Using your model, how soon will the stakeholder be able to predict the outcome before the outcome occurs. For example, if the model predicts the number of bikes people will rent in Evanston on a certain day, then how many days before that day will your model be able to make the prediction. This will depend on how soon the data that your model uses becomes available. If you are predicting election results, how many days / weeks / months / years before the election can you predict the results. 

When will your model become too obsolete to be useful?

## Other sections *(optional)*

You are welcome to introduce additional sections or subsections, if required, to address any specific aspects of your project in detail. For example, you may briefly discuss potential future work that the research community could focus on to make further progress in the direction of your project's topic.

## Conclusions and Recommendations to stakeholder(s)

What conclusions do you draw based on your model? If it is inference you may draw conclusions based on the coefficients, statistical significance of predictors / interactions, etc. If it is prediction, you may draw conclusions based on prediction accuracy, or other performance metrics.

How do you use those conclusions to come up with meaningful recommendations for stakeholders? The recommendations must be action-items for stakeholders that they can directly implement without any further analysis. Be as precise as possible. The stakeholder(s) are depending on you to come up with practically implementable recommendations, instead of having to think for themselves.

If your recommendations are not practically implementable by stakeholders, how will they help them? Is there some additional data / analysis / domain expertise you need to do to make the recommendations implementable? 

Do the stakeholder(s) need to be aware about some limitations of your model? Is your model only good for one-time use, or is it possible to update your model at a certain frequency (based on recent data) to keep using it in the future? If it can be used in the future, then for how far into the future?

## GitHub and individual contribution {-}

https://github.com/cara25/slayta_scientists

<html>
<style>
table, td, th {
  border: 1px solid black;
}

table {
  border-collapse: collapse;
  width: 100%;
}

th {
  text-align: left;
}
    

</style>
<body>

<h2>Individual contribution</h2>

<table style="width:100%">
     <colgroup>
       <col span="1" style="width: 15%;">
       <col span="1" style="width: 20%;">
       <col span="1" style="width: 50%;">
       <col span="1" style="width: 15%;"> 
    </colgroup>
  <tr>
    <th>Team member</th>
    <th>Contributed aspects</th>
    <th>Details</th>
    <th>Number of GitHub commits</th>
  </tr>
  <tr>
    <td>Elton John</td>
    <td>Data cleaning and EDA</td>
    <td>Cleaned data to impute missing values and developed visualizations to identify appropriate variable transformations.</td>
    <td>100</td>
  </tr>
  <tr>
    <td>Xena Valenzuela</td>
    <td>Assumptions and interactions</td>
    <td>Checked and addressed modeling assumptions and identified relevant variable interactions.</td>
    <td>120</td>
  </tr>
    <tr>
    <td>Sankaranarayanan Balasubramanian</td>
    <td>Outlier and influential points treatment</td>
    <td>Identified outliers/influential points and analayzed their effect on the model.</td>
    <td>130</td>    
  </tr>
    <tr>
    <td>Chun-Li</td>
    <td>Variable selection and addressing overfitting</td>
    <td>Performed variable selection on an exhaustive set of predictors to address multicollinearity and overfitting.</td>
    <td>150</td>    
  </tr>
</table>

List the **challenges** you faced when collaborating with the team on GitHub. Are you comfortable using GitHub? 
Do you feel GitHuB made collaboration easier? If not, then why? *(Individual team members can put their opinion separately, if different from the rest of the team)*

## References {-}

List and number all bibliographical references. When referenced in the text, enclose the citation number in square brackets, for example [1].

[1] Authors. The frobnicatable foo filter, 2014. Face and Gesture submission ID 324. Supplied as additional material
fg324.pdf. 3


[1]https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-of-death#:~:text=Stroke%20and%20chronic%20obstructive%20pulmonary,6%25%20of%20total%20deaths%20respectively.

[2] https://www.cdc.gov/stroke/facts.htm#:~:text=Stroke%20statistics,-In%202020%2C%201&text=Every%203.5%20minutes%2C%20someone%20dies%20of%20stroke.&text=Every%20year%2C%20more%20than%20795%2C000,are%20first%20or%20new%20strokes.&text=About%20185%2C000%20strokes%E2%80%94nearly%201,have%20had%20a%20previous%20stroke.

[3]https://www.cdc.gov/stroke/prevention.htm

[4]https://www.npr.org/sections/health-shots/2022/03/14/1086345393/strokes-young-people-hailey-bieber 

[5] https://www.health.harvard.edu/blog/first-do-no-harm-201510138421 


## Appendix {-}

You may put additional stuff here as Appendix. You may refer to the Appendix in the main report to support your arguments. However, the appendix section is unlikely to be checked while grading, unless the grader deems it necessary.