# Heart Attack - Kaggle competition

This is an active Kaggle competition for Kudos.
Details: https://www.kaggle.com/competitions/heart-attack-risk-analysis/overview

### IMPORTS

In [1]:
from matplotlib import pyplot as plt
import pandas as pd
from dataprep.eda import create_report
import statistics

AttributeError: module 'numba' has no attribute 'generated_jit'

### GET DATA

In [3]:
df_heart_risk_train = pd.read_csv('data/train.csv')
df_heart_risk_test = pd.read_csv('data/test.csv')

In [7]:
report = create_report(df_heart_risk_train)
report.save('report_heart_attack')

  0%|                                                                                                         …

  df = df.append(pd.DataFrame({col: [nrows - npresent]}, index=["Others"]))
  df = df.append(pd.DataFrame({col: [nrows - npresent]}, index=["Others"]))
  df = df.append(pd.DataFrame({col: [nrows - npresent]}, index=["Others"]))


Report has been saved to report_heart_attack.html!


So. The EDA profiler tells us is that we have 26 variables (25 features + 1 target variable of low/high risk) and 7010 patients in rows. There is no missing data or duplicate rows, meaning each record is unique, which is great. We have 16 categorical and 9 numberical variables, plus a geo variable as a bonus (South/North hemisphere). 

Looking closer at the data, there seem to be 2 distinct type of measures / features: 
1. __biomarkers__ such as Blood Pressure, Family History, Diabetes, Previous Heart Problems and partially Medication Use, AND
2. __behavioural indicatiors__ such as Smoking, Obesity, Alcohol Consumption and Physical Activity (Days per week)

The mean __age__ is 54 years with a standard deviation of 21 years, and a range of 18-90, which nicely covers the general population characteristics. In terms of __sex__ about two-thirds are men, the rest is female. 

Mean __Cholesterol__ levels are at 260, which is above the normal value of 240, although this measure has a high variability (standard deviation = 81) and an almost bimodal distribution (values centering at 250 and 355 with a "valley" at 320). __Blood Pressure__ has to be mined, as the format is currently not suitable for analysis (i.e. 159/105). The mean of __Heart Rate__ is around 75 (std: 21, 40-110) which is again in the normal range. Noteworty is the bimodal distribution of this measure, with centers around 50 and 70 (valley at 72). 
__Diabetes__ is represented as a bivariate measure, with about 2/3rds of the sample having diabetes. __Family History__, __Obesity__ and __Previous Heart Problems__ each take two values, about half of the participants being obese, having previously heart problems, and also half reporting positive value for heart attack in the family. A suprisingly large proportion, about 90% of the sample reports __Smoking__. Most of them (65%) reports __Alcohol Consumption__. After such high proportions of "unhealthy behaviour" the patients suprisingly report a mean __Exercise__ of 10 hours per week (std: 5.8, 0-20 hrs). In terms of __Diet__, there are equal proportions of Health, Average and Unhealth dieters. Although there is no information on which kind of __Medication__ patients take, half of them report using. __Stress Levels__ are approximately equally distributed among 10 possible categories. In terms of __Sleep__, there is a large variation in the data: the mean is 6 hours, ranging between 0.001 (?!) and 12 hours. __Income__ centers around 158.000 USD a year, which indicates the sample being well, pretty rich - over twice as high than the average yearly salary in 2023. __BMI__ may also be an important indicator: the mean is 29, with a standard deviation of 6.3 indicating general population levels. In terms of __Triglycerides__, the mean is around 416 (data are approximately equally distributed between 30 and 800), which indicates high levels compared to the normal being less than 150mg/dL. __Physical Activity__ ranges equally between 0 and 7 days a week as well as __Sleep Hours Per Day__ (ranging between 4 and 10 but in almost fully equal proportions in the sample). In terms of __Country__ most patients come from "Otherland", the rest being distributed accross the world and accross __Continents__ with Europe and Asia being the most represented. This also leads to the fact that most patients (about 70%) live on the Northern __Hemisphere__. Finally, our target variable: 35% has a high __risk for heart attack__, and 65% has low risk. 

In summary the data is of good quality with no missing or weird values, also no duplicates. However, given the probably artificial nature of the data, not all variables are expected to be associated with the target variable in this modeling given their uniform distribution. 




In [None]:
###### EXPLORE DATA ######
# df_heart_risk_train.profile_report()
# df_heart_risk_test.columns
df_heart_risk_train.profile_report()

# for column in columns_mean:
#        mean = statistics.mean(columns_mean[[column]])
#        std = statistics.stdev(columns_mean[[column]])
#        f"For {column} the mean is {mean} and the std is {std}."


# print(df_heart_risk_test.describe())
# train data contains 7010 entries and 26 columns (25+1)
# there is no missing data!
# mean age is 54 years, 18-90 years, 70% men. The sample has high cholesterol levels (mean = 260, above 240 is considered high).
# Heart rate is around 75 (normal: 60-100).


# .shape, .columns, .dtypes
# .info(), .describe(), nunique(), .isna().sum()


# import ydata_profiling
# mpg.profile_report()


###### IMPUTE ######


###### MODELING ######
#pip install -U scikit-learn
# from sklearn.linear_model import LinearRegression
# from sklearn.model_selection import cross_validate

# Ready X and y
# X = livecode_data[['GrLivArea']]
# y = livecode_data['SalePrice']
#
# # Split into Train/Test
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
#
#
#
# # Instantiate model
# model = LinearRegression()
#
# # 5-Fold Cross validate model
# cv_results = cross_validate(model, X, y, cv=5)
# # Rule of thumb:  K = 5   or 10
# # Scores
# print(cv_results['test_score'])
# # Mean of scores
# cv_results['test_score'].mean()
#
# # check model learning curves
# import numpy as np
# from sklearn.model_selection import learning_curve
#
# train_sizes = [25,50,75,100,250,500,750,1000,1150]
#
# # Get train scores (R2), train sizes, and validation scores using `learning_curve`
# train_sizes, train_scores, test_scores = learning_curve(
#     estimator=LinearRegression(), X=X, y=y, train_sizes=train_sizes, cv=5)
#
# # Take the mean of cross-validated train scores and validation scores
# train_scores_mean = np.mean(train_scores, axis=1)
# test_scores_mean = np.mean(test_scores, axis=1)
#
# # plt.plot(train_sizes, train_scores_mean, label = 'Training score')
# # plt.plot(train_sizes, test_scores_mean, label = 'Test score')
# # plt.ylabel('r2 score', fontsize = 14)
# # plt.xlabel('Training set size', fontsize = 14)
# # plt.title('Learning curves', fontsize = 18, y = 1.03)
# # plt.legend()