In [1]:
%%capture
# move to src folder so we can import code
%cd ../src

In [2]:
from common.kaggle import download_competition_data
import config

In [3]:
download_competition_data(config.COMPETITION, config.INPUTS)

Downloading playground-series-s3e2.zip to /run/media/jcarnero/linux-data/kaggle/playground-series-s3e2/input


100%|██████████| 321k/321k [00:00<00:00, 2.75MB/s]


['test.csv', 'train.csv', 'sample_submission.csv']






In this competition we will be using data generated by a deep learning model trained on the [Stroke Prediction Dataset.](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset). We can expect the relationships between variables to be similar as in the original dataset, but not exactly the same.

We will be predicting whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. The independent variables at our disposal are:

1. __id__: unique identifier
1. __gender__: "Male", "Female" or "Other"
1. __age__: age of the patient
1. __hypertension__: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
1. __heart_disease__: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
1. __ever_married__: "No" or "Yes"
1. __work_type__: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
1. __Residence_type__: "Rural" or "Urban"
1. __avg_glucose_level__: average glucose level in blood
1. __bmi__: body mass index
1. __smoking_status__: "formerly smoked", "never smoked", "smokes" or "Unknown"*
1. __stroke__: 1 if the patient had a stroke or 0 if not

Note: "Unknown" in smoking_status means that the information is unavailable for this patient

The evaluation metric is going to be AUC.

# Let's take a look at the data

In [4]:
from pathlib import Path
import pandas as pd

In [5]:
df = pd.read_csv(config.TRAIN_DATA)
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,0,Male,28.0,0,0,Yes,Private,Urban,79.53,31.1,never smoked,0
1,1,Male,33.0,0,0,Yes,Private,Rural,78.44,23.9,formerly smoked,0
2,2,Female,42.0,0,0,Yes,Private,Rural,103.0,40.3,Unknown,0
3,3,Male,56.0,0,0,Yes,Private,Urban,64.87,28.8,never smoked,0
4,4,Female,24.0,0,0,No,Private,Rural,73.36,28.8,never smoked,0


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15304 entries, 0 to 15303
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 15304 non-null  int64  
 1   gender             15304 non-null  object 
 2   age                15304 non-null  float64
 3   hypertension       15304 non-null  int64  
 4   heart_disease      15304 non-null  int64  
 5   ever_married       15304 non-null  object 
 6   work_type          15304 non-null  object 
 7   Residence_type     15304 non-null  object 
 8   avg_glucose_level  15304 non-null  float64
 9   bmi                15304 non-null  float64
 10  smoking_status     15304 non-null  object 
 11  stroke             15304 non-null  int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 1.4+ MB


In [7]:
df.nunique()

id                   15304
gender                   3
age                    106
hypertension             2
heart_disease            2
ever_married             2
work_type                5
Residence_type           2
avg_glucose_level     3740
bmi                    407
smoking_status           4
stroke                   2
dtype: int64

In [8]:
len(df[df.duplicated()])

0

`age`, `avg_glucose_level` and `bmi` are numerical. The rest are categorical. id can be ignored, as its value is unique