# Basic Data Cleaning

This short notebook cleans the data and creates dummy variables for some of the variables in the dataset. The data is finally stored in a new CSV-file called "stroke_clean.csv".

In [2]:
# Importing pandas for data cleaning
import pandas as pd

# Loading in the dataset
stroke = pd.read_csv('stroke.csv')

# View the first 5 rows of the dataset
stroke.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


There is already an <b>id</b> column, so there is no need for a separate index. Hence we make the <b>id</b> column the index of the dataframe.

In [3]:
# Make the id column the index of the DataFrame
stroke.index = stroke['id']
del stroke['id']

# Make all the column names lowercase
stroke.columns = stroke.columns.str.lower()

## Missing Values

The following cell shows that there are some missing values in the <b>bmi</b> column. Since this is only $201/5110 \simeq 0.04$ of the observables, we remove these persons completely.

In [4]:
# Find out where the are missing values
print(stroke.isnull().any())

# Determine the number of missing values
missing_values = stroke['bmi'].isnull().sum()
print(f'\nThere are {missing_values} missing values in the bmi column\n')

# Drop the rows with the missing values
stroke.dropna(inplace=True)

# Check out the datatypes of the data
print(stroke.info())

gender               False
age                  False
hypertension         False
heart_disease        False
ever_married         False
work_type            False
residence_type       False
avg_glucose_level    False
bmi                   True
smoking_status       False
stroke               False
dtype: bool

There are 201 missing values in the bmi column

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4909 entries, 9046 to 44679
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   gender             4909 non-null   object 
 1   age                4909 non-null   float64
 2   hypertension       4909 non-null   int64  
 3   heart_disease      4909 non-null   int64  
 4   ever_married       4909 non-null   object 
 5   work_type          4909 non-null   object 
 6   residence_type     4909 non-null   object 
 7   avg_glucose_level  4909 non-null   float64
 8   bmi                4909 non-null   float64
 9 

We see from the information above that we now have no missing values left. 

## Dummy Variable Encoding

We also see that the columns <b>gender</b>, <b>ever_married</b>, <b>work_type</b> <b>residence_type</b>, and <b>smoking_status</b> are of data type <i>object</i>. Let us investigate these columns to see if it is reasonable to convert them to a numeric format.

In [5]:
# Check the values present in the gender column
stroke['gender'].value_counts()

# Since there is only one "other", this will not give us anything statistically significant. 
# We hence remove this row from the dataset.
stroke.drop(stroke[stroke['gender'] == 'Other'].index[0], inplace=True)

# Do a dummy encoding of the genders
stroke['gender'].replace({'Female': 1, 'Male': 0}, inplace=True)

In [6]:
# Check the values present in the ever_married column
stroke['ever_married'].value_counts()

# Do a dummy encoding of the ever_married column
stroke['ever_married'].replace({'Yes': 1, 'No': 0}, inplace=True)

In [7]:
# Check the values present in the ever_married column
stroke['work_type'].value_counts()

# Since there are 5 categories, we dummy encode them into new columns
dummy_variables = pd.get_dummies(stroke['work_type'])
del stroke['work_type']
stroke['govt_job'] = dummy_variables['Govt_job']
stroke['never_worked'] = dummy_variables['Never_worked']
stroke['private'] = dummy_variables['Private']
stroke['self-employed'] = dummy_variables['Self-employed']
stroke['children'] = dummy_variables['children']

In [8]:
# Check the values present in the residence_type column
stroke['residence_type'].value_counts()

# Do a dummy encoding of the residence_type column
stroke['residence_type'].replace({'Urban': 1, 'Rural': 0}, inplace=True)

In [9]:
# Check the values present in the smoking_status column
stroke['smoking_status'].value_counts()

# There are a lot of "Unknowns", so dropping all of them is probably not a good idea. 
# Let us for now dummy encode them, then we can decide later.
dummy_variables = pd.get_dummies(stroke['smoking_status'])
del stroke['smoking_status']
stroke['never_smoked'] = dummy_variables['never smoked']
stroke['formerly_smoked'] = dummy_variables['formerly smoked']
stroke['smokes'] = dummy_variables['smokes']
stroke['unknown_smoker'] = dummy_variables['Unknown']

## Saving the Cleaned Data

We now save the cleaned data to a CSV file. 

In [6]:
# Reordering the columns so that the stroke is last.
labels = list(stroke.columns)
labels.append('stroke')
del labels[8]
stroke = stroke.reindex(columns=labels)

# Saving the data.
stroke.to_csv('stroke_clean.csv')

# One should load the data with the command:
# stroke_clean = pd.read_csv('stroke_clean.csv', index_col='id')