# Project 4
## Part 1: ETL Pipeline
#### We have sourced over 5,000 records containing demographic and medical data to be used in an ML model that predicts risk of stroke based on such indicators as gender, age, history of heart disease, glucose level, etc. The data needs to be transformed in preparation for our model, as shown below. 

In [None]:
## Load in the CSV
import pandas as pd
from pathlib import Path
# Read stroke data
file_path = Path("resources/healthcare-dataset-stroke-data.csv")
strokes_df = pd.read_csv(file_path)

# Display data
strokes_df.head()

In [None]:
strokes_df.info()

In [None]:
# Mean fill all 'N/A' values in 'bmi'
mean_bmi = round(strokes_df['bmi'].mean(), 1)

strokes_df['bmi'].fillna(mean_bmi, inplace=True)

strokes_df.head()

In [None]:
# Replace 'Unknown' smoking status with 'never smoked' for children under 10
strokes_df.loc[strokes_df['age'] <= 10, 'smoking_status'] = strokes_df.loc[strokes_df['age'] <= 10, 
                                                                           'smoking_status'].replace('Unknown', 'never smoked')

In [None]:
# Drop 'id'
strokes_df.drop(columns='id', inplace=True)
strokes_df.head()

In [None]:
strokes_df = strokes_df.reset_index(drop=True)
strokes_df

In [None]:
# Standardize capitalization and spacing for column names
strokes_df.rename(columns={'gender': 'Gender', 'age': 'Age', 'hypertension':'Hypertension', 
                           'heart_disease':'Heart Disease', 'ever_married':'Ever Married', 'work_type':'Work Type',
                           'Residence_type': 'Residence Type', 'avg_glucose_level': 'Average Glucose Level',
                           'bmi': 'BMI', 'smoking_status': 'Smoking Status', 'stroke': 'Stroke'}, inplace=True)

strokes_df.columns

In [None]:
# Standardize capitalization for objects
strokes_df['Smoking Status'] = strokes_df['Smoking Status'].replace('Unknown', 'unknown')
strokes_df.head(10)

In [None]:
strokes_df.to_csv('data/stroke_data.csv')