![image](https://analyticsindiamag.com/wp-content/uploads/2020/04/Screenshot-2020-04-15-at-10.08.12-AM.png)

## Business Problem Understanding

- According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.

- This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status.

- Each row in the data provides relavant information about the patient.

## Attribute Information
- 1) id: unique identifier
- 2) gender: "Male", "Female" or "Other"
- 3) age: age of the patient
- 4) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
- 5) heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
- 6) ever_married: "No" or "Yes"
- 7) work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
- 8) Residence_type: "Rural" or "Urban"
- 9) avg_glucose_level: average glucose level in blood
- 10) bmi: body mass index
- 11) smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
- 12) stroke: 1 if the patient had a stroke or 0 if not

*Note: "Unknown" in smoking_status means that the information is unavailable for this patient



## Data Collection/Data Import

**The dataset is available at this google drive link:**
**Use gdown to download this in the colab environment directly.**

!gdown https://drive.google.com/uc?id=1vs0cmeKYeht_d07C1HvHFUIxD-IfEACL


In [5]:
## write your code here


Downloading...
From: https://drive.google.com/uc?id=1vs0cmeKYeht_d07C1HvHFUIxD-IfEACL
To: /media/biswash/New Volume/fm 024/video/Full/Module 2; Data Wrangling/Unit 3 ; Class Recording Nepali/Chapter 1; Data Wrangling with Pandas/Resources/stroke_data.csv
100%|█████████████████████████████████████████| 484k/484k [00:03<00:00, 129kB/s]


## Importing Necessary libraries

In [3]:
import os
import gdown
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt



## Importing Data From CSV File Using Pandas

In [None]:
## write your code here


## Data Understanding:

### Print the first five rows of the pandas dataframe

In [None]:
## write your code here


## Print the last five rows of the pandas dataframe

In [None]:
## write your code here


## What is the shape of the dataset?

In [None]:
## write your code here


## What are the name of the columns in the dataframe?

In [None]:
## write your code here


### What are the datatypes of each feature in the dataset?

In [None]:
## write your code here


## Descriptive Statistics

Descriptive statistics involve a set of summary measures that provide a snapshot of the dataset's characteristics. These measures help us understand the distribution, central tendency, and variability within the data.

- Mean: The average value of the data.
- Median: The middle value when the data is sorted.
- Mode: The most frequently occurring value.
- Range: The difference between the maximum and minimum values.
- Standard Deviation: A more interpretable measure of data spread.
These statistics provide a preliminary understanding of the dataset, which is valuable for subsequent analysis and decision-making.



### How to see the descriptive statistics of a dataset?

In [None]:
## write your code here?


## How to select gender column from the pandas dataframe?

In [None]:
## write your code here


### How to select multiple columns : age, gender and bmi?

In [None]:
## write your code here


## How to select the 7th row of the pandas dataframe?


In [None]:
## write your code here


## How to select the 4th column from the pandas dataframe?


In [None]:
## write your code here


## How to select 20th to 30th row and 3rd to 7th column in pandas dataframe?

In [None]:
## write your code here


## How to select 3rd and 100th row & 4th and 10th column in a pandas dataframe?

In [None]:
## write your code here


## Select only those rows with gender 'Male'

In [None]:
## write your code here


## Select all those rows which have avg_glucose_level greater than 100 and columns gender, age, bmi and avg_glucose_level

In [None]:
## write your code here


## Select all those Females who are greater than 50 years old?

In [None]:
## write your code here


## Data Wrangling

- Data Inspection
  - Checking Duplicate Enties
  - Checking Missing Values
  - Checking standard format
  - Checking data entry typos and errors
- Data Cleaning
  - Removing Duplicates
  - Handling Missing Values
  - Standardizing Formats
  - Correcting Errors
- Data Transformation
  - Feature Engineering
  - Normalization/Scaling
  - One-Hot Encoding
- Data Integration
- Data Reduction
- Data Formatting
- Data Enrichment
- Data Validation
- Documentation
- Exploratory Data Analysis (EDA)


### Checking Duplicate Entries
- Check if duplicate entries are present or not.
- If present find how many of duplicate entries are there?

In [None]:
## write your code here


In [None]:
## write your code here


## Remove Duplicate Entries

- Remove all those rows which has duplicate entries

In [None]:
## write your code here


In [None]:
## write your code here


## Checking Missing Values
- Find missing values (NAN) values in the datasets
- Find columns which has missing values with their frequency

In [None]:
## write your code here


## Visualize missing values using heatmaps

In [None]:
## write your code here


## Handling Missing Values

- Handle missing values for ever_married column, avg_glucose_level and weight_in_kg column

In [None]:
## write your code here


In [None]:
## write your code here


In [None]:
## write your code here



In [None]:
## write your code here



## Checking missing values for weight_in_kg and bmi columns

In [None]:
## write your code here


In [None]:
## write your code here


## Check relationship between bmi and height_in_m whether it can be used to fill missing values in bmi (use scatterplot to visually inspect relationship)


In [None]:
## write your code here


In [None]:
## write your code here


In [None]:
## write your code here


In [None]:
## write your code here


## Exploratory Data Analysis

- Univariate Analysis: Studying one variable at a time
- Bivariate Analysis: Studying two variables at a time
- Multivariate Analysis: Studying multiple variables at a time
- We need to investigate each feature properly

In [None]:
## write your code here


## id feature

In [None]:
## write your code here


## gender

In [None]:
## write your code here (check dtypes first)


In [None]:
## write your code here


In [None]:
## write your code here for calculating frequency count of gender column


In [None]:
## write your code here


## Create Piechart Or Bargraph For Univariate Analysis Of Categorical Feature

In [None]:
## write your code here


## smoking status

In [None]:
## write your code here


In [None]:
## write your code here


In [None]:
## write your code here


In [None]:
## write your code here


## Plot figure (Barchart)

In [None]:
## write your code here (use seaborn)


## hypertension

In [None]:
## write your code here


In [None]:
## write your code here (show graph)


## stroke feature

In [None]:
## write your code here


In [None]:
## write your code here(piechart)


# Bivariate Analysis
## Is there a chance that patients with hypertension has more likely to get a stroke or not? (cross_tab function)

In [None]:
## write your code here


## Hypothesis Testing (Chisquare test for Independence)


chi2, p, dof, expected = chi2_contingency(stroke_hypertension_df)

In [None]:
from scipy.stats import chi2_contingency

In [None]:
# Perform Chi-square test



## Group Barplot

In [None]:
## write your code here


In [None]:
# Plot using Seaborn



## heart disease

In [None]:
## write your code here


In [None]:
## write your code here


## Hypothesis Testing (Chisquare test for Independence)


In [None]:
# Perform Chi-square test



## Group Bar plot

In [None]:
## write your code here with long format table


In [None]:
## write your code here


## Numerical Features

In [None]:
# select numerical features
## write your code here


## age column

In [None]:
## write your code here (for histogram)


In [None]:
## write your code here (for kde plot)


In [None]:
## write your code here (for outlier analysis using boxplot)


## Bmi column

In [None]:
## write your code here (for histogram)


In [None]:
## write your code here (for kde plot)

In [None]:
## write your code here (for boxplot outlier analysis)


### Hypothesis Test For Normality


# Perform Kolmogorov-Smirnov test
statistic, pvalue = kstest(final_df['bmi'], 'norm')

# Print the result
print("Kolmogorov-Smirnov Test Statistic:", statistic)

print("p-value:", pvalue)

# Interpret the results
alpha = 0.05  # Significance level

if pvalue > alpha:
    print("Sample looks Gaussian (fail to reject H0)")
    
else:
    print("Sample does not look Gaussian (reject H0)")


In [None]:
from scipy.stats import kstest, shapiro

## Scatterplots

In [None]:
## write your code here


## Correlation Plots and Heatmaps

In [None]:
## write your code here
