# Assignment Redi School Machine Learning Class
- Register on Kaggle
- Find Open Datasets and Machine Learning Projects | Kaggle
- Download a simple dataset (less than 5mb) and tell a story with your data. 
- Use Seaborn or Matplotlib
- (each person get 2 minutes in the next class to tell a story with their data).

### Install Packages

In [1]:
%pip install pandas numpy seaborn matplotlib

Note: you may need to restart the kernel to use updated packages.




## Overview
This notebook aims to explore a heart disease dataset with the goal of uncovering key insights regarding age, gender, cholesterol levels, and other factors that affect the likelihood of heart disease. 
We'll leverage visualizations to tell the story of how these features relate to the occurrence of heart disease.
This heart disease dataset is acquired from one of the multispecialty hospitals in India. Over 14 common features make it one of the heart disease datasets available so far for research purposes. This dataset consists of 1000 subjects with 12 features. This dataset will be useful for building early-stage heart disease detection as well as for generating predictive machine-learning models.

### Import and Validate Packages

In [2]:
# Importing necessary libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Set style for better visuals
sns.set(style="whitegrid")

ImportError: DLL load failed while importing _c_internal_utils: The specified module could not be found.

#### Load the Data and display the few rows of the dataset

In [9]:
df = pd.read_csv('lung cancer survey.csv')
df.head()

Unnamed: 0,GENDER,AGE,SMOKING,YELLOW_FINGERS,ANXIETY,PEER_PRESSURE,CHRONIC DISEASE,FATIGUE,ALLERGY,WHEEZING,ALCOHOL CONSUMING,COUGHING,SHORTNESS OF BREATH,SWALLOWING DIFFICULTY,CHEST PAIN,LUNG_CANCER
0,M,69,1,2,2,1,1,2,1,2,2,2,2,2,2,YES
1,M,74,2,1,1,1,2,2,2,1,1,1,2,2,2,YES
2,F,59,1,1,1,2,1,2,1,2,1,2,2,1,2,NO
3,M,63,2,2,2,1,1,1,1,1,2,1,1,2,2,NO
4,F,63,1,2,1,1,1,1,1,2,1,2,2,1,1,NO


In [6]:
df.tail()

Unnamed: 0,GENDER,AGE,SMOKING,YELLOW_FINGERS,ANXIETY,PEER_PRESSURE,CHRONIC DISEASE,FATIGUE,ALLERGY,WHEEZING,ALCOHOL CONSUMING,COUGHING,SHORTNESS OF BREATH,SWALLOWING DIFFICULTY,CHEST PAIN,LUNG_CANCER
304,F,56,1,1,1,2,2,2,1,1,2,2,2,2,1,YES
305,M,70,2,1,1,1,1,2,2,2,2,2,2,1,2,YES
306,M,58,2,1,1,1,1,1,2,2,2,2,1,1,2,YES
307,M,67,2,1,2,1,1,2,2,1,2,2,2,1,2,YES
308,M,62,1,1,1,2,1,2,2,2,2,1,1,2,1,YES


#### Explore and understand the Data
Let's start by inspecting the columns and general information about the dataset, such as the data types, presence of missing values, and summary statistics.

In [None]:
df.columns

In [None]:
df.info()

In [None]:
df.describe()

#### Handle Missing Values
It's important to check for any missing values, as they may skew our analysis or affect our visualizations.

In [None]:
df.isnull().sum()

#### Find Duplicate
Duplicates can introduce bias in our analysis, so we removed any duplicate entries from the dataset.

In [None]:
duplicates = df[df.duplicated()]
duplicates

Convert 1 to 0 and 2 to 1

In [27]:
# List of columns to modify
columns_to_convert = ['AGE', 'SMOKING', 'YELLOW_FINGERS', 'ANXIETY',
                      'PEER_PRESSURE', 'CHRONIC DISEASE', 'FATIGUE ', 'ALLERGY ', 
                      'WHEEZING', 'ALCOHOL CONSUMING', 'COUGHING', 
                      'SHORTNESS OF BREATH', 'SWALLOWING DIFFICULTY', 
                      'CHEST PAIN']

# Replace 1 with 0 and 2 with 1 in the specified columns
df[columns_to_convert] = df[columns_to_convert].replace({1: 0, 2: 1})

In [None]:
df.head()

Convert Gender F to 0 and M to 1

In [33]:
# Replace F with 0 and M with 1 in the specified columns
df['GENDER'] = df['GENDER'].replace({'F': 0, 'M': 1})

In [None]:
df.head()

Convert LUNG_CANCER NO to 0 and YES to 1

In [37]:
# Replace FALSE with 0 and TRUE with 1 in the specified columns
df['LUNG_CANCER'] = df['LUNG_CANCER'].replace({'NO': 0, 'YES': 1})

In [None]:
df.head()

In [None]:
df.shape

### Tell the Story of the Data

#### Heatmap
- Heatmaps helps visualize relationships between variables.

In [None]:
# Correlation heatmap excluding 'patientid'
plt.figure(figsize=(12,8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Between Variables', fontsize=16)
plt.show()

#### Does Age or Gender Affect Heart Disease?

##### Distribution of Heart Disease by Age

In [None]:
# Distribution of patients by age, color-coded by presence of Lung Cancer
plt.figure(figsize=(10,6))
sns.histplot(data=df, x='AGE', hue='LUNG_CANCER', kde=True)
plt.title('Age Distribution of Patients with and without Lung Cancer')
plt.xlabel('Age')
plt.ylabel('Number of Patients')
plt.show()

##### Interpretation: 
- Younger Age Groups (20-40 years):
The number of patients without heart disease is significantly higher.
Very few patients in these age groups have heart disease.
- Middle Age Groups (40-60 years):
The number of patients with heart disease starts to increase.
There is a noticeable rise in heart disease cases compared to younger age groups.
- Older Age Groups (60-80 years):
The number of patients with heart disease peaks.
The gap between patients with and without heart disease narrows, indicating a higher prevalence of heart disease in older age groups.
- This chart highlights that heart disease becomes more common as age increases, particularly from middle age onwards. 

##### Gender and Heart Disease

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8,5))
ax = sns.countplot(x='GENDER', hue='LUNG_CANCER', data=df, palette='coolwarm')
plt.title('Heart Disease by Gender', fontsize=16)
plt.xlabel('Gender', fontsize=12)
plt.ylabel('Count', fontsize=12)

# Update the legend labels
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles, ['No', 'Yes'], title='LUNG_CANCER', loc='upper right')

# Customize x-axis labels
ax.set_xticklabels(['Female', 'Male'])

plt.show()

##### Interpretation:
- This gender analysis shows that the likelihood of heart disease is notably higher in males than females.

##### Interpretation:
- Females: More females in this dataset have heart disease compared to those who don’t.
- Males: More males in this dataset do not have heart disease compared to those who do.
- This chart suggests a higher prevalence of heart disease among females in this dataset, while a larger proportion of males are without heart disease.

#### How Does Chest Pain Relate to Heart Disease?

In [None]:
# Bar plot for chest pain type vs heart disease
plt.figure(figsize=(8,5))
sns.countplot(x='chestpain', hue='target', data=df)
plt.title('Chest Pain Type and Heart Disease')
plt.xlabel('Chest Pain Type')
plt.ylabel('Count')
plt.show()

##### Interpretation: 
- Typical angina is more common in individuals without heart disease.
- Atypical angina is more frequent in individuals without heart disease, but there is a notable presence in those with heart disease as well.
- Non-anginal pain is more common in individuals with heart disease.
- Asymptomatic individuals are rare in both categories, but slightly more common in those with heart disease.

#### Cholesterol and Blood Pressure Levels as Predictors

In [None]:
# Scatter plot of resting BP vs serum cholesterol
plt.figure(figsize=(10,6))
sns.scatterplot(x='AGE', y='ALCOHOL CONSUMING', hue='LUNG_CANCER', data=df)
plt.title('Resting BP vs Cholesterol by Heart Disease')
plt.xlabel('Resting Blood Pressure')
plt.ylabel('Serum Cholesterol')
plt.show()

In [None]:
plt.figure(figsize=(8, 6))
sns.scatterplot(x='AGE', y='SMOKING', hue='LUNG_CANCER', data=df, palette='coolwarm')
plt.title('Age vs Smoking by Lung Cancer Status', fontsize=16)
plt.xlabel('Age', fontsize=12)
plt.ylabel('Smoking', fontsize=12)
plt.legend(title='Lung Cancer')
plt.show()

##### Interpretation:
- This plot attempts to visualize any potential relationship between two important health metrics (resting blood pressure and serum cholesterol) and the presence of heart disease. However, from this plot, it seems that there isn’t a straightforward relationship between these variables and heart disease.

## Conclusion
From this analysis, we can draw several key insights:
- **Age** is a significant factor in heart disease, with cases becoming more frequent after the age of 40.
- There appears to be a **gender disparity** in heart disease prevalence, with males being more affected than females.
- Certain **chest pain types** are associated with a higher risk of heart disease, particularly non-anginal pain.

In [None]:
# Correlation heatmap excluding 'patientid'
plt.figure(figsize=(12,8))
sns.heatmap(df.drop(columns=['patientid']).corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Between Variables', fontsize=16)
plt.show()

In [None]:
# Boxplot of serum cholesterol by gender
plt.figure(figsize=(10,6))
sns.boxplot(x='gender', y='serumcholestrol', hue='target', data=df, palette='coolwarm')
plt.title('Cholesterol Levels by Gender and Heart Disease', fontsize=16)
plt.xlabel('Gender (0: Female, 1: Male)', fontsize=12)
plt.ylabel('Serum Cholesterol', fontsize=12)
plt.show()

In [None]:
import plotly.express as px

# Interactive scatter plot of Age vs Smoking and Lung Cancer
fig = px.scatter(df, x="AGE", y="SMOKING", color="LUNG_CANCER", 
                 title="Age vs Smoking with Lung Cancer",
                 labels={'LUNG_CANCER':'Lung Cancer Status'})
fig.show()