<a href="https://colab.research.google.com/github/anmol-iisc/psa_dspi_rmdn12/blob/main/Assignment2_1_25Sep_0545PM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Coding Assignment 2: Exploratory Data Analysis on Diabetes Dataset

## Background
Diabetes is one of the most prevalent chronic diseases and creates significant burdens on healthcare systems. The **Diabetes 130-US hospitals dataset** contains over **100,000 records** of hospital admissions for diabetic patients across **130 U.S. hospitals (1999–2008)**. It includes more than **50 attributes**, covering demographics, admission details, diagnoses, lab results, medications, and hospital outcomes.  

Like many real-world healthcare datasets, it presents challenges such as missing values, categorical codes (ICD-9 diagnoses, drug prescriptions), imbalances in outcome variables (e.g., readmission), and inconsistent data formats. Conducting an in-depth **Exploratory Data Analysis (EDA)** will help uncover insights into patient demographics, treatment outcomes, and factors influencing readmission rates.  

---

## Objectives
The goal of this assignment is to conduct a comprehensive **qualitative and quantitative EDA** of the Diabetes dataset. You will:  

- **Understand** the dataset structure, including variables, data types, and overall composition.  
- **Detect and address** data quality issues, such as missing values, duplicates, and inconsistent categories.  
- **Conduct univariate, bivariate, and multivariate analysis** using both qualitative and quantitative EDA, supported by visualization techniques (bar charts, count plots, stacked charts, heatmaps) and statistical methods (descriptive statistics, distribution analysis, and exploration of variable relationships).  
- **Visualize** patterns and correlations to uncover key insights and trends.  

---

## Submission Instructions
- This is an **individual assignment** and carries **5% of the total marks** of this course.
- You should not discuss your progress or solutions with anyone else.  
- Use this **template Jupyter Notebook and dataset** for your work.  

At the beginning of the notebook, clearly state:  
1. **Your Name and Roll Number**  
2. **Python environment details** (version, libraries used).  
3. Whether you are running locally or on **Google Colab** (*preferred, to ensure all students work in a consistent environment*).  

Additional guidelines:  
- Ensure that your notebook is **fully reproducible**: every cell must run sequentially without errors to regenerate the results.  
- Submit a **single completed notebook file (.ipynb)** before the deadline.  

---


## Dataset Source
- Source of Data : https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008
- It represents **hospital admissions of diabetic patients** across 130 U.S. hospitals from 1999 to 2008.
- Read the **Variables Table** in the link above, which provides details of individual variables such as type, role, and description.

##  Dataset Overview - "Diabetes_dataset.csv"
- **Number of records:** ~101,766 hospital admissions  
- **Number of features:** 53  
- **Unit of analysis:** Each row represents **one hospital admission** for a patient (a patient can have multiple admissions).  
- **Target variable (Goal):** Predict whether a patient is **readmitted within 30 days** of discharge.  

###  Key Features with Mappings
- **Admission Type**: `admission_type_id` → mapped to `admission_type_desc`  
- **Discharge Disposition**: `discharge_disposition_id` → mapped to `discharge_desc`  
- **Admission Source**: `admission_source_id` → mapped to `admission_source_desc`  

---

##  Important Notes
- **Missing values:** Several features contain missing values, often recorded as `"?"` (not as `NaN`).   
- **ID columns:** Both numeric IDs and their human-readable descriptions are included after merging. IDs may be useful for re-checking or remapping, while descriptions are helpful for analysis.  


**Note:** Do not just run plots — write down **interpretations** of your findings.  
The goal is to practice *thinking about the data*, not just generating charts.


# 📌 Preamble: Environment Details

Before starting the assignment, please fill in the following:

- **Name**: <Your Name>  
- **Roll Number**: <Your Roll Number>  
- **Environment**: Local Machine / Google Colab (choose one)  

Run the code cell below to print your Python and library versions.


In [9]:
# Preamble: Environment Check

import sys
import pandas as pd
import numpy as np
import matplotlib
import seaborn as sns

print("Python version:", sys.version)
print("Pandas version:", pd.__version__)
print("NumPy version:", np.__version__)
print("Matplotlib version:", matplotlib.__version__)
print("Seaborn version:", sns.__version__)

Python version: 3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0]
Pandas version: 2.2.2
NumPy version: 2.0.2
Matplotlib version: 3.10.0
Seaborn version: 0.13.2


## Load and Inspect the dataset structure

In [10]:
df=pd.read_csv("Diabetes_dataset.csv")

In [11]:
df.head()

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted,admission_type_desc,discharge_desc,admission_source_desc
0,2278392,8222157,Caucasian,Female,[0-10),?,6,25,1,1,...,No,No,No,No,No,No,NO,,Not Mapped,Physician Referral
1,149190,55629189,Caucasian,Female,[10-20),?,1,1,7,3,...,No,No,No,No,Ch,Yes,>30,Emergency,Discharged to home,Emergency Room
2,64410,86047875,AfricanAmerican,Female,[20-30),?,1,1,7,2,...,No,No,No,No,No,Yes,NO,Emergency,Discharged to home,Emergency Room
3,500364,82442376,Caucasian,Male,[30-40),?,1,1,7,2,...,No,No,No,No,Ch,Yes,NO,Emergency,Discharged to home,Emergency Room
4,16680,42519267,Caucasian,Male,[40-50),?,1,1,7,1,...,No,No,No,No,Ch,Yes,NO,Emergency,Discharged to home,Emergency Room


In [12]:
df.shape

(101766, 53)

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101766 entries, 0 to 101765
Data columns (total 53 columns):
 #   Column                    Non-Null Count   Dtype 
---  ------                    --------------   ----- 
 0   encounter_id              101766 non-null  int64 
 1   patient_nbr               101766 non-null  int64 
 2   race                      101766 non-null  object
 3   gender                    101766 non-null  object
 4   age                       101766 non-null  object
 5   weight                    101766 non-null  object
 6   admission_type_id         101766 non-null  int64 
 7   discharge_disposition_id  101766 non-null  int64 
 8   admission_source_id       101766 non-null  int64 
 9   time_in_hospital          101766 non-null  int64 
 10  payer_code                101766 non-null  object
 11  medical_specialty         101766 non-null  object
 12  num_lab_procedures        101766 non-null  int64 
 13  num_procedures            101766 non-null  int64 
 14  num_

In [14]:
df.describe()

Unnamed: 0,encounter_id,patient_nbr,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,number_diagnoses
count,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0
mean,165201600.0,54330400.0,2.024006,3.715642,5.754437,4.395987,43.095641,1.33973,16.021844,0.369357,0.197836,0.635566,7.422607
std,102640300.0,38696360.0,1.445403,5.280166,4.064081,2.985108,19.674362,1.705807,8.127566,1.267265,0.930472,1.262863,1.9336
min,12522.0,135.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
25%,84961190.0,23413220.0,1.0,1.0,1.0,2.0,31.0,0.0,10.0,0.0,0.0,0.0,6.0
50%,152389000.0,45505140.0,1.0,1.0,7.0,4.0,44.0,1.0,15.0,0.0,0.0,0.0,8.0
75%,230270900.0,87545950.0,3.0,4.0,7.0,6.0,57.0,2.0,20.0,0.0,0.0,1.0,9.0
max,443867200.0,189502600.0,8.0,28.0,25.0,14.0,132.0,6.0,81.0,42.0,76.0,21.0,16.0


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101766 entries, 0 to 101765
Data columns (total 53 columns):
 #   Column                    Non-Null Count   Dtype 
---  ------                    --------------   ----- 
 0   encounter_id              101766 non-null  int64 
 1   patient_nbr               101766 non-null  int64 
 2   race                      101766 non-null  object
 3   gender                    101766 non-null  object
 4   age                       101766 non-null  object
 5   weight                    101766 non-null  object
 6   admission_type_id         101766 non-null  int64 
 7   discharge_disposition_id  101766 non-null  int64 
 8   admission_source_id       101766 non-null  int64 
 9   time_in_hospital          101766 non-null  int64 
 10  payer_code                101766 non-null  object
 11  medical_specialty         101766 non-null  object
 12  num_lab_procedures        101766 non-null  int64 
 13  num_procedures            101766 non-null  int64 
 14  num_

## Data Pre-processing (cleaning, transformation, handling missing values, etc.)

## Univariate Analysis

## Bivariate Analysis

## Multivariate Analysis

## Summarize insights