### **Dataset:** Medical Appointment No-Shows

# Project Definition and Scope

The goal of this project is to build a machine learning model to predict whether a patient will show up for their scheduled medical appointment. The dataset used comes from a Brazilian public health system and contains over 110,527 medical appointments, each with various demographic, medical, and scheduling-related features.

Missed medical appointments — known as *no-shows* — are a significant issue for healthcare providers. They result in lost time, wasted resources, and delayed treatment. Accurately predicting no-shows allows clinics to proactively intervene, such as by sending reminders, rescheduling, or double-booking slots.

###  Objective
- Analyze and preprocess the Medical Appointment No-Shows dataset.
- Perform feature engineering to create meaningful predictors.
- Train and evaluate multiple machine learning models.
- Identify the best-performing model for predicting appointment no-shows.

###  Target Definition
The original `No-show` column contains:
- `"No"` → Patient **showed up**
- `"Yes"` → Patient **did not show up**

To reduce confusion, we rename this column to `NoShow` and convert it to binary:
- `0` → Patient showed up
- `1` → Patient missed the appointment (positive class)

This binary label will serve as the target variable for the classification models.


# Data Collection

- The dataset can be downloaded online or imported in code from Kaggle.

In [None]:
import pandas as pd

# Set the path of the file
data_path = "/content/KaggleV2-May-2016.csv"

# Load the data into a pandas DataFrame
data = pd.read_csv(data_path)

# Display the shape of the DataFrame
print(f"Data loaded successfully with shape: {data.shape}")

# Print a view of the dataset
data

## Basic Overview of the Dataset

- Determine the number of instances (rows) and columns (features)




In [None]:
# 1. Determine the number of instances (rows) and columns (features)
num_rows, num_cols = data.shape
print(f"Number of instances (rows): {num_rows}")
print(f"Number of features (columns): {num_cols}")

- Display the first 5 rows

In [None]:
# 2. Display the first 5 rows of the dataset
print("\nFirst 5 rows of the dataset:")
data.head()

- Duisplay the last 5 rows

In [None]:
# 3. Display the last 5 rows of the dataset
print("\nLast 5 rows of the dataset:")
data.tail()

## General Dataset Information

In [None]:
# General information about the dataset
data.info()

print("\n\n")

# Summary statistics for numerical columns
data.describe()

###  Dataset Overview and Summary Statistics

The dataset contains **110,527 records** and **14 columns**, with **no missing values** across any of the features.

#### Data Types
- **Float64**: 1 column (`PatientId`)
- **Int64**: 8 columns (`AppointmentID`, `Age`, `Scholarship`, `Hipertension`, `Diabetes`, `Alcoholism`, `Handcap`, `SMS_received`)
- **Object (string)**: 5 columns (`Gender`, `ScheduledDay`, `AppointmentDay`, `Neighbourhood`, `No-show`)

####  Summary Statistics Highlights (`.describe()`):
- **Age** ranges from **-1 to 115**. A minimum age of `-1` is invalid and will require cleaning.
- **Binary columns** (0 or 1): `Scholarship`, `Hipertension`, `Diabetes`, `Alcoholism`, `SMS_received`  
  These can be treated as categorical indicators.
- **Handcap** ranges from 0 to 4, but most values are 0 — treatable as ordinal or binary after inspection.
- **High cardinality column**:  
  `PatientId` has many unique values, likely acting as an identifier and not useful for prediction.  
  `AppointmentID` also shows wide numeric spread and can be excluded from modeling.

####  Notes
- The `ScheduledDay` and `AppointmentDay` columns are currently stored as text and will need to be converted to datetime for further analysis.
- The `No-show` column contains `"Yes"`/`"No"` and will be converted to a binary `NoShow` column (1 = missed, 0 = showed up).

Overall, the dataset is clean in terms of null values but contains some anomalies (like negative age) and high-cardinality identifiers that must be addressed before modeling.


## Data quality checks

In [None]:
# Check for missing/null values
missing_values = data.isnull().sum()
print("Missing values per column:\n")
print(missing_values)

# Check for negative age values
negative_ages = data[data['Age'] < 0]
print(f"\nNumber of rows with negative age: {len(negative_ages)}")

# Check unique values in 'Handcap'
print("\nUnique values in 'Handcap':", data['Handcap'].unique())

# Check for values greater than expected (e.g., > 4)
out_of_range_handcap = data[data['Handcap'] > 4]
print(f"Rows with out-of-range 'Handcap' values (>4): {len(out_of_range_handcap)}")
out_of_range_handcap.head()

# Check for exact duplicate rows
duplicate_count = data.duplicated().sum()
print(f"\nNumber of exact duplicate rows: {duplicate_count}")


### Data Quality Checks Summary

As observed in the earlier `.info()` step, there are no missing values in any of the 14 columns — confirmed again using `.isnull().sum()`.

- Age: One row contains an invalid value of `-1`, confirming the previously noted anomaly.
- Handcap: Values range from 0 to 4. No out-of-range values (>4) are present.
- Duplicate Rows: There are no exact duplicate rows in the dataset.

These checks confirm the dataset is structurally clean, with only one clear anomaly (negative age) that should be handled during preprocessing.


## Feature Uniqueness Exploration
- Categorical Features

In [None]:
# Quick view of unique values per column
data.nunique()

# View unique values for selected categorical features
for col in ['Gender', 'Neighbourhood', 'No-show']:
    print(f"\n{col} - unique values:\n{data[col].value_counts()}")



While previous steps focused primarily on numerical features, this step inspects the unique values of key **categorical variables**.

- **Gender**: Two categories — `F` (female) and `M` (male), with females making up the majority.
- **Neighbourhood**: Contains 81 unique values, indicating high cardinality. This will require encoding (e.g., one-hot or target encoding) during preprocessing.
- **No-show**: Two values — `"No"` (patient showed up) and `"Yes"` (patient missed). These will be converted to binary (`0` and `1`) for modeling.

This step is essential to confirm category distributions, detect imbalances, and inform appropriate encoding strategies for categorical features.


## Target Variable Analysis

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Check value counts for the original 'No-show' column
print("Value counts:\n")
print(data['No-show'].value_counts())

print("\n\n")

# Plot class distribution
plt.figure(figsize=(6, 4))
sns.countplot(x='No-show', data=data, palette='pastel')
plt.title("No-show Class Distribution")
plt.xlabel("No-show (Yes = Missed, No = Attended)")
plt.ylabel("Count")
plt.show()

### Target Variable Analysis – No-show Distribution

The `No-show` column shows a clear class imbalance:

- **"No"** (patient showed up): 88,208 instances (~80%)
- **"Yes"** (patient missed appointment): 22,319 instances (~20%)

This imbalance is also visible in the bar chart.

The imbalance will be considered when selecting evaluation metrics and will be addressed during preprocessing


# Feature Engineering

To enhance the predictive power of the dataset, several new features were engineered based on appointment scheduling dates and time intervals. These features aim to capture behavioral patterns and scheduling dynamics that may influence whether a patient attends their appointment.

### DaysBetween  
**Description**: Number of days between when the appointment was scheduled and when it was held.  
**Purpose**: Captures the waiting time for each patient, which may influence their likelihood of showing up. Longer delays may increase the risk of no-shows.


In [None]:
# Convert to datetime (if not already)
data['ScheduledDay'] = pd.to_datetime(data['ScheduledDay']).dt.date
data['AppointmentDay'] = pd.to_datetime(data['AppointmentDay']).dt.date

# Convert back to datetime (optional, for subtraction)
data['ScheduledDay'] = pd.to_datetime(data['ScheduledDay'])
data['AppointmentDay'] = pd.to_datetime(data['AppointmentDay'])

# Now calculate DaysBetween
data['DaysBetween'] = (data['AppointmentDay'] - data['ScheduledDay']).dt.days

data['DaysBetween']

 DaysBetween Calculation – Explanation

1. **Convert to Date Only**:  
   `ScheduledDay` and `AppointmentDay` are first converted to contain only the date (removing the time component) to avoid negative day differences caused by time-of-day differences.

2. **Reconvert to Datetime**:  
   The `.dt.date` format is converted back to full datetime so subtraction can be performed correctly.

3. **Calculate DaysBetween**:  
   The difference in days between the appointment date and the scheduling date is computed using `.dt.days`.

This ensures that appointments scheduled and held on the same calendar day have `DaysBetween = 0`, avoiding rounding errors like `-1`.


### ScheduledWeekday  
**Description**: Indicates the day of the week (0 = Monday, ..., 6 = Sunday) on which the appointment was scheduled.  
**Purpose**: Helps capture booking behavior patterns. Patients scheduling appointments earlier in the week may be more intentional or organized, which could influence attendance rates.


In [None]:
# ScheduledWeekday: weekday when the appointment was scheduled
data['ScheduledWeekday'] = data['ScheduledDay'].dt.dayofweek

data['ScheduledWeekday']

### AppointmentWeekday  
**Description**: Indicates the day of the week on which the appointment is scheduled to occur (0 = Monday, ..., 6 = Sunday).  
**Purpose**: No-show behavior may vary by day. For instance, patients may be more likely to skip appointments on Mondays or Fridays due to routine disruptions or long weekends.


In [None]:
# AppointmentWeekday: weekday when the appointment takes place
data['AppointmentWeekday'] = data['AppointmentDay'].dt.dayofweek

data['AppointmentWeekday']

### IsWeekendAppointment  
**Description**: A boolean feature indicating whether the appointment is scheduled on a weekend (Saturday = 5, Sunday = 6).  
**Purpose**: Weekend appointments may affect attendance patterns due to differences in availability, responsibilities, or access to transportation compared to weekdays.


In [None]:
# IsWeekendAppointment: True if appointment is on Saturday or Sunday
data['IsWeekendAppointment'] = data['AppointmentWeekday'].isin([5, 6])

data['IsWeekendAppointment']

### Feature Engineering – Pros and Cons

**Pros:**
- **Domain-driven**: Features like `DaysBetween` and `AppointmentWeekday` are based on real-world patient scheduling behavior, making them meaningful and interpretable.
- **Temporal and behavioral context**: The features capture both how far in advance appointments are booked and patterns across different days of the week.
- **Model compatibility**: A mix of continuous and categorical features allows flexibility across various machine learning models.

**Cons:**
- **Redundancy risk**: Some features (e.g., `AppointmentWeekday` and `IsWeekendAppointment`) may convey overlapping information and require correlation checks.
- **Encoding overhead**: Categorical features such as weekdays must be encoded before modeling, adding extra preprocessing steps.
- **No historical behavior**: The features do not include patient history (e.g., past no-shows), which could be a strong predictor but is not available in this dataset.


# Data Preprocessing

In [None]:
# Make a working copy of the data to preserve the original dataset
df = data.copy(deep=True)

- Convert Dates to datetime
- Convert and Rename Target Variable
- Remove Anomalies
- Drop ID Columns and Neighboorhourhood

In [None]:
# Convert ScheduledDay and AppointmentDay to datetime (remove time)
df['ScheduledDay'] = pd.to_datetime(df['ScheduledDay']).dt.date
df['AppointmentDay'] = pd.to_datetime(df['AppointmentDay']).dt.date

# Convert back to full datetime to allow subtraction
df['ScheduledDay'] = pd.to_datetime(df['ScheduledDay'])
df['AppointmentDay'] = pd.to_datetime(df['AppointmentDay'])

# Replace string labels with binary values
df['No-show'] = df['No-show'].replace({'No': 0, 'Yes': 1})

# Remove invalid age values
df = df[df['Age'] >= 0]

# Remove non-predictive identifiers
df.drop(['PatientId', 'AppointmentID',"Neighbourhood"], axis=1, inplace=True)

- Normalize Continuous Features
- Encode Binary Categorical Features
- One-Hot Encode Multi-class Categorical Features

In [None]:
from sklearn.preprocessing import MinMaxScaler

# -----
# Normalizing Continuous Features

# Create a scaler instance
scaler = MinMaxScaler()

# Normalize 'Age' and 'DaysBetween' (apply after DaysBetween has been created)
df.loc[:, ['Age', 'DaysBetween']] = scaler.fit_transform(df[['Age', 'DaysBetween']])

# -----
# Encode Binary Categorical Features

from sklearn.preprocessing import LabelEncoder

# List of binary categorical features
binary_cols = ['Gender', 'Scholarship', 'Hipertension', 'Diabetes', 'Alcoholism', 'SMS_received', 'IsWeekendAppointment']

# Apply label encoding (0 = No, 1 = Yes or Female/Male)
le = LabelEncoder()
for col in binary_cols:
    df.loc[:, col] = le.fit_transform(df[col])


# -----
# One-Hot Encode Multi-class Categorical Features

from sklearn.preprocessing import OneHotEncoder

# Columns to one-hot encode (e.g., weekday features)
onehot_cols = ['ScheduledDay', 'AppointmentDay']

# Initialize encoder
encoder = OneHotEncoder(sparse_output=False, drop='first')

# Fit and transform
encoded = encoder.fit_transform(df[onehot_cols])

# Create DataFrame with encoded columns
encoded_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out(onehot_cols), index=df.index)

# Drop original columns and append encoded ones
df.drop(columns=onehot_cols, inplace=True)
df = pd.concat([df, encoded_df], axis=1)

# Move the target variable 'No-show' to the end of the DataFrame
target = df.pop('No-show')
df['No-show'] = target

df

### SMOTE Application

In this step, we split the dataset into training and testing sets using an 80/20 ratio with stratification to preserve the original class distribution.

SMOTE (Synthetic Minority Oversampling Technique) is applied **only to the training set** to generate synthetic examples of the minority class (missed appointments). This prevents data leakage and ensures the model does not learn from artificially created patterns that exist in the test set.

By training on a balanced dataset and evaluating on the original, imbalanced test set, we ensure a fair and realistic assessment of model performance.


In [None]:
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
import seaborn as sns
import matplotlib.pyplot as plt

# 1. Separate features and target
X = df.drop('No-show', axis=1)
y = df['No-show']

# 2. Split into training and testing sets (stratify to maintain imbalance in test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# 3. Apply SMOTE to training set
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# 4. Visualize class balance in the resampled training set
sns.countplot(x=y_train_smote)
plt.title("Class Distribution After SMOTE (Training Set)")
plt.xlabel("No-show (0 = Showed up, 1 = Missed)")
plt.ylabel("Count")
plt.show()

print("\n\n\n")

# Original training and testing set sizes
print(f"Original training set size: {X_train.shape[0]} rows")
print(f"Original testing set size: {X_test.shape[0]} rows")

# After SMOTE
print(f"Training set size after SMOTE: {X_train_smote.shape[0]} rows")


The training set was balanced using SMOTE to address class imbalance. Originally, the training set had 88,420 samples, which increased to 141,130 after synthetic examples were added. The class distribution is now even (0 = showed up, 1 = missed), as shown in the bar plot, helping improve model performance and fairness.