# Feature Engineering

This notebook focuses on creating **high-quality engineered features** for predicting patient no-shows.

Key goals:
- Create time-based features
- Encode categorical variables
- Prepare a clean modeling dataset

**Important:** Feature engineering matters more than model choice.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_columns', None)
sns.set(style="whitegrid")

In [2]:
# Load dataset
data_path = "../data/raw/KaggleV2-May-2016.csv"
df = pd.read_csv(data_path)

df.head()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


## Target Encoding

- `No-show = Yes` → patient **missed** the appointment (1)
- `No-show = No` → patient **attended** (0)

In [3]:
df['no_show'] = df['No-show'].map({'No': 0, 'Yes': 1})

df['no_show'].value_counts(normalize=True)

no_show
0    0.798067
1    0.201933
Name: proportion, dtype: float64

## Date & Time Features

We extract:
- Days between scheduling and appointment
- Appointment weekday
- Weekend flag

In [4]:
# Convert date columns
df['ScheduledDay'] = pd.to_datetime(df['ScheduledDay'])
df['AppointmentDay'] = pd.to_datetime(df['AppointmentDay'])

# Days between scheduling and appointment
df['days_between'] = (df['AppointmentDay'] - df['ScheduledDay']).dt.days

# Appointment weekday
df['appointment_weekday'] = df['AppointmentDay'].dt.weekday

# Weekend flag
df['is_weekend'] = df['appointment_weekday'].isin([5, 6]).astype(int)

df[['days_between', 'appointment_weekday', 'is_weekend']].head()

Unnamed: 0,days_between,appointment_weekday,is_weekend
0,-1,4,0
1,-1,4,0
2,-1,4,0
3,-1,4,0
4,-1,4,0


## Demographic Features

- Gender encoding
- Age cleanup

In [6]:
# Gender encoding
df['gender'] = df['Gender'].map({'F': 0, 'M': 1})

# Fix invalid ages (negative values)
df.loc[df['Age'] < 0, 'Age'] = np.nan

df[['Age', 'gender']].describe()

Unnamed: 0,Age,gender
count,110527.0,110527.0
mean,37.088883,0.350023
std,23.11019,0.476979
min,0.0,0.0
25%,18.0,0.0
50%,37.0,0.0
75%,55.0,1.0
max,115.0,1.0


## Chronic Disease Flags

Binary medical condition indicators:
- Hypertension
- Diabetes
- Alcoholism
- Handicap

In [10]:
chronic_features = ['Hipertension', 'Diabetes', 'Alcoholism', 'Handcap']

df[chronic_features].value_counts()

Hipertension  Diabetes  Alcoholism  Handcap
0             0         0           0          84115
1             0         0           0          13663
              1         0           0           5885
0             0         1           0           1922
              1         0           0           1341
              0         0           1           1088
1             0         1           0           1042
                        0           1            541
              1         0           1            304
                        1           0            243
0             0         0           2             97
              1         1           0             75
1             1         0           2             41
              0         0           2             39
0             1         0           1             39
              0         1           1             31
1             0         1           1             26
              1         1           1             12
0 

## SMS Reminder Feature

- `1` → patient received SMS
- `0` → no SMS reminder

In [11]:
df['SMS_received'].value_counts()

SMS_received
0    75045
1    35482
Name: count, dtype: int64

## Neighborhood Encoding

High-cardinality feature.
We use **frequency encoding** instead of one-hot.

In [12]:
# Frequency encoding for neighborhood
neighborhood_freq = df['Neighbourhood'].value_counts(normalize=True)
df['neighborhood_freq'] = df['Neighbourhood'].map(neighborhood_freq)

df[['Neighbourhood', 'neighborhood_freq']].head()

Unnamed: 0,Neighbourhood,neighborhood_freq
0,JARDIM DA PENHA,0.035077
1,JARDIM DA PENHA,0.035077
2,MATA DA PRAIA,0.005827
3,PONTAL DE CAMBURI,0.000624
4,JARDIM DA PENHA,0.035077


## Feature Selection for Modeling

We keep only engineered and relevant features.

In [13]:
feature_cols = [
    'Age',
    'gender',
    'days_between',
    'appointment_weekday',
    'is_weekend',
    'SMS_received',
    'Hipertension',
    'Diabetes',
    'Alcoholism',
    'Handcap',
    'neighborhood_freq'
]

X = df[feature_cols]
y = df['no_show']

X.head()

Unnamed: 0,Age,gender,days_between,appointment_weekday,is_weekend,SMS_received,Hipertension,Diabetes,Alcoholism,Handcap,neighborhood_freq
0,62.0,0,-1,4,0,0,1,0,0,0,0.035077
1,56.0,1,-1,4,0,0,0,0,0,0,0.035077
2,62.0,0,-1,4,0,0,0,0,0,0,0.005827
3,8.0,0,-1,4,0,0,0,0,0,0,0.000624
4,56.0,0,-1,4,0,0,1,1,0,0,0.035077


## Final Dataset Check

In [14]:
print("Feature matrix shape:", X.shape)
print("Target distribution:\n", y.value_counts(normalize=True))

Feature matrix shape: (110527, 11)
Target distribution:
 no_show
0    0.798067
1    0.201933
Name: proportion, dtype: float64


# Summary

### Engineered Features:
- Time-based delays
- Weekday & weekend flags
- Encoded demographics
- Chronic disease indicators
- Neighborhood frequency encoding