# Patient Overlap and Data Leakage

Patient overlap in medical data is a part of a more general problem in machine learning called **data leakage**.  To identify patient overlap in this week's graded assignment, you'll check to see if a patient's ID appears in both the training set and the test set. You should also verify that you don't have patient overlap in the training and validation sets, which is what you'll do here.

Below is a simple example showing how you can check for and remove patient overlap in your training and validations sets.

In [1]:
# Import necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import os
import seaborn as sns
sns.set()

import warnings
warnings.filterwarnings('ignore')

## 1. Data

### 1.1 Loading the Data

First, you'll read in your training and validation datasets from csv files. Run the next two cells to read these csvs into `pandas` dataframes.

In [3]:
train_df = pd.read_csv('data/train-small.csv')
valid_df = pd.read_csv('data/valid-small.csv')

print(f'There are {train_df.shape[0]} rows & {train_df.shape[1]} features in the TRAINING set')
print(f'There are {valid_df.shape[0]} rows  & {valid_df.shape[1]} features in the VALIDATION set')

There are 1000 rows & 16 features in the TRAINING set
There are 109 rows  & 16 features in the VALIDATION set


In [4]:
train_df.head(3)

Unnamed: 0,Image,Atelectasis,Cardiomegaly,Consolidation,Edema,Effusion,Emphysema,Fibrosis,Hernia,Infiltration,Mass,Nodule,PatientId,Pleural_Thickening,Pneumonia,Pneumothorax
0,00008270_015.png,0,0,0,0,0,0,0,0,0,0,0,8270,0,0,0
1,00029855_001.png,1,0,0,0,1,0,0,0,1,0,0,29855,0,0,0
2,00001297_000.png,0,0,0,0,0,0,0,0,0,0,0,1297,1,0,0


In [5]:
valid_df.head(3)

Unnamed: 0,Image,Atelectasis,Cardiomegaly,Consolidation,Edema,Effusion,Emphysema,Fibrosis,Hernia,Infiltration,Mass,Nodule,PatientId,Pleural_Thickening,Pneumonia,Pneumothorax
0,00027623_007.png,0,0,0,1,1,0,0,0,0,0,0,27623,0,0,0
1,00028214_000.png,0,0,0,0,0,0,0,0,0,0,0,28214,0,0,0
2,00022764_014.png,0,0,0,0,0,0,0,0,0,0,0,22764,0,0,0


### 1.2 Extracting Patient IDs

By running the next three cells you will do the following:
1. Extract patient IDs from the train and validation sets

In [6]:
train_ids = train_df.PatientId.values
valid_ids = valid_df.PatientId.values

### 1.3 Comparing PatientIDs for Train & Validation Sets

2. Convert these arrays of numbers into `set()` datatypes for easy comparison
3. Identify patient overlap in the intersection of the two sets

In [8]:
intersection = set(train_ids).intersection(set(valid_ids))
print(f'There are {len(intersection)} intersections between train & validation sets')
print(f'The intersections are: {intersection}')

There are 11 intersections between train & validation sets
The intersections are: {20290, 27618, 9925, 10888, 22764, 19981, 18253, 4461, 28208, 8760, 7482}


### 1.4 Identifying & Removing Overlapping Patients

Run the next two cells to do the following:
1. Create lists of the overlapping row numbers in both the training and validation sets.
2. Drop the overlapping patient records from the validation set.

**Note:** You could also choose to drop them from train set.

In [10]:
train_overlap = []
valid_overlap = []

for idx in range(len(intersection)):
    train_overlap.extend(train_df.index[train_df['PatientId'] == list(intersection)[idx]].tolist())
    valid_overlap.extend(valid_df.index[valid_df['PatientId'] == list(intersection)[idx]].tolist())

print(f'Train overlapping patients: {train_overlap}')
print(f'Validation overlapping patients: {valid_overlap}')

Train overlapping patients: [306, 186, 797, 98, 408, 917, 327, 913, 10, 51, 276]
Validation overlapping patients: [104, 88, 65, 13, 2, 41, 56, 70, 26, 75, 20, 52, 55]


In [11]:
valid_df.drop(valid_overlap, inplace=True)

In [12]:
train_overlap = []
valid_overlap = []

for idx in range(len(intersection)):
    train_overlap.extend(train_df.index[train_df['PatientId'] == list(intersection)[idx]].tolist())
    valid_overlap.extend(valid_df.index[valid_df['PatientId'] == list(intersection)[idx]].tolist())

print(f'Train overlapping patients: {train_overlap}')
print(f'Validation overlapping patients: {valid_overlap}')

Train overlapping patients: [306, 186, 797, 98, 408, 917, 327, 913, 10, 51, 276]
Validation overlapping patients: []


### 1.5 Sanity Check

Check that everything worked as planned by rerunning the patient ID comparison between train and validation sets. When you run the next two cells you should see that there are now fewer records in the validation set and that the overlap problem has been removed!

In [13]:
# Extract patient id's for the validation set
ids_valid = valid_df.PatientId.values
# Create a "set" datastructure of the validation set id's to identify unique id's
ids_valid_set = set(ids_valid)
print(f'There are {len(ids_valid_set)} unique Patient IDs in the training set')

There are 86 unique Patient IDs in the training set


In [15]:
# Identify patient overlap by looking at the intersection between the sets
patient_overlap = list(set(train_overlap).intersection(ids_valid_set))
n_overlap = len(patient_overlap)
print(f'There are {n_overlap} Patient IDs in both the training and validation sets')

There are 0 Patient IDs in both the training and validation sets


In [6]:
preds = [[0.6],
           [0.3],
           [0.4]]

y_true = np.array(
        [[1],
         [1],
         [0]])


-1 * np.sum(1/4 * y_true * np.log(y_true))

nan