## Integrative Data Analysis Tutorial 4: Data Leakage

You are a reviewer at a science journal, tasked with evaluating three studies using ML approaches which were sent in for review.
In all three studies the results look suspiciously good. Can you spot the data leakage? Fix it and determine the performance metrics without leakage.



### Frist Study

The first study proposes a model to predict house prices from attributes about the houses.

In [7]:
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
import pandas as pd
from sklearn.model_selection import train_test_split

# load the dataset
data = pd.read_csv('house_price_data.csv')

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Num_rooms       100 non-null    int64  
 1   Square_footage  100 non-null    int64  
 2   Location        100 non-null    object 
 3   Listing_year    100 non-null    int64  
 4   Price           100 non-null    float64
dtypes: float64(1), int64(3), object(1)
memory usage: 4.0+ KB


In [None]:


# Define features and target variable
X = data.drop(columns='Price')
y = data['Price']

# One-hot encode the 'Location' column using pd.get_dummies
X = pd.get_dummies(X, columns=['Location'], drop_first=True)  # Drop first to avoid multicollinearity

y_scaled = StandardScaler().fit_transform(y.values.reshape(-1, 1)).flatten()

# Define the Ridge regression model
model = Ridge()
model = model.fit(X, y_scaled)


# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y_scaled, test_size=0.3, random_state=42)
preds = model.predict(X_test) #predict only test set
mse = mean_squared_error(y_test, preds)
print(f"Mean squared error: {mse:.2f}")


Mean squared error: 0.73


### Second Study

The second study proposes using a LogisticRegression classification model to help quickly diagnosing liver disease for incoming patients in a hospital. The data is from historical medical records from the hospital.


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
# Load the dataset
data = pd.read_csv('diagnosis_data.csv')

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 8 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Age                        2000 non-null   int64  
 1   Prednisolone_administered  2000 non-null   int64  
 2   BloodPressure              2000 non-null   int64  
 3   BMI                        2000 non-null   float64
 4   Cholesterol                2000 non-null   int64  
 5   Amoxicillin_administered   2000 non-null   int64  
 6   Diagnosis_LiverDisease     2000 non-null   int64  
 7   Ibuprofen_administered     2000 non-null   int64  
dtypes: float64(1), int64(7)
memory usage: 125.1 KB


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score

# Define features and target variable
X = data.drop(columns='Diagnosis_LiverDisease')
y = data['Diagnosis_LiverDisease']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=43)

# Scale the data
scaler = StandardScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns, index=X_test.index)

# Define and fit the model
model = LogisticRegression(random_state=42)
model.fit(X_train_scaled, y_train)

# Predict and calculate accuracy
y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")


Accuracy: 0.97


### Third Study

The third study investigates a model that predicts the eficacy of a drug using various patient characteristics (age, weight, blood pressure, etc.) The study includes data from multiple measurements taken for each patient.

In [None]:
data = pd.read_csv('drug_efficacy_data.csv')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Patient_ID      50 non-null     int64  
 1   Age             50 non-null     int64  
 2   Weight          50 non-null     int64  
 3   Blood_Pressure  50 non-null     int64  
 4   Heart_Rate      50 non-null     int64  
 5   Drug_Dosage     50 non-null     int64  
 6   Drug_Efficacy   50 non-null     float64
dtypes: float64(1), int64(6)
memory usage: 2.9 KB


In [None]:


# Define features and target variable
X = data.drop(columns=['Drug_Efficacy', 'Patient_ID'])
y = data['Drug_Efficacy']


# Randomly split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train the model on the randomized training data
model = Ridge(alpha=1)
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Calculate Mean Squared Error (MSE) on the test set
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")







Mean Squared Error: 0.19
