## Ml Worklflow

In [10]:
# Day 1 — ML Workflow & Train/Test Split
#Short: learn workflow, why split, reproducibility, and avoid leakage.

### Imports and synthetic data (code)

In [3]:
# 1) Imports: pandas & numpy for data, sklearn helper for splitting
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [6]:
# 2) Making random numbers reproducible
np.random.seed(0)

In [7]:
# 3) Creating a small synthetic dataset (100 rows) with two features
n = 100
X = pd.DataFrame({
    "sqft": np.random.normal(1500, 300, size=n),   # continuous feature
    "age": np.random.randint(0, 30, size=n)        # integer feature
})

In [8]:
# 4) Creating a synthetic target using a known linear formula + noise
y = 100 * X["sqft"] - 500 * X["age"] + np.random.normal(0, 10000, size=n)

# 5) Showing first rows to inspect
X.head(), y.head()

(          sqft  age
 0  2029.215704    0
 1  1620.047163    4
 2  1793.621395   27
 3  2172.267960   27
 4  2060.267397   25,
 0    217736.703133
 1    176597.482658
 2    165863.535643
 3    218785.179604
 4    194571.948938
 dtype: float64)

### Train/Test split (code)

In [9]:
# 6) Spliting the dataset into train and test sets-->

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 7) Printing shapes to confirm split counts-->

print("X shape:", X.shape)
print("X_train:", X_train.shape, "X_test:", X_test.shape)
print("y_train:", y_train.shape, "y_test:", y_test.shape)


X shape: (100, 2)
X_train: (80, 2) X_test: (20, 2)
y_train: (80,) y_test: (20,)


### Data leakage example (bad vs good) (code)

In [11]:
from sklearn.preprocessing import StandardScaler

# Bad: fit scaler on entire dataset -> leakage
scaler_bad = StandardScaler()
X_scaled_bad = scaler_bad.fit_transform(X)  

# Good: fit scaler only on train, then transform test
scaler_good = StandardScaler()
X_train_scaled = scaler_good.fit_transform(X_train)  # learn parameters from train
X_test_scaled = scaler_good.transform(X_test)        # apply same transform to test

# Print means to show difference
print("Bad scaler means:", np.round(X_scaled_bad.mean(axis=0), 3))
print("Good scaler train means (approx 0):", np.round(X_train_scaled.mean(axis=0), 3))


Bad scaler means: [-0.  0.]
Good scaler train means (approx 0): [ 0. -0.]


**Takeaway**: Always split BEFORE fitting any preprocessing. Use pipelines for safety.
