### Project 2: Using a Pipeline for Data Preprocessing and Modeling 
#### Objective: Create a Scikit-learn Pipeline that first scales the data and then trains a LogisticRegression model. 
#### Technique: Use from sklearn.pipeline import Pipeline and define a list of steps like [('scaler', StandardScaler()), ('logreg', LogisticRegression())]. 

#####  Step 1: Import necessary libraries

In [1]:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

##### Step 2: Generate a binary classification dataset

In [8]:
X, y = make_classification(
    n_samples=500,       # number of data points
    n_features=2,        # use 2 features for simplicity
    n_informative=2,     # both features are useful
    n_redundant=0,       # no redundant features
    random_state=42
)

##### Step 3: Split data into training and testing sets

In [9]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

##### Step 4: Create a pipeline that scales then trains

In [10]:
pipeline = Pipeline([
    ('scaler', StandardScaler()),             # Step 1: Scale features
    ('logreg', LogisticRegression())          # Step 2: Train logistic regression
])
# Fit the pipeline to the training data
pipeline.fit(X_train, y_train)
# The pipeline automatically scales the data before training

# Predict on test data
y_pred = pipeline.predict(X_test)
# can directly call .predict() â€” scaling happens internally

# Evaluate model accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Predicted labels:", y_pred)
print("Actual labels:", y_test)
print("Model Accuracy:", round(accuracy * 100, 2), "%")

# View the pipeline steps
print("\nPipeline Steps:")
print(pipeline.named_steps)

# Avoids data leakage (scaling applied correctly)
# Cleaner and reusable code
# param_grid - Allows grid search hyperparameter tuning easily later

Predicted labels: [1 1 1 0 1 1 1 1 0 1 0 1 0 1 0 0 0 1 1 0 1 0 1 1 1 0 1 1 0 0 1 0 1 1 0 1 1
 1 1 1 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 0 0 0 1 1 0 0 0 1 0 0 1 1 1 1 0 1 0 1
 0 0 0 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 0 1 0]
Actual labels: [1 1 1 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 1 0 0 1 0 0 0 0 1 1 0 1 1
 1 1 1 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 0 0 1 1 0 1 0 0 0 0 0 1 0 1 1 0 1 0 1
 0 0 0 1 0 0 0 0 1 1 1 0 1 1 0 0 0 0 1 0 1 0 1 0 1 0]
Model Accuracy: 88.0 %

Pipeline Steps:
{'scaler': StandardScaler(), 'logreg': LogisticRegression()}
