<a href="https://colab.research.google.com/github/asupraja3/py-ml-toolkit-collab/blob/main/ScikitLearn_collab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ðŸ¤– What is Scikit-learn?

**Scikit-learn** is a free, open-source machine learning library for Python that provides simple and efficient tools for data mining and data analysis. It's built on top of **NumPy**, **SciPy**, and **matplotlib**, and is one of the most widely used ML libraries in the Python ecosystem.

---

## ðŸ”§ Key Features

- **Classification** â€“ Identifying labels or categories  
  *Example: Email spam detection, disease diagnosis*

- **Regression** â€“ Predicting continuous values  
  *Example: Predicting house prices, stock prices*

- **Clustering** â€“ Grouping similar data points without labels  
  *Example: Customer segmentation using KMeans*

- **Dimensionality Reduction** â€“ Reducing the number of features  
  *Example: PCA (Principal Component Analysis)*

- **Model Selection** â€“ Tools for choosing and validating models  
  *Example: Cross-validation, GridSearchCV*

- **Preprocessing** â€“ Scaling, transforming, and encoding data  
  *Example: StandardScaler, LabelEncoder, OneHotEncoder*

---

## ðŸ“š Use Cases

- Predicting housing prices using regression
- Detecting fraudulent transactions using classification
- Segmenting customers with clustering
- Compressing data features with PCA

---

## ðŸ’¡ Why Use Scikit-learn?

- âœ… Beginner-friendly API
- âœ… Consistent model syntax (.fit, .predict, .score)
- âœ… Excellent documentation & community support
- âœ… Integrates well with pandas, NumPy, and Jupyter
- âœ… Great for fast prototyping and academic research

---




# **1. Imports and Setup**
Interview Context: AI Engineers need to know the standard stack (numpy, pandas, sklearn).

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris, make_regression, make_classification
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression, Ridge, Lasso, LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

print("Libraries loaded successfully. Ready for Financial Modeling.")

Libraries loaded successfully. Ready for Financial Modeling.


# **2: Dataset Loading (The Basics + Financial Context)**
Topic: load_iris, train_test_split

Interview Context: You were asked about load_iris (a classic benchmark), but in a Citadel interview, you'd deal with tabular financial data.

In [2]:
# 1. The requested 'load_iris' practice (Standard Interview Check)
iris = load_iris()
X_iris, y_iris = iris.data, iris.target
print(f"Iris Data Shape: {X_iris.shape}")

# 2. REAL SCENARIO: Credit Application Data (Citadel Context)
# Features: [Credit Score, Annual Income, Debt-to-Income Ratio, Years employed]
# Target: 1 (Default/Reject), 0 (Repay/Approve)
X_fin, y_fin = make_classification(n_samples=1000, n_features=4, random_state=42)

# Splitting the data (Crucial to prevent data leakage)
# Test size 0.2 means 20% of data is hidden for final validation
X_train, X_test, y_train, y_test = train_test_split(X_fin, y_fin, test_size=0.2, random_state=42)

print("Financial Data split complete.")
print(f"Training samples: {X_train.shape[0]}")
print(f"Testing samples: {X_test.shape[0]}")

Iris Data Shape: (150, 4)
Financial Data split complete.
Training samples: 800
Testing samples: 200


# **3: Preprocessing (Scaling & Encoding)**
Topic: Scaling (StandardScaler), Encoding

Interview Context: Algorithms like KNN and SVM calculate "distance." If Income is 100,000 and Credit Score is 700, the Income variable will dominate the model purely because the number is bigger. Scaling fixes this.

In [3]:
# 1. Scaling (Standardization)
# We fit the scaler ONLY on training data to avoid looking into the future (Data Leakage)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Data Scaled (Mean approx 0, Variance approx 1)")
print(f"Sample scaled feature row: {X_train_scaled[0]}")

# 2. Encoding (For categorical data)
# Scenario: Converting 'Sector' (Tech, Energy, Finance) into numbers
sectors = pd.DataFrame({'sector': ['Tech', 'Energy', 'Finance', 'Tech']})
encoder = OneHotEncoder(sparse_output=False)
sectors_encoded = encoder.fit_transform(sectors)

print("\nOne-Hot Encoded Sectors:")
print(sectors_encoded)

Data Scaled (Mean approx 0, Variance approx 1)
Sample scaled feature row: [-0.48469871  0.40799652 -0.15113754  0.2878111 ]

One-Hot Encoded Sectors:
[[0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]


# **4: Regression Models (CDS Pricing)**
Topic: Linear, Ridge, Lasso

Interview Context: Ridge and Lasso are "Regularization" techniques. They prevent overfitting.

In [4]:
# Creating continuous data for Regression (CDS Price Prediction)
X_reg, y_reg = make_regression(n_samples=500, n_features=10, noise=0.1, random_state=42)
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, y_reg, test_size=0.2)

# 1. Linear Regression (Base model)
lr = LinearRegression()
lr.fit(X_train_reg, y_train_reg)
print(f"Linear Regression Score: {lr.score(X_test_reg, y_test_reg):.4f}")

# 2. Ridge (L2 Regularization - handles multicollinearity in financial markers)
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_reg, y_train_reg)
print(f"Ridge Regression Score: {ridge.score(X_test_reg, y_test_reg):.4f}")

# 3. Lasso (L1 Regularization - Feature selection)
lasso = Lasso(alpha=0.1)
lasso.fit(X_train_reg, y_train_reg)
print(f"Lasso Regression Score: {lasso.score(X_test_reg, y_test_reg):.4f}")
print("Note: Lasso is useful to identify which economic indicators actually matter.")

Linear Regression Score: 1.0000
Ridge Regression Score: 1.0000
Lasso Regression Score: 1.0000
Note: Lasso is useful to identify which economic indicators actually matter.


# **5: Classification Models (Credit Approval)**
Topic: Logistic, KNN, SVM, Decision Tree

Scenario: Classifying a trade or loan application as "Safe" (0) or "Risky" (1).

In [5]:
# Using our X_train_scaled and y_train from Cell 2

# 1. Logistic Regression (Simple, interpretable probabilities)
log_reg = LogisticRegression()
log_reg.fit(X_train_scaled, y_train)

# 2. KNN (Finds similar historical loans)
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

# 3. SVM (Finds the best boundary/margin between Safe and Risky)
svm = SVC(kernel='linear')
svm.fit(X_train_scaled, y_train)

# 4. Decision Tree (Rule-based: If Score > 700 AND Income > 50k...)
dt = DecisionTreeClassifier(max_depth=5)
dt.fit(X_train_scaled, y_train)

print("All classification models trained successfully.")

All classification models trained successfully.


# **6: Model Evaluation**
Topic: Accuracy, Confusion Matrix, Classification Report

Interview Context: Accuracy is dangerous in finance. If 99% of loans are safe, a model that simply guesses "Safe" every time has 99% accuracy but misses every fraud case. We need Precision and Recall.

In [6]:
# Make predictions using SVM
y_pred = svm.predict(X_test_scaled)

# 1. Accuracy
acc = accuracy_score(y_test, y_pred)
print(f"Accuracy: {acc:.2f}")

# 2. Confusion Matrix
# [[True Neg (Correct Reject), False Pos (Wrongly Approved)],
#  [False Neg (Wrongly Rejected), True Pos (Correct Approve)]]
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(cm)

# 3. Classification Report (Precision, Recall, F1-Score)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Accuracy: 0.88

Confusion Matrix:
[[92  9]
 [15 84]]

Classification Report:
              precision    recall  f1-score   support

           0       0.86      0.91      0.88       101
           1       0.90      0.85      0.88        99

    accuracy                           0.88       200
   macro avg       0.88      0.88      0.88       200
weighted avg       0.88      0.88      0.88       200



# **7: Cross-Validation & GridSearchCV**
Topic: Hyperparameter Tuning

Interview Context: How do you know k=5 is best for KNN? Or C=1.0 is best for SVM? You don't. You search for it.

In [7]:
# 1. Cross Validation
# splits data into 5 parts, trains on 4, tests on 1, rotates 5 times.
scores = cross_val_score(svm, X_train_scaled, y_train, cv=5)
print(f"Cross-Validation Scores: {scores}")
print(f"Average CV Score: {scores.mean():.2f}")

# 2. GridSearchCV (Automated tuning)
# We want to find the best 'C' (regularization) and 'kernel' for SVM
param_grid = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf']
}

grid = GridSearchCV(SVC(), param_grid, cv=3)
grid.fit(X_train_scaled, y_train)

print(f"\nBest Parameters found: {grid.best_params_}")
print(f"Best Estimator Accuracy: {grid.best_score_:.2f}")

Cross-Validation Scores: [0.8375  0.89375 0.85625 0.81875 0.9    ]
Average CV Score: 0.86

Best Parameters found: {'C': 10, 'kernel': 'rbf'}
Best Estimator Accuracy: 0.87


# **8: Pipelines (The Production Standard)**
Topic: Pipeline

Interview Context: In production (Citadel systems), you never manually scale data then pass it to a model. You wrap them together. This ensures that when new real-time data comes in, it is automatically scaled using the exact same logic as training.

In [8]:
# Define the pipeline steps:
# Step 1: Scale the data
# Step 2: Apply the Classifier
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])

# Fit the WHOLE pipeline on raw data
pipeline.fit(X_train, y_train)

# Predict on raw test data (Pipeline handles scaling internally)
pipe_score = pipeline.score(X_test, y_test)

print(f"Pipeline Accuracy: {pipe_score:.2f}")
print("System ready for deployment.")

Pipeline Accuracy: 0.89
System ready for deployment.


## **1: Pipeline and ColumnTransformer**
Topic: Handling mixed data types (Numbers + Categories) simultaneously.

Interview Perspective: Real-world financial data is messy. You have "Trade Amount" (Number) and "Exchange ID" (Category). You cannot treat them the same. ColumnTransformer allows you to apply different preprocessing to different columns in parallel.

In [9]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import pandas as pd

# Sample Financial Data: [Trade Amount, Risk Score, Exchange ID]
data = pd.DataFrame({
    'trade_amount': [10000, 50000, 2000, 15000],
    'risk_score': [0.1, 0.8, 0.2, 0.4],
    'exchange': ['NYSE', 'NASDAQ', 'NYSE', 'LSE']
})

# Define transformers
# 1. Numeric pipeline: Fill missing values -> Scale
num_trans = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# 2. Categorical pipeline: Fill missing -> OneHotEncode
cat_trans = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder())
])

# Combine them using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_trans, ['trade_amount', 'risk_score']),
        ('cat', cat_trans, ['exchange'])
    ])

# Apply transformations
processed_data = preprocessor.fit_transform(data)
print("Processed Data Shape (Rows, Cols):", processed_data.shape)
print("Data is now ready for the model.")

Processed Data Shape (Rows, Cols): (4, 5)
Data is now ready for the model.


## **2: Preprocessing Module (Scalers & Encoders)**
Topic: StandardScaler, MinMaxScaler, OneHotEncoder.

Interview Perspective:

StandardScaler: Best for algorithms assuming normal distribution (Regression, Logistic).

MinMaxScaler: Best for neural networks or bounded ranges (0 to 1).

OneHot: Essential for converting text labels to binary vectors.

In [10]:
from sklearn.preprocessing import MinMaxScaler

# Feature: [Asset Price, Daily Volume]
market_data = [[150.0, 2000000], [155.0, 2500000], [148.0, 1800000]]

# 1. StandardScaler (Mean=0, Variance=1)
# Used when data has outliers or normal distribution assumption
std_scaler = StandardScaler()
print("Standard Scaled:\n", std_scaler.fit_transform(market_data))

# 2. MinMaxScaler (Scales to range [0, 1])
# Used for Neural Networks or image data
min_max = MinMaxScaler()
print("\nMinMax Scaled:\n", min_max.fit_transform(market_data))

# 3. OneHotEncoder (Categorical -> Binary)
# Used for 'Buy', 'Sell', 'Hold' signals
signals = [['Buy'], ['Sell'], ['Hold'], ['Buy']]
enc = OneHotEncoder(sparse_output=False)
print("\nOneHot Encoded Signals:\n", enc.fit_transform(signals))

Standard Scaled:
 [[-0.33968311 -0.33968311]
 [ 1.35873244  1.35873244]
 [-1.01904933 -1.01904933]]

MinMax Scaled:
 [[0.28571429 0.28571429]
 [1.         1.        ]
 [0.         0.        ]]

OneHot Encoded Signals:
 [[1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]]


## **3: Model Training (.fit, .predict, .predict_proba)**
Topic: The core training API.

Interview Perspective: predict() gives you a hard class (0 or 1). predict_proba() gives you the confidence. In finance, we care about confidence. If the model says "Fraud" with 51% confidence, we might ignore it. If 99%, we block the card.

In [11]:
from sklearn.linear_model import LogisticRegression

# Synthetic Data: [Debt Ratio, Credit Score] -> [0: No Default, 1: Default]
X_train = [[0.1, 800], [0.8, 500], [0.2, 750], [0.9, 450]]
y_train = [0, 1, 0, 1]
X_new_application = [[0.85, 480]] # A risky application

# 1. .fit() -> Trains the model
model = LogisticRegression()
model.fit(X_train, y_train)

# 2. .predict() -> Hard classification
prediction = model.predict(X_new_application)
print(f"Hard Prediction (0=Safe, 1=Default): {prediction[0]}")

# 3. .predict_proba() -> Probability of each class
# Output format: [Prob of Class 0, Prob of Class 1]
probs = model.predict_proba(X_new_application)
risk_percentage = probs[0][1] * 100
print(f"Risk Probability: {risk_percentage:.2f}%")

Hard Prediction (0=Safe, 1=Default): 1
Risk Probability: 99.99%


## **4: Model Persistence (Saving/Loading)**
Topic: joblib vs pickle.

Interview Perspective: Training takes hours; prediction takes milliseconds. You train once, save the file, and load it into the production server. joblib is preferred in Scikit-Learn because it handles large numpy arrays more efficiently than Python's built-in pickle.

In [12]:
import joblib

# 1. Save the model to disk (simulated)
# In real life, this creates a file like 'credit_model.pkl'
joblib.dump(model, 'credit_risk_model.pkl')
print("Model saved to disk.")

# ... System Restart / Transfer to Production Server ...

# 2. Load the model from disk
loaded_model = joblib.load('credit_risk_model.pkl')

# Verify it still works
check = loaded_model.predict([[0.1, 800]])
print(f"Loaded Model Prediction check: {check[0]}")

Model saved to disk.
Loaded Model Prediction check: 0


## **5: Working with Sparse Matrices**
Topic: Handling data that is mostly zeros.

Interview Perspective: In NLP (processing news headlines for sentiment) or when OneHotEncoding thousands of stock tickers, your matrix is 99% zeros. Storing every zero wastes RAM. Sparse matrices only store the values and their coordinates.

Scenario: Analyzing Bloomberg News headlines (Bag of Words).

In [13]:
from scipy import sparse
import numpy as np

# Create a dense matrix (Standard format)
# Imagine this is word counts for 3 documents
dense_matrix = np.array([
    [0, 0, 1, 0],
    [0, 2, 0, 0],
    [0, 0, 0, 0]
])

print(f"Dense size in bytes: {dense_matrix.nbytes}")

# Convert to Compressed Sparse Row (CSR) format
# Scikit-learn models accept this format automatically
sparse_matrix = sparse.csr_matrix(dense_matrix)

print(f"Sparse (CSR) matrix:\n{sparse_matrix}")
print("Note: Only non-zero elements are stored to save memory.")

Dense size in bytes: 96
Sparse (CSR) matrix:
<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 2 stored elements and shape (3, 4)>
  Coords	Values
  (0, 2)	1
  (1, 1)	2
Note: Only non-zero elements are stored to save memory.
