# Scikit-learn
Scikit-learn is a powerful Python library for **machine learning**.

It provides tools for **data preprocessing, model building, evaluation, and deployment**.


In [1]:
# !pip install scikit-learn
import sklearn
print('Scikit-learn version:', sklearn.__version__)

Scikit-learn version: 1.6.1


## 1. Working with Datasets
Scikit-learn provides many built-in datasets for practice.

In [2]:
from sklearn.datasets import load_iris, load_digits, fetch_california_housing
from sklearn.model_selection import train_test_split

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print('Train shape:', X_train.shape, 'Test shape:', X_test.shape)

Train shape: (120, 4) Test shape: (30, 4)


## 2. Preprocessing and Pipelines
Scaling and encoding data are crucial before training models.

- **StandardScaler** → normalize features
- **OneHotEncoder** → encode categorical data
- **ColumnTransformer** → apply different preprocessing to different columns
- **Pipeline** → chain preprocessing + model

In [3]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
import numpy as np

# Example: pipeline with scaler + logistic regression
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression())
])

pipe.fit(X_train, y_train)
print('Accuracy:', pipe.score(X_test, y_test))

Accuracy: 0.9333333333333333


## 3. Supervised Learning
### Classification Example (Logistic Regression)
We classify the Iris dataset using Logistic Regression.

In [4]:
from sklearn.metrics import classification_report, confusion_matrix

y_pred = pipe.predict(X_test)
print('Confusion Matrix:\n', confusion_matrix(y_test, y_pred))
print('\nClassification Report:\n', classification_report(y_test, y_pred))

Confusion Matrix:
 [[10  0  0]
 [ 0  9  1]
 [ 0  1  9]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       0.90      0.90      0.90        10
           2       0.90      0.90      0.90        10

    accuracy                           0.93        30
   macro avg       0.93      0.93      0.93        30
weighted avg       0.93      0.93      0.93        30



### Regression Example (Ridge Regression)
We use Ridge Regression on California Housing dataset.

In [5]:
housing = fetch_california_housing()
X, y = housing.data, housing.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, r2_score

ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)

y_pred = ridge.predict(X_test)
print('MSE:', mean_squared_error(y_test, y_pred))
print('R2 Score:', r2_score(y_test, y_pred))

MSE: 0.5558034669932191
R2 Score: 0.575854961144014


## 4. Unsupervised Learning
### Clustering with KMeans

In [6]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
print('Cluster Centers:', kmeans.cluster_centers_[:2])

Cluster Centers: [[ 3.89652823e+00  3.08127517e+01  5.53773604e+00  1.10830438e+00
   9.43137651e+02  2.85455143e+00  3.57652953e+01 -1.19683919e+02]
 [ 4.15307075e+00  1.47826087e+01  5.46589053e+00  1.07384122e+00
   6.13240711e+03  7.37675093e+00  3.51687945e+01 -1.19051206e+02]]


### Dimensionality Reduction with PCA

In [7]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
print('Original shape:', X.shape, 'Reduced shape:', X_pca.shape)

Original shape: (20640, 8) Reduced shape: (20640, 2)


## 5. Model Selection
### Cross-validation & Hyperparameter Tuning

In [8]:
from sklearn.model_selection import cross_val_score, GridSearchCV

# Use the iris dataset for cross-validation with Logistic Regression
scores = cross_val_score(LogisticRegression(max_iter=200), iris.data, iris.target, cv=5)
print('CV Scores:', scores)
print('Mean Accuracy:', np.mean(scores))

# Grid Search Example
param_grid = {'alpha': [0.1, 1.0, 10.0]}
grid = GridSearchCV(Ridge(), param_grid, cv=3)
grid.fit(X_train, y_train)
print('Best Params:', grid.best_params_)

CV Scores: [0.96666667 1.         0.93333333 0.96666667 1.        ]
Mean Accuracy: 0.9733333333333334
Best Params: {'alpha': 0.1}


## 6. Model Persistence
Save and load trained models using `joblib`.

In [9]:
import joblib

joblib.dump(ridge, 'ridge_model.pkl')
loaded = joblib.load('ridge_model.pkl')
print('Loaded model R2:', loaded.score(X_test, y_test))

Loaded model R2: 0.575854961144014


# **Fin.**