# Introduction to Scikit-learn (sklearn)

This notebook provides a comprehensive introduction to scikit-learn, covering its basic concepts, key features, and common usage patterns.

## What is Scikit-learn?
Scikit-learn is Python's most popular machine learning library that provides:
- Simple and efficient tools for data mining and data analysis
- Accessible to everybody and reusable in various contexts
- Built on NumPy, SciPy, and matplotlib
- Open source, commercially usable under BSD license

<img src="images/scikit-learn-logo.png" alt="ML Workflow Diagram" width="400"/>

## 1. Setup and Installation

First, let's make sure we have all required packages installed. Run the following cell to import necessary libraries:

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing, model_selection, datasets, metrics, linear_model

## 2. Basic Concepts

Scikit-learn is built on a few key concepts:

1. **Estimators**: Objects that learn from data
2. **Transformers**: Estimators that transform data
3. **Predictors**: Estimators that make predictions

Let's see these concepts in action with a simple example:

In [None]:
# Create sample data
X = np.array([[1], [2], [3], [4]])
y = np.array([2, 4, 6, 8])

In [None]:
# Create and train a simple linear regression model
model = linear_model.LinearRegression()

In [None]:
# Train the model
model.fit(X, y)

In [None]:
# Make predictions
model.predict([[5]])

In [None]:
# Plot the results
plt.figure(figsize=(10, 6))
plt.scatter(X, y, color="blue", label="Data points")
plt.plot(X, model.predict(X), color="red", label="Linear regression")
plt.xlabel("X")
plt.ylabel("y")
plt.title("Simple Linear Regression Example")
plt.legend()
plt.show()

## 3. Core Features

### 3.1 Data Preprocessing

Scikit-learn provides various tools for preprocessing data. Let's explore some common preprocessing techniques:

In [None]:
# Create sample data
data = pd.DataFrame(
    {
        "feature1": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
        "feature2": [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
        "category": ["A", "C", "A", "D", "E", "B", "A", "F", "A", "B"],
    }
)

data

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(
    data=data, x="feature1", y="feature2", hue="category", style="category", s=100
)
x_vals = data["feature1"]
y_vals = data["feature2"]
plt.plot(x_vals, y_vals, color="red", linestyle="--", label="y")
plt.title("Scatter Plot of Feature1 vs Feature2 by Category")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend(title="Category")
plt.show()

### 3.1.1 Standardization (Standard Scaler)

The `StandardScaler` in scikit-learn standardizes features by removing the mean and scaling to unit variance. This means it transforms the data so that each feature has a mean of 0 and a standard deviation of 1. Standardization is particularly useful in machine learning when features have different units or ranges, as it ensures that each feature contributes equally to the model's learning process.

Here's how `StandardScaler` works:
- **Mean removal**: Centers the data around zero.
- **Variance scaling**: Scales the data so that it has a standard deviation of 1.

### Example
If `X` is your data:
```python
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
```

After scaling, `X_scaled` will have standardized values, which can improve the performance of algorithms sensitive to feature scales, such as gradient-based methods.

In [None]:
scaler = preprocessing.StandardScaler()
numerical_features = ["feature1", "feature2"]
scaled_data = scaler.fit_transform(data[numerical_features])

scaled_data_df = pd.DataFrame(scaled_data, columns=numerical_features)
scaled_data_df

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(data=scaled_data_df, x="feature1", y="feature2", s=100)
x_vals = scaled_data_df["feature1"]
y_vals = scaled_data_df["feature2"]
plt.plot(x_vals, y_vals, color="red", linestyle="--", label="line")
plt.title("Scatter Plot of Feature1 vs Feature2 by Category")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend(title="Category")
plt.show()

### 3.1.2 One-Hot Encoding

In [None]:
encoder = preprocessing.OneHotEncoder()
encoded_categories = encoder.fit_transform(data[["category"]])

In [None]:
encoded_categories.toarray()

In [None]:
encoded_categories_df = pd.DataFrame(
    encoded_categories.toarray(), columns=encoder.get_feature_names_out(["category"])
)


encoded_categories_df

In [None]:
plt.figure(figsize=(10, 6))
sns.heatmap(encoded_categories_df, annot=True, cmap="YlGnBu", cbar=True)
plt.title("Category One-Hot Encoding Heatmap")
plt.xlabel("Categories")
plt.ylabel("Samples")
plt.show()

### 3.2 Model Selection and Evaluation

Let's explore how to split data, train models, and evaluate their performance:

### 3.2.1 

- **`datasets.make_classification`**: This function generates a synthetic dataset with:
  - `n_samples=1000`: 1000 samples (data points).
  - `n_features=4`: 4 features (variables or predictors).
  - `n_classes=2`: 2 target classes (binary classification).
  - `random_state=42`: Ensures reproducibility of the dataset generation by fixing the random seed.
- **`X`**: Features of the dataset (input variables).
- **`y`**: Target variable (labels or outcomes).

In [None]:
# Create a synthetic classification dataset
X, y = datasets.make_classification(
    n_samples=1000, n_features=4, n_classes=2, random_state=42
)

X.shape, y.shape

In [None]:
sample_data = pd.DataFrame(X, columns=["x1", "x2", "x3", "x4"])
sample_data["y"] = y
sample_data.head()

In [None]:
# Pair plot
sns.pairplot(sample_data, hue="y", diag_kind="hist", corner=True)
plt.show()

- **`train_test_split`**: Splits the data into training and test sets.
  - `X` and `y` are the input features and target labels, respectively.
  - `test_size=0.2`: Allocates 20% of the data for testing and the remaining 80% for training.
  - `random_state=42`: Ensures reproducibility of the split.

  After this step:
  - **`X_train`**: 80% of the feature data for training.
  - **`X_test`**: 20% of the feature data for testing.
  - **`y_train`**: 80% of the target labels for training.
  - **`y_test`**: 20% of the target labels for testing.

In [None]:
# Split the data
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [None]:
len(X_train), len(X_test), len(y_train), len(y_test)

- **`LogisticRegression()`**: Creates an instance of the Logistic Regression model (a classifier).
- **`fit(X_train, y_train)`**: Trains the model on the training data. The model learns the relationship between `X_train` (input features) and `y_train` (target labels).

In [None]:
# Train a logistic regression model
clf = linear_model.LogisticRegression()
clf.fit(X_train, y_train)

- **`predict(X_test)`**: Uses the trained model (`clf`) to predict the target labels (`y_pred`) for the test data (`X_test`).

In [None]:
# Make predictions
y_pred = clf.predict(X_test)

In [None]:
for x_, y_ in zip(X_test[:5], y_test[:5]):
    print(f"Input: {x_} Actual: ({y_}) -> Prediction: {clf.predict([x_])[0]}")
    print("-----")

- **`metrics.classification_report`**: This function computes several classification metrics:
  - **Precision**: The ratio of correctly predicted positive observations to the total predicted positives.
  - **Recall**: The ratio of correctly predicted positive observations to all actual positives.
  - **F1-Score**: The harmonic mean of precision and recall.
  - **Support**: The number of actual occurrences of each class in the dataset.

In [None]:
# Evaluate performance
print(metrics.classification_report(y_test, y_pred))

- **`cross_val_score(clf, X, y, cv=5)`**: Performs k-fold cross-validation (with `cv=5`, i.e., 5 folds) to evaluate the model’s performance.
  - The data (`X`, `y`) is split into 5 different subsets (folds), and for each fold, the model is trained on the other 4 folds and evaluated on the remaining fold.
  - **`cv_scores`**: An array of scores (usually accuracy) from each fold.
  
- **`cv_scores.mean()`**: The average performance score across the 5 folds.
- **`cv_scores.std()`**: The standard deviation of the cross-validation scores, which provides insight into the model's performance consistency.
- **`cv_scores.std() * 2`**: Displays the "approximate" 95% confidence interval around the mean score, assuming a normal distribution.

In [None]:
# Cross-validation
cv_scores = model_selection.cross_val_score(clf, X, y, cv=5)
print(f"\nCross-validation scores: {cv_scores}")
print(f"Average CV score: {cv_scores.mean():.3f} ± {cv_scores.std()*2:.3f}")

### 3.3 Pipelines

Pipelines are a powerful tool for combining multiple steps into a single estimator:

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Create a pipeline
pipeline = Pipeline(
    [("scaler", StandardScaler()), ("classifier", LogisticRegression())]
)

# Train and evaluate the pipeline
pipeline.fit(X_train, y_train)
pipeline_pred = pipeline.predict(X_test)

print("Pipeline Classification Report:")
print(metrics.classification_report(y_test, pipeline_pred))