**Scikit-learn **is a widely-used Python library for machine learning that provides simple and efficient tools for data mining and analysis. It is built on top of NumPy, SciPy, and matplotlib and offers various algorithms and utilities for both supervised and unsupervised learning.

In [None]:
pip install scikit-learn

Datasets

Scikit-learn provides several small sample datasets (like the famous Iris dataset) and utilities to generate or load custom datasets.

In [None]:
from sklearn import datasets

# Load the Iris dataset
iris = datasets.load_iris()

# Features and target
X = iris.data
y = iris.target

print(X.shape)  # (150, 4)
print(y.shape)  # (150,)

The Iris dataset contains 150 samples of iris flowers, each described by four features (sepal length, sepal width, petal length, petal width), and the target is the type of flower (three species).

 #Train-Test Split

In machine learning, it's essential to split the data into training and testing sets to evaluate model performance on unseen data.

In [None]:
from sklearn.model_selection import train_test_split

# Split the dataset (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(X_train.shape, X_test.shape)  # (120, 4), (30, 4)

 Preprocessing

Scikit-learn includes various utilities for preprocessing data, such as scaling features, normalizing, or encoding categorical variables.

Standardization (scaling): This process scales the features so that they have zero mean and unit variance.

Scikit-learn offers a rich collection of utilities for preprocessing data, which is a crucial step in machine learning to ensure data quality and consistency. These preprocessing techniques help to improve model performance and prevent issues such as overfitting.

#Example:

In [None]:
from sklearn.preprocessing import StandardScaler

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(X_train_scaled[:5])  # First 5 standardized samples

StandardScaler() scales the features to have a mean of 0 and standard deviation of 1. This is important for many machine learning models that are sensitive to the scale of input features (e.g., SVM, k-NN).



Supervised Learning Algorithms

Scikit-learn provides many algorithms for supervised learning, which is the task of learning a function from labeled training data.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Train a logistic regression model
model = LogisticRegression(max_iter=200)
model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = model.predict(X_test_scaled)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

LogisticRegression() is used for classification, and the accuracy_score() function evaluates how well the model performed on the test set.

Unsupervised Learning Algorithms

Unsupervised learning involves learning patterns from unlabeled data. Scikit-learn provides tools for clustering, dimensionality reduction, and more.

Example of K-Means Clustering:

from sklearn.cluster import KMeans


Here, K-Means is used to cluster the data into 3 groups. Clustering is unsupervised, meaning we don’t need target labels.

In [None]:
# Perform K-Means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)

# Cluster centers and labels
print("Cluster centers:\n", kmeans.cluster_centers_)
print("Labels:\n", kmeans.labels_)

Model Evaluation

Scikit-learn provides several ways to evaluate a model's performance, including metrics for classification and regression.

Accuracy, Precision, Recall, F1-Score

Confusion Matrix

Mean Squared Error (MSE)

Example of Confusion Matrix and Classification Report:

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

# Classification Report
report = classification_report(y_test, y_pred)
print("Classification Report:\n", report)

The confusion matrix gives insight into how many true positives, false positives, etc., are made, while the classification report provides metrics such as precision, recall, and F1-score.

Confusion Matrix:

The diagonal elements indicate correct classifications. For example, cm[0, 0] represents the number of true positives (correctly predicted class 0).
The off-diagonal elements indicate misclassifications. For example, cm[0, 1] represents the number of false negatives (class 0 instances incorrectly predicted as class 1).

Cross-Validation

Cross-validation is a technique to assess the performance of a model on different subsets of data. The most common form is k-fold cross-validation, which divides the dataset into k folds and trains the model k times.

In [None]:
from sklearn.model_selection import cross_val_score

# Perform 5-fold cross-validation
cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5)

print("Cross-validation scores:", cv_scores)
print("Average CV score:", cv_scores.mean())

Pipelines

A Pipeline is a tool that allows you to chain multiple steps together (e.g., preprocessing + model) so that you can fit and evaluate everything in one go. This is useful to ensure that preprocessing (like scaling) is done only on the training data during cross-validation or testing.

In [None]:
from sklearn.pipeline import Pipeline

# Define a pipeline with scaling and model training
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(max_iter=200))
])

# Train the model using the pipeline
pipeline.fit(X_train, y_train)

# Predict using the pipeline
y_pred = pipeline.predict(X_test)

 Grid Search for Hyperparameter Tuning

Grid Search helps in finding the best hyperparameters for a model by exhaustively searching over a specified parameter grid.

In [None]:
from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {'C': [0.1, 1, 10], 'solver': ['lbfgs', 'liblinear']}

# Perform grid search with cross-validation
grid_search = GridSearchCV(LogisticRegression(max_iter=200), param_grid, cv=5)
grid_search.fit(X_train_scaled, y_train)

# Best hyperparameters
print("Best parameters:", grid_search.best_params_)

 Dimensionality Reduction

Scikit-learn provides dimensionality reduction techniques like Principal Component Analysis (PCA) to reduce the number of features while preserving variance.

In [None]:
from sklearn.decomposition import PCA

# Reduce the data to 2 dimensions
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

print("Reduced data shape:", X_reduced.shape)