Scikit-learn is an open-source machine learning library that provides simple and efficient tools for data analysis and modeling. It is built on [NumPy](https://www.geeksforgeeks.org/introduction-to-numpy/), [SciPy](https://www.geeksforgeeks.org/scipy-integration/), and [Matplotlib,](https://www.geeksforgeeks.org/python-introduction-matplotlib/) making it a powerful tool for tasks like classification, regression, clustering, and dimensionality reduction.

*   [**Classification**](https://www.geeksforgeeks.org/getting-started-with-classification/)**:** Classification involves teaching a computer to categorize things. For example, a model could be built to determine whether an email is spam or not.
    
*   [**Regression**](https://www.geeksforgeeks.org/ml-classification-vs-regression/)**:** Regression predicting numbers based on other numbers. For instance, a model could predict house prices using factors like location, size, and age.
    
*   [**Clustering**](https://www.geeksforgeeks.org/clustering-in-machine-learning/)**:** Clustering involves finding patterns in data and grouping similar items together. For example, customers could be segmented into different groups based on their shopping habits.
    
*   [**Dimensionality Reduction**](https://www.geeksforgeeks.org/dimensionality-reduction/)**:** Dimensionality reduction helps focus on essential data parts while discarding noise. This is useful when dealing with a lot of data that isn’t all relevant.

Features of Scikit-Learn
------------------------

Scikit-learn is indeed a versatile tool for machine learning tasks, offering a wide range of features to address various aspects of the data science pipeline. let’s examine prime key features of scikit-learn:

### **Supervised Learning**

*   **Classification:** Algorithms for predicting categorical labels, including [logistic regression](https://www.geeksforgeeks.org/understanding-logistic-regression/), [decision trees](https://www.geeksforgeeks.org/decision-tree-introduction-example/), [random forests](https://www.geeksforgeeks.org/random-forest-algorithm-in-machine-learning/), [support vector machines (SVMs)](https://www.geeksforgeeks.org/support-vector-machine-algorithm/) and [gradient boosting](https://www.geeksforgeeks.org/ml-gradient-boosting/).
    
*   **Regression:** Algorithms for predicting continuous outputs, including [linear regression](https://www.geeksforgeeks.org/ml-linear-regression/), support vector regression, and [decision tree regression.](https://www.geeksforgeeks.org/python-decision-tree-regression-using-sklearn/)
    

### **Unsupervised Learning**

*   **Clustering:** Techniques for grouping data points into similar clusters, including [K-means clustering](https://www.geeksforgeeks.org/k-means-clustering-introduction/), [DBSCAN,](https://www.geeksforgeeks.org/dbscan-clustering-in-ml-density-based-clustering/) and [hierarchical clustering.](https://www.geeksforgeeks.org/hierarchical-clustering/)
    
*   **Dimensionality Reduction:** Methods for reducing the number of features in your data, such as [principal component analysis (PCA).](https://www.geeksforgeeks.org/principal-component-analysis-pca/)
    

### **Data Preprocessing**

*   [**Data Splitting**](https://www.geeksforgeeks.org/splitting-data-for-machine-learning-models/)**:** Functions to split your data into training and testing sets for model evaluation.
    
*   [**Feature Scaling**](https://www.geeksforgeeks.org/ml-feature-scaling-part-2/)**:** Techniques for normalizing the scale of your features.
    
*   [**Feature Selection**](https://www.geeksforgeeks.org/feature-selection-techniques-in-machine-learning/)**:** Methods to identify and select the most relevant features for your model.
    
*   [**Feature Extraction**](https://www.geeksforgeeks.org/difference-between-feature-selection-and-feature-extraction/)**:** Tools to create new features from existing ones, such as text vectorization for natural language processing tasks.
    

### **Model Evaluation**

*   **Metrics:** Functions to calculate performance metrics like accuracy, [precision, recall](https://www.geeksforgeeks.org/metrics-for-machine-learning-model/), and [F1-score](https://www.geeksforgeeks.org/f1-score-in-machine-learning/) for classification models, and mean squared error (MSE) for regression models.
    
*   **Model Selection:** Tools for selecting the best model hyperparameters through techniques like [grid search](https://www.geeksforgeeks.org/comparing-randomized-search-and-grid-search-for-hyperparameter-estimation-in-scikit-learn/) and randomized search.
    

### **Additional Features**

*   **Inbuilt datasets:** Scikit-learn provides a variety of sample datasets for experimentation and learning purposes.
    
*   **Easy to Use API:** Scikit-learn is known for its consistent and user-friendly API, making it accessible to both beginners and experienced data scientists.
    
*   **Open Source:** Scikit-learn is an open-source library with a large and active community, ensuring continuous development and support.

### Classification – **Logistic Regression Algorithm Example**

Logistic Regression is a binary classification algorithm that estimates probabilities of a binary outcome. It’s used for problems like spam detection, medical diagnosis, and credit scoring. It’s chosen for its simplicity, interpretability, and effectiveness in linearly separable datasets.

In [4]:
# Importing necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Load Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardizing features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Training the logistic regression model
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

# Making predictions on the testing set
y_pred = log_reg.predict(X_test)

# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


Accuracy: 1.0


### Classification – KNN Classifier **Algorithm Example**

K-Nearest Neighbors (KNN) algorithm classifies data points based on the majority class of their nearest neighbors. It’s useful for simple classification tasks, particularly when data is not linearly separable or when decision boundaries are complex. It’s used in recommendation systems, handwriting recognition, and medical diagnosis.

In [5]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# Load the Iris dataset
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

# Initialize the KNN classifier
knn = KNeighborsClassifier(n_neighbors=3)

# Train the classifier
knn.fit(X_train, y_train)

# Make predictions on the test data
predictions = knn.predict(X_test)

# Evaluate the model
accuracy = knn.score(X_test, y_test)
print("Accuracy:", accuracy)


Accuracy: 1.0


### Regression – Linear Regression Algorithm Example

Linear Regression fits a linear model to observed data points, predicting continuous outcomes based on input features. It’s used when exploring relationships between variables and making predictions. Applications include economics, finance, engineering, and social sciences.

just see the code below for understanding purpose, housing is not fetched properly here.

In [7]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Load the California Housing dataset
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target, test_size=0.2, random_state=42)

# Initialize the Linear Regression model
lr = LinearRegression()

# Train the model
lr.fit(X_train, y_train)

# Make predictions on the test data
predictions = lr.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, predictions)
print("Mean Squared Error:", mse)

### Clustering – KMeans Algorithm Example

KMeans algorithm partitions data into k clusters based on similarity. It’s used for unsupervised clustering tasks like customer segmentation, image compression, and anomaly detection. Ideal when data’s structure is unknown but grouping is desired.

In [8]:
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans

# Load the Iris dataset
iris = load_iris()

# Initialize the KMeans clustering model
kmeans = KMeans(n_clusters=3)

# Fit the model to the data
kmeans.fit(iris.data)

# Get the cluster labels
cluster_labels = kmeans.labels_

print("Cluster Labels:", cluster_labels)


Cluster Labels: [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 2 2 2 2 0 2 2 2 2
 2 2 0 0 2 2 2 2 0 2 0 2 0 2 2 0 0 2 2 2 2 2 0 2 2 2 2 0 2 2 2 0 2 2 2 0 2
 2 0]




### Dimensionality Reduction – PCA Example

PCA (Principal Component Analysis) reduces the dimensionality of data by finding the most important features. It’s used for visualizing high-dimensional data, noise reduction, and speeding up machine learning algorithms. Commonly applied in image processing, genetics, and finance.

In [10]:
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA

# Load the digits dataset
digits = load_digits()

# Initialize PCA for dimensionality reduction
pca = PCA(n_components=2)

# Apply PCA to the data
reduced_data = pca.fit_transform(digits.data)

print("Original data shape:", digits.data.shape)
print("Reduced data shape:", reduced_data.shape)


Original data shape: (1797, 64)
Reduced data shape: (1797, 2)


**Reference**: https://www.geeksforgeeks.org/what-is-python-scikit-library/