# Scikit-learn

Scikit-learn is a popular machine learning library in Python that provides a wide range of tools and algorithms for data preprocessing, feature selection, model training, and evaluation. It's built on top of NumPy, SciPy, and Matplotlib, and offers a user-friendly interface for machine learning tasks. Let's explore some key concepts and examples using scikit-learn:

In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load the Iris dataset
iris = load_iris()

# Split the dataset into features (X) and target variable (y)
X = iris.data
y = iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the shapes of the train and test sets
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

X_train shape: (120, 4)
X_test shape: (30, 4)
y_train shape: (120,)
y_test shape: (30,)


### 1. Data Preparation:
Scikit-learn provides various tools for data preprocessing, such as scaling, encoding categorical variables, and splitting data into training and testing sets. Here's an example of scaling numeric features using the StandardScaler:

In [2]:
from sklearn.preprocessing import StandardScaler

# Create a StandardScaler object
scaler = StandardScaler()

# Fit the scaler to the data
scaler.fit(X_train)

# Transform the data
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)



### 2. Model Training and Evaluation:
Scikit-learn supports a wide range of machine learning algorithms for classification, regression, clustering, and more. Here's an example of training a logistic regression classifier and evaluating its performance using accuracy:

In [3]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Create a LogisticRegression object
classifier = LogisticRegression(max_iter=300,)

# Train the model
classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)


1.0


### 3. Model Selection and Hyperparameter Tuning:
Scikit-learn provides tools for model selection and hyperparameter tuning, such as cross-validation and grid search. Here's an example of performing a grid search to find the best hyperparameters for a support vector machine (SVM) classifier:

In [4]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

# Define the parameter grid
param_grid = {'C': [1, 10, 100], 'gamma': [0.1, 0.01, 0.001]}

# Create an SVC object
svm = SVC()

# Perform grid search with cross-validation
grid_search = GridSearchCV(svm, param_grid, cv=5)

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

# Get the best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_
print(best_params)
print(best_score)


{'C': 100, 'gamma': 0.01}
0.9583333333333334


### 4. Feature Selection:
Scikit-learn provides feature selection techniques to identify the most relevant features for a given task. Here's an example of using recursive feature elimination (RFE) with a random forest classifier:

In [5]:
import pandas as pd
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

# Create a Random Forest classifier object
classifier = RandomForestClassifier()

# Create an RFE object
rfe = RFE(classifier, n_features_to_select=2)

# Fit RFE to the data
rfe.fit(X_train, y_train)

# Convert X_train to a pandas DataFrame
X_train_df = pd.DataFrame(X_train, columns=iris.feature_names)

# Get the selected features
selected_features = X_train_df.columns[rfe.support_]
print(selected_features)


Index(['petal length (cm)', 'petal width (cm)'], dtype='object')


### 5. Pipeline:
Scikit-learn allows you to build data processing and modeling pipelines to streamline the workflow. Here's an example of constructing a pipeline with data scaling and logistic regression:

In this example, the make_pipeline function is used to create a pipeline that consists of two steps: data scaling using StandardScaler and logistic regression using LogisticRegression. The pipeline allows you to apply both steps together, making it easier to manage data preprocessing and model training in a single object.

In [6]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression


# Create a pipeline
pipeline = make_pipeline(
    StandardScaler(),
    LogisticRegression()
)

# Train the model using the pipeline
pipeline.fit(X_train, y_train)

# Make predictions on the test set
y_pred = pipeline.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)


1.0


These examples highlight some common tasks and functionalities provided by scikit-learn. It's a powerful library with extensive documentation and a wide range of algorithms and tools to support various machine learning tasks. Feel free to explore the official scikit-learn documentation for more information and examples: scikit-learn Documentation