# Scikit-learn Course: Learn Machine Learning from Scratch
Welcome to the scikit-learn course! This notebook will guide you through the basics and advanced features of the scikit-learn library, a powerful Python framework for machine learning. Each section includes code snippets and explanations to help you understand and apply machine learning concepts using scikit-learn.

## What is scikit-learn?
Scikit-learn is an open-source Python library that provides simple and efficient tools for data mining and data analysis. It is built on top of NumPy, SciPy, and matplotlib, and is widely used for implementing machine learning algorithms.

## What will you learn?
- How to install and import scikit-learn
- Loading datasets
- Data preprocessing
- Building and training models
- Evaluating models
- Hyperparameter tuning
- Pipelines and advanced topics

Let's get started!

## 1. Installing and Importing scikit-learn
To use scikit-learn, you first need to install it. You can install it using pip. After installation, you can import it in your Python code.

- `pip install scikit-learn` installs the library.
- `import sklearn` imports the main package. Usually, you import specific modules like `from sklearn import datasets`.

In [1]:
# Install scikit-learn (uncomment the line below if not already installed)
# !pip install scikit-learn numpy pandas
import sklearn
import numpy as np
import pandas as pd
print('scikit-learn version:', sklearn.__version__)

scikit-learn version: 1.6.1


## 2. Loading Datasets
Scikit-learn comes with several built-in datasets for practice, such as the iris and digits datasets. You can also load your own data from CSV or Excel files.

- `sklearn.datasets.load_iris()` loads the famous iris flower dataset.
- `sklearn.datasets.load_digits()` loads handwritten digits data.
- For your own data, use pandas: `pd.read_csv('yourfile.csv')`.

In [2]:
from sklearn import datasets
iris = datasets.load_iris()
print('Keys of iris dataset:', iris.keys())
print('Feature names:', iris.feature_names)
print('Target names:', iris.target_names)
print('First 5 rows of data:')
print(iris.data[:5])

Keys of iris dataset: dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])
Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Target names: ['setosa' 'versicolor' 'virginica']
First 5 rows of data:
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]


## 3. Splitting Data into Training and Test Sets
To evaluate a machine learning model, you need to split your data into a training set (to train the model) and a test set (to evaluate it).
Scikit-learn provides `train_test_split` for this purpose.

- `train_test_split` randomly splits data into training and test sets.
- Typical split: 70-80% for training, 20-30% for testing.

In [3]:
from sklearn.model_selection import train_test_split
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print('Training set size:', X_train.shape)
print('Test set size:', X_test.shape)

Training set size: (120, 4)
Test set size: (30, 4)


## 4. Training a Simple Model: k-Nearest Neighbors (KNN)
Let's train a simple classifier using the k-Nearest Neighbors algorithm. KNN is a basic classification algorithm that predicts the class of a data point based on the classes of its nearest neighbors.

- `KNeighborsClassifier` is used for classification tasks.
- `fit()` trains the model on the training data.
- `predict()` makes predictions on new data.

In [4]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print('Predicted labels:', y_pred)

Predicted labels: [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0]


## 5. Evaluating the Model
After training a model, you need to evaluate its performance. The most common metric for classification is accuracy, which measures the proportion of correct predictions.

- `accuracy_score` computes the accuracy of the model.
- You can also use a confusion matrix to see details of correct and incorrect predictions.

In [5]:
from sklearn.metrics import accuracy_score, confusion_matrix
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)
cm = confusion_matrix(y_test, y_pred)
print('Confusion Matrix:\n', cm)

Accuracy: 1.0
Confusion Matrix:
 [[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]


## 6. Feature Scaling and Preprocessing
Many machine learning algorithms perform better when features are on a similar scale. Scikit-learn provides tools for scaling and preprocessing data.

- `StandardScaler` standardizes features by removing the mean and scaling to unit variance.
- Always fit the scaler on the training data and transform both training and test data.

In [6]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print('First 5 rows of scaled training data:')
print(X_train_scaled[:5])

First 5 rows of scaled training data:
[[-1.47393679  1.20365799 -1.56253475 -1.31260282]
 [-0.13307079  2.99237573 -1.27600637 -1.04563275]
 [ 1.08589829  0.08570939  0.38585821  0.28921757]
 [-1.23014297  0.75647855 -1.2187007  -1.31260282]
 [-1.7177306   0.30929911 -1.39061772 -1.31260282]]


## 7. Pipelines: Combining Preprocessing and Modeling
Scikit-learn pipelines allow you to chain preprocessing and modeling steps together, making your code cleaner and less error-prone.

- `Pipeline` lets you define a sequence of steps (e.g., scaling, then classification).
- This ensures that all steps are applied consistently during training and testing.

In [7]:
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(n_neighbors=3))
])
pipeline.fit(X_train, y_train)
y_pred_pipe = pipeline.predict(X_test)
accuracy_pipe = accuracy_score(y_test, y_pred_pipe)
print('Pipeline accuracy:', accuracy_pipe)

Pipeline accuracy: 1.0


## 8. Hyperparameter Tuning with GridSearchCV
Choosing the best parameters for your model can improve its performance. Scikit-learn provides `GridSearchCV` to search for the best hyperparameters automatically.

- `GridSearchCV` tries different combinations of parameters and finds the best one based on cross-validation.
- You define a parameter grid and the model to optimize.

In [8]:
from sklearn.model_selection import GridSearchCV
param_grid = {'knn__n_neighbors': [1, 3, 5, 7, 9]}
grid = GridSearchCV(pipeline, param_grid, cv=5)
grid.fit(X_train, y_train)
print('Best parameters:', grid.best_params_)
print('Best cross-validation score:', grid.best_score_)

Best parameters: {'knn__n_neighbors': 3}
Best cross-validation score: 0.95


## 9. Trying Different Models: Classification and Regression
Scikit-learn supports many algorithms for both classification and regression tasks. You can easily switch between models by changing the estimator in your pipeline.

- For classification: `LogisticRegression`, `DecisionTreeClassifier`, `RandomForestClassifier`, etc.
- For regression: `LinearRegression`, `DecisionTreeRegressor`, `RandomForestRegressor`, etc.

Let's see an example with Logistic Regression.

In [9]:
from sklearn.linear_model import LogisticRegression
logreg_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('logreg', LogisticRegression())
])
logreg_pipeline.fit(X_train, y_train)
y_pred_logreg = logreg_pipeline.predict(X_test)
accuracy_logreg = accuracy_score(y_test, y_pred_logreg)
print('Logistic Regression accuracy:', accuracy_logreg)

Logistic Regression accuracy: 1.0


## 10. Model Evaluation with Cross-Validation
Cross-validation is a technique to assess how well your model generalizes to unseen data. It splits the data into multiple parts (folds), trains the model on some folds, and tests it on the remaining fold(s).
- `cross_val_score` computes scores for each fold and returns the results.
- Commonly used with 5 or 10 folds.

In [10]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(logreg_pipeline, X, y, cv=5)
print('Cross-validation scores:', scores)
print('Mean cross-validation score:', scores.mean())

Cross-validation scores: [0.96666667 1.         0.93333333 0.9        1.        ]
Mean cross-validation score: 0.9600000000000002


## 11. Saving and Loading Models
After training a model, you may want to save it for later use. Scikit-learn provides utilities to save and load models using the `joblib` library.

- `joblib.dump()` saves the model to a file.
- `joblib.load()` loads the model from a file.

In [None]:
import joblib
# Save the trained logistic regression pipeline
joblib.dump(logreg_pipeline, 'logreg_pipeline.joblib')
# Load the model back
loaded_model = joblib.load('logreg_pipeline.joblib')
print('Loaded model accuracy:', loaded_model.score(X_test, y_test))

# Summary and Next Steps
You have now learned the basics of scikit-learn, including how to load data, preprocess it, train and evaluate models, use pipelines, tune hyperparameters, and save/load models.

## Next Steps
- Explore more datasets and algorithms in scikit-learn.
- Try regression tasks with `LinearRegression` or `RandomForestRegressor`.
- Learn about advanced topics like ensemble methods, feature selection, and custom transformers.
- Read the [scikit-learn documentation](https://scikit-learn.org/stable/documentation.html) for more details and examples.

Happy Learning!