<div style="background-color:#daee8420; line-height:1.5; text-align:center;border:2px solid black;">
    <div style="color:#7B242F; font-size:24pt; font-weight:700;">The Ultimate Machine Learning Mastery Course with Python</div>
</div>

---
### **Course**: The Ultimate Machine Learning Course with Python  
#### **Chapter**: Machine Learning with Python Frameworks
##### **Lesson**: Scikit-Learn Framework

###### **Author:** Dr. Saad Laouadi   
###### **Copyright:** Dr. Saad Laouadi    

---

## License

**This material is intended for educational purposes only and may not be used directly in courses, video recordings, or similar without prior consent from the author. When using or referencing this material, proper credit must be attributed to the author.**

```text
#**************************************************************************
#* (C) Copyright 2024 by Dr. Saad Laouadi. All Rights Reserved.           *
#**************************************************************************                                                                    
#* DISCLAIMER: The author has used their best efforts in preparing        *
#* this content. These efforts include development, research,             *
#* and testing of the theories and programs to determine their            *
#* effectiveness. The author makes no warranty of any kind,               *
#* expressed or implied, with regard to these programs or                 *
#* to the documentation contained within. The author shall not            *
#* be liable in any event for incidental or consequential damages         *
#* in connection with, or arising out of, the furnishing,                 *
#* performance, or use of these programs.                                 *
#*                                                                        *
#* This content is intended for tutorials, online articles,               *
#* and other educational purposes.                                        *
#**************************************************************************
```

# Introduction to scikit-learn

**Scikit-learn** is one of the most prominent and widely used general-purpose machine learning libraries for Python. It offers a robust set of tools for data analysis, preprocessing, and building predictive models. Designed with simplicity and efficiency in mind, scikit-learn is a versatile library that caters to both beginner and advanced machine learning practitioners.

### Why Use scikit-learn?

scikit-learn is highly regarded due to its ease of use, comprehensive documentation, and the wide range of machine learning algorithms it offers. Whether you are working on classification, regression, clustering, or dimensionality reduction, scikit-learn provides well-optimized and standardized methods to streamline your workflow.

Key benefits of scikit-learn include:
- **User-friendly API**: Simple and consistent interfaces for performing machine learning tasks.
- **Comprehensive set of algorithms**: From classical machine learning techniques to advanced methods for supervised and unsupervised learning.
- **Efficiency**: Well-optimized code for scalability, even with large datasets.
- **Interoperability**: Easily integrates with other Python libraries like NumPy, SciPy, and Pandas.

---

### Core Submodules in scikit-learn:

scikit-learn is organized into several submodules, each focusing on different aspects of machine learning and data processing. Below is a breakdown of its core functionality:

#### 1. **Classification**
Classification algorithms are used to assign input data into discrete categories. scikit-learn offers a wide range of algorithms, including:
- **Support Vector Machines (SVM)**: Powerful for classification in both linear and non-linear settings.
- **k-Nearest Neighbors (KNN)**: Simple and intuitive, KNN classifies new data points based on the majority class of its neighbors.
- **Random Forests**: A popular ensemble method that combines multiple decision trees to improve prediction accuracy.
- **Logistic Regression**: A classical algorithm widely used for binary classification problems.

#### 2. **Regression**
Regression algorithms predict continuous values based on input data. Key algorithms in scikit-learn include:
- **Lasso Regression**: Lasso performs linear regression with L1 regularization, helping to reduce model complexity by shrinking coefficients.
- **Ridge Regression**: Similar to Lasso, but uses L2 regularization to prevent overfitting in linear models.
- **Support Vector Regression (SVR)**: The regression counterpart of SVM, effective for both linear and non-linear regression tasks.

#### 3. **Clustering**
Clustering is an unsupervised learning technique used to identify inherent groupings in data. scikit-learn provides:
- **k-Means**: One of the most popular clustering algorithms that partitions the data into k distinct clusters based on centroids.
- **Spectral Clustering**: A powerful clustering method that uses graph theory and spectral decomposition techniques to detect clusters.
- **DBSCAN**: A density-based clustering algorithm ideal for finding arbitrarily shaped clusters.

#### 4. **Dimensionality Reduction**
Dimensionality reduction is useful for simplifying large datasets by reducing the number of features while preserving essential patterns. scikit-learn includes:
- **Principal Component Analysis (PCA)**: A fundamental technique for reducing the dimensionality of datasets while retaining as much variance as possible.
- **Feature Selection**: Techniques like Recursive Feature Elimination (RFE) and SelectKBest are used to select the most relevant features.
- **Matrix Factorization**: Algorithms like Non-Negative Matrix Factorization (NMF) are provided to reduce dimensions by factorizing matrices into lower-dimensional representations.

#### 5. **Model Selection**
Model selection is crucial for optimizing machine learning models. scikit-learn offers several utilities to facilitate model tuning and evaluation, including:
- **Grid Search**: A method for exhaustively searching over specified hyperparameter values to find the optimal model.
- **Cross-Validation**: A technique for assessing how well a model generalizes to an independent dataset by training and testing it multiple times with different subsets of data.
- **Metrics**: A rich collection of performance metrics, including accuracy, precision, recall, F1-score, ROC-AUC, and more.

#### 6. **Preprocessing**
Preprocessing tools are essential for preparing data before feeding it into machine learning algorithms. scikit-learn provides:
- **Feature Scaling**: Methods like MinMaxScaler and StandardScaler to normalize or standardize your data for better model performance.
- **Imputation**: Techniques for handling missing data, such as mean, median, or constant value imputation.
- **Encoding Categorical Features**: Utilities like OneHotEncoder and LabelEncoder to convert categorical data into numerical formats suitable for model training.

---

### Educational Purpose

This guide to scikit-learn is designed to provide students and aspiring data scientists with a structured understanding of how to approach machine learning tasks. By breaking down the various submodules and algorithms, learners can gain insight into the fundamental building blocks of machine learning. 

Here’s how to maximize your learning:
1. **Hands-On Practice**: Experiment with different algorithms using real-world datasets. The best way to solidify your understanding is by applying concepts through coding.
2. **Understand the Theory**: While scikit-learn makes machine learning implementation easy, it's essential to grasp the theory behind each algorithm to make informed decisions about which methods to use and when.
3. **Experiment with Model Selection**: Use scikit-learn’s powerful tools like Grid Search and Cross-Validation to tune your models and find the best hyperparameters.
4. **Preprocess Data Effectively**: Learn to handle raw data using scikit-learn’s preprocessing tools to ensure your models perform well in practice.

---

### Conclusion

scikit-learn is a versatile and efficient library that covers a wide range of machine learning tasks, from preprocessing to model selection. Its simplicity and powerful features make it an excellent choice for both beginners and experienced professionals looking to implement machine learning solutions. By mastering the core functionalities of scikit-learn, you will be equipped to tackle various real-world data science problems with confidence.

---

### References
- [scikit-learn Official Documentation](https://scikit-learn.org/stable/documentation.html)
- Hands-on Machine Learning with scikit-learn by Aurélien Géron

### Installing Scikit learn
you can use one of the following codes to install the package
```python 
pip install scikit-learn

# Or use conda manager
conda install scikit-learn
```

### Importing Modules from scikit-learn

The alias for scikit-learn is sklearn. In order to import a function or a submodule you can use this syntax:
```python 
from sklearn import linear_model
```

In [1]:
from sklearn import linear_model

### Example of Importing datasets 

In [2]:
from sklearn import datasets

In [3]:
print(dir(datasets))

['__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__getattr__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '_arff_parser', '_base', '_california_housing', '_covtype', '_kddcup99', '_lfw', '_olivetti_faces', '_openml', '_rcv1', '_samples_generator', '_species_distributions', '_svmlight_format_fast', '_svmlight_format_io', '_twenty_newsgroups', 'clear_data_home', 'dump_svmlight_file', 'fetch_20newsgroups', 'fetch_20newsgroups_vectorized', 'fetch_california_housing', 'fetch_covtype', 'fetch_kddcup99', 'fetch_lfw_pairs', 'fetch_lfw_people', 'fetch_olivetti_faces', 'fetch_openml', 'fetch_rcv1', 'fetch_species_distributions', 'get_data_home', 'load_breast_cancer', 'load_diabetes', 'load_digits', 'load_files', 'load_iris', 'load_linnerud', 'load_sample_image', 'load_sample_images', 'load_svmlight_file', 'load_svmlight_files', 'load_wine', 'make_biclusters', 'make_blobs', 'make_checkerboard', 'make_circles', 'make_classification', 'make_friedman1', 'make_f

### Checking the Content of datasets

In [4]:
print([data for data in dir(datasets) if not data.startswith('_')])

['clear_data_home', 'dump_svmlight_file', 'fetch_20newsgroups', 'fetch_20newsgroups_vectorized', 'fetch_california_housing', 'fetch_covtype', 'fetch_kddcup99', 'fetch_lfw_pairs', 'fetch_lfw_people', 'fetch_olivetti_faces', 'fetch_openml', 'fetch_rcv1', 'fetch_species_distributions', 'get_data_home', 'load_breast_cancer', 'load_diabetes', 'load_digits', 'load_files', 'load_iris', 'load_linnerud', 'load_sample_image', 'load_sample_images', 'load_svmlight_file', 'load_svmlight_files', 'load_wine', 'make_biclusters', 'make_blobs', 'make_checkerboard', 'make_circles', 'make_classification', 'make_friedman1', 'make_friedman2', 'make_friedman3', 'make_gaussian_quantiles', 'make_hastie_10_2', 'make_low_rank_matrix', 'make_moons', 'make_multilabel_classification', 'make_regression', 'make_s_curve', 'make_sparse_coded_signal', 'make_sparse_spd_matrix', 'make_sparse_uncorrelated', 'make_spd_matrix', 'make_swiss_roll', 'textwrap']


## Scikit-Learn Data Type

 We are going to import some data from __sklearn__ package and check their type

In [5]:
diabetes = datasets.load_diabetes()

In [6]:
print(dir(diabetes))

['DESCR', 'data', 'data_filename', 'data_module', 'feature_names', 'frame', 'target', 'target_filename']


In [7]:
type(diabetes)

sklearn.utils._bunch.Bunch

Data type in sklearn is called __Bunch__ which is similar to a dict. 

### The Content Of a Bunch

  - DESCR: Dataset Description
  - **data**: values of the independent variables
  - **feature_names**: independent variable names
  - **target**: values of the dependent variable. 
  - **filename**: location of the on the local system.

In [8]:
print(diabetes.keys())

dict_keys(['data', 'target', 'frame', 'DESCR', 'feature_names', 'data_filename', 'target_filename', 'data_module'])


### Dataset Description 

you can get information or help about the built-in dataset by checking its __DESCR__ attribute.

In [9]:
print(diabetes.DESCR)

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

:Number of Instances: 442

:Number of Attributes: First 10 columns are numeric predictive values

:Target: Column 11 is a quantitative measure of disease progression one year after baseline

:Attribute Information:
    - age     age in years
    - sex
    - bmi     body mass index
    - bp      average blood pressure
    - s1      tc, total serum cholesterol
    - s2      ldl, low-density lipoproteins
    - s3      hdl, high-density lipoproteins
    - s4      tch, total cholesterol / HDL
    - s5      ltg, possibly log of serum triglycerides level
    - s6      glu, blood sugar level

Note: Each of these 10 feature variables have bee

### Feature Names

In [10]:
diabetes.feature_names

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

### Data

In [11]:
diabetes.data

array([[ 0.03807591,  0.05068012,  0.06169621, ..., -0.00259226,
         0.01990749, -0.01764613],
       [-0.00188202, -0.04464164, -0.05147406, ..., -0.03949338,
        -0.06833155, -0.09220405],
       [ 0.08529891,  0.05068012,  0.04445121, ..., -0.00259226,
         0.00286131, -0.02593034],
       ...,
       [ 0.04170844,  0.05068012, -0.01590626, ..., -0.01107952,
        -0.04688253,  0.01549073],
       [-0.04547248, -0.04464164,  0.03906215, ...,  0.02655962,
         0.04452873, -0.02593034],
       [-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338,
        -0.00422151,  0.00306441]])

In [12]:
diabetes.data.shape

(442, 10)

### The target Variable

In [13]:
diabetes.target.shape

(442,)

## Setting the Data for Building Models

In [14]:
import pandas as pd 

In [15]:
X = diabetes.data
y = diabetes.target

In [16]:
diabetes_df = pd.DataFrame(X, columns = diabetes.feature_names)

In [17]:
diabetes_df.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641


In [18]:
diabetes_df.tail(10)

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
432,0.009016,-0.044642,0.055229,-0.00567,0.057597,0.044719,-0.002903,0.023239,0.055686,0.106617
433,-0.02731,-0.044642,-0.060097,-0.02977,0.046589,0.01998,0.122273,-0.039493,-0.051404,-0.009362
434,0.016281,-0.044642,0.001339,0.008101,0.005311,0.010899,0.030232,-0.039493,-0.045424,0.032059
435,-0.01278,-0.044642,-0.023451,-0.040099,-0.016704,0.004636,-0.017629,-0.002592,-0.03846,-0.038357
436,-0.05637,-0.044642,-0.074108,-0.050427,-0.02496,-0.047034,0.09282,-0.076395,-0.061176,-0.046641
437,0.041708,0.05068,0.019662,0.059744,-0.005697,-0.002566,-0.028674,-0.002592,0.031193,0.007207
438,-0.005515,0.05068,-0.015906,-0.067642,0.049341,0.079165,-0.028674,0.034309,-0.018114,0.044485
439,0.041708,0.05068,-0.015906,0.017293,-0.037344,-0.01384,-0.024993,-0.01108,-0.046883,0.015491
440,-0.045472,-0.044642,0.039062,0.001215,0.016318,0.015283,-0.028674,0.02656,0.044529,-0.02593
441,-0.045472,-0.044642,-0.07303,-0.081413,0.08374,0.027809,0.173816,-0.039493,-0.004222,0.003064


Or we can do it in one step using this syntax:
```python
datasets.load_diabetes(return_X_y=True)
```