# Python Tutorial: Scikit-Learn (sklearn)

Scikit-learn (sklearn) is a popular machine learning library in Python that provides various tools for building and applying machine learning models. 
    

## Official Documentation

https://scikit-learn.org/stable/


<details>
<summary><b>Overview</b></summary>

Sklearn offers a wide range of models, which can be broadly categorized into the following types based on the learning task:

1. **Supervised Learning Models:**
   In supervised learning, the algorithm learns from labeled data, meaning the input data is accompanied by corresponding output labels. Supervised learning models in sklearn can be further divided into two subcategories:

   - **Classification Models:** These models are used for predicting categorical labels. Examples include Logistic Regression, Decision Trees, Random Forest, Support Vector Machines (SVM), k-Nearest Neighbors (kNN), etc.
   
   - **Regression Models:** Regression models are used for predicting continuous values. Examples include Linear Regression, Ridge Regression, Lasso Regression, Support Vector Regression (SVR), etc.

2. **Unsupervised Learning Models:**
   In unsupervised learning, the algorithm learns patterns and structures from unlabeled data. Unsupervised learning models in sklearn include:

   - **Clustering Models:** These models are used for grouping similar data points into clusters based on some similarity measure. Examples include K-Means, Hierarchical Clustering, DBSCAN, etc.
   
   - **Dimensionality Reduction Models:** These models are used for reducing the number of features in the data while preserving important information. Examples include Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), etc.

3. **Semi-Supervised Learning Models:**
   Semi-supervised learning combines both labeled and unlabeled data to improve learning accuracy. Sklearn provides some semi-supervised learning algorithms, including LabelPropagation and LabelSpreading.

4. **Model Selection and Evaluation:**
   Sklearn also provides tools for model selection and evaluation, including:

   - **Cross-Validation:** Techniques like k-fold cross-validation, which splits the dataset into k subsets and trains the model k times, each time using a different subset as the test set.
   
   - **Model Evaluation Metrics:** Sklearn offers various metrics to evaluate model performance, such as accuracy, precision, recall, F1-score for classification, and mean squared error, R-squared for regression.
   
   - **Hyperparameter Tuning:** GridSearchCV and RandomizedSearchCV are used for finding the best hyperparameters for a model by exhaustively searching through a specified parameter grid or randomly sampling from a parameter distribution.

5. **Ensemble Methods:**
   Ensemble methods combine multiple individual models to improve performance. Sklearn provides ensemble methods like Random Forest, Gradient Boosting, AdaBoost, etc.

These are some of the major model types and functionalities offered by sklearn. Each model type has its strengths and weaknesses, and choosing the right model depends on the specific problem at hand and the characteristics of the dataset.
                                                                                                 </details>                                                                               

## Installation
  
You can install scikit-learn using pip:


In [None]:
pip install scikit-learn


## Example 1: Linear Regression

Linear Regression is a simple and widely used supervised learning algorithm for predicting continuous values. It establishes a linear relationship between the independent variables (features) and the dependent variable (target). In sklearn, linear regression is implemented in the `LinearRegression` class.


In [None]:
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

# Generate some random data
np.random.seed(0)
X = 2 * np.random.rand(100, 1)
y = 3 * X + np.random.randn(100, 1)

# Visualize the data
plt.scatter(X, y)
plt.xlabel('X')
plt.ylabel('y')
plt.title('Random Data for Linear Regression')
plt.show()

# Create a Linear Regression model
model = LinearRegression()

# Train the model
# `X` is a 2-dimensional array-like object containing the training data (features).
# `y` is a 1-dimensional array-like object containing the target values corresponding to `X`.
model.fit(X, y)

# Make predictions
# `X_new` is a 2-dimensional array-like object containing the test data (features).
# `y_pred` will contain the predicted values corresponding to `X_new`.
X_new = np.array([[0], [2]])
y_pred = model.predict(X_new)

# Visualize the linear regression line
plt.scatter(X, y)
plt.plot(X_new, y_pred, color='red')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression')
plt.show()


## Example 2: Classification with Support Vector Machine (SVM)

Support Vector Machine (SVM) is a powerful supervised learning algorithm used for classification and regression tasks. In the context of classification, SVM aims to find the optimal hyperplane that best separates different classes in the feature space. Sklearn provides an implementation of SVM for classification tasks in the `SVC` (Support Vector Classification) class.


<details>
<summary><b>How SVM Classification Works</b></summary>
    
1. **Maximum Margin Classifier:**
   SVM seeks to find the hyperplane that maximizes the margin, i.e., the distance between the hyperplane and the closest data points (support vectors) from each class. This hyperplane effectively creates decision boundaries between different classes.

2. **Kernel Trick:**
   SVM can efficiently handle non-linearly separable data by using the kernel trick. Instead of explicitly mapping the input features into a higher-dimensional space, the kernel function computes the dot product between the data points in the feature space, implicitly transforming them into a higher-dimensional space.

3. **Soft Margin Classification:**
   In cases where the data is not linearly separable, SVM allows for soft margin classification. This means that some misclassification of training examples is tolerated to achieve a wider margin and better generalization to unseen data. The regularization parameter \( C \) controls the trade-off between maximizing the margin and minimizing the classification error.
 
</details>    

<details>
<summary><b>The fit() method explained</b></summary>
    
In scikit-learn, the `fit()` method is used to train a machine learning model on a given dataset. Here's a breakdown of what the `fit()` method does:

1. **Model Initialization**: Before calling `fit()`, you create an instance of a scikit-learn estimator (model) object, specifying any hyperparameters that control the model's behavior.

2. **Training Data**: You pass the training data (features and corresponding labels, if applicable) to the `fit()` method. This data is used to train the model, meaning the model adjusts its internal parameters to minimize the difference between its predictions and the true labels (in supervised learning) or to capture patterns in the data (in unsupervised learning).

3. **Training Process**: The `fit()` method triggers the training algorithm specific to the chosen estimator. During training, the model learns from the provided data by updating its internal parameters based on the optimization objective and the patterns observed in the data.

4. **Model Adaptation**: As the model is trained, its internal parameters are adjusted to fit the training data. For example, in linear regression, the coefficients are calculated to minimize the residual sum of squares, while in decision trees, the splits are determined to maximize information gain or minimize impurity.

5. **Return Value**: The `fit()` method typically returns the trained estimator object itself. This allows you to chain method calls or access attributes of the trained model for further analysis or prediction.

6. **Optional Parameters**: Some estimators may have optional parameters that control the training process, such as the number of iterations, convergence criteria, or regularization strength. These parameters can be specified when calling the `fit()` method to customize the training behavior.

7. **Error Handling**: If there are any issues during the training process, such as invalid input data or numerical instability, the `fit()` method may raise exceptions or issue warnings to alert the user.

In summary, the `fit()` method in scikit-learn is the primary interface for training machine learning models. It encapsulates the training logic specific to each estimator and is a fundamental step in the machine learning workflow.

</details>    

In [None]:
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

# Load data and populate X and y
iris = load_iris()
X, y = iris.data, iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the SVM classifier
svm_model = SVC()

# Train the model
svm_model.fit(X_train, y_train)

# Make predictions
y_pred = svm_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)


## Exercise 1: 

Load the Breast Cancer dataset from `sklearn.datasets`.


In [None]:
# Solution


## Exercise 2: 

Train a Decision Tree classifier on the Breast Cancer dataset and evaluate its performance.


<details>
<summary><b>Overview</b></summary>
    
The DecisionTreeClassifier in scikit-learn (sklearn) is a popular machine learning model used for classification tasks. It builds a decision tree from the training data, which can be visualized as a flowchart-like structure. At each node of the tree, the algorithm makes a decision based on a feature's value, leading to a splitting of the data into different branches. This process continues recursively until a stopping criterion is met, such as reaching a maximum depth, minimum samples per leaf, or other user-defined conditions.

Here's a breakdown of some key aspects:

1. **Splitting Criterion**: The decision tree algorithm determines the best way to split the data at each node. Common splitting criteria include Gini impurity and entropy, which measure the impurity of a node (i.e., how mixed the classes are).

2. **Tree Pruning**: Decision trees have a tendency to overfit the training data, capturing noise along with the underlying patterns. Pruning techniques are used to prevent this by removing parts of the tree that do not provide significant predictive power. In scikit-learn, this can be controlled using parameters like `max_depth`, `min_samples_split`, and `min_samples_leaf`.

3. **Handling Categorical and Numeric Data**: DecisionTreeClassifier in scikit-learn can handle both categorical and numeric features. It uses techniques like binary splitting for numeric features and one-hot encoding for categorical features.

4. **Performance Metrics**: Once trained, the DecisionTreeClassifier can be used to make predictions on new data. Common performance metrics for classification tasks include accuracy, precision, recall, F1-score, and area under the ROC curve (AUC-ROC).

5. **Feature Importance**: Decision trees also provide a measure of feature importance, indicating which features were most influential in making decisions. This can be useful for feature selection and understanding the data.

Overall, DecisionTreeClassifier is a versatile and widely used algorithm due to its simplicity, interpretability, and ability to handle both numerical and categorical data. However, it's important to tune its parameters carefully to prevent overfitting and achieve optimal performance.
    
</details>    

In [None]:
# Solution


## Exercise 3: 

Use K-Means clustering on the Iris dataset and visualize the clusters.


<details>
<summary><b>Overview</b></summary>
    
The K-means clustering algorithm is a popular unsupervised learning method used for partitioning a dataset into K distinct, non-overlapping clusters. Here's how it works:

1. **Initialization**: The algorithm starts by randomly initializing K cluster centroids (points in the feature space) within the domain of the data. These centroids represent the centers of the initial clusters.

2. **Assignment Step**: Each data point in the dataset is assigned to the nearest centroid, forming K clusters. The "nearest" centroid is typically determined using a distance metric such as Euclidean distance.

3. **Update Step**: After assigning each data point to a cluster, the centroids are updated by computing the mean of all the data points assigned to each cluster. This moves the centroids to new locations within the feature space.

4. **Iteration**: Steps 2 and 3 are repeated iteratively until a stopping criterion is met. Common stopping criteria include a maximum number of iterations or when the centroids no longer change significantly between iterations.

5. **Convergence**: Eventually, the centroids converge to stable positions, and the algorithm stops. At this point, the data points are clustered in a way that minimizes the sum of squared distances between each data point and its corresponding centroid, known as the "within-cluster sum of squares" or "inertia."

6. **Final Clustering**: The final clusters are formed based on the converged centroids.

Key considerations and characteristics of K-means clustering:

- **Number of Clusters (K)**: One of the critical parameters in K-means clustering is the number of clusters, K. It's often determined using domain knowledge or by using techniques such as the elbow method or silhouette analysis to find the optimal value of K.

- **Cluster Centroids**: The centroids represent the centers of the clusters and are updated iteratively to minimize the within-cluster sum of squares.

- **Initialization Sensitivity**: K-means clustering is sensitive to the initial positions of the centroids. Different initializations may lead to different final clustering results.

- **Scalability**: K-means clustering is computationally efficient and scales well with large datasets. However, it may struggle with high-dimensional data or clusters of varying sizes and densities.

- **Assumptions**: K-means assumes that clusters are spherical and of similar size, which may not always hold true in real-world datasets. As a result, it may produce suboptimal results for non-linear or irregularly shaped clusters.

Overall, K-means clustering is widely used for data exploration, pattern recognition, and segmentation tasks, but it's essential to interpret the results carefully and consider its limitations.

</details>    

In [None]:
# Solution


## Summary

Scikit-learn is a powerful library for machine learning in Python, offering a wide range of algorithms and tools for various tasks. By following this tutorial and practicing the exercises, you'll gain a good understanding of how to use scikit-learn effectively for building and evaluating machine learning models.


<details>
<summary><b>Instructor Notes</b></summary>

I wanted to have the sklearn datasets already downloaded if the class is unable to get the data. There were a few changes made to get everything to work.
    
The 'species' column was not showing up. I followed this stackoverflow.com link.
    
https://stackoverflow.com/questions/69821857/iris-dataset-not-showing-species-column
    
How to create the species column from target and target_names columns?
    
You just need a dict mapping to replace 0 by 'setosa', 1 by 'versicolor' and 2 by 'virginica'. Use enumerate to create a list of tuples [(0, 'setosa'), (1, 'versicolor), (2, 'virginica')] then dict to convert as a dictionary.

Now Series.map will map the corresponding values.
    
Then I was getting the following error:
    
A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
    
From this link:
    
https://stackoverflow.com/questions/34165731/a-column-vector-y-was-passed-when-a-1d-array-was-expected   
    
    
When using csv and avoid this warning, use y_train.values.ravel(). 
    
The problem was that the labels were in a column format while it expected it in a row. use np.ravel()  
    
Here is the code to save the dataset to a csv:
    
```python
# This is how the dataset was originally Loaded from sklearn and saved to a csv

from sklearn.datasets import load_iris

iris = load_iris()

# In order to get the 'species' column had to use the following code
df = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
                  columns= iris['feature_names'] + ['target']).astype({'target': int}) \
       .assign(species=lambda x: x['target'].map(dict(enumerate(iris['target_names']))))

df.to_csv('iris.csv', header = True, index = False)
```
    
Here is the code to read the csv:
    
```python    
# Read from csv and populate X and y
iris = pd.read_csv('iris.csv')

X = df[['sepal length (cm)', 'sepal width (cm)','petal length (cm)', 'petal width (cm)']]
y = df[['species']]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the SVM classifier
svm_model = SVC()

# When using csv and avoid this warning, use y_train.values.ravel()
svm_model.fit(X_train, y_train.values.ravel())

# Make predictions
y_pred = svm_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)
```

</details>