## üéØ Project Goal:
Create a machine learning pipeline to classify a toy dataset using scikit-learn. We'll walk through each step, from data preprocessing to model evaluation.

### 1. üß∞ Step 1: Import Libraries
**Description:** We need to import the necessary libraries for building and evaluating the pipeline. The key libraries include scikit-learn for machine learning tasks and pandas for data handling.

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

In [1]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

### 2. üìä Step 2: Load the Toy Dataset
**Description:** We'll use the famous Iris dataset, a small dataset with 150 samples of iris flowers, categorized into three species. This dataset is included in scikit-learn.

In [2]:
iris = load_iris()
X = iris.data
y = iris.target

In [9]:
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

- **Dataset Info:** The Iris dataset consists of 4 features (sepal length, sepal width, petal length, petal width) and a target variable with three classes (Setosa, Versicolor, Virginica).

### 3. üîÄ Step 3: Split the Data
**Description:** We'll split the dataset into training and testing sets to evaluate the model's performance later.

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

- **Why Split?:** Splitting ensures the model is evaluated on unseen data, helping to gauge its generalization ability.

### 4. üõ†Ô∏è Step 4: Create a Pipeline
**Description:** The pipeline simplifies the workflow by chaining multiple steps, including preprocessing and model training, into a single object. This ensures that the entire process is streamlined and consistent.

In [4]:
pipeline = Pipeline([
    ('scaler', StandardScaler()),   # Step 1: Scale the data
    ('pca', PCA(n_components=2)),   # Step 2: Reduce dimensions
    ('svc', SVC(kernel='linear'))   # Step 3: Train a Support Vector Classifier
])

#### Pipeline Explanation:
- **Scaler:** Normalizes the features to have a mean of 0 and a standard deviation of 1. This improves the performance of many algorithms.
- **PCA:** Reduces the feature space to 2 dimensions, making it easier to visualize and manage.
- **SVC:** A Support Vector Classifier with a linear kernel, chosen for its effectiveness on small to medium-sized datasets.

### 5. üöÄ Step 5: Train the Model
**Description:** We'll train the pipeline using the training data. The pipeline automatically applies each step in sequence.

In [5]:
pipeline.fit(X_train, y_train)

- **Training:** During training, the pipeline first scales the data, then applies PCA, and finally trains the SVC on the transformed data.

### 6. üß™ Step 6: Make Predictions
**Description:** We'll use the trained model to predict the species of the flowers in the test set.

In [6]:
y_pred = pipeline.predict(X_test)

### 7. üìà Step 7: Evaluate the Model
**Description:** Finally, we'll evaluate the model's performance using accuracy as the metric.

In [7]:
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")

Model Accuracy: 93.33%


- **Why Accuracy?:** Accuracy is a simple and intuitive metric to evaluate classification performance, especially for balanced datasets like Iris.