<a href="https://colab.research.google.com/github/aliraza5101/scikit-learn-python/blob/main/Scikit_Learn_with_Ali_Raza.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Installing Scikit Learn package**


In [12]:
pip install scikit-learn



# Dataset Loading

A collection of data is called a **dataset**.  
It mainly consists of the following two components:

## 1. Features
- The variables of data are called **features**.  
- They are also known as:
  - Predictors  
  - Inputs  
  - Attributes  
- Features are used as input to make predictions.

## 2. Response
- The **response** is the output variable.  
- It mainly depends upon the feature variables.  
- It is also known as:
  - Target  
  - Label  
  - Output  
- The response represents the result we want to predict. **bold text**

In [None]:
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target

feature_names = iris.feature_names
target_names = iris.target_names

print("Feature names:", feature_names)
print("Target names:", target_names)
print("\nFirst 10 rows of X:\n", X[:10])


Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Target names: ['setosa' 'versicolor' 'virginica']

First 10 rows of X:
 [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]]


# Preprocessing the Data

As we are dealing with a large amount of data and most of it is in **raw form**,  
before giving this data as input to **machine learning algorithms**, we need to  
convert it into a **meaningful and usable form**.

This process is called **data preprocessing**.  
Scikit-learn provides a package named **preprocessing** for this purpose.

## Common Preprocessing Techniques

- **Binarization**  
  Converts numerical data into binary values (0 or 1) based on a given threshold.

- **Mean Removal**  
  Removes the mean from the data so that each feature has a mean value of zero.

- **Scaling**  
  Scales the features so that they fall within a specific range.

- **Normalization (L1 / L2)**  
  Transforms data so that the feature vectors have a unit norm.


In [None]:
import numpy as np
from sklearn import preprocessing

input_data = np.array(
    [[2.1, -1.9, 5.5],
     [-1.5, 2.4, 3.5],
     [0.5, -7.9, 5.6],
     [5.9, 2.3, -5.8]]
)

# displaying the mean and the standard deviation of the input data
print("Mean = ", input_data.mean(axis=0))
print("Stddeviation = ", input_data.std(axis=0))

# Removing the mean and the standard deviation of the input data
data_scaled = preprocessing.scale(input_data)

print("Mean_removed = ", data_scaled.mean(axis=0))
print("Stddeviation_removed = ", data_scaled.std(axis=0))


Mean =  [ 1.75  -1.275  2.2  ]
Stddeviation =  [2.71431391 4.20022321 4.69414529]
Mean_removed =  [1.11022302e-16 0.00000000e+00 0.00000000e+00]
Stddeviation_removed =  [1. 1. 1.]


# Splitting the Dataset

To check the accuracy of our model, we split the dataset into two parts:

- **Training Set**
- **Testing Set**

The **training set** is used to train the machine learning model,  
while the **testing set** is used to test the performance of the model.

After splitting the dataset, we can evaluate how well our model performs  
on unseen data.


In [None]:
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1
)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(105, 4)
(45, 4)
(105,)
(45,)


# Train the Model

Next, we use our dataset to train a **prediction model**.

As discussed earlier, **Scikit-learn** provides a wide range of  
**Machine Learning (ML) algorithms** with a consistent interface for:

- Model fitting  
- Prediction  
- Accuracy evaluation  
- Recall and other performance metrics

In [None]:
from sklearn.datasets import load_iris

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.4, random_state=1
)

# KNN Classifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

classifier_knn = KNeighborsClassifier(n_neighbors=3)
classifier_knn.fit(X_train, y_train)

# Predictions on test data
y_pred = classifier_knn.predict(X_test)

# Accuracy
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))

# Sample prediction
sample = [[5, 5, 3, 2], [2, 4, 3, 5]]
preds = classifier_knn.predict(sample)

pred_species = [iris.target_names[p] for p in preds]
print("Predictions:", pred_species)


Accuracy: 0.9833333333333333
Predictions: [np.str_('versicolor'), np.str_('virginica')]


# Model Persistence

Once the model is trained, it is desirable to **persist (save)** the model  
for future use so that we do not need to retrain it again and again.

Model persistence can be achieved using the **dump** and **load**  
functions provided by the **joblib** package.

Saving a trained model allows us to reuse it later for prediction  
without repeating the training process.

In [None]:
import joblib

joblib.dump(classifier_knn, 'iris_classifier_knn.joblib')
classifier_knn = joblib.load('iris_classifier_knn.joblib')