Project 1: Classification of Iris Flowers
Input: Iris.csv data set
Project: Building different classification models, validation and performance
evaluation of models

Step 1: Import all necessary libraries
    The following libraries to be imported in this project:
    pandas: Used to read and manipulate CSV data.
    Numpy: For fast and efficient processing of data
    sklearn.dataset: To load data from the Sci-Kit-Learn repository
    sklearn.train_test_split: From scikit-learn, used to split data into training and testing sets.
    sklearn.preprocessing: For feature scaling/normalization
    sklearn.LogisticRegression: A common classification algorithm from scikit-learn.
    sklearn.SVC: Support Vector Machine Classifier
    sklearn.RandomeForest: Random Forest Classification
    sklearn.KNeighborsClassifier: k-Nearest Neighbour classifier
    sklearn.DecissionTreeClassifier: Decision Tree Classifier
    sklearn.MLPClassifier: Multi-Layer Perceptron classifier
    sklearn.GradientBoostingClassifier: Gradient Boosting classifier
    sklearn.accuracy_score: To calculate model accuracy.

In [None]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
# Step 1: Load the Iris dataset (it is a classic builtin dataset)
iris = load_iris()
data = pd.DataFrame(data=iris.data, columns=iris.feature_names)
data['target'] = iris.target
# Download the data from "Iris.csv"   locally             
X, y = iris.data, iris.target        
# Convert to DataFrame for better processing
df = pd.DataFrame(data=X, columns=iris.feature_names)
df['target'] = y
# Preview the dataset: It is required as a customary step!
#print("Top 5 rows of the dataset:")
#print(df.head())
#print("Bottom 5 rows of the dataset:")
#print(df.tail())
#print("The columns present in the data frame
#print(df.columns)
#print("The information about the attributes
print(df.info())
#print("To check if the null entries are there")
#print(df.isnull())
#print("The statistical information about the data")
# print(df.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   target             150 non-null    int32  
dtypes: float64(4), int32(1)
memory usage: 5.4 KB
None


Step 2: Split the data set into two parts: "Training set" and "Test set"
The following library is used
    import train_test_split from sklearn.model_selection
    "Training set" is used to train a model and "Test set" is used to test a model

In [3]:
from sklearn.model_selection import train_test_split
print("Import of \"Train-Test-Split-Selection\" library is successful")

# Split the dataset into training and testing sets: 67% for training and 33% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, train_size=0.67)

# Note 1: Data (i.e., Data-attributes and Target-column) are kept as separate variables (X for features, y for target labels)
# Note 2: Here, random_state=42 is chosen as a seed value and popularly used for reproducibility in experiments

print("\nTrain and test data shapes:")
print("X_train:", X_train.shape, "X_test:", X_test.shape)


Import of "Train-Test-Split-Selection" library is successful

Train and test data shapes:
X_train: (100, 4) X_test: (50, 4)


Step 3: Preprocessing
    (a)Handling null-entries, if applicable
    (b) Scaling (to put all values in a normalize scale)
        
        For scaling there are many methods: StandardScalar, MinMaxScalar, Normalizer, PolynomialFeatures, etc. Use any one.

In [4]:
from sklearn.preprocessing import StandardScaler

### Tutorial to learn the basics of scalar-based normalization

# Create a DataFrame for training data
data1 = {'A': [2, 4, 5, 6, 7, 8, 9], 'B': [60, 70, 90, 10, 30, 40, 50]}
# Create a DataFrame for testing data
data2 = {'A': [1, 6, 3], 'B': [80, 40, 20]}

# Convert dictionaries to pandas DataFrame
X_train_ = pd.DataFrame(data1)
X_test_ = pd.DataFrame(data2)

# Create a StandardScaler object for normalization
scaler = StandardScaler()

# Fit and transform the training data (learn scaling parameters from training data)
X_train_scaled_ = scaler.fit_transform(X_train_)
print("Normalized training dataset...\n")
print(X_train_scaled_)  # Using print instead of display() for general compatibility

# Transform the testing data using the parameters already learned from training data
X_test_scaled_ = scaler.transform(X_test_)  # Corrected from fit_transform() to transform()
print("\nNormalized testing dataset...\n")
print(X_test_scaled_)  # Using print instead of display()

Normalized training dataset...

[[-1.72849788  0.40824829]
 [-0.83223972  0.81649658]
 [-0.38411064  1.63299316]
 [ 0.06401844 -1.63299316]
 [ 0.51214752 -0.81649658]
 [ 0.9602766  -0.40824829]
 [ 1.40840568  0.        ]]

Normalized testing dataset...

[[-2.17662696  1.22474487]
 [ 0.06401844 -0.40824829]
 [-1.2803688  -1.22474487]]


In [5]:
# Handling missing values: There are no missing values

# Normalization of training and testing data

'''
Note: For normalization, sklearn provides two methods: fit_transform() and transform().
    - fit_transform() is applied to training data, whereas transform() is applied to testing data.
    - fit_transform() is a combination of:
    - fit(): To calculate the necessary transformation parameters based on the training data (e.g., min, max, mean, standard deviation).
    - transform(): To apply the transformation to the data using the parameters learned from the training data.
    - The two methods are applicable to all normalization methods defined in sklearn.
'''

# Import scaling methods for normalization
from sklearn.preprocessing import StandardScaler, MinMaxScaler, Normalizer, PolynomialFeatures

# Standard Scaling (Standardization)
scaler = StandardScaler()  # Create a StandardScaler object
X_train_scaled = scaler.fit_transform(X_train)  # Fit and transform the training data
X_test_scaled = scaler.transform(X_test)  # Apply transform() to test data using learned parameters
print("Standard Scaled Data (First 5 rows):\n", X_train_scaled[:5])  # Show first 5 rows

# Min-Max Scaling (Normalization to range [0,1])
minmax_scaler = MinMaxScaler()  # Create a MinMaxScaler object
X_train_minmax = minmax_scaler.fit_transform(X_train)  # Fit and transform the training data
X_test_minmax = minmax_scaler.transform(X_test)  # Apply transform() to test data using learned parameters
print("\nMin-Max Scaled Data (First 5 rows):\n", X_train_minmax[:5])  # Show first 5 rows

# L2 Normalization (Scaling each row to unit norm)
normalizer = Normalizer()  # Create a Normalizer object
X_train_normalized = normalizer.fit_transform(X_train)  # Fit and transform the training data
X_test_normalized = normalizer.transform(X_test)  # Apply transform() to test data using learned parameters
print("\nNormalized Data (First 5 rows):\n", X_train_normalized[:5])  # Show first 5 rows

# Polynomial Feature Transformation (Expanding features up to the given degree)
poly = PolynomialFeatures(degree=2, include_bias=False)  # Create a PolynomialFeatures object (degree=2)
X_train_poly = poly.fit_transform(X_train)  # Fit and transform the training data
X_test_poly = poly.transform(X_test)  # Apply transform() to test data using learned parameters
print("\nPolynomial Features (First 5 rows):\n", X_train_poly[:5])  # Show first 5 rows


Standard Scaled Data (First 5 rows):
 [[-0.70686204  2.40241739 -1.17067786 -1.33850041]
 [-1.0608831  -1.54739844 -0.18183687 -0.18240601]
 [-1.0608831   0.07899632 -1.17067786 -1.33850041]
 [-0.11682695  3.09944372 -1.17067786 -0.95313561]
 [ 0.59121516 -0.61803001  0.80700412  0.45986866]]

Min-Max Scaled Data (First 5 rows):
 [[0.26470588 0.86363636 0.08474576 0.        ]
 [0.17647059 0.09090909 0.38983051 0.375     ]
 [0.17647059 0.40909091 0.08474576 0.        ]
 [0.41176471 1.         0.08474576 0.125     ]
 [0.58823529 0.27272727 0.69491525 0.58333333]]

Normalized Data (First 5 rows):
 [[0.76578311 0.60379053 0.22089897 0.0147266 ]
 [0.75916547 0.37183615 0.51127471 0.15493173]
 [0.81803119 0.51752994 0.25041771 0.01669451]
 [0.77381111 0.59732787 0.2036345  0.05430253]
 [0.72366005 0.32162669 0.58582004 0.17230001]]

Polynomial Features (First 5 rows):
 [[5.200e+00 4.100e+00 1.500e+00 1.000e-01 2.704e+01 2.132e+01 7.800e+00
  5.200e-01 1.681e+01 6.150e+00 4.100e-01 2.250e+00 