## Iris Dataset

The Iris dataset is a classic dataset in machine learning and statistics. It is often used for classification tasks, particularly in the context of supervised learning.

The Iris dataset contains measurements of various characteristics of iris flowers belonging to three different species.

Here are the details of the columns in the Iris dataset:

1. **Sepal Length**: Length of the sepals (in centimeters).
2. **Sepal Width**: Width of the sepals (in centimeters).
3. **Petal Length**: Length of the petals (in centimeters).
4. **Petal Width**: Width of the petals (in centimeters).
5. **Species**: The species of the iris flower. It is the target variable and can take one of the following three values: <br>
a. Iris-setosa <br>
b. Iris-versicolor <br>
c. Iris-virginica

The dataset consists of 150 observations (samples), with each observation representing a different iris flower. There are 50 observations for each of the three species.







## Importing the required libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

## Loading the dataset

In [None]:
dataset = pd.read_csv("iris.csv")
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [None]:
print(dataset.columns)

Index(['sepal.length', 'sepal.width', 'petal.length', 'petal.width',
       'variety'],
      dtype='object')


In [None]:
print(X.shape)

(150, 4)


In [None]:
unique_values = dataset["variety"].unique()
print(unique_values)

['Setosa' 'Versicolor' 'Virginica']


## Taking Care of Missing Data

In [None]:
# Check for null values in any column
null_values = dataset.isnull().sum()
print(null_values)

sepal.length    0
sepal.width     0
petal.length    0
petal.width     0
variety         0
dtype: int64


In [None]:
# Check for nan values in any column
nan_values = dataset.isna().sum()
print(nan_values)

sepal.length    0
sepal.width     0
petal.length    0
petal.width     0
variety         0
dtype: int64


In [None]:
# Check for 0 values in any column
zero_values = (dataset == 0).sum()
print(zero_values)

sepal.length    0
sepal.width     0
petal.length    0
petal.width     0
variety         0
dtype: int64


In [None]:
# Good to know that the iris dataset doesn't have any missing values.

## Encoding Categorical Data - the Dependent Variable

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = le.fit_transform(y)

In [None]:
print (np.unique(y))

[0 1 2]


## Splitting the Dataset into Training Set and Test Set

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [None]:
print(X_train.shape)
print(X_train[:5, :])

(120, 4)
[[4.6 3.6 1.  0.2]
 [5.7 4.4 1.5 0.4]
 [6.7 3.1 4.4 1.4]
 [4.8 3.4 1.6 0.2]
 [4.4 3.2 1.3 0.2]]


In [None]:
print(X_test.shape)
print(X_test[:5, :])

(30, 4)
[[6.1 2.8 4.7 1.2]
 [5.7 3.8 1.7 0.3]
 [7.7 2.6 6.9 2.3]
 [6.  2.9 4.5 1.5]
 [6.8 2.8 4.8 1.4]]


In [None]:
print(y_train[:10])

[0 0 1 0 0 2 1 0 0 0]


In [None]:
print(y_test[:10])

[1 0 2 1 1 0 1 2 1 1]


## Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
print(X_train[:5, :])

[[-1.47393679  1.20365799 -1.56253475 -1.31260282]
 [-0.13307079  2.99237573 -1.27600637 -1.04563275]
 [ 1.08589829  0.08570939  0.38585821  0.28921757]
 [-1.23014297  0.75647855 -1.2187007  -1.31260282]
 [-1.7177306   0.30929911 -1.39061772 -1.31260282]]


In [None]:
print(X_test[:5, :])

[[ 0.35451684 -0.58505976  0.55777524  0.02224751]
 [-0.13307079  1.65083742 -1.16139502 -1.17911778]
 [ 2.30486738 -1.0322392   1.8185001   1.49058286]
 [ 0.23261993 -0.36147005  0.44316389  0.4227026 ]
 [ 1.2077952  -0.58505976  0.61508092  0.28921757]]
