### 1. Locate open-source data from the web (e.g. https://www.kaggle.com).
### 2. Provide a clear description of the data and its source (i.e., URL of the web site).
### 3. Load the Dataset into the pandas data frame.
### 4. Data Preprocessing: check for missing values in the data using pandas insult(), describe()  function to get some initial statistics. Provide variable descriptions. Types of variables etc.Check the dimensions of the data frame.
### 5. Data Formatting and Data Normalization: Summarize the types of variables by checking the data types (i.e., character, numeric, integer, factor, and logical) of the variables in the data set. If variables are not in the correct data type, apply proper type conversions.
### 6. Turn categorical variables into quantitative variables in Python.

In [1]:
print("The dataset contains 150 rows with 5 columns:\n"
      "Sepal Length (cm)\n"
      "Sepal Width (cm)\n"
      "Petal Length (cm)\n"
      "Petal Width (cm)\n"
      "Species (Setosa, Versicolor, Virginica)")


The dataset contains 150 rows with 5 columns:
Sepal Length (cm)
Sepal Width (cm)
Petal Length (cm)
Petal Width (cm)
Species (Setosa, Versicolor, Virginica)


In [17]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, MinMaxScaler

In [18]:
df = pd.read_csv("iris.csv")

In [21]:
print("First 5 rows of the dataset:")
print(df.head())

First 5 rows of the dataset:
   Id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm      Species
0   1            5.1           3.5            1.4           0.2  Iris-setosa
1   2            4.9           3.0            1.4           0.2  Iris-setosa
2   3            4.7           3.2            1.3           0.2  Iris-setosa
3   4            4.6           3.1            1.5           0.2  Iris-setosa
4   5            5.0           3.6            1.4           0.2  Iris-setosa


In [23]:
print("\nDataset Information:")
print(df.info())



Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLengthCm  150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB
None


In [25]:
print("\nMissing values in each column:")
print(df.isnull().sum())


Missing values in each column:
Id               0
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64


In [27]:
print("\nStatistical Summary:")
print(df.describe())


Statistical Summary:
               Id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm
count  150.000000     150.000000    150.000000     150.000000    150.000000
mean    75.500000       5.843333      3.054000       3.758667      1.198667
std     43.445368       0.828066      0.433594       1.764420      0.763161
min      1.000000       4.300000      2.000000       1.000000      0.100000
25%     38.250000       5.100000      2.800000       1.600000      0.300000
50%     75.500000       5.800000      3.000000       4.350000      1.300000
75%    112.750000       6.400000      3.300000       5.100000      1.800000
max    150.000000       7.900000      4.400000       6.900000      2.500000


In [29]:
print("\nData Types of each column:")
print(df.dtypes)



Data Types of each column:
Id                 int64
SepalLengthCm    float64
SepalWidthCm     float64
PetalLengthCm    float64
PetalWidthCm     float64
Species           object
dtype: object


In [31]:
print("\nUnique values in 'Species' column before Label Encoding:")
print(df['Species'].unique())


Unique values in 'Species' column before Label Encoding:
['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']


In [33]:
le = LabelEncoder()
df['Species'] = le.fit_transform(df['Species'])

In [35]:
print("\nData after Label Encoding (first 5 rows):")
print(df.head())


Data after Label Encoding (first 5 rows):
   Id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm  Species
0   1            5.1           3.5            1.4           0.2        0
1   2            4.9           3.0            1.4           0.2        0
2   3            4.7           3.2            1.3           0.2        0
3   4            4.6           3.1            1.5           0.2        0
4   5            5.0           3.6            1.4           0.2        0


In [38]:
scaler = MinMaxScaler()
df[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']] = scaler.fit_transform(df[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']])



In [39]:
print("\nData after Normalization (first 5 rows):")
print(df.head())


Data after Normalization (first 5 rows):
   Id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm  Species
0   1       0.222222      0.625000       0.067797      0.041667        0
1   2       0.166667      0.416667       0.067797      0.041667        0
2   3       0.111111      0.500000       0.050847      0.041667        0
3   4       0.083333      0.458333       0.084746      0.041667        0
4   5       0.194444      0.666667       0.067797      0.041667        0
