# Scenario: Analyzing the Iris Flower Data Set

You are a data scientist working on a project involving the classic Iris Flower Data Set, originally published in 1936. This dataset is a well-known resource in the machine learning community, frequently used for testing algorithms and demonstrating data analysis techniques. The data is sourced from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/machine-learning-databases/iris/), with the file name `iris.data`.

## Objectives

Goal is to load the dataset, create a DataFrame, and prepare it for analysis. Before proceeding, let's understand the structure of the data:

## Data Description

The Iris Flower Data Set includes measurements for three different species of iris flowers: Iris Setosa, Iris Versicolour, and Iris Virginica. Each species has 50 samples, resulting in a total of 150 instances.

### Attributes

The dataset consists of five attributes, which are:

- **Sepal Length (cm)**
- **Sepal Width (cm)**
- **Petal Length (cm)**
- **Petal Width (cm)**
- **Class**: The species of the iris plant

## Data Format

- The attributes are separated by commas.
- There is no header row in the data file, meaning that you will need to add column names manually when creating the DataFrame.

## Steps to Load and Prepare the Data

1. **Read the Data File**: Load the `iris.data` file from the provided URL.
2. **Create a DataFrame**: Convert the loaded data into a pandas DataFrame.
3. **Add Column Names**: Assign appropriate names to the columns as described above.

In [1]:
# Analysis of Iris data plants
import numpy as np
import pandas as pd
iris_data = pd.read_csv('iris.data', sep =',', 
                        header =None, 
                        names=['sepal length', 'sepal width','petal length', 'petal width', 'class']
)
iris_data

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


In [2]:
iris_data['class'].unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

In [3]:
iris_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal length  150 non-null    float64
 1   sepal width   150 non-null    float64
 2   petal length  150 non-null    float64
 3   petal width   150 non-null    float64
 4   class         150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [4]:
# No null values as it seems, 5 columns, 4 -> numeric, 1-> string, 3 classes with 50 each

# class attribute contains repeating values, we can use it as a key to group the data by class

iris_data_grouped = iris_data.groupby('class')

In [5]:
iris_data_grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000024B53782C40>

In [6]:
#Group object method: groups, get_group(), mean(),sum()
iris_data_grouped.groups

{'Iris-setosa': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], 'Iris-versicolor': [50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99], 'Iris-virginica': [100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149]}

In [7]:
# Examine one group lets say Iris-versicolor, use get_group()

iris_data_grouped.get_group('Iris-versicolor')

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
50,7.0,3.2,4.7,1.4,Iris-versicolor
51,6.4,3.2,4.5,1.5,Iris-versicolor
52,6.9,3.1,4.9,1.5,Iris-versicolor
53,5.5,2.3,4.0,1.3,Iris-versicolor
54,6.5,2.8,4.6,1.5,Iris-versicolor
55,5.7,2.8,4.5,1.3,Iris-versicolor
56,6.3,3.3,4.7,1.6,Iris-versicolor
57,4.9,2.4,3.3,1.0,Iris-versicolor
58,6.6,2.9,4.6,1.3,Iris-versicolor
59,5.2,2.7,3.9,1.4,Iris-versicolor


In [8]:
iris_data_grouped.mean()

Unnamed: 0_level_0,sepal length,sepal width,petal length,petal width
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Iris-setosa,5.006,3.418,1.464,0.244
Iris-versicolor,5.936,2.77,4.26,1.326
Iris-virginica,6.588,2.974,5.552,2.026


In [9]:
iris_data_grouped.describe()

Unnamed: 0_level_0,sepal length,sepal length,sepal length,sepal length,sepal length,sepal length,sepal length,sepal length,sepal width,sepal width,...,petal length,petal length,petal width,petal width,petal width,petal width,petal width,petal width,petal width,petal width
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
class,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
Iris-setosa,50.0,5.006,0.35249,4.3,4.8,5.0,5.2,5.8,50.0,3.418,...,1.575,1.9,50.0,0.244,0.10721,0.1,0.2,0.2,0.3,0.6
Iris-versicolor,50.0,5.936,0.516171,4.9,5.6,5.9,6.3,7.0,50.0,2.77,...,4.6,5.1,50.0,1.326,0.197753,1.0,1.2,1.3,1.5,1.8
Iris-virginica,50.0,6.588,0.63588,4.9,6.225,6.5,6.9,7.9,50.0,2.974,...,5.875,6.9,50.0,2.026,0.27465,1.4,1.8,2.0,2.3,2.5


In [10]:
iris_data_grouped.median()

Unnamed: 0_level_0,sepal length,sepal width,petal length,petal width
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Iris-setosa,5.0,3.4,1.5,0.2
Iris-versicolor,5.9,2.8,4.35,1.3
Iris-virginica,6.5,3.0,5.55,2.0


The result of the aggregate operation is a new DataFrame object with the key as an index, which in our case is 'class'.

In [11]:
#Create a function that will select and return a record from the group for the flower which will have the longest petal length:

iris_data_grouped['petal length'].idxmax()

class
Iris-setosa         24
Iris-versicolor     83
Iris-virginica     118
Name: petal length, dtype: int64

In [12]:
def fun(iris):
    return iris.loc[iris['petal length'].idxmax()]

iris_data_grouped.apply(fun)

Unnamed: 0_level_0,sepal length,sepal width,petal length,petal width,class
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Iris-setosa,4.8,3.4,1.9,0.2,Iris-setosa
Iris-versicolor,6.0,2.7,5.1,1.6,Iris-versicolor
Iris-virginica,7.7,2.6,6.9,2.3,Iris-virginica


In [13]:
# select a particular column from the original DataFrame while grouping:
q = iris_data.groupby('class')['petal length']

In [14]:
q.mean() # gives out mean of petal length for each class

class
Iris-setosa        1.464
Iris-versicolor    4.260
Iris-virginica     5.552
Name: petal length, dtype: float64

In [15]:
iris_data_grouped.mean()

Unnamed: 0_level_0,sepal length,sepal width,petal length,petal width
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Iris-setosa,5.006,3.418,1.464,0.244
Iris-versicolor,5.936,2.77,4.26,1.326
Iris-virginica,6.588,2.974,5.552,2.026


In [16]:
# Multiple aggregations at same time
# Pass the columns first and then aggregations list

iris_data.groupby('class')[['sepal length', 'petal length']] .aggregate([min, np.mean, max])

Unnamed: 0_level_0,sepal length,sepal length,sepal length,petal length,petal length,petal length
Unnamed: 0_level_1,min,mean,max,min,mean,max
class,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Iris-setosa,4.3,5.006,5.8,1.0,1.464,1.9
Iris-versicolor,4.9,5.936,7.0,3.0,4.26,5.1
Iris-virginica,4.9,6.588,7.9,4.5,5.552,6.9


In [17]:
iris_data.groupby('class')[['sepal length', 'petal length']] .aggregate(['min', 'mean', 'max'])

Unnamed: 0_level_0,sepal length,sepal length,sepal length,petal length,petal length,petal length
Unnamed: 0_level_1,min,mean,max,min,mean,max
class,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Iris-setosa,4.3,5.006,5.8,1.0,1.464,1.9
Iris-versicolor,4.9,5.936,7.0,3.0,4.26,5.1
Iris-virginica,4.9,6.588,7.9,4.5,5.552,6.9
