# Fisher’s Iris data set 

This is a data set also known as Iris flower data set which was published by  British statistician and biologist Ronald Fisher in 1936. [Wikipedia](https://en.wikipedia.org/wiki/Iris_flower_data_set)

The data set consists of 150 records (50 for each of the three Iris species: Iris setosa, Iris versicolor and Iris virginica ) Each species in turn have four attributes which were measured: the length and the width of the sepals and petals in centimeters.
![title](Images/flowers.png)
(Sources: [1](https://commons.wikimedia.org/wiki/Category:Iris_setosa#/media/File:Irissetosa1.jpg), [2](https://en.wikipedia.org/wiki/Iris_flower_data_set#/media/File:Iris_versicolor_3.jpg), [3](https://en.wikipedia.org/wiki/Iris_flower_data_set#/media/File:Iris_virginica.jpg), Licenses: Public Domain)

<br>

## Saving original data set
***

In [1]:
from urllib.request import urlretrieve
import pandas as pd

#Assigning url of file. Idea taken from https://gist.github.com/curran/a08a1080b88344b0c8a7
iris='https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
#Saving file locally
urlretrieve(iris)

#Read file into a DataFrame and assign column names 
df = pd.read_csv(iris, sep=',', names=["sepal_length", "sepal_width", "petal_length", "petal_width", "class"])
#Prints dataframe head/top (first 10 elements)
print(df.head(10))

   sepal_length  sepal_width  petal_length  petal_width        class
0           5.1          3.5           1.4          0.2  Iris-setosa
1           4.9          3.0           1.4          0.2  Iris-setosa
2           4.7          3.2           1.3          0.2  Iris-setosa
3           4.6          3.1           1.5          0.2  Iris-setosa
4           5.0          3.6           1.4          0.2  Iris-setosa
5           5.4          3.9           1.7          0.4  Iris-setosa
6           4.6          3.4           1.4          0.3  Iris-setosa
7           5.0          3.4           1.5          0.2  Iris-setosa
8           4.4          2.9           1.4          0.2  Iris-setosa
9           4.9          3.1           1.5          0.1  Iris-setosa


In [2]:
#Checking that whole data set has been imported (150 rows)
len(df)

150

<br>

## Looking at data

***
Before we start analysing the data set lets have a look at the data more closely using pandas functionality.

We can use <span style="color:SlateBlue; font-weight:bold;">columns</span> attribute to show the column labels of the DataFrame. <br>
Next, <span style="color:SlateBlue; font-weight:bold;">index</span> attribute shows RangeIndex(start, stop, step), index was automatically assigned when the csv file was read and df was created.<br>
<span style="color:SlateBlue; font-weight:bold;">ndim</span> parameter shows the number of axes/dimensions of the data set.<br>
<span style="color:SlateBlue; font-weight:bold;">shape</span> attribute can be used to show the number of rows (if used with index 0) and columns (if used with index 1) in the data set.<br>
<span style="color:SlateBlue; font-weight:bold;">size</span> attribute shows total number of elements in the DataFrame (150 rows x 5 columns = 750).<br>
<span style="color:SlateBlue; font-weight:bold;">dtypes</span> attribute shows the data types of the DataFrame.

In [3]:
print("The column labels of the iris DataFrame are: ", *df.columns, sep = "   ")
print(" The index of the DataFrame is: ", df.index, "\n")
print(f"The iris DataFrame has {df.ndim} dimensions")
print(f"The iris data set has {df.shape[0]} rows and {df.shape[1]} columns")
print(f"The iris DataFrame has {df.size} elements in total", "\n")
print("The data types of iris DataFrame are as follows:")
print(df.dtypes)

The column labels of the iris DataFrame are:    sepal_length   sepal_width   petal_length   petal_width   class
 The index of the DataFrame is:  RangeIndex(start=0, stop=150, step=1) 

The iris DataFrame has 2 dimensions
The iris data set has 150 rows and 5 columns
The iris DataFrame has 750 elements in total 

The data types of iris DataFrame are as follows:
sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
class            object
dtype: object


Now lets check if there are any missing values in our data, which we can easily do by using <span style="color:SlateBlue; font-weight:bold;">isnull</span> method which returns True or False values for each observation. Boolean values can be converted to 1 (True) or 0 (False) we can then combine it with the **sum** function to count the number of True values in the data set instead of printing all of the True or False values. After looking at the result we can confrim that there are no missing values in this data set.

In [10]:
print("The number of null or missing values in the iris dataframe for each column: ")
print(df.isnull().sum())

The number of null or missing values in the iris dataframe for each column: 
sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
class           0
dtype: int64


The <span style="color:SlateBlue; font-weight:bold;">info</span> method produses concise summary of a DataFrame. This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.<br>
The <span style="color:SlateBlue; font-weight:bold;">count</span> method counts non-NA cells for each column or row.<br>
The <span style="color:SlateBlue; font-weight:bold;">unique</span> method returns unique values based on a hash table. If we use it on the 'class' column it will show how many different species of Iris flower are in our data set.

In [13]:
print(f"A concise summary of the iris DataFrame: \n")
df.info()
print(f"\n The number of non-NA cells for each column or row are: {df.count()}")
species_type =df['class'].unique()
print("\n The following are the three class or species types of iris in the data set \n",*species_type, sep = " ")

A concise summary of the iris DataFrame: 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   class         150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB

 The number of non-NA cells for each column or row are: sepal_length    150
sepal_width     150
petal_length    150
petal_width     150
class           150
dtype: int64

 The following are the three class or species types of iris in the data set 
 Iris-setosa Iris-versicolor Iris-virginica
