# Higher Diploma in Science in Computing (Data Analytics)

**Module:** Programming and Scripting  
**Lecturer:** Andrew Beatty  
**Author:** Elaine R. Cazetta  

---  

## Project: Iris Dataset Analysis

[1]  
![Iris Flowers](https://miro.medium.com/v2/resize:fit:720/format:webp/0*11IwZmSKXw77eYz5)  

---  

### About the project: 

This project was developed as part of the *Programming and Scripting* module. It aims to analize the famous [Fisher's Iris Dataset](https://archive.ics.uci.edu/dataset/53/iris) [2], apply basic data processing and visualization techniques, and present meaningful insights using Python.

The project requirements are as follows:

1. Research the dataset online and write a summary about it in a README file.
2. Download the dataset and add it to a GitHub repository.
3. Write a program called 'analysis.py' that:
   - Outputs a summary of each variable to a single text file,
   - Saves a histogram of each variable as a PNG file, and
   - Outputs a scatter plot of each pair of variables.
4. Perform any other appropriate analysis, including the creation of a Jupyter Notebook.

The analysis performed in this notebook was created to meet the project requirements. It reproduces the same analysis done in the [analysis.py](https://github.com/elainecazetta/pands-project/blob/main/analysis.py) file, with additional insights in accordance with item 4 above.


---  


### About the dataset:

The [Iris dataset](https://doi.org/10.24432/C56C76) was introduced in 1936 by British statistician and biologist Ronald A. Fisher [4]. It's one of the most well-known datasets in the field of data science and is often used for learning and testing classification techniques. What makes it so popular is that it's small, clean, and easy to visualize — perfect for beginners exploring data analysis and machine learning [3].

The dataset includes 150 entries, each representing a type of iris flower. For each sample, four features were recorded in centimeters [2]:

- *Sepal length*
- *Sepal width*
- *Petal length*
- *Petal width*

These measurements are used to identify the species of each flower. The dataset covers three distinct iris species:

- *Iris Setosa*  
- *Iris Versicolor*   
- *Iris Virginica* 

The clear structure and balanced number of samples per class (50 of each) make this dataset especially useful for practicing supervised learning models like classification and clustering, and it's a common choice for visual demonstrations in data exploration. 


---  


### Libraries:

The following libraries are required to run the code in this notebook:   

- [pandas](https://pandas.pydata.org/) – for data manipulation and analysis [5]  
- [numpy](https://numpy.org/) – for numerical operations and working with arrays [6]   
- [matplotlib](https://matplotlib.org/) – for creating basic data visualizations such as histograms and scatter plots [7]   
- [scikit-learn](https://scikit-learn.org/stable/) – provides access to the Iris dataset [8]   

In [21]:
# Importing the libraries:

# For data manipulation and analysis
import pandas as pd 

# For numerical operations and working with arrays
import numpy as np 

# For creating data visualizations
import matplotlib.pyplot as plt 

# To load the Iris dataset
from sklearn.datasets import load_iris 


--- 

### Loading the dataset:

The line ***iris = load_iris()*** loads the Iris dataset from *sklearn* library [8] into this notebook. The command ***print(iris)*** displays the full dataset, including metadata such as feature names, target names, and a description. The commands commands ***print(iris.data[:5])*** and ***print(iris.data[-5:])*** display the first and last five rows of the raw data.

In [22]:
# Load the Iris dataset from sklearn
iris = load_iris()
print(iris)

{'data': array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
     

In [23]:
# First 5 rows of the dataset [9]
print(iris.data[:5])

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]


In [24]:
# Last 5 rows of the dataset [9]
print(iris.data[-5:])

[[6.7 3.  5.2 2.3]
 [6.3 2.5 5.  1.9]
 [6.5 3.  5.2 2. ]
 [6.2 3.4 5.4 2.3]
 [5.9 3.  5.1 1.8]]



To improve readability and add column names, the raw data is converted into a Pandas DataFrame [10], as shown below:


In [25]:
# Convert the dataset into a Pandas DataFrame for better visualization
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# Add a 'species' column to identify each row
df['species'] = iris.target

# Map the numeric species labels (0, 1, 2) to their corresponding species names
species_map = {0: 'setosa', 1: 'versicolor', 2: 'virginica'}
df['species'] = df['species'].replace(species_map)


The commands ***df.head()*** and ***df.tail()*** display the first and last five rows of the dataset in a formatted table, making it easier to read and interpret. This is especially useful during data exploration to quickly understand the structure, column names, and spot any unusual values.


In [26]:
# Displays the first five rows of the DataFrame
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [27]:
# Displays the last five rows of the DataFrame
df.tail()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica



---  

### Exploring the dataset:

This section of the project is dedicated to performing some basic data exploration and identifying potential issues within the dataset, such as missing or blank values. It also includes getting important insights into the dataset's structure, such as its shape (the number of rows and columns), and performing statistical analysis on the features to better understand the distribution of the data.


The first step in this analysis is checking for missing or null values in the dataset. This is crucial because missing values can impact the performance of any analysis. Below is the code that checks for missing values in each column of the dataset:


In [35]:
# Check for missing values
missing_values = df.isnull().sum()

# Display the result: number of missing values in each column
print(missing_values)

sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
species              0
dtype: int64


Once it is confirmed that there are no missing values, the next step is to get a summary of the statistical information of the dataset. This summary helps to understand the distribution of the numerical features and check if there are any significant outliers or unusual patterns. The following code generates this statistical summary for all features in the dataset, excluding the species column [11], since it is categorical and doesn’t require numeric statistics:

In [28]:
df.drop(columns='species').describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


To enhance this analysis further, it is possible to group the data by the species and compute the same summary statistics for each species separately. This allows to compare the different species' features in more detail:


In [33]:
grouped = df.groupby('species').describe()
transposed_grouped = grouped.T
print(transposed_grouped)

species                     setosa  versicolor  virginica
sepal length (cm) count  50.000000   50.000000  50.000000
                  mean    5.006000    5.936000   6.588000
                  std     0.352490    0.516171   0.635880
                  min     4.300000    4.900000   4.900000
                  25%     4.800000    5.600000   6.225000
                  50%     5.000000    5.900000   6.500000
                  75%     5.200000    6.300000   6.900000
                  max     5.800000    7.000000   7.900000
sepal width (cm)  count  50.000000   50.000000  50.000000
                  mean    3.428000    2.770000   2.974000
                  std     0.379064    0.313798   0.322497
                  min     2.300000    2.000000   2.200000
                  25%     3.200000    2.525000   2.800000
                  50%     3.400000    2.800000   3.000000
                  75%     3.675000    3.000000   3.175000
                  max     4.400000    3.400000   3.800000
petal length (


---


### Data Visualization:

---  

### References:  
[1] Image, credits: medium.com. Available in: https://3tw.medium.com/scikit-learn-the-iris-dataset-and-machine-learning-the-journey-to-a-new-skill-c8d2f537e087 [Accessed 06 May 2025]  
[2] R. Fisher. Iris, UCI Machine Learning Repository, 1936. Available in: https://doi.org/10.24432/C56C76. [Accessed 04 May 2025]  
[3] Kelleher, Curran. The Iris Dataset Explained. Available in: https://gist.github.com/curran/a08a1080b88344b0c8a7. [Accessed 05 May 2025]  
[4] Iris Flower data set - Wikipedia. Available in: https://en.wikipedia.org/wiki/Iris_flower_data_set. [Accessed 06 May 2025]  
[5] pandas. Available in: https://pandas.pydata.org/ [Accessed 08 May 2025]  
[6] NumPy. Available in: https://numpy.org/ [Accessed 08 May 2025]  
[7] Matplotlib. Available in: https://matplotlib.org/ [Accessed 08 May 2025]  
[8] scikit-learn. Available in: https://scikit-learn.org/stable/ [Accessed 06 May 2025]  
[9] Python - Slicing Strings. Available in: https://www.w3schools.com/python/python_strings_slicing.asp  [Accessed 02 Mar 2025]  
[10] Concept of dataframes. Available in: https://chatgpt.com/share/680106b3-f308-8007-b590-e54e5784049b [Accessed 14 Mar 2025]  
[11] Excluding a column from a DataFrame. Available in: https://chatgpt.com/share/6820aac2-2a04-8007-a03f-d6298dee5674 [Accessed 11 May 2025]


---  

## End

---  