# Higher Diploma in Science in Computing (Data Analytics)

**Module:** Programming and Scripting  
**Lecturer:** Andrew Beatty  
**Author:** Elaine R. Cazetta  

---  

## Project: Iris Dataset Analysis

[1]  
![Iris Flowers](https://miro.medium.com/v2/resize:fit:720/format:webp/0*11IwZmSKXw77eYz5)  

---  

### About the project: 

This project was developed as part of the *Programming and Scripting* module. It aims to analize the famous [Fisher's Iris Dataset](https://archive.ics.uci.edu/dataset/53/iris) [2], apply basic data processing and visualization techniques, and present meaningful insights using Python.

The project requirements are as follows:

1. Research the dataset online and write a summary about it in a README file.
2. Download the dataset and add it to a GitHub repository.
3. Write a program called 'analysis.py' that:
   - Outputs a summary of each variable to a single text file,
   - Saves a histogram of each variable as a PNG file, and
   - Outputs a scatter plot of each pair of variables.
4. Perform any other appropriate analysis, including the creation of a Jupyter Notebook.

The analysis performed in this notebook was created to meet the project requirements. It reproduces the same analysis done in the [analysis.py](https://github.com/elainecazetta/pands-project/blob/main/analysis.py) file, with additional insights in accordance with item 4 above.


---  


### About the dataset:

The [Iris dataset](https://doi.org/10.24432/C56C76) was introduced in 1936 by British statistician and biologist Ronald A. Fisher [4]. It's one of the most well-known datasets in the field of data science and is often used for learning and testing classification techniques. What makes it so popular is that it's small, clean, and easy to visualize — perfect for beginners exploring data analysis and machine learning [3].

The dataset includes 150 entries, each representing a type of iris flower. For each sample, four features were recorded in centimeters [2]:

- *Sepal length*
- *Sepal width*
- *Petal length*
- *Petal width*

These measurements are used to identify the species of each flower. The dataset covers three distinct iris species:

- *Iris Setosa*  
- *Iris Versicolor*   
- *Iris Virginica* 

The clear structure and balanced number of samples per class (50 of each) make this dataset especially useful for practicing supervised learning models like classification and clustering, and it's a common choice for visual demonstrations in data exploration. 


---  


### Libraries:

The following libraries are required to run the code in this notebook:   

- [pandas](https://pandas.pydata.org/) – for data manipulation and analysis [5]  
- [numpy](https://numpy.org/) – for numerical operations and working with arrays [6]   
- [matplotlib](https://matplotlib.org/) – for creating basic data visualizations such as histograms and scatter plots [7]   
- [scikit-learn](https://scikit-learn.org/stable/) – provides access to the Iris dataset [8]   

In [3]:
# Importing the libraries:

# For data manipulation and analysis
import pandas as pd 

# For numerical operations and working with arrays
import numpy as np 

# For creating data visualizations
import matplotlib.pyplot as plt 

# To load the Iris dataset
from sklearn.datasets import load_iris 


--- 

### Loading the dataset:

The line ***iris = load_iris()*** loads the Iris dataset from *sklearn* library [8] into this notebook. The command ***print(iris)*** displays the full dataset, including metadata such as feature names, target names, and a description. The commands commands ***print(iris.data[:5])*** and ***print(iris.data[-5:])*** display the first and last five rows of the raw data.

In [4]:
# Load the Iris dataset from sklearn
iris = load_iris()
print(iris)

{'data': array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
     

In [5]:
# First 5 rows of the dataset [9]
print(iris.data[:5])

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]


In [6]:
# Last 5 rows of the dataset [9]
print(iris.data[-5:])

[[6.7 3.  5.2 2.3]
 [6.3 2.5 5.  1.9]
 [6.5 3.  5.2 2. ]
 [6.2 3.4 5.4 2.3]
 [5.9 3.  5.1 1.8]]



To improve readability and add column names, the raw data is converted into a Pandas DataFrame, as shown below (first and last five rows):

In [7]:
# Convert the dataset into a Pandas DataFrame for better visualization
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# Add a 'species' column to identify each row
df['species'] = iris.target

# Displays the DataFrame including the number of rows and columns at the end
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2



---  

### Exploring the dataset:



---  

### Data Visualization:

---  

### References:  
[1] Image, credits: medium.com. Available in: https://3tw.medium.com/scikit-learn-the-iris-dataset-and-machine-learning-the-journey-to-a-new-skill-c8d2f537e087 [Accessed 06 May 2025]  
[2] R. Fisher. Iris, UCI Machine Learning Repository, 1936. Available in: https://doi.org/10.24432/C56C76. [Accessed 04 May 2025]  
[3] Kelleher, Curran. The Iris Dataset Explained. Available in: https://gist.github.com/curran/a08a1080b88344b0c8a7. [Accessed 05 May 2025]  
[4] Iris Flower data set - Wikipedia. Available in: https://en.wikipedia.org/wiki/Iris_flower_data_set. [Accessed 06 May 2025]  
[5] pandas. Available in: https://pandas.pydata.org/ [Accessed 08 May 2025]  
[6] NumPy. Available in: https://numpy.org/ [Accessed 08 May 2025]  
[7] Matplotlib. Available in: https://matplotlib.org/ [Accessed 08 May 2025]  
[8] scikit-learn. Available in: https://scikit-learn.org/stable/ [Accessed 06 May 2025]  
[9] Python - Slicing Strings. W3schools. Available in: https://www.w3schools.com/python/python_strings_slicing.asp  [Accessed 02 Mar 2025] 

---  

## End

---  