# Iris Project - Supplementary Notebook
24-25: 4122 -- PROGRAMMING AND SCRIPTING : Project

The objective of this jupyter notebook is to use the analysis.py module and show some of the functionality of the analysis.py module as well as commnent on the iris dataset.

Some of the functions in the analysis.py module which are called in this notebook have a parameter which defines if the function will write a file or run show plots. The default behaiviour is to write to file but there is a write to console options , which writes or shows to the console or in this case the jupyter notebook.


## Requirements

Requirements:
1. Research the data set online and write a summary about it in your README.
2. Download the data set and add it to your repository.
3. Write a program called analysis.py that:
    1. Outputs a summary of each variable to a single text file,
    2. Saves a histogram of each variable to png files, and
    3. Outputs a scatter plot of each pair of variables.
    4. Performs any other analysis you think is appropriate.

## References

- Github Copilot. (n.d.). *GitHub Copilot*.  This is a code completion tool that uses machine learning to suggest code snippets and functions based on the context

*Note: The code for functions will be associated with functions in analysis.py module will be shown as well , using the describe function*


### Import some standard libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import inspect

### Import python module analysis.py 

This follows the DRY principle (Don't Repeat Yourself) and allows the easy testing of the code in the analysis.py module. This also means that code is not repeated in the notebook

In [None]:
import analysis as an

### Load Iris Data Set

In [None]:
# load iris data set and check return code
# return code 0 means success
return_code, df_iris = an.load_data(an.config)
if return_code != 0:
    print(f"Error loading data : {return_code}")
else:
    print("Data loaded successfully")

**Code**

In [None]:
print(inspect.getsource(an.load_data))

### Convert Iris Data Frame to Metric Data Frame

The data frame is melted so that it is a feature per row. This is helpful when summarising the data

Converts from a format 

| sepal_length | sepal_width | petal_length | petal_width | species |
|--------------|-------------|--------------|-------------|---------|
| 5.1          | 3.5         | 1.4          | 0.2         | setosa  |
| 4.9          | 3.0         | 1.4          | 0.2         | setosa  |

To a metric data frame
| feature       | value | species |
|--------------|-------|---------|
| sepal_length | 5.1   | setosa  |
| sepal_length | 4.9   | setosa  |
| sepal_width  | 3.5   | setosa  |
| sepal_width  | 3.0   | setosa  |



In [None]:
# convert to a metric dataframe
return_code = an.convert_to_metrics_df(an.config)
if return_code != 0:
    print(f"Error converting data to metrics dataframe : {return_code}")
else:
    print("Data converted to metrics dataframe successfully")
print('Melted data frame head:')
an.config['df_iris_melt'].head()

**Code**

In [None]:
print(inspect.getsource(an.convert_to_metrics_df))

### Load Summary Data Set

This converts the iris melted data from to a summary data frame. This contains Mean,Max,Min,Std Dev,Median , Q25 , Q75 for each species and each feature

In [None]:
# create a summary dataframe
return_code = an.load_summary(an.config)
if return_code != 0:
    print(f"Error creating summary dataframe : {return_code}")
else:
    print("Summary dataframe created successfully")

an.config['df_summary'].head()

**Code**

In [None]:
print(inspect.getsource(an.load_summary))

## Run generate report 

This will display the report in the notebook and also save it to a file.


In [None]:
return_code = an.generate_report(an.config, to_console=True)
if return_code != 0:
    print(f"Error generating report : {return_code}")
else:
    print("Report generated successfully")

**Code**

In [None]:
print(inspect.getsource(an.generate_report))

## Plot histogram of the data
This will display the histogram in the notebook and also save it to a file using generate_histogram function in the analysis.py module

The histograms show that the there is distinct diffirence between iris setosa and the other two species ( versicolor and virginica) with respect to petal length and petal width.  The sepal length and width do not show an obvious difference between the three species. 

*Note: There is an alternative histogram function which saves each histogram as a separate file.*

In [None]:
an.generate_histograms_combined(an.config, to_console=True)

**Code**

In [None]:
print(inspect.getsource(an.generate_histograms_combined))

## Plot Scatterplot of the data
This will display the scatterplot in the notebook and also save it to a file.
of the code. 

There is a distinct relationship between petal length and petal width , as well as the species. This may imply that only one of the features needs to be used , or possible the width and length of the petal can be comined ( approximate servace area , either assume a rectangle or ellipse) to create a new feature. The formulae for a elipse is pi * a * b where a and b are the semi major and minor axes. The formulae for a rectangle is a * b (https://www.cuemath.com/geometry/area-of-an-ellipse/) and alternative is the circumference of a circle ( pi * sqrt(2[a**2+b**2]) (https://www.cuemath.com/measurement/perimeter-of-ellipse/)

References:
- [Seaborn Scatterplot](https://seaborn.pydata.org/generated/seaborn.scatterplot.html) 

In [None]:
an.generate_scatter_plot(an.config, to_console=True)

In [None]:
# print the source code of the generate_scatter_plot function
lines = inspect.getsource(an.generate_scatter_plot)
print(lines)

## Plot Box plot of the data
This will display the boxplot in the notebook and also save it to a file.

The box plot shows the summary metrics in a diagram , so that the data can be easily visualised. The box plot shows the median, Q25 and Q75 as the box , and the whiskers shows a line at IQR * 1.5 above and below the box. Anything outside ( above and below ) is considered an outlier. 

What is interesting is the seperation between the species. In the sepal's there is more overlap than the petal's length and width. 

In [None]:
an.generate_box_plot(an.config, to_console=True)

**Code**

In [None]:
print(inspect.getsource(an.generate_box_plot))

## Box Plot of iris data set - Common X axis
This will display the boxplot in the notebook and also save it to a file.
This is a box plot , similar to the above for each species of iris and feature with a common x axis . This makes it easier to proportianly compare the species and features.

The species setosa stands out as distinct from the other two species , with respect to the petal length and petal width , there is also seperation for the other two species but it is less distinct . There is cross over between the two species , especially with respect to outliers.

In [None]:
return_code = an.generate_box_plot_II(an.config, to_console=True)
if return_code != 0:
    print(f"Error generating box plot I : {return_code}")
else:
    print("Box plot I generated successfully")


*Code*

In [None]:
print(inspect.getsource(an.generate_box_plot_II))

## Violin Plot of iris data set - Common X axis
This will display the violin plot in the notebook and also save it to a file.  


A violin plot plays a similar role as a box-and-whisker plot. It shows the distribution of data points after grouping by one (or more) variables. Unlike a box plot, each violin is drawn using a kernel density estimate of the underlying distribution. The width of the violin indicates the "density" of the points at different values. The fatter the more points are there .

- https://seaborn.pydata.org/generated/seaborn.violinplot.html#seaborn.violinplot
- Google Gemini - What is a violin Plot


In [None]:
return_code = an.generate_box_plot_II(an.config, to_console=True,kind='violin')
if return_code != 0:
    print(f"Error generating box plot I : {return_code}")
else:
    print("Box plot I generated successfully")


## Boxen Plot of iris data set - Common X axis
This will display the boxen plot in the notebook and also save it to a file.
This is a boxen plot , similar to the above for each species of iris and feature with a common x axis . This makes it easier to proportianly compare the species and features.

The boxen plot shows more quantiles and the tails. Shows the outliers in more detail. It is supposed to be for large data sets. The iris data set may be too small for this plot to be useful.  

It does show that generally there is separation between the species for the petal length and width. The outliers show there is overlap between species. The outliers are potentially identified as different species if some catagosing alogorithm was used.

In [None]:
return_code = an.generate_box_plot_II(an.config, to_console=True,kind='boxen')
if return_code != 0:
    print(f"Error generating box plot I : {return_code}")
else:
    print("Box plot I generated successfully")