# Principals of Data Analytics - Tasks


| Course   | ATU - Principals of Data Analytics  |
|----------|----------|
| Project  | PODA Course Project  |
| Date     | 2025-05-01  |
| Instructor | Dr. Ian McLoughlin  |
| Author  | Clyde Watts  |



## Imports

The following imports are approved numpy, scipy, matplotlib, seaborn, pandas, statsmodels, scikit-learn

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# import sklearn
import sklearn

# Task 1: Source the Data Set

Import the Iris data set from the `sklearn.datasets` module.  
Explain, in your own words, what the `load_iris()` function returns.






### Task 1 : Answer

#### Summary

The load_iris scikit_learn function returns a built-in toy dataset of iris flowers. The data set consists of 150 records evenly distributed among three different iris species $( 50 * 3 )$ , Setosa , Versicolour , Virginica. The data has 4 features , petal length and width , and petal length and width in cm , and the target as iris species.

The load_iris function returns a "bunch" object , which is a dictionary returns either features , and target and associated metadata.


***Image for Iris Flowers***

For those who are not gardeners , an image below

- [Image : Iris diagram with petals and setals](https://www.life.illinois.edu/help/digitalflowers/Iridaceae/9.htm)
- [Image: Iris-Setosa](https://en.wikipedia.org/wiki/Iris_setosa#/media/File:Irissetosa1.jpg)
- [Image: Iris-Versicolour](https://upload.wikimedia.org/wikipedia/commons/2/27/Blue_Flag%2C_Ottawa.jpg)
- [Image: Iris-Virginica](https://en.wikipedia.org/wiki/Iris_virginica#/media/File:Iris_virginica_2.jpg)

### References

https://scikit-learn.org/1.5/modules/generated/sklearn.datasets.load_iris.html

https://scikit-learn.org/stable/datasets/toy_dataset.html#iris-dataset





In [None]:
# Task 1: Import the data
# Load the iris dataset from the sklearn.datasets module using the load_iris() function.
# The iris dataset is a dictionary-like object that holds all the data and some metadata about the data.
import sklearn.datasets as datasets


Load the iris data set

In [None]:
# Load the iris dataset from sklearn.datasets module
# as_frame=True loads the data as a pandas DataFrame
bunch_iris = datasets.load_iris(as_frame=True)

Display the keys of the "bunch" data set.

The keys behave as a dictionary or data set


In [None]:
print("Iris Dataset Keys:")
for key in bunch_iris.keys():
    print(f"Key: {key}")


Iris Data Set Feature Names

In [None]:
# Convert feature_names to DataFrame
# Easier to read
df_feature_name = pd.DataFrame(bunch_iris['feature_names'])
df_feature_name

Iris Dataset Target Names  
The target names are the 3 species of iris flowers.

In [None]:
# convert target names to dataframe for easier reading
df_target_name = pd.DataFrame(bunch_iris['target_names'])
print(df_target_name)

In [None]:
#bunch_iris = datasets.load_iris(as_frame=True)
#print("Bunch iris: ", bunch_iris)
print("Keys: ", bunch_iris.keys())
print("Data: ", bunch_iris['data'].head())
print("Target: ", bunch_iris['target'].head())



In [None]:
bunch_iris.data.info()
bunch_iris.data.shape
bunch_iris.data.describe()
bunch_iris.data.head()

In [None]:
bunch_iris.target.info()
bunch_iris.target.shape
bunch_iris.target.describe()
bunch_iris.target.head()


# Task 2: Explore the Data Structure

Print and explain the shape of the data set, the first and last 5 rows of the data, the feature names, and the target classes.  


## Task 2 : Answer

### Summary

The iris data set is returned as a "bunch" object. The bunch object is a dictionary with the following keys

- data : The feature data set
- target : The target data set
- target_names : The target names
- feature_names : The feature names
- DESCR : The description of the data set





**Description of the data set (DESCR)**

The description(*DESCR*) describes and defines the contents of the iris data set.

In [None]:
# This explains the data set
print(bunch_iris.DESCR);

### Shape of data set

The irish ( flower ) data portion dataset has 4 columns of features ( sepal length and width , and petal length and widht in cm ) and 150 rows of 50 of each type of the 3 iris species.  The target , which identifies the iris species , is a series of 150 rows in the same order as the data set. The target represents the species and is decoded as follows

**Data - Features**


| Column Number | Column Name      | Description           |
|---------------|------------------|-----------------------|
| 0             | Sepal Length     | Sepal length in cm    |
| 1             | Sepal Width      | Sepal width in cm     |
| 2             | Petal Length     | Petal length in cm    |
| 3             | Petal Width      | Petal width in cm     |


**Target - Species**
| Column Number | Column Name      | Description           |
|---------------|------------------|-----------------------|
| 0             | Species          | Number [0,2]          |

***Mapping of Species***

| Value | Species          |
|-------|------------------|
| 0     | Iris Setosa      |
| 1     | Iris Versicolour  |
| 2     | Iris Virginica    |





In [None]:
print("Shape of the data: ")
print("------------------")
bunch_iris.data.shape
print("Info of the data: ")
print("------------------")
bunch_iris.data.info()
print("Shape of the target: ")
print("------------------")
bunch_iris.target.shape
print("Info of the target: ")
print("------------------")
bunch_iris.target.info()

### First and last 5 rows of the data set

The data set in the "bunch" object is broken up into "data" or features and targets


**Data ( Features ) First 5 Rows**

In [None]:
bunch_iris.data.head(5)

**Target ( Species ) First 5 Rows**

In [None]:
# like displaying as array easier to read
bunch_iris.target.head(5).to_frame(name='target')

**Data ( Features ) Last 5 Rows**

In [None]:
bunch_iris.data.tail()

**Target ( Species ) Last 5 Rows**

In [None]:
bunch_iris.target.tail(5).to_frame(name='target')

### Feature Names

The feature names are stored as metadata in irish dataset features names.


#### Notes

##### What is a sepal ? 

The outer parts of the flower (often green and leaf-like) that enclose a developing bud. Sepals are usually green, but in some species they may be brightly colored or resemble petals. Sepals help to protect the flower bud.


#### What is a petal ?

 The parts of a flower that are often conspicuously colored. Petals are the parts of a flower that are often conspicuously colored. They are often involved in attracting pollinators, or are used to attract animals that help the plant to spread its seeds.

 https://www.amnh.org/learn-teach/curriculum-collections/biodiversity-counts/plant-identification/plant-morphology/parts-of-a-flower

In [None]:
bunch_iris.feature_names

Check that the target is as specified in the dataset specification. 

*Trust but verify.*

In [None]:
# Count of each target
# This shows the distribution of the target
bunch_iris.target.groupby(bunch_iris.target).count().to_frame(name='count').reset_index()

In [None]:
bunch_iris.target_names.tolist()

# Task 3: Summarize the Data

For each feature in the dataset, calculate and display:  

- mean
- minimum
- maximum
- standard deviation
- median

## Task 3 : Answer

### Summary

Create a single dataframe containing all the features , petal length and width , and sepal length and width and associated species. The calculate the mean , minimum , maximum , standard deviation , and median for each feature.



In [None]:
# Comnbine the data and target into one dataframe - datacamp course
df_iris = pd.concat([bunch_iris.data, bunch_iris.target], axis=1)
# get a dictionary of the target values and their corresponding names
# create a lookup table for the target values and their name
target_name_map = {i: name for i, name in enumerate(bunch_iris.target_names)}
# map the target values to the target names
# substitute the surrogate key with the target name
df_iris['target'] = df_iris['target'].map(target_name_map)
# run describe for each column

In [None]:
# Describe gets shows the count, mean, std, min, 25%, 50%, 75%, max
# for each column , it does not show the median
df_iris.describe(include='all')


Display the mean, minimum, maximum, standard deviation, and median of all the species combined for each feature.

In [None]:
# get the mean, min, max, std, and median of the features all at once
# using the agg function , we can pass a list of functions to apply to the columns
# , the describe function does not have the median
df_iris.agg({'sepal length (cm)':['mean','min','max','std','median'],\
             'sepal width (cm)':['mean','min','max','std','median'],\
             'petal length (cm)':['mean','min','max','std','median'],\
             'petal width (cm)':['mean','min','max','std','median']})

### Summary at species level

This summary will be for each species , Setosa , Versicolour , Virginica. The mean , minimum , maximum , standard deviation , and median for each feature will be calculated for each species.
This is done doing a melt on the data frame and grouping by species and feature.

In [None]:
# use melt to convert the dataframe to long format
# this is a classic metric table format
# makes it easier to workout the mean, min, max, std, and median of the species and features
# id_vars is the column to keep as is - target setosa, versicolor, virginica
# var_name is the new column name for the features
# value_name is the new column name for the values
df_iris_melt = df_iris.melt(id_vars='target', var_name='feature', value_name='value')
# rename target to species
df_iris_melt.rename(columns={'target':'species'}, inplace=True)
# check the first few rows of the melted dataframe
df_iris_melt.head()

In [None]:
# datacamp course - picked up how to do this
# used similar technique for daily store , sku metric table about 5B rows
df_iris_species_summary = df_iris_melt.groupby(['species','feature'])\
    .agg({'value':['mean','min','max','std','median']})\
    .reset_index()
# rename columns
df_iris_species_summary.columns = ['species','feature','mean','min','max','std','median']
print(df_iris_species_summary['species'].unique())



Summary of Setosa

In [None]:
# get setosa summary and then transpose
# this is to make it easier to read
df_setosa = df_iris_species_summary[df_iris_species_summary['species']=='setosa'].drop(columns='species').transpose()
# rename index to feature and drop the feature colum
df_setosa = df_setosa.rename(columns=df_setosa.iloc[0]).drop(df_setosa.index[0])
df_setosa


Summary of Versicolour

In [None]:
# get versicolor summary and then transpose
# this is to make it easier to read
df_versicolor = df_iris_species_summary[df_iris_species_summary['species']=='versicolor'].drop(columns='species').transpose()
# rename index to feature and drop the feature colum
df_versicolor = df_versicolor.rename(columns=df_versicolor.iloc[0]).drop(df_versicolor.index[0])
df_versicolor

Summary of Virginica


In [None]:
# get versicolor summary and then transpose
# this is to make it easier to read
df_virginica = df_iris_species_summary[df_iris_species_summary['species']=='virginica'].drop(columns='species').transpose()
# rename index to feature and drop the feature colum
df_virginica = df_virginica.rename(columns=df_virginica.iloc[0]).drop(df_virginica.index[0])
df_virginica

# Task 4: Visualize Features

Plot histograms for each feature using `matplotlib`.  
Add appropriate titles and axis labels.  



## Task 4 : Answer

The first plot is a historgram of each featue in the data set , then second plot is a histogram of each feature by species. Each species is a different colour. 


### References
[Matplotlib Histograms](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html)


In [None]:
fig, ax = plt.subplots(2, 2, figsize=(10, 10))
# Set the title of the figure
fig.suptitle('Iris Species Feature Histogram Plot')
# loop through the features and plot the histogram for each feature
features = df_iris.columns[:-1]
# learned this from datacamp course - after doing the below
axflat = ax.flatten()
for i, feature in enumerate(features):
    # get ax
    telecaster = axflat[i] # guitars are called axe's 
    # set title of plot
    # set the title of the plot
    title = f'{feature.replace(" (cm)", "")}'
    telecaster.set_title(title, fontsize=12, fontweight='bold')
    # plot the histogram for each species
    telecaster.hist(df_iris[feature], color='gray',bins='auto', rwidth=0.85)
    # set the x and y labels
    telecaster.set_xlabel('cm')
    telecaster.set_ylabel('Count')


This histogram displays each species in a different colour .  This is much more useful in viewing and understanding the data distribution among the species.   
It shows that there is a decided differences in species in the petal length and width.   The sepal length and width does not show as much distinction between the species.

In [None]:
# Plot in 4 subplots the histograms of the features of each species of iris
plt.style.use('ggplot')
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(10, 10))
# set overall title
fig.suptitle('Iris Species Feature Histogram Plot')
# loop thru each feature and plot histogram in subplots 2 x 2
# sepals and petals seperate layers
# loop makes it more compact
axflat = ax.flatten()
for i, feature_name in enumerate(df_iris.columns[:-1]):
    for species in bunch_iris.target_names:
        # plot the histogram for each species
        stratocaster = axflat[i]
        stratocaster.hist(df_iris[df_iris['target'] == species][feature_name], alpha=0.5, label=species,histtype='stepfilled', bins='auto', rwidth=0.85)
    stratocaster.set_title(feature_name.replace(" (cm)", ""), fontsize=12, fontweight='bold')
    stratocaster.legend()
    stratocaster.set_xlabel('cm')
    stratocaster.set_ylabel('Count')
    stratocaster.set_xlim(df_iris[feature_name].min(), df_iris[feature_name].max())
    stratocaster.grid(True)
plt.tight_layout()

# Task 5: Investigate Relationships

Choose any two features from the data set and create a scatter plot of them.  
Color-code the three different classes of the scatter plot points.


## Task 5 : Answer

Pick Sepal Width and Sepal Length , and Petal Width and Petal Length to create scatter plots. Color code the scatter plot points by species ( Setosa , Versicolour , Virginica )


In [None]:
# Create 2 x 
# Set plot style
plt.style.use('ggplot')
# create a figure and axis
fig , ax = plt.subplots(nrows = 2, ncols = 1, figsize=(10, 10))
# Sepal length vs Sepal width scatter plot
ax[0].scatter(bunch_iris.data['sepal length (cm)'], bunch_iris.data['sepal width (cm)']\
    , c=[['r', 'g', 'b'][i] for i in bunch_iris.target])
ax[0].set_xlabel('sepal length (cm)')
ax[0].set_ylabel('sepal width (cm)')
ax[0].set_title('sepal length vs sepal width')
# put legend for the targets values and colors
ax[0].legend(handles=[plt.Line2D([0], [0], marker='o', color='w', label='setosa', markerfacecolor='r', markersize=10),
                    plt.Line2D([0], [0], marker='o', color='w', label='versicolor', markerfacecolor='g', markersize=10),
                    plt.Line2D([0], [0], marker='o', color='w', label='virginica', markerfacecolor='b', markersize=10)])
# Petal length vs Petal width scatter plot
# plot the data petal length vs petal width,petal length vs petal width, and target each species with different color
ax[1].scatter(bunch_iris.data['petal length (cm)'], bunch_iris.data['petal width (cm)'], c=[['r', 'g', 'b'][i] for i in bunch_iris.target])
ax[1].set_xlabel('petal length (cm)')
ax[1].set_ylabel('petal width (cm)')
ax[1].set_title('petal length vs petal width')
# put legend for the targets values and colors
ax[1].legend(handles=[plt.Line2D([0], [0], marker='o', color='w', label='setosa', markerfacecolor='r', markersize=10),
                    plt.Line2D([0], [0], marker='o', color='w', label='versicolor', markerfacecolor='g', markersize=10),
                    plt.Line2D([0], [0], marker='o', color='w', label='virginica', markerfacecolor='b', markersize=10)]);


# Task 6: Analyze Relationship

Use `numpy.polyfit` to add a regression line to the scatter plot from Task 5.

The polyfit function used Least squares polynomial fit. Last used this to do Iodine absorption curve of EDTA titration in chemistry lab ( 2nd year WITS 1984)

Numpy plolyfit documentation https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html
How to plot polyfit in scatter plot https://stackoverflow.com/questions/46627629/how-to-use-numpy-polyfit-to-plot-trend



## Task 6 : Answer


The polyfit function used Least squares polynomial fit. Last used this to do Iodine absorption curve of EDTA titration in chemistry lab ( 2nd year chemistry WITS 1984 - chemistry was too hard and switched to applied maths/operations research which was easier ). The polifit function is used for both petal length and width and sepal length and width , for each species seperately and together. 

Visualy the sepal length and width linear relation for all the species together does not seam to have a good correlation , seperately there does appear to be some linear relationship. The petal length and width does show a linear relationship overall and seperately.



Numpy plolyfit documentation https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html
How to plot polyfit in scatter plot https://stackoverflow.com/questions/46627629/how-to-use-numpy-polyfit-to-plot-trend



### Polyfit of sepals length and width



### Using ployflit to do linear regression using the petal length and petal width

#### Conclusion


The inital attempt at doing a regresion line through the data was not successful.  There is not a obvious linear relationship between the petal length and petal width  
If the different species data is seperated out, there is a linear relationship between the petal length and petal width for the setosa species.  The versicolor and virginica species do not have a linear relationship between the petal length and petal width.  The data for the versicolor and virginica species is more spread out and not linear.  The setosa species data is more linear and has a clear relationship between the petal length and petal width.  
Then each of the species shows a linear regressiuons

In [None]:
#numpy.polyfit(x, y, deg, rcond=None, full=False, w=None, cov=False)
#for sepal length vs sepal width
# Do a polinomial fit between sepal length and sepal width 
# Degree 1 means linear
trend = np.polyfit(x = df_iris['sepal length (cm)'], y = df_iris['sepal width (cm)'], deg = 1)
trend_serosa = np.polyfit(x = df_iris[df_iris['target'] == 'setosa']['sepal length (cm)'], y = df_iris[df_iris['target'] == 'setosa']['sepal width (cm)'], deg = 1)
trend_versicolor = np.polyfit(x = df_iris[df_iris['target'] == 'versicolor']['sepal length (cm)'], y = df_iris[df_iris['target'] == 'versicolor']['sepal width (cm)'], deg = 1)
trend_virginica = np.polyfit(x = df_iris[df_iris['target'] == 'virginica']['sepal length (cm)'], y = df_iris[df_iris['target'] == 'virginica']['sepal width (cm)'], deg = 1)
# do a trend line for each species of iris

# create a polynomial function - this is the trend line
# this is a linear trend line
trendpoly = np.poly1d(trend) # create a polynomial function
trendpoly_serosa = np.poly1d(trend_serosa) # create a polynomial function
trendpoly_versicolor = np.poly1d(trend_versicolor) # create a polynomial function
trendpoly_virginica = np.poly1d(trend_virginica) # create a polynomial function

plt.scatter(df_iris['sepal length (cm)'], df_iris['sepal width (cm)'], c=[['r', 'g', 'b'][i] for i in bunch_iris.target])
plt.xlabel('sepal length (cm)')
plt.ylabel('sepal width (cm)')
plt.title('sepal length vs sepal width')
# put legend for the targets values and colors
plt.legend(handles=[plt.Line2D([0], [0], marker='o', color='w', label='setosa', markerfacecolor='r', markersize=10),
                    plt.Line2D([0], [0], marker='o', color='w', label='versicolor', markerfacecolor='g', markersize=10),
                    plt.Line2D([0], [0], marker='o', color='w', label='virginica', markerfacecolor='b', markersize=10)])
# plot the trend line for all the data and then each species
plt.plot(df_iris['sepal length (cm)'], trendpoly(df_iris['sepal length (cm)']), color = 'black')
plt.plot(df_iris[df_iris['target'] == 'setosa']['sepal length (cm)'], trendpoly_serosa(df_iris[df_iris['target'] == 'setosa']['sepal length (cm)']), color = 'red')
plt.plot(df_iris[df_iris['target'] == 'versicolor']['sepal length (cm)'], trendpoly_versicolor(df_iris[df_iris['target'] == 'versicolor']['sepal length (cm)']), color = 'green')
plt.plot(df_iris[df_iris['target'] == 'virginica']['sepal length (cm)'], trendpoly_virginica(df_iris[df_iris['target'] == 'virginica']['sepal length (cm)']), color = 'blue')
plt.show();


The polyfit documentation recommends the use of the polynomial fit function for linear regression
- [polyfit](https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html)
- [polynomial fit](https://numpy.org/doc/stable/reference/routines.polynomials-package.html#module-numpy.polynomial)

The below is the above polifit example replace with polynomial fit function.  The polynomial fit function is more versatile .

In [None]:
# Use polinomal fit function from numpy
#polynomial.polynomial.Polynomial.fit
# used github copilot prompt to get me started
from numpy.polynomial import Polynomial
# TODO - add a function to do this
# Split into x and y - it could be done in the plot function but this is more readable
x_sepal_length = df_iris['sepal length (cm)']
y_sepal_width = df_iris['sepal width (cm)']
# now x and y for each species
x_sepal_length_setosa = df_iris[df_iris['target'] == 'setosa']['sepal length (cm)']
y_sepal_width_setosa = df_iris[df_iris['target'] == 'setosa']['sepal width (cm)']
x_sepal_length_versicolor = df_iris[df_iris['target'] == 'versicolor']['sepal length (cm)']
y_sepal_width_versicolor = df_iris[df_iris['target'] == 'versicolor']['sepal width (cm)']
x_sepal_length_virginica = df_iris[df_iris['target'] == 'virginica']['sepal length (cm)']
y_sepal_width_virginica = df_iris[df_iris['target'] == 'virginica']['sepal width (cm)']
# Use Polynomial fit function from numpy for sepal length vs sepal width , for all the data and then each species
# get the polynomial fit for all the data and then each specie
poly_sepal,poly_sepal_residual = Polynomial.fit(x = x_sepal_length, y = y_sepal_width, deg = 1 , full=True)
poly_sepal_setosa,poly_sepal_setosa_residual = Polynomial.fit(x = x_sepal_length_setosa, y = y_sepal_width_setosa, deg = 1 , full=True)
poly_sepal_versicolor,poly_sepal_versicolor_residual = Polynomial.fit(x = x_sepal_length_versicolor, y = y_sepal_width_versicolor, deg = 1 , full=True)
poly_sepal_virginica,poly_sepal_virginica_residual = Polynomial.fit(x = x_sepal_length_virginica, y = y_sepal_width_virginica, deg = 1 , full=True)
# get the polynomial fit for all the data
# get the equations for all the data and then each species
poly_sepal_equation = poly_sepal.convert()
poly_sepal_setosa_equation = poly_sepal_setosa.convert()
poly_sepal_versicolor_equation = poly_sepal_versicolor.convert()
poly_sepal_virginica_equation = poly_sepal_virginica.convert()
# print the polynomial equations for all the data and then each species
print("Polynomial equation: ", poly_sepal_equation)
print("Polynomial equation setosa: ", poly_sepal_setosa_equation)
print("Polynomial equation versicolor: ", poly_sepal_versicolor_equation)
print("Polynomial equation virginica: ", poly_sepal_virginica_equation)
# get the residuals
# TODO Store in a summary dataframe
#print("Residuals Redisidual: ", poly_sepal_residual[0])
#print("Residuals Rank: ", poly_sepal_residual[1])
#print("Residuals Singular values: ", poly_sepal_residual[2])
#print("Residuals Condition number: ", poly_sepal_residual[3])
# get the linspace of x values for all ,and then each species
# this is to plot the trend line
poly_sepal_linspace = poly_sepal.linspace(n=len(x_sepal_length))
poly_sepal_setosa_linspace = poly_sepal_setosa.linspace(n=len(x_sepal_length_setosa))
poly_sepal_versicolor_linspace = poly_sepal_versicolor.linspace(n=len(x_sepal_length_versicolor))
poly_sepal_virginica_linspace = poly_sepal_virginica.linspace(n=len(x_sepal_length_virginica))

# set the style of the plot
plt.style.use('ggplot')
fig , ax = plt.subplots(nrows = 2, ncols = 2, figsize=(10, 10))
# plot the data
ax[0,0].scatter(x=x_sepal_length, y=y_sepal_width, c=[['r', 'g', 'b'][i] for i in bunch_iris.target])
ax[0,0].plot(poly_sepal_linspace[0], poly_sepal_linspace[1], color = 'black')
ax[0,0].plot(poly_sepal_setosa_linspace[0], poly_sepal_setosa_linspace[1], color = 'red')
ax[0,0].plot(poly_sepal_versicolor_linspace[0], poly_sepal_versicolor_linspace[1], color = 'green')
ax[0,0].plot(poly_sepal_virginica_linspace[0], poly_sepal_virginica_linspace[1], color = 'blue')
# github copilot prompt to get me started
equation_text = f"$y = {poly_sepal_equation.coef[0]:.2f} + {poly_sepal_equation.coef[1]:.2f}x$"
ax[0,0].annotate(equation_text, xy=(0.05, 0.95), xycoords='axes fraction', fontsize=12, color='black',
             horizontalalignment='left', verticalalignment='top')
# add axes title
ax[0,0].set_title('sepal length vs sepal width (all)')
ax[0,0].set_xlabel('sepal length (cm)')
ax[0,0].set_ylabel('sepal width (cm)')

ax[0,0].legend()
# Now do setosa
ax[0,1].scatter(x=x_sepal_length_setosa, y=y_sepal_width_setosa, c='red')
ax[0,1].plot(poly_sepal_setosa_linspace[0], poly_sepal_setosa_linspace[1], color = 'red')
# get the equation for setosa
equation_text = f"$y = {poly_sepal_setosa_equation.coef[0]:.2f} + {poly_sepal_setosa_equation.coef[1]:.2f}x$"
ax[0,1].annotate(equation_text, xy=(0.05, 0.95), xycoords='axes fraction', fontsize=12, color='red',
                horizontalalignment='left', verticalalignment='top')
# add axes title
ax[0,1].set_title('sepal length vs sepal width (setosa)')
ax[0,1].set_xlabel('sepal length (cm)')
ax[0,1].set_ylabel('sepal width (cm)')
# Now do versicolor
ax[1,0].scatter(x=x_sepal_length_versicolor, y=y_sepal_width_versicolor, c='green')
ax[1,0].plot(poly_sepal_versicolor_linspace[0], poly_sepal_versicolor_linspace[1], color = 'green')
# get the equation for versicolor
equation_text = f"y = {poly_sepal_versicolor_equation.coef[0]:.2f} + {poly_sepal_versicolor_equation.coef[1]:.2f}x"
ax[1,0].annotate(equation_text, xy=(0.05, 0.95), xycoords='axes fraction', fontsize=12, color='green',
                horizontalalignment='left', verticalalignment='top')
# add axes title
ax[1,0].set_title('sepal length vs sepal width (versicolor)')
ax[1,0].set_xlabel('sepal length (cm)')
ax[1,0].set_ylabel('sepal width (cm)')
# Now do virginica
ax[1,1].scatter(x=x_sepal_length_virginica, y=y_sepal_width_virginica, c='blue')
ax[1,1].plot(poly_sepal_virginica_linspace[0], poly_sepal_virginica_linspace[1], color = 'blue')
# get the equation for virginica
equation_text = f"$y = {poly_sepal_virginica_equation.coef[0]:.2f} + {poly_sepal_virginica_equation.coef[1]:.2f}x$"
ax[1,1].annotate(equation_text, xy=(0.05, 0.95), xycoords='axes fraction', fontsize=12, color='blue',

                horizontalalignment='left', verticalalignment='top')
# add axes title
ax[1,1].set_title('sepal length vs sepal width (virginica)')
ax[1,1].set_xlabel('sepal length (cm)')
ax[1,1].set_ylabel('sepal width (cm)')
# set the overall title
fig.suptitle('Iris Data  Set - Polynomial Fit of Sepal Length vs Sepal Width - Scatter and Trend Line')
plt.tight_layout()


In [None]:
# Use polinomal fit function from numpy
#polynomial.polynomial.Polynomial.fit
# used github copilot prompt to get me started
from numpy.polynomial import Polynomial
# TODO - add a function to do this
# Split into x and y - it could be done in the plot function but this is more readable
x_petal_length = df_iris['petal length (cm)']
y_petal_width = df_iris['petal width (cm)']
# now x and y for each species
x_petal_length_setosa = df_iris[df_iris['target'] == 'setosa']['petal length (cm)']
y_petal_width_setosa = df_iris[df_iris['target'] == 'setosa']['petal width (cm)']
x_petal_length_versicolor = df_iris[df_iris['target'] == 'versicolor']['petal length (cm)']
y_petal_width_versicolor = df_iris[df_iris['target'] == 'versicolor']['petal width (cm)']
x_petal_length_virginica = df_iris[df_iris['target'] == 'virginica']['petal length (cm)']
y_petal_width_virginica = df_iris[df_iris['target'] == 'virginica']['petal width (cm)']
# Use Polynomial fit function from numpy for petal length vs petal width , for all the data and then each species
# get the polynomial fit for all the data and then each specie
poly_petal,poly_petal_residual = Polynomial.fit(x = x_petal_length, y = y_petal_width, deg = 1 , full=True)
poly_petal_setosa,poly_petal_setosa_residual = Polynomial.fit(x = x_petal_length_setosa, y = y_petal_width_setosa, deg = 1 , full=True)
poly_petal_versicolor,poly_petal_versicolor_residual = Polynomial.fit(x = x_petal_length_versicolor, y = y_petal_width_versicolor, deg = 1 , full=True)
poly_petal_virginica,poly_petal_virginica_residual = Polynomial.fit(x = x_petal_length_virginica, y = y_petal_width_virginica, deg = 1 , full=True)
# get the polynomial fit for all the data
# get the equations for all the data and then each species
poly_petal_equation = poly_petal.convert()
poly_petal_setosa_equation = poly_petal_setosa.convert()
poly_petal_versicolor_equation = poly_petal_versicolor.convert()
poly_petal_virginica_equation = poly_petal_virginica.convert()
# print the polynomial equations for all the data and then each species
print("Polynomial equation: ", poly_petal_equation)
print("Polynomial equation setosa: ", poly_petal_setosa_equation)
print("Polynomial equation versicolor: ", poly_petal_versicolor_equation)
print("Polynomial equation virginica: ", poly_petal_virginica_equation)
# get the residuals
# TODO Store in a summary dataframe
#print("Residuals Redisidual: ", poly_petal_residual[0])
#print("Residuals Rank: ", poly_petal_residual[1])
#print("Residuals Singular values: ", poly_petal_residual[2])
#print("Residuals Condition number: ", poly_petal_residual[3])
# get the linspace of x values for all ,and then each species
# this is to plot the trend line
poly_petal_linspace = poly_petal.linspace(n=len(x_petal_length))
poly_petal_setosa_linspace = poly_petal_setosa.linspace(n=len(x_petal_length_setosa))
poly_petal_versicolor_linspace = poly_petal_versicolor.linspace(n=len(x_petal_length_versicolor))
poly_petal_virginica_linspace = poly_petal_virginica.linspace(n=len(x_petal_length_virginica))

# set the style of the plot
plt.style.use('ggplot')
fig , ax = plt.subplots(nrows = 2, ncols = 2, figsize=(10, 10))
# plot the data
ax[0,0].scatter(x=x_petal_length, y=y_petal_width, c=[['r', 'g', 'b'][i] for i in bunch_iris.target])
ax[0,0].plot(poly_petal_linspace[0], poly_petal_linspace[1], color = 'black')
ax[0,0].plot(poly_petal_setosa_linspace[0], poly_petal_setosa_linspace[1], color = 'red')
ax[0,0].plot(poly_petal_versicolor_linspace[0], poly_petal_versicolor_linspace[1], color = 'green')
ax[0,0].plot(poly_petal_virginica_linspace[0], poly_petal_virginica_linspace[1], color = 'blue')
# github copilot prompt to get me started
equation_text = f"$y = {poly_petal_equation.coef[0]:.2f} + {poly_petal_equation.coef[1]:.2f}x$"
ax[0,0].annotate(equation_text, xy=(0.05, 0.95), xycoords='axes fraction', fontsize=12, color='black',
             horizontalalignment='left', verticalalignment='top')
# add axes title
ax[0,0].set_title('petal length vs petal width (all)')
ax[0,0].set_xlabel('petal length (cm)')
ax[0,0].set_ylabel('petal width (cm)')

ax[0,0].legend()
# Now do setosa
ax[0,1].scatter(x=x_petal_length_setosa, y=y_petal_width_setosa, c='red')
ax[0,1].plot(poly_petal_setosa_linspace[0], poly_petal_setosa_linspace[1], color = 'red')
# get the equation for setosa
equation_text = f"$y = {poly_petal_setosa_equation.coef[0]:.2f} + {poly_petal_setosa_equation.coef[1]:.2f}x$"
ax[0,1].annotate(equation_text, xy=(0.05, 0.95), xycoords='axes fraction', fontsize=12, color='red',
                horizontalalignment='left', verticalalignment='top')
# add axes title
ax[0,1].set_title('petal length vs petal width (setosa)')
ax[0,1].set_xlabel('petal length (cm)')
ax[0,1].set_ylabel('petal width (cm)')
# Now do versicolor
ax[1,0].scatter(x=x_petal_length_versicolor, y=y_petal_width_versicolor, c='green')
ax[1,0].plot(poly_petal_versicolor_linspace[0], poly_petal_versicolor_linspace[1], color = 'green')
# get the equation for versicolor
equation_text = f"$y = {poly_petal_versicolor_equation.coef[0]:.2f} + {poly_petal_versicolor_equation.coef[1]:.2f}x$"
ax[1,0].annotate(equation_text, xy=(0.05, 0.95), xycoords='axes fraction', fontsize=12, color='green',
                horizontalalignment='left', verticalalignment='top')
# add axes title
ax[1,0].set_title('petal length vs petal width (versicolor)')
ax[1,0].set_xlabel('petal length (cm)')
ax[1,0].set_ylabel('petal width (cm)')
# Now do virginica
ax[1,1].scatter(x=x_petal_length_virginica, y=y_petal_width_virginica, c='blue')
ax[1,1].plot(poly_petal_virginica_linspace[0], poly_petal_virginica_linspace[1], color = 'blue')
# get the equation for virginica
equation_text = f"$y = {poly_petal_virginica_equation.coef[0]:.2f} + {poly_petal_virginica_equation.coef[1]:.2f}x$"
ax[1,1].annotate(equation_text, xy=(0.05, 0.95), xycoords='axes fraction', fontsize=12, color='blue',

                horizontalalignment='left', verticalalignment='top')
# add axes title
ax[1,1].set_title('petal length vs petal width (virginica)')
ax[1,1].set_xlabel('petal length (cm)')
ax[1,1].set_ylabel('petal width (cm)')
# set the overall title
fig.suptitle('Iris Data  Set - Polynomial Fit of petal Length vs petal Width - Scatter and Trend Line')
plt.tight_layout()

### Polyfit of petals length and width

In [None]:
#numpy.polyfit(x, y, deg, rcond=None, full=False, w=None, cov=False)
#for petal length vs petal width
# Do a polinomial fit between petal length and petal width 
# Degree 1 means linear
trend = np.polyfit(x = bunch_iris.data['petal length (cm)'], y = bunch_iris.data['petal width (cm)'], deg = 1)
print(trend)
# create a polynomial function - this is the trend line
# this is a linear trend line
trendpoly = np.poly1d(trend) # create a polynomial function
print(trendpoly)
plt.scatter(bunch_iris.data['petal length (cm)'], bunch_iris.data['petal width (cm)'], c=[['r', 'g', 'b'][i] for i in bunch_iris.target])
plt.xlabel('petal length (cm)')
plt.ylabel('petal width (cm)')
plt.title('petal length vs petal width')
# put legend for the targets values and colors
plt.legend(handles=[plt.Line2D([0], [0], marker='o', color='w', label='setosa', markerfacecolor='r', markersize=10),
                    plt.Line2D([0], [0], marker='o', color='w', label='versicolor', markerfacecolor='g', markersize=10),
                    plt.Line2D([0], [0], marker='o', color='w', label='virginica', markerfacecolor='b', markersize=10)])
plt.plot(bunch_iris.data['petal length (cm)'], trendpoly(bunch_iris.data['petal length (cm)']), color = 'black')
plt.show()


# Task 7: Analyze Class Distributions

Create box-plots of the petal lengths for each of the three classes.


## Task 7 : Answer

### Description

The box plot show the distribution of the 3 iris species petal lengths . The distribution show the spread of the data via various metrics. The box plot shows the mean , median , and quantiles of the data. The box is the interquartile range , the whiskers are the IQR * 1.5. The box plot is a good way to show the distribution of the data and the spread of the data.

### Conclusion

The serosa species has the most distinct petal length and it is easy to determine the serosa species by only looking at the petal length. The other two species , Versicolour and Virginica do overalp in petal length and thus it is not as easy to determine the species by petal length alone.


#### Notes

Box Plots are basically a plot with mean , median and quantiles , the box is the interquartile range 0.25 and 0.75 quantiles , the whiskers are IQR * 1.5.
We are going to the petal lengths of each of the classes , id est  the iris species , Setosa , Versicolour , Virginica.
We will be using the seaborn library to create the box plots.

### References

- Course Notes - Ian McLoughlin : https://github.com/ianmcloughlin/principles_of_data_analytics/blob/main/materials/boxplots.ipynb
- Seaborn Box Plots : https://seaborn.pydata.org/generated/seaborn.boxplot.html
- https://www.geeksforgeeks.org/how-to-set-the-hue-order-in-seaborn-plots/#5-example-withseabornboxplot
- Datacamp Seaborn Course 
- Datacamp Numpy Course - melts

Create a datafram of petal lengths for each of the three classes , Setosa , Versicolour , Virginica. 

In [None]:
# A data frame with the petal length and class
df_petal = df_iris[[ 'target','petal length (cm)']]
df_petal.sample(5)


#### Summary Data
For each of the classes , Setosa , Versicolour and Virginica calculate  the mean , max , min , standard deviation , median and quantiles of the petal lengths will be displayed in a table

In [None]:
def quantile_25(x):
    """ input: x is a pandas series or numpy array and return the 25th and 75th quantile
    Note: coulld be done by a lambda function but this is more readable
    """
    return np.quantile(x, 0.25)

def quantile_75(x):
    """ input: x is a pandas series or numpy array and return the 25th and 75th quantile
    Note: coulld be done by a lambda function but this is more readable
    """
    return np.quantile(x, 0.75)

# Group by target and get the mean, min, max, std, median, 25th and 75th quantile
df_iris_species_summary_stats = df_petal.groupby(['target']).agg({
	'petal length (cm)': ['mean', 'min', 'max', 'std', 'median', quantile_25, quantile_75]})


# Rename the quantile columns
#  relates to box plot < -- |Q25 | Median | Q75 | -- >
df_iris_species_summary_stats.columns = ['mean', 'min', 'max', 'std', 'median', 'Q25', 'Q75']

 # Add IQR column - Interquartile range is the difference between the 75th and 25th quantile
df_iris_species_summary_stats['IQR'] = df_iris_species_summary_stats['Q75'] - df_iris_species_summary_stats['Q25']

# Add outlier columns
df_iris_species_summary_stats['outlier_min'] = df_iris_species_summary_stats['Q25'] - 1.5 * df_iris_species_summary_stats['IQR']
df_iris_species_summary_stats['outlier_max'] = df_iris_species_summary_stats['Q75'] + 1.5 * df_iris_species_summary_stats['IQR']

#print(df_iris_species_summary_stats.info())

# rotate the df_iris_species_summary_stats
# I want the species to be the columns
# This is easier to read
# Transpose the data frame
df_iris_species_summary_stats = df_iris_species_summary_stats.transpose()

# change the order of the columns to be setosa, versicolor, virginica
df_iris_species_summary_stats = df_iris_species_summary_stats[['setosa', 'versicolor', 'virginica'] ]
# sort rows in a specific order
df_iris_species_summary_stats = df_iris_species_summary_stats.reindex(['min', 'outlier_min', 'Q25', 'median', 'mean', 'Q75', 'outlier_max', 'max', 'std', 'IQR'])
df_iris_species_summary_stats


** Plot the box plots of the petal length of each of the three secpies *

The box plot has the following features

- the outliers are shown as dots
- the median is shown as a line in the box
- the mean is shown as a diamond in the box
- the box is the interquartile range ( IQR ) 0.25 and 0.75 quantiles
- the whiskers are IQR * 1.5
- the box plot is a good way to show the distribution of the data and the spread of the data



In [None]:
# Plot the seanborn box plot
ax = sns.boxplot(x='petal length (cm)', y='target', data=df_petal, palette="pastel",hue='target'\
                 ,hue_order=['setosa','versicolor','virginica'])
ax.title.set_text('Petal Length vs Species')
ax.set_xlabel('Petal Length (cm)')
ax.set_ylabel('Species')
ax.legend(title='Species', loc='upper right', labels=['setosa','versicolor','virginica'])

In [None]:
# Plot the seanborn box plot
df_sepal = df_iris[[ 'target','sepal length (cm)']]
ax = sns.boxplot(x='petal length (cm)', y='target', data=df_petal, palette="pastel",hue='target'\
                 ,hue_order=['setosa','versicolor','virginica'])
ax.title.set_text('Petal Length vs Species')
ax.set_xlabel('Petal Length (cm)')
ax.set_ylabel('Species')
ax.legend(title='Species', loc='upper right', labels=['setosa','versicolor','virginica']);

**Box Plots of the petal width and length , and sepal length and width of all three species of iris flowers**

The axis length is the same for all the plots. This gives a more visual representation of the data

In [None]:
sns.set_style("darkgrid")
g = sns.catplot(data=df_iris_melt,kind="box",x="species",y="value",hue='species',col="feature",col_wrap=2,sharex=True
                 ,height=5, aspect=1.5, palette="pastel", hue_order=['setosa','versicolor','virginica'])
# how to set the title for each subplot - github copilot assited with adjustment
# Adjust the top space for the title and increase spacing between subplots
plt.subplots_adjust(top=0.9, wspace=0.3, hspace=0.4)
# set grid on each subplot
for ax in g.axes.flat:
    ax.grid(True)
    # get the title of each subplot
    # This could be done in one line, but for clarity multiple lines are used
    title = ax.get_title()  # this gets the title of each subplot
    # remove the prefix "feature = " from the title
    # and replace "_" with " "
    title = title.replace("feature = ", "").replace("_"," ")
    # now capitalize the first letter of each word
    title = title.title()
    # set the title of each subplot 
    ax.set_title(title)
    # set the x and y labels
    ax.set_ylabel("Measurement (cm)")
    ax.set_xlabel("Species")
    ax.legend(title='Species', loc='upper left', labels=['setosa','versicolor','virginica'])
# set the overall title
plt.suptitle("Boxplot of features by species");
# add a legend


# Task 8: Compute Correlations

Calculate the correlation coefficients between the features.  The dataframe method `corr()` returns a correlation matric. 


## Task 8 : Answer
### Summary
There number of ways of calculating the correlation coefficients. The default , the pearson method is used. There are alternatives , kendall and spearman. The correlation coefficients is a metric on how closely related two features are. The values 1 and -1 means that the two features are perfectly related. ( don't trust 1 and -1  , if a feature is perfectly related to another feature , then why is it being used ), the value 0 means there is no relationship between the two features.   

The heatman shows the "strength" of the correlation in a visual matrix of the features. All the features are on the x and y axis. ( I use a heatmap to determine compute hotpoints in cpu usage during database daily batch processing )

Display the results as a heatmap using `matplotlib`.  

**Correlation Coefficients equation**

$r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2} \cdot \sqrt{\sum_{i=1}^{n}(y_i - \bar{y})^2}}$


### References

- [Pandas Correlation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html)  
- [Wiki Pearson Correlation Coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient)  
- [Seaborn Heatmap](https://seaborn.pydata.org/generated/seaborn.heatmap.html)  
- [Correlation Coefficient](https://www.statisticshowto.com/probability-and-statistics/correlation-coefficient-formula/)


**Correlation Coefficients**

Calculate the correlation coefficient between the features for all , and each of the species seperately. 

In [None]:
df_corr = df_iris[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']].corr(method='pearson')
# get correlation matrix for each species
df_setosa_corr = df_iris[df_iris['target'] == 'setosa'][['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']].corr(method='pearson')
df_versicolor_corr = df_iris[df_iris['target'] == 'versicolor'][['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']].corr(method='pearson')
df_virginica_corr = df_iris[df_iris['target'] == 'virginica'][['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']].corr(method='pearson')
# plot the correlation matrix
print(df_setosa_corr.info())
df_setosa_corr

** Plot a heatmap of the correlation coefficients **

The heatmap shows the correlation between the two features , the diagonal is 1.0 , the off diag shows the correlation between two different features. The more closer to 1.0 or -1.0 , the better the correlation. The closer to zero , the less the correlation . The main diagonal is 1 , which is a feature with itself.


In [None]:
# First attempt at plotting the correlation matrix using a heatmap
fig, ax = plt.subplots(figsize=(8, 8))
# impshow plots a grid of data represented as colors ( heatmap) 
# I use a heatmap to determine if there is cpu exhaustion using a teradata vantage database engine
# usually indicates that there are non-performant queries , or to much running at the same time
ax.imshow(df_corr, cmap='coolwarm', interpolation='nearest') 
# display the value in the heatmap
# ndenumerate enumerates the array and returns the index and value
# this is used to display the value in the heatmap
for (i, j), val in np.ndenumerate(df_corr):
    # display value at point i,j in the heatmap - format 9.09
    ax.text(j, i, round(val, 2), ha='center', va='center', color='yellow', fontsize=12);
# set the x and y ticks to be the column names of the correlation matrix
ax.set_xticks(np.arange(len(df_corr.columns)));
ax.set_yticks(np.arange(len(df_corr.index)));
ax.set_xticklabels(df_corr.columns, rotation=45, ha='right');
ax.set_yticklabels(df_corr.index);
# display the color bar on the right
fig.colorbar(ax.imshow(df_corr, cmap='coolwarm', interpolation='nearest'), ax=ax, shrink=0.8, label='Correlation Coefficient')
# set the title of the heatmap
ax.set_title('Correlation Matrix Heatmap for All Iris Species', fontsize=16, fontweight='bold');

# switch off pescy grid lines
ax.grid(False);


This shows a heat map of the correlation coefficieents between the features , for all species and then seperately for each species.


In [None]:
# list of correlation dataframes
coef_list = [df_corr,df_setosa_corr, df_versicolor_corr, df_virginica_corr]
# list of titles
titles = ['All Species', 'Setosa', 'Versicolor', 'Virginica']
cmap_list = ['Spectral', 'Spectral', 'Spectral', 'Spectral']
# Seaborn heatmap
fig , ax = plt.subplots(2,2,figsize=(8, 8))
# Add figure title
fig.suptitle('Correlation Heatmap For Iris Dataset , all species and each species', fontsize=16, fontweight='bold')

# loop through each correlation matrix and plot
i = 0
for ax in ax.flatten():
    # get the correlation matrix
    ax.set_title(titles[i])
    df = coef_list[i]
    # setup the title ticks depending on whether it is on the outside of the axis
    if i < 2:
        # set the color map for the first two plots
        xl = False
    else:
        # set the color map for the last two plots
        xl = df.columns
    if i % 2 == 0:
        # set the color map for the first two plots
        xy = True
    else:
        # set the color map for the last two plots
        xy = False
    sns.heatmap(df, annot=True, cmap=cmap_list[i], center=0, ax=ax,xticklabels=xl, yticklabels=xy, cbar=True,\
                 fmt='.2f', linewidths=0.5, linecolor='black', square=True, annot_kws={"size": 10})
    i += 1

  
 



# Task 9: Fit a Simple Linear Regression

For your two features in Task 5, calculate the coefficient of determination $R^2$.  
Re-create the plot from Task 6 and annotate it with the $R^2$ value.

### Task 9 : Answer
### Summary

**Coefficient of Determination**

[Wikipedia - Coefficient of Determination](https://en.wikipedia.org/wiki/Coefficient_of_determination)
In statistics, the coefficient of determination, denoted R2 or r2 and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).
[Copilot]It is a measure of how well the regression line fits the data. The value of R2 is between 0 and 1. A value of 0 means that the regression line does not fit the data at all, while a value of 1 means that the regression line perfectly fits the data.


### References

- [Wikipedia Coefficient of Determination](https://en.wikipedia.org/wiki/Coefficient_of_determination)

y = petal_width
x = petal_length
r2 = 1 - (np.sum((y - y_pred) ** 2) / np.sum((y - np.mean(y)) ** 2))

Determine the equation of the regression line using polyfit
p = np.polyfit(x, y, 1)

In [None]:
# Use polinomal fit function from numpy
#polynomial.polynomial.Polynomial.fit
# used github copilot prompt to get me started
from numpy.polynomial import Polynomial
# github copilot prompt to get me started
# how to display an equation in latex format
# https://stackoverflow.com/questions/12345678/how-to-display-latex-in-jupyter-notebook
from sympy import symbols, latex 
from IPython.display import display, Math
# TODO - add a function to do this
# Split into x and y - it could be done in the plot function but this is more readable
x_petal_length = df_iris['petal length (cm)']
y_petal_width = df_iris['petal width (cm)']
poly_petal,poly_petal_residual = Polynomial.fit(x = x_petal_length, y = y_petal_width, deg = 1 , full=True)

poly_petal_equation = poly_petal.convert()
print(" Petal Length vs Petal With Polyfit metadata")
print("------------------------------------------------------------")
# print the polynomial equations for all the data and then each species
#print("Polynomial equation: ", poly_petal_equation)
latex_poly = latex(f"y={str(poly_petal_equation)}")
display(Math(latex_poly))
# get the residuals
# TODO Store in a summary dataframe
print("Residuals Redisidual: ", poly_petal_residual[0])
print("Residuals Rank: ", poly_petal_residual[1])
print("Residuals Singular values: ", poly_petal_residual[2])
print("Residuals Condition number: ", poly_petal_residual[3])
# get the linspace of x values for all ,and then each species
# this is to plot the trend line
# Because i am using Polynomial.fit the linespace automatically gets the x values for the polynomial fit
poly_petal_linspace = poly_petal.linspace(n=len(x_sepal_length))

Plot a scatter plot with a regression line

In [None]:
fig , ax = plt.subplots(nrows = 1, ncols = 1, figsize=(3, 3))
# plot the data , change the color based on species
ax.scatter(x=x_petal_length, y=y_petal_width, c=[['r', 'g', 'b'][i] for i in bunch_iris.target])
ax.plot(poly_petal_linspace[0], poly_petal_linspace[1], color = 'black')
ax.set_title('Petal Length vs Petal Width (all)')
ax.set_xlabel('Petal Length (cm)')
ax.set_ylabel('Petal Width (cm)')
# set the overall title
plt.suptitle('Iris Data  Set - Polynomial Fit of Petal Length vs Petal Width - Scatter and Trend Line')
plt.tight_layout()

Get the mean of x and y

The equation for the mean of **x** is:

$ \bar{x} = \frac{\sum_{i=1}^n x_i}{n} $



The equation for the mean of **y** is:


$ \bar{y} = \frac{\sum_{i=1}^n y_i}{n} $

Note :  

The Linspace mean are close to the mean of the x and y values of the petals but not exactly the same. There is a difference , i would have expected it on y , not sure why it is on the x as well ?

In [None]:
x_petal_mean = x_petal_length.mean()
y_petal_mean = y_petal_width.mean()
# also get the mean from poly_petal_linspace
x_poly_petal_linspace_mean = poly_petal_linspace[0].mean()
y_poly_petal_linspace_mean = poly_petal_linspace[1].mean()

print(f"Mean Petal Length X: {x_petal_mean:.2f} cm")
print(f"Mean Petal Width Y: {y_petal_mean:.2f} cm")
print(f"Mean Petal Length from poly_petal_linspace X: {x_poly_petal_linspace_mean:.2f} cm")
print(f"Mean Petal Width from poly_petal_linspace Y: {y_poly_petal_linspace_mean:.2f} cm")

Now work out the residual sum of squares ( SSres ) 

$ SSres = \sum_{i=1}^n (y_i - \hat{f_i})^2 $

This could be views as the sum of the errors between the actual y and predicted y values. The smaller the SSres the better the vis

In [None]:
# get the residual then square it and sum it
SSres = ((y_petal_width - poly_petal_linspace[1]) ** 2).sum()
# the square means the cm is cm^2 
print(f"Residual Sum of Squares: {SSres:.2f} cm^2")

Now work out the total sum of squares ( SStot ) 

$ SStot = \sum_{i=1}^n (y_i - \hat{y_i})^2 $

This could be views as the sum of the errors between the actual y and predicted y values. The smaller the SSres the better the vis

In [None]:
# total sum of squares
SStot = ((y_petal_width - y_petal_mean) ** 2).sum()
# the square means the cm is cm^2
print(f"Total Sum of Squares: {SStot:.2f} cm^2")


Now tie this altogether as the coefficient of determination ( R2 )
$ R^2 = 1 - \frac{SSres}{SStot} $

*Wikipedia - Coefficient of Determination*

n the best case, the modeled values exactly match the observed values, which results in $SSres=0$
and $R^2 = 1$. A baseline model, which always predicts y, will have $R^2 = 0$.

The $R^2=0.8$ means there is a 80% correlation between the petal width and length. This is born out by the visual plots of the two features.

In [None]:
R2 = 1 - ( SSres / SStot)
print(f"R^2: {R2:.2f}")

Re-create the plot from Task 6 and annotate it with the $R^2$ value.

In [None]:
fig , ax = plt.subplots(nrows = 1, ncols = 1, figsize=(8, 8))
# plot the data , change the color based on species
ax.scatter(x=x_petal_length, y=y_petal_width, c=[['r', 'g', 'b'][i] for i in bunch_iris.target])
ax.plot(poly_petal_linspace[0], poly_petal_linspace[1], color = 'black');
ax.set_xlabel('Petal Length (cm)');
ax.set_ylabel('Petal Width (cm)');
# set the overall title
plt.suptitle('Iris Data  Set - Polynomial Fit of Petal Length vs Petal Width - Scatter and Trend Line');
plt.tight_layout();
# Annotate the polynomial equation - github copilot prompt to get me started

# find mid point of the x and y axis of line
x_mid = (ax.get_xlim()[0] + ax.get_xlim()[1]) / 2
y_mid = (ax.get_ylim()[0] + ax.get_ylim()[1]) / 2
print(f"x_mid: {x_mid:.2f} cm")
print(f"y_mid: {y_mid:.2f} cm")
# add the equation to the plot
# add the equation to the plot

anot_str = f"$y = {poly_petal_equation.coef[0]:.2f} + {poly_petal_equation.coef[1]:.2f}x \quad R^2 = {R2:.2f}$";
ax.annotate(anot_str, xy=(x_mid, y_mid)\
            , xycoords='data', fontsize=13, color='violet'\
            ,  horizontalalignment='right', verticalalignment='top'\
            , bbox=dict(boxstyle="round,pad=0.3", edgecolor='black', facecolor='white')\
);
# add a legend for the species - github copilot prompt to get me started
# Add a legend for the species and regression line
legend_elements = [
    plt.Line2D([0], [0], marker='o', color='w', label='Setosa', markerfacecolor='r', markersize=10),
    plt.Line2D([0], [0], marker='o', color='w', label='Versicolor', markerfacecolor='g', markersize=10),
    plt.Line2D([0], [0], marker='o', color='w', label='Virginica', markerfacecolor='b', markersize=10),
    plt.Line2D([0], [0], color='black', lw=2, label='Regression Line') # Add regression line to legend
]
ax.legend(handles=legend_elements, title='Species', loc='upper left');



## Task 10: Too Many Features

Use `seaborn` to create a `pairplot` of the data set.  
Explain, in your own words, what the `pairplot` depicts. 

## Task 10 : Answer
### Summary

The pairplot shows a table of each feature on the x and y axis. With a scatter plot comparing the features with the exception of the diagonal. There is a number of variations of a pairplot , scatter , reg ( regresion) , kde ( kernel density estimation - the cool one ) and hist (historgram). The plots are the same on alternative sides of the diagonal. A sort of mirror image. Taking advantage of the layout , the alternative plot can show a variation of the same data. Example one with or with
out KDE.  The diagnol plot has the histogram or kde of the features. 

Visually inspecting the pairplots shows that the petal length and width shows a distinct separation between serotosa species and the other two species. The other two species do show some form of separation but there is some crossover between the two ( virginica and versicolor ). The Sepal width and length shows less distiction then petals with respect to species although serotosa still is distictive but less so than the petals, the other two are all intermixed.  

If this is a classification problem then the petal features are the best option for predicting the species. In which case sepals are not needed. An alternative v
iew is that since there is a correlation between petal width and petal length , are both needed . There may be insights we can get by removing one of the petal features and doing additional analysis. Thus which features are redundant is dependent on what the question being asked. 

Since both sepals and petals show good distintion between the serotosa , and it looks like petals are the beter predictor fo the serotosa species , and the other two species sepals are very intermixed. We may be able to exclude the sepal features if we want reduce the number of features.
This will need to be investigated further. Some alternatives are by using clustering to predict the species with sepals and petals ,and compare the results.

  


### References

https://seaborn.pydata.org/generated/seaborn.pairplot.html

Plot pairwise relationships in a dataset.

By default, this function will create a grid of Axes such that each numeric variable in data will by shared across the y-axes across a single row and the x-axes across a single column. The diagonal plots are treated differently: a univariate distribution plot is drawn to show the marginal distribution of the data in each column.

It is also possible to show a subset of variables or plot different variables on the rows and columns.

This is a high-level interface for PairGrid that is intended to make it easy to draw a few common styles. You should use PairGrid directly if you need more flexibility.

https://en.wikipedia.org/wiki/Univariate_distribution

In statistics, a univariate distribution is a probability distribution of only one random variable. This is in contrast to a multivariate distribution, the probability distribution of a random vector (consisting of multiple random variables).

Gemini - a univariate distribution to a high school student   


.Kernel Density Estimation:
In statistics and data analysis, KDE stands for "kernel density estimation."1 This is a method used to estimate the probability density function of a random variable. Essentially, it's a way to smooth out a histogram to get a continuous curve that represents the distribution of the data. 


In [None]:
# Use seaborn pairplot to visualize the data
# use df_iris and hue='target' to color the data points by target
g = sns.pairplot(df_iris,hue='target', kind='scatter')
g.figure.suptitle("Pairplot of Iris Data Set", y=1.02);
g.map_lower(sns.kdeplot, levels=4, color=".2")
# Add regression line to the lower triangle of the pairplot
# Use the map_lower method to add regression line to the lower triangle
# This is done by using the sns.regplot function
g.map_lower(sns.regplot, scatter=False, color=".2")




## END