# Principals of Data Analytics - Tasks



## Imports

The following imports are approved numpy, scipy, matplotlib, seaborn, pandas, statsmodels, scikit-learn

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# import sklearn
import sklearn

# Task 1: Source the Data Set

Import the Iris data set from the `sklearn.datasets` module.  
Explain, in your own words, what the `load_iris()` function returns.






### Task 1 : Answer

#### Summary

The load_iris scikit_learn function returns a built-in small dataset of irish flowers. The data set consists of 150 records evenly distributed among three different iris species , Setosa , Versicolour , Virginica. The data has 4 features , petal length and width , and petal length and width in cm , and the iris species.

The load iris returns a number of "data sets" . The data dataset of features , a target dataset of a surrogate key to the associated iris species , target names in order of the surrogate key , and the feature names.



***Image for Iris Flowers***

For those who are not gardeners , an image below

**Irish diagram with petals and petals

https://www.life.illinois.edu/help/digitalflowers/Iridaceae/9.htm

**Iris-Setosa**

https://en.wikipedia.org/wiki/Iris_setosa#/media/File:Irissetosa1.jpg

**Iris-Versicolour**

https://upload.wikimedia.org/wikipedia/commons/2/27/Blue_Flag%2C_Ottawa.jpg

**Iris-Virginica**

https://en.wikipedia.org/wiki/Iris_virginica#/media/File:Iris_virginica_2.jpg

### References
https://scikit-learn.org/1.5/modules/generated/sklearn.datasets.load_iris.html

https://scikit-learn.org/stable/datasets/toy_dataset.html#iris-dataset





In [None]:
# Task 1: Import the data
# Load the iris dataset from the sklearn.datasets module using the load_iris() function.
# The iris dataset is a dictionary-like object that holds all the data and some metadata about the data.
import sklearn.datasets as datasets


Run help(load_iris) to get more information on the function


In [None]:
help(datasets.load_iris);

Load the iris data set

In [None]:
bunch_iris = datasets.load_iris()
# Print out shape,info and head of the data
print("Shape of the data: ", bunch_iris.data.shape)
print("Info of the data: ", bunch_iris.keys())



Feature Names

In [None]:
# Convert feature_names to DataFrame
# Easier to read
df_feature_name = pd.DataFrame(bunch_iris['feature_names'])
df_feature_name

Target Names

In [None]:
# convert target names to dataframe for easier reading
df_target_name = pd.DataFrame(bunch_iris['target_names'])
df_target_name

In [None]:
bunch_iris = datasets.load_iris(as_frame=True)
print("Bunch iris: ", bunch_iris)
print("Keys: ", bunch_iris.keys())
print("Feature names: ", bunch_iris['feature_names'])
print("Target names: ", bunch_iris['target_names'])
print("Data: ", bunch_iris['data'].head())
print("Target: ", bunch_iris['target'].head())



In [None]:
bunch_iris.data.info()
bunch_iris.data.shape
bunch_iris.data.describe()
bunch_iris.data.head()

In [None]:
bunch_iris.target.info()
bunch_iris.target.shape
bunch_iris.target.describe()
bunch_iris.target.head()


# Task 2: Explore the Data Structure

Print and explain the shape of the data set, the first and last 5 rows of the data, the feature names, and the target classes.  


## Task 2 : Answer

### Summary

The iris data set is returned as a "bunch" object. The bunch object is a dictionary with the following keys

- data : The feature data set
- target : The target data set
- target_names : The target names
- feature_names : The feature names
- DESCR : The description of the data set


**Description of the data set (DESCR)**

In [None]:
# This explains the data set
print(bunch_iris.DESCR);

### Shape of data set

The irish ( flower ) data dataset has 4 columns of features ( sepal length and width , and petal length and widht in cm ) and 150 rows of 50 of each type of the 3 iris species.

In [None]:
print("Shape of the data: ")
print("------------------")
bunch_iris.data.shape
print("Info of the data: ")
print("------------------")
bunch_iris.data.info()
print("Shape of the target: ")
print("------------------")
bunch_iris.target.shape
print("Info of the target: ")
print("------------------")
bunch_iris.target.info()

### First and last 5 rows of the data set

The data set in the "bunch" object is broken up into "data" or features and targets


**Data ( Features ) First 5 Rows**

In [None]:
bunch_iris.data.head()

**Data ( Features ) Last 5 Rows**

In [None]:
bunch_iris.data.tail()

**Target ( Species ) First 5 Rows**

In [None]:
bunch_iris.target.head()

**Target ( Species ) Last 5 Rows**

In [None]:
bunch_iris.target.tail()

### Feature Names

The feature names are stored as metadata in irish dataset features names.


#### Notes

##### What is a sepal ? 

The outer parts of the flower (often green and leaf-like) that enclose a developing bud. Sepals are usually green, but in some species they may be brightly colored or resemble petals. Sepals help to protect the flower bud.


#### What is a petal ?

 The parts of a flower that are often conspicuously colored. Petals are the parts of a flower that are often conspicuously colored. They are often involved in attracting pollinators, or are used to attract animals that help the plant to spread its seeds.

 https://www.amnh.org/learn-teach/curriculum-collections/biodiversity-counts/plant-identification/plant-morphology/parts-of-a-flower

In [None]:
bunch_iris.feature_names

### Target Classes

The target classes are the iris species , Setosa , Versicolour , Virginica. The target classes are stored as metadata in the irish dataset target names.

The target part of the iris dataset is 150 records of the surrogate key to the associated iris species. 
There are a total of 3 target classes , 0 , 1 , 2 , which are the surrogate keys to the iris species , Setosa , Versicolour , Virginica.



In [None]:
# Count of each target
# This shows the distribution of the target
bunch_iris.target.groupby(bunch_iris.target).count()

In [None]:
bunch_iris.target_names

# Task 3: Summarize the Data

For each feature in the dataset, calculate and display:  

- mean
- minimum
- maximum
- standard deviation
- median

## Task 3 : Answer

create a single dataframe with the results for all features and all species.
The target class is converted to the species name in the final dataframe.

In [None]:
# Comnbine the data and target into one dataframe
df_iris = pd.concat([bunch_iris.data, bunch_iris.target], axis=1)
# get a dictionary of the target values and their corresponding names
target_name_map = {i: name for i, name in enumerate(bunch_iris.target_names)}
# map the target values to the target names
df_iris['target'] = df_iris['target'].map(target_name_map)
df_iris.head()


Display the mean, minimum, maximum, standard deviation, and median of all the species combined for each feature.

In [None]:
# get the mean, min, max, std, and median of the features all at once
df_iris.agg({'sepal length (cm)':['mean','min','max','std','median'],'sepal width (cm)':['mean','min','max','std','median'],'petal length (cm)':['mean','min','max','std','median'],'petal width (cm)':['mean','min','max','std','median']})

Group the data by species and display the mean, minimum, maximum, standard deviation, and median of each feature for each species.

In [None]:
# do mean min max std and median for whole data frame
# Figure out how to do this for the whole data frame
# first to a mean min max std and median for each target
df_iris_species_summary_stats = df_iris.groupby('target').agg(['mean','min','max','std','median'])
df_iris_species_summary_stats

## Task 4: Visualize Features

Plot histograms for each feature using `matplotlib`.  
Add appropriate titles and axis labels.  



In [None]:
# Plot in 4 subplots the histograms of the features
fig1, axs1 = plt.subplots(nrows=2, ncols=2, figsize=(10, 10))
# set overall title
fig1.suptitle('Iris Data Set')
# loop thru each feature and plot histogram in subplots 2 x 2
# sepals and petals seperate layers
# loop makes it more compact
for i, feature_name in enumerate(df_iris.columns[:-1]):
    x = i // 2
    y = i % 2
    axs1[x, y].hist(df_iris[feature_name])
    axs1[x, y].set_title(feature_name)
plt.legend
plt.tight_layout()
plt.show()

A more pretty way would be to do a histogram for each feature for each species.

In [None]:
# Plot in 4 subplots the histograms of the features of each species of iris
fig2, axs2 = plt.subplots(nrows=2, ncols=2, figsize=(10, 10))
# set overall title
fig2.suptitle('Iris Data Set')
# loop thru each feature and plot histogram in subplots 2 x 2
# sepals and petals seperate layers
# loop makes it more compact
for i, feature_name in enumerate(df_iris.columns[:-1]):
    x = i // 2
    y = i % 2
    for species in df_iris['target'].unique():
        axs2[x, y].hist(df_iris[df_iris['target'] == species][feature_name], alpha=0.5, label=species,histtype='stepfilled')
    axs2[x, y].set_title(feature_name)
    axs2[x, y].legend()

## Task 5: Investigate Relationships

Choose any two features from the data set and create a scatter plot of them.  
Color-code the three different classes of the scatter plot points.


In [None]:

# plot the data sepal length vs sepal width,petal length vs petal width, and target each species with different color
#print([['r', 'g', 'b'][i] for i in bunch_iris.target])
plt.scatter(bunch_iris.data['sepal length (cm)'], bunch_iris.data['sepal width (cm)'], c=[['r', 'g', 'b'][i] for i in bunch_iris.target])
plt.xlabel('sepal length (cm)')
plt.ylabel('sepal width (cm)')
plt.title('sepal length vs sepal width')
# put legend for the targets values and colors
plt.legend(handles=[plt.Line2D([0], [0], marker='o', color='w', label='setosa', markerfacecolor='r', markersize=10),
                    plt.Line2D([0], [0], marker='o', color='w', label='versicolor', markerfacecolor='g', markersize=10),
                    plt.Line2D([0], [0], marker='o', color='w', label='virginica', markerfacecolor='b', markersize=10)])
plt.show()
# plot the data petal length vs petal width,petal length vs petal width, and target each species with different color
plt.scatter(bunch_iris.data['petal length (cm)'], bunch_iris.data['petal width (cm)'], c=[['r', 'g', 'b'][i] for i in bunch_iris.target])
plt.xlabel('petal length (cm)')
plt.ylabel('petal width (cm)')
plt.title('petal length vs petal width')
# put legend for the targets values and colors
plt.legend(handles=[plt.Line2D([0], [0], marker='o', color='w', label='setosa', markerfacecolor='r', markersize=10),
                    plt.Line2D([0], [0], marker='o', color='w', label='versicolor', markerfacecolor='g', markersize=10),
                    plt.Line2D([0], [0], marker='o', color='w', label='virginica', markerfacecolor='b', markersize=10)])
plt.show()



## Task 6: Analyze Relationship

Use `numpy.polyfit` to add a regression line to the scatter plot from Task 5.

The polyfit function used Least squares polynomial fit. Last used this to do Iodine absorption curve of EDTA titration in chemistry lab ( 2nd year WITS 1984)

Numpy plolyfit documentation https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html
How to plot polyfit in scatter plot https://stackoverflow.com/questions/46627629/how-to-use-numpy-polyfit-to-plot-trend



### Polyfit of sepals length and width



### Using ployflit to do linear regression using the petal length and petal width

#### Conclusion

# TODO Tidy up the comments
The inital attempt at doing a regresion line through the data was not successful.  There is not a obvious linear relationship between the petal length and petal width  
If the different species data is seperated out, there is a linear relationship between the petal length and petal width for the setosa species.  The versicolor and virginica species do not have a linear relationship between the petal length and petal width.  The data for the versicolor and virginica species is more spread out and not linear.  The setosa species data is more linear and has a clear relationship between the petal length and petal width.  
Then each of the species shows a linear regressiuons

In [None]:
#numpy.polyfit(x, y, deg, rcond=None, full=False, w=None, cov=False)
#for sepal length vs sepal width
# Do a polinomial fit between sepal length and sepal width 
# Degree 1 means linear
trend = np.polyfit(x = df_iris['sepal length (cm)'], y = df_iris['sepal width (cm)'], deg = 1)
trend_serosa = np.polyfit(x = df_iris[df_iris['target'] == 'setosa']['sepal length (cm)'], y = df_iris[df_iris['target'] == 'setosa']['sepal width (cm)'], deg = 1)
trend_versicolor = np.polyfit(x = df_iris[df_iris['target'] == 'versicolor']['sepal length (cm)'], y = df_iris[df_iris['target'] == 'versicolor']['sepal width (cm)'], deg = 1)
trend_virginica = np.polyfit(x = df_iris[df_iris['target'] == 'virginica']['sepal length (cm)'], y = df_iris[df_iris['target'] == 'virginica']['sepal width (cm)'], deg = 1)
# do a trend line for each species of iris

# create a polynomial function - this is the trend line
# this is a linear trend line
trendpoly = np.poly1d(trend) # create a polynomial function
trendpoly_serosa = np.poly1d(trend_serosa) # create a polynomial function
trendpoly_versicolor = np.poly1d(trend_versicolor) # create a polynomial function
trendpoly_virginica = np.poly1d(trend_virginica) # create a polynomial function

plt.scatter(df_iris['sepal length (cm)'], df_iris['sepal width (cm)'], c=[['r', 'g', 'b'][i] for i in bunch_iris.target])
plt.xlabel('sepal length (cm)')
plt.ylabel('sepal width (cm)')
plt.title('sepal length vs sepal width')
# put legend for the targets values and colors
plt.legend(handles=[plt.Line2D([0], [0], marker='o', color='w', label='setosa', markerfacecolor='r', markersize=10),
                    plt.Line2D([0], [0], marker='o', color='w', label='versicolor', markerfacecolor='g', markersize=10),
                    plt.Line2D([0], [0], marker='o', color='w', label='virginica', markerfacecolor='b', markersize=10)])
# plot the trend line for all the data and then each species
plt.plot(df_iris['sepal length (cm)'], trendpoly(df_iris['sepal length (cm)']), color = 'black')
plt.plot(df_iris[df_iris['target'] == 'setosa']['sepal length (cm)'], trendpoly_serosa(df_iris[df_iris['target'] == 'setosa']['sepal length (cm)']), color = 'red')
plt.plot(df_iris[df_iris['target'] == 'versicolor']['sepal length (cm)'], trendpoly_versicolor(df_iris[df_iris['target'] == 'versicolor']['sepal length (cm)']), color = 'green')
plt.plot(df_iris[df_iris['target'] == 'virginica']['sepal length (cm)'], trendpoly_virginica(df_iris[df_iris['target'] == 'virginica']['sepal length (cm)']), color = 'blue')
plt.show()


### Polyfit of petals length and width

In [None]:
#numpy.polyfit(x, y, deg, rcond=None, full=False, w=None, cov=False)
#for petal length vs petal width
# Do a polinomial fit between petal length and petal width 
# Degree 1 means linear
trend = np.polyfit(x = bunch_iris.data['petal length (cm)'], y = bunch_iris.data['petal width (cm)'], deg = 1)
print(trend)
# create a polynomial function - this is the trend line
# this is a linear trend line
trendpoly = np.poly1d(trend) # create a polynomial function
print(trendpoly)
plt.scatter(bunch_iris.data['petal length (cm)'], bunch_iris.data['petal width (cm)'], c=[['r', 'g', 'b'][i] for i in bunch_iris.target])
plt.xlabel('petal length (cm)')
plt.ylabel('petal width (cm)')
plt.title('petal length vs petal width')
# put legend for the targets values and colors
plt.legend(handles=[plt.Line2D([0], [0], marker='o', color='w', label='setosa', markerfacecolor='r', markersize=10),
                    plt.Line2D([0], [0], marker='o', color='w', label='versicolor', markerfacecolor='g', markersize=10),
                    plt.Line2D([0], [0], marker='o', color='w', label='virginica', markerfacecolor='b', markersize=10)])
plt.plot(bunch_iris.data['petal length (cm)'], trendpoly(bunch_iris.data['petal length (cm)']), color = 'black')
plt.show()


## Task 7: Analyze Class Distributions

Create box-plots of the petal lengths for each of the three classes.


#### Descrition

The box plot show the distribution of the 3 iris species petal lengths . The distribution show the spread of the data via various metrics. The box plot shows the mean , median , and quantiles of the data. The box is the interquartile range , the whiskers are the IQR * 1.5. The box plot is a good way to show the distribution of the data and the spread of the data.

### Conclusion

The serosa species has the most distinct petal length and it is easy to determine the serosa species by only looking at the petal length. The other two species , Versicolour and Virginica do overalp in petal length and thus it is not as easy to determine the species by petal length alone.


#### Notes

Box Plots are basically a plot with mean , median and quantiles , the box is the interquartile range 0.25 and 0.75 quantiles , the whiskers are IQR * 1.5.
We are going to the petal lengths of each of the classes , id est  the iris species , Setosa , Versicolour , Virginica.
We will be using the seaborn library to create the box plots.

###References

- Course Notes - Ian McLoughlin : https://github.com/ianmcloughlin/principles_of_data_analytics/blob/main/materials/boxplots.ipynb
- Seaborn Box Plots : https://seaborn.pydata.org/generated/seaborn.boxplot.html
- https://www.geeksforgeeks.org/how-to-set-the-hue-order-in-seaborn-plots/#5-example-withseabornboxplot

Create a datafram of petal lengths for each of the three classes , Setosa , Versicolour , Virginica. 

In [None]:
# A data frame with the petal length and class
df_petal = df_iris[[ 'target','petal length (cm)']]
df_petal.sample(5)


#### Summary Data
For each of the classes , Setosa , Versicolour and Virginica the mean , max , min , standard deviation , median and quantiles of the petal lengths will be displayed in a table

In [None]:
def quantile_25(x):
    """ input: x is a pandas series or numpy array and return the 25th and 75th quantile
    Note: coulld be done by a lambda function but this is more readable
    """
    return np.quantile(x, 0.25)

def quantile_75(x):
    """ input: x is a pandas series or numpy array and return the 25th and 75th quantile
    Note: coulld be done by a lambda function but this is more readable
    """
    return np.quantile(x, 0.75)

# Group by target and get the mean, min, max, std, median, 25th and 75th quantile
df_iris_species_summary_stats = df_petal.groupby(['target']).agg({
	'petal length (cm)': ['mean', 'min', 'max', 'std', 'median', quantile_25, quantile_75]})


# Rename the quantile columns
#  relates to box plot < -- |Q25 | Median | Q75 | -- >
df_iris_species_summary_stats.columns = ['mean', 'min', 'max', 'std', 'median', 'Q25', 'Q75']

 # Add IQR column - Interquartile range is the difference between the 75th and 25th quantile
df_iris_species_summary_stats['IQR'] = df_iris_species_summary_stats['Q75'] - df_iris_species_summary_stats['Q25']

# Add outlier columns
df_iris_species_summary_stats['outlier_min'] = df_iris_species_summary_stats['Q25'] - 1.5 * df_iris_species_summary_stats['IQR']
df_iris_species_summary_stats['outlier_max'] = df_iris_species_summary_stats['Q75'] + 1.5 * df_iris_species_summary_stats['IQR']

#print(df_iris_species_summary_stats.info())

# rotate the df_iris_species_summary_stats
# I want the species to be the columns
# This is easier to read
# Transpose the data frame
df_iris_species_summary_stats = df_iris_species_summary_stats.transpose()

# change the order of the columns to be setosa, versicolor, virginica
df_iris_species_summary_stats = df_iris_species_summary_stats[['setosa', 'versicolor', 'virginica'] ]
# sort rows in a specific order
df_iris_species_summary_stats = df_iris_species_summary_stats.reindex(['min', 'outlier_min', 'Q25', 'median', 'mean', 'Q75', 'outlier_max', 'max', 'std', 'IQR'])
df_iris_species_summary_stats


In [None]:


ax = sns.boxplot(x='petal length (cm)', y='target', data=df_petal, palette="pastel",hue='target',hue_order=['setosa','versicolor','virginica'])
ax.title.set_text('Petal Length vs Species')
ax.set_xlabel('Petal Length (cm)')
ax.set_ylabel('Species')
ax.legend(title='Species', loc='upper right', labels=['setosa','versicolor','virginica'])
plt.show()

## Task 8: Compute Correlations

Calculate the correlation coefficients between the features.  
Display the results as a heatmap using `matplotlib`.  


### Correlation Coefficients

For the heatmap the correlation coef is calculated for each of the features , sepal length , sepal width , petal length , petal width. The correlation coef is a measure of the linear relationship between the features. The correlation coef is a value between -1 and 1. A value of 1 means a perfect positive linear relationship , a value of -1 means a perfect negative linear relationship , and a value of 0 means no linear relationship.

There will also be a heatmap for each of the species

In [None]:
df_corr = df_iris[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']].corr()
df_corr

#### Plot a matplot lib heatmap of the correlation coef

Heat map is a number boxs in a matrix of all the features on x and y axis . The value of the box represents the "strength" of the correlation between the two features

In [None]:
plt.figure(figsize=(8, 8)) 
# plots the color map of the correlation matrix - the color is based on the value of the correlation
plt.imshow(df_corr, cmap='coolwarm', interpolation='nearest') 
## returns a eveny space range of number for 4 it is 0,1,2,3(np.arange(len(df_corr.columns)  ))
plt.colorbar(label='Correlation Coefficient')  
plt.xticks(np.arange(len(df_corr.columns)), df_corr.columns, rotation=45, ha='right')  
plt.yticks(np.arange(len(df_corr.index)), df_corr.index)  #
plt.title('Correlation Heatmap')
plt.tight_layout()
plt.show()

### Correlation Coefficients

This is incorrect , it is not cc for individual features , it is for all 4 features.  The correlation coefficient is a measure of the strength and direction of a linear relationship between two variables. The correlation coefficient ranges from -1 to 1. A value of 1 indicates a perfect positive linear relationship between two variables, while a value of -1 indicates a perfect negative linear relationship. A value of 0 indicates no linear relationship between the two variables.

TODO - remove this section , it incorrect but interesting ways of doing it

In [None]:
# reset plot
# 
plt.clf()
# We will first calculate the coefficient for petals , for all and then for each species
corr_coef_petal_all = df_iris['petal length (cm)'].corr(df_iris['petal width (cm)'])
# now do each of the species
corr_coef_petal_serosa = df_iris[df_iris['target'] == 'setosa']['petal length (cm)'].corr(df_iris[df_iris['target'] == 'setosa']['petal width (cm)'])
corr_coef_petal_versicolor = df_iris[df_iris['target'] == 'versicolor']['petal length (cm)'].corr(df_iris[df_iris['target'] == 'versicolor']['petal width (cm)'])
corr_coef_petal_virginica = df_iris[df_iris['target'] == 'virginica']['petal length (cm)'].corr(df_iris[df_iris['target'] == 'virginica']['petal width (cm)'])
#print(f"Petal Correlation Coefficient for all species : {corr_coef_petal_all:.2f}")
#print(f"Petal Correlation Coefficient for setosa      : {corr_coef_petal_serosa:.2f}")
#print(f"Petal Correlation Coefficient for versicolor  : {corr_coef_petal_versicolor:.2f}")
#print(f"Petal Correlation Coefficient for virginica   : {corr_coef_petal_virginica:.2f}")
# plot 4 plots all , setosa, versicolor, virginica
# so i can visualise the coefficients
# 2 x 2 plot but wider than tall
fig , ax = plt.subplots(nrows=2, ncols=2,figsize=(10,10))
fig.suptitle('Petal Length vs Petal Width')
# All species
ax[0,0].scatter(df_iris['petal length (cm)'], df_iris['petal width (cm)'] , c=[['r', 'g', 'b'][i] for i in bunch_iris.target],
                 label = ['setosa','versicolor','virginica'],alpha=0.1)
ax[0,0].set_title('All Species')
# Setosa
ax[0,1].scatter(df_iris[df_iris['target'] == 'setosa']['petal length (cm)'], df_iris[df_iris['target'] == 'setosa']['petal width (cm)'],c='r',alpha=0.1)
ax[0,1].set_title('Setosa')
# Versicolor
ax[1,0].scatter(df_iris[df_iris['target'] == 'versicolor']['petal length (cm)'], df_iris[df_iris['target'] == 'versicolor']['petal width (cm)'],c='g',alpha=0.1)
ax[1,0].set_title('Versicolor')
# Virginica
ax[1,1].scatter(df_iris[df_iris['target'] == 'virginica']['petal length (cm)'], df_iris[df_iris['target'] == 'virginica']['petal width (cm)'],c='b',alpha=0.1)
ax[1,1].set_title('Virginica')
# add the correlation coefficient to the plots
# Add correlation coefficient to the plots for all species
# at first data point
ax[0,0].text(df_iris['petal length (cm)'].min(), df_iris['petal width (cm)'].min(), f"Correlation Coefficient : {corr_coef_petal_all:.2f}")
# Add correlation coefficient to the plots for setosa
# at first data point
ax[0,1].text(df_iris[df_iris['target'] == 'setosa']['petal length (cm)'].min(), df_iris[df_iris['target'] == 'setosa']['petal width (cm)'].min(), f"Correlation Coefficient : {corr_coef_petal_serosa:.2f}")
# Add correlation coefficient to the plots for versicolor
# at first data point
ax[1,0].text(df_iris[df_iris['target'] == 'versicolor']['petal length (cm)'].min(), df_iris[df_iris['target'] == 'versicolor']['petal width (cm)'].min(), f"Correlation Coefficient : {corr_coef_petal_versicolor:.2f}")
# Add correlation coefficient to the plots for virginica
# at first data point
ax[1,1].text(df_iris[df_iris['target'] == 'virginica']['petal length (cm)'].min(), df_iris[df_iris['target'] == 'virginica']['petal width (cm)'].min(), f"Correlation Coefficient : {corr_coef_petal_virginica:.2f}")

# adjust the space between the plots
plt.show()

# reset plot
plt.clf()
# now do a heatmap
plt.figure(figsize=(8, 6))  # Adjust figure size as needed
plt.imshow(correlation_matrix, cmap='coolwarm', interpolation='nearest')  # Use a colormap
plt.colorbar(label='Correlation Coefficient')  # Add a colorbar
plt.xticks(np.arange(len(correlation_matrix.columns)), correlation_matrix.columns, rotation=45, ha='right')  # Rotate x-axis labels
plt.yticks(np.arange(len(correlation_matrix.index)), correlation_matrix.index)  # Set y-axis labels
plt.title('Correlation Heatmap')
plt.tight_layout() # Adjust layout to prevent labels from overlapping
plt.show()





## Task 9: Fit a Simple Linear Regression

For your two features in Task 5, calculate the coefficient of determination $R^2$.  
Re-create the plot from Task 6 and annotate it with the $R^2$ value.

### References

- https://en.wikipedia.org/wiki/Coefficient_of_determination
- Nice Explanantion https://medium.com/@muhammadsohaib3434/r-squared-r%C2%B2-6582386b8821


### Gemini AI 

* explain the coefficient of determination to a high school student *

The coefficient of determination is a measure of how well the regression line fits the data. It is a value between 0 and 1, with 1 indicating a perfect fit. The coefficient of determination is calculated as the square of the correlation coefficient, and it represents the proportion of the variance in the dependent variable that is predictable from the independent variable. In other words, it tells you how much of the variation in the dependent variable can be explained by the independent variable. A coefficient of determination of 0 means that the independent variable does not explain any of the variation in the dependent variable, while a coefficient of determination of 1 means that the independent variable explains all of the variation in the dependent variable.

Alright, let's break down the coefficient of determination (often called R-squared) in a way that's easy to understand.

Imagine you're trying to predict something, like how well someone will do on a test. You might look at how much they studied.

Here's the basic idea:

* **You've got some data:** You've collected information on how many hours a bunch of students studied, and what scores they got on the test.
* **You make a prediction:** You draw a line (or use a more complex model) that you think best shows the relationship between studying and test scores. This line is your prediction.
* **But how good is your prediction?** That's where R-squared comes in.

Here's how to think about it:

* **R-squared tells you how much of the variation in your results your prediction explains.**
    * Think of "variation" as how spread out the test scores are. Some students did really well, some did poorly, and most were in the middle.
    * R-squared tells you what percentage of that spread your study-hours prediction can account for.
* **R-squared is a number between 0 and 1 (or 0% and 100%).**
    * If R-squared is 1 (or 100%), it means your prediction is perfect. Every student's score perfectly matched what your line predicted.
    * If R-squared is 0 (or 0%), it means your prediction is useless. Knowing how much someone studied tells you nothing about their test score.
    * If R-squared is, say, 0.7 (or 70%), it means that 70% of the variation in test scores can be explained by how much the students studied. The other 30% is due to other factors (like how well they slept, or how good they are at taking tests).

**In simpler terms:**

* R-squared is like a grade for your prediction. A higher R-squared means your prediction is better.
* It tells you how well your model "fits" the data.

**Key takeaway:**

* R-squared helps you understand how reliable your predictions are.

I hope that helps!


In [None]:
## We will use pandas to determine coeffiecient of dermination


## Task 10: Too Many Features

Use `seaborn` to create a `pairplot` of the data set.  
Explain, in your own words, what the `pairplot` depicts. 


#### References
https://seaborn.pydata.org/generated/seaborn.pairplot.html

In [None]:
# Use seaborn pairplot to visualize the data
# use df_iris and hue='target' to color the data points by target
sns.pairplot(df_iris,hue='target')

Used reg plot because I was interested in the outlay. 
This shows a regision line throut the two axis.
May indicate that there is a relationship between length and width. Makes sense. 
Is there a clusting equivalent - something like kmeans

In [None]:
sns.pairplot(df_iris,hue='target',kind='reg')

### PairPlot Using KDE



In [None]:
sns.pairplot(df_iris,hue='target',kind='kde')

## END