# Programming and Scripting Project 
## An Analysis of the Iris Dataset
**Author: A O'Connor**
*****
<p align ="center"><img src="https://storage.googleapis.com/kaggle-datasets-images/19/19/default-backgrounds/dataset-card.jpg" /></p> 

## Introduction
This repository contains my analysis of the [Iris dataset](https://archive.ics.uci.edu/dataset/53/iris). The data set is available on the [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/53/iris), or the raw csv file can be found in the [Seaborn Data Repository](https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv) on Github.

### The Iris Data Set
The iris dataset presents data on three different species of the Iris flower, and is often used in statistics and data science as a training data set. The data set contains information on the following four features from 150 different samples of iris flower:
- Sepal length (cm)
- Sepal width (cm)
- Petal Length (cm)
- Petal Width (cm)

The flowers are categorized into 3 different species, listed below:

- Iris setosa
- Iris versicolor
- Iris virginica

More information on this classic data set can be found in this [paper](https://www.semanticscholar.org/paper/The-iris-data-set%3A-In-search-of-the-source-of-Unwin-Kleinman/4599862ea877863669a6a8e63a3c707a787d5d7e) published in Significance in 2021. The paper, titled "*The iris data set: In search of the source of virginica*", provides a neat overview of the data set that will be explored in this repository, and its significance in the data analysis and statistics community since its first use as an example data set by [R. A. Fischer](https://onlinelibrary.wiley.com/doi/10.1111/j.1469-1809.1936.tb02137.x) in 1936. 
******


## About this project
This repository contains a Python script and a Jupyter notebook with my own analysis of the Iris data set, for the purposes of the Programming and Scripting module I am taking as part of a course in Computer Science at ATU. The aim of this project is to demonstrate how Python can be used to analyse large datasets, and to use this analysis to provide a coherent overview of the Iris dataset. Detailed explanations and code comments are provided throughout to demonstrate understanding of the principles behind data analytics using Python. 

## Get Started
To get started with this project, you can open the notebook in Google Colab by clicking on the link below:

<br>

<a target="_blank" href="https://colab.research.google.com/github/a-o-connor/pands-project/blob/main/iris_analysis.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

<br>

**Overview of notebook contents:**
1. Overview of the Data 
    - The first section of the notebook contains an overview of the data contained in the Iris data set. 
    - An interpretation of the types of variables within the data frame is included in this section. A summary of the ca 
2. Exploratory Data Analysis
    - The second section of the notebook includes exploratory data analysis of the Palmer Penguins data, in order to get an overview of the data, to identify anomalies and ensure the data set has been loaded in correctly. 
    - Histograms of each the numeric variables in the data frame were made in order to visualise and explore the distribution of each of the variables. 
    - Bar charts and box plots were plotted to explore the differences in penguin attributes between species.
3. Correlations
    - The third section of the notebook contains an analysis of the correlations within the data set. 
    - Bivariate linear regression was used to explore the relationship between flipper length and body mass. 
    - A student's *t*-test was used to analyse the correlation between body mass and sex.  

## 1. Overview of the Data
An initial overview of the data is provided in the two tables below.\
Initially the Pandas DataFrame ``.describe()`` method was used to generate the summary statistics of the continuous, numerical variables in the dataset. The function returns a dataframe, which was converted into Markdown formatting using the Pandas ``.to_markdown()`` function. 
- [W3 Schools Pandas Tutorial: Pandas describe() Method](https://www.w3schools.com/python/pandas/ref_df_describe.asp#:~:text=Definition%20and%20Usage,std%20%2D%20The%20standard%20deviation.)
- [Pandas Documentation: to_markdown()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_markdown.html)
- [Stack Overflow Question: How to control float formatting in the Pandas to_markdown](https://stackoverflow.com/questions/66236289/how-do-you-control-float-formatting-when-using-dataframe-to-markdown-in-pandas )
                                    
**Summary statistics of the Iris Dataset:**                     
                     
|       |   sepal_length |   sepal_width |   petal_length |   petal_width |
|:------|---------------:|--------------:|---------------:|--------------:|
| count |        150.000 |       150.000 |        150.000 |       150.000 |
| mean  |          5.843 |         3.054 |          3.759 |         1.199 |
| std   |          0.828 |         0.434 |          1.764 |         0.763 |
| min   |          4.300 |         2.000 |          1.000 |         0.100 |
| 25%   |          5.100 |         2.800 |          1.600 |         0.300 |
| 50%   |          5.800 |         3.000 |          4.350 |         1.300 |
| 75%   |          6.400 |         3.300 |          5.100 |         1.800 |
| max   |          7.900 |         4.400 |          6.900 |         2.500 |                     
                     
The Iris dataset contains one categorical variable. This is the species. Each flower is defined as one of three species of Iris, setosa,   versicolor or virginica. The mean of each attribute grouped by species was found using the Pandas ``.groupby()`` function. This function returns a GroupBy object. This [Real Python tutorial](https://realpython.com/pandas-groupby/#:~:text=You%20call%20.,a%20single%20column%20name%20to%20.) on using Pandas GroupBy and the **split-apply-combine** method was very helpful in order to understand how Pandas GroupBy works and how to manipulate the returned GroupBy object.\
**Split-apply-combine** refers to the following 3 steps often used in manipulation of Pandas dataframes: 
1. Split the data frame into groups.
2. Apply some function across the groups. 
3. Combine the returned results into a different dataframe. 

Pandas ``.mean()`` function was applied to return the mean of each attribute by species. This function returns a Pandas series, which was converted to a list using a list wrapper in order to store it in a dictionary object. 
The species means were stored in a dictionary object, however in order to apply the ``.to_markdown()`` function mentioned earlier, to generate the table below, this had to be coverted back into a Pandas data frame using ``.from_dict()``. Pandas ``.unique()`` function was executed on the species column to return a Numpy array containing the three species of Iris in the dataframe. This was assigned the variable name "species" which was passed as column name to the new Pandas data frame.  

**Species Means in the Iris Dataset:**                    
                     
|              |   setosa |   versicolor |   virginica |
|:-------------|---------:|-------------:|------------:|
| Sepal Length |    5.006 |        5.936 |       6.588 |
| Sepal Width  |    3.418 |        2.77  |       2.974 |
| Petal Length |    1.464 |        4.26  |       5.552 |
| Petal Width  |    0.244 |        1.326 |       2.026 |


The summary tables and a brief description of each was saved to "textfile_summary_of_variables.txt", which can be found in this repository.  
- [Real Python: Working With Files](https://realpython.com/working-with-files-in-python/ )

## 2. Exploratory Data Analysis
- EDA is often the first step in analysis of large data sets in order to get an overview of the data, to identify anomalies and ensure the data set has been loaded in correctly.   
- Histograms of each the numeric variables in the data frame were made in order to visualise and explore the distribution of each of the numeric variables.
- Histograms are a good way to present continuous data, as they provide a visualisation of:
    - Where the distribution is centered
    - The spread of the distribution
    - The shape of the distribution
- The histograms were generated using Matplotlib. This [Python Tutorial](https://realpython.com/python-matplotlib-guide/) on the object oriented (stateless) approach to using Matplotlib provides a helpful overview of the Matplotlib figure and axes object hierarchy, which can get confusing, and how to use the subplots notation. 
- A function named "histogram" was defined in the script to take an x-value (a numeric variable to be plotted) as a keyword argument, and save an image file of the histogram of that variable to the current working directory.
- A histogram of each of the numeric variables in the Iris dataset can be found in this repository.  
    - **Sepal Width** displays a normal distribution. 
    - **Petal Width** and **Sepal Length** both display a trimodal distribution. 
    - **Petal Length** and displays a bimodal distribution. 
- For the variables displaying bimodal and trimodal distributions, a histogram displaying the distributions separated by species was plotted.
- These plots are saved as image files titled Distribution by Species for each variable. 
- From the overlay of each variable's histogram separated by species, it becomes clear that the bimodal and trimodal distributions observed within the dataset were due to the different modes for each species. 

## 3. Bivariate Analysis: Correlations
1. **Scatterplot** of each pair of variables
- The initial step taken in the bivariate analysis of the Iris data set in order to assess the correlations between each variable was to plot a scatterplot of each pair of continuous numeric variable. 
- A scatterplot function was defined, taking 3 keyword arguments: an x_value, a y_value and a colour, that returns a scatterplot of the variable input as the y_value versus the variable input as the x_value, with the points coloured by the variable input as the colour kwarg. 
- For each scatterplot generated, a line of best fit was also plotted over the scatterplot. 
2. **Least Squares Polynomial Fit**
- The line of best fit was generated using [NumPy's ``polyfit``](https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html) function to perform a Least Squares polynomial fit on the two variables. 
- The ``polyfit`` function takes two arrays, which were indexed from the Panda's dataframe using the x_value and y_value keyword arguments taken in by the function, and the desired degree of the polynomial fitting, and returns two Polynomial coefficients. These were applied to the x_value and y_value to fit a $ y = mx + c $ line to the scatterplot. 

3. **Correlation Coefficient, R Square, and Observed Significance Probabilities** 

The plot below displays a heatmap of the correlations in the Iris Data Set.

<br>
<img src="Heatmap of Correlations between Variables.png">
<br>

- The R-value for each line of best fit, as well as the observed significance probability and the R Square, was reported on each plot. 
- This was generated using Panda's ``.corr()`` function to generate a correlation matrix.  ([W3 Schools.](https://www.w3schools.com/python/pandas/pandas_correlations.asp)) 
- This function takes a dataframe, and returns a correlation matrix with the R values for the correlation between each of the numeric variables.
    - This correlation matrix describes the relationship between each numeric column in the data frame, represented as an R value ranging from -1 to +1.
    - Negative numbers represent a negative correlation between the two variables (i.e. as one number increases the other decreases), and positive numbers represent a positive correlation. 
    - The closer the R value is to zero, the less strong the correlation.
- The correlation matrix is a Pandas data frame that can be indexed using ``.loc``([W3 Schools](https://www.w3schools.com/python/pandas/pandas_dataframes.asp)) to find the correlation coefficient between the x_value and y_value using the arguments taken by the function. 
- This correlation coefficient was reported on each scatterplot as a label for the line of best fit. 
- Underneath each plot, a textbox was drawn, following this [Matplotlib demo](https://matplotlib.org/stable/gallery/text_labels_and_annotations/fancytextbox_demo.html#sphx-glr-gallery-text-labels-and-annotations-fancytextbox-demo-py) to generate a neat and aesthetic textbox. 
- Both the *p*-value and the R^2 value for the linear regression fit of the two variables is reported underneath each plot, using [Scipy's linear regression function](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html) .stats.linregress to calculate the observed significance probability between each pair of variables. 
- The function takes two arrays (once again, these were indexed from the Panda's dataframe using the x_value and y_value arguments taken in by the function) and returns a tuple with the slope, intercept, R value and p value. 
- In the Iris data set, the following correlations are worth noting: 
    - Petal width is positively correlated with sepal length, with an R value of + 0.82, an R Square of 0.67, and a *p*-value $\le$ 0.05.
    - Petal width is positively correlated with petal length, with an R value of + 0.96, an R Square of 0.93, and a *p*-value $\le$ 0.05.
    - A linear regression model’s R Square value describes the proportion of variance explained by the model.
        - A value of 1 means that all of the variance in the data is explained by the model, and the model fits the data well. 
        - A value of 0 means that none of the variance is explained by the model.  
    - *p*-value $\le$ 0.05: A calculated significance probability of less than 0.05 indicates that the correlation is significant within a 95 % confidence interval.

4. **Bar Charts and Box Plots by Species:**
- In order to explore how each iris attribute measured differed between the species, initially a bar chart was plotted that displayed the species mean of each variable. 
- This [tutorial on Matplotlib](https://matplotlib.org/stable/gallery/lines_bars_and_markers/barchart.html#sphx-glr-gallery-lines-bars-and-markers-barchart-py) was followed to generate the bar chart, with the mean of each numeric attribute by species.
- [Seaborn](https://seaborn.pydata.org/generated/seaborn.boxplot.html) boxplots and swarmplots were then used to explore the differences in petal length and sepal length between species as these seemed to vary the most. 
- Box plots are a useful tool in determining the spread of data within a group, and to identify potential outliers. 
- The boxes represent the interquartile range. The box limits represent the spread of the central 50% of the data. 
- In the boxplot of petal length by species, the Virginica species has the longest box, indicating a greater variance (or, wider spread) of the data points within the group. There is a wider spread of petal lengths within the Virginica species of Iris.   
- The horizontal lines in the boxes are the median line. The group median petal length for the Setosa species is the lowest, and the least variance is seen in this group.  
    - The overlaid swarmplot points are coloured by petal width. The setosa species also shows the shortest petal width of the three species measured.
- Versicolour and virginica both have higher median petal length than the setosa species of iris. 
- The whiskers extend to the furthest data point in the data set that is within $ \pm 1.5\times $ IQR. 
- Outliers are represented by points that fall outside the boxplot whiskers. Versicolour species has one outlier: one plant has a petal length significantly smaller than others of that species measured.
- In the boxplot of sepal length by species, all three species measured show different median in sepal lengths, with a wide spread of data points in each group. 
- The swarmplot points are coloured by sepal width. The setosa species of flower, while having the shortest median sepal length measured, seems to have larger sepal width. There appears to be an equal spread of sepal widths across the Versicolour and Virginica species. 
- One outlier was identified: One plant in the Virginica species group had a significantly shorter sepal length than others on that species. 

5. **Independent Samples Students *t*-test** for Species vs. Petal Length 
- A [students *t*-test](https://en.wikipedia.org/wiki/Student%27s_t-test#Independent_two-sample_t-test) is used to explore how the distribution of a response (in this case, petal length) differs across groups (here, the two groups we will be looking at are two different species of Iris, Versicolour and Virginica). 
- From the boxplot of petal length by species, it was evident that members of the setosa species had a significantly smaller petal length than the versicolor and virginica species of flower, however the difference between the petal length in the Virginica and Versicolor species of flower was less obvious to determine just by looking at the box plot.  
- In order to ascertain whether a statistically significant difference in the mean petal length between the Virginica and Versicolor species of flower, an independent samples students *t*-test was carried out.
- Scipy's ``stats.ttest_ind_from_stats`` function was used to perform the *t*-test. 
- This t-test takes two groups of data, assumes equal variance in the two group, and returns the test statistic and the *p*-value. 
- This Data Camp tutorial, [An Introduction to Python T-Tests](https://www.datacamp.com/tutorial/an-introduction-to-python-t-tests), was used to aid in interpretation of the results. 
- The *p*-value determined by the test is imported from analysis.py and reported in the output from the code cell below:

In [8]:
import analysis
print(f"The observed significance probability of the difference in mean petal length between the Virginica and Versicolor species of flower is {analysis.pvalue_petal_length:.2e}.")
#https://realpython.com/how-to-python-f-string-format-float/

The observed significance probability of the difference in mean petal length between the Virginica and Versicolor species of flower is 3.18e-22.


- $\alpha$ = 0.05: This is the significance level. There is a 5% chance that the null hypothesis will be incorrectly accepted or rejected. 
- *p*-value $\le$ 0.05: A calculated significance probability less than the predetermined significance level indicates that the null hypothesis should be rejected in favour of the alternative hypothesis that there is a difference in the group means.   
- From the t-test performed, there is a statistically significant difference in petal lengths between the Virginica and Versicolor species of flower, within a 95 % confidence interval. 

6. **Oneway ANOVA** for Species vs. Sepal Length
- Oneway analysis is used to determine how numerical responses vary between categorical groups of data. The Students *t*-test performed earlier would be a type of oneway analysis.
- This [chapter on Oneway ANOVA](https://www.biostathandbook.com/onewayanova.html) from the Handbook of Biological Statistics was used to interpret the output from the test. 
- A Oneway ANOVA was performed in order to determine the statistical significance of the differences in sepal length between the 3 species.
- Scipy's ``stats.f_oneway`` function was used to perform the analysis. The function takes two array like objects, therefore Panda's .``get_group()`` function was used to return an array of the sepal length values for each species. Once again, this tutorial on manipulation of GroupBy ojects on [Real Python](https://realpython.com/pandas-groupby/#:~:text=You%20call%20.,a%20single%20column%20name%20to%20.) was a useful tool in detemining how to pass two array-like objects to Scipy's f_oneway test. 
- The *p*-value determined by the test is imported from analysis.py and reported in the output from the code cell below:

In [9]:
print(f"The observed significance probability of the difference in mean sepal length between the 3 species of flowers is {analysis.pvalue_sepal_length:.2e}.")

The observed significance probability of the difference in mean sepal length between the 3 species of flowers is 1.67e-31.


- *p*-value $\le$ 0.05: From the Oneway ANOVA performed, there is a statistically significant difference in sepal lengths between the 3 species of flower, within a 95 % confidence interval.

### Bivariate analysis summary
The plot below displays a matrix of scatterplots for each pair of variables in the Iris data set, coloured by species. This matrix was built using Seaborn's pairplot graphing function, and provides a neat overview of the correlations between the continuous, numeric data in the Iris data set. There seems to be clustering of data points based on species, and some separation of different groups of data. Multivariate data analysis methods can be applied to achieve a better understanding of this phenomenon, and capture the interactions that occur between variables in a multivariate data set.
<br>

<img src="Scatterplot Matrix For Each Pair Of Variables in the Iris Data Set.png" height = "800"/>

## 4. Multivariate Data Analysis
While the bivariate analysis carried out thus far was successful in identifying correlations between each of the variables, with this type of analysis it is often not possible to visualize so many variables at once. Analysis in a univariate or bivariate manner cannot capture interactions occuring between variables that might be contributing to variation in the data. In order to appropriately analyze the entire data set, multivariate analysis through PCA was carried out.
### Principal Component Analysis (PCA)
PCA is a dimension reduction tool that takes multiple variables and transforms them into a smaller set of variables (known as principal components) that still contain most of the information in the data set.
- A principal component is a linear combination of the original set of variables: Sepal length, sepal width, petal length and petal width.
- The weight that each variable contributes to each principal component is its Eigenvalue, and its Eigenvector (neagtive or positive) indicates the direction that variable contributes to the PC. 
- The variables are combined in such a way that the most information, or variation, contained in the data is described by the least number of dimensions.
- The original variables in the data frame are reduced to the same number of principal components, such that the greatest variance in the data is explained by the first few principal components. i.e. The first principal component will explain most of the variance in the data set, followed by PC2, followed by PC3 and so on.
- In this way, a data frame is generated where most of the variation in the data is contained within the first two or three variables, which is much easier to visualise and analyse. 
    - [Data Camp tutorial on PCA in Python](https://www.datacamp.com/tutorial/principal-component-analysis-in-python)





#### PCA with Scikit Learn: The code
- PCA was carried out on the data set using the Scikit learn library.
    - [Scikit Learn PCA example with Iris Data-set](https://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_iris.html)
- The first step was to generate a PCA object with 5 components.  
- This PCA object will be used to used to reduce the 5 variables in our data to 5 principal components. 
- The next step was to apply the ``fit_transform()`` function to the pca object to find the scores of each sample in the Iris data frame. The function can be applied to a numeric dataframe to determine the scores of each sample in the data set on each of the principal components. 
- This function returns a NumPy array with a column for each principal component and a row for each sample reporting its Eigenvector and Eigenvalue against each PC. 
- A 2D scores plot, coloured by species, was generated for principal component 1 and 2 by indexing the returned NumPy array. 
- This [tutorial on generating custom legends in Matplotlib](https://python-graph-gallery.com/custom-legend-with-matplotlib/), from the Matplotlib documentation, was followed in order to colour each point on the scores plot by species. 
- The next step was to generate a loadings matrix. 
- This tutorial on Github on [How to compute PCA loadings and the loading matrix](https://scentellegher.github.io/machine-learning/2020/01/27/pca-loadings-sklearn.html#:~:text=Loadings%20with%20scikit%2Dlearn&text=The%20columns%20of%20the%20dataframe,to%20the%20corresponding%20principal%20component) was used.
- The loadings matrix  describes the magnitude of the contribution of individual variables toward each component. 
- To compute the loadings of each variable, the ``components_`` of the PCA object must be accessed.
- These were coverted into a Panda's Dataframe, that could be indexed to generate a loadings plot. 
- Each point in the loadings plot was labelled with the name of the variable it pertained to, by applying the function ``enumerate()`` to the list of column names in the original data frames. This returns a number and a list item which can be assigned the variables i, and column and iterated through in a for loop to annotate each point in the scatterplot individually. 
- [This Matplotlib tutorial](https://matplotlib.org/2.0.2/examples/mplot3d/text3d_demo.html) on use of the ``annotate()`` function to label each point in a scatterplot in Matplotlib was used to label each variable in the Loadings plot. 
- The [Matplotlib user guide](https://matplotlib.org/stable/users/explain/text/annotations.html#sphx-glr-users-explain-text-annotations-py) on annotation and text in Matplotlib was also very useful in generating this code.

## PCA: The Output
<img src="Iris Principal Component Analysis.png"/>

**Scores plot:**
- The scores plot is a 2 dimensional representation of samples across the 2 first principle components, with each point on the plot representing a sample.
- Samples that are similar will group (cluster) together.
- Exploration of how samples (or clusters of samples) are separated along the PCs reveals which variables are influencing the separation of different groups.   
- If samples are separated along the PC1 axis, that means whatever variable has a high loading on PC1 contributes to the variation between the samples.

**Loadings plot:** 
- The Loadings plot describes the magnitude of the contribution of individual variables toward each component
- The closer the variable is to the origing (loading of 0 on both PC1 and PC2), the less important that variable is for explaining the variance in the data.
- Variables that are clustered together in the loadings plot are positively correlated.
    - In the loadings plot above, sepal width and sepal length are positively correlated, and contribute strongly to variation along the PC2 axis. 
- Variables that are located on the opposite sides of the origin are inversely correlated to each other.
    - In the loadings plot above, sepal width is inversley correlated with petal length, and petal length contributes strongly to PC1. 
- Interpretation of where each sample falls on the scores plot and the variables lying in the corresponding area in the loadings plot can reveal which variables are contributing to the variation in that cluster of samples.  
- Samples that have negative values for PC1 on the score plot will have relatively higher values for variables on the negative side of PC1 axis on the loadings plot, and conversley samples that have positive values along PC1 on the scores plot will have relatively higher values for variables contibuting positively to PC1. 
- In this case, there are no variables on the negative side of the PC1 axis however petal length contributes strongly to the positive end of the PC1 axis.
    - The setosa species samples cluster toward the negative end of the PC1 axis. These flowers has relatively shorter petal length.
    - The cluster of samples for the versicolor species falls further along the PC1 axis, this species has realtively longer petal length. 
    - The virginica species cluster falls furthest along the PC1 axis, samples in this cluster display the longest petal length.  
- Sepal width very close to 0 loading on PC1 - no difference in sepal width between species!!
**Scree Plot**
- The scree plot shows how much variance in the data set (i.e. how much information) is explained by each of the principal components. 
- The scree dataframe imported from analysis.py is printed below for a more precise look at the variance explained by each of the principal components. 
- PC1 captures 92% of the information comprised in the data. 
- PC2 captures 4.8% of the information comprised in the data.
- PC1 and PC2 alone can explain 96.8 % of the variance in the data.


In [14]:
import analysis 
print(analysis.scree)

          0
1  0.922640
2  0.048104
3  0.018300
4  0.007001
5  0.003955


# WRITE A CONCLUSION