#### Project for Fundamentals of Data Analytics

![image.png](attachment:image.png)

##### Project Brief: 
- The project is to create a notebook investigating the variables and data points within the well-known iris flower data set associated with Ronald A Fisher. 
-  In the notebook, you should discuss the classification of each variable within the data set according to common variable types and scales of measurement in mathematics, statistics, and Python.
-  Select, demonstrate, and explain the most appropriate summary statistics to describe each variable.
-  Select, demonstrate, and explain the most appropriate plot(s) for each variable.
- The notebook should follow a cohesive narrative about the data set. 

##### Contents

1. Introduction
2. Classification of variables
3. Summary Statistics to describe each variable
4. Plotting and Analysis of Data
6. Conclusion 

##### 1. Introduction

The Iris flower dataset, originally introduced by Ronald A. Fisher in 1936, is a cornerstone in the field of data analysis and machine learning. This dataset features measurements of sepal and petal lengths and widths for three species of iris flowers  (Setosa, Versicolor, and Virginica). 

There are 50 samples for each species, making a total of 150 samples. The measurements are in centimeters and consist of sepal length, sepal width, petal length, and petal width.

The data set is often used for statistical analysis, visualization, and machine learning algorithms, such as classification and clustering. It is also used as a benchmark data set for evaluating new methods and algorithms.
Fisher's Iris data set is considered a classic example of exploratory data analysis and is widely used in data science education and research.

References: https://www.angela1c.com/projects/iris_project/the-iris-dataset/

For the Purpose of this assignment I will meet the elements of the above brief and explore and represent the data found in the Iris Dataset in order to improve my Data Analysis skills. 

##### Importing Libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 

*Pandas* is needed in the first section which imports the data. 

*Mattplotlib.pyplot* is needed to import the libraries needed to create the histograms.

Although I began with using the *mattplotlib.pyplot* libraries and functions, I needed to improve the clarity upon further research the *seaborn* libraries and functions improved the issues greatly.


##### Importing the Dataset

In [None]:
#importing the dataset
iris_data = pd.read_csv('iris.csv') 
#Displaying the information about the Data set
iris_data.info()
#The first few rows of the dataset
iris_data.head()

Here we see the following useful information:

Five Data columns: Indicates that the DataFrame has a total of 5 columns.

The columns are listed with their respective information: sepal.length, sepal.width, petal.length, petal.width: The names of the numeric columns, each with 150 NaN entries. 

variety: The name of the categorical column, representing the species variety. It has 150 non-null entries and is of data type object, typically indicating a string or mixed data type.

Below, the information shows a number of statistical information useful before delving in and presenting and analysing the data. 

In [None]:
#Gives statistics about the Data
iris_data.describe()

Now the code checks if the data has any null values (it does not).

In [None]:
# checking for null values in the Data
iris_data.isnull().sum()

##### 2. Classification of Variables

The Iris Dataset includes measurements of the length and width of sepals and petals for 50 samples each of three Iris species—namely, Iris setosa, Iris virginica, and Iris versicolor. These measurements served as the basis for constructing a linear discriminant model designed to classify the different species. Reference: http://www.lac.inpe.br/~rafael.santos/Docs/CAP394/WholeStory-Iris.html#:~:text=The%20Iris%20Dataset%20contains%20four,model%20to%20classify%20the%20species. 

Nominal (Species/variety):
Here we see different types of iris flowers: Setosa, Versicolor, and Virginica. This is nominal as it is only based on name only with no type of flower being greater or lesser than the other. In python These are categorical variables. 

Numeric Variables:
With the three types of variety of Iris classified as nominal, the rest of the data is numerical (Sepal Length, Sepal Width, Petal Length, Petal Width). These are numbers we can measure, compare and analyse. In python these are numerical variables.

Reference used: https://statistics.laerd.com/statistical-guides/types-of-variable.php

#### 3. Summary statistics to describe each variable.

Brief: Select, demonstrate, and explain the most appropriate summary
statistics to describe each variable.

Building from sections ones importing fo the dataset and giving general information about the dataset, and sections two's classifying of the vairables, it is now prudent to apply appropriate summary statistics to the Data frame. 

Numeric Variables: 

For the numeric variables I believe it is important to derive the following from the Data: Mean, Standard Deviation, minimum and maximum.

Species/Variety:

For the categorical variables I believe that counr and mode may be the only two statistical analysis to perform on this data.

In [None]:
import pandas as pd
import numpy as np

#Loading dataset
iris = pd.read_csv('iris.csv')

#Basic statistics for Sepal Length
print('Sepal Length Statistics:')
print('Min:', iris['sepal.length'].min())
print('Max:', iris['sepal.length'].max())
print('Mean:', iris['sepal.length'].mean())
print('Median:', iris['sepal.length'].median())
print('Std Dev:', iris['sepal.length'].std())

#Overall summary statistics
print('\nOverall Summary Statistics:')
print(iris.describe())

#Mean Sepal Length by Species
print('\nMean Sepal Length by Species:')
print(iris.groupby('variety')['sepal.length'].mean())

#Mean Sepal and Petal Length by Species
print('\nMean Sepal & Petal Length by Species:')
print(iris.groupby('variety')[['sepal.length', 'petal.length']].mean())

#Summary statistics of Sepal Length by Species
print('\nSummary Statistics of Sepal Length by Species:')
print(iris.groupby('variety')['sepal.length'].describe())

#Counts for different species
print('\nCounts for Different Species:')
print(iris['variety'].value_counts())

References used: 

https://shiny.abdn.ac.uk/Stats/R_Python_tutorial/example-summary-statistics-in-python.html
https://akshay-a.medium.com/descriptive-statistics-with-pandas-on-iris-data-beginner-bbc4422597ea
https://www.geeksforgeeks.org/python-basics-of-pandas-using-iris-dataset/ 

#### 4. Plotting and Analysis of Data

In this section I will plot the data as appropriately as I can in order to analyse it. with this, I first needed to import the libraries needed for this task. As Pandas is used earlier in this notebook to import the iris dataset, I did not need to use this now. References: https://www.geeksforgeeks.org/exploratory-data-analysis-on-iris-dataset/. 

The libraries needed are as follows: 

*Mattplotlib.pyplot* is needed to import the libraries needed to create the histograms.

*Seaborn* is used in addition to Mattplotlib.pyplot for use of further visualisations to display the data.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

a.	Histograms of Variables:

I used Histograms to first visualize the distribution of each variable and quickly understand the spread and tendencies of each feature (sepal length, sepal width, petal length, petal width). I hoped this may Identify patterns, or lack of patterns, in the data. I used a loop to create a historgam fro each column in order to keep the code tidy. 

In [None]:
#histograms of variables
for col in iris_data.columns[:-1]:
    plt.hist(iris_data[col], bins=50)
    plt.title(col)
    plt.xlabel('measurement cm')
    plt.ylabel('frequency')
    plt.show()

References:  

https://stackoverflow.com/questions/62118646/i-loop-through-data-frame-graph-histogram-for-each-column-use-column-name-as-g 

https://datatofish.com/plot-histogram-python/

b.	Scatter Plots:

I used Scatter plots help see the relationship between two variables of data. This once again helps find different patterns (if any) and see any groups within the data. I also incorporated hue into scatterplots to see the three different species.

In [None]:
#Petal Width vs Petal Length
sns.scatterplot(x='petal.length', y='petal.width', hue='variety', data=iris_data)
plt.legend(bbox_to_anchor=(1, 1), loc=2)
plt.show()

In [None]:
#Sepal Width vs Sepal Length
sns.scatterplot(x='sepal.length', y='sepal.width', hue='variety', data=iris_data)
plt.legend(bbox_to_anchor=(1, 1), loc=2)
plt.show()

References: https://www.w3schools.com/python/matplotlib_scatter.asp 

c.	Pairplot:

A pairplot displays scatter plots for all pairs of features and histograms for each variable. Pairplots help to quickly visualize relationships between all pairs of variables.

In [None]:
#Pairplot
sns.pairplot(iris_data.drop(['sepal.length'], axis=1), hue='variety', height=2)
plt.show()

References: https://seaborn.pydata.org/generated/seaborn.pairplot.html

Analysis of Data

First, I noticed the sepal sizes of the iris dataset are different across the three species. Wit Sepal lengths being larger in the Virginica species and short in the Setosa species. Setosa has larger sepal width. 

Secondly, I studied the petal dimensions. Setosa species have the smallest petals, in terms of length. Whereas the Virginica species has the longest petals. Veriscolor appears to be somewhere in between in terms of sizes.

Thirdly, the pariplotting shows Petal length and width show some correlations and may indicate that longer petals are generally wider. 

Finally, one of the most predominant discoveries is Setosa seems to be less like the other two having different widths and lengths. Versicolor and Virginica are quite similar in sizes.
