Using EDA I have tried to explore IRIS Data Set, one of the oldest data sets created in 1936. For information on IRIS Data Set, read this.
- It is a task of analyzing the given data using tools from Statistics, Linear Algebra, Plotting tools, etc.
- It is the very first thing to do before providing a model using machine learning.
- Here, we explore the given data as much as we can, hence the name Exploratory.
Note: While exploring the data one must also try to clean the data by using some data cleaning techniques such as deduplication, removing extra spaces, change text to the proper case, spell check, etc. To learn about some data cleaning and pre-processing techniques, check my notebook here.
- The IRIS Data Set used for this problem is balanced data set.
- That is, each species of the flowers in the data set consists of equal number of data points as given below:
- Iris Virginica : 50
- Iris Versicolor : 50
- Iris Setosa : 50
The following features of the flowers are used for the analysis:
- Sepal Length
- Sepal Width
- Petal Length
- Petal Width
- Univariate Analysis
- An analysis performed on a single variable(a unique feature) is known as a univariate analysis.
- The following plots fall under the univariate analysis:
- PDF(s)
- CDF(s)
- Box Plots
- Violin Plots
- Bivariate Analysis
- An analysis performed on two variable(two features) is known as a bivariate analysis.
- The following plots fall under the bivariate analysis:
- Pair-Plots
- Scatter Plots