This repository contains an exploratory data analysis (EDA) of the famous Iris dataset. The Iris dataset is a classic dataset in the field of machine learning and is often used for classification tasks. In this analysis, we will explore the dataset, visualize its features, and gain insights into the relationships between different species of Iris flowers.
The Iris dataset consists of three species of Iris flowers (Setosa, Versicolor, and Virginica), each with four features: sepal length, sepal width, petal length, and petal width. The dataset contains 150 samples, with 50 samples per species.
- Loaded the Iris dataset using a Python library (e.g., Pandas).
- Displayed basic information about the dataset, such as the number of samples, features, and species.
- Calculated summary statistics (mean, median, standard deviation, min, max) for each feature.
- Identified potential outliers based on summary statistics.
- Created histograms to visualize the distribution of each feature for all species.
- Plotted scatter plots to visualize relationships between different pairs of features.
- Utilized box plots to identify potential outliers and compare feature distributions across species.
- Constructed a heatmap to visualize the correlation between different features.
- Explored how features correlate with each other to identify potential patterns.
- Visualized the distribution of each feature across species using box plots.
- Calculated the mean and standard deviation of each feature for each species.
- Setosa species generally has smaller sepal lengths and widths compared to Versicolor and Virginica.
- Virginica species tends to have larger petal lengths and widths.
- Sepal length and sepal width are not strongly correlated, while petal length and petal width show a strong positive correlation.
- Potential outliers were identified in the sepal width feature of the Setosa species.
- Further analysis could involve applying machine learning algorithms for species classification.
- Outliers could be investigated in more detail to determine whether they are genuine data points or measurement errors.
- Additional visualizations, such as 3D scatter plots, could be explored to gain deeper insights.
iris_dataset.csv
: The original Iris dataset in CSV format.edairisdataset.ipynb
: Jupyter Notebook containing the complete EDA process.