Iris has become an extremely popular dataset for beginners to learn the basics of visualization and machine learning, and unfortunately most of the time one ends up reinventing the wheel using it. My goal for this project was to showcase a completely new approach you can visualize this data presented, discovering new possibilities and new ways you can research something that has been done before a millions of times. First, I wanted to explore Plotly library and see if by manipulating data right I would be able to create unique graphs to tell a story. Second, to develop 2D representations of classification algorithm and its decision boundary.
I have downloaded Iris Species dataset from Kaggle provided by UCI Machine Learning. The Iris dataset was used in R.A. Fisher's classic 1936 paper The Use of Multiple Measurements in Taxonomic Problems, and can also be found on the UCI Machine Learning Repository.
- Creating individual and a combination of multiple custom Scatter plots to showcase clusters of Sepal and Petal measurements
- Creating Coordinates plot to demonstrate new way of looking at the distribution of measurements for each point passing established ranged windows
- Transforming data, establishing 25%-75% IQR ranges and determining normal & outlier observations via Categorical Coordiantes plot
- Determining best features, number of nearest neighbors and boundary points for ML KNN-Classification and 2D representation (taking overfitting and underfitting results into the account)