Multivariate Statistics and Machine Learning

Project Objective

Based on the data provided, the report deals with simplifying the dataset for further analysis through clustering, specifically using the K-means algorithm. Since the algorithm requires a user-input number of clusters, clustering using a range of number of clusters is done to compare each of its sum of squares to find the optimal K number of clusters.

-- Project Status: [Completed]

Methods used

Multivariate statistics
K-means clustering
Sum of squares(SS) analysis

Technologies

LateX
R

Project Description

The data used for this project can be obtained from Palmer Penguins with 4 variables, bill_length_mm; bill_depth_mm; flipper_length_mm; and body_mass_g. With the n = 333 observations, the first section explores the dataset through descriptive statistics to determine the user-input K number of clusters. The covariance matrix is also obtained and studied to scale the data before the clustering can be done.

The second section explains the methods used to cluster the data, through an overview of the algorithm, as well as the commands used in R. How the optimal number of clusters is recognised is further detailed as reference for the dicussion of this report.

The third and final section of the report discusses the results of the methods in the previous section, using user-input K = 3 which is chosen when analysing the scatterplot of the variables against one another. The counts of the clustering done is compared to two of the known clusters provided in the dataset, L.species; L.islands, to see how the variables affect the clustering. These counts are then visualised through scatterplots of two of the variables, where the different colours indicate the K-means clusters while the point characters separate the data points by its known clusters. It's found that the clustering done matches up more with the species classification due to the smaller number of mislassified samples in comparison to the island grouping. The discussion is then concluded with plot of the within group and between group variation against number of clusters, telling us that the chosen K is the optimal number of clusters.

The final report can be read here.

Needs of this project

Machine learning
Writeup

Author

Nurfahimah Mohd Ghazali

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
3v6_clusters.pdf		3v6_clusters.pdf
MSML.Rdata.R		MSML.Rdata.R
MSML_cw-5.pdf		MSML_cw-5.pdf
Multivariate-Statistics-and-Machine-Learning.Rproj		Multivariate-Statistics-and-Machine-Learning.Rproj
README.md		README.md
further_analysis.R		further_analysis.R
species_sex.pdf		species_sex.pdf
ss_plot.pdf		ss_plot.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multivariate Statistics and Machine Learning

Project Objective

-- Project Status: [Completed]

Methods used

Technologies

Project Description

Needs of this project

Author

About

Releases

Packages

Languages

fahimahghazali/Multivariate-Statistics-and-Machine-Learning

Folders and files

Latest commit

History

Repository files navigation

Multivariate Statistics and Machine Learning

Project Objective

-- Project Status: [Completed]

Methods used

Technologies

Project Description

Needs of this project

Author

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages