# CLASSIFYING THE AGE GROUP OF ABALONE FROM PHYSICAL MEASUREMENTS

Data Source: UCI Machine Learning Repository

## Objective of the Analysis
Abalone is considered the most expensive seafood in the world. The price of abalone is highly correlated with its age. Researchers use the number of rings +/- 1.5 years to estimate the age of abalone as the ring is formed once a year. Researchers must cut the shell, stain it, examine the shell sample through a microscope and then count the number of rings. This process of determining the age of abalone can be very cumbersome and costly. Our goal in this study is to classify the age group of abalone based on physical measurements, which can be easily obtained at a lower cost. 

The following multivariate techniques will be applied in our analysis: 
Dimensionality Reduction: Principal Component Analysis

Classification: k Nearest Neighbor, Linear Discriminant Analysis, Quadratic Discriminant Analysis


In [1]:
proc import 
datafile = '~/Abalone.txt'
out = Abalone
dbms = dlm
replace;
delimiter = ',';

SAS Connection established. Subprocess id is 3000



In [2]:
proc contents data=abalone;
run;

0,1,2,3
Data Set Name,WORK.ABALONE,Observations,4177
Member Type,DATA,Variables,9
Engine,V9,Indexes,0
Created,05/20/2020 01:46:50,Observation Length,72
Last Modified,05/20/2020 01:46:50,Deleted Observations,0
Protection,,Compressed,NO
Data Set Type,,Sorted,NO
Label,,,
Data Representation,"SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64",,
Encoding,utf-8 Unicode (UTF-8),,

Engine/Host Dependent Information,Engine/Host Dependent Information.1
Data Set Page Size,65536
Number of Data Set Pages,5
First Data Page,1
Max Obs per Page,908
Obs in First Data Page,866
Number of Data Set Repairs,0
Filename,/tmp/SAS_work0F2C00000BB8_localhost.localdomain/abalone.sas7bdat
Release Created,9.0401M6
Host Created,Linux
Inode Number,144279

Alphabetic List of Variables and Attributes,Alphabetic List of Variables and Attributes,Alphabetic List of Variables and Attributes,Alphabetic List of Variables and Attributes,Alphabetic List of Variables and Attributes,Alphabetic List of Variables and Attributes
#,Variable,Type,Len,Format,Informat
3,DIAMETER,Num,8,BEST12.,BEST32.
4,HEIGHT,Num,8,BEST12.,BEST32.
2,LENGTH,Num,8,BEST12.,BEST32.
9,RINGS,Num,8,BEST12.,BEST32.
1,SEX,Char,1,$1.,$1.
8,SHELLWEIGHT,Num,8,BEST12.,BEST32.
6,SHUCKEDWEIGHT,Num,8,BEST12.,BEST32.
7,VISCERAWEIGHT,Num,8,BEST12.,BEST32.
5,WHOLEWEIGHT,Num,8,BEST12.,BEST32.


The Abalone dataset [1] obtained from the UCI Machine Learning Repository, contains 4,177 observations, with 9 different variables. For the purpose of this study, we will exclude the SEX variable and only use the physical numerical measures as the predictors variables for our statistical model. All variables used in this analysis are measured on a continuous scale.


In [3]:
proc sgplot data=abalone;
  histogram rings;
  density rings;
run;

The number of rings varies from 1 to 29. We decided to partition this variable into three groups using the following rules: 

RINGS < 9 => ageRings: 1

9 <= RINGS <= 10 => ageRings: 2

RINGS > 10 => ageRings: 3

After the partition process, each age group contains approximately 33% of the dataset.

In [7]:
/* Drop variable 'SEX' */
data Abalone2;
set Abalone;
keep LENGTH DIAMETER HEIGHT WHOLEWEIGHT SHUCKEDWEIGHT VISCERAWEIGHT SHELLWEIGHT RINGS;
RUN;

/* Define the age group */
data Abalone3;
set Abalone2;
if RINGS <9 then ageRINGS = 1;
if RINGS <= 10 and RINGS >= 9 then ageRINGS = 2;
if RINGS > 10 then ageRINGS = 3;

/* Distribution of ageRings */
PROC FREQ data=abalone3;
tables agerings;
run;

ageRINGS,Frequency,Percent,Cumulative Frequency,Cumulative Percent
1,1407,33.68,1407,33.68
2,1323,31.67,2730,65.36
3,1447,34.64,4177,100.0


# Data Description
## Univariate Analysis and Bivariate Analysis

In [9]:
proc corr data= abalone2 plots=matrix(histogram);
var LENGTH DIAMETER HEIGHT WHOLEWEIGHT SHUCKEDWEIGHT VISCERAWEIGHT SHELLWEIGHT RINGS;
run;

0,1
8 Variables:,LENGTH DIAMETER HEIGHT WHOLEWEIGHT SHUCKEDWEIGHT VISCERAWEIGHT SHELLWEIGHT RINGS

Simple Statistics,Simple Statistics,Simple Statistics,Simple Statistics,Simple Statistics,Simple Statistics,Simple Statistics
Variable,N,Mean,Std Dev,Sum,Minimum,Maximum
LENGTH,4177,0.52399,0.12009,2189.0,0.075,0.815
DIAMETER,4177,0.40788,0.09924,1704.0,0.055,0.65
HEIGHT,4177,0.13952,0.04183,582.76,0.0,1.13
WHOLEWEIGHT,4177,0.82874,0.49039,3462.0,0.002,2.8255
SHUCKEDWEIGHT,4177,0.35937,0.22196,1501.0,0.001,1.488
VISCERAWEIGHT,4177,0.18059,0.10961,754.3395,0.0005,0.76
SHELLWEIGHT,4177,0.23883,0.1392,997.5965,0.0015,1.005
RINGS,4177,9.93368,3.22417,41493.0,1.0,29.0

"Pearson Correlation Coefficients, N = 4177 Prob > |r| under H0: Rho=0","Pearson Correlation Coefficients, N = 4177 Prob > |r| under H0: Rho=0","Pearson Correlation Coefficients, N = 4177 Prob > |r| under H0: Rho=0","Pearson Correlation Coefficients, N = 4177 Prob > |r| under H0: Rho=0","Pearson Correlation Coefficients, N = 4177 Prob > |r| under H0: Rho=0","Pearson Correlation Coefficients, N = 4177 Prob > |r| under H0: Rho=0","Pearson Correlation Coefficients, N = 4177 Prob > |r| under H0: Rho=0","Pearson Correlation Coefficients, N = 4177 Prob > |r| under H0: Rho=0","Pearson Correlation Coefficients, N = 4177 Prob > |r| under H0: Rho=0"
Unnamed: 0_level_1,LENGTH,DIAMETER,HEIGHT,WHOLEWEIGHT,SHUCKEDWEIGHT,VISCERAWEIGHT,SHELLWEIGHT,RINGS
LENGTH,1.00000,0.98681 <.0001,0.82755 <.0001,0.92526 <.0001,0.89791 <.0001,0.90302 <.0001,0.89771 <.0001,0.55672 <.0001
DIAMETER,0.98681 <.0001,1.00000,0.83368 <.0001,0.92545 <.0001,0.89316 <.0001,0.89972 <.0001,0.90533 <.0001,0.57466 <.0001
HEIGHT,0.82755 <.0001,0.83368 <.0001,1.00000,0.81922 <.0001,0.77497 <.0001,0.79832 <.0001,0.81734 <.0001,0.55747 <.0001
WHOLEWEIGHT,0.92526 <.0001,0.92545 <.0001,0.81922 <.0001,1.00000,0.96941 <.0001,0.96638 <.0001,0.95536 <.0001,0.54039 <.0001
SHUCKEDWEIGHT,0.89791 <.0001,0.89316 <.0001,0.77497 <.0001,0.96941 <.0001,1.00000,0.93196 <.0001,0.88262 <.0001,0.42088 <.0001
VISCERAWEIGHT,0.90302 <.0001,0.89972 <.0001,0.79832 <.0001,0.96638 <.0001,0.93196 <.0001,1.00000,0.90766 <.0001,0.50382 <.0001
SHELLWEIGHT,0.89771 <.0001,0.90533 <.0001,0.81734 <.0001,0.95536 <.0001,0.88262 <.0001,0.90766 <.0001,1.00000,0.62757 <.0001
RINGS,0.55672 <.0001,0.57466 <.0001,0.55747 <.0001,0.54039 <.0001,0.42088 <.0001,0.50382 <.0001,0.62757 <.0001,1.00000


The minimum value of RINGS is 1 whereas the maximum value is 29, indicating that some abalones are very young and some are very old. Thus it is reasonable to classify the RINGS variable into three age groups. Minimum value of HEIGHT is 0, which does not make sense. We noticed that there are only two observations with such value. This could be a typo that happened during the data entry process. 

In [13]:
proc sgscatter data=abalone3;
  matrix LENGTH DIAMETER HEIGHT WHOLEWEIGHT SHUCKEDWEIGHT VISCERAWEIGHT SHELLWEIGHT;
run;

Based on the Correlation matrix and the Scatterplot, we noticed that Rings variable has the highest correlation with Shell Weight (0.62757), followed by Diameter (0.57466), Height (0.55747) and Length (0.55672). All variables are highly correlated with each other with the exception of Rings. For all variables except Rings, Height is least correlated with other variables. In this study, we will use Principal Component Analysis as a remedial solution for multicollinearity and high dimensionality. 

# DIMENSIONALITY REDUCTION - PRINCIPAL COMPONENT ANALYSIS (PCA)
## S matrix vs R matrix

Principal Component Analysis (PCA) is a dimensionality reduction technique that is often used for datasets with a large number of variables as it helps the user distinguish between the independent variables that are the most important for the prediction versus the variables that are not so important. This technique is beneficial to multivariate analysis because it removes the need to understand the relationship between each variable, reduces the noise within your data and also reduces the risk of overfitting your model.


In [12]:
proc princomp data = abalone3 cov out=result;
var LENGTH DIAMETER HEIGHT WHOLEWEIGHT SHUCKEDWEIGHT VISCERAWEIGHT SHELLWEIGHT;
run;

0,1
Observations,4177
Variables,7

Simple Statistics,Simple Statistics,Simple Statistics,Simple Statistics,Simple Statistics,Simple Statistics,Simple Statistics,Simple Statistics
Unnamed: 0_level_1,LENGTH,DIAMETER,HEIGHT,WHOLEWEIGHT,SHUCKEDWEIGHT,VISCERAWEIGHT,SHELLWEIGHT
Mean,0.5239920996,0.4078812545,0.1395163993,0.8287421594,0.3593674886,0.1805936079,0.2388308595
StD,0.1200929126,0.0992398661,0.0418270566,0.4903890182,0.221962949,0.1096142503,0.1392026695

Covariance Matrix,Covariance Matrix,Covariance Matrix,Covariance Matrix,Covariance Matrix,Covariance Matrix,Covariance Matrix,Covariance Matrix
Unnamed: 0_level_1,LENGTH,DIAMETER,HEIGHT,WHOLEWEIGHT,SHUCKEDWEIGHT,VISCERAWEIGHT,SHELLWEIGHT
LENGTH,0.0144223076,0.011760825,0.0041569119,0.0544907081,0.0239349454,0.0118872298,0.015007172
DIAMETER,0.011760825,0.009848551,0.0034605472,0.045038182,0.0196742019,0.0097872955,0.0125066369
HEIGHT,0.0041569119,0.0034605472,0.0017495027,0.0168034708,0.0071948868,0.0036601674,0.0047588999
WHOLEWEIGHT,0.0544907081,0.045038182,0.0168034708,0.2404813892,0.1055180319,0.0519461632,0.0652158684
SHUCKEDWEIGHT,0.0239349454,0.0196742019,0.0071948868,0.1055180319,0.0492675507,0.0226749006,0.0272709563
VISCERAWEIGHT,0.0118872298,0.0097872955,0.0036601674,0.0519461632,0.0226749006,0.0120152839,0.0138495613
SHELLWEIGHT,0.015007172,0.0125066369,0.0047588999,0.0652158684,0.0272709563,0.0138495613,0.0193773832

0,1
Total Variance,0.3471619684

Eigenvalues of the Covariance Matrix,Eigenvalues of the Covariance Matrix,Eigenvalues of the Covariance Matrix,Eigenvalues of the Covariance Matrix,Eigenvalues of the Covariance Matrix
Unnamed: 0_level_1,Eigenvalue,Difference,Proportion,Cumulative
1,0.33817073,0.3342067,0.9741,0.9741
2,0.00396403,0.00105632,0.0114,0.9855
3,0.00290771,0.00185281,0.0084,0.9939
4,0.0010549,0.00056524,0.003,0.9969
5,0.00048966,6.288e-05,0.0014,0.9983
6,0.00042679,0.00027865,0.0012,0.9996
7,0.00014814,,0.0004,1.0

Eigenvectors,Eigenvectors,Eigenvectors,Eigenvectors,Eigenvectors,Eigenvectors,Eigenvectors,Eigenvectors
Unnamed: 0_level_1,Prin1,Prin2,Prin3,Prin4,Prin5,Prin6,Prin7
LENGTH,0.193156,0.350069,0.655436,0.038785,-0.155845,0.000561,-0.620285
DIAMETER,0.159552,0.318821,0.505473,-0.01806,-0.074836,-0.030203,0.78138
HEIGHT,0.059283,0.134752,0.08608,-0.004683,0.924448,-0.337705,-0.047395
WHOLEWEIGHT,0.842619,0.018824,-0.31147,0.127977,-0.167979,-0.384695,-0.006248
SHUCKEDWEIGHT,0.371959,-0.703432,0.337272,-0.353767,0.162444,0.318403,0.012573
VISCERAWEIGHT,0.182251,0.012948,-0.025061,0.762978,0.207282,0.582881,0.033733
SHELLWEIGHT,0.228349,0.512161,-0.309994,-0.523912,0.133925,0.543987,-0.033322


PCA will be performed using the 7 independent variables: LENGTH, DIAMETER, HEIGHT, WHOLE, WEIGHT, SHUCKEDWEIGHT, VISCERA, WEIGHT, SHELL, and WEIGHT. Total variation is the trace of the covariance matrix, this is the sum of all variances of the individual variables. The total variation of this dataset is .3471

The variances of the variables are not consistent - namely WHOLEWEIGHT has a variance of .24 (approximately 70% of the total variation), while the remaining 6 variables all have small variances under .05. Similarly, the first PC of S accounts for approximately 97% of the variance, whereas it would take 3 PCs of R to account for the same variance (see below). With the large variance of WHOLEWEIGHT (.24), it is expected that WHOLEWEIGHT will account for most of the variance noted in the first PC. This pattern is also reflected in the eigenvector of the PC with the coefficient of WHOLEWEIGHT (.843) dominating the coefficients of the other variables. 

Given this information, we should extract the components from the R matrix rather than the S matrix. The correlation matrix of the eigenvalues  shows that the first two eigenvalues account for 95% of the variance, giving us a 5% lost in variance which satisfies our requirement.


In [14]:
/* Apply PCA using R matrix, create a new dataset with 2 PCs*/
proc princomp data = abalone3 out=NewData n=2;
var LENGTH DIAMETER HEIGHT WHOLEWEIGHT SHUCKEDWEIGHT VISCERAWEIGHT SHELLWEIGHT;
run;

0,1
Observations,4177
Variables,7

Simple Statistics,Simple Statistics,Simple Statistics,Simple Statistics,Simple Statistics,Simple Statistics,Simple Statistics,Simple Statistics
Unnamed: 0_level_1,LENGTH,DIAMETER,HEIGHT,WHOLEWEIGHT,SHUCKEDWEIGHT,VISCERAWEIGHT,SHELLWEIGHT
Mean,0.5239920996,0.4078812545,0.1395163993,0.8287421594,0.3593674886,0.1805936079,0.2388308595
StD,0.1200929126,0.0992398661,0.0418270566,0.4903890182,0.221962949,0.1096142503,0.1392026695

Correlation Matrix,Correlation Matrix,Correlation Matrix,Correlation Matrix,Correlation Matrix,Correlation Matrix,Correlation Matrix,Correlation Matrix
Unnamed: 0_level_1,LENGTH,DIAMETER,HEIGHT,WHOLEWEIGHT,SHUCKEDWEIGHT,VISCERAWEIGHT,SHELLWEIGHT
LENGTH,1.0,0.9868,0.8276,0.9253,0.8979,0.903,0.8977
DIAMETER,0.9868,1.0,0.8337,0.9255,0.8932,0.8997,0.9053
HEIGHT,0.8276,0.8337,1.0,0.8192,0.775,0.7983,0.8173
WHOLEWEIGHT,0.9253,0.9255,0.8192,1.0,0.9694,0.9664,0.9554
SHUCKEDWEIGHT,0.8979,0.8932,0.775,0.9694,1.0,0.932,0.8826
VISCERAWEIGHT,0.903,0.8997,0.7983,0.9664,0.932,1.0,0.9077
SHELLWEIGHT,0.8977,0.9053,0.8173,0.9554,0.8826,0.9077,1.0

Eigenvalues of the Correlation Matrix,Eigenvalues of the Correlation Matrix,Eigenvalues of the Correlation Matrix,Eigenvalues of the Correlation Matrix,Eigenvalues of the Correlation Matrix
Unnamed: 0_level_1,Eigenvalue,Difference,Proportion,Cumulative
1,6.35511203,6.07567967,0.9079,0.9079
2,0.27943236,,0.0399,0.9478

Eigenvectors,Eigenvectors,Eigenvectors
Unnamed: 0_level_1,Prin1,Prin2
LENGTH,0.383251,0.037865
DIAMETER,0.383573,0.065323
HEIGHT,0.348144,0.866836
WHOLEWEIGHT,0.390673,-0.233271
SHUCKEDWEIGHT,0.378188,-0.348011
VISCERAWEIGHT,0.381513,-0.252903
SHELLWEIGHT,0.378922,-0.058375


The absolute value of the coefficients within the principal component shows the contribution of the corresponding variable. From the equation, you can see that the coefficients of the first PC are all positively correlated, indicating z1 will increase if all 7 variables increase. Within the second PC, we have a mixture of positive and negative coefficients, HEIGHT dominates this PC and thus makes this PC variable-specific. 

## Outliers Detection and Visualization 
Now that we have lower dataset representation, we can easily visualize the distribution of predictor variables. We can see that there is a presence of outliers in this dataset. 


In [15]:
/* Box plot */
proc sgplot data=NewData;
  vbox prin1;
run;
proc sgplot data=NewData;
  vbox prin2;
run;


We apply Tukey’s rule to find outliers based on the quartiles of the data: the first quartile Q1 is the value >= ¼ of the data, the second quartile Q2 or the median is the value >= ½ of the data, and the third quartile Q3 is the value >= ¾ of the data. The interquartile range (IQR) can be defined as the difference between Q3 and Q1. According to Tukey’s rule, the outliers are values that are more than 1.5 times the IQR from the quartiles - either below Q1 - 1.5IQR or above Q3+1.5IQR. Below are the boxplots for PC1 and PC2 after the outliers were removed.  


In [16]:
/* Outliers detection */
proc univariate data = NewData;
var prin1 prin2;
output out=boxStats p25=q1_1 q1_2 p75=q3_1 q3_2  qrange = iqr1 iqr2;
run; 
data _null_;
set boxStats;
call symput ('Q1_1',q1_1);
call symput ('Q1_2',q1_2);
call symput ('Q3_1',q3_1);
call symput ('Q3_2',q3_2);
call symput ('iqr1', iqr1);
call symput ('iqr2', iqr2);
run; 
data trimmed;
set NewData;
    if (prin1 ge &q1_1 - 1.5 * &iqr1) and (prin1 le &q3_1 + 1.5 * &iqr1); 
    if (prin2 ge &q1_2 - 1.5 * &iqr2) and (prin2 le &q3_1 + 1.5 * &iqr2); 
run; 


Moments,Moments.1,Moments.2,Moments.3
N,4177,Sum Weights,4177.0
Mean,0,Sum Observations,0.0
Std Deviation,2.52093475,Variance,6.35511203
Skewness,0.05890147,Kurtosis,-0.407421
Uncorrected SS,26538.9479,Corrected SS,26538.9479
Coeff Variation,.,Std Error Mean,0.03900582

Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures
Location,Location.1,Variability,Variability.1
Mean,0.000000,Std Deviation,2.52093
Median,0.087688,Variance,6.35511
Mode,.,Range,15.17533
,,Interquartile Range,3.63015

Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0
Test,Statistic,Statistic.1,p Value,p Value.1
Student's t,t,0.0,Pr > |t|,1.0
Sign,M,52.5,Pr >= |M|,0.1076
Signed Rank,S,-8982.5,Pr >= |S|,0.9083

Quantiles (Definition 5),Quantiles (Definition 5)
Level,Quantile
100% Max,8.7584395
99%,5.8816182
95%,4.0530179
90%,3.1810749
75% Q3,1.7814235
50% Median,0.0876883
25% Q1,-1.8487217
10%,-3.4046821
5%,-4.1516777
1%,-5.2794434

Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations
Lowest,Lowest,Highest,Highest
Value,Obs,Value,Obs
-6.41689,237,7.49574,3716
-5.98681,239,7.68705,892
-5.87074,238,7.73479,1210
-5.84073,720,8.49362,1418
-5.83877,2115,8.75844,1764

Moments,Moments.1,Moments.2,Moments.3
N,4177,Sum Weights,4177.0
Mean,0,Sum Observations,0.0
Std Deviation,0.52861362,Variance,0.27943236
Skewness,14.7020918,Kurtosis,578.409956
Uncorrected SS,1166.90955,Corrected SS,1166.90955
Coeff Variation,.,Std Error Mean,0.00817911

Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures
Location,Location.1,Variability,Variability.1
Mean,0.000000,Std Deviation,0.52861
Median,0.003767,Variance,0.27943
Mode,.,Range,24.01567
,,Interquartile Range,0.45007

Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0
Test,Statistic,Statistic.1,p Value,p Value.1
Student's t,t,0.0,Pr > |t|,1.0
Sign,M,21.5,Pr >= |M|,0.5158
Signed Rank,S,15123.5,Pr >= |S|,0.8462

Quantiles (Definition 5),Quantiles (Definition 5)
Level,Quantile
100% Max,20.8182812
99%,1.01084242
95%,0.62875694
90%,0.47385354
75% Q3,0.22702904
50% Median,0.00376671
25% Q1,-0.22304266
10%,-0.48069421
5%,-0.68303161
1%,-1.14853347

Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations
Lowest,Lowest,Highest,Highest
Value,Obs,Value,Obs
-3.19739,1175,1.73515,2180
-2.26522,1258,1.87795,2178
-1.96651,3997,2.08818,507
-1.85332,1763,5.29176,1418
-1.84859,2812,20.81828,2052


In [17]:
/* Plot the trimmed dataset */
proc sgplot data=trimmed;
  vbox prin1;
run;
proc sgplot data=trimmed;
  vbox prin2;
run;
proc contents data=trimmed;
run;

0,1,2,3
Data Set Name,WORK.TRIMMED,Observations,4068
Member Type,DATA,Variables,11
Engine,V9,Indexes,0
Created,05/20/2020 02:29:04,Observation Length,88
Last Modified,05/20/2020 02:29:04,Deleted Observations,0
Protection,,Compressed,NO
Data Set Type,,Sorted,NO
Label,,,
Data Representation,"SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64",,
Encoding,utf-8 Unicode (UTF-8),,

Engine/Host Dependent Information,Engine/Host Dependent Information.1
Data Set Page Size,65536
Number of Data Set Pages,6
First Data Page,1
Max Obs per Page,743
Obs in First Data Page,706
Number of Data Set Repairs,0
Filename,/tmp/SAS_work0F2C00000BB8_localhost.localdomain/trimmed.sas7bdat
Release Created,9.0401M6
Host Created,Linux
Inode Number,144308

Alphabetic List of Variables and Attributes,Alphabetic List of Variables and Attributes,Alphabetic List of Variables and Attributes,Alphabetic List of Variables and Attributes,Alphabetic List of Variables and Attributes,Alphabetic List of Variables and Attributes
#,Variable,Type,Len,Format,Informat
2,DIAMETER,Num,8,BEST12.,BEST32.
3,HEIGHT,Num,8,BEST12.,BEST32.
1,LENGTH,Num,8,BEST12.,BEST32.
10,Prin1,Num,8,,
11,Prin2,Num,8,,
8,RINGS,Num,8,BEST12.,BEST32.
7,SHELLWEIGHT,Num,8,BEST12.,BEST32.
5,SHUCKEDWEIGHT,Num,8,BEST12.,BEST32.
6,VISCERAWEIGHT,Num,8,BEST12.,BEST32.
4,WHOLEWEIGHT,Num,8,BEST12.,BEST32.


The dataset now has 4068 observations, meaning 109 outliers were removed.


In [21]:
/* PROC PLOT data=trimmed;
plot prin2*prin1=ageRINGS;
run; */

We then plotted the first two PCs against each other and noticed that the age group of abalone actually related with the PC1. Group 1 (young abalone) tends to fall around the far left region of the plot whereas Group 3 (old abalone) falls around the far right region. The middle region is the mix of Group 2 and Group 3, which indicates that it might be hard to distinguish between Group 2 and Group 3. 

# CLASSIFICATION - k NEAREST NEIGHBORS (KNN)

Classification is a statistical method to identify which group membership the sampling unit belongs to. KNN is a nonparametric classification procedure, which does not make any assumption on the underlying data distribution. KNN, however, does make an assumption that similar points share similar labels. KNN is a very simple yet powerful model because in practice most of the data does not obey the typical normality assumption. However, the procedure requires equal covariances for every group assumptions. The major drawbacks of KNN are its low efficiency because we need to calculate the distance to all other points and its dependency on the selection of an appropriate value for k [2]. As mentioned above, KNN assumes that similar points share similar class labels. In higher dimensional spaces, points that are drawn from a probability distribution are not very close together. Thus we have used PCA to reduce the dimensionality of the dataset.


We will use error rate to evaluate the ability of a classification procedure to predict group membership. The procedure assumes that the sampling unit being classified actually belongs to the considered population. If the sampling unit is wrongly classified, then it is considered an error. The Apparent Error Rate usually underestimates the actual error because the dataset used to compute the classification functions is also used to evaluate them. Thus we use the error rate from the holdout method to assess how well our statistical procedure will generalize to an independent dataset. For KNN classifiers, the observation being classified is excluded from the k nearest neighbors of that observation. We noted that KNN relies on distance metric. The better that metric reflects group label similarity, the better the classified will be. So we will try different distance metrics and pick the one that produces the lowest error rate. 

## KNN - default setting using pooled covariance matrix

In [22]:
/* Apply KNN  - METRIC=FULL Default*/

proc discrim data=trimmed method=npar k=100 crossvalidate;
class ageRINGS;
var Prin1 Prin2;
run;

proc discrim data=trimmed method=npar k=200 crossvalidate;
class ageRINGS;
var Prin1 Prin2;
run;

proc discrim data=trimmed method=npar k=300 crossvalidate;
class ageRINGS;
var Prin1 Prin2;
run;

proc discrim data=trimmed method=npar k=400 crossvalidate;
class ageRINGS;
var Prin1 Prin2;
run;

proc discrim data=trimmed method=npar k=500 crossvalidate;
class ageRINGS;
var Prin1 Prin2;
run;

0,1,2,3
Total Sample Size,4068,DF Total,4067
Variables,2,DF Within Classes,4065
Classes,3,DF Between Classes,2

0,1
Number of Observations Read,4068
Number of Observations Used,4068

Class Level Information,Class Level Information,Class Level Information,Class Level Information,Class Level Information,Class Level Information
ageRINGS,Variable Name,Frequency,Weight,Proportion,Prior Probability
1,1,1366,1366,0.335792,0.333333
2,2,1301,1301,0.319813,0.333333
3,3,1401,1401,0.344395,0.333333

Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS
From ageRINGS,1,2,3,Total
1,1028 75.26,247 18.08,91 6.66,1366 100.00
2,273 20.98,645 49.58,383 29.44,1301 100.00
3,138 9.85,434 30.98,829 59.17,1401 100.00
Total,1439 35.37,1326 32.60,1303 32.03,4068 100.00
Priors,0.33333,0.33333,0.33333,

Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS
Unnamed: 0_level_1,1,2,3,Total
Rate,0.2474,0.5042,0.4083,0.3866
Priors,0.3333,0.3333,0.3333,

Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS
From ageRINGS,1,2,3,Total
1,1024 74.96,250 18.30,92 6.73,1366 100.00
2,278 21.37,628 48.27,395 30.36,1301 100.00
3,139 9.92,449 32.05,813 58.03,1401 100.00
Total,1441 35.42,1327 32.62,1300 31.96,4068 100.00
Priors,0.33333,0.33333,0.33333,

Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS
Unnamed: 0_level_1,1,2,3,Total
Rate,0.2504,0.5173,0.4197,0.3958
Priors,0.3333,0.3333,0.3333,

0,1,2,3
Total Sample Size,4068,DF Total,4067
Variables,2,DF Within Classes,4065
Classes,3,DF Between Classes,2

0,1
Number of Observations Read,4068
Number of Observations Used,4068

Class Level Information,Class Level Information,Class Level Information,Class Level Information,Class Level Information,Class Level Information
ageRINGS,Variable Name,Frequency,Weight,Proportion,Prior Probability
1,1,1366,1366,0.335792,0.333333
2,2,1301,1301,0.319813,0.333333
3,3,1401,1401,0.344395,0.333333

Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS
From ageRINGS,1,2,3,Total
1,1021 74.74,260 19.03,85 6.22,1366 100.00
2,277 21.29,636 48.89,388 29.82,1301 100.00
3,139 9.92,440 31.41,822 58.67,1401 100.00
Total,1437 35.32,1336 32.84,1295 31.83,4068 100.00
Priors,0.33333,0.33333,0.33333,

Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS
Unnamed: 0_level_1,1,2,3,Total
Rate,0.2526,0.5111,0.4133,0.3923
Priors,0.3333,0.3333,0.3333,

Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS
From ageRINGS,1,2,3,Total
1,1020 74.67,261 19.11,85 6.22,1366 100.00
2,279 21.45,625 48.04,397 30.51,1301 100.00
3,140 9.99,443 31.62,818 58.39,1401 100.00
Total,1439 35.37,1329 32.67,1300 31.96,4068 100.00
Priors,0.33333,0.33333,0.33333,

Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS
Unnamed: 0_level_1,1,2,3,Total
Rate,0.2533,0.5196,0.4161,0.3963
Priors,0.3333,0.3333,0.3333,

0,1,2,3
Total Sample Size,4068,DF Total,4067
Variables,2,DF Within Classes,4065
Classes,3,DF Between Classes,2

0,1
Number of Observations Read,4068
Number of Observations Used,4068

Class Level Information,Class Level Information,Class Level Information,Class Level Information,Class Level Information,Class Level Information
ageRINGS,Variable Name,Frequency,Weight,Proportion,Prior Probability
1,1,1366,1366,0.335792,0.333333
2,2,1301,1301,0.319813,0.333333
3,3,1401,1401,0.344395,0.333333

Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS
From ageRINGS,1,2,3,Total
1,1030 75.40,255 18.67,81 5.93,1366 100.00
2,278 21.37,641 49.27,382 29.36,1301 100.00
3,144 10.28,441 31.48,816 58.24,1401 100.00
Total,1452 35.69,1337 32.87,1279 31.44,4068 100.00
Priors,0.33333,0.33333,0.33333,

Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS
Unnamed: 0_level_1,1,2,3,Total
Rate,0.246,0.5073,0.4176,0.3903
Priors,0.3333,0.3333,0.3333,

Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS
From ageRINGS,1,2,3,Total
1,1026 75.11,259 18.96,81 5.93,1366 100.00
2,281 21.60,630 48.42,390 29.98,1301 100.00
3,142 10.14,451 32.19,808 57.67,1401 100.00
Total,1449 35.62,1340 32.94,1279 31.44,4068 100.00
Priors,0.33333,0.33333,0.33333,

Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS
Unnamed: 0_level_1,1,2,3,Total
Rate,0.2489,0.5158,0.4233,0.396
Priors,0.3333,0.3333,0.3333,

0,1,2,3
Total Sample Size,4068,DF Total,4067
Variables,2,DF Within Classes,4065
Classes,3,DF Between Classes,2

0,1
Number of Observations Read,4068
Number of Observations Used,4068

Class Level Information,Class Level Information,Class Level Information,Class Level Information,Class Level Information,Class Level Information
ageRINGS,Variable Name,Frequency,Weight,Proportion,Prior Probability
1,1,1366,1366,0.335792,0.333333
2,2,1301,1301,0.319813,0.333333
3,3,1401,1401,0.344395,0.333333

Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS
From ageRINGS,1,2,3,Total
1,1027 75.18,264 19.33,75 5.49,1366 100.00
2,287 22.06,653 50.19,361 27.75,1301 100.00
3,148 10.56,466 33.26,787 56.17,1401 100.00
Total,1462 35.94,1383 34.00,1223 30.06,4068 100.00
Priors,0.33333,0.33333,0.33333,

Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS
Unnamed: 0_level_1,1,2,3,Total
Rate,0.2482,0.4981,0.4383,0.3948
Priors,0.3333,0.3333,0.3333,

Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS
From ageRINGS,1,2,3,Total
1,1026 75.11,266 19.47,74 5.42,1366 100.00
2,289 22.21,650 49.96,362 27.82,1301 100.00
3,147 10.49,470 33.55,784 55.96,1401 100.00
Total,1462 35.94,1386 34.07,1220 29.99,4068 100.00
Priors,0.33333,0.33333,0.33333,

Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS
Unnamed: 0_level_1,1,2,3,Total
Rate,0.2489,0.5004,0.4404,0.3966
Priors,0.3333,0.3333,0.3333,

0,1,2,3
Total Sample Size,4068,DF Total,4067
Variables,2,DF Within Classes,4065
Classes,3,DF Between Classes,2

0,1
Number of Observations Read,4068
Number of Observations Used,4068

Class Level Information,Class Level Information,Class Level Information,Class Level Information,Class Level Information,Class Level Information
ageRINGS,Variable Name,Frequency,Weight,Proportion,Prior Probability
1,1,1366,1366,0.335792,0.333333
2,2,1301,1301,0.319813,0.333333
3,3,1401,1401,0.344395,0.333333

Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS
From ageRINGS,1,2,3,Total
1,1029 75.33,264 19.33,73 5.34,1366 100.00
2,291 22.37,662 50.88,348 26.75,1301 100.00
3,149 10.64,473 33.76,779 55.60,1401 100.00
Total,1469 36.11,1399 34.39,1200 29.50,4068 100.00
Priors,0.33333,0.33333,0.33333,

Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS
Unnamed: 0_level_1,1,2,3,Total
Rate,0.2467,0.4912,0.444,0.3939
Priors,0.3333,0.3333,0.3333,

Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS
From ageRINGS,1,2,3,Total
1,1029 75.33,266 19.47,71 5.20,1366 100.00
2,291 22.37,658 50.58,352 27.06,1301 100.00
3,148 10.56,478 34.12,775 55.32,1401 100.00
Total,1468 36.09,1402 34.46,1198 29.45,4068 100.00
Priors,0.33333,0.33333,0.33333,

Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS
Unnamed: 0_level_1,1,2,3,Total
Rate,0.2467,0.4942,0.4468,0.3959
Priors,0.3333,0.3333,0.3333,


Although we started off with k = 100, we understood that many other values could possibly be used for this classifier. We noticed that the error rate decreased to 0.3811 and 0.3803 as we increased the number of  from 100 to 200 and 300 but slightly increased to 0.3908 and 0.3897 as k increased to 400 and 500.  The cross-validation error rate indicates that there should be a balance between  and the error rate. The higher  does not necessarily improve the model. The 200-nearest neighbor classifier is the best performing model with the lowest error rate of 0.3811.  

## KNN - IDENTITY metric using Euclidean distance

In [23]:
/* Apply KNN  - METRIC=IDENTITY*/
proc discrim data=trimmed method=npar metric=identity k=100 crossvalidate;
class ageRINGS;
var Prin1 Prin2;
run;

proc discrim data=trimmed method=npar metric=identity k=200 crossvalidate;
class ageRINGS;
var Prin1 Prin2;
run;

proc discrim data=trimmed method=npar metric=identity k=300 crossvalidate;
class ageRINGS;
var Prin1 Prin2;
run;

proc discrim data=trimmed method=npar metric=identity k=400 crossvalidate;
class ageRINGS;
var Prin1 Prin2;
run;

proc discrim data=trimmed method=npar metric=identity k=500 crossvalidate;
class ageRINGS;
var Prin1 Prin2;
run;

0,1,2,3
Total Sample Size,4068,DF Total,4067
Variables,2,DF Within Classes,4065
Classes,3,DF Between Classes,2

0,1
Number of Observations Read,4068
Number of Observations Used,4068

Class Level Information,Class Level Information,Class Level Information,Class Level Information,Class Level Information,Class Level Information
ageRINGS,Variable Name,Frequency,Weight,Proportion,Prior Probability
1,1,1366,1366,0.335792,0.333333
2,2,1301,1301,0.319813,0.333333
3,3,1401,1401,0.344395,0.333333

Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS
From ageRINGS,1,2,3,Total
1,1032 75.55,263 19.25,71 5.20,1366 100.00
2,271 20.83,685 52.65,345 26.52,1301 100.00
3,141 10.06,473 33.76,787 56.17,1401 100.00
Total,1444 35.50,1421 34.93,1203 29.57,4068 100.00
Priors,0.33333,0.33333,0.33333,

Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS
Unnamed: 0_level_1,1,2,3,Total
Rate,0.2445,0.4735,0.4383,0.3854
Priors,0.3333,0.3333,0.3333,

Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS
From ageRINGS,1,2,3,Total
1,1029 75.33,267 19.55,70 5.12,1366 100.00
2,276 21.21,676 51.96,349 26.83,1301 100.00
3,142 10.14,480 34.26,779 55.60,1401 100.00
Total,1447 35.57,1423 34.98,1198 29.45,4068 100.00
Priors,0.33333,0.33333,0.33333,

Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS
Unnamed: 0_level_1,1,2,3,Total
Rate,0.2467,0.4804,0.444,0.3904
Priors,0.3333,0.3333,0.3333,

0,1,2,3
Total Sample Size,4068,DF Total,4067
Variables,2,DF Within Classes,4065
Classes,3,DF Between Classes,2

0,1
Number of Observations Read,4068
Number of Observations Used,4068

Class Level Information,Class Level Information,Class Level Information,Class Level Information,Class Level Information,Class Level Information
ageRINGS,Variable Name,Frequency,Weight,Proportion,Prior Probability
1,1,1366,1366,0.335792,0.333333
2,2,1301,1301,0.319813,0.333333
3,3,1401,1401,0.344395,0.333333

Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS
From ageRINGS,1,2,3,Total
1,1047 76.65,254 18.59,65 4.76,1366 100.00
2,294 22.60,666 51.19,341 26.21,1301 100.00
3,163 11.63,473 33.76,765 54.60,1401 100.00
Total,1504 36.97,1393 34.24,1171 28.79,4068 100.00
Priors,0.33333,0.33333,0.33333,

Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS
Unnamed: 0_level_1,1,2,3,Total
Rate,0.2335,0.4881,0.454,0.3919
Priors,0.3333,0.3333,0.3333,

Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS
From ageRINGS,1,2,3,Total
1,1046 76.57,255 18.67,65 4.76,1366 100.00
2,297 22.83,659 50.65,345 26.52,1301 100.00
3,166 11.85,475 33.90,760 54.25,1401 100.00
Total,1509 37.09,1389 34.14,1170 28.76,4068 100.00
Priors,0.33333,0.33333,0.33333,

Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS
Unnamed: 0_level_1,1,2,3,Total
Rate,0.2343,0.4935,0.4575,0.3951
Priors,0.3333,0.3333,0.3333,

0,1,2,3
Total Sample Size,4068,DF Total,4067
Variables,2,DF Within Classes,4065
Classes,3,DF Between Classes,2

0,1
Number of Observations Read,4068
Number of Observations Used,4068

Class Level Information,Class Level Information,Class Level Information,Class Level Information,Class Level Information,Class Level Information
ageRINGS,Variable Name,Frequency,Weight,Proportion,Prior Probability
1,1,1366,1366,0.335792,0.333333
2,2,1301,1301,0.319813,0.333333
3,3,1401,1401,0.344395,0.333333

Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS
From ageRINGS,1,2,3,Total
1,1056 77.31,263 19.25,47 3.44,1366 100.00
2,310 23.83,659 50.65,332 25.52,1301 100.00
3,184 13.13,482 34.40,735 52.46,1401 100.00
Total,1550 38.10,1404 34.51,1114 27.38,4068 100.00
Priors,0.33333,0.33333,0.33333,

Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS
Unnamed: 0_level_1,1,2,3,Total
Rate,0.2269,0.4935,0.4754,0.3986
Priors,0.3333,0.3333,0.3333,

Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS
From ageRINGS,1,2,3,Total
1,1053 77.09,266 19.47,47 3.44,1366 100.00
2,309 23.75,655 50.35,337 25.90,1301 100.00
3,185 13.20,486 34.69,730 52.11,1401 100.00
Total,1547 38.03,1407 34.59,1114 27.38,4068 100.00
Priors,0.33333,0.33333,0.33333,

Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS
Unnamed: 0_level_1,1,2,3,Total
Rate,0.2291,0.4965,0.4789,0.4015
Priors,0.3333,0.3333,0.3333,

0,1,2,3
Total Sample Size,4068,DF Total,4067
Variables,2,DF Within Classes,4065
Classes,3,DF Between Classes,2

0,1
Number of Observations Read,4068
Number of Observations Used,4068

Class Level Information,Class Level Information,Class Level Information,Class Level Information,Class Level Information,Class Level Information
ageRINGS,Variable Name,Frequency,Weight,Proportion,Prior Probability
1,1,1366,1366,0.335792,0.333333
2,2,1301,1301,0.319813,0.333333
3,3,1401,1401,0.344395,0.333333

Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS
From ageRINGS,1,2,3,Total
1,1055 77.23,271 19.84,40 2.93,1366 100.00
2,309 23.75,680 52.27,312 23.98,1301 100.00
3,185 13.20,536 38.26,680 48.54,1401 100.00
Total,1549 38.08,1487 36.55,1032 25.37,4068 100.00
Priors,0.33333,0.33333,0.33333,

Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS
Unnamed: 0_level_1,1,2,3,Total
Rate,0.2277,0.4773,0.5146,0.4065
Priors,0.3333,0.3333,0.3333,

Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS
From ageRINGS,1,2,3,Total
1,1053 77.09,273 19.99,40 2.93,1366 100.00
2,310 23.83,677 52.04,314 24.14,1301 100.00
3,187 13.35,541 38.62,673 48.04,1401 100.00
Total,1550 38.10,1491 36.65,1027 25.25,4068 100.00
Priors,0.33333,0.33333,0.33333,

Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS
Unnamed: 0_level_1,1,2,3,Total
Rate,0.2291,0.4796,0.5196,0.4095
Priors,0.3333,0.3333,0.3333,

0,1,2,3
Total Sample Size,4068,DF Total,4067
Variables,2,DF Within Classes,4065
Classes,3,DF Between Classes,2

0,1
Number of Observations Read,4068
Number of Observations Used,4068

Class Level Information,Class Level Information,Class Level Information,Class Level Information,Class Level Information,Class Level Information
ageRINGS,Variable Name,Frequency,Weight,Proportion,Prior Probability
1,1,1366,1366,0.335792,0.333333
2,2,1301,1301,0.319813,0.333333
3,3,1401,1401,0.344395,0.333333

Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS
From ageRINGS,1,2,3,Total
1,1061 77.67,272 19.91,33 2.42,1366 100.00
2,314 24.14,684 52.57,303 23.29,1301 100.00
3,192 13.70,545 38.90,664 47.39,1401 100.00
Total,1567 38.52,1501 36.90,1000 24.58,4068 100.00
Priors,0.33333,0.33333,0.33333,

Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS
Unnamed: 0_level_1,1,2,3,Total
Rate,0.2233,0.4743,0.5261,0.4079
Priors,0.3333,0.3333,0.3333,

Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS
From ageRINGS,1,2,3,Total
1,1060 77.60,273 19.99,33 2.42,1366 100.00
2,315 24.21,679 52.19,307 23.60,1301 100.00
3,193 13.78,550 39.26,658 46.97,1401 100.00
Total,1568 38.54,1502 36.92,998 24.53,4068 100.00
Priors,0.33333,0.33333,0.33333,

Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS
Unnamed: 0_level_1,1,2,3,Total
Rate,0.224,0.4781,0.5303,0.4108
Priors,0.3333,0.3333,0.3333,


The 300-nearest neighbor classifier is the best performing model and yields the lowest error rate of 0.3892.

## KNN - DIAGONAL metric using the diagonal matrix of the pooled covariance matrix 


In [24]:
proc discrim data=trimmed method=npar metric=diagonal pool=yes k=100 crossvalidate;
class ageRINGS;
var Prin1 Prin2;
run;

proc discrim data=trimmed method=npar metric=diagonal pool=yes k=200 crossvalidate;
class ageRINGS;
var Prin1 Prin2;
run; 

proc discrim data=trimmed method=npar metric=diagonal pool=yes k=300 crossvalidate;
class ageRINGS;
var Prin1 Prin2;
run;

proc discrim data=trimmed method=npar metric=diagonal pool=yes k=400 crossvalidate;
class ageRINGS;
var Prin1 Prin2;
run;

proc discrim data=trimmed method=npar metric=diagonal pool=yes k=500 crossvalidate;
class ageRINGS;
var Prin1 Prin2;
run;

0,1,2,3
Total Sample Size,4068,DF Total,4067
Variables,2,DF Within Classes,4065
Classes,3,DF Between Classes,2

0,1
Number of Observations Read,4068
Number of Observations Used,4068

Class Level Information,Class Level Information,Class Level Information,Class Level Information,Class Level Information,Class Level Information
ageRINGS,Variable Name,Frequency,Weight,Proportion,Prior Probability
1,1,1366,1366,0.335792,0.333333
2,2,1301,1301,0.319813,0.333333
3,3,1401,1401,0.344395,0.333333

Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS
From ageRINGS,1,2,3,Total
1,1024 74.96,252 18.45,90 6.59,1366 100.00
2,270 20.75,645 49.58,386 29.67,1301 100.00
3,134 9.56,434 30.98,833 59.46,1401 100.00
Total,1428 35.10,1331 32.72,1309 32.18,4068 100.00
Priors,0.33333,0.33333,0.33333,

Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS
Unnamed: 0_level_1,1,2,3,Total
Rate,0.2504,0.5042,0.4054,0.3867
Priors,0.3333,0.3333,0.3333,

Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS
From ageRINGS,1,2,3,Total
1,1023 74.89,255 18.67,88 6.44,1366 100.00
2,276 21.21,627 48.19,398 30.59,1301 100.00
3,140 9.99,436 31.12,825 58.89,1401 100.00
Total,1439 35.37,1318 32.40,1311 32.23,4068 100.00
Priors,0.33333,0.33333,0.33333,

Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS
Unnamed: 0_level_1,1,2,3,Total
Rate,0.2511,0.5181,0.4111,0.3934
Priors,0.3333,0.3333,0.3333,

0,1,2,3
Total Sample Size,4068,DF Total,4067
Variables,2,DF Within Classes,4065
Classes,3,DF Between Classes,2

0,1
Number of Observations Read,4068
Number of Observations Used,4068

Class Level Information,Class Level Information,Class Level Information,Class Level Information,Class Level Information,Class Level Information
ageRINGS,Variable Name,Frequency,Weight,Proportion,Prior Probability
1,1,1366,1366,0.335792,0.333333
2,2,1301,1301,0.319813,0.333333
3,3,1401,1401,0.344395,0.333333

Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS
From ageRINGS,1,2,3,Total
1,1020 74.67,258 18.89,88 6.44,1366 100.00
2,282 21.68,627 48.19,392 30.13,1301 100.00
3,133 9.49,438 31.26,830 59.24,1401 100.00
Total,1435 35.28,1323 32.52,1310 32.20,4068 100.00
Priors,0.33333,0.33333,0.33333,

Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS
Unnamed: 0_level_1,1,2,3,Total
Rate,0.2533,0.5181,0.4076,0.393
Priors,0.3333,0.3333,0.3333,

Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS
From ageRINGS,1,2,3,Total
1,1018 74.52,259 18.96,89 6.52,1366 100.00
2,286 21.98,618 47.50,397 30.51,1301 100.00
3,134 9.56,447 31.91,820 58.53,1401 100.00
Total,1438 35.35,1324 32.55,1306 32.10,4068 100.00
Priors,0.33333,0.33333,0.33333,

Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS
Unnamed: 0_level_1,1,2,3,Total
Rate,0.2548,0.525,0.4147,0.3981
Priors,0.3333,0.3333,0.3333,

0,1,2,3
Total Sample Size,4068,DF Total,4067
Variables,2,DF Within Classes,4065
Classes,3,DF Between Classes,2

0,1
Number of Observations Read,4068
Number of Observations Used,4068

Class Level Information,Class Level Information,Class Level Information,Class Level Information,Class Level Information,Class Level Information
ageRINGS,Variable Name,Frequency,Weight,Proportion,Prior Probability
1,1,1366,1366,0.335792,0.333333
2,2,1301,1301,0.319813,0.333333
3,3,1401,1401,0.344395,0.333333

Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS
From ageRINGS,1,2,3,Total
1,1030 75.40,255 18.67,81 5.93,1366 100.00
2,281 21.60,646 49.65,374 28.75,1301 100.00
3,140 9.99,459 32.76,802 57.24,1401 100.00
Total,1451 35.67,1360 33.43,1257 30.90,4068 100.00
Priors,0.33333,0.33333,0.33333,

Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS
Unnamed: 0_level_1,1,2,3,Total
Rate,0.246,0.5035,0.4276,0.3923
Priors,0.3333,0.3333,0.3333,

Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS
From ageRINGS,1,2,3,Total
1,1026 75.11,259 18.96,81 5.93,1366 100.00
2,282 21.68,637 48.96,382 29.36,1301 100.00
3,140 9.99,463 33.05,798 56.96,1401 100.00
Total,1448 35.59,1359 33.41,1261 31.00,4068 100.00
Priors,0.33333,0.33333,0.33333,

Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS
Unnamed: 0_level_1,1,2,3,Total
Rate,0.2489,0.5104,0.4304,0.3966
Priors,0.3333,0.3333,0.3333,

0,1,2,3
Total Sample Size,4068,DF Total,4067
Variables,2,DF Within Classes,4065
Classes,3,DF Between Classes,2

0,1
Number of Observations Read,4068
Number of Observations Used,4068

Class Level Information,Class Level Information,Class Level Information,Class Level Information,Class Level Information,Class Level Information
ageRINGS,Variable Name,Frequency,Weight,Proportion,Prior Probability
1,1,1366,1366,0.335792,0.333333
2,2,1301,1301,0.319813,0.333333
3,3,1401,1401,0.344395,0.333333

Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS
From ageRINGS,1,2,3,Total
1,1033 75.62,258 18.89,75 5.49,1366 100.00
2,288 22.14,659 50.65,354 27.21,1301 100.00
3,147 10.49,477 34.05,777 55.46,1401 100.00
Total,1468 36.09,1394 34.27,1206 29.65,4068 100.00
Priors,0.33333,0.33333,0.33333,

Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS
Unnamed: 0_level_1,1,2,3,Total
Rate,0.2438,0.4935,0.4454,0.3942
Priors,0.3333,0.3333,0.3333,

Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS
From ageRINGS,1,2,3,Total
1,1030 75.40,261 19.11,75 5.49,1366 100.00
2,290 22.29,653 50.19,358 27.52,1301 100.00
3,149 10.64,479 34.19,773 55.17,1401 100.00
Total,1469 36.11,1393 34.24,1206 29.65,4068 100.00
Priors,0.33333,0.33333,0.33333,

Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS
Unnamed: 0_level_1,1,2,3,Total
Rate,0.246,0.4981,0.4483,0.3974
Priors,0.3333,0.3333,0.3333,

0,1,2,3
Total Sample Size,4068,DF Total,4067
Variables,2,DF Within Classes,4065
Classes,3,DF Between Classes,2

0,1
Number of Observations Read,4068
Number of Observations Used,4068

Class Level Information,Class Level Information,Class Level Information,Class Level Information,Class Level Information,Class Level Information
ageRINGS,Variable Name,Frequency,Weight,Proportion,Prior Probability
1,1,1366,1366,0.335792,0.333333
2,2,1301,1301,0.319813,0.333333
3,3,1401,1401,0.344395,0.333333

Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS
From ageRINGS,1,2,3,Total
1,1033 75.62,260 19.03,73 5.34,1366 100.00
2,290 22.29,666 51.19,345 26.52,1301 100.00
3,149 10.64,470 33.55,782 55.82,1401 100.00
Total,1472 36.18,1396 34.32,1200 29.50,4068 100.00
Priors,0.33333,0.33333,0.33333,

Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS
Unnamed: 0_level_1,1,2,3,Total
Rate,0.2438,0.4881,0.4418,0.3912
Priors,0.3333,0.3333,0.3333,

Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS
From ageRINGS,1,2,3,Total
1,1032 75.55,261 19.11,73 5.34,1366 100.00
2,290 22.29,659 50.65,352 27.06,1301 100.00
3,149 10.64,471 33.62,781 55.75,1401 100.00
Total,1471 36.16,1391 34.19,1206 29.65,4068 100.00
Priors,0.33333,0.33333,0.33333,

Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS
Unnamed: 0_level_1,1,2,3,Total
Rate,0.2445,0.4935,0.4425,0.3935
Priors,0.3333,0.3333,0.3333,


The 200-nearest neighbor classifier is the best performing model and yields the lowest error rate of 0.3813.

# CLASSIFICATION - LINEAR DISCRIMINANT ANALYSIS AND QUADRATIC DISCRIMINANT ANALYSIS

Linear Discriminant Analysis (LDA) assumes equal covariance for every group. Quadratic Discriminant Analysis (QDA), on the other hand, does not assume the equal covariances for every group. Decision rule for LDA is to classify the sampling unit into the largest square distance whereas decision rule for QDA is to classify the sampling unit into a group with the smallest square distance.

In [27]:
proc discrim data=trimmed pool=no crossvalidate;
class ageRINGS;
var Prin1 Prin2;
run;

0,1,2,3
Total Sample Size,4068,DF Total,4067
Variables,2,DF Within Classes,4065
Classes,3,DF Between Classes,2

0,1
Number of Observations Read,4068
Number of Observations Used,4068

Class Level Information,Class Level Information,Class Level Information,Class Level Information,Class Level Information,Class Level Information
ageRINGS,Variable Name,Frequency,Weight,Proportion,Prior Probability
1,1,1366,1366,0.335792,0.333333
2,2,1301,1301,0.319813,0.333333
3,3,1401,1401,0.344395,0.333333

Within Covariance Matrix Information,Within Covariance Matrix Information,Within Covariance Matrix Information
ageRINGS,Covariance Matrix Rank,Natural Log of the Determinant of the Covariance Matrix
1,2,-1.40126
2,2,-1.11515
3,2,-0.49066

Generalized Squared Distance to ageRINGS,Generalized Squared Distance to ageRINGS,Generalized Squared Distance to ageRINGS,Generalized Squared Distance to ageRINGS
From ageRINGS,1,2,3
1,-1.40126,1.49374,3.97564
2,0.82234,-1.11515,-0.14646
3,2.39399,-0.71074,-0.49066

Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS
From ageRINGS,1,2,3,Total
1,1015 74.30,281 20.57,70 5.12,1366 100.00
2,270 20.75,615 47.27,416 31.98,1301 100.00
3,133 9.49,433 30.91,835 59.60,1401 100.00
Total,1418 34.86,1329 32.67,1321 32.47,4068 100.00
Priors,0.33333,0.33333,0.33333,

Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS
Unnamed: 0_level_1,1,2,3,Total
Rate,0.257,0.5273,0.404,0.3961
Priors,0.3333,0.3333,0.3333,

Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS,Number of Observations and Percent Classified into ageRINGS
From ageRINGS,1,2,3,Total
1,1015 74.30,281 20.57,70 5.12,1366 100.00
2,270 20.75,613 47.12,418 32.13,1301 100.00
3,133 9.49,434 30.98,834 59.53,1401 100.00
Total,1418 34.86,1328 32.65,1322 32.50,4068 100.00
Priors,0.33333,0.33333,0.33333,

Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS,Error Count Estimates for ageRINGS
Unnamed: 0_level_1,1,2,3,Total
Rate,0.257,0.5288,0.4047,0.3968
Priors,0.3333,0.3333,0.3333,


QDA model yields a lower error rate of 0.3968 using the hold out method. 

Among 17 models, the 200-nearest neighbor classifier using pooled covariance matrix to calculate the squared distance is the best performing model and yields the lowest error rate of 0.3811. We noticed all 17 models had a hard time distinguishing between Group 2 and Group 3, meaning observations belong to Group 2 were misclassified to Group 3 and vice versa. This is in line with our observation that it will be hard to separate Group 2 and Group 3.  

# Conclusions

This project studied classification through supervised learning algorithms by applying k-nearest neighbors to predict the age group of abalones. We used PCA to transform the dataset into lower dimension, remove the outliers and understand the distribution of the observations. We ran KNN models using three different distance metrics with  ranging from 100 to 500, LDA and QDA. Error rate from the holdout method was used to evaluate different models. We achieved the lowest error with the 200-nearest neighbor model using the pooled covariance matrix. 

[1] Warwick J Nash, Tracy L Sellers, Simon R Talbot, Andrew J Cawthorn and Wes B Ford (1994) “The Population Biology of Abalone (Haliotis species) in Tasmania. I. Blacklip Abalone (H.rubra) from the North Coast and Islands of Bass Strait”, Sea Fisheries Division, Technical
Report No. 48 .

[2] G. Guo, H. Wang, D. Bell, Y. Bi, K. Greer, R. Meersman, Z. Tari, D. C. Schmidt, "KNN model-based approach in classification" in On the Move to Meaningful Internet Systems 2003: CoopIS DOA and ODBASE, Berlin, Germany:Springer, vol. 2888, pp. 986-996, 2003.
