# Bean Classification Project Proposal 

In [1]:
install.packages("cowplot")
install.packages("kknn")
library(kknn)

Installing package into ‘/home/jupyter/R/x86_64-pc-linux-gnu-library/4.2’
(as ‘lib’ is unspecified)

Installing package into ‘/home/jupyter/R/x86_64-pc-linux-gnu-library/4.2’
(as ‘lib’ is unspecified)



In [2]:
library(GGally)

Loading required package: ggplot2

Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2



In [3]:
#import libraries
install.packages("themis")
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 16)
library(readxl)
library(ggplot2)
library(cowplot)

Installing package into ‘/home/jupyter/R/x86_64-pc-linux-gnu-library/4.2’
(as ‘lib’ is unspecified)

“running command 'timedatectl' had status 1”
── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.0     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mlubridate[39m 1.9.2     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mpurrr    [39m 1.0.1     [32m✔[39m [34mtidyr    [39m 1.3.0
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


ERROR: Error in library(tidymodels): there is no package called ‘tidymodels’


In [None]:
library(themis)

# Introduction

This data set was extracted from the article written by Kolku and Ozkan, “Multiclass classification of dry beans using computer vision and machine learning techniques” (2020). Koluk and Ozkan (2020) explored the use of computer vision and machine learning techniques to classify dry beans into different classes. Based on the study's dataset, this report will also try to accurately classify the dry beans into their types based on the extracted features. This report is relevant for the food industry and agriculture, as it can potentially help in automating the classification process of dry beans, which is currently done manually and is time-consuming.

**Research Question**: Can we accurately predict the bean type in an image based on the predictors Area and Roundness?

This dataset is based on 13611 images of 7 types of individual dry beans with similar features. Each image was analyzed for 16 features of the bean (4 "shape factors", 12 structural/geometric features) (Kolku & Ozkan, 2020). While the dataset uses up to 12 predictors to classify the data points, we will try to reduce the number of the predictors and see if the accuracy of the estimate is still high or nearly as high as when using the full set of predictors.

We will be using the K-nearest neighbours classification to predict "Class" using the mentioned predictors. K-nearest neighbours finds the K-closest data points to the input sample and assigns the most common class label amongst them. The accuracy of the model can be improved by tuning the value of K.


Our variable of intetrest (Kolku & Ozkan, 2020).:
- **Area**:The area of a bean zone and the number of pixels within its boundaries.
- **Roundness**: Calculated with the following formula: (4piA)/(P^2). Where A is the area and P is the perimeter of the bean.

# Methods

**Loading Data**: A zip file was loaded from the original web source (https). Then, we created a temporary file path to store the *temp* variable, to temporarily store the downloaded zip file before it is extracted. Next, we used 'download.file' to download the zip file and saved it in 'temp'. The unzip() function extracts the xlsx file from the downloaded zip file and saves it in the 'beanzip' filepath. Finally, the read_excel file reads the xlsx file and we named the data frame 'bean'. 


In [None]:
beanurl<-"https://archive.ics.uci.edu/ml/machine-learning-databases/00602/DryBeanDataset.zip"
temp <- tempfile()
download.file(beanurl,temp)
beanzip <- unzip(temp, "DryBeanDataset/Dry_Bean_Dataset.xlsx")
bean <- read_excel(beanzip)
bean

*Table 1: Raw Bean dataset*

**Selecting variables & data wrangling & cleaning**: We will explore the data and and select the variables for prediction. The variables should produce distinct clusters by bean type, which is helpful in classifying the datapoints. 

To do so, we create a scatterplot matrix to further explore the relationship between all variables, and have the colors based on the "Class" variable. To do so, we use the "ggpairs" function. The "columns" argument includes all the variables we included in the matrix, and the "lower" argument specifies the type of plot. In this graph, we chose a "continuous" scatterplot and "combo" to display off-diagonal plots are dot plots without facets that is easier to visualize. 

We then use plot_grid()

In [None]:
matrix_corr_bean <- ggpairs(bean, mapping = aes(color = Class), columns = c("Area", "Perimeter", "MajorAxisLength","MinorAxisLength","AspectRation","Eccentricity",
                                                                            "ConvexArea","EquivDiameter","Extent","Solidity","roundness"),
                           lower = list(continuous = wrap("points", size = 0.4), combo = wrap("dot_no_facet", alpha = 0.4)))+
                            theme(text = element_text (size = 20))
matrix_corr_bean <- matrix_corr_bean + theme(axis.text = element_text(size = 20))
options(repr.plot.width = 30, repr.plot.height = 100)

plot_grid( matrix_corr_bean[2,1], matrix_corr_bean[3,1],
           matrix_corr_bean[3,2],
           matrix_corr_bean[4,1],matrix_corr_bean[4,2], 
           matrix_corr_bean[4,3],
           matrix_corr_bean[5,1],matrix_corr_bean[5,2],
           matrix_corr_bean[5,3],matrix_corr_bean[5,4],
           matrix_corr_bean[6,1],matrix_corr_bean[6,2],         
           matrix_corr_bean[6,3],matrix_corr_bean[6,4],
           matrix_corr_bean[6,5],matrix_corr_bean[7,1],
           matrix_corr_bean[7,2],matrix_corr_bean[7,3],
           matrix_corr_bean[7,4],matrix_corr_bean[7,5],
           matrix_corr_bean[7,6],matrix_corr_bean[8,1],
           matrix_corr_bean[8,2],matrix_corr_bean[8,3],
           matrix_corr_bean[8,4],matrix_corr_bean[8,5],
           matrix_corr_bean[8,6],matrix_corr_bean[8,7],
           matrix_corr_bean[9,1],matrix_corr_bean[9,2],
           matrix_corr_bean[9,3],matrix_corr_bean[9,4],
           matrix_corr_bean[9,5],matrix_corr_bean[9,6],
           matrix_corr_bean[9,7],matrix_corr_bean[9,8],
           matrix_corr_bean[10,1],matrix_corr_bean[10,2],
           matrix_corr_bean[10,3],matrix_corr_bean[10,4],
           matrix_corr_bean[10,5],matrix_corr_bean[10,6],
           matrix_corr_bean[10,7],matrix_corr_bean[10,8],
           matrix_corr_bean[10,9],matrix_corr_bean[11,1],
           matrix_corr_bean[11,2],matrix_corr_bean[11,3],
           matrix_corr_bean[11,4],matrix_corr_bean[11,5],
           matrix_corr_bean[11,6],matrix_corr_bean[11,7],
           matrix_corr_bean[11,8],matrix_corr_bean[11,9],
           matrix_corr_bean[11,10],
           nrow = 16, ncol = 3)

Figure 0: Structural features of beans

The "ggpairs" function helps us to better visualize the relationship between variables and their class. It can help us to identify any correlations, outliers, and class separation (see *Figure 0*). 

In order to determine the best set of data for analysis, we looked at the two groups that best represented the similarities between the bean classes as well as having the better cluster groups compared to the other variables. According to *Figure 0*, we will select the variables Area and Roundness. 

Then, we create a table only with the variables we're interested in: Area, roundness, and Class. 

In [4]:
select_bean_var <- bean |>
                select("Area","roundness","Class")
select_bean_var 

ERROR: Error in select(bean, "Area", "roundness", "Class"): object 'bean' not found


*Table 2: Bean data frame with chosen variable*

**Visualize plot**: all of the bean types, in relation to roundness and area, are plotted on the graph to better understand the dataset. 

In [None]:
options(repr.plot.width = 12, repr.plot.height = 7)
bean_plot <- select_bean_var |>
  ggplot(aes(x = Area, y = roundness, color = Class)) +
  geom_point(size = 0.2) +
  labs(x = "Area", 
       y = "Roundness",
       color = "Type") +
  ggtitle("Figure 1: Area and Roundness of all bean type")+
  theme(text = element_text(size = 12))+
  guides(colour = guide_legend(override.aes = list(size=2)))
bean_plot

As shown in *Figure 1*, Barbunya has a lack of clustering that causes too many overlaps. This may result in an inaccurate model. Thus, we will remove the Class Barbunya. which will enable the KNN model to better distinguish between the remaining classes, ultimately improving the overall accuracy of the model. The "Class" column was also converted to a factor type for this classification model. 

In [None]:
select_bean <- select_bean_var|>
                filter(Class != "BARBUNYA")|>
                mutate(Class = as.factor(Class))
select_bean

*Table 3: Bean dataframe with Barbunya filtered out*

In [None]:
bean_plot2 <- select_bean |>
  ggplot(aes(x = Area, y = roundness, color = Class)) +
  geom_point(size = 0.2) +
  labs(x = "Area", 
       y = "Roundness",
       color = "Type") +
  ggtitle("Figure 2: Area and Roundess of bean types (Barbunya filtered)")+
  theme(text = element_text(size = 12))
bean_plot2

We graph it again, and we can see that the clusters in *Figure 2* are more clean.

Next, we split the dataset into a training set and a testing set. The training set is used to build the model and make predictions on the testing set. By splitting the data, it can ensure that the model can be generalized to the new data, which is the testing set. 

In [None]:
##Creating training and testing dataset

set.seed(2022)
bean_split <- initial_split(select_bean, prop = 0.75, strata = Class)
bean_train <- training(bean_split)
bean_test <- testing(bean_split)

Graphing the training dataset in plot point. *Figure 2* is very similar to *Figure 3*, which indicates a good distribution as the datasets were split randomly. 

In [None]:
area_round_plot <- bean_train |>
  ggplot(aes(x = Area, y = roundness, color = Class)) +
  geom_point(size = 0.2) +
  labs(x = "Area", 
       y = "Roundness",
       color = "Type") +
  ggtitle("Figure 3: Area and roundness of bean types in training dataset")+
  theme(text = element_text(size = 12))
area_round_plot

From *Figure 3*, we can see that there is a big difference between the area value and the roundness value. Therefore, we will scale the data, which will be dealt with in `workflow()` . Next, we summarized the training dataset to visualize the distribution of bean types. We grouped the beans by their types (Class) , summarized by their counts, and created a new column called *percentage_dist* to find the percentage of each bean type in the training dataset. 

In [None]:
#Summarize training dataset
bean_class_dist <- bean_train |>
                group_by(Class)|>
                summarize(count = n()) |>
                mutate(percentage_dist = 100*count/nrow(bean_train))
bean_class_dist

*Table 4: Distribution of bean types in training set*

*Table 4* and *Figure 4* shows us the distribution of bean types in the training set.

In [None]:
bean_class_dist_plot <- bean_class_dist |>
                        ggplot(aes(x=Class, y = count))+
                        geom_bar(stat = "identity")+
                        labs(x= "Type of beans",
                             y = "Number of beans")+
                        ggtitle("Figure 4: Distribution of bean type in the training dataset")+
                        theme(text = element_text(size = 12))
bean_class_dist_plot

**Upsampling**: By looking at the *Table 4* and *Figure 4* viewing the total number of datapoints for each type of beans, we can see that there's a big difference in the number of datapoints between Dermason bean type compared to other type of beans. Therefore, we will upsample the training dataset so that each Class of bean has a comparable voting power when it comes to the classfication of the testing dataset. This is done in the 'recipe' and 'step_upsample ()' functions. 

**Scaling:** all data is called to avoid features with large values dominating the decision process. The scaling of the data will later be part of the Classifier building process called `workflow()`.

Next, we plot the data agin to see what the upsampled data set looks like. We can see that each cluster should have approximately equal data points. 

In [None]:
## Scalling all data:

bean_data_training_scaled_recipe <- recipe(Class ~., data = bean_train) |>
                        step_upsample()|>
                        prep()

final_bean_data <- bake(bean_data_training_scaled_recipe, bean_train)

area_round_plot_scaled <- final_bean_data |>
  ggplot(aes(x = Area, y = roundness, color = Class)) +
  geom_point(size = 0.2) +
  labs(x = "Area", 
       y = "Roundness",
       color = "Type") +
  ggtitle("Figure 5: Area and roundness in training dataset (upsampled)")+
  theme(text = element_text(size = 12))
area_round_plot_scaled

**Summary of the data set**: Next we will look at Statistical distribution of our chosen variables, in table and boxplot form. To understand the distribution, we calculate the basic statistics such as range and standard deviation. To do this, we group the beans by their types (Class), and created a new dataframe that includes the statistics in each column. 


In [None]:
#Statistical Distribution of predictor based on class
features_dist_by_class <- final_bean_data |>
                group_by(Class)|>
                summarize(max_area = max(Area, na.rm = TRUE),
                          min_area = min(Area, na.rm = TRUE),
                          std_dev_area = sd(Area, na.rm = TRUE),
                          max_roundness = max(roundness, na.rm = TRUE),
                          min_roundness = min(roundness, na.rm = TRUE),
                         std_dev_roundness = sd(roundness, na.rm = TRUE))
features_dist_by_class

*Table 5: Statistical distribution of each predictor based on their class*

Next, we create *Table 6* to explore the statistical distribution of the "Area" and "roundness".   First, we select these to variables and reshape the data using 'pivot_longer()' to make it tidy. We then group the data by the "Features" column and calculate summary statistics for each feature, including the mean, minimum, maximum, and standard deviation.

In [None]:
#Statistical Distribution in Features of varieties of dry bean
features_dist <- bean_train |>
            select(Area,roundness) |>
            pivot_longer(cols= Area:roundness,
                         names_to = "Features",
                         values_to = "values") |>
            group_by(Features) |>
            summarize(Mean = mean(values, na.rm = TRUE),
                      Min = min(values, na.rm = TRUE),
                      Max = max(values, na.rm = TRUE),
                     Std_Deviation = sd(values, na.rm = TRUE))
features_dist

*Table 6: Statistical distribution of the predictors in the whole dataset*

Then, to learn more about the distribution, we graph the area distribution for each type of bean. 

In [None]:

area_box_plot <- final_bean_data |>
                ggplot(aes(x = Class, y = Area))+
                geom_boxplot()+
                xlab("Type of beans")+
                ylab("Area")+
                ggtitle("Figure 6: Area Distribution for each type of bean")+
                coord_flip()
area_box_plot

By looking at *Figure 5*, we can find that the square boxes, which represents the middle 50% of each bean type, have little overlap over each other (Krzywinski et al., 2014) . We can also find that 5/6 bean types (Sira, Seker, Horoz, Cali, and Bombay) have outliers that are higher in value than most of its usual area size, based on the data points above the upper whiskers (Krzywinski et al., 2014). We might expect that these outliers might also affect the accuracy of the classifier as these outliers overlap other bean types area measurements. However, out of all the bean types, Bombay seems to have the most distinct area distribution, so we might see that the classifier can predict the Bombay bean type better than other types of beans. Sira and Seker area distributions are seen overlapping each other, so we also expect this to affect classifier accuracy.

Then, we graph the box plot showing roundness distribution for each type of box plot. 

In [None]:
roundness_box_plot <- final_bean_data |>
                ggplot(aes(x = Class, y = roundness))+
                geom_boxplot()+
                xlab("Type of beans")+
                ylab("Roundess")+
                ggtitle("Figure 7: Roundness distribution for each type of bean")+
                coord_flip()
roundness_box_plot

In *Figure 7*, we can see that most of the bean types' middle 50% of roundness distribution do not overlap each other. However, there are a lot of outliers that cause these bean type roundness distribution to overlap each other. The overlap of roundness distribution of dataset matches the distribution of bean type of Area and Roundness that we see in Figure 1, 2, 3, 5. However, due to 50% of the dataset mostly don't overlap each other as we seen in *Figure 7*, we can also see distinct cluster in Figure 1, 2, 3, and 5.

# Building the Classification Model

Now we should start building the classifier. However, before starting to build the classifier, we need to create the scaling and centering recipe on the training dataset, to ensure that all predictors are standardized, so that predictor with larger scales won't create a greater unwanted affect. We set the seed to 2022 so the codes can be reproducible, and it produces consistent results. 

In [None]:
set.seed(2022) # DO NOT REMOVE

bean_report_recipe <- recipe(Class ~., data = final_bean_data) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())
bean_report_recipe

Next, we need to tune our model using the cross-validation method, so that we can choose the optimal K-neighbour. First we need to specify the model for cross-validation, where the neighbour = tune() is used to test accuracies of model across different range of k-neighbours. Cross-validation is a method to help us tune our classifier. Cross-validation will randomly divide the training sets into specified number of smaller set with the same size. The method then will train our classifier using the remaining sets that were not accept as one test set. This process is repeated until all of the set that was divided has a chance to be a test set (Arlot and Celisse, 2010). Here, we use cross-validation across a range of number of neighbours and get the number with the best accuracy estimate.

In [None]:
set.seed(2022) # DO NOT REMOVE

knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("classification")

Now we use the `vfold_cv()` to split the training data into 10 fold for cross-validation. `gridvals` is where we set the range of k-neighbours that we would like to cross-validate with. Since we have run the this code a few time to get the best estimate of a range of k-neighbours to use, we decide only run upto 30 k-neighbours to get the best k-neighbours for our classifier.

In [None]:
set.seed(2022) # DO NOT REMOVE
bean_vfold <- vfold_cv(final_bean_data, v = 10, strata = Class)
gridvals <- tibble(neighbors = seq(1,30,by=1))

Now we start to cross-validate multiple k-neighbours to the training dataset. The function `tune_grid()` here allow us to fit the model for each value in a range of value. After the cross-validation is run, we collect the accuracies calculated from each value of k-neighbours.

In [None]:
set.seed(2022) # DO NOT REMOVE

knn_results <- workflow() |>
  add_recipe(bean_report_recipe) |>
  add_model(knn_spec) |>
  tune_grid(resamples = bean_vfold, grid = gridvals) |>
  collect_metrics() |>
  filter(.metric == "accuracy")

Now we graph the k-neighbours against the mean accuracy calculated from cross-validated across 10 folds to find the best k to use. 

In [None]:
options(repr.plot.width = 12, repr.plot.height = 7)

# Plot k values against their respective accuracies and choose optimal k value
cross_val_plot <- knn_results |> 
    ggplot(aes(x = neighbors, y = mean)) +
    geom_point() +
    geom_line() +
    labs(x = "Neighbors", y = "Accuracy Estimate") +
    ggtitle("Figure 9: K-Neighbours and their accuracy estimates")+
    scale_x_continuous(breaks = seq(1,30, by = 1))
cross_val_plot

From *Figure 9*, we can see that with such a large dataset like the bean dataset, large number of neighbour is required to predict the data accurately. 
From the graph we can see that the highest accuracies is around 17-19 neighbours. We can also see that around 18 neighbors is a good K, since the accuracy does not fluctuate much between 17-19 neighbours 

Now that we have a good estimate of which neighbours yield the highest accuracy for our classification, we should build our classifier with that k-neighbour, in our case is: 18 neighbour. We first build the specification of the classifier using the best neighbour. Then we build the workflow with the best spec, and we can use the recipe that we created before. 

In [None]:
knn_best_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 18) |>
  set_engine("kknn") |>
  set_mode("classification")

bean_fit <- workflow() |>
  add_recipe(bean_report_recipe) |>
  add_model(knn_best_spec) |>
  fit(data = final_bean_data)

Now we start using the new classifier that we just created to build predict our test dataset.

In [None]:
# Get the prediction column
bean_predictions <- predict(bean_fit, bean_test) |> 
    bind_cols(bean_test)

Next, we calculate the accuracy of the classifier by comparing the predictions to the test set. the `metrics()` function calculates the performance of the model's predictions given the true Class and the predicted class labels. `Head(1)` then returns the first row of the data frame that contains the perfmance metric and its estimate. 

In [None]:
bean_acc <- bean_predictions |> 
    metrics(truth = Class, estimate = .pred_class) |> 
    select(.metric, .estimate) |> 
    head(1)
bean_acc

*Table 7: Accuracy estimate our model against the test set*

As we can see from *Table 7*, the accuracy the result of the k-nearest neighbors (k-NN) model shows an estimated accuracy of 0.8812622.

**Visualizations of the analysis**: Now, we build a confusion matrix that compare the prediction made to the truth values. A confusion matrix is shown below to help interpret the results and communicate the findings effectively. The confusion matrix is also used to evaluate the performance of our model, and provides us with the accuracy percentages.

In [None]:
bean_cm <- bean_predictions |> 
    conf_mat(truth = Class, estimate = .pred_class)
bean_cm

*Table 8: Confusion matrix of the predicted class and the truth class*

From *Table 8*, we can see the the number of true positives, true negatives, false positives, and false negatives for each class label. The diagonal of the matrix represents the correctly classified instances for each class label. For example, if we look at the first row - the BOMBAY class. All 141 instances that were predicted to belong to the BOMBAY class were actually labeled as BOMBAY, so it has 100% accuracy for that class. To calculate the accuracy of the model, we divide the diagonal of the matrix by all of the numbers, which is 2709/3074, and is 0.8812622, consistent with our results from *Table 7*.

Finally, we want a visual comparison of true class vs predicted class. 

In [None]:

bean_predictions_plot <- bean_predictions |>
    ggplot(aes(x = Area, y = roundness, color = .pred_class)) +
    geom_point(size = 0.3) +
    labs(x = "Area of Bean", y = "Roundness of Bean", color = "Type of Bean") +
    ggtitle("Figure 10: Predictions of Bean Type")

bean_trueclass_plot <- bean_predictions |>
    ggplot(aes(x = Area, y = roundness, color = Class)) +
    geom_point(size = 0.3) +
    labs(x = "Area of Bean", y = "Roundness of Bean", color = "Type of Bean") +
    ggtitle("Figure 11: True Class of Beans")

plot_grid(bean_predictions_plot, bean_trueclass_plot, ncol = 2)

We can see that *Figure 10* and *Figure 11* are very similar, which is consistent with our model accuracy of 88.13%. 

# Discussion

**Results**

With the help of a classification model, we expect to find a well-performing model that can predict the type of beans based on their roundness and area with at least 85% accuracy. From *Table 7*, the result of the k-nearest neighbors (k-NN) model shows an estimated accuracy of 0.8812622. This means that the model was able to correctly classify 88.13% of the beans in the dataset based on their area and roundness. This is a relatively high accuracy as the classes are similar in appearance and it's difficult to differentiate based on these features alone. While this estimated accuracy is high, if we use all the predictor that was proposed in the paper that the dataset was based on, the KNN algorithm will produce a classifier with a higher accuracy estimate (exactly 92.52% in the paper). 

Looking at each individual class prediction, in *Table 8*, we can see that the classifier predicts the Bombay bean type correctly 100% of the time. While the classifier predicts the Sira bean type the least accurately, if we take the number of predictions that were correctly divided by the number of predictions from *Table 8*, we can see that the classifier only predicts the Sira bean type correct 82% of the time, whereas the classifier predicts other bean types correctly more than 84% of the time. We can see the same result between *Figure 10* and 11, where the classifier cannot predict Sira bean type that has roundness and area measurements that are similar to the Horoz bean type.

The model can correctly classify 88.13% of the data. To improve the accuracy of the model, we can increase the training data or collect more data. Using more data to train the model can help it lean and better generalize the model on new data. We can also add more predictor variables in the model and explore how that might change the accuracy of the model.  

**Findings Impact**

This classification model can impact the food and agriculture industry. 

An automated system can be created to classifiy the type of bean based on their physical appearance. This is espcially useful when food and agriculture companies collect different types of beans together, dry them, and want to package them based on different types. The model we built can help to streamline the service, gain quality control and reduce the time for people to manually separate the beans. 

**Future research questions**
- Can other variables such as major axis length or the perimeter be good predictors of the type? And are they better predictors than roundness and area?
- Can the classification model be used for different types of crops such as rice? 

# References:

- Arlot, S., & Celisse, A. (2010). *A survey of cross-validation procedures for model selection*.

- Koklu, M., & Ozkan, I. A. (2020). Multiclass classification of dry beans using computer vision and machine learning techniques. Computers and Electronics in Agriculture, 174, 105507. https://doi.org/10.1016/j.compag.2020.105507

- Krzywinski, Martin, and Naomi Altman. "Visualizing samples with box plots: use box plots to illustrate the spread and differences of samples." Nature Methods, vol. 11, no. 2, Feb. 2014, pp. 119+. Gale OneFile: Health and Medicine, link.gale.com/apps/doc/A361242515/HRCA?u=ubcolumbia&sid=bookmark-HRCA&xid=0db0fe06. Accessed 11 Mar. 2023.
