## [Predicting Iris Flower type based on Petal measurements and Sepal measurements]("https://archive.ics.uci.edu/dataset/53/iris")

 ### Introduction:   
Our project will focus on ....

 ### Methods & Results
As mentioned in the introduction, the goal of our project is to determine what group of factors has the most influence on the prediction of type the Iris flower between the 3 types of 
Iris variegata, Iris Setosa, and Iris Versicolor. of an individual. Since this is a predictive question, we will be using classification to respond to it.

First, we load the necessary packages into R, including the `“kknn”` package required for our classification process. We also set our seed to `1382` for reproducibility.

In [None]:
# KM - Run cell before starting workspace - loads necessary packages
set.seed(1382)
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
install.packages("kknn")

We read the data set into R, using the `read_csv` function since the data set contained comma separated values. The original dataset also did not have any names describing each column. Thus, we renamed the columns by referring to the website. Then we printed the data set to see what it looks like.

In [None]:
iris_data <- read_csv ("data/iris.data")
iris_data

After importing the dataset from the provided link, we observed that the columns lacked descriptive names, making it challenging to interpret and analyze the data effectively. In response, we took a step in our data wrangling and cleaning process by assigning meaningful names to each column based on the information available on the source website, using the `setNames` function.

Simultaneously, we noted that the rows were originally organized based on the iris types, where all rows corresponding to one type of iris flower were grouped together, followed by rows of the second type, and so on. However, this organization could introduce bias when splitting the data for training and testing purposes.

To address this concern, we decided to randomize the order of the rows. We used the `slice` function to do this. By doing so, we ensure that our subsequent data split into training and testing sets will be representative and unbiased, with a balanced representation of all three types of iris flowers. This comprehensive approach to data preparation, including renaming columns and randomizing rows, sets the foundation for a more reliable and robust analysis in our exploration of the Iris dataset.

Furthermore, we noticed that the 'class' column contained values with the `Iris-` prefix (e.g., `Iris-versicolor`, `Iris-setosa`). To enhance clarity and simplicity in our data, we decided to rename the `class` column by removing the `Iris-` prefix.

We accomplished this using the mutate function from the dplyr package, combined with the gsub function. The gsub function allowed us to globally substitute the 'Iris_' prefix with an empty string, effectively leaving only the type of iris flower in the 'class' column.

Furthermore, since we are predicting the type of iris flower, we mutated the data set with `as_factor` function so that `class` is treated as a factor for our future classifications.

We named the final data set `iris_clean`.

In [None]:
set.seed (1402)
iris_clean <- iris_data |>
    setNames(c("sepal_length", "sepal_width", "petal_length", "petal_width", "class"))|>
    slice(sample(n())) |>
    mutate(class = gsub("Iris-", "", class)) |>
    mutate (class= as_factor (class))
iris_clean

#### Visualization

Following our research of the topic, we decided to compare the `petal_width` vs. `sepal_width` so see wether there is a pattern in that scatterplot based on the type of flower. We also visualized the `petal_length` vs. `sepal_length` for the same purpose.

In [None]:
options(repr.plot.width = 8, repr.plot.height = 6)

#Scatterplot visualization of Petal vs. Sepal width:

petal_vs_sepal_width <- iris_clean |>
ggplot(aes (x= petal_width, y= sepal_width , color = class))+
geom_point (alpha = 0.6)+
labs (x = "Petal Width (cm)", 
       y = "Sepal Width (cm)",
       color = "Iris Type")+
ggtitle("Petal width vs. Sepal width")
petal_vs_sepal_width

#Scatterplot visualization of Petal vs. Sepal length:
petal_vs_sepal_length <- iris_clean |>
ggplot(aes (x= petal_length, y= sepal_length , color = class))+
geom_point (alpha = 0.6)+
labs (x = "Petal Length (cm)", 
       y = "Sepal Length (cm)",
       color = "Iris Type")+
ggtitle("Petal length vs. Sepal length")
petal_vs_sepal_length

After conducting a visual exploration of the relationships between petal and sepal measurements in our dataset, we created scatterplots to compare petal_width vs. sepal_width and petal_length vs. sepal_length. These visualizations revealed compelling patterns where points from the same iris type tended to cluster together in different regions of the plots. This suggested a potential relationship between petal and sepal measurements and the type of iris flower.

To leverage this observed pattern, we have decided to design two distinct prediction models. One model will focus on utilizing petal length and width as features, while the other model will use sepal length and width. This segmentation allows us to investigate the contributions of both petal and sepal measurements independently in predicting the iris flower type, providing insights into which set of features plays a more significant role in classification.

To determine the optimal parameter `k` for our `kknn` classification models, we will perform cross-validation. This process helps ensure that our models generalize well to new, unseen data. Once we have identified the best k values, we will train our kknn models on both sets of features and evaluate their accuracies.

By comparing the accuracies of the two models, we aim to quantify the relative importance of sepal measurements versus petal measurements in determining the iris flower type among the three classes: `Virginica`, `Setosa`, and `Versicolor`. This approach allows us to gain a nuanced understanding of the predictive power of each set of measurements and contributes to a more comprehensive analysis of the dataset.

#### Expected Outcomes and Significance:  
We hypothesize the (sepal or petal, we have to chose this based on research) category predictors will have the highest accuracy to be used as a prediction model to find out the type of iris flower. Literature suggests that....

#### Classification:
Starting the classification process, we first divide our data into training and testing sets to train our models on training set and then test for their accuracies on testing set.

In [None]:
set.seed(1402)  # Set your desired seed for reproducibility

# Split data
pre_split_iris <- iris_clean
iris_split <- initial_split(pre_split_iris, prop = 0.7, strata = class)

# Shuffle the split using slice_sample
iris_train <- slice_sample(training(iris_split), prop = 1)
iris_test <- slice_sample(testing(iris_split), prop = 1)

# Display the head of the training and testing set
head(iris_train)
head(iris_test)


#### Now, We start creating our models, training them, and obtaining their accuracies.
### Petal KNN Classification:
#### Predictors: 
    * Petal Width
    * Petal Length

### 1. First we perform cross-validation to obtain the best K
##### 1.1. Select the desired columns from `iris_train` and `iris_test`

In [None]:
#1.1
iris_train_petal  <- iris_train |>
select (class, petal_width, petal_length)

iris_test_petal  <- iris_test |>
select (class, petal_width, petal_length)

head (iris_train_petal)
head (iris_test_petal)

##### 1.2. Split the `iris_train_petal` for cross-validation through `vfold_cv`.
##### 1.3. Create the `recipe`, include the correct `predictors` and `pre-process` the data.
##### 1.4. Create the `knn` `model`, using the appropriate engine and mode and set `neighbors = tune()`.
##### 1.5. Fit the `recipe` and `model` into the `workflow`.

In [None]:
#1.2
petal_vfold <- vfold_cv(iris_train_petal, strata=class, v=5)

#1.3
petal_recipe <- recipe (class ~ . , data= iris_train_petal)
petal_recipe

#1.4
petal_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
    set_engine ("kknn") |>
    set_mode ("classification")
petal_spec

#1.5
petal_workflow <- workflow() |>
    add_recipe (petal_recipe)|>
    add_model (petal_spec)
petal_workflow

##### 1.6. Made a `tibble` called `k_vals` to determine the range and intervals of $K$s we want to perform the cross-validation on.
##### 1.7. Used the `tune_grid` function on our train/validation splits to estimate the classifier accuracy for a range of K values, collected and filtered the `accuracy` to obtain the `accuracies` of our $K$s.
##### 1.8. Plotted `accuracy` vs. `K` to visualize what would be the best $K$ with the highest accuracy estimate that doesn’t change much if you change K to a nearby value.
##### Notes to keep in mind for this section:
* We have picked the range to be from 40-80 through a series of background work:
* First we used the formula `sqrt(# of rows)/2` to roughly get an idea of what $K$ can be
* The result was 41, we made two ranges: one from 1-40 and one from 41-80 and plotted the `K` vs. `Accuracy` for both
* Obtained graphs showed the range to be used is 41-81 because that range has the maximum $K$ `accuracy`.
* The code for graph of cross validation of 1-40 and 41-59 is removed to avoid wasting memory.

In [None]:
options(repr.plot.height = 6, repr.plot.width = 6)
#1.6
k_vals_petal <- tibble(neighbors = seq(1,150, by=2))
#1.7
accurasies_petal <- petal_workflow |>
    tune_grid (resamples = petal_vfold, grid = k_vals_petal) |>
    collect_metrics() |>
  filter(.metric == "accuracy")
head(accurasies_petal)
#1.8
cross_val_plot_petal <- accurasies_petal |>
    ggplot(aes (x = neighbors, y = mean)) +
    geom_point() +
    geom_line() +
    labs (x = "k (Number of Neighbors)",
          y = "Petal Measurements' Accuracy",
         title = "K vs. Accuracy") +
      theme(text = element_text(size = 12))
#scale_x_continuous(breaks = seq(40, 80, by = 2)) # adjusting the x-axis
cross_val_plot_petal