# Exploration using Tidymodels

Load the `tidymodels` library along with other useful ones:

In [None]:
library(tidymodels)


── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──

✔ broom        1.0.8     ✔ recipes      1.2.1
✔ dials        1.4.0     ✔ rsample      1.3.0
✔ dplyr        1.1.4     ✔ tibble       3.2.1
✔ ggplot2      3.5.2     ✔ tidyr        1.3.1
✔ infer        1.0.8     ✔ tune         1.3.0
✔ modeldata    1.4.0     ✔ workflows    1.2.0
✔ parsnip      1.3.1     ✔ workflowsets 1.1.0
✔ purrr        1.0.4     ✔ yardstick    1.3.2

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ purrr::discard() masks scales::discard()
✖ dplyr::filter()  masks stats::filter()
✖ dplyr::lag()     masks stats::lag()
✖ recipes::step()  masks stats::step()


Attaching package: 'readr'

The following object is masked from 'package:yardstick':

    spec

The following object is masked from 'package:scales':

    col_factor

Load the full dataset from the Surging for Science project, also clean the variables names so they use `snake_case`:

In [None]:
plastics <- read_csv("data/surfingforscience_240325.csv") |>
  janitor::clean_names()


Rows: 113404 Columns: 75
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): ColorNature, Color_Name, Color_Category, Group, RF_Group, Cruise_N...
dbl (63): Area, Mean, StdDev, Mode, Min, Max, X, Y, XM, YM, Perim., BX, BY, ...
lgl  (2): RF_use, Modified

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The aim of this notebook is to evaluate performance of different models following the [tidymodels tutorial](https://www.tidymodels.org/learn/). To evaluate performance we need plastics that have been evaluated manually so we can compare the trained human observer with the AI. A *good* model is the one that gets us very similar results to the trained human.

Subset the data to get all plastics evaluated by a human observer.

In [None]:
plastics_manual <- plastics |> filter(rf_use == FALSE)


Overall, the model currently in use had a 58.9% success rate.

In [None]:
plastics_manual |> count(rf_success = group == rf_group) |> 
  mutate(prop = n/sum(n))


# A tibble: 2 × 3
  rf_success     n  prop
  <lgl>      <int> <dbl>
1 FALSE      23656 0.411
2 TRUE       33857 0.589

Before fitting new models, data splitting is done to avoid misleading performance results due to overfitting. The training dataset is intended to have plastics of all types in similar proportion, this is achieved by setting the argument `strata = group`.

In [None]:
plastics_split <- plastics_manual |> initial_split(strata = group)
