DSCI100/002 Group 5

Decoding Legendary Pokemon 

Introduction:
Pokemon is an entertainment franchise that surrounds a make-believe world where people catch, train and battle animals known as Pokemon. They are created with various statistics, appearances and personalities. In these games we as the player create teams and travel across various regions to battle other trainers and wild Pokemon. All Pokemon have set statistics determining how powerful they are, with the highly desirable “Legendary Pokemon” having much higher statistics than most of the non-legendaries. 
Our predictive question is to determine whether a hypothetical Pokemon we create with random statistics should be considered legendary. We will analyze different selected variables of all the Pokemon up to the sixth generation to determine what constitutes a legendary and put our findings against our new Pokemon to determine if it is legendary or not. For our data, we are using the ‘Pokemon with Stats' dataset from Kaggle, containing the names of pokemon and all of their statistics, and a classification of either legendary or not legendary as variables. (source: https://www.kaggle.com/datasets/abcsds/pokemon)

Methods:
A classification model will be built, using the variables of attack, special attack, defense, special defense, and speed. These variables were chosen because they are the 5 traits that determine the overall strength/quality of a pokemon and there is typically a large difference in these values between legendary-tier pokemon and normal pokemon, so the predictor can be accurate using a nearest k-neighbors classifier. 
One way the results will be visualized is on a box plot, where the y axis is a total sum of all 5 predictor variables together, representing the overall strength of the pokemon. There will be four boxes, one box represents legendary pokemon, one represents non-legendary pokemon. Unknown classes will have 2 different boxes, one for legendary and one for non-legendary based on the prediction of the model. This will allow the reader to see the average stats of the different classes of pokemon, and whether the averages of the predictions match the averages of the known classifications.

In [1]:
#install.packages("themis")

In [2]:
#loading required libraries for the analysis. Install the packages if they are not previously installed.
library(rvest)
library(tidymodels)
library(themis)
library(tidyverse)
set.seed(100)

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.0     [32m✔[39m [34mrecipes     [39m 1.0.6
[32m✔[39m [34mdials       [39m 1.0.0     [32m✔[39m [34mrsample     [39m 1.0.0
[32m✔[39m [34mdplyr       [39m 1.1.2     [32m✔[39m [34mtibble      [39m 3.2.1
[32m✔[39m [34mggplot2     [39m 3.3.6     [32m✔[39m [34mtidyr       [39m 1.2.0
[32m✔[39m [34minfer       [39m 1.0.2     [32m✔[39m [34mtune        [39m 1.0.0
[32m✔[39m [34mmodeldata   [39m 1.0.0     [32m✔[39m [34mworkflows   [39m 1.0.0
[32m✔[39m [34mparsnip     [39m 1.0.0     [32m✔[39m [34mworkflowsets[39m 1.0.0
[32m✔[39m [34mpurrr       [39m 1.0.1     [32m✔[39m [34myardstick   [39m 1.0.0

── [1mConflicts[22m ───────────────────────────────────────── tidymodels_conflicts() ──
[31m✖[39m [34mpurrr[39m::[32mdiscard()[39m masks [34mscales[39m::discard()
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m

In [3]:
#Reading web data using GitHub generated URL link

url <- "https://raw.githubusercontent.com/dlee03/DSCI_group_project/main/Pokemon.csv" 

pokemon2 <- read_csv(url)
colnames(pokemon2) <- make.names(colnames(pokemon2)) 
head(pokemon2)

[1mRows: [22m[34m800[39m [1mColumns: [22m[34m13[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): Name, Type 1, Type 2
[32mdbl[39m (9): #, Total, HP, Attack, Defense, Sp. Atk, Sp. Def, Speed, Generation
[33mlgl[39m (1): Legendary

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


X.,Name,Type.1,Type.2,Total,HP,Attack,Defense,Sp..Atk,Sp..Def,Speed,Generation,Legendary
<dbl>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<lgl>
1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
5,Charmeleon,Fire,,405,58,64,58,80,65,80,1,False


Table 1. Pokemon dataset that is read from directly from URL. 

In [9]:
#summarize data that reports number of entries per column 

pokemon_selected <- pokemon2 |>
    select(Name, HP, Attack, Defense, Sp..Atk, Sp..Def, Speed, Legendary)
head(pokemon_selected)

Name,HP,Attack,Defense,Sp..Atk,Sp..Def,Speed,Legendary
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<lgl>
Bulbasaur,45,49,49,65,65,45,False
Ivysaur,60,62,63,80,80,60,False
Venusaur,80,82,83,100,100,80,False
VenusaurMega Venusaur,80,100,123,122,120,80,False
Charmander,39,52,43,60,50,65,False
Charmeleon,58,64,58,80,65,80,False


Table 2. Selected Name, Attack, Defense, Attack Speed, Defense Speed, Speed, and Legendary columns.

In [10]:
#find counts in each
pokemon_counts <- pokemon_selected |> 
    group_by() |>
    summarize(counts = n())
head(pokemon_counts)

counts
<int>
800


pokemon_counts is used to count the number of rows/observations present in the data frame. 

In [11]:
sum(is.na(pokemon_counts)) #check if there is any missing data in the dataframe

In [12]:
#convert the character variable to the factor datatype
pokemon_data <- pokemon_selected |>
    mutate(Legendary = as_factor(Legendary)) 

In [15]:
stats <- pokemon_data |>
    select(HP:Speed)
pokemon_data <- pokemon_data |>
    mutate(total_stats = rowSums(stats))

In [16]:
#make training data set
pokemon_split <- initial_split(pokemon_data, prop = 0.75, strata = Legendary)
pokemon_train <- training(pokemon_split)
pokemon_test <- testing(pokemon_split)

A training and testing dataset is made, we will be using the training data set to build the model and the testing set to check it's accuracy. We also converted "Legendary" into a factor so we can use statistical functions on this column.  

In [17]:
#table of the mean stats of the training set pokemon, and the number of pokemon used in the training set.
nrow(pokemon_train)
mean_stats_table <- pokemon_train |> 
    summarize(across(Attack:Speed, mean)) |>
    add_column(n_training_pokemon = 600)
mean_stats_table

Attack,Defense,Sp..Atk,Sp..Def,Speed,n_training_pokemon
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
79.845,74.08167,72.98833,70.38167,68.60833,600


Table 3. Summary of statistics of training set Pokemon. Using these statistics we can use the model to predict legendary status. 

In [18]:
#create a recipe and model for the analysis

pokemon_recipe <- recipe(Legendary ~ total_stats, data = pokemon_train) |>
                    step_scale(all_predictors()) |>
                    step_center(all_predictors()) |>
                    step_upsample(Legendary, over_ratio = 1, skip = TRUE)

pokemon_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
                set_engine("kknn") |>
                set_mode("classification")

Recipe, workflow and fit will be used to train the classifier.

In [None]:
#cross-validate and tune the model using a 10-fold cross-validation
pokemon_vfold <- vfold_cv(pokemon_train, v = 10, strata = Legendary)

kvals <- tibble(neighbors = seq(from = 1, to = 50, by = 1))

pokemon_k_test <- workflow() |>
    add_recipe(pokemon_recipe) |>
    add_model(pokemon_spec) |>
    tune_grid(resamples = pokemon_vfold, grid = kvals) |>
    collect_metrics()

accuracies <- pokemon_k_test |>
    filter(.metric == "accuracy")
head(accuracies)

k_val_plot <-  ggplot(accuracies, aes(x = neighbors, y = mean))+
       geom_point() +
       geom_line() +
       labs(x = "Neighbors", y = "Accuracy Estimate") +
       ggtitle ("Accuracies vs. K")+
       theme(text = element_text(size =20))+ 
       scale_x_continuous(breaks = seq(0, 14, by = 1)) +  
       scale_y_continuous(limits = c(0.4, 1.0)) 
k_val_plot

Table 4. Accuracies of training set data.

Figure 1. Accuracy Estimate vs. K-Nearest Neighbours. Here we are using K=3 because it there is a smaller slope (decrease of accuracy) between points 2 and 3 than 3 and 4. This indicates that using K =3 will offer the least clusters for the highest accuracy. 

In [None]:
#create a new model with k = 3
pokemon_best_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 3) |>
                set_engine("kknn") |>
                set_mode("classification")

pokemon_fit <- workflow() |>
    add_recipe(pokemon_recipe) |>
    add_model(pokemon_spec) |>
    fit(data = pokemon_train)
pokemon_fit

Recipe, workflow and fit will be used again to train the classifier using the training dataset.

In [None]:
pokemon_predictions <- predict(pokemon_fit, pokemon_train) |>
                     bind_cols(pokemon_train) 
head(pokemon_predictions)    

pokemon_metrics <- pokemon_predictions |>
                 metrics(truth = Legendary, estimate = .pred_class) |>
                 filter(.metric == "accuracy")
pokemon_metrics

pokemon_conf_mat <- pokemon_predictions |>
                  conf_mat(truth = Legendary, estimate = .pred_class)

pokemon_conf_mat

Table 5. Prediction of pokemon class (legendary vs. non legendary). 

Table 6. Accuracy of training dataset is tested and a matrix is created. Here __ ( ) predicted correctly and ___ ( ) predicted incorrectly. This gives us an accuracy of 86.66% for our model. 

Discussion: 


In [None]:
Citations:
