<a href="https://www.kaggle.com/code/kagglebenb/penguin-machine-learning?scriptVersionId=107235153" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
# This R environment comes with many helpful analytics packages installed
# It is defined by the kaggle/rstats Docker image: https://github.com/kaggle/docker-rstats
# For example, here's a helpful package to load

library(tidyverse) # metapackage of all tidyverse packages
library(ggplot2)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

list.files(path = "../input")

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.3.6      [32m✔[39m [34mpurrr  [39m 0.3.4 
[32m✔[39m [34mtibble [39m 3.1.8      [32m✔[39m [34mdplyr  [39m 1.0.10
[32m✔[39m [34mtidyr  [39m 1.2.1      [32m✔[39m [34mstringr[39m 1.4.1 
[32m✔[39m [34mreadr  [39m 2.1.2      [32m✔[39m [34mforcats[39m 0.5.2 
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


This is a follow up to the first penguin notebook in R. This notebook is also in R, but will be focused on Machine Learning.

Some goals for this notebook
 * Based on culmen length/culmen depth/flipper length/body mass, can we determine 
      - species?
      - island?
 * Given island/species/culmen length/culmen depth/flipper length/body mass, can we determine
      - sex?
 * Given island/species/stage/date egg, can we determine
      - clutch completion?
 * Given culmen length, can we determine
      - culmen depth?
      - flipper length?
      - body mass?
      - what if species was also a feature?
 * Given flipper length, can we determine
      - body mass?
      - what if species was also a feature?
 * Given species/island/clutch completion/date egg, can we determine
      - stage?
          - this one is probably not going to go well

In [2]:
penguins_size <- read.csv('../input/palmer-archipelago-antarctica-penguin-data/penguins_size.csv', header=TRUE)
head(penguins_size)

Unnamed: 0_level_0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
Unnamed: 0_level_1,<chr>,<chr>,<dbl>,<dbl>,<int>,<int>,<chr>
1,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
2,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
3,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
4,Adelie,Torgersen,,,,,
5,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE
6,Adelie,Torgersen,39.3,20.6,190.0,3650.0,MALE


In [3]:
penguin_lengths <- penguins_size[c("species","island","culmen_length_mm", "culmen_depth_mm", "flipper_length_mm")]
head(penguin_lengths)

Unnamed: 0_level_0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm
Unnamed: 0_level_1,<chr>,<chr>,<dbl>,<dbl>,<int>
1,Adelie,Torgersen,39.1,18.7,181.0
2,Adelie,Torgersen,39.5,17.4,186.0
3,Adelie,Torgersen,40.3,18.0,195.0
4,Adelie,Torgersen,,,
5,Adelie,Torgersen,36.7,19.3,193.0
6,Adelie,Torgersen,39.3,20.6,190.0


Let's drop all NA values.

In [4]:
dataset <- penguin_lengths[complete.cases(penguin_lengths), ]
head(dataset)

Unnamed: 0_level_0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm
Unnamed: 0_level_1,<chr>,<chr>,<dbl>,<dbl>,<int>
1,Adelie,Torgersen,39.1,18.7,181
2,Adelie,Torgersen,39.5,17.4,186
3,Adelie,Torgersen,40.3,18.0,195
5,Adelie,Torgersen,36.7,19.3,193
6,Adelie,Torgersen,39.3,20.6,190
7,Adelie,Torgersen,38.9,17.8,181


In [5]:
# Machine Learning Library
library(caret)

Loading required package: lattice


Attaching package: ‘caret’


The following object is masked from ‘package:purrr’:

    lift


The following object is masked from ‘package:httr’:

    progress




In [6]:
dummy <- dummyVars(" ~ .", data=dataset)
dataset_with_dummies <- data.frame(predict(dummy, newdata=dataset))
head(dataset_with_dummies)

Unnamed: 0_level_0,speciesAdelie,speciesChinstrap,speciesGentoo,islandBiscoe,islandDream,islandTorgersen,culmen_length_mm,culmen_depth_mm,flipper_length_mm
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,0,0,0,0,1,39.1,18.7,181
2,1,0,0,0,0,1,39.5,17.4,186
3,1,0,0,0,0,1,40.3,18.0,195
5,1,0,0,0,0,1,36.7,19.3,193
6,1,0,0,0,0,1,39.3,20.6,190
7,1,0,0,0,0,1,38.9,17.8,181


In [7]:
dataset_with_dummies$id <- 1:nrow(dataset_with_dummies)

#use 70% of dataset as training set and 30% as test set 
train <- dataset_with_dummies %>% dplyr::sample_frac(0.70)
test  <- dplyr::anti_join(dataset_with_dummies, train, by = 'id')

In [8]:
head(train)

Unnamed: 0_level_0,speciesAdelie,speciesChinstrap,speciesGentoo,islandBiscoe,islandDream,islandTorgersen,culmen_length_mm,culmen_depth_mm,flipper_length_mm,id
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
235,0,0,1,1,0,0,45.8,14.6,210,234
171,0,1,0,0,1,0,46.4,18.6,190,170
159,0,1,0,0,1,0,46.1,18.2,178,158
110,1,0,0,1,0,0,43.2,19.0,197,109
262,0,0,1,1,0,0,49.6,16.0,225,261
321,0,0,1,1,0,0,48.5,15.0,219,320


In [9]:
head(test)

Unnamed: 0_level_0,speciesAdelie,speciesChinstrap,speciesGentoo,islandBiscoe,islandDream,islandTorgersen,culmen_length_mm,culmen_depth_mm,flipper_length_mm,id
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
5,1,0,0,0,0,1,36.7,19.3,193,4
9,1,0,0,0,0,1,34.1,18.1,193,8
10,1,0,0,0,0,1,42.0,20.2,190,9
11,1,0,0,0,0,1,37.8,17.1,186,10
12,1,0,0,0,0,1,37.8,17.3,180,11
19,1,0,0,0,0,1,34.4,18.4,184,18


In [10]:
lr_model <- lm(flipper_length_mm~speciesAdelie+speciesChinstrap+speciesGentoo+islandBiscoe+islandDream+islandTorgersen+culmen_length_mm+culmen_depth_mm, data=train)
summary(lr_model) #Review the results


Call:
lm(formula = flipper_length_mm ~ speciesAdelie + speciesChinstrap + 
    speciesGentoo + islandBiscoe + islandDream + islandTorgersen + 
    culmen_length_mm + culmen_depth_mm, data = train)

Residuals:
     Min       1Q   Median       3Q      Max 
-20.1124  -3.0544   0.0818   3.3901  15.7357 

Coefficients: (2 not defined because of singularities)
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      156.4350     6.5149  24.012  < 2e-16 ***
speciesAdelie    -29.4564     2.4430 -12.058  < 2e-16 ***
speciesChinstrap -31.6124     2.0657 -15.304  < 2e-16 ***
speciesGentoo          NA         NA      NA       NA    
islandBiscoe      -2.5801     1.3239  -1.949   0.0525 .  
islandDream       -0.5917     1.3354  -0.443   0.6581    
islandTorgersen        NA         NA      NA       NA    
culmen_length_mm   0.7188     0.1454   4.945 1.46e-06 ***
culmen_depth_mm    1.9771     0.3721   5.313 2.53e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘

In [11]:
predictions <- predict(lr_model, test[c('speciesAdelie', 'speciesChinstrap', 'speciesGentoo', 'islandBiscoe', 'islandDream', 'islandTorgersen' ,"culmen_length_mm", "culmen_depth_mm")])
predictions

“prediction from a rank-deficient fit may be misleading”


In [12]:
test$flipper_length_mm

In [13]:
predictions <- as.vector(predictions)
predictions

In [14]:
abs_errors <- abs(predictions - test$flipper_length_mm)
abs_errors

In [15]:
max(abs_errors)

In [16]:
mean(abs_errors)

In [17]:
mse = sum((predictions - test$flipper_length_mm) ^2) / length(predictions)
mse