#####  Project Proposal Group 35

# Classifying the Presence or Absence of the European Tree Frog due to  Environmental Factors in Poland


## Introduction

The European Tree Frog (*Hyla arborea*) is a common amphibian species found throughout Europe, including Poland. The presence or absence of amphibian species, including the European Tree Frog, have been limited by several environmental factors: the number and type of water resevoirs, the use of each water resevoir, the presence of vegetation and fishing, and how well maintained the water resevoirs are.

**Question:** The question we will attempt to answer with our project is: can we use environmental factors to predict the presence or absence of tree frogs in a given area?

**Dataset:** The dataset that will be used to answer the aforementioned question is titled amphibians.csv, and was taken from the UCI Machine Learning Repository. The data was originally gathered from an environmental impact assessement in Poland for the preparation of two upcoming road projects.

## Preliminary Exploratory data Analysis

Below we demonstrate that the full data set can be read into R.

In [1]:
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.1     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

“package ‘ggplot2’ was built under R version 4.0.1”
“package ‘tibble’ was built under R version 4.0.2”
“package ‘tidyr’ was built under R version 4.0.2”
“package ‘dplyr’ was built under R version 4.0.2”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



In [9]:
amphibians <- read_delim("data/amphibians.csv", delim = ";", skip = 1)

Parsed with column specification:
cols(
  .default = col_double(),
  Motorway = [31mcol_character()[39m
)

See spec(...) for full column specifications.



In [10]:
head(amphibians)

ID,Motorway,SR,NR,TR,VR,SUR1,SUR2,SUR3,UR,⋯,BR,MR,CR,Green frogs,Brown frogs,Common toad,Fire-bellied toad,Tree frog,Common newt,Great crested newt
<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,A1,600,1,1,4,6,2,10,0,⋯,0,0,1,0,0,0,0,0,0,0
2,A1,700,1,5,1,10,6,10,3,⋯,1,0,1,0,1,1,0,0,1,0
3,A1,200,1,5,1,10,6,10,3,⋯,1,0,1,0,1,1,0,0,1,0
4,A1,300,1,5,0,6,10,2,3,⋯,0,0,1,0,0,1,0,0,0,0
5,A1,600,2,1,4,10,2,6,0,⋯,5,0,1,0,1,1,1,0,1,1
6,A1,200,1,5,1,6,6,10,1,⋯,0,0,1,0,0,0,0,0,0,0


In [11]:
tail(amphibians)

ID,Motorway,SR,NR,TR,VR,SUR1,SUR2,SUR3,UR,⋯,BR,MR,CR,Green frogs,Brown frogs,Common toad,Fire-bellied toad,Tree frog,Common newt,Great crested newt
<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
184,S52,4000,1,12,4,2,6,10,0,⋯,1,0,1,0,1,1,0,0,0,0
185,S52,2300,1,12,3,2,2,1,0,⋯,1,0,1,0,1,0,0,0,0,0
186,S52,300,1,14,2,7,10,2,0,⋯,5,0,1,1,1,1,1,0,1,0
187,S52,500,1,1,4,1,10,2,0,⋯,5,0,1,1,1,1,1,0,1,0
188,S52,300,1,12,3,2,1,6,0,⋯,0,0,1,0,1,1,0,0,0,0
189,S52,300,1,12,3,2,6,10,0,⋯,1,0,1,0,1,1,0,0,0,0


In [12]:
glimpse(amphibians)

Rows: 189
Columns: 23
$ ID                   [3m[90m<dbl>[39m[23m 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
$ Motorway             [3m[90m<chr>[39m[23m "A1", "A1", "A1", "A1", "A1", "A1", "A1", "A1", …
$ SR                   [3m[90m<dbl>[39m[23m 600, 700, 200, 300, 600, 200, 500, 700, 750, 200…
$ NR                   [3m[90m<dbl>[39m[23m 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, …
$ TR                   [3m[90m<dbl>[39m[23m 1, 5, 5, 5, 1, 5, 5, 5, 5, 12, 1, 14, 1, 1, 1, 1…
$ VR                   [3m[90m<dbl>[39m[23m 4, 1, 1, 0, 4, 1, 0, 2, 1, 4, 4, 2, 1, 1, 3, 3, …
$ SUR1                 [3m[90m<dbl>[39m[23m 6, 10, 10, 6, 10, 6, 6, 10, 6, 2, 2, 1, 2, 2, 2,…
$ SUR2                 [3m[90m<dbl>[39m[23m 2, 6, 6, 10, 2, 6, 6, 6, 1, 7, 7, 2, 6, 6, 10, 1…
$ SUR3                 [3m[90m<dbl>[39m[23m 10, 10, 10, 2, 6, 10, 10, 9, 2, 6, 1, 7, 10, 10,…
$ UR                   [3m[90m<dbl>[39m[23m 0, 3, 3, 3, 0, 1, 1, 0, 0, 0, 0, 0, 3, 0, 0, 0, 

Our finalized training set will have 7 variables (columns), with six of them as our predictors and the seventh being our classification variable, the presence or absence of the European Tree Frog. The six predictors, despite being all categorical, will be represented by numerical values that correspond to a certain variable type. We will summarize this training set in a table which displays the number of times we observe each of these variable types for a given predictor.

To visualize this we will create a bar chart to look at the frequency of each of the predictors. Count will be shown on the y-axis and will be plotted against the six predictors on the x-axis.

## Methods

Not much data wrangling will be needed in order to conduct the analysis as the dataset is already in a tidy csv format. However, certain columns will need to beremoved from the data set as they are not directly relevant to the question being asked. The columns that will be removed are ID, SR, SUR1/2/3, OR, RR, BR, CR, green frogs, brown frogs, common toad, fire-bellied toad, common newt, and great crested newt.

We will conduct our data anaysis using a K nearest neighbours classifier with variables NR, TR, UR, VR, FR, MR as predictors. We will spilt our data set into a training set and a testing set and use the training set and cross validation to train our classifier and tune for an optimal K value. We will then apply this classifier to a test subset of our data to determine its accuracy. This accuracy measurment will represent our ability to predict the presence or absence of European Tree Frogs in a given area using environmental factors.

One way we will visualize our results is with one grand bar chart. Count will be shown on the y-axis, and will be plotted against the six predictors on the x-axis. Color will be used to denote the presence or absence of the European Tree Frog for each of the predictors. We will also include a stacked column bar chart comparing the result of our predictions versus the true presence or absence of European Tree Frogs.

## Expected Outcomes and Significance

By the end of our analysis, we expect to be able to determine how each environmental factor impacts the presence of the European Tree Frog. We predict that with a given set of the environmental factors we used as predictors, we will be able to classify the presence or absence of European Tree Frogs with a fairly high degree of accuracy.

These finding could greatly help in environmental impact assessments towards the consideration and planning of future infastructure projects in Poland.

Further questions this could lead to include determining which predictor has the most significant impact of the presence of the tree frog, as well as whether our finding could yield simillar results for other species of amphibians.