/
MammalPredictor.Rmd
2745 lines (2048 loc) · 131 KB
/
MammalPredictor.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Mammal Predictor"
author: "Ryan J. Cooper"
date: "11/1/2020"
output:
pdf_document:
toc: yes
toc_depth: '3'
html_document:
df_print: paged
toc: yes
toc_depth: 3
editor_options:
chunk_output_type: inline
---
```{r knit setup, include=FALSE, message=FALSE, warning=FALSE}
knitr::opts_chunk$set(echo = FALSE, message=FALSE, warning=FALSE, cache=TRUE)
```
# Introduction
This project was developed for the HarvardX Data Science Professional Certificate program, Machine Learning Capstone Project taught by Rafael Irizarry, Ph.D. and offered through the EDX online learning platform. The skills demonstrated in this report are based on lessons in this course and the accompanying text, an Introduction to Data Science. (Irizarry, 2020)
Data available from the Ecological Society of America (ESA) PanTHERIA dataset (Jones, 2016) contains descriptive attributes concerning over 5,400 individual mammal species. Additional data derived from The Global Biodiversity Information Facility (GBIF, 2020) contains extended attributes and details that will support exploration of the Pantheria dataset. These data sources will be used in the construction and training of a machine learning classification model which will try to correctly classify mammals into their most likely taxonomy given a set of identifying physical, behavioral, reproductive and life history characteristics.
* Physical characteristics include body mass, head + body length, and forearm length.
* Behavioral characteristics include population density, diet breadth and trophic level, geographic locale, and activity cycle.
* Reproductive and life history characteristics include longevity, gestation, and weaning ages.
The data set also contains data related to many other characteristics, but many of the columns do not have very complete data. The combination of these various predictors should be very informative about an animals taxonomy. Greater numbers of available predictors to train on generally should increase accuracy of predictions. However, with over 50 possible original columns available, it is necessary to identify and utilize the predictors that are most relevant, and select predictors for which there is a substantial amount of data available.
To demonstrate the concept of a classification and regression tree, we will first implement the RPart package - which enable the visualization of a single decision tree using conditional probability to select the relevant variables and cut off points for each split of the tree. This should give us a good sense of how a single tree would work, and provide some insight into how the model will perform when using random forests.
After demonstration of the basic concept, we will shift to a random forests approach that will go beyond the CART concept by combining many trees, enabling individual trees to work together, providing greater overall accuracy, but with some loss of transparency. Random forests can be surprisingly accurate in multiple outcome classification problems.
## Primer on Taxonomy
Biological taxonomy is the science of classification of living organisms by related characteristics. Every organism belongs to a tree of taxa starting with the _domain_ and _kingdom_ at the top, with each successive (lower) level being more specific/descriptive, and generally having fewer species in it. The most specific taxonomic level in general use is *species*.
The *species* is expressed as a binomial (two word) title which incorporates the *genus* into the title. The *species* name will formally have the first word capitalized and the second word in lower case, which is the word that differentiates this species from others in the same *genus*.
For example, a blue whale is of the *genus:* Balaenoptera, *species:* _Balaenoptera musculus_. The last word of the species name is also referred to as the _specific epithet_. In text, the species is traditionally italicized. However, for the purpose of this study, species names will not always be italicized. In some references the _genus_ part of the binomial may be shortened to a single initial, for example, _Balaenoptera musculus_ may be shortened to _B. musculus_.
The 8 main levels or _Ranks_ of taxonomy are:
Domain > Kingdom > Phylum > Class > Order > Family > Genus > Species
Example Taxonomy of a Blue Whale:
Eukarya > Animalia > Chordata > Mammalia > Cetacea > Balaenopteridea > Balaenoptera > musculus
All data in the ESA dataset is concerning animals from the *class* Mammalia (mammals). The focus of this project will be to generally identify the remaining taxonomic ranks: *order, family, or genus* given a limited number of variables available for prediction.
Please note, since the advent of widespread genetic testing, some re-organization of taxa has been occurring to better represent *monophyletic groups* or *clades* which share a common ancestor. One example of this is the order _Soricomorpha_, which appears in the Pantheria dataset, but has been partially reclassified since. The Pantheria dataset was not selected for relevance and currency - but because its hierarchical structure, sparseness, and imbalanced classes makes it an interesting subject for demonstration of random forest models.
## Strengths of CART & Random Forests in Taxonomy
If the purpose of taxonomy is to differentiate species by combinations of unique markers, then a CART / RF approach should work well. Some individual characteristics are very predictive, when classes are very distinct such as the Cetacea; a mammal with a body mass of greater than 100,000,000 grams is *always* going to be a whale.
Other characteristics like the trophic level, terrestriality, diet breadth, or gestational age may be less predictive, unless these characteristics are used in conjunction with other characteristics. A CART / Random Forest approach should reveal which variables are most important, and provide good prediction accuracy by finding the combinations of observations that most efficiently answer the question - which order/family/genus does this mammal belong to?
## Limits & Challenges
This model is not intended as a species-level classification tool, as it contains only one row with median values for any given species. A reasonable population of data is needed for any outcome class.
The data is incomplete and any analysis performed may suffer from the _"curse of dimensionality"_ - of the characteristics selected, many observations are present on only a portion of the rows. As the number of columns increases, the data becomes relatively sparse.
The number of species per taxonomical order varies greatly, with some orders having only one or two sub-families, genus, or species, and some having many individual species. This results in _unbalanced classes_, which may impact the accuracy of the final model. Predicting Chiroptera (bats) or Rodentia (rats, mice) has a much greater chance of being correct vs. guessing an obscure order with only a few species.
## Model Goals
Since there are many possible outcomes, it is necessary to frame the problem carefully: What are we trying to accomplish - what are the goals of the model?
* Accuracy: A measure of overall accuracy is one of the goals - we would like a model that predicts the correct order or the correct family most of the time.
* Diversity of Outcomes: We would like a model that will predict a diverse number of orders. Given the presence of many imbalanced classes, we could make a model that predicts Rodentia or Carnivora every time, and it would be correct much of the time. But what good is a model that looks at an elephant and predicts it is a rat or a wolf? We must balance the need to include minority classes. Increasing the selection of minority classes can decrease accuracy - So accuracy and diversity of outcomes are two competing forces in this model.
* Balance: We would like a model that balances precision and recall. The F1 score is a good measure of this. We will record the average of the balanced F1 score for predicted class. If 8 classes are predicted there will be 8 F1 scores. The mean of those scores will be recorded and considered during model analysis.
* Recursion: Given that there are so many families - the number of possible _family_ or _genus_ outcomes is too great without segmentation or recursion. We can determine the most likely order, then filter the families and re-run the model using a more focused set of data. This could be done again to predict at the genus level as well.
* Flexibility: We would like a classification model that can be flexible and take a range of inputs with as little or as much data as is available. This may necessitate a more complex approach to modeling that can adapt to a variable number of inputs, and handle missing data effectively.
# Data Setup
In this section we will load data, libraries, and packages, and set up the data for analysis.
## Load Packages
Several packages and external libraries are used by this project. Key utility packages include GGPlot2 for visualizations, dplyr for data manipulation, and the caret package for various functions. The randomforest and rpart packages contain key features used in the model construction and testing process.
```{r data setup, warning=FALSE, message = FALSE}
# Note: this process could take a couple of minutes
if(!require(dplyr)) install.packages("dplyr", repos = "http://cran.us.r-project.org")
if(!require(tidyverse)) install.packages("tidyverse", repos = "http://cran.us.r-project.org")
if(!require(caret)) install.packages("caret", repos = "http://cran.us.r-project.org")
if(!require(dslabs)) install.packages("dslabs", repos = "http://cran.us.r-project.org")
if(!require(rpart.plot)) install.packages("rpart.plot", repos = "http://cran.us.r-project.org")
if(!require(rgbif)) install.packages("rgbif", repos = "http://cran.us.r-project.org")
if(!require(corrr)) install.packages("corrr", repos = "http://cran.us.r-project.org")
if(!require(splitstackshape)) install.packages("splitstackshape", repos = "http://cran.us.r-project.org")
if(!require(mice)) install.packages("mice", repos = "http://cran.us.r-project.org")
if(!require(randomForest)) install.packages("randomForest", repos = "http://cran.us.r-project.org")
if(!require(data.table)) install.packages("data.table", repos = "http://cran.us.r-project.org")
library(randomForest)
library(mice)
library(tidyverse)
library(dslabs)
library(caret)
library(dplyr)
library(rpart.plot)
library("rgbif")
library(corrr)
library(splitstackshape)
library(data.table)
```
## Load & Process Mammal Data
In this section we will preprocess the data from the ESA Pantheria dataset. The data contains over 50 different variables. For the purposes of this study, we're selecting a subset of variables where observations are present for most rows. The variables selected will include the taxonomic data and numerous predictors including physical, behavioral, reproductive, and life history characteristics. The columns with enough observations will be copied into a new data frame and renamed for easier readability, and the data types and values will be cleaned up to work correctly with R.
```{r load data}
# Load data
#METADATA: http://esapubs.org/archive/ecol/E090/184/metadata.htm
#there are two data sets with overlapping data, so, I am using the larger, WR05 version.
#zooWR93 <- read.table(
#"http://esapubs.org/archive/ecol/E090/184/PanTHERIA_1-0_WR93_Aug2008.txt",
#sep="\t", header=TRUE)
#setwd("D:/RProjects/Zoo/")
reload=TRUE
if(reload){
zoo <- read.table("PanTHERIA_1-0_WR05_Aug2008.txt", sep="\t", header=TRUE)
}
#replace -999 with NA for correct handling in R
#see missing data section for more details on missing data handling
zoo[zoo==-999]<-NA
totalrows <- nrow(zoo)
```
There are `r totalrows` total original rows of data in the Pantheria file.
## Vernacular Name Lookup Function for Species Binomials
Here we set up a lookup function to find the vernacular / common name for a species binomial.
```{r get vernacular names, fig.width=12,fig.height=8}
#function to return a vernacular/common name for each species from GBIF
#Note that this can't run inside a loop or DPLYR mutate- it will cause a bad request error. Names may ONLY be looked up one at a time.
getvernacularname <- function(x){
speckey <- rgbif::name_backbone(name=x)$speciesKey
speckey <- as.numeric(speckey)
vernac <- rgbif::name_usage(key=speckey, data='vernacularNames')
vnames <- vernac[[2]] %>%
filter(language=='eng') %>%
select(vernacularName)
vx <- unique(vnames)
vname <- vx[which.max(tabulate(match(vnames, vx)))] %>%
slice(1:1) %>%
pull(vernacularName)
vname
}
df <- data.frame()
#example usage
lookupbinomial <- 'Balaenoptera musculus'
vernac <- getvernacularname(lookupbinomial)
lookup <- data.frame(lookup=paste('The common name for',lookupbinomial,'is',vernac))
df <- df %>% bind_rows(lookup)
lookupbinomial <- 'Rattus rattus'
vernac <- getvernacularname(lookupbinomial)
lookup <- data.frame(lookup=paste('The common name for',lookupbinomial,'is',vernac))
df <- df %>% bind_rows(lookup)
lookupbinomial <- 'Canis lupus'
vernac <- getvernacularname(lookupbinomial)
lookup <- data.frame(lookup=paste('The common name for',lookupbinomial,'is',vernac))
df <- df %>% bind_rows(lookup)
df %>% knitr::kable()
```
This function has returned the correct common names of the various species binomials we have passed in.
## Vernacular Name Index for Orders
We will also set up an index of common names for the taxonomical orders in the Mammalia family. Again, this is done to make the exploration of the data easier to understand.
```{r order index}
order_info <-data.frame()
order_desc <- data.frame(taxo_order='Afrosoricida',taxo_order_desc='Tenrecs')
order_info <- order_info %>% bind_rows(order_desc)
order_desc <- data.frame(taxo_order='Artiodactyla',taxo_order_desc='Even-toed ungulates')
order_info <- order_info %>% bind_rows(order_desc)
order_desc <- data.frame(taxo_order='Carnivora',taxo_order_desc='Bears & Wolves')
order_info <- order_info %>% bind_rows(order_desc)
order_desc <- data.frame(taxo_order='Cetacea',taxo_order_desc='Whales & Dolphins')
order_info <- order_info %>% bind_rows(order_desc)
order_desc <- data.frame(taxo_order='Chiroptera',taxo_order_desc='Bats')
order_info <- order_info %>% bind_rows(order_desc)
order_desc <- data.frame(taxo_order='Cingulata',taxo_order_desc='Armadillos & Anteaters')
order_info <- order_info %>% bind_rows(order_desc)
order_desc <- data.frame(taxo_order='Dasyuromorphia',taxo_order_desc='Carnivorous Marsupials')
order_info <- order_info %>% bind_rows(order_desc)
order_desc <- data.frame(taxo_order='Dermoptera',taxo_order_desc='Flying Lemurs')
order_info <- order_info %>% bind_rows(order_desc)
order_desc <- data.frame(taxo_order='Didelphimorphia',taxo_order_desc='Opossums')
order_info <- order_info %>% bind_rows(order_desc)
order_desc <- data.frame(taxo_order='Diprotodontia',taxo_order_desc='Herbivorous Marsupials')
order_info <- order_info %>% bind_rows(order_desc)
order_desc <- data.frame(taxo_order='Erinaceomorpha',taxo_order_desc='Hedgehogs')
order_info <- order_info %>% bind_rows(order_desc)
order_desc <- data.frame(taxo_order='Hyracoidea',taxo_order_desc='Hydraxes')
order_info <- order_info %>% bind_rows(order_desc)
order_desc <- data.frame(taxo_order='Lagomorpha',taxo_order_desc='Rabbits & Hares')
order_info <- order_info %>% bind_rows(order_desc)
order_desc <- data.frame(taxo_order='Macroscelidea',taxo_order_desc='Elephant Shrews')
order_info <- order_info %>% bind_rows(order_desc)
order_desc <- data.frame(taxo_order='Microbiotheria',taxo_order_desc='Monito del montes')
order_info <- order_info %>% bind_rows(order_desc)
order_desc <- data.frame(taxo_order='Monotremata',taxo_order_desc='Platypus & Echidnas')
order_info <- order_info %>% bind_rows(order_desc)
order_desc <- data.frame(taxo_order='Notoryctemorphia',taxo_order_desc='Marsupial Mole')
order_info <- order_info %>% bind_rows(order_desc)
order_desc <- data.frame(taxo_order='Paucituberculata',taxo_order_desc='Shrew Opossums')
order_info <- order_info %>% bind_rows(order_desc)
order_desc <- data.frame(taxo_order='Peramelemorphia',taxo_order_desc='Bandicoots')
order_info <- order_info %>% bind_rows(order_desc)
order_desc <- data.frame(taxo_order='Perissodactyla',taxo_order_desc='Odd-toed Ungulates')
order_info <- order_info %>% bind_rows(order_desc)
order_desc <- data.frame(taxo_order='Pholidota',taxo_order_desc='Pangolins')
order_info <- order_info %>% bind_rows(order_desc)
order_desc <- data.frame(taxo_order='Pilosa',taxo_order_desc='Sloths & Anteaters')
order_info <- order_info %>% bind_rows(order_desc)
order_desc <- data.frame(taxo_order='Primates',taxo_order_desc='Apes & Monkeys')
order_info <- order_info %>% bind_rows(order_desc)
order_desc <- data.frame(taxo_order='Proboscidea',taxo_order_desc='Elephants')
order_info <- order_info %>% bind_rows(order_desc)
order_desc <- data.frame(taxo_order='Rodentia',taxo_order_desc='Rats & Mice')
order_info <- order_info %>% bind_rows(order_desc)
order_desc <- data.frame(taxo_order='Scandentia',taxo_order_desc='Treeshrews')
order_info <- order_info %>% bind_rows(order_desc)
order_desc <- data.frame(taxo_order='Sirenia',taxo_order_desc='Manatees & Dugongs')
order_info <- order_info %>% bind_rows(order_desc)
order_desc <- data.frame(taxo_order='Soricomorpha',taxo_order_desc='Moles & Shrews')
order_info <- order_info %>% bind_rows(order_desc)
order_desc <- data.frame(taxo_order='Tubulidentata',taxo_order_desc='Aardvarks')
order_info <- order_info %>% bind_rows(order_desc)
order_info <- order_info %>% mutate(taxo_order = as.factor(taxo_order))
order_info <- order_info %>% mutate(taxo_order_desc = as.factor(taxo_order_desc))
order_info %>% slice(1:10) %>% knitr::kable()
```
## Data Wrangling & Feature Selection
In this section, we are mutating the data to be more friendly to work with. We will drop some of the fields that are not going to be used in this analysis project, and combine some of the fields into new fields which will contain a complete heirarchy of classes. Some of the data in the Pantheria datase was estimated / extrapolated - these fields are marked by the EXT. These will be coalesced with the actual values to provide the greatest final number of measurements.
```{r data wrangling}
#create a subset of teh original data with the variables we are interested in, with the correct data type formats
tinyzoo <- zoo %>%
dplyr::select(
taxo_order = MSW05_Order,
taxo_family = MSW05_Family,
taxo_genus = MSW05_Genus,
taxo_species = MSW05_Species,
headlen_mm=X13.1_AdultHeadBodyLen_mm,
forearmlen_mm=X8.1_AdultForearmLen_mm,
mass_grams=X5.1_AdultBodyMass_g,
mass_grams2=X5.5_AdultBodyMass_g_EXT,
litter_size=X15.1_LitterSize,
litters_peryear=X16.1_LittersPerYear,
litters_peryear2=X16.2_LittersPerYear_EXT,
gestation_days= X9.1_GestationLen_d,
longevity_months= X17.1_MaxLongevity_m,
diet_breadth=X6.1_DietBreadth,
pop_density= X21.1_PopulationDensity_n.km2,
trophic_level= X6.2_TrophicLevel,
terrestriality= X12.2_Terrestriality,
sex_maturity= X23.1_SexualMaturityAge_d,
wean_age= X25.1_WeaningAge_d,
activity_cycle= X1.1_ActivityCycle,
range_km2 = X26.1_GR_Area_km2,
max_lat = X26.2_GR_MaxLat_dd,
min_lat = X26.3_GR_MinLat_dd,
max_lng =X26.5_GR_MaxLong_dd,
min_lng =X26.6_GR_MinLong_dd,
mid_lat = X26.4_GR_MidRangeLat_dd,
mid_lng =X26.7_GR_MidRangeLong_dd,
references=References
) %>%
mutate(
data_set = 'PanTHERIA_WR05_Aug2008.txt',
taxo_class = as.factor('Mammalia'),
taxo_order = as.factor(taxo_order),
taxo_family = as.factor(taxo_family),
taxo_genus = as.factor(taxo_genus),
taxo_species = as.factor(taxo_species),
taxo_binomial = as.factor(paste(taxo_genus, taxo_species)),
taxo_order_fam = as.factor(paste(taxo_order,'>', taxo_family)),
taxo_order_fam_genus = as.factor(paste(taxo_order,'>', taxo_family,'>', taxo_genus)),
taxo_all = as.factor(paste(taxo_class,'>',taxo_order,'>', taxo_family,'>', taxo_genus, '>', taxo_species)),
headlen_mm = as.numeric(headlen_mm),
forearmlen_mm = as.numeric(forearmlen_mm),
mass_grams = as.numeric(coalesce(mass_grams,mass_grams2)),
litter_size = as.numeric(litter_size),
litters_peryear = as.numeric(coalesce(litters_peryear,litters_peryear2)),
gestation_days = as.numeric(gestation_days),
longevity_months = as.numeric(longevity_months),
diet_breadth=as.numeric(diet_breadth),
pop_density= as.numeric(pop_density),
trophic_level= as.numeric(trophic_level),
terrestriality= as.numeric(terrestriality),
sex_maturity= as.numeric(sex_maturity),
activity_cycle= as.numeric(activity_cycle),
range_km2 = as.numeric(range_km2),
max_lat = as.numeric(max_lat),
min_lat = as.numeric(min_lat),
max_lng = as.numeric(max_lng),
min_lng = as.numeric(min_lng),
mid_lat = as.numeric(mid_lat),
mid_lng = as.numeric(mid_lng),
wean_age= as.numeric(wean_age),
marine = ifelse(taxo_order == 'Cetacea' | taxo_order == 'Sirenia',1,0),
flying = ifelse(taxo_order == 'Chiroptera',1,0),
) %>%
left_join(order_info, by="taxo_order") %>%
select(data_set,
taxo_class,
taxo_order,
taxo_order_desc,
taxo_family,
taxo_genus,
taxo_order_fam,
taxo_order_fam_genus,
taxo_species,
taxo_binomial,
taxo_all,
references,
headlen_mm,
forearmlen_mm,
mass_grams,
litter_size,
litters_peryear,
gestation_days,
longevity_months,
diet_breadth,
pop_density,
trophic_level,
terrestriality,
sex_maturity,
wean_age,
activity_cycle,
range_km2,
mid_lat,
mid_lng,
max_lat,
min_lat,
max_lng,
min_lng,
marine,
flying
) %>%
arrange(taxo_order,taxo_family,taxo_genus,taxo_species)
```
# Analysis
In this section we will analyze the Pantheria data to thoroughly understand the distribution, completeness, and other properties of the data.
```{r map orders, fig.width=16,fig.height=8}
world <- map_data("world")
mapdata <- tinyzoo %>% mutate(order_info = paste(taxo_order,"\n",taxo_order_desc,"\n"))
wmap <- ggplot(world, aes(long, lat)) +
geom_point(size = .1, show.legend = FALSE) +
geom_point(data = mapdata, alpha=.7,
mapping = aes(
size = range_km2,
x = mid_lng,
y = mid_lat,
colour = order_info)) +
coord_quickmap() +
labs(title = "All Orders Geographic Distributions",
caption = "Data source: Pantheria")
ggsave("worldmap.png",plot = wmap,width=16,height=8)
```
```{r, fig.width=14,fig.height=12}
rmap <- ggplot(world, aes(long, lat)) +
geom_point(size = .1, show.legend = FALSE) +
geom_point(data = tinyzoo %>% filter(taxo_order %in% c('Afrosoricida','Chiroptera','Primates','Lagomorpha','Rodentia','Carnivora','Proboscidea','Diprotodontia')) %>% mutate(order_info = paste(taxo_order,"\n",taxo_order_desc,"\n")), alpha=.7,
mapping = aes(
size = range_km2,
x = mid_lng,
y = mid_lat,
colour = order_info)) +
coord_quickmap() + facet_wrap(taxo_order ~ ., ncol = 2) +
labs(title = "Selected Orders Geographic Distributions",
caption = "Data source: Pantheria")
ggsave("regionmap.png",plot = rmap,width=14,height=12)
```
![world map](worldmap.png)
![region map](regionmap.png)
This map shows the vast scale of the Pantheria database - it contains data related to thousands of species from all over the world. The dataset contains a latitude/longitude (geocode) for the center point of each species' known range, which have been plotted on these maps.
Some orders like Diprotodontia (Marsupials) are found only in Australia - Dermoptera (Flying Lemurs) in Indonesia & the Phillipines, Proboscidea (Elephants) in Africa and India & Tubulidentata (Aardvarks) which occur only in Sub Saharan africa.
On the other hand, some species like Chiroptera (Bats) Carnivora (Meat Eaters), Rodentia (Rats & Mice), and Lagomorpha (Rabbits) are found all over the globe. Cetacea (Whales) and Sirenia (Manatees & Dugongs) have no geocodes or range data recorded, presumably because they are marine mammals.
## Examining Data for Completeness
Now that we've created a consolidated data set, we will analyze the details, examine the data for completeness, and try to determine what machine learning methods would be most appropriate to produce the most accurate predictions. The data used in this project comes from a scientific database which was derived from many sources.
The original source of each data reference may be found by accessing the metadata link in the references section and finding the corresponding numeric codes. The values provided by Pantheria are already highly reduced and, in some cases, the values are the product of a regression model. The machine learning model we are building already has a lot of the data collection work "baked in" and we have just one row per species.
```{r missing values list, fig.width=12,fig.height=8}
#review how many rows have NA values
missing_list <- data.frame(
headlen_mm_missing = sum(is.na(tinyzoo$headlen_mm)),
forearmlen_mm_missing = sum(is.na(tinyzoo$forearmlen_mm)),
mass_grams_missing = sum(is.na(tinyzoo$mass_grams)),
litter_size_missing = sum(is.na(tinyzoo$litter_size)),
litters_peryear_missing = sum(is.na(tinyzoo$litters_peryear)),
gestation_days_missing = sum(is.na(tinyzoo$gestation_days)),
longevity_months_missing = sum(is.na(tinyzoo$longevity_months)),
diet_breadth_missing = sum(is.na(tinyzoo$diet_breadth)),
pop_density_missing = sum(is.na(tinyzoo$pop_density)),
trophic_level_missing = sum(is.na(tinyzoo$trophic_level)),
terrestriality_missing = sum(is.na(tinyzoo$terrestriality)),
sex_maturity_missing = sum(is.na(tinyzoo$sex_maturity)),
wean_age_missing = sum(is.na(tinyzoo$wean_age)),
range_km2_missing = sum(is.na(tinyzoo$range_km2)),
mid_lat_missing = sum(is.na(tinyzoo$mid_lat)),
mid_lng_missing = sum(is.na(tinyzoo$mid_lng))
)
missing_list = t(missing_list)
missing_list %>% knitr::kable()
```
```{r missing values, fig.width=12,fig.height=8}
#prepare data for a heatmap grid
missingpcts <- tinyzoo %>%
select(-taxo_family,-taxo_genus,-taxo_species,-taxo_binomial, -taxo_class,-taxo_order_desc,-taxo_order_fam,-taxo_order_fam_genus,-taxo_all) %>%
group_by(taxo_order) %>%
summarize_all(.funs = funs('NA' = sum(is.na(.))/n()))
row.names(missingpcts) <- missingpcts$taxo_order
heatmapdata <- missingpcts %>%
rownames_to_column() %>%
gather(colname, value, -rowname)
r <- 1
#render grid & annotations
ggplot(heatmapdata, aes(x = rowname, y = colname, alpha = value)) +
geom_tile() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
theme(legend.position="none") +
annotate("path",color="red",
x=5+r*cos(seq(0,2*pi,length.out=100)),
y=5+r*sin(seq(0,2*pi,length.out=100))) +
annotate("text",x=10,y=5,color="red",label="Only Chiroptera have forearm length") +
annotate("path",color="red",
x=4+r*cos(seq(0,2*pi,length.out=100)),
y=17+5*sin(seq(0,2*pi,length.out=100))) +
annotate("path",color="red",
x=27+r*cos(seq(0,2*pi,length.out=100)),
y=17+5*sin(seq(0,2*pi,length.out=100))) +
annotate("text",x=10,y=20,color="red",label="Cetacea + Sirenia missing geographic fields") +
labs(title = "Missing Data by Order",
caption = "Data source: Pantheria")
```
\newpage
The dataset contains `r totalrows ` total rows with columns containing the following data points. The taxonomy follows the convention set in Wilson & Reeder's Mammal Species of the World. A Taxonomic and Geographic Reference (3rd ed), published 2005.
Class data:
* taxo_order - the taxonomical order
* taxo_family - the taxonomical family
* taxo_genus - the taxonomical genus
* taxo_species - the taxonomical species
Commonly recorded attributes:
* headlen_mm - Head & body length in mm
* mass_grams - Mass in grams
* litter_size - Number of offspring per litter
* litters_peryear - Number of litters per pear
* gestation_days - Length of gestational period in days
* longevity_months - Lifespan in months
* diet_breadth - Number of types of food sources
* pop_density - Number of individuals per square km
* trophic_level - Level in the food chain
* sex_maturity - Days to maturity
* wean_age - Days to weaning
* activity cycle - Diurnal (active during the day) / Nocturnal (active at night) etc.
Chiroptera (Bats) only:
* forearmlen_mm - Length of forearm (wing) - a key indicator in bats
Land dwellers only have geo coordinates & terrestriality:
* terrestriality - Above ground/under ground dwelling
* min_lat - Minimum latitude
* max_lat - Maximum latitude
* min_lng - Minimum longitude
* max_lng - Maximum longitude
* mid_lat - Mid range latitude
* mid_lng - Mid range longitude
* range_km2 - Range in square km
The column with the fewest missing entries is body mass. Aside from the geocodes and forearm length, the column with the most missing entries is population density.
The patterns of missingness may be influenced by differences in the data collection processes for various traits. It is probably easier to ascertain certain physical values like mass (which can be measured fairly instantly) vs something like population density or longevity - which requires observation of a group, long periods of time, or complex studies to observe and record. It also makes sense that certain characteristics would not be as widely recorded for orders where it is not easily measured or not as useful, or not applicable. In particuler - flying mammals Chiroptera (Bats) and marine mammals - Cetacea (Whales) and Sirenia (Manatees & Dugongs) have some key differences in which columns are commonly recorded.
```{r complete rows, warning=FALSE,fig.width=12,fig.height=8}
#review how many rows have ALL nonnegative values
#this would limit the number of possible matches if we use a complete case apporoach
complete_rows <- tinyzoo %>% filter(
headlen_mm > 0 &
mass_grams > 0 &
activity_cycle > 0 &
litter_size > 0 &
litters_peryear > 0 &
gestation_days > 0 &
longevity_months > 0 &
diet_breadth > 0 &
pop_density > 0 &
trophic_level > 0 &
sex_maturity > 0 &
wean_age > 0
)
complete_count <- nrow(complete_rows)
```
Only `r complete_count` species have all 12 common characteristics recorded. Considering the relatively small number of _complete rows_ in the Pantheria data, a _complete case analysis_ would be of limited use. This approach could only incorporate a small fraction of the total observations, and the number of possible outcome classifications would include very few outcomes. This method would also require more complete data to be provided to make a prediction.
```{r empty rows, warning=FALSE,fig.width=12,fig.height=8}
#find rows that have no predictor values
empty_rows <- tinyzoo %>% filter(
is.na(headlen_mm) &
is.na(mass_grams) &
is.na(litter_size) &
is.na(litters_peryear) &
is.na(gestation_days) &
is.na(longevity_months) &
is.na(diet_breadth) &
is.na(trophic_level) &
is.na(pop_density) &
is.na(sex_maturity) &
is.na(wean_age) &
is.na(activity_cycle)
)
#record number of empty rows
empty_count <- nrow(empty_rows)
```
We will remove any rows that are missing all of the 12 common predictors. The original dataset contains `r empty_count ` empty rows that will be removed.
```{r}
#remove empty rows from tinyzoo
tinyzoo <- tinyzoo %>% anti_join(empty_rows,by="taxo_binomial")
tinyzoo$taxo_order <- factor(tinyzoo$taxo_order)
tinyzoo$taxo_order <- droplevels(tinyzoo$taxo_order)
```
Now we will examine the pattern of missingness. We must determine if most rows are missing the same columns, and if any discernable patterns are present in the missing data.
```{r mice, message=FALSE,warning=FALSE,echo=FALSE,fig.width=6,fig.height=12}
library(mice)
#make a dataframe with just the predictors
predictors_only <- tinyzoo %>% select(
headlen_mm,
forearmlen_mm,
mass_grams,
litter_size,
litters_peryear,
gestation_days,
longevity_months,
diet_breadth,
pop_density,
trophic_level,
terrestriality,
sex_maturity,
wean_age,
activity_cycle,
range_km2,
mid_lat,
mid_lng
)
##Inspect missingness of first 100 rows in the training adata
missing <- md.pattern(predictors_only %>% sample_n(100),rotate.names=TRUE,plot=TRUE)
#missing
#missingness <- md.pattern(predictors_only,rotate.names=TRUE,plot=FALSE)
#missingness
```
Selecting 100 random rows, there are many missing values (red cells), and many different patterns of completedness in data (blue cells). The number on the left is the number of rows with that pattern of missingness. The number on the right is how many columns are missing. Missingness can be described in any of the following ways: MAR - Missing at Random, MCAR - Missing Completely at Random, or MNAR - Not Missing at Random (Rubin, 1976). For data to be considered MCAR - it must have approximately the same probability of being missing across all classes - but this is not the case here - some classes are missing greater portions of data in specific fields or groups of fields. Within these calsses, the data appears to be missing somewhat at random, so this would be MAR. Modern missing data methods usually start from the MAR assumption. (Van Buuren, 2018)
In this dataset, MAR is probably the most applicable classification, but there is also a case that some of the data is MNAR. Some variables may simply not be applicable to all species. In most classes, the missingness of rows appears to be fairly random, and is likely a natural consequence of the data collection method - this data is compiled from over 3000 reference sources. Given the fact that data did not all come from one researcher or one study - that implies that a certain amount of missingness would be expected on traits that had not been studied extensively, particularly concerning obscure or hard-to-find species. In these cases the data could also be considered MNAR as it is reflecting aspects of the data collection process - and the missingness is not at random, but a consequence of the survey methodology or other identifiable factors. For the purpose of this report, we will assume the data is MAR, but we will also try to identify and deal with data that is missing in a discernable non-random pattern.
The default approach to most regression techniques would be to omit any rows in the training data which are missing any values - This is known as _Complete Case Analysis_ or may be called _List Wise Deletion_, as it is omitting entire rows from the list of training data. (Gelman, 2006) However this is not a practical approach when so few rows are complete. We will instead attempt to replace NA values with other imputed values in the training data, using various imputation strategies. We will then apply _Available Case Analysis_ or _Pair-Wise Deletion_ - deleting columns in the training data, to match the same columns supplied in each row of test data. This will require construction of a complex model that will pair each set of available predictor columns with only those matching data points in the training data, either real, or imputed.
Imputation of missing values could improve the number of possible outcomes, by simulating data points that were not recorded for any individual species within a given taxa. Since taxonomical classification by its very nature attempts to group similar organisms, it should be reasonable to assume that a particular species' characteristics (if missing) could be imputed based on an average of other species in the same genus, family, or order. This approach should work if the genus, family or order contains relatively similar animals - but may fail in recognizing correlations between specific observed traits, and other missing traits on the same row of data. Specifically - using an average head-body length from the genus or family, may be less accurate than say, estimating the head-body length based on a knowm, highly correlated variable such as mass. Imputation can be a complex, and subject to many combinations of methods, even within one dataset. We will experiment with how well the system performs with various imputation methods applied.
\newpage
## Examining Species Counts
The charts below will examine the count, distribution and ranges of values in our data. In this section we will consider the number of species per order.
```{r examine order data points, fig.width=12,fig.height=6}
#create plot of species count by orders
plot_ord <- tinyzoo %>%
group_by(taxo_order) %>%
dplyr::summarize(group_count = n()) %>%
ggplot(aes(x=taxo_order,y=group_count, fill=taxo_order)) +
geom_col() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
theme(legend.position="none") +
labs(title = "Number of Species per Order",
caption = "Data source: Pantheria")
plot_ord
```
```{r, fig.width=12,fig.height=12}
#summarize orders
all_ord <- tinyzoo %>% group_by(taxo_order) %>%
dplyr::summarize(group_count = n())
#summarize median count of species per order
median_group_count <- median(all_ord$group_count)
#create a table of minority orders
minority_ord <- tinyzoo %>% group_by(taxo_order) %>%
dplyr::summarize(group_count = n()) %>%
filter(group_count < median_group_count)
#how many are there?
minority_groups <- nrow(minority_ord)
```
The median species count is `r median_group_count` species per order. We will consider any groups with fewer than the median group count, to be minority classes. This will be the basis for oversampling the training set. Of 29 orders in the data, `r minority_groups` could be considered minority groups.
We are now performing the same analysis as above, but at the family level. Since there are so many outcomes, we will need to filter this chart to a subset of orders. We will just examine the 3 largest orders - Rodentia, Chiroptera and Primates.
```{r examine family, fig.width=16,fig.height=6}
#create families species count plot
plot_fam <- tinyzoo %>% filter(taxo_order %in% c('Rodentia','Chiroptera','Primates')) %>%
group_by(taxo_order, taxo_order_fam) %>%
dplyr::summarize(group_count = n()) %>%
ggplot(aes(x=taxo_order_fam,y=group_count, fill=taxo_order)) +
geom_col() +
facet_wrap(taxo_order ~ ., scales = "free") +
theme(axis.text.x = element_text(size=8, angle = 90, vjust = 0.5, hjust=1)) +
theme(legend.position="none") +
labs(title = "Number of Species per Family - Orders: Chiroptera, Primates, Rodentia",
caption = "Data source: Pantheria")
plot_fam
```
Clearly the classes are very imbalanced. Some have one or two species, and others have hundreds.
```{r examine family 3, fig.width=12,fig.height=12}
#table of group counts
all_fam <- tinyzoo %>%
group_by(taxo_order_fam) %>%
dplyr::summarize(group_count = n())
#med group count
median_fam_count <- median(all_fam$group_count)
#select minority classes
minority_fam <- tinyzoo %>%
group_by(taxo_order_fam) %>%
dplyr::summarize(group_count = n()) %>%
filter(group_count < median_fam_count) %>%
mutate(minority = TRUE) %>%
ungroup()
```
The median number of families per order across all species is: `r median_fam_count`. We may also use this detail as the basis for some imputation strategies to be explored later in this report.
We are now performing the same analysis as above, but at the genus level.
```{r examine family 2, fig.width=16,fig.height=6}
#create genus species counts plot
plot_fam <- tinyzoo %>% filter(taxo_order == 'Chiroptera' & taxo_family == 'Vespertilionidae') %>%
group_by(taxo_order, taxo_genus) %>%
dplyr::summarize(group_count = n()) %>%
ggplot(aes(x=taxo_genus,y=group_count, fill=taxo_genus)) +
geom_col() +
facet_wrap(taxo_order ~ ., scales = "free") +
theme(axis.text.x = element_text(size=8, angle = 90, vjust = 0.5, hjust=1)) +
theme(legend.position="none") +
labs(title = "Number of Species per Genus - Order: Chiroptera - Family: Vespertilionidae",
caption = "Data source: Pantheria")
plot_fam
```
Once again we can see that there is significant imbalance in the number of genus groups per family.
```{r}
#create a table of all genus types for reference
all_genus <- tinyzoo %>%
group_by(taxo_order_fam_genus) %>%
dplyr::summarize(group_count = n())
```
## Visualization of Trait Distributions
In the following plots we will examine the taxonomic data to visualize how the distributions vary between each order. First we will examine some of the raw data, to make sure the values agree with general knowledge.
```{r getminmax, warning=FALSE, message=FALSE, fig.width=12,fig.height=8}
#getminmax function to compare species with min/max values for a given metric
getminmax <- function(data, col) {
topspecies <- data %>%
arrange(desc(!!sym(col))) %>%
slice(1:1) %>%
mutate(species_binom = paste(taxo_genus,taxo_species)) %>%
pull(species_binom)
botspecies <- data %>%
arrange(!!sym(col)) %>%
slice(1:1) %>%
mutate(species_binom = paste(taxo_genus,taxo_species)) %>%
pull(species_binom)
topstat <- data %>%
arrange(desc(!!sym(col))) %>%
slice(1:1) %>%
mutate(species_binom = paste(taxo_genus,taxo_species)) %>%
pull(sym(col))
botstat <- data %>%
arrange(!!sym(col)) %>%
slice(1:1) %>%
mutate(species_binom = paste(taxo_genus,taxo_species)) %>%
pull(sym(col))
checkrow <- data.frame(
stat=col,
#maxspecies=topspecies,
maxspecies_vernacular=getvernacularname(topspecies),
maxstat=topstat,
#minspecies=botspecies,
minspecies_vernacular=getvernacularname(botspecies),
minstat=botstat
)
checkrow
}
spectable <- data.frame()
checkrow <- getminmax(tinyzoo,c("mass_grams"))
spectable <- spectable %>% bind_rows(checkrow)
checkrow <- getminmax(tinyzoo,c("headlen_mm"))
spectable <- spectable %>% bind_rows(checkrow)
checkrow <- getminmax(tinyzoo %>% filter(flying == 1),c("forearmlen_mm"))
spectable <- spectable %>% bind_rows(checkrow)
checkrow <- getminmax(tinyzoo,c("litter_size"))
spectable <- spectable %>% bind_rows(checkrow)
checkrow <- getminmax(tinyzoo,c("litters_peryear"))
spectable <- spectable %>% bind_rows(checkrow)
checkrow <- getminmax(tinyzoo,c("longevity_months"))
spectable <- spectable %>% bind_rows(checkrow)
checkrow <- getminmax(tinyzoo,c("gestation_days"))
spectable <- spectable %>% bind_rows(checkrow)
checkrow <- getminmax(tinyzoo,c("wean_age"))
spectable <- spectable %>% bind_rows(checkrow)
checkrow <- getminmax(tinyzoo,c("pop_density"))
spectable <- spectable %>% bind_rows(checkrow)
checkrow <- getminmax(tinyzoo,c("trophic_level"))
spectable <- spectable %>% bind_rows(checkrow)
checkrow <- getminmax(tinyzoo,c("terrestriality"))
spectable <- spectable %>% bind_rows(checkrow)
checkrow <- getminmax(tinyzoo,c("diet_breadth"))
spectable <- spectable %>% bind_rows(checkrow)
checkrow <- getminmax(tinyzoo,c("activity_cycle"))
spectable <- spectable %>% bind_rows(checkrow)
checkrow <- getminmax(tinyzoo,c("range_km2"))
spectable <- spectable %>% bind_rows(checkrow)
checkrow <- getminmax(tinyzoo,c("mid_lat"))
spectable <- spectable %>% bind_rows(checkrow)
spectable %>% knitr::kable()
```
The table above examines the data based on the minimum and maximum of each variable. Most of the details in the the table seem to agree with general knowledge. Mass and head size reveal the huge Blue whale vs. tiny, lightweight Bumblebee Bats. Examining Litter Size, Litters per year, Gestation, and Weaning age - Large mammals including Elephants and Sperm Whales, which reproduce infrequent small litters (K-selected species) appear opposite diminuitive Mice, Tenrec, and Echidnas (r-selected species) which are much more prolific in reproductive frequency. Chimpanzees are another K-selected species with a long weaning age vs. the very fast reproducing Viscacha, that wean their young for a very short period. Population density statistics indicate the Hooded Seal as the most solitary vs highly communal Water Voles.
The following box plots and scatter plots show considerable diversity in the distributions, types, and ranges of the various data points for each taxonomical _order_. There are 29 orders represented in our dataset. The box plots indicate the inter-quartile ranges of each "creature feature" including continuous-valued variables like mass & head size, as well as discrete-valued variables like terrestriality, diet breadth, and activity cycle. The aim of a machine learning model would be to find patterns in these features that are predictive in a way that's useful.
In general - determing the taxonomical order is not terribly difficult - For example - sharp canine teeth mark Carnivora. Mammals with wings are Chiroptera. Hooved animals with even number of toes are always Perissodactyla. A classification tool determining orders may be of limited value - though it could be useful in some cases where two orders may contain similar species. For example, Scandentia, Soricomorpha, Paucituberculata, Macroscelidea all contain shrews. So we might expect to see more confusion between these similar orders.
```{r viz orders, fig.width=12,fig.height=12}
# Plot distribution of each predictor stratified by order
tinyzoolog <- tinyzoo %>% mutate(mass_grams_log10 = log10(mass_grams),
headlen_mm_log10 = log10(headlen_mm),
pop_density_log10 = log10(pop_density))
#create some log10 transforms
tinyzoolog %>% gather(varz, length,
mass_grams_log10,
headlen_mm_log10,
litter_size,
litters_peryear,
gestation_days,
longevity_months,
diet_breadth,
pop_density_log10,
trophic_level,
terrestriality,
sex_maturity,
wean_age,
activity_cycle,
range_km2,
mid_lat,
mid_lng
) %>%
ggplot(aes(paste(taxo_order),
length,
fill = paste(taxo_order))) +
geom_boxplot() +
facet_wrap(~varz, scales = "free", ncol=2) +
theme(axis.text.x = element_blank()) +
theme(legend.position="bottom",axis.text.x = element_blank()) +
labs(title = "Distributions of Physical, Behavioral and Reproductive Characteristics", subtitle = "Distribution of median observed values for each species grouped by order", caption = "Data source: Pantheria")
```
The difference between families are much more specific, and can be used to differentiate more similar species. We will examine some characteristics of families within the largest order, Rodentia.
```{r viz families, fig.width=12,fig.height=12}
# Plot distribution of each predictor stratified by family
tinyzoolog %>% filter(taxo_order == 'Rodentia') %>% gather(varz, length,
mass_grams_log10,
headlen_mm_log10,
litter_size,
litters_peryear,
gestation_days,
longevity_months,
diet_breadth,
pop_density_log10,
trophic_level,
terrestriality,
sex_maturity,
wean_age,
activity_cycle,
range_km2,
mid_lat,
mid_lng) %>%
ggplot(aes(paste(taxo_family),
length,
fill = paste(taxo_family))) +
geom_boxplot() +
facet_wrap(~varz, scales = "free", ncol=2) +
theme(legend.position="bottom",axis.text.x = element_blank()) +
labs(title = "Distributions of Physical, Behavioral and Reproductive Characteristics - Order: Rodentia", subtitle = "Distribution of median observed values for each species grouped by family", caption = "Data source: Pantheria")
```
Predicting the family correctly could be more helpful in distinguishing fairly similar species - for example voles (Cricetidae), Rats (Muridae), and Gophers (Geomyidae) are all similar families in order Rodentia.
```{r viz genus, fig.width=12,fig.height=12}
# Plot distribution of each predictor stratified by genus
tinyzoolog %>% filter(taxo_order == 'Rodentia' & taxo_family == 'Sciuridae') %>% gather(varz, length,
mass_grams_log10,
headlen_mm_log10,
litter_size,
litters_peryear,
gestation_days,
longevity_months,
diet_breadth,
pop_density_log10,
trophic_level,
terrestriality,
sex_maturity,
wean_age,
activity_cycle,
range_km2,
mid_lat,
mid_lng) %>%
ggplot(aes(paste(taxo_genus),
length,
fill = paste(taxo_genus))) +
geom_boxplot() +
facet_wrap(~varz, scales = "free", ncol=2) +
theme(legend.position="bottom",axis.text.x = element_blank()) +
labs(title = "Distributions of Physical, Behavioral and Reproductive Characteristics - Order: Rodentia, Family: Sciuridae", subtitle = "Distribution of median observed values for each species grouped by genus", caption = "Data source: Pantheria")
```
And finally, looking at one family of Rodents - we can see that the data becomes more sparse at the genus level, with many columns containing missing data. There also appear to be many genus's that have only one value recorded for all rows (as indicated by a horizontal bar). The genus is the most specific rank that would be practical to predict in a machine learning context given this data set.
These charts suggest that a recursive model might also be practical. There are too many classes of genus to predict that right off the bat - so the first model will train on all data to predict the order, then retrain once an order has been predicted, with data that only includes the families within that one order, then again at the family level to predict the genus. Orders with more complete data should produce more accurate results when recursed.
## Examining Correlations
The charts below visualize some of the most highly correlated variables and how their distributions are grouped at the order level. We will consider how closely the various predictors are correlated.
```{r examine correlations, fig.width=12,fig.height=8}
library(corrr)
library(gridExtra)
library(grid)
#create a dataset with just the predictors
tzvars <- tinyzoo %>% select(mass_grams,headlen_mm,litter_size,litters_peryear,gestation_days,wean_age,sex_maturity,longevity_months,pop_density,diet_breadth,terrestriality,trophic_level,activity_cycle,range_km2,mid_lat,mid_lng)
#examine correlations
cor_tbl <- corrr::correlate(tzvars)
#show the table
cor_tbl
```
This table demonstrates that there are strong correlations between certain sets of values:
* Longevity and sex maturity: 78%
* Mass and head length: 76%
* Weaning age and sex maturity: 74%
* Sex maturity and gestation days: 72%
* Longevity and gestation days: 70%
* Longevity and head length: 67%
* Weaning age and gestation days: 55%
* Litter size and gestation days: -55%
These correlations suggest that some fields may be candidates for use of linear regression to impute some missing values, since the value of one know variable may be highly correlated with teh value of an unknown one. The following scatter plots have been faceted for easier readability, but they are still somewhat crowded. The convex hull shows the outline of each species x/y distribution of two highly correlated variables.
```{r getpointplot, fig.width=12,fig.height=16}
#function to display correlated variables scatterplot with convex hull
getpointplot <- function(data, col1, col2, colgroup, loglist, opac, desc) {
hull <- data %>%
filter(!is.na(!!sym(col1)) & !is.na(!!sym(col2))) %>%