## Exploring the data

In [7]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

Questions to think about:

**1. Try and gauge the scope of the database. How many types of animals? How many types of variables? What are the measurements? Are there variables that are required for each data point? Try to explain in a programmatic way.**

While exploring the database, a very vital question came to me: Is is possible to aggregate ALL of the data from MorphoSource? Probably. Although the sheer amount of data avaliable would most likely burden my computer's memory too much, and unless there's a way that I'm not aware of, I would have to individually add to cart/download the files one-by-one (which would be very tedious and perhaps result in missed datasets). 

The scope of the data really traverses many areas, and there is most likely over 10,000 specimens (and growing). And there's not only data about the actual specimen/organism itself, but also its body parts (teeth).

But in the context of the database that I'm working with currently, which is "Project: oVert: UW - CT Scan all Fishes" by/from Cornell University's Museum of Vertebrates, the scope of the data is basically all the known fish species in the world (or at least the team is trying to achieve this). As of right now, there are around 8000 recorded and scanned fish from their dataset, categorized/labeled with around 20 variables. But majority of the column labels (from the provided google doc) pertains mainly to the type of scanner, dates, times, and other information not related to the actual fish. But the dataset does provide the general categorizations (i.e. Genus, Species). There are also extra links for a few of the fishes that provide additional information. See [here](https://docs.google.com/spreadsheets/d/1TUqJJNPFdAEjncQXJ8D6iX0ootl5EfiNNHEE9kPgQpE/edit?ts=5702a468#gid=0)


In [8]:
file = Table().read_table("Fish_scans.csv") #Reading in the file 

#Grouped by unique species
Species_grouped = file.group("Species")

#the number of rows in the table correspond to the number of unique species recorded
#After skimming over the data, I filtered out all the ones that had unspecified species names
#Note: I may have missed some due to the different standards in indentifying unknown species
#This is another thing to account for when looking at this dataset (unstandardized naming inputs)
filter_count = Species_grouped.where("Species", are.not_equal_to("nan")).where("Species", are.not_equal_to("not sure")).where("Species", are.not_equal_to("unsure")).where("Species", are.not_equal_to("tbd"))

#number of unique fish species
filter_count.num_rows   

3565

In [9]:
Family_grouped = file.group("family")

#filtered out unknowns
filtered_family = Family_grouped.where("family", are.not_equal_to("??")).where("family", are.not_equal_to("nan"))
# Family_grouped.show() #used this to look through/skim dataset since it's in alphabetical order here

#unique families
filtered_family.num_rows

390

In [10]:
Genus_grouped = file.group("Genus")

filtered_genus = Genus_grouped.where("Genus", are.not_equal_to("#VALUE!")).where("Genus", are.not_equal_to("nan"))
filtered_anomaly = filtered_genus.where("Genus", are.not_equal_to("21")) 
filtered_anomaly.num_rows

1473

**2. Data Quality: Are there ways to access the quality of this data? For example, if there are geographic points, does it make sense that whales were found in the middle of Iowa? What are some assumptions that you have to make about this data? Are there clear documentation on the standards that are required to input data into this database?**

Honestly, I would say that improvements to the quality of the data can be made, especially since there's a good amount of "N/A" and/or "None" data inputs (as briefly analyzed in the previous post). Also, I would prefer if they provided some sort of key as to what the column labels are referring to when categorizing the data. I would also like to know where the team gathered these fishes (at least in the google doc--But if you look at the "spec. image" column, you will find a photo of the actual fish that was scanned), because I do notice that the whereabouts and other specifics about the fish can be found when clicking into the individual fish scan, but this can be tedious if I want to compare certain fishes with others. But from the context of their description, I will assume that their primary goal is to simply scan all the fishes and not actually provide any sort of analysis with the data. Though their collection can be helpful for research in areas pertaining to how many types of fish are in a certain Genus/etc. 

**3. Completeness: requires that a particular column, element or class of data is populated and does not feature null values or values in place of nulls (e.g. N/As).**

Not fully complete! See below for brief analysis (but all mainly less than 1% unknown from vast dataset)


In [11]:
# percentages unknown from data set for respecitive category 
unk1 = (Family_grouped.num_rows - filtered_family.num_rows) / Family_grouped.num_rows * 100
unk2 = (Species_grouped.num_rows - filter_count.num_rows) / Species_grouped.num_rows * 100 
unk3 = (Genus_grouped.num_rows - filtered_anomaly.num_rows) / Genus_grouped.num_rows * 100 
briefUnk = Table().with_column("Category", make_array("Family", "Species", "Genus"),
                                "Unknown percentage", make_array(unk1, unk2, unk3))
briefUnk

Category,Unknown percentage
Family,0.510204
Species,0.112076
Genus,0.203252


**4. Accuracy: the hardest dimension to test for as this often requires some kind of manual checking by a Subject Matter Expert (SME).**

When photographing the fishes/collecting the scans, I noticed that they did use a sort of symbol, either an "X" or some other shape. This was may likely be used to create a standard size comparision for each fish and to maintain a sense of consistency/accuracy for future reproduction. However, without a key/description, I can't be absolutely certain what each column label is representing. Moreover, some of the recorded data is missing this input value, which again points to the rather lack of completeness of data on some parts.

**5. What are the variables that are most interesting to you? At some point you will need to refine the scope of your project. You likely cannot explore ALL the data. Are their questions about the that are particularly interesting to you? Questions can either be about the quality of the data or of biological significance.**

I would love to figure out where each species of fish is located at (and add to the data) since the provided data is not available, and see where, geographically, majority of a specific species or genus are most commonly found. I realize that most (if not all) of the fish data have been imported from iDigBio, so I could look into that and figure out/gather geographical data that corresponds to the data provided from MorphoSource. And perhaps in response to the quality of the data, I maybe find out information and propose what some of the missing data is? 

Another thing that I would really love to learn how to create visualizations of the data. So it might be interesting to create a visual or map geographically where the fish species are found at, with larger clusters on the more populated aeas of the map. I want to see how else I can visualize the data as well. 

On a side note, I'm actually curious as to when majority of the information was recorded/scanned. That is, were most of the fishes scanned during 2018, or earlier dates. . .or was the data collection consistent throughout the years.

**6. Reiterate what skills you particullarly interested in learning. Do you see a clear path from this database to level up on those skills?**

I'm particularly interested in learning different visualization techniques. From this database, I'm hoping I can aggregate the data and try out different representations of the (limited) but provided data. Though, because there is a limited amount of data pertaining to the fishes themselves, I may have to look for another dataset to continue my research. But regardless, I do want to play around with matplotlib (especially how to use/import their basemap so that I could potentiall create a symbol map for a particular fish species) and d3js (which is a javascript library used to manipulate documents and create visualizations).

Another thing, I might want to try out is to perhaps create a k-NN classfier (or related tools) to identify/fill in the missing blanks in the data. But this might be hard since there's not much data about the physical appearance of the fish itself, so it'll be hard to collect features to test out this mechanism. 

**7. If you are having a hard time understanding how to handle the data, is there a clear path for learning how? Is there something that could be done to the data on the database side that would make your life easier when using this data? Do you wish it was in json over XML? Do you wish that there was a tool in Python that would connect to the database? Did you find the documentation incredibly hard to follow? What are some things you googled that helped you? What are the things you googled that had no answer but wish there was?**

I would say that cleaning up some of the data would definitely be very helpful in seeing the different types of data and relationships. I did also have to look up some of the scanner data since I wasn't familiar with the types of tools that they used to record the data. And, I did mention this previously, but a key/documentation would be nice to have so that I know exactly what each column label is referring to. And I actually wish there was more information about the fish besides it's genus/family/etc. For example, info on the weight, color, size, etc might be helpful. 
