## Here's a notebook of how I cleaned the data I used in my example Bokeh app, VirtualDive
**NOTE: This notebook is optional: I've already provided cleaned data files in the data folder of this repo so that this does not need to be run before the Bokeh notebook**

1) Let's import our packages! For this bit all I used was Pandas and Numpy:

In [1]:
import pandas as pd
import numpy as np

2) Let's load our data into a couple Pandas dataframes! ReefLocations is actually from http://reefbase.org/ and the SEDAC 2000 population density data is actually from https://neo.sci.gsfc.nasa.gov/view.php?datasetId=SEDAC_POP . I've just reuploaded both to GitHub to make sure the files don't move or change in the future.

In [11]:
reeflocations=pd.read_csv("https://raw.githubusercontent.com/mistergroot/VirtualDive/master/data/ReefLocations.csv")
pop = pd.read_csv("https://raw.githubusercontent.com/mistergroot/VirtualDive/master/data/SEDAC_POP_2000-01-01_rgb_1440x720.SS.CSV")

3) Let's see what we have to work with:

In [4]:
reeflocations.head(4)

Unnamed: 0,ID,REGION,SUBREGION,COUNTRY,LOCATION,LAT,LON,REEF_SYSTEM,REEF_TYPE,REEF_NAME,WATER_DEPTH,ISLAND_NAME,PROTECTED,TOURISM,COUNTRY_CODE,SIZE
0,62,Pacific,Southwest Pacific,Fiji,,-16.0,-179.98333,Vanua Levu,Fringing,Cikobia,,Vanua Levu,0.0,0,FJI,3
1,4475,Pacific,Southwest Pacific,Fiji,,-17.5,-179.95,Vanua Balavu,Barrier,Daku Barrier Reef,,,0.0,0,FJI,3
2,4457,Pacific,Southwest Pacific,Fiji,,-16.66667,-179.83333,Taveuni,Fringing,Korolevu,,,0.0,0,FJI,3
3,4459,Pacific,Southwest Pacific,Fiji,,-16.73333,-179.83333,Taveuni,Fringing,Viubani,,,0.0,0,FJI,3


In [3]:
pop.head(4)

Unnamed: 0,lat/lon,-179.875,-179.625,-179.375,-179.125,-178.875,-178.625,-178.375,-178.125,-177.875,...,177.625,177.875,178.125,178.375,178.625,178.875,179.125,179.375,179.625,179.875
0,89.875,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,...,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0
1,89.625,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,...,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0
2,89.375,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,...,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0
3,89.125,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,...,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0


4) Let's start with the pop data. It looks pretty ugly. It's a gridded dataset where the top row and left column are longitudes and latitudes, respectively. I don't like that the latitudes are in an unindexed column. Let's change that. It's also going to make the index look weird with the name 'lat/lon' there, so let's get rid of it at the same time:

In [12]:
pop = pop.set_index('lat/lon')
del pop.index.name

In [13]:
pop.head(2)

Unnamed: 0,-179.875,-179.625,-179.375,-179.125,-178.875,-178.625,-178.375,-178.125,-177.875,-177.625,...,177.625,177.875,178.125,178.375,178.625,178.875,179.125,179.375,179.625,179.875
89.875,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,...,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0
89.625,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,...,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0,99999.0


In [None]:
data = pop.stack().to_frame('item').query('250 < item < 99999')

In [None]:
data = data.reset_index(drop=False)

In [None]:
data.rename(columns={'level_0':'lat','level_1':'lon','item':'popdens'}, inplace=True)

In [14]:
reeflocations = reeflocations.convert_objects(convert_numeric=True)
reeflocations['PROTECTED'] = reeflocations['PROTECTED'].fillna(0)
reeflocations['PROTECTED'] = reeflocations['PROTECTED'].astype(np.int64)
reeflocations["PROTECTED"] = reeflocations["PROTECTED"].replace(1, "Yes")
reeflocations["PROTECTED"] = reeflocations["PROTECTED"].replace(0, "No")
protected = reeflocations[reeflocations["PROTECTED"] == "Yes"]
unprotected = reeflocations[reeflocations["PROTECTED"] == "No"]
reeflocations.to_csv("../data/reefloc.csv")
protected.to_csv("../data/protected.csv")
unprotected.to_csv("../data/unprotected.csv")
data.to_csv("../data/popdata.csv")

For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  after removing the cwd from sys.path.
