In [1]:
import pandas as pd

In [2]:
data_dir = "./data/raw_data/"
save_dir = "./data/preprocessed_data/"

Read in the data

In [3]:
cv_table = pd.read_csv(data_dir + "cv_table.csv")
ndr_table = pd.read_csv(data_dir + "ndr_table.csv")
poll_table = pd.read_csv(data_dir + "poll_table.csv")

latlong_cv = pd.read_csv(data_dir + "latlong_cv.csv")
latlong_ndr = pd.read_csv(data_dir + "latlong_ndr.csv")
latlong_pollination = pd.read_csv(data_dir + "latlong_pollination.csv")


We have three tables of data, one for each contribution of nature. We also have three tables relating each data point to a longitude and a latitude.

Let's look at the tables first!

# cv_table

In [4]:
cv_table.describe(include = 'all')

Unnamed: 0,fid,UN_cur,UN_ssp1,UN_ssp3,UN_ssp5,pop_cur,pop_ssp1,pop_ssp3,pop_ssp5
count,686665.0,686665.0,686665.0,686665.0,686665.0,686665.0,686665.0,686665.0,686665.0
mean,343333.0,2.48269,2.822749,2.982682,3.139066,485.270589,586.541642,605.606714,608.113528
std,198223.255633,0.511879,0.573237,0.618028,0.644773,2772.84894,3590.280849,3368.830041,3719.613422
min,1.0,1.030057,1.237034,1.29779,1.311983,0.0,0.0,0.0,0.0
25%,171667.0,2.082644,2.376177,2.492883,2.667168,0.510938,0.423553,0.455772,0.44583
50%,343333.0,2.440013,2.777381,2.935599,3.107233,21.314615,21.012911,26.209813,21.459414
75%,514999.0,2.817269,3.259844,3.419952,3.619904,156.068106,153.478002,184.031578,159.716874
max,686665.0,4.291871,4.817462,5.0,5.0,236893.117288,270175.660403,247236.351206,270048.14108


This table describes the coastal risk mitigation by natural habitats.
Some explanation of the table follow:

* **fid** - the id of the geographical location that the data describes. This id can be converted to lon/lat using the latlong table.  
* **UN_cur** - the **current** (2015) unmet need of the coastal risk mitigation in this particular area. If this figure is negative then it means that we have enough, and even a surplus, in this particular area. 
* **UN_ssp1** the unmet need of the coastal risk mitigation in the year 2050, given that scenario **ssp1** is realized. 
* **ssp1, ssp3, ssp5** - the three future scenarios. It is not perfectly clear which scenraio is which here. ssp1 is most likely the sustainability scenario, purely based on the observation that the unmet need in this column is lower than for the other scenarios. The other two scenarios are fossil-fuels and regional rivalry.
* **pop_cur** - the **current** (2015) number of people in this geographic region who are affected by coastal risk mitigation. If the unmet need is high and the population is high then a lot of people are negatively affected.
* **pop_sspX** - same as the above, but in the year 2050, for the three different future scenarios.

# ndr_table

In [5]:
ndr_table.describe(include = 'all')

Unnamed: 0,fid,country,region,UN_cur,UN_ssp1,UN_ssp3,UN_ssp5,rurpopcur,rurpopssp1,rurpopssp3,rurpopssp5
count,25243.0,25243,25243,25243.0,25243.0,25243.0,25243.0,25243.0,25243.0,25243.0,25243.0
unique,,228,8,,,,,,,,
top,,Antarctica,UNKNOWN,,,,,,,,
freq,,6511,6557,,,,,,,,
mean,31136.198669,,,2847848.0,2982161.0,3464451.0,2888546.0,169300.7,100850.5,232859.4,100378.9
std,20508.712007,,,8093390.0,9765498.0,10729920.0,8570622.0,691964.7,407131.6,981792.5,403094.7
min,2262.0,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,13083.5,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,25299.0,,,2304.164,734.3517,721.7091,717.5778,0.0,0.0,0.0,0.0
75%,56810.5,,,2418531.0,1757585.0,2092982.0,1920021.0,44799.87,24824.87,42305.56,25924.68


This table describes the improvement of water quality by filtering of pollutants (in our case probably excessive amounts of nutrients, making the water quality bad) when water passes by natural habitats such as forests and wetlands.

The columns are the same as in the previous table, although `pop` is replaced by `rurpop`. Charlotte has confirmed that this is still the same metric though.

We also have two extra columns `country` and `region`. We will most probably not use these, especially as we don't actually have them for each table.

One thing we can immediately observe is that the count (number of entries) is much lower, the previous table had 686665 entries while this one has only 25243. This needs to be investigated. Does this mean that we don't have data for the whole earth for this particular contribution of nature?

# poll_table

In [6]:
poll_table.describe(include = 'all')

Unnamed: 0,fid,UN_cur,UN_ssp1,UN_ssp3,UN_ssp5,pop_cur,pop_ssp1,pop_ssp3,pop_ssp5
count,9160.0,9160.0,9160.0,9160.0,9160.0,9160.0,9160.0,9160.0,9160.0
mean,25270.656441,96306.18,89728.63,93758.06,99902.51,257479.9,399387.1,465965.3,360315.9
std,10032.053267,434821.6,413617.0,411384.6,433524.4,1004224.0,1546353.0,1681148.0,1450151.0
min,8845.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,16672.0,2.088596,2.046068,9.315915,9.55,0.0,0.0,0.0,0.0
50%,22971.5,1027.879,978.1146,1492.83,1594.218,622.3414,3940.41,3460.522,0.0
75%,33446.25,31073.08,28270.17,34398.39,34747.84,126399.4,187440.6,227624.3,146889.4
max,51951.0,23220620.0,22495340.0,23218140.0,22702230.0,23387770.0,37782320.0,32826860.0,32225640.0


This table describes the pollination of crops coming from natural habitats inhabiting pollinators (ex. bees).

The columns are the same as previously.

Here though we have an even lower count. What does this mean? A hypothesis could be that the resolution is simply not just not as high for the data here.

# latlong_cv

In [7]:
latlong_cv.describe(include = 'all')

Unnamed: 0,fid,lat,lng
count,686665.0,686665.0,686665.0
mean,343333.0,9.346826,23.706301
std,198223.255633,32.202816,102.750279
min,1.0,-58.48,-180.0
25%,171667.0,-10.6,-73.07
50%,343333.0,9.6,12.64
75%,514999.0,35.45,123.25
max,686665.0,62.8,180.0


This looks good, just a simple mapping from the **fid** to the coordinates. We have the same number of mappings as we have entrys in the `cv_table`.

# latlong_ndr

In [8]:
latlong_ndr.describe(include = 'all')

Unnamed: 0,fid,lat,lng
count,64800.0,64800.0,64800.0
mean,32399.5,0.0,0.0
std,18706.293059,51.961123,103.923449
min,0.0,-89.5,-179.5
25%,16199.75,-44.75,-89.75
50%,32399.5,0.0,0.0
75%,48599.25,44.75,89.75
max,64799.0,89.5,179.5


Here the number of entries is actually higher than the number of entries in the `ndr_table`. This is a bit strange, what does this mean?

Moreover, why do we even need to have different mappings for the different contributions of nature? Couldn't they just use the same mappings between id and coordinates? The answer to this is probably that, as suggested before, the world map resolutions of the nature's contributions are different, with coastal risk having the most detailed data.

# latlong_pollination

In [9]:
latlong_pollination.describe(include = 'all')

Unnamed: 0,fid,lat,lng
count,64800.0,64800.0,64800.0
mean,32399.5,0.0,0.0
std,18706.293059,51.961123,103.923449
min,0.0,-89.5,-179.5
25%,16199.75,-44.75,-89.75
50%,32399.5,0.0,0.0
75%,48599.25,44.75,89.75
max,64799.0,89.5,179.5


Once again the same thing, the number of entries are the same as in the `latlong_ndr` table, 64800, while the number of entries in the `poll_table` is just 9160...



# Actual data

We also take a look at the actual data, just to get a quick feel for the actual figures:

In [10]:
cv_table.head()

Unnamed: 0,fid,UN_cur,UN_ssp1,UN_ssp3,UN_ssp5,pop_cur,pop_ssp1,pop_ssp3,pop_ssp5
0,1,3.282099,3.684031,3.684031,4.135186,0.0,0.0,0.0,0.0
1,2,2.924018,3.282099,3.282099,3.684031,0.0,0.0,0.0,0.0
2,3,2.924018,3.282099,3.282099,3.684031,0.0,0.0,0.0,0.0
3,4,2.924018,3.282099,3.282099,3.684031,0.0,0.0,0.0,0.0
4,5,2.924018,3.282099,3.282099,3.684031,0.0,0.0,0.0,0.0


Nothing surprising here! The id:s start from 1 and go up. We see an unmet need for the first 5 geographical regions both currently and for the 3 different future scenarios. We also see that nobody in this region actually gets directly affected by any changes to the natural habitats mitigating coastal risk. This could be because nobody lives there, or simply perhaps because there is no coast in this area.

Let's look at the `ndr_table`.

In [11]:
ndr_table.sample(5)

Unnamed: 0,fid,country,region,UN_cur,UN_ssp1,UN_ssp3,UN_ssp5,rurpopcur,rurpopssp1,rurpopssp3,rurpopssp5
16104,36205,Democratic Republic of the Congo,Africa,1674984.0,691177.5,811586.2,814407.7,264893.4,230902.5781,700097.4,230909.9063
6441,13262,Russia,Eurasia,404589.9,401216.3,415218.3,393199.5,11851.13,7582.272461,9816.912,8103.152832
15470,34174,Papua New Guinea,Oceania,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23819,63376,Antarctica,UNKNOWN,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14194,30058,Ghana,Africa,1167166.0,1883950.0,1306541.0,4207906.0,1268533.0,97216.5625,1231804.0,95833.23438


We note that the **fid** goes above the maximum number of entries in the table (which was 25243). 

Same for `poll_table`:

In [12]:
poll_table.sample(5)

Unnamed: 0,fid,UN_cur,UN_ssp1,UN_ssp3,UN_ssp5,pop_cur,pop_ssp1,pop_ssp3,pop_ssp5
6668,32621,4168.757878,4257.722532,4393.030225,4392.279882,93224.70313,133665.0313,234833.5,114015.7266
7174,34691,47.156799,45.83114,53.051223,93.80918,234947.2813,255341.6406,304156.8125,253622.1563
7706,37185,171.682375,165.798171,288.975228,183.196205,92237.45313,0.0,125735.5,0.0
3446,19340,81.112525,67.935944,50.390224,81.112525,30059.59766,22724.45508,27804.46094,22570.26758
7075,34322,0.0,0.0,0.0,0.0,7878.15332,7933.297363,9833.271484,7882.144531


And for an example lat_long-table let's check out `latlong_cv`:

In [13]:
latlong_cv.sample(5)

Unnamed: 0,fid,lat,lng
363092,363093,10.24,-64.58
466735,466736,25.03,-108.04
538601,538602,39.1,141.9
568288,568289,47.63,-59.26
341285,341286,9.28,80.79


# Preprocessing

The next step is to perform the preprocessing (if needed) and save the data in new files. 
This might need to be done in several iteration as we don't know at the moment exactly how we will use the data. One thing that will be good to have done already is to join the lat/lon tables with the data tables.

Start with `cv_table`:

In [14]:
cv_table_out = pd.merge(cv_table, latlong_cv, on='fid')
cv_table_out.sample()

Unnamed: 0,fid,UN_cur,UN_ssp1,UN_ssp3,UN_ssp5,pop_cur,pop_ssp1,pop_ssp3,pop_ssp5,lat,lng
34493,34494,2.58734,2.904191,2.904191,3.259844,0.013091,0.0,0.0,0.0,-48.63,-75.58


`fid` column is not needed, drop this.

In [15]:
cv_table_out.drop(columns=['fid'], inplace=True)
cv_table_out.sample()

Unnamed: 0,UN_cur,UN_ssp1,UN_ssp3,UN_ssp5,pop_cur,pop_ssp1,pop_ssp3,pop_ssp5,lat,lng
671814,2.685377,3.014233,3.224968,3.224968,0.0,0.0,0.0,0.0,60.86,-47.93


Looks good! Let's do the same for `ndr_table` and `poll_table`:

In [16]:
ndr_table_out = pd.merge(ndr_table, latlong_ndr, on ='fid')

# Remove the country and region columns, which we will probably not use, and rename columns
ndr_table_out.drop(columns=['fid', 'country', 'region'], inplace=True) 
ndr_table_out.rename(columns={"rurpopcur": "pop_cur", 
                              "rurpopssp1": "pop_ssp1", 
                              "rurpopssp3": "pop_ssp3",
                              "rurpopssp5": "pop_ssp5"}, inplace=True)

ndr_table_out.sample()

Unnamed: 0,UN_cur,UN_ssp1,UN_ssp3,UN_ssp5,pop_cur,pop_ssp1,pop_ssp3,pop_ssp5,lat,lng
16603,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-16.5,-150.5


In [17]:

poll_table_out = pd.merge(poll_table, latlong_pollination, on='fid')
poll_table_out.drop(columns=['fid'], inplace=True)

poll_table_out.sample()

Unnamed: 0,UN_cur,UN_ssp1,UN_ssp3,UN_ssp5,pop_cur,pop_ssp1,pop_ssp3,pop_ssp5,lat,lng
3082,11205.57207,10341.60583,11426.92235,11698.08962,1213828.0,1370219.75,1067409.25,1506498.25,38.5,15.5


Now we are ready to save the files!

# Saving data

In [20]:
cv_table_out.to_csv(save_dir + "cv_table_preprocessed.csv.gz", compression='gzip')
ndr_table_out.to_csv(save_dir + "ndr_table_preprocessed.csv.gz", compression='gzip')
poll_table_out.to_csv(save_dir + "poll_table_preprocessed.csv.gz", compression='gzip')