# Group 9 Project Data Pre-Processing Description - Ocean Acidification

### We plan to research how increased ocean acidification affect the broader coastal ecosystem. This can be broken down into two specific research questions:
#### 1) Is the increase in ocean acidity associated with a net decrease in coastal biodiversity?
#### 2) What are the socio-economic and health-related impacts of ocean acidification.
#### By socio-economic impacts, we are discussing the societal and human impact of ocean and marine health, such as impacts on the fishing industry as well as coastal resilience efforts.

In [14]:
import pandas as pd

codap_df = pd.read_csv(r'CODAP_NA_v2021.csv', low_memory=False) # placed into group 9 project directory
zooplankton_df = pd.read_excel(r'BATS_zooplankton.xlsx') # placed into group 9 project directory
display(codap_df) #easier to view vs. print()
display(zooplankton_df)

Unnamed: 0,Accession,EXPOCODE,Cruise_flag,Cruise_ID,Observation_type,Profile_number,Station_ID,Cast_number,Niskin_ID,Niskin_flag,...,Nitrate,Nitrate_flag,Nitrite,Nitrite_flag,Nitrate_and_Nitrite,Nitrate_and_Nitrite_flag,recommended_Nitrate_and_Nitrite,recommended_Nitrate_and_Nitrite_flag,Ammonium,Ammonium_flag
0,n.a.,n.a.,n.a.,n.a.,n.a.,n.a.,n.a.,n.a.,n.a.,n.a.,...,[umol/kg],n.a.,[umol/kg],n.a.,[umol/kg],n.a.,[umol/kg],n.a.,[umol/kg],n.a.
1,144549,33HQ20080329,B,HLY0802,Niskin,1,1,5,1,2,...,25.8,2,0.19,2,25.99,2,25.99,2,0.14,2
2,144549,33HQ20080329,B,HLY0802,Niskin,1,1,5,2,2,...,25.6,2,0.2,2,25.8,2,25.8,2,0.13,2
3,144549,33HQ20080329,B,HLY0802,Niskin,1,1,5,3,2,...,24,2,0.15,2,24.15,2,24.15,2,0.41,2
4,144549,33HQ20080329,B,HLY0802,Niskin,1,1,5,4,2,...,22.6,2,0.17,2,22.77,2,22.77,2,-999,9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28202,208230,3.3222E+11,A,SH1709,Niskin,3391,HB07,1,7,2,...,-999,9,-999,9,-999,9,-999,9,-999,9
28203,208230,3.3222E+11,A,SH1709,Niskin,3391,HB07,1,8,2,...,-999,9,-999,9,-999,9,-999,9,-999,9
28204,208230,3.3222E+11,A,SH1709,Niskin,3391,HB07,1,9,2,...,-999,9,-999,9,-999,9,-999,9,-999,9
28205,208230,3.3222E+11,A,SH1709,Niskin,3391,HB07,1,10,2,...,-999,9,-999,9,-999,9,-999,9,-999,9


Unnamed: 0,% /BATS zooplankton Data,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22
0,% BATS66a (April 1994) to BATS 368 (Feb 2020),,,,,,,,,,...,,,,,,,,,,
1,% Cruise # is listed as 5 digit ID.,,,,,,,,,,...,,,,,,,,,,
2,% First digit is cruise type.,,,,,,,,,,...,,,,,,,,,,
3,% 1 = BATS core,,,,,,,,,,...,,,,,,,,,,
4,% 2 = BATS bloom A,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6684,10368,20200228.0,4.0,31.0,40.581,64.0,11.449,2243.0,2325.0,42.0,...,2120.2,411.4,3.38,0.66,16.49,2.9,675.69,131.11,3298.78,580.85
6685,10368,20200228.0,4.0,31.0,40.581,64.0,11.449,2243.0,2325.0,42.0,...,1941.6,408.0,3.09,0.65,16.49,2.9,618.77,130.03,3298.78,580.85
6686,10368,20200228.0,4.0,31.0,40.581,64.0,11.449,2243.0,2325.0,42.0,...,2535.4,615.4,4.04,0.98,16.49,2.9,808.01,196.12,3298.78,580.85
6687,10368,20200228.0,4.0,31.0,40.581,64.0,11.449,2243.0,2325.0,42.0,...,118.0,312.8,0.19,0.50,16.49,2.9,37.61,99.69,3298.78,580.85


There is no major data missing from either dataset. Each dataset contains exclusively raw measurement data of several oceanographic variables, the most important being:
**CODAP_NA_v2021**: 
This dataset includes discrete measurements from 2003 to 2018. An image showing measurement coverage is included in the project directory. Geographically, this area covers: U.S. West Coast, U.S. East Coast, Gulf of Mexico, Gulf of Alaska, Bering Sea, North Atlantic Ocean, and the North Pacific Ocean. The origin of the data is the NOAA Ocean Acidification Program (OAP). The [metadata](https://www.ncei.noaa.gov/data/oceans/ncei/ocads/metadata/0219960.html) is located here.
* General locational data
    - dates, lat, long, cruise identification numbers
* Dissolved inorganic carbon (DIC)
    - carbon deposits can either be organic or inorganic and dissolved or particulate
* Total alkalinity
* pH
* Continous and discrete seawater pCO2
    - dissolved CO2 concentrations
* Carbonate concentrations 
    - spectrophotometer
* Aragonite & calcite saturation
    - carbonite sources as an alternative measurement system for dissolved carbonite concentrations
* Revelle factor
    - ratio of CO2 change to change in total dissolved inorganic carbon (DIC)
* Oxygen concentration
* Apparent oxygen utilization 
    - measured dissolved oxygen concentration and its equilibrium saturation concentration in water with the same physical and chemical properties
* Various dissolved chemical concentrations important for biological life
    - silicate, phosphate, nitrate, nitrites, ammonium
* CTD measurements 
    - conductivity, temperature and depth measurements, includes salinity and pressure 

**BATS_zooplankton**
This dataset is sourced from the Bermuda Institute of Ocean Sciences and primarily measures zooplankton biomass. The range of data collection is from 1994 to 2020, with discrete measurement sampling. One of the major hypotheses that this study aims to answer is: *changes in seawater CO2-carbonate chemistry and ocean acidification indicators at BATS are the longest record in the global ocean and comparable to the six other globally distributed time-series (Bates et al., 2012; 2014)*, so this dataset is topical to our data analysis and visualization focus. The metadata for this dataset is located in the excel file itself.
* General locational data
* Max depth of sampling
* Volume of water sampled
    - volume of water sampled has significant influence on the concentration of ions and other carbonate-related minerals that might have an influence on zooplankton concentrations
* Weight of zooplankton biomass
    - wet weight and dry weight, ratios of weight-to-volume, total weights accounting for all size fractions of zooplankton

### Summary Statistics and other notes:

In both datasets, there are values labeled "-999" that must be removed, preferably by removing the entire row that contains the missing data, as we plan to use all of the important variables in our statistical analysis. We plan to do this by:

> import numpy as np  
> df = df.replace(-999, np.nan).dropna(axis=0)   

Here, we are removing all rows in our dataframe df that contains either -999 or missing values entirely (np.nan). Alternatively, we could also replace all -999 values with "0" to signify missing data:

> df2 = pd.df.replace(-999,0)  

#### CODAP_NA_v2021:
We plan to remove all the "flag" columns and most of the locational data, except year and maybe month. When visualizing our data in GIS, cruise IDs or station_IDs might also be relevant if we want to map out where the data was collected.

To generate summary statistics, we would use a method to subset a column, replace missing values, and then summarize. An example is listed below.

#### BATS_zooplankton:
We plan to remove the metadata above the dataset to make displaying it in python easier. 

In [55]:
### Example method to calculate summary statistics step-by-step:

import numpy as np

test_df = codap_df #initializes new test df for example purposes to avoid changing origin df
test_df['Phosphate'] = test_df['Phosphate'].replace('-999','0') #replaces all -999 values with 0
display(test_df['Phosphate']) 

df_droprow = test_df.drop(test_df.index[0]) #drops unnecessary first row with string values
display(df_droprow['Phosphate'])

df_float = df_droprow['Phosphate'].astype(str).astype(float) #converts pandas dtype object to string to float
display(df_float)

print("The mean of phosphate column is:", df_float.mean()) 
#calculates average for column "Phosphate", could include .mean(skipna = True) to pass .mean() function without NaN values

0        [umol/kg]
1             1.98
2             1.96
3             1.87
4             1.79
           ...    
28202            0
28203            0
28204            0
28205            0
28206            0
Name: Phosphate, Length: 28207, dtype: object

1        1.98
2        1.96
3        1.87
4        1.79
5        1.72
         ... 
28202       0
28203       0
28204       0
28205       0
28206       0
Name: Phosphate, Length: 28206, dtype: object

1        1.98
2        1.96
3        1.87
4        1.79
5        1.72
         ... 
28202    0.00
28203    0.00
28204    0.00
28205    0.00
28206    0.00
Name: Phosphate, Length: 28206, dtype: float64

The mean of phosphate column is: 0.8398234418208971
