<a href="https://colab.research.google.com/github/WRFitch/fyp/blob/main/src/fyp_data_import_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Import Pipeline

This code is still under construction, and is therefore very very bad in places. 

### TODO
- Import CO2 dataset
- Figure out a way of iterating through existing images and displaying the area currently covered by my dataset on a map. 
- Define and import other regions of interest - stick to cities and suburbs for now, since that will have the best health data. Converting this to include rural or rocky areas is an increase in feature set. 
- Figure out how accurate the image exports are
  - Are the points definitely centered on the given coordinates? 
  - is there a way of standardising lighting? 
- add file indexing into one CSV with all our existing latlong exports, so we're not constantly querying the filesystem. 
- list exported files into a CSV 
- Move data processing methods into their own pipeline 

## Setup
*   Import necessary libraries
*   Import fyputil module
*   Set up Earth Engine authentication and mount google drive  


In [1]:
import ee
import os
import pandas as pd 

from google.colab import drive
from osgeo import gdal
from pprint import pprint

In [None]:
ee.Authenticate()
ee.Initialize()
drive.mount('/content/drive')

In [None]:
%rm -rf /content/fyp

In [None]:
# Import FYP repo so we can access fyputil common library 
%cd /content
!git clone https://github.com/WRFitch/fyp.git

# Import fyputil library
%cd fyp/src/fyputil
import constants as c
import ee_constants as eec
import ee_utils as eeutil
import fyp_utils as fyputil
%cd /content

# Dataset import

### Import the following datasets into Google Drive

*   [Sentinel-2 Satellite photography](https://developers.google.com/earth-engine/datasets/catalog/COPERNICUS_S2_SR)
*   [Sentinel-5 Precursor Data](https://developers.google.com/earth-engine/datasets/catalog/sentinel)
  *   [Carbon Monoxide](https://developers.google.com/earth-engine/datasets/catalog/COPERNICUS_S5P_OFFL_L3_CO)
  *   [Formaldehyde](https://developers.google.com/earth-engine/datasets/catalog/COPERNICUS_S5P_OFFL_L3_HCHO)
  *   [Nitrogen Dioxide](https://developers.google.com/earth-engine/datasets/catalog/COPERNICUS_S5P_OFFL_L3_NO2)
  *   [Ozone](https://developers.google.com/earth-engine/datasets/catalog/COPERNICUS_S5P_OFFL_L3_O3)
  *   [Sulphur Dioxide](https://developers.google.com/earth-engine/datasets/catalog/COPERNICUS_S5P_OFFL_L3_SO2)
  *   [Methane](https://developers.google.com/earth-engine/datasets/catalog/COPERNICUS_S5P_OFFL_L3_CH4)
*   [ODIAC Fossil Fuel CO2 Emissions](https://db.cger.nies.go.jp/dataset/ODIAC/DL_odiac2019.html)

### Visualise Data

In [None]:
eec.map 

### Export Data

Exports as .csv tables and GeoTIFF images. 

#### Exporting CSVs

This method of getting the data is very very stupid, but also it does exactly what I need. 

In [None]:
# Only once this is completed can you move forward and get pictures from these spreadsheets.
for ghg_img in eec.ghg_imgs:
  csv_name = ghg_img.getInfo().get('bands')[0].get('id')
  print(csv_name)
  #eeutil.exportTableFromImage(ghg_img, eec.south_east, 1000, c.export_dir, csv_name)

#### Getting Images From CSV Data

In [None]:
eeutil.getImgsFromCsv(f"{c.data_dir}/{c.SO2_band}.csv", eec.s2_img)

In [None]:
#pprint(ee.batch.Task.list())

# Data processing

In [None]:
fyputil.geotiffToPng("big_geotiff", f"{c.export_dir}/png_224", rm_artifacts=False)
fyputil.moveFilesByExtension("big_geotiff", f"{c.export_dir}/geotiff_224", ".tif")
fyputil.moveFilesByExtension("big_geotiff", f"{c.export_dir}/png_224", ".png")
fyputil.rmConversionArtifacts("big_geotiff", rmTif=False, rmXml=True)

In [None]:
fyputil.geotiffToPng(f"{c.export_dir}/geotiff_224", f"{c.export_dir}/png_224", rm_artifacts=False)

In [None]:
fyputil.moveFilesByExtension(f"{c.export_dir}/geotiff_224", f"{c.export_dir}/png_224", ".png")
fyputil.rmConversionArtifacts(f"{c.export_dir}/geotiff_224", rmTif=False, rmXml=True)

In [None]:
# Cleaning up if things go a bit wrong
fyputil.moveFilesByExtension(f"{c.export_dir}/geotiff", f"{export_dir}/png", ".png")
fyputil.rmConversionArtifacts(f"{c.export_dir}/geotiff", rmTif=False, rmXml=True)
fyputil.moveFilesByExtension(c.geotiff_dir, f"{c.export_dir}/geotiff", ".tif")

###Incorporate all CSVs into one

This could be more efficient, but it's only processed once at the minute.

In [None]:
fyputil.parseCsvCoords(f"{c.data_dir}/se-{c.CO_band}.csv")
fyputil.parseCsvCoords(f"{c.data_dir}/se-{c.HCHO_band}.csv")
fyputil.parseCsvCoords(f"{c.data_dir}/se-{c.NO2_band}.csv")
fyputil.parseCsvCoords(f"{c.data_dir}/se-{c.O3_band}.csv")
fyputil.parseCsvCoords(f"{c.data_dir}/se-{c.SO2_band}.csv")
fyputil.parseCsvCoords(f"{c.data_dir}/se-{c.CH4_band}.csv")

In [None]:
# Parse CSVs into pandas dataframes
# TODO rewrite so we aren't deleting columns directly - do it properly! Incorporate these into one csv export in the 
#      output pipeline 
co_df = pd.read_csv(f"{c.data_dir}/se-{c.CO_band}.csv")
del co_df[".geo"]
hcho_df = pd.read_csv(f"{c.data_dir}/se-{c.HCHO_band}.csv")
del hcho_df[".geo"]
no2_df = pd.read_csv(f"{c.data_dir}/se-{c.NO2_band}.csv")
del no2_df[".geo"]
o3_df = pd.read_csv(f"{c.data_dir}/se-{c.O3_band}.csv")
del o3_df[".geo"]
so2_df = pd.read_csv(f"{c.data_dir}/se-{c.SO2_band}.csv")
del so2_df[".geo"]
ch4_df = pd.read_csv(f"{c.data_dir}/se-{c.CH4_band}.csv")
del ch4_df[".geo"]

# Incorporate individual csvs into one ghg dataframe. Badly. 
# TODO fix this so we aren't repeating the same thing over and over
mrg_params = ['longitude', 'latitude']
# somehow this means "intersect". We're taking the intersect so we know we have common values. 
mrg_type = 'inner'

intersect = pd.merge(so2_df, ch4_df, how=mrg_type, on=mrg_params)
intersect = pd.merge(intersect, co_df, how=mrg_type, on=mrg_params)
intersect = pd.merge(intersect, hcho_df, how=mrg_type, on=mrg_params)
intersect = pd.merge(intersect, no2_df, how=mrg_type, on=mrg_params)
intersect = pd.merge(intersect, o3_df, how=mrg_type, on=mrg_params)

print(intersect.shape)
intersect.iloc[0:4] 

In [None]:
# Hacky again, but it'll do for now
del intersect["system:index_x"]
del intersect["system:index_y"]
intersect.iloc[0:4] 

In [None]:
raw_ghg_df = intersect.copy()

for index, row in intersect.iterrows():
  coords = (row.longitude, row.latitude)
  #print(coords)
  filepath = f"{c.data_dir}/png_224/{coords[0]}_{coords[1]}.png"
  if not os.path.isfile(filepath):
    print(f"dropping {filepath} from row {index}")
    # TODO implement this in a way that doesn't suck. 
    raw_ghg_df = raw_ghg_df.drop(index=index)

In [None]:
print(intersect.shape)
print(raw_ghg_df.shape)
raw_ghg_df.iloc[0:10]

In [None]:
intersect.to_csv(f"{c.data_dir}/se-ghgs.csv")