<a href="https://colab.research.google.com/github/WRFitch/fyp/blob/main/src/fyp_ai_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AI Training

This notebook exists to import and train new neural networks using fastai, based on the imported greenhouse gas/satellite photography dataset. 

All notebooks in this project are to be considered development environments, rather than bona fide scripts that, when run, will produce the end product. Therefore, certain code blocks and documentation are added for developer convenience. 

## Notebook Setup
- Install & import necessary libraries
- Mount drive

In [None]:
# Sometimes the colab fastai version can be wrong, so we reinstall with no cache
# reinstalling, and restarting runtime should fix any major issues, including 
# CUDA OOM error
!pip uninstall -y fastai
!pip install -U --no-cache-dir fastai

In [None]:
import pandas as pd

from fastai.vision.all import *
from google.colab import drive

drive.mount('/content/drive')

In [None]:
%rm -rf /content/fyp

In [None]:
%cd /content
!git clone https://github.com/WRFitch/fyp.git

In [None]:
# Import fyputil library
%cd /content/fyp/src/fyputil
import constants as c
import fyp_utils as fyputil
%cd /content

## Data Setup

In [5]:
ghg_df = pd.read_csv(c.ghg_csv)
ghg_df = fyputil.normGhgDf(ghg_df)

In [17]:
def getGhgsAsArr(img_path):
  return fyputil.getGhgsAsArr(img_path, ghg_df)

def getCO(img_path):
  return fyputil.getGhgsAsArr(img_path, ghg_df)[0]

# Wrapper around fyputil method to add ghg_df
def imgIsInDf(path):
  return fyputil.imgIsInDf(path, ghg_df)

def getGhgImgs(path):
  return get_image_files(path).filter(imgIsInDf)

# TODO implement multiple transforms pipeline
# TODO revisit image normalisation
# TODO ensure default random data splitter is ok (80/20 train/test split)

ghg_block = DataBlock(
    blocks = (ImageBlock, RegressionBlock),
    get_items = getGhgImgs,
    get_y = getCO,
    item_tfms = Resize(224),
    splitter  = RandomSplitter()
)

# Testing large batches to avoid overfitting
ghg_dl = ghg_block.dataloaders(c.big_png_dir)

### Datablock/loader evaluation

In [None]:
ghg_dl.show_batch(nrows=9, max_n=9, figsize = (50,50))

In [None]:
ghg_block.summary(c.big_png_dir)

In [None]:
bigimgs = get_image_files(c.big_png_dir)
len(bigimgs)

## Training

### Image Recognition and Feature Extraction. 

- Train image-based predictor to guess greenhouse gas concentrations based on 1km square of land. 
  - Transfer an ImageNet predictor to work top-down
  - Start by predicting one ghg and expand from there
- Use image predictor to extract a basic feature set by slicing the network at different points. The idea is to limit the amount of data going into the tabular recommender, while transferring as much useful data as possible. We want to implicitly extract GHG-emitting features of each image without losing any detail, as a form of convolutional preprocessing. 


In [84]:
# TODO experiment with variable floating-point accuracy 
# TODO experiment with smaller networks
# TODO experiment with batch normalisation
# TODO experiment with adding a 2-layer head to the network to ensure decent conversions 
learn = cnn_learner(ghg_dl, resnet152, y_range=(0, 100), metrics=rmse)
name = "fresh learner"
learn.save(name)

Path('models/fresh learner.pth')

In [30]:
learn.load(name)

<fastai.learner.Learner at 0x7efe43101990>

In [81]:
learn.freeze()

In [None]:
# FROZEN
learn.lr_find()

In [90]:
learn.unfreeze()

In [None]:
# UNFROZEN 
learn.lr_find()

In [21]:
# TODO When cutting release branch, update these to be the actual learning rates used for training the final branch. 
lr = 0.0012

In [None]:
# Fit the first layer before unfreezing to get the network halfway there. If it overfits for now, that's not really a problem. 
learn.fit(4, 7.585775892948732e-05) 

In [None]:
learn.save("save1")

In [None]:
learn.load("save1")

In [34]:
lrs = slice(0.003, 0.1)
lrs2 = slice(1e-4, 3e-3)
lrs3 = slice(1e-6, 1e-3)

In [None]:
learn.fit_one_cycle(5, 0.0005)

In [None]:
learn.save("save2")

In [None]:
learn.load("save2")

In [None]:
learn.save("save3")

In [None]:
learn.load("save3")

## Evaluate Model Performance 

### Plot results 

In [None]:
learn.validate()

In [None]:
learn.show_results(ds_idx=20, dl=ghg_dl, nrows=9, max_n=9, figsize = (50,50))

# In-place testing

In [68]:
import pandas as pd 
pred_df = pd.read_csv(f"{c.data_dir}/CO_column_number_density.csv")
predcol = "CO_pred"
pred_df[predcol] = 0

In [71]:
spare = pred_df.copy()

In [None]:
pred_df

In [61]:
pred_df.iloc[1,5]

0

In [70]:
for idx, row in pred_df.iterrows():
  coords = (row.longitude, row.latitude)
  if not fyputil.imgExported(coords): continue
  filepath = fyputil.getFilepath(coords)
  pred_df.iloc[idx, 5] = learn.predict(filepath)[0][0]

In [None]:
pred_df["err"] = pred_df[c.CO_band] - pred_df[predcol]

In [None]:
rmse = math.sqrt(pred_df["err"].apply(lambda x:x**2).mean())

### Export the model

In [None]:
# Export model so we can use it for other things. Note - this kills the model 
#TODO find better naming convention 
new_model = "080321_resnet152_10k-imgs_fit-one-cycle"
learn.export(f"{c.model_dir}/{new_model}.pkl")

In [None]:
# Import model and test to see if it hasn't broken in the export process.
imported_learner = load_learner(f"{c.model_dir}/{c.model_name}.pkl")

In [None]:
# Predict from imported learner
imported_learner.predict(f"{c.png_dir}/-0.73212695655741_51.2533785354393.png")

#### Notes on Image Predictions

A lower learning rate appears to cause slower training with more sophisticated conclusions. Sophistication also appears to arise from a deeper network.

Effectively, this network recognises certain features of high-GHG land. Depending on sophistication, this may include airports, power plants, or other rare features, as well as recognising different types of wilderness or residential districts. This will be used to extract a feature set for a tabular recommender, which can then be used to find more accurate readings. 