<a href="https://colab.research.google.com/github/WRFitch/fyp/blob/main/src/fyp_ai_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AI data analysis

### Setup
- Install & import necessary libraries
- Mount drive
- Import and define handy variables 

In [None]:
# Sometimes the colab fastai version can be wrong, so we reinstall with no cache
# reinstalling, and restarting runtime should fix any major issues. 
!pip uninstall -y fastai
!pip install -U --no-cache-dir fastai

In [None]:
import pandas as pd

#from fastai import *
#from fastai.vision import *
from fastai.vision.all import *
from google.colab import drive

drive.mount('/content/drive')

In [None]:
%rm -rf /content/fyp

In [None]:
%cd /content
!git clone https://github.com/WRFitch/fyp.git

In [None]:
# Import fyputil library
%cd /content/fyp/src/fyputil
import constants as c
import fyp_utils as fyputil
%cd /content

#### Test Image Import

(Not strictly necessary, but nice to have) 

In [None]:
# Import data from google drive 
# This is getting really slow. Is there too much data? If so, slice to only use 
# every tenth image so we still get a decently stratified set. 
#imgs = get_image_files(c.png_dir)
#print(len(imgs))

In [None]:
#big_imgs = get_image_files(big_png_dir)
#print(len(big_imgs))

### Data Setup

In [None]:
ghg_df = pd.read_csv(c.ghg_csv)
ghg_df = fyputil.normGhgDfProperly(ghg_df)

In [None]:
def getGhgsAsArr(img_path):
  return fyputil.getGhgsAsArr(img_path, ghg_df)

def getGhgImgs(path):
  return get_image_files(path).filter(fyputil.imgIsInDf)

In [None]:
# TODO implement multiple transforms pipeline
# TODO revisit image normalisation

ghg_block = DataBlock(
    blocks = (ImageBlock, RegressionBlock),
    get_items = getGhgImgs,
    get_y = getGhgsAsArr,
    item_tfms = Resize(224)
)

ghg_dl = ghg_block.dataloaders(c.big_png_dir)

In [None]:
ghg_dl.show_batch(nrows=9, max_n=9, figsize = (50,50))

In [None]:
ghg_block.summary(c.big_png_dir)

## Training

### Image Recognition and Feature Extraction. 

- Train image-based predictor to guess greenhouse gas concentrations based on 1km square of land. 
  - Transfer an ImageNet predictor to work top-down
  - Start by predicting one ghg and expand from there
- Use image predictor to extract a basic feature set by slicing the network at different points. The idea is to limit the amount of data going into the tabular recommender, while transferring as much useful data as possible. We want to implicitly extract GHG-emitting features of each image without losing any detail, as a form of convolutional preprocessing. 


In [None]:
# uses a regression approach.
# TODO analyse metrics. Really it doesn't seem to matter so long as everything 
#      is evaluated equally, but I'd like to be sure - ask Allan on Monday. 
# TODO Further experimentation with resnet size is necessary. 34 provides _ok_ 
#      predictions, longer is usually better but it takes longer to train. While
#      I'm iterating on design, performance is necessary. Once I'm at a stage 
#      where I can export my model and use it as is, I'll take the time to train 
#      a much larger network. 
learn = cnn_learner(ghg_dl, resnet152, y_range=(0, 100),  metrics=rmse)

In [None]:
name = "learner test"
learn.save(name)

In [None]:
learn.load(name)

In [None]:
# TODO examine 3d representation of problem space re: local optima 
learn.lr_find()

In [None]:
lr = 0.05
finelr = 0.0019
xfinelr = 0.0001
xxfinelr = 2e-5

In [None]:
# epochs = 5
learn.fine_tune(2, lr)

In [None]:
# Saving mid-training, so I can figure out a decent training pathway
learn.save("mid-training")

In [None]:
learn.load("mid-training")

In [None]:
learn.fine_tune(5, finelr)

In [None]:
learn.save("fine-tuning")

In [None]:
learn.fine_tune(5, xfinelr)

In [None]:
learn.save("xfine-tuning")

In [None]:
learn.load("xfine-tuning")

In [None]:
# at this point, it doesn't seem to make any difference 
# There appears to be a point of diminishing returns, where rmse is just the 
# error rate of the given data. 
learn.fine_tune(10, xxfinelr)

## Evaluate Model Performance 

### Plot results 

In [None]:
learn.show_results(ds_idx=1, dl=ghg_dl, nrows=9, max_n=9, figsize = (50,50))

In [None]:
import fastai.utils.collect_env
fastai.utils.collect_env.show_install(1)

In [None]:
interp = Interpretation.from_learner(learn)

### Plot model statistics 

#### Plot layer stats

This allows us to see what the mean std and pct activation levels are, letting us see areas of the network that require further analysis 

In [None]:
learn.activation_stats.plot_layer_stats(151) 

In [None]:
learn.recorder.plot_sched()

In [None]:
learn.activation_stats.color_dim(-4)

### Export the model

In [None]:
# Export model so we can use it for other things
# Note - this kills the model 
#TODO find better naming convention 
new_model = "mrghg_230221_resnet34"
learn.export(f"{c.model_dir}/{new_model}.pkl")

In [None]:
# Import model and test to see if it hasn't broken in the export process.
imported_learner = load_learner(f"{c.model_dir}/{c.model_name}.pkl")

In [None]:
# Predict from imported learner
imported_learner.predict(f"{c.png_dir}/-0.73212695655741_51.2533785354393.png")

#### Notes on Image Predictions

A lower learning rate appears to cause slower training with more sophisticated conclusions. Sophistication also appears to arise from a deeper network, but I'm hitting a wall at roughly 0.6 rmse.

---

Currently, the networks are having some trouble defining more subtle characteristics of the images, which shows some flaws in my work. The network will need some supplemental information to accurately predict the greenhouse gas at this point. This may include the following:
- **Latitude/Longitude.** Geography may affect predictions - all the images in my current dataset are near London, meaning they have far more greenhouse gases than most places. To encode a knowledge of city geography into a neural net may take some work...
- **Property Value.** How valuable is this land? This could go some way to encoding city dynamics, as well as explaining where the land might be. If land is rural, but valuable, it's likely to be near major cities or airports. 
- **Nearby GHG Values.** Combined with wind direction, an understanding of source & direction of airflow may describe how areas inherit ghg's from elsewhere. An example of this would be the high concentration of NO<sub>2</sub> north of Heathrow Airport, which may be caused by common flight patterns heading north. 
- **Wind Direction.** See above. 
- **Land Use.** Depending on detail, this may help alleviate the "grey field/massive factory" issue described in my log. By proving that certain areas are rural, residential, or industrial, we can limit errors based on inferring purely visual information. If we can specifically define what a large grey box is doing, we can also come to more developed conclusions about its purpose. A recycling center, an oil refinery, and a brewery may all look similar from above, but information about what they _are_ will limit a neural network getting confused. 
- **Population Density/Economic Output.** This will work in a similar way to property value, where we can predict human activity and its effects on greenhouse gases. Economic output may have a complex relationship to GHG emissions that cannot be easily represented, depending on the form of industry. For example, an eco-tourist attraction may rely on its low carbon footprint for survival, whereas a petrol station relies on high carbon ouput. 
- **Land Height**

Effectively, this network recognises certain features of high-GHG land. Depending on sophistication, this may include airports, power plants, or other rare features, as well as recognising different types of wilderness or residential districts. This will be used to extract a feature set for a tabular recommender, which can then be used to find more accurate readings. 