
# ADS Project 1: The Deluge - Flood Risk in the UK

[James Percival](https://www.imperial.ac.uk/people/j.percival),
[Parastoo Salah](https://www.imperial.ac.uk/people/p.salah)
and
[Marijan Beg](https://www.imperial.ac.uk/people/marijan.beg)


## Synopsis:

Given the global and UK distribution of human habitation, flooding is one of the most common and impactful natural distasters which can occur. [Analysis of satellite data](https://doi.org/10.1038/s41586-021-03695-w) coupled with [predictions on future changes](https://www.ipcc.ch/report/ar5/syr/) to the climate suggest this is an issue which will only get more frequent and severe.

Flooding can occur from a number of sources:

- Surface water (from rainfall run-off, or from the fresh or dirty water systems)
- Rivers, due to blockage or excess rainfall
- Seas, due to sea level rise or storm surge.

![Flooding in York, UK](images/York_Floods_2015.jpg)
_picture taken by [J. Scott](https://flickr.com/photos/60539443@N05/23675629939), used under [CC-BY](https://creativecommons.org/licenses/by/2.0/deed.en)_

Currently flood risk from these sources in the UK is assessed in a number of ways using topographic data, large, complex numerical models and a great degree of human experience. It would be useful to have rapid tools leveraging ML teachniques to update these risk assessments in piecemeal as new data is obtained.

The purpose of this exercise is to:

  (a) develop prediction/classification routines for flood probability and impact for the UK, and
  
  (b) use this tool, together with rainfall and river level data to provide a holistic tool to assess and analyse flood risk.


## Problem definition

### Datasets

Samble datasets have been provided for you in the `flood_tools/resources` folder. These represent a subset of the potential test data which will be applied to your model.

#### Postcode data

Three `.csv` files, `postcodes_labelled.csv`, `postcodes_unlabelled.csv` and `sector_data.csv` deal with information indexed by postcode, or postcode sector.

The fully labelled `postcodes_labelled.csv` data provides a sample of labelled data for postcodes in a region of England. The column headings are:

- `postcode`: The full unit postcode for the row.
- `sector`: The postcode sector for the row.
- `easting`: The OS easting (in m) for the centroid of this postcode.
- `northing`: The OS easting (in m) for the centroid of this postcode.
- `soilType`: The typical soil type for the postcode, as a category.
- `elevation`: The height (in m) above sea level for the centroid of this postcode.`
- `localAuthority` The Local Authority governing this postcode
- `riskLabel` The probability class for flood risk (from rivers and seas) for the postcode
- `medianPrice` A typical house price (in £) for this postcode.
- `historicFlooding` A binary flag indicating whether this postcode has experienced major flooding since 1949.

The postcode format in the file uses numbers and capital letters, and has a single space in the middle separating the district from the sector/unit label. This should be treated as the canonical format for postcodes in this exercise (and the one you should output), but it is not guaranteed that all postcodes you may encounter will be in this format. Consider the postcodes "SW7 2AZ" and "SW1A 1AA", a non-exhaustive list of alternative formats include

```
SW7 2AZ, SW1A 1AA
sw7 2az, sw1a 1aa
SW72AZ, SW1A1AA
SW7 2AZ, SW1A1AA
SW7  2AZ, SW1A 1AA  
```

Note that the last two version have a fixed length (of 7 or 8 characters respectively). This makes these a common format in some kinds of database.

The approximate probabilities of flooding for the ten classes (in terms of the likelihood of at least one event in a given year) can be assumed to be:

| Class | Flood event|
|:-----:| :---------:|
| 10  | 5% |
| 9   | 4% |
| 8   | 3% |
| 7   | 2%|
| 6   | 1.5% |
| 5   | 1% |
| 4   | 0.5% |
| 3 |  0.3% |
| 2 | 0.2% |
| 1 | 0.1% |

So the lowest risk class expects one event in 1,000 years (or longer) and the highest risk class expects one event in 20 years (or sooner).

The `postcodes_unlabelled.csv` file in the `\flood_tool/resources` directory provides an example of the input format to expect for the raw unlabelled data for which you must make predictions. This shares its columns with the first 6 of the `postcodes_labelled.csv` file columns, but doesn't have Local Authority information, , a risk class, house price data, or information on historic flooding.

The `sector_data.csv` file contains information on the number of people (the headcount) and  households in each postcode at the sector level, as well as the number of postcode units in each sector. This is a component of the impact factors involved in a flooding event.

The `district_data.csv` file contains information on the number of pets in each postcode. Additional similiar data (or similar data at the sector level)
can be added if it will improve the quality of your predictions.

#### Rain, river and tidal data

The `typical_day.csv` and `wet_day.csv` files provide examples of UK rainfall information, tide and river level taken from UK Environment Agency data. The columns are

- `dateTime` The time for the reference.
- `stationReference` The short code for the station
- `parameter` The property being measured
- `qualifier` Addditonal information on the measurement
- `unitName` The unit of measurement, either millimeters (`mm`) for rain data or meters above a notional stage depth (`mASD`) for river data
- `value` The actual measurement.

The rainfall data is primarily from tipping bucket rain gauges, which capture the height of the water column which has fallen at a given location over the 15 minute measurement period. River data is the instantaneous height of the river, with 0m being a "standard" height for the measurement site, and tidal data is with reference to the long term mean sea level at Newlyn.

As a point of reference, one typical scale for rainfall is

| rainfall | classifier |
|:--------:|:----------:|
| less than 2 mm per hour  | slight |
| 2mm to 4 mm per hour | moderate |
| 4mm to 50 mm per hour | heavy |
| more than 50mm per hour | violent |

although for flood risk both intensity and total quantity are factors. River levels vary naturally, but significant changes in water level and high existing water levels are both significant risk factors.

The `stations.csv` file contains additional information on the stations reporting data in the previous two files, namely the station name, latitude and longitude, as well as maximums, minimums and 95% data limits for the river stage measurements. More information is available via the Environment Agency API at a URI in the form

```
https://environment.data.gov.uk/flood-monitoring/id/stations/{stationReference}
```

### Definition of risk

For this project flood risk is defined by combining both probability of a flooding event and the impact of an event (for which property value is a proxy). You may use a risk score defined as 

$$ R := 0.05 \times (\textrm{total property value}) \times(\textrm{flood probability}) $$

Here 0.05 is an (arbitrary) estimate of the value lost when a flood affects a property. Potential additional considerations are the number of households impacted and the extent of the local area in which flooding appears likely.


## Challenge

Your task is to develop a Python program with two main features: 

1. The ability to clasify flood risk for UK postcodes & locations based on a subset of labelled data.
2. The ability to visualize and analyse rainfall data in conjunction with the above tool to present risk information to the user.

In the following, we describe the functionality that we would like you to incorporate into these features. 

### Risk Tool

#### Core functionality

Your tool must:

2. Provide at least one classifier/regression for postcodes in England into a ten class flood probability scale based on provided labelled samples.
3. Provide a regression tool for median house price for postcodes in England, given sampled data.
4. Provide a classifier for historic flooding for postcodes in England, given labelled sampled data.
4. Provide a regression tool & a classifier taking in an arbitrary location and predicting the Local Authority and flood risk.
4. Calculate a predicted risk for input postcodes.

Class method interfaces for this core functionality has been specified in the skeleton `flood_tool` package. These core intefaces will be used during the automated scoring run during the week, and should only be updated if requested by an instructor.


#### Additional functionality

You can also develop a simple method for a user to interact with your tool (e.g., jupyter notebook, command-line arguments or example python scripts) and document its usage appropriately.


### Data visualiser

#### Core functionality

Your visualiser must present the information required for the previous section. It should also use rainfall and river data, in the format provided in the example .csv files (and if you choose, via the API) to indicate potential areas at risk (i.e with rainfall, river levels or tides significantly above normal), as well as the potential impact of a flooding event.

#### Additional functionality

You may extend your visualiser in any appropriate direction (if you're unsure, please consult with an instructor). Some potential directions include:
1. Adding the ability to interact with live rain data (you must still have an offline mode)
2. Adding additional data sources


## Assessment

The **deadline** for software submission is **Friday 24th November, 12:00 pm (Noon) GMT**.

### Software (70 marks)

Your software will be assessed primarily on functionality (**30 marks**), performance (**20 marks**) and sustainability (**20 marks**). Here sustainability can be interpretted as how easy it would be for a new group of students to maintain or extend your repository.

Functionality (**30 marks**): Your software will be scored, partially based on its ability to perform predictions on unseen training and test data, as well as how simple it would be to include additional data. These will:

  1. Score your classification of flood probabilities, using an approach based on the metric described below.
  2. Score your regression routines for median house prices for UK postcodes, using an approach based on the root mean square error.
  3. Score your classification of historic flooding, using an approach based on the recall and precision of your classifier (i.e. the F1 score).
  4. Score your classification of the correct local authority for a given location, using an approach based on the accuracy of your classifier.

Indicative scores for the automated parts of the functionality and performance will be computed for these tests at two or three points during the week of the project. Note that the marks for Functionality and Performance will be based on these scores (i.e., higher score implies higher mark), but not necessarily in an exact or linear mapping.

||1|2|3|4|5|6|7|8|9|10
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|1|100| 80| 60| 60| 30| 0| -30| -600| -1800| -2400|
|2|80| 100| 80| 90| 60| 30| 0| -300| -1200| -1800|
|3|60| 80| 100| 120| 90| 60| 30| 0|  -600| -1200|
|4|40| 60| 80|  150| 120| 90| 60| 300| 0|-600|
|5|20| 40| 60| 120| 150| 120| 90| 600| 600| 0|
|6|0| 20| 40| 90| 120| 150| 120| 900| 1200| 600|
|7|-20| 0| 20| 60| 90| 120| 150| 1200| 1800| 1200|
|8|-40| -20| 0| 30| 60| 90| 120| 1500| 2400| 1800|
|9|-60| -40| -20| 0| 30| 60| 90| 1200| 3000| 2400|
|10|-80| -60| -40| -30| 0| 30| 60| 900| 2400| 3000|

The visualiser functionality will be assessed manually at the end of the week, and will be scored based on the following criteria:
   - The ability to visualise/analyze the data provided in the sample files in a meaningful way.
   - The ability to visualise/use the data provided in the sample files in conjunction with the flood risk tool.
   - The ability to visualise/analyse live data returned through your tool (if you choose to implement this).


Sustainability (**20 marks**): As with all software projects, you should employ all the elements of best practice in software development that you have learned so far. A GitHub repository will be created for your project to host your software. The quality and sustainability of your software and its documentation will be assessed based on your final repository and how it evolves during the week. Specific attention will be given to the following elements:

1. Installation and usage instructions
2. Documentation (in HTML / PDF format). Note that a template SPHINX documentation framework has been provided for you as a starting point for your documentation.
3. Coding style
4. Quality and coverage of automatic testing framework
5. General repository usage
6. Licensing

Please refer to the module handbook for more information about the assessment of software quality.

### Presentation & One page Flyer (20 marks)

Your project will also be assessed on the basis of a one page report & 15-minute video presentation that you must upload to your assigned group channel on Teams before the deadline of **Friday 24th November, 4:00 pm GMT**.

#### Report

Your report should be up to 1 page of text (1,800 characters) in no less than 11pt text, along with up to 4 additional images (no more than 3 pages in total) It should be uploaded in .pdf form to your final repository, with the title `report.pdf`. You will not be penalised if your report is shorter than these limits, providing it covers the required details. You can use any software you wish to produce it.

The report should cover the following questions

1. What regression/classification method(s) did you use. Include technical details and/or references. Which methods did you investigate & reject?
2. Which features in the data did you find were important for your regressions? Which were unimportant? Would it be worthwhile including historic flooding data in the labelled model?
3. Demonstrate your data vizualization/analysis software applied to the "wet day" data, or to another significant event.

#### Presentation

Your presentation should be approximately 15 minutes long, and cover broadly similar details to the report (you can include some information in one but not the other). All team members should aim to contribute to the presentation, but not all need to be recorded on camera/microphone.

You can record your presentation in any software that you like. If in doubt, MS Teams will work reasonably, and is readily available to you.


### Teamwork (peer assessment; 10 marks)

After the presentations, you will complete a self-evaluation of your group's performance. This will inform the teamwork component of your mark. Please refer to the module guidelines for more information about the assessment of teamwork.

## Technical requirements

* You should use the assigned GitHub repository exclusively for your project
* Your software must be written to work correctly in Python 3.11
* You are free to import anything from the [standard python libraries](https://docs.python.org/3.11/library/index.html) as well as `numpy`, `matplotlib`, `pandas`, `scipy`, `mpltools`, `sklearn`, and `sympy` (see the `requirements.txt` for the full list of preapproved packages already in the environment. Packages from the standard library, e.g the `os` module, don't need to be listed). You should submit a request for any other packages you wish to use. Requests to use alternative packages should be submitted by 12 noon GMT on Thursday, and will be announced to all groups.
* You have been given some geographic mapping examples using [folium](https://python-visualization.github.io/folium/latest/), but can request to use an alternative mapping package if you've used it previously. Requests should respect the 12 noon GMT deadline on Thursday mentioned above.
* You are not allowed to import other third-party Python packages without authorization (if in doubt, please query with the one of the instructors)
* You can assume that the users of your software will have `pytest`` installed, so this does not need to be part of your pre-requisites.
* You should use GitHub Actions for any automated testing that you decide to implement. Example workflows are given in your template repository.
* You do not need to make a Graphical User Interface for your software (although you can if you choose): the program can designed to run via the command line as a stand alone tool, interatively in a Python/IPython environment, or via examples in a notebook.