## Notebook Submission Template

This notebook is one of the mandatory deliverables when you submit your solution. Its structure follows the WDL evaluation criteria and it has dedicated cells where you should add information. Make sure your code is readable as it will be the only technical support the jury will have to evaluate your work. Make sure to list all the datasets used besides the ones provided.

Instructions:
1. 🧱 Create a separate copy of this template and **do not change** the predefined structure
2. 👥 Fill in the Authors section with the name of each team member
3. 💻 Develop your code - make sure to add comments and save all the output you want the jury to see. Your code **must be** runnable!
4. 📄 Fill in all the text sections
5. 🗑️ Remove this section (‘Notebook Submission Template’) and any instructions inside other sections
6. 📥 Export as HTML and make sure all the visualisations are visible.
7. ⬆️ Upload the .ipynb file to the submission platform.


## 🎯 Challenge
Predict Waste Production for its Reduction


## 👥 Authors

* Claire Benard
* Diego Arenas
* Natalie Muenter
* Tom Constant
* Tom Wagstaff


## 💻 Development
Start coding here! 🐱‍🏍

Create the necessary subsections (e.g. EDA, different experiments, etc..) and markdown cells to include descriptions of your work where you see fit. Comment your code. 

All new subsections must start with three hash characters.

Pro-tip 1: Don't forget to make the jury's life easier. Remove any unnecessary prints before submitting the work. Hide any long output cells (from training a model for example). For each subsection, have a quick introduction (justifying what you are about to do) and conclusion (results you got from what you did). 

Pro-tip 2: Have many similiar graphs which all tell the same story? Add them to the appendix and show only a couple of examples, with the mention that all the others are in the appendix.

### 🦺 Set up

In [1]:
import pandas as pd

### 🧼 Data import and cleaning

Let's import the **clean data**. Details of how the data was cleaned are available in the appendix but here is a summary of the issues we identified and the decisions we made:

1) The dataset was supposed to have `Load.ID` has a primary key, it wasn't the case so we removed duplicates and selected the first row of groups that shared a `Load.ID` (44 rows)

2) We removed rows with `Load.Weight` missing. 99.6% of them were in `SWEEPING` which we removed from the analysis anyway 

3) The time series is incomplete for some load types. We kept data from **2005** 

4) `Load.Type` categories don't seem mutually exclusive.

This could be because over time, recycling streams got split into different categories or the data collection process changed over time. We kept the categories making up most of the waste, namely: **Brush, Bulk, Dead animal, Garbage collections, Litter, Mixed Litter, Recycled Metal, Recycling - single stream, Tires and Yard Trimming.**

5) There were some data errors and extreme values that needed to be removed from the analysis. We used an **isolation forest** to identify outliers

6) We grouped the data by day, load type and route number

7) Merged in the population data

In [17]:
data = pd.read_csv('../data/clean_waste_data.csv')
data

Unnamed: 0,date,year,month,wday,Load.Type,Route.Type,Dropoff.Site,Route.Number,outlier,total_pop,annualised_growth,nb_loads,daily_weight
0,2005-01-03,2005,1,Mon,BRUSH,BRUSH,HORNSBY BEND,BR05,normal,700407.0,1.20%,2,8000.0
1,2005-01-03,2005,1,Mon,BRUSH,BRUSH,HORNSBY BEND,BR05,outlier,700407.0,1.20%,1,9400.0
2,2005-01-03,2005,1,Mon,BRUSH,BRUSH,HORNSBY BEND,BRPN01,outlier,700407.0,1.20%,1,3140.0
3,2005-01-03,2005,1,Mon,BRUSH,BRUSH,HORNSBY BEND,BRPS01,normal,700407.0,1.20%,1,5640.0
4,2005-01-03,2005,1,Mon,BULK,BULK,STEINER LANDFILL,BU27,normal,700407.0,1.20%,1,4620.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
362972,2021-07-11,2021,7,Sun,YARD TRIMMING,YARD TRIMMINGS,HORNSBY BEND,YM02,normal,,,2,33220.0
362973,2021-07-11,2021,7,Sun,YARD TRIMMING,YARD TRIMMINGS,HORNSBY BEND,YM04,normal,,,1,19640.0
362974,2021-07-11,2021,7,Sun,YARD TRIMMING,YARD TRIMMINGS,HORNSBY BEND,YM04,outlier,,,1,960.0
362975,2021-07-11,2021,7,Sun,YARD TRIMMING,YARD TRIMMINGS,HORNSBY BEND,YM05,normal,,,1,19460.0


## 🖼️ Visualisations
Copy here the most important visualizations (graphs, charts, maps, images, etc). You can refer to them in the Executive Summary.

Technical note: If not all the visualisations are visible, you can still include them as an image or link - in this case please upload them to your own repository.

## 👓 References
List all of the external links (even if they are already linked above), such as external datasets, papers, blog posts, code repositories and any other materials.

## ⏭️ Appendix
Add here any code, images or text that you still find relevant, but that was too long to include in the main report. This section is optional.
