![intro_cover](image_folder/intro_new.PNG)

# Table of Contents

• Business Pain Points

• Introduction to Geospatial Analysis

• Competition Dataset

• Q&A

# Business Pain Points

## Case Study:

As a new intern of a logistic company in London, your supervisor has assigned you to a new data science project with the objective of reducing waste (time and fuel expenses) by <u>optimising the route of goods delivery</u>. Thus, the first step towards achieving the goal is to develop a <u> **model for predicting travel time** </u> (which corresponds to monetary value) between two points in London.

![travelling_salesman_problem](image_folder/case_study1.jpg)
Credit: OptimoRoute (https://optimoroute.com/travelling-salesman-problem/)

## Objective

To perform data wrangling and develop a machine learning model for estimating the time required to travel between two points in a city.

![google_map](image_folder/case_study3.PNG)
Credit: Google Maps

# Introduction to Geospatial Analysis

## History of Geographic Information Systems (GIS)

|||
|:--:|:--:|
|**1854:** Paper mapping analysis with cholera clusters in London, England by Dr. John Snow. The illustration shows that cholera was being spread along water line, not air as speculated.|![Dr. John](image_folder/John-Snow.PNG) ![cholera map](image_folder/Cholera-Map2.PNG)|
|**Before 1960s, The GIS Dark Ages:** Physical maps, which creates bottleneck in analysis (e.g. area and distance measurements, course and inaccurate data) | ![physical map](image_folder/physical_map.PNG)|
|**1960 - 75, GIS Pioneering:** Advancements in technology, and first computerized GIS by Roger Tomlinson for the Canadian government| ![roger_tomlinson](image_folder/Roger-Tomlinson.PNG)|
|**1975 - 90, GIS Software Commercialization:** The development of the first computer map-making software by Harvard Laboratory Computer Graphics, and software commercialization by the consulting firm Environmental Systems Research Institute, Inc. (Esri)| ![GIS_commercial](image_folder/GIS-Software-Box.PNG)|
|**1990 - 2010, User Proliferation:** Cheaper, faster and more powerful computers, multiple software options, data availability, launch of satellites and integration of remote sensing technology; enabling users to take full advantage of GIS and recognizing the importance of spatial analysis | ![interaction](image_folder/map-interaction.jpg)|
|**2010 - Now, The Open Source Explosion:** Even better technologies, GIS data become accessible worldwide for free, more collaboration and commercialization of GIS products| ![open_source](image_folder/open-mapping.png)|

Credit: GIS Geography (https://gisgeography.com/history-of-gis/)

## Motivation behind Geospatial Analysis

<font size="10"><center>  **_"MOBILITY"_**  </center></font>

* **Location intelligence** - _Deriving insights from location_
* **Location-based marketing** - _Evaluating catchment areas_ 
* **Assessing accessibilities** - _Evaluating accessibility of an area_
* **Travel Times** - _Goods transportation_

* Location intelligence - _Deriving insights from location_ (https://www.forbes.com/sites/louiscolumbus/2018/02/11/what-new-in-location-intelligence-for-2018/?sh=42acc8b114b5)
* Location-based marketing - _Evaluating catchment areas_ (https://www.simplybusiness.co.uk/knowledge/articles/2010/10/2010-10-25-what-starbucks-can-teach-your-business-about-location-based-marketing/)
* Assessing accessibilities - _Evaluating accessibility of an area_ (https://www.walkscore.com/methodology.shtml)
* Travel Times - _Goods transportation_

**Industry:** Geographic Information System (GIS)

**Examples close to us:**

* Navigation applications
* Ministry of Health Malaysia

**and not just limited by those, but also:**
* Telco
* Accident analysis
* Urban planning
* Environmental impact analysis
* and many more! (https://nobelsystemsblog.com/gis-data-business/, https://gisgeography.com/what-gis-geographic-information-systems/)

## Travel Time Prediction

* Storing a complete travel time matrix - _Not practical!_
* Creating a travel time prediction model - _From historical data_
    * Source location
    * Destination location
    * Date

* **Data preparation**
    * Subsetting
    * Null values
    * Feature Engineering (Logical & relevant)
    * Visualization will be helpful

* **Modelling**
    * Machine learning models
    * Assign random_state
    
* **Model Evaluation**
    * RMSE
    * Visualization will be helpful!

|||
|:--:|:--:|
|![map](image_folder/map.PNG)|![visualization_of_error](image_folder/random_forest_scatterplot.PNG)|

# Competition Dataset

## Dataset Provided

1)	**london.json**: The boundaries of London in geospatial (GeoJSON) format, including Zone IDs used in other dataset.

![geojson](image_folder/dataset1.PNG)

![zones](image_folder/dataset2.PNG)
Credit: GIS StackExchange (https://gis.stackexchange.com/questions/118223/merge-geojson-polygons-with-wgs84-coordinate)

2)	**training_WeeklyAggregate.csv**: Contains the arithmetic mean for aggregated travel times over the first quarter of 2020 between randomly selected zone pairs in London.
-	sourceid – Source location ID as per london.json
-	dstid – Destination location ID as per london.json
-	dow – Days of week, where 1: Monday, 2: Tuesday and so on.
-	mean_travel_time (label) - The average travel time as per the shortest distance travelled by car from its source location to destination on the particular day of week, in seconds.

3)	**testing_dataset.csv**: Similar to training dataset but without labels.

4)	**sample_submission.csv**: An example of a submission file.


## Limitation of Dataset

* Weekly-aggregated, with daily granularity
* Period: Q1, 2020
* Discrepancies with other web mapping platforms
* Small dataset (does not cover all source and destination pairs)
* Not date-specific
* Aggregated by zones

![zones](image_folder/dataset2.PNG)
Credit: GIS StackExchange (https://gis.stackexchange.com/questions/118223/merge-geojson-polygons-with-wgs84-coordinate)

The dataset used in this competition is strictly limited to only what is provided. No additional dataset should be used.

You can submit your prediction result once per day throughout the entire period of competition - the organizing team will inform you on the RMSE value within 24 hours. At the final submission day, you are expected to submit both your final working codes (jupyter notebook) and the final prediction csv files.

The organizing team will be inspecting your code to ensure reproducibility of your final submitted result.

In [3]:
import pandas

# Q&A