![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

# Santa Visiting Homes in Strathcona County

There are a lot of homes in Strathcona County and Santa's internal GPS is malfunctioning. We [think](https://www.sciencealert.com/turns-out-we-have-no-idea-why-the-northern-lights-wreak-havoc-on-our-satellite-technology) the GPS interference is due to strong aurora borealis (Northern Lights) activity which are a result of intense solar storms. Luckily [Strathcona County's Open Data Portal](https://data.strathcona.ca/) includes all of the location data of homes in the county.

That’s where you as a data scientist come in. They’ve given you data and you need to reprogram Santa's GPS to figure out how to visit homes in the county on Christmas Eve in the most efficient manner possible.

## Getting Ready

This section sets up many things behind the scenes which are required for the rest of this notebook. Most of the code blocks in this section are ready-to-run so you won't have to do any modifications. You don't need to know everything about various tasks being accomplished by the code cell in this section to complete the challenges. However feel free to ask mentors about anything that makes you curious.

`▸Run` the cell below to download required Python libraries. It may take few minutes to complete the execution of the cell.

In [None]:
%pip install -q pyodide_http plotly haversine folium
import pyodide_http
pyodide_http.patch_all()
import pandas as pd
import plotly.express as px
import haversine as hs
import folium
from folium.plugins import FastMarkerCluster
print('Setup Complete')

## Getting Data About Homes

Next we will: 

1. Retrieve the data from the Strathcona County Open Data Portal.
2. Put the data in a dataframe named `home_data`. Think of a dataframe as a powerful spreadsheet.
3. Have a look at the data.

In [None]:
data_url = 'https://data.strathcona.ca/api/views/fdr6-tu3d/rows.csv?accessType=DOWNLOAD'
home_data = pd.read_csv(data_url)
home_data

We can also look at just the column names.

In [None]:
home_data.columns

### Data Cleaning

You may notice that some rows have `NaN` (not a number) for their locations. We're going to remove these from our dataset.

In [None]:
home_data = home_data.dropna()
home_data

## Visualizing Home Locations

Let's use folium to plot the home locations in our dataframe on an interactive map.

In [None]:
m = folium.Map(location=[53.5701, -113.0741], zoom_start=10)
m.add_child(FastMarkerCluster(home_data[['Latitude', 'Longitude']].values.tolist()))
display(m)

## Counting Homes

That's a lot of homes for Santa to visit, and this is just in Strathcona County. To find out how many homes are in the data set we can use `.shape`.

In [None]:
home_data.shape

## Calculating Travel Time

We can approximate Santa's travel time using the equation $t = \frac{d}{v}$ where $t$ is time, $d$ is distance, and $v$ is speed or velocity.

Start by assuming that Santa can travel close to the speed of sound, or about 300 meters per second, and that he spends about 30 seconds in each home. Feel free to change the default values for `flight_speed` and `time_per_home`.

In [None]:
def calculate_required_time(travel_distance, flight_speed=300, time_per_home=30):
    print('Santa will travel at', flight_speed, 'm/s and spend', time_per_home, 'seconds per home.')
    time_required = travel_distance / flight_speed + time_per_home
    return time_required

total_distance = 0
previous_location = (53.5701, -113.0741) # starting from the middle of Strathcona County
for record in home_data.iterrows():
    current_location = (record[1]['Latitude'], record[1]['Longitude'])
    travel_distance = hs.haversine(previous_location, current_location, unit='m')
    total_distance = total_distance + travel_distance
    previous_location = current_location

print(total_distance, 'meters')
required_time = calculate_required_time(total_distance, 300, 30)
print(required_time, 'seconds required, which is about', required_time/3600, 'hours.')

That seems like a long time, a little over a week. You can of course change the values in the calculate_required_time function so Santa travels faster or spends less time in each home.

# Visualizing the Path

A better way to decrease the travel time, though, would be to visit homes in an optimal order. We will visualize this using the Cufflinks library for Plotly. Right now we just have Santa visiting homes in the order they are listed in the data.

In [None]:
px.line(home_data, x='Longitude', y='Latitude')

Looking at just the first 50 homes with `home_data.head(50)`, we can see that this is not an efficient path.

In [None]:
px.line(home_data.head(50), x='Longitude', y='Latitude')

#### Travelling Salesman Problem

Optimizing Santa's travel path is a version of the classic [travelling salesman problem](https://simple.wikipedia.org/wiki/Travelling_salesman_problem), which is actually a very hard mathematical problem to compute.

There hasn't yet been a good solution, and there is a [$1,000,000 prize](http://www.claymath.org/millennium-problems/p-vs-np-problem) available to anyone who solves it.

### Filtering Data

Assuming that you haven't solved the travelling salesman problem already, we'll try to optimize Santa's route by eliminating some homes. Let's see what data categories are available to us in our `home_data` dataframe:

In [None]:
home_data.columns

There are some columns that might be interesting for our purposes, such as `'FIREPLACE'`. To get just the homes that just have a fireplace you can run the following cell to create a new dataframe called `home_data_filtered`.

In [None]:
condition = home_data['FIREPL']==True
home_data_filtered = home_data[condition]
home_data_filtered.shape

You can also specify two conditions like this, try it yourself:

---

`condition1 = home_data['FIREPL'] == True`

`condition2 = home_data['2021 Assessment'] < 1000000`

`home_data_filtered = home_data[(condition1) & (condition2)]`

---

`&` means **and**

`|` means **or**

### Sorting Data

Ordering the data by latitude might also help. Notice that we create a new dataframe called `home_data_sorted`. 

`home_data_sorted = home_data_filtered.sort_values(by=['Longitude'])`

In [None]:
home_data_sorted = home_data_filtered.sort_values(by=['Longitude'])
px.line(home_data_sorted, x='Longitude', y='Latitude')

## Graphing Data

Here is an example graph to help you visualize the data and make decisions about which homes to include in Santa's route.

In [None]:
px.scatter(home_data, x='BLDG', y='2021 Assessment', color='FIREPL')

You can continue your own analysis in the [next notebook](santa-challenge.ipynb).

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)