In [1]:
from geopy.distance import geodesic
from scipy.spatial import distance

import networkx as nx
import numpy as np
import pandas as pd
import random
import time
import warnings
warnings.filterwarnings('ignore')

## Table of Contents
[Brief Description of Model](#1)

[Initial Data and Setting "User-defined" Variables](#2)

[Random Route (Baseline Model)](#3)

[Optimal Route](#4)

[Find Nearest Attraction](#5)

[Road Trip Example](#6)
- [Random Route Example](#7)
- [Optimal Route Example](#8)
- [Nearest Attraction Example](#9)

[Conclusion](#10)

## Brief Description of Model <a id=1></a>

In a nutshell, the model does the following:

- Approximates the shortest distance one would need to travel in order to reach every city on their road trip itinerary, and the approximate amount of time this would take.
- Compares the shortest route to a randomly selected alternative route, one that serves as the baseline to compare the optimal route to.
- Determines which city in a user's itinerary is closest to one of the ten popular attractions listed in the first notebook, and the approximate distance between the locations.

## Initial Data and Setting "User-defined" Variables <a id=2></a>

In [2]:
california_cities = pd.read_csv('cal_cities_lat_long.csv')

california_cities.columns = ['city', 'latitude', 'longitude']

california_cities['coords'] = list(zip(california_cities['latitude'], california_cities['longitude']))

During a road trip, one can expect to want to take breaks from time to time. If the team had more time to work on this project, then we would like for the website to ask users how often they envision wanting a break, and for how long they would be on break. Furthermore, with more time we might be able to acquire traffic data that would help the model accurately estimate how many miles per hour a user can expect to travel.

Due to the time constraints of this project, we decided that our prototype would assume that a user would go on a 10-minute break after every 2 hours of driving. Furthermore, we suspect that the average speed limit is likely to be (approximately) 55 miles per hour.

In [3]:
break_interval = 120
break_time = 10

break_percentage = break_time / break_interval
average_speed = 55

In [4]:
def calculate_travel_time(total_distance):
    est_travel_time = (total_distance / average_speed) * (1 + break_percentage)
    return est_travel_time

In [5]:
california_cities

Unnamed: 0,city,latitude,longitude,coords
0,Adelanto,34.582769,-117.409214,"(34.582769, -117.409214)"
1,Agoura Hills,34.153339,-118.761675,"(34.153339, -118.761675)"
2,Alameda,37.765206,-122.241636,"(37.765206, -122.241636)"
3,Albany,37.886869,-122.297747,"(37.886869, -122.297747)"
4,Alhambra,34.095286,-118.127014,"(34.095286, -118.127014)"
...,...,...,...,...
454,Woodland,38.678517,-121.773297,"(38.678517, -121.773297)"
455,Yorba Linda,33.888625,-117.813111,"(33.888625, -117.813111)"
456,Yreka,41.735419,-122.634472,"(41.735419, -122.634472)"
457,Yuba City,39.140447,-121.616911,"(39.140447, -121.616911)"


In the next line, the code takes a random sample of 10 cities to simulate a roadtrip being planned.

In [6]:
cities_sample = california_cities.sample(n=10)
coordinates = list(zip(cities_sample['latitude'], cities_sample['longitude']))

## Random Route (Baseline Model) <a id=3></a>

A good way to assess the quality of the model is to compare it to a different model that chooses the route of the road trip randomly.

In [7]:
def calculate_random_route(route_coords):
    start_time = time.time()
    df = route_coords

    random_route = df.sample(frac=1).reset_index(drop=True)
    random_route = pd.concat([random_route, random_route.iloc[[0]]], ignore_index=True)

    def calculate_distance(coord1, coord2):
        return geodesic(coord1, coord2).miles

    distances = []
    for i in range(len(random_route) - 1):
        dist = calculate_distance((random_route.loc[i, 'latitude'], random_route.loc[i, 'longitude']),
                                (random_route.loc[i+1, 'latitude'], random_route.loc[i+1, 'longitude']))
        distances.append(dist)

    rolling_distances = pd.Series(distances).cumsum()

    random_route['next_city'] = random_route['city'].shift(-1)
    random_route['coords'] = list(zip(random_route['latitude'], random_route['longitude']))
    random_route['distance_to_next'] = distances + [distances[0]]
    random_route['segment_travel_time'] = random_route['distance_to_next'].apply(calculate_travel_time)
    random_route_time_array = np.array(random_route['segment_travel_time'])
    random_route['rolling_travel_time'] = np.cumsum(random_route_time_array)
    random_route['total_rolling_distance'] = [0] + rolling_distances.tolist()

    final_df = random_route[['city',
                             'coords',
                              'next_city', 
                              'distance_to_next', 
                              'total_rolling_distance',
                              'segment_travel_time',
                              'rolling_travel_time']]
    final_df['distance_to_next'] = final_df['distance_to_next'].round(0)
    final_df['total_rolling_distance'] = final_df['total_rolling_distance'].round(0)
    final_df['segment_travel_time'] = final_df['segment_travel_time'].round(2)
    final_df['rolling_travel_time'] = final_df['rolling_travel_time'].round(2)

    
    print('Random Road Trip Route')

    display(final_df)

    print(f'Total Distance: {round(rolling_distances.iloc[-1])} miles')
    print(f'Estimated Travel Time: {round(calculate_travel_time(rolling_distances.iloc[-1]))} hours')
    end_time = time.time()

    # final_df.to_json('random_route_5.json', orient='records', lines=True)

    print(f'Runtime for Random Route: {round(end_time - start_time, 2)} seconds')

## Optimal Route <a id=4></a>

Note that the optimal route function (and the random route function) calculates not only the distance and time needed to travel from one city to the next, it also calculates the rolling distance and time values.

In [8]:
def calculate_optimal_route(df):
    coords_sample = df
    start_time = time.time()
    def calculate_distance_matrix(coords_sample):
        num_coords = len(coords_sample)
        distance_matrix = [[0] * num_coords for _ in range(num_coords)]

        for i in range(num_coords):
            for j in range(i + 1, num_coords):
                distance = geodesic(coords_sample[i], coords_sample[j]).miles
                distance_matrix[i][j] = distance
                distance_matrix[j][i] = distance
                
        return distance_matrix

    distance_matrix = calculate_distance_matrix(coords_sample)

    G = nx.Graph()

    # add nodes
    for i, coord in enumerate(coords_sample):
        G.add_node(i, pos=coord)

    # add edges with weights
    for i in range(len(coords_sample)):
        for j in range(i + 1, len(coords_sample)):
            G.add_edge(i, j, weight=distance_matrix[i][j])
            G.add_edge(j, i, weight=distance_matrix[i][j])

    tsp_path = nx.approximation.traveling_salesman_problem(G, weight='weight')


    total_distance = 0
    segment_distances = [0]
    for i in range(len(tsp_path) - 1):
        distance_traveled = G[tsp_path[i]][tsp_path[i + 1]]['weight']
        segment_distances.append(distance_traveled)
        total_distance += distance_traveled
    # Add the distance to return to the starting point
    if G.has_edge(tsp_path[-1], tsp_path[0]):
        distance_traveled = G[tsp_path[-1]][tsp_path[0]]['weight']
        segment_distances.append(distance_traveled)
        total_distance += distance_traveled

    # Map node indices to location names
    route_coords = [coords_sample[node] for node in tsp_path]
    route_names = []
    for coord in [coords_sample[node] for node in tsp_path]:
        city_name = california_cities[california_cities['coords'] == coord]['city'].values[0]
        route_names.append(city_name)


    segment_travel_times = [calculate_travel_time(dist) for dist in segment_distances]
    rolling_travel_time = np.cumsum(segment_travel_times)
    rolling_distance = np.cumsum(segment_distances)

    optimal_route = pd.DataFrame({'city': route_names,
                                'coordinates': route_coords,
                                'segment_distance': segment_distances,
                                'rolling_distance': rolling_distance,
                                'segment_travel_time': segment_travel_times,
                                'rolling_travel_time': rolling_travel_time
                                })

    optimal_route['segment_distance'] = optimal_route['segment_distance'].round()
    optimal_route['rolling_distance'] = optimal_route['rolling_distance'].round()
    optimal_route['segment_travel_time'] = optimal_route['segment_travel_time'].round(2)
    optimal_route['rolling_travel_time'] = optimal_route['rolling_travel_time'].round(2)

    # Output the route and total distance
    print('Optimal Road Trip Route')
    display(optimal_route)
    print(f'Total Distance: {round(total_distance)} miles')

    print('Estimated Travel Time:', round(calculate_travel_time(total_distance)), 'hours')
    end_time = time.time()

    # optimal_route.to_json('optimal_route_5.json', orient='records', lines=True)
    print(f'Runtime for Optimized Route: {round(end_time - start_time, 2)} seconds')

## Find Nearest Attraction <a id=5></a>

Recall that at the end of the first notebook it was written that the model also computes which of ten popular attractions is located closest to one of the cities on a user's road trip itinerary.

If we had more time, one way to improve this part of the model is to indicate which city on the road trip is closest to <u>EACH</u> of the ten attractions. This improvement would give users a thorough understanding of which attractions they might have time to visit.

In [9]:
attractions_names = ['Alcatraz Island', 'Balboa Park', 'Disneyland', 'Hearst Castle', 'Joshua Tree National Park',
                     'Legoland California', 'Malibu Beach', 'Redwoods National Park', 'Universal Studios Hollywood', 'Yosemite National Park']

attractions_latitudes = [37.8265991, 32.7325629, 33.8120294, 35.6852218, 33.8875175, 33.1264746, 34.0318786, 37.8488593, 34.1373322, 37.6727756]
attractions_longitudes = [-122.4228001, -117.1472597, -117.9190063, -121.1679822, -115.8082581, -117.3113757, -118.6880654, -119.5570877, -118.3532224, -119.7282411]

attractions_df = pd.DataFrame({'attraction': attractions_names, 'latitude': attractions_latitudes, 'longitude': attractions_longitudes})

display(attractions_df)

Unnamed: 0,attraction,latitude,longitude
0,Alcatraz Island,37.826599,-122.4228
1,Balboa Park,32.732563,-117.14726
2,Disneyland,33.812029,-117.919006
3,Hearst Castle,35.685222,-121.167982
4,Joshua Tree National Park,33.887518,-115.808258
5,Legoland California,33.126475,-117.311376
6,Malibu Beach,34.031879,-118.688065
7,Redwoods National Park,37.848859,-119.557088
8,Universal Studios Hollywood,34.137332,-118.353222
9,Yosemite National Park,37.672776,-119.728241


In [10]:
def calculate_nearest_attractions(df):
    start_time = time.time()
    coords_sample_df = df
    # beginning with an empty distance matrix
    distance_matrix_interesting = np.zeros((len(coords_sample_df), len(attractions_df)))

    # iterating through the coords_sample_df (sample of city coordinates, simulating a planned trip)
    # and the attractions_df to find the distance between all locations
    for i in range(len(coords_sample_df)):
        for j in range(len(attractions_df)):
            city_coords = (coords_sample_df.iloc[i]['latitude'], coords_sample_df.iloc[i]['longitude'])
            place_coords = (attractions_df.iloc[j]['latitude'], attractions_df.iloc[j]['longitude'])
            # using  geopy.distance.geodesic to calculate the distance between lat/lon in miles
            distance_matrix_interesting[i, j] = geodesic(city_coords, place_coords).miles

    # creating the dataframe of the distance matrix for easier inspection
    attraction_distance_df = pd.DataFrame(distance_matrix_interesting, index=coords_sample_df['city'], columns=attractions_df['attraction'])

    display(attraction_distance_df.round())

    # when axis=None is specified in dataframe.min, it finds the lowest value in both axes
    best_match = attraction_distance_df.min(axis=None)

    # function to iterate through the provided dataframe looking for the provided value
    def search_dataframe(df, value):
        found_item = []
        for column in df.columns:
            for index in df.index:
                if df.loc[index, column] == value:
                    found_item.append((column, index, value))
                    print('The closest attraction on the trip is', column)
                    print(f'It is approximately {round(value)} miles away from {index}')
        if not found_item:
            print('Value not found in dataframe')

    search_dataframe(attraction_distance_df, best_match)
    end_time = time.time()
    print(f'Runtime for Attraction Search: {round(end_time - start_time, 2)} seconds')

## Road Trip Example <a id=6></a>

Let's now pretend that a road trip has been planned. Suppose we desire to go to the following ten cities:

In [11]:
cities_sample['city']

328            Reedley
400     South Pasadena
224             Lomita
375        Santa Maria
18          Atascadero
177              Huron
235    Manhattan Beach
458            Yucaipa
138            Fortuna
70          Chowchilla
Name: city, dtype: object

Now that we have our cities picked, we will throw caution to the wind and pick a route at random!

Afterwards, we will optimize that route in order to minimize the distance traveled. 

Additionally, we will try to find the closest of our top 10 favorite attractions to any of the cities we will be in.

### Random Route Example <a id=7></a>

In [12]:
calculate_random_route(cities_sample)

Random Road Trip Route


Unnamed: 0,city,coords,next_city,distance_to_next,total_rolling_distance,segment_travel_time,rolling_travel_time
0,Atascadero,"(35.489417, -120.670725)",Lomita,178.0,0.0,3.51,3.51
1,Lomita,"(33.792239, -118.315072)",Chowchilla,254.0,178.0,5.01,8.52
2,Chowchilla,"(37.123, -120.260175)",Fortuna,319.0,432.0,6.28,14.8
3,Fortuna,"(40.598186, -124.157275)",Manhattan Beach,561.0,751.0,11.04,25.84
4,Manhattan Beach,"(33.884736, -118.410908)",Huron,186.0,1312.0,3.67,29.51
5,Huron,"(36.202731, -120.102917)",Yucaipa,229.0,1498.0,4.51,34.02
6,Yucaipa,"(34.033625, -117.043086)",Santa Maria,204.0,1727.0,4.01,38.03
7,Santa Maria,"(34.953033, -120.435719)",Reedley,126.0,1931.0,2.48,40.52
8,Reedley,"(36.596339, -119.450403)",South Pasadena,186.0,2057.0,3.67,44.18
9,South Pasadena,"(34.116119, -118.15035)",Atascadero,172.0,2243.0,3.38,47.56


Total Distance: 2415 miles
Estimated Travel Time: 48 hours
Runtime for Random Route: 0.03 seconds


### Optimal Route Example <a id=8></a>

In [13]:
calculate_optimal_route(coordinates)

Optimal Road Trip Route


Unnamed: 0,city,coordinates,segment_distance,rolling_distance,segment_travel_time,rolling_travel_time
0,Reedley,"(36.596339, -119.450403)",0.0,0.0,0.0,0.0
1,Chowchilla,"(37.123, -120.260175)",58.0,58.0,1.14,1.14
2,Fortuna,"(40.598186, -124.157275)",319.0,376.0,6.28,7.42
3,Yucaipa,"(34.033625, -117.043086)",598.0,975.0,11.78,19.2
4,South Pasadena,"(34.116119, -118.15035)",64.0,1039.0,1.26,20.46
5,Manhattan Beach,"(33.884736, -118.410908)",22.0,1060.0,0.43,20.89
6,Lomita,"(33.792239, -118.315072)",8.0,1069.0,0.17,21.05
7,Santa Maria,"(34.953033, -120.435719)",145.0,1214.0,2.86,23.91
8,Atascadero,"(35.489417, -120.670725)",39.0,1253.0,0.77,24.69
9,Huron,"(36.202731, -120.102917)",59.0,1312.0,1.15,25.84


Total Distance: 1357 miles
Estimated Travel Time: 27 hours
Runtime for Optimized Route: 0.02 seconds


### Nearest Attraction Example <a id=9></a>

In [14]:
calculate_nearest_attractions(cities_sample)

attraction,Alcatraz Island,Balboa Park,Disneyland,Hearst Castle,Joshua Tree National Park,Legoland California,Malibu Beach,Redwoods National Park,Universal Studios Hollywood,Yosemite National Park
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Reedley,185.0,297.0,211.0,115.0,278.0,268.0,182.0,87.0,180.0,76.0
South Pasadena,350.0,112.0,25.0,203.0,135.0,84.0,31.0,269.0,12.0,261.0
Lomita,361.0,100.0,23.0,208.0,144.0,74.0,27.0,288.0,24.0,279.0
Santa Maria,227.0,243.0,164.0,65.0,274.0,219.0,118.0,206.0,131.0,192.0
Atascadero,188.0,277.0,195.0,31.0,298.0,252.0,151.0,174.0,161.0,159.0
Huron,170.0,293.0,206.0,70.0,291.0,265.0,170.0,117.0,173.0,103.0
Manhattan Beach,353.0,108.0,29.0,200.0,150.0,82.0,19.0,281.0,18.0,271.0
Yucaipa,399.0,90.0,53.0,261.0,72.0,64.0,94.0,298.0,75.0,293.0
Fortuna,213.0,667.0,581.0,376.0,652.0,639.0,544.0,311.0,548.0,312.0
Chowchilla,128.0,350.0,264.0,111.0,336.0,322.0,231.0,63.0,232.0,48.0


The closest attraction on the trip is Universal Studios Hollywood
It is approximately 12 miles away from South Pasadena
Runtime for Attraction Search: 0.03 seconds


Wonderful! Not only does the model work, its runtime is extremely short!

The following is documentation of a specific example of what the model outputted:

A random route to ten particular cities and back would result in traveling 2361 miles, which would take about 47 hours (including breaks)!

When we used the optimizer to generate the shortest possible route, we almost cut travel distance and time in half! We ended up with 1335 miles traveled across 26 hours. 

One of the ten cities was only 4 miles away from the famous Balboa park! That is definitely worth the short detour!

## Conclusion <a id=10></a>

Overall, the team thinks that the prototype has potential to expand into a fantastic tool for road trip enthusiasts! In addition to the ideas for improvement that have already been mentioned throughout the notebook, it would be interesting to investigate what additional supplemental data the website can give about each city on a road trip itinerary. 

So far we have the violent crime and Chipotle data, and it might be nice to have data on the cost of accomodation (e.g. hotel, Airbnb) in each city. This data would help users make informed decisions about where to stay overnight. Also, more restaurant data wouldn't hurt! Perhaps data about the locations of both cheap, fast-food restaurants and high-end restaurants?