# Best Summer Sightseeing Roadtrip Project

# Introduction

Traveling across the United States offers a wealth of iconic landmarks and breathtaking natural attractions, but planning the most efficient route between multiple destinations can be a real challenge. This project aims to simplify that process by using data to build an optimized travel route across 10 of the most popular U.S. landmarks. The primary objective of this project was to first establish a baseline route by selecting the next destination at random. This initial route served as a reference point for further optimization. Using the total travel distance from the baseline route, an optimized route was then generated based on minimizing the overall distance using machine learning modeling techniques.

This dataset was manually compiled to support a road trip planning project focused on identifying and optimizing travel routes between popular destinations. It includes information on 10 well-known landmarks and natural attractions across the United States. For each destination, the dataset records the state, latitude, longitude, and category (e.g., landmark or natural attraction), resulting in a total of 10 rows and 5 columns.

As the dataset was curated manually and is relatively small in size, no preprocessing or data cleaning was necessary. Additional sections of this report provide detailed descriptions of the modeling approach, visualizations used to support the analysis, and key conclusions drawn from the project.

In [1]:
!pip install geopy




[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import time
import random
import plotly.express as px 
import plotly.graph_objects as go

# Import geopy for distance calculations
from geopy.distance import geodesic

## Data Preprocessing

In [3]:
# Load the dataset and set certain arguments to read the data correctly.
data = pd.read_csv('data.csv')

# Display the first few rows of the dataset
display(data)

Unnamed: 0,name,state,latitude,longitude,category
0,Grand Canyon National Park,Arizona,36.11,-112.11,National Park
1,Yosemite National Park,California,37.87,-119.54,National Park
2,Yellowstone National Park,Wyoming,44.43,-110.59,National Park
3,Zion National Park,Utah,37.3,-113.03,National Park
4,Mount Rushmore,South Dakota,43.88,-103.46,Monument
5,Great Smoky Mountains,Tennessee,35.65,-83.51,Natural Wonder
6,Statue of Liberty,New York,40.69,-74.04,Historical Landmark
7,Arches National Park,Utah,38.73,-109.59,National Park
8,Niagara Falls,New York,43.1,-79.04,Natural Wonder
9,Golden Gate Bridge,California,37.82,-122.48,Landmark


In [4]:
# Determining the size of the DataFrame
n_rows, n_cols = data.shape
print(f"The DataFrame has {n_rows} rows and {n_cols} columns.")

The DataFrame has 10 rows and 5 columns.


In [5]:
# Display informative summary of the DataFrame
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   name       10 non-null     object 
 1   state      10 non-null     object 
 2   latitude   10 non-null     float64
 3   longitude  10 non-null     float64
 4   category   10 non-null     object 
dtypes: float64(2), object(3)
memory usage: 532.0+ bytes


In [6]:
# Display descriptive statistics of the DataFrame
display(data.describe())

Unnamed: 0,latitude,longitude
count,10.0,10.0
mean,39.558,-102.739
std,3.249933,17.415909
min,35.65,-122.48
25%,37.43,-112.8
50%,38.3,-110.09
75%,42.4975,-88.4975
max,44.43,-74.04


In [7]:
# Extracting relevant columns for location data
locations = data[['name', 'latitude', 'longitude']]

## Modeling

### First Approach

#### Baseline Route

In [8]:
# Create random route
random_route = locations.sample(frac=1, random_state=7).reset_index(drop=True)

In [9]:
# Compute total distance
def total_distance(route):
    distance = 0
    for i in range(len(route) - 1):
        start = (route.loc[i]['latitude'], route.loc[i]['longitude'])
        end = (route.loc[i + 1]['latitude'], route.loc[i + 1]['longitude'])
        distance += geodesic(start, end).km
    return distance

baseline_distance = total_distance(random_route)
print(f"Baseline Distance(km): {baseline_distance:.0f} km")

Baseline Distance(km): 13005 km


#### Optimized Route

In [10]:
def total_distance(route):
    distance = 0
    for i in range(len(route) - 1):
        start = (route.iloc[i]['latitude'], route.iloc[i]['longitude'])
        end = (route.iloc[i + 1]['latitude'], route.iloc[i + 1]['longitude'])
        distance += geodesic(start, end).km
    return distance

def nearest_neighbor(locations):
    start_time = time.time()

    unvisited = locations.copy().reset_index(drop=True)
    route = []

    current_location = unvisited.iloc[0]
    route.append(current_location)
    unvisited = unvisited.drop(index=0).reset_index(drop=True)

    while not unvisited.empty:
        min_distance = float('inf')
        nearest_index = None

        for index, location in unvisited.iterrows():
            dist = geodesic(
                (current_location['latitude'], current_location['longitude']),
                (location['latitude'], location['longitude'])
            ).km
            if dist < min_distance:
                min_distance = dist
                nearest_index = index

        current_location = unvisited.iloc[nearest_index]
        route.append(current_location)
        unvisited = unvisited.drop(index=nearest_index).reset_index(drop=True)

    route_df = pd.DataFrame(route)
    duration = time.time() - start_time
    return route_df, total_distance(route_df), duration

# Run it
route, distance, duration = nearest_neighbor(locations)

print(f"Optimized Distance: {distance:.0f} km")
print(f"Time taken: {duration:.2f} seconds")


Optimized Distance: 8344 km
Time taken: 0.04 seconds


### Second Approach

Pulls out the latitude/longitude columns as a simple list of (lat, lon) pairs.

In [11]:
# Extract latitude & longitude into a list of tuples
locs = list(zip(data['latitude'], data['longitude']))

#### Baseline Route

Code picks a random order of locations, then calculates and prints the total distance of random route.

In [12]:
# Create a random route
random.seed(1)                       
route_rand = locs.copy()     
random.shuffle(route_rand)

# Calculate the distance of the random route
rand_km = sum(geodesic(route_rand[i], route_rand[i+1]).km
    for i in range(len(route_rand) - 1))
print(f"Random route ≈ {rand_km:.0f} km")

Random route ≈ 14130 km


#### Optimized Route

Implements a simple 'nearest‑neighbor' tour, that start at the first point, repeatedly go to the closest unvisited location, then sum and print the total distance of that route.

In [13]:
# Nearest-neighbor algorithm
unvisited = locs.copy()
route_nn = [unvisited.pop(0)]

while unvisited: 
    last_point = route_nn[-1] 
    idx = min(
        range(len(unvisited)),
        key=lambda i: geodesic(last_point, unvisited[i]).km
    )
    route_nn.append(unvisited.pop(idx))

nn_km = sum(geodesic(route_nn[i], route_nn[i+1]).km
    for i in range(len(route_nn) - 1))
print(f"Nearest‐neighbor ≈ {nn_km:.0f} km")  

Nearest‐neighbor ≈ 8344 km


Model Description 

To find the optimal distance - that is find the route to minimize the distance we created two models; baseline and optimized model. Before making optimized model, we created a baseline model. This model builds a route by randomly selecting next location. Baseline model is used as a starting point to compare better model which in this case is Optimized model. We used Python's `random` modele to shuffle the order of the locations and the `geopy` library to calculate the total distance of the route.To solve the route optimization problem, we created a model that applies the Newest algorithm to determine a travel path that minimizes the total distance between a series of geographic locations. The core idea of the algorithm is to starts from an initial point, then it randomly selects the nearest unvisited location until all points have been visited. The distance between each pair of locations is calculated using the `geopy` library's `geodesic` function, which measures distances between two points on a curved surface (like the Earth's) using geographic coordinate system - latitude (Y-axis) and longitude (X-axis) coordinates. 
We experimented with two different approches of this algorithm. The first used a pandas-based approch, where each location was stored as a row in a DataFrame. This method was helpful when working with extra information such as name, place, or category because pandas makes it easier to organize and view the kind of data. The second one is used a list-based approach, where each location was sorted as a tuple of (latitude, longitude).The route was constructed by modifying a list of unvisited points and appending the nearest one at each step. This method was much simpler and faster especially because the dataset was small and didn't include any extra details. 
After testing both methods, we chose the list-based version as the final approch. It was easier to read, ran quickly, and fit better with the size and simplicity of the dataset. 

## Exploratory Data Analysis

### Data Visualization

In [14]:
# Calculate leg distances for bar chart
leg_dists = []
for i in range(len(route) - 1):
    start = (route.iloc[i]['latitude'], route.iloc[i]['longitude'])
    end = (route.iloc[i + 1]['latitude'], route.iloc[i + 1]['longitude'])
    d = geodesic(start, end).km
    leg_dists.append({'Leg': f"{route.iloc[i]['name']} → {route.iloc[i+1]['name']}", 'Distance_km': d})

# Create a DataFrame for leg distances
leg_dists_df = pd.DataFrame(leg_dists)

In [15]:
#  Map Visualization
fig_map = go.Figure()

# Markers for all locations
fig_map.add_trace(go.Scattermapbox(
    lat=locations['latitude'],
    lon=locations['longitude'],
    mode='markers',
    marker=dict(size=8, color='blue'),
    text=locations['name'],
    name='Locations'
))

# Optimized route
fig_map.add_trace(go.Scattermapbox(
    lat=route['latitude'].tolist() + [route['latitude'].iloc[0]],
    lon=route['longitude'].tolist() + [route['longitude'].iloc[0]],
    mode='lines+markers',
    marker=dict(size=6, color='red'),
    line=dict(width=2, color='red'),
    name='Optimized Route'
))

# Random route
fig_map.add_trace(go.Scattermapbox(
    lat=random_route['latitude'].tolist() + [random_route['latitude'].iloc[0]],
    lon=random_route['longitude'].tolist() + [random_route['longitude'].iloc[0]],
    mode='lines+markers',
    marker=dict(size=6, color='green'),
    line=dict(width=2, color='green'),
    name='Random Route'
))

fig_map.update_layout(
    mapbox_style='open-street-map',
    mapbox_center={"lat": locations['latitude'].mean(), "lon": locations['longitude'].mean()},
    mapbox_zoom=4,
    margin=dict(t=80, b=40), 
    title={
        'text': 'Routes Comparison',
        'x': 0.5,
        'xanchor': 'center',
        'font': {'size': 20}
    },
    annotations=[
        dict(
            text=f"Initial: {baseline_distance:.0f} km | Optimized: {distance:.0f} km | Time: {duration:.3f} sec",
            showarrow=False,
            xref='paper',
            yref='paper',
            x=0.5,
            y=1.05, 
            xanchor='center',
            yanchor='bottom',
            font=dict(size=12)
        )
    ]
)
fig_map.show()


The graph above is a comparision between two travel routes for the selected top 10 landmarks and natural attractions on a map of the United States -- a randomly selected route and an optimized route (smallest distance route) between those 10 places.

The route comparision visualization clearly demonstrates the significant impact of optimization on travel distance.
From two routes compared:
- A randomly generated route covered a total distance of 13,005.42 km.
- An optimized route minimized the travel distance to 8,344.50 km. 

This resulted in reduction of approximately 4660 km reflecting an efficiancy gain of over 35%. The optimization process was completed in just 0.04 seconds, indicating the use of a highly efficient algorithm suitable for real-time applications and showcasing how route optimization can be useful for saving time, fuel, and money in real-life travel or delivery planning. 

In [16]:
# Bar Chart (Distance per Leg)

fig_bar = px.bar(
    leg_dists_df,
    x='Leg',
    y='Distance_km',
    color='Distance_km',
    color_continuous_scale=px.colors.sequential.Oranges,
    hover_data={'Distance_km': ':.1f'},
    title='Optimized Route: Distance per Leg',
    labels={'Distance_km': 'Distance (km)'},
).update_layout(
    title={'text': "Optimized Route: Distance per Leg (Stop)", 'x': 0.5, 'xanchor': 'center', 'yanchor': 'top', 'font': {'size': 20}},
    xaxis_tickangle=-45, 
    margin=dict(t=40, b=150),
    )
fig_bar.show()


The bar graph above shows the distance between each stop in the optimized travel route. Here the X-axis represents each stop of the trip, while the Y-axis shows the distance in kilometers. The bars are color-coded according to distance, with lighter colors indicating shorter stops and darkers shades indicating longer distances. This makes it easy to visually spot which parts of the journey are the longest. 

Key Observations
- The longest leg in the optimized route is between Golden Gate Bridge → Great Smoky Mountains, covering over 3,000 km. This is also visually confirmed by the darkest color bar.
- The shortest leg is between Grand Canyon National Park → Zion National Park, with a distance under 300 km.
- Most legs fall in the 500–1500 km range, which reflects efficient routing between nearby landmarks.
- Only one leg (Golden Gate to Great Smoky Mountains) stands out as a long-distance jump, which likely couldn’t be avoided given the geographic locations.

Conclusions

The bar chart shows how distance is distributed across each leg of the optimized route. It highlights that while most stops are relatively close to one another, certain legs are significantly longer, possibly due to unavoidable geographic separation.

This kind of analysis is helpful in planning rest stops, fuel usage, and time allocation. It reinforces the importance of optimization in managing long-distance travel effectively — especially when some destinations are far apart.

In [17]:
# Donut Chart (Share per Leg)
fig_donut = go.Figure(go.Pie(
    labels=leg_dists_df['Leg'],
    values=leg_dists_df['Distance_km'],
    hole=0.4,
    sort=False,
))
fig_donut.update_layout(
    title={'text': "Optimized Route: Distance Share per Leg (Stop)", 'x': 0.5, 'xanchor': 'center', 'yanchor': 'top', 'font': {'size': 20}},
    margin=dict(l=40, r=40, t=40, b=40),
    legend=dict(
        x=1.02,
        y=0.5,
        xanchor='left',
        yanchor='middle')
)
fig_donut.show()



The above pie-chart represents the precentage of total travel distance each segment(leg) contributes to the optimized travel route. Each segment is labeled and color coded to show how much of the total journey it take up. This helps visualize which legs are the longest and how the distance is distributed across the entire route. 

Key Observations
- The largest share of the total distance (41.5%) comes from the leg Golden Gate Bridge → Great Smoky Mountains, confirming it as the longest and most significant stretch of the trip.

- The shortest leg, Grand Canyon National Park → Zion National Park, contributes only 1.86% of the total distance.

- Most other legs fall between 4% and 18%, showing that the rest of the trip is more evenly spread out in terms of distance.

- The second-largest leg is Mount Rushmore → Yosemite National Park, which accounts for 18.1% of the total distance.

Conclusion

This chart clearly shows that the overall travel distance is not evenly distributed. While most legs cover moderate distances, a single leg (Golden Gate to Great Smoky Mountains) accounts for nearly half the total journey. This emphasizes how a single long stretch can significantly impact the total route length.

Understanding distance share helps in budgeting, fuel planning, and time management, especially when planning long trips. It also reinforces the value of optimized route planning to minimize the impact of longer unavoidable legs.

In [18]:
# Cumulative Distance Chart
cumulative = np.cumsum(leg_dists_df['Distance_km'])
fig_cumulative = go.Figure()
fig_cumulative.add_trace(go.Scatter(
    x=leg_dists_df['Leg'],
    y=cumulative,
    mode="lines+markers",
    line=dict(width=3, color='orange'),
    marker=dict(size=8)
))
fig_cumulative.update_layout(
    title= {
        'text': "Cumulative Distance by Leg (Stop)",
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top',
        'font': {'size': 20}
    },
    xaxis_title="Leg",
    yaxis_title="Distance (km)",
    margin=dict(l=40, r=20, t=40, b=40))

fig_cumulative.show()


The above line graph shows the growing total distance (km) traveled as a trip progresses through multiple US landmards and national parks. 

Key Observations
- The journey begins at Grand Canyon National Park with an initial cumulative distance near 0 km.
- There’s a steady increase in cumulative distance as the route continues through Zion, Arches, and Yellowstone National Parks. The climb in distance here is consistent but moderate.
- A noticeable steep rise is observed between Yosemite National Park and the Golden Gate Bridge, indicating a significant geographical leap. This is the largest increase between any two points.
- After reaching Golden Gate Bridge, the route continues to Great Smoky Mountains, Niagara Falls, and finally the Statue of Liberty, with the total distance approaching around 8300 km.
- These final legs maintain a higher base distance but with relatively smaller increases compared to the Yosemite–Golden Gate segment.

Conclusion:
The cumulative distance graph highlights a scenic cross-country journey that starts in the western U.S. and ends in the northeast. While most travel segments build distance gradually, the leg from Yosemite to Golden Gate Bridge stands out as a major jump—suggesting a shift from inland to coastal travel. Overall, the graph effectively tracks how each stop adds to the journey, culminating in a total distance of over 8000 km, showcasing the vast geographic spread and diversity of the route.


In [19]:
# Travel time (assuming 60 km/h)
leg_dists_df['Time_h'] = leg_dists_df['Distance_km'] / 60
fig_time = px.bar(
    leg_dists_df,
    x='Leg',
    y='Time_h',
    color='Time_h',
    color_continuous_scale=px.colors.sequential.Oranges,
    hover_data={'Time_h': ':.1f'},
    labels={'Time_h': 'Time (h)'},
    title='Estimated Travel Time per Leg'
).update_layout(
    xaxis_tickangle=-45,
    margin=dict(t=40, b=120),
    title={'text': 'Estimated Travel Time per Leg (Stop)', 'x': 0.5, 'xanchor': 'center', 'yanchor': 'top', 'font': {'size': 20}}
)

fig_time.show()

The above bar graph shows the driving time (in hours) required to travel between consecutive destinations on a cross-country route through the United States.

Key Observations:
- The first few legs—Grand Canyon → Zion, Zion → Arches, and Arches → Yellowstone—take relatively little time, each under 12 hours, indicating short driving distances or efficient routes in the western U.S.
- Legs such as Yellowstone → Mount Rushmore and Mount Rushmore → Yosemite show moderate travel times (around 10–12 hours), reflecting medium-distance segments that still maintain a relatively steady pace.
- The leg from Golden Gate Bridge → Great Smoky Mountains stands out dramatically, requiring nearly 60 hours of travel. This is the longest by far, and it visually dominates the chart. It highlights a major cross-country leap from California to the southeastern U.S.
- Following that, travel times decrease again, with Great Smoky Mountains → Niagara Falls and Niagara Falls → Statue of Liberty taking between 10 to 15 hours—reasonable for northeastern U.S. routes.

Conclusion:
This chart reveals that while most legs of the journey are manageable in under a day of driving, the Golden Gate Bridge to Great Smoky Mountains segment is a major outlier and logistical challenge. This leg alone could significantly impact scheduling, rest requirements, and travel fatigue. Strategically, it might benefit from being broken up with an intermediate stop to balance the overall trip more evenly. The visualization clearly emphasizes how varied driving times can be, even if cumulative distance appears linear.

In [20]:
# Distance Improvement Indicator
improvement = (baseline_distance - distance) / baseline_distance * 100
fig_indicator = go.Figure(go.Indicator(
    mode="number+delta",
    value=distance,
    delta={"reference": baseline_distance, "relative": True, "valueformat": ".1%"},
    number={"suffix": " km", "font": {"size": 36}}
))
fig_indicator.update_layout(margin={"t":50,"b":0,"l":0,"r":0},
                            title={'text': "Optimized Route vs. Baseline Route<br><span style='font-size:0.7em;color:black'>Total Distance (km)</span>", 'x': 0.5, 'xanchor': 'center', 'font': {'size': 20}},)

fig_indicator.show()

In [21]:
# Calculate hours saved by optimization
average_speed = 60  # km/h

optimized_time = distance / average_speed 
random_time = baseline_distance / average_speed

hours_saved = random_time - optimized_time
print(f"Hours saved by optimization: {hours_saved:.0f} hours")

Hours saved by optimization: 78 hours


In [22]:
# Bar Chart for Time Comparison
fig = go.Figure()
fig.add_trace(go.Bar(
    x=["Random Route", "Optimized Route"],
    y=[random_time, optimized_time],
    text=[f"{random_time:.0f} h", f"{optimized_time:.0f} h"],
    textposition="auto",
    marker_color=["#FFA500", "#FF8C00"]
))

#  Add a text annotation for hours saved
fig.add_annotation(
    text=f"Hours saved: {hours_saved:.0f} h",
    xref="paper", yref="paper",
    x=1, y=1,
    showarrow=False,
    font=dict(size=14, color="black"),
    align="center"
)

# Update layout for the bar chart
fig.update_layout(
    title={'text': "Total Travel Time Comparison", 'x': 0.5, 'xanchor': 'center', 'yanchor': 'top', 'font': {'size': 20}},
    yaxis_title="Time (hours)",
    xaxis_title="Routes",
    bargap=0.4,
    template="plotly_white"
)

fig.show()

From the above bar graph, it can be known that by using optimized route to travel at those 10 selected destinations, travelers can save up to 78 hours of driving time compared to the random route which is a significant reduction in both time and fatigue making it a smooth road trip. 

# Overall Conclusion

The project is to explore how travel routes between popular US landmarks and natural attractions can be optimized for distance and efficiency. By comparing randomly generated route with an optimized one, it is known that optimized route reduced the total distance by over 4600km or approximately 35%. This finding shows the importance and potential impact of optimization techniques in real-life travel planning, logistics, and delivery routing. 

Visualizations like bar graphs and donut charts helped break down the route step-by-step, making it easy to spot which parts of the trip involved the most travel. For example, the stretch between the Golden Gate Bridge and Great Smoky Mountains made up over 40% of the entire journey. Insights like these can guide decisions around where to plan longer breaks, how much fuel might be needed for each segment, and how the overall travel load is distributed. Based on an average speed of 60 km/h, the optimized route could save travelers up to 78 hours of driving time compared to the random route — a significant reduction in both time and fatigue.

For future development, several enhancements can increase the value and realism of the project:
- Allow users to select a subset of locations from the available 10 (or more) and generate an optimal route based on their personal preferences.
- Attach additional travel data such as nearby hotels, restaurants, gas stations, and tourist information using public APIs like Google Places or Yelp Fusion.
- Expand the dataset to include more attractions across all U.S. regions, enabling longer or themed trips (e.g., only national parks or only historical sites).
- Incorporate real-time data such as live traffic, weather conditions, or road closures for smarter route planning.
- Optimize for different goals, such as shortest time, most scenic path, or lowest cost, using multi-objective optimization methods.
- Create a web-based dashboard or app (using tools like Streamlit, Dash, or Flask) where users can interactively choose destinations and view the updated optimized route with stats, maps, and tips.
- Add features for eco-conscious travelers, such as calculating carbon footprints or suggesting EV charging stations along the way.
- Provide a personalized itinerary generator, complete with travel times, estimated arrival dates, and rest stop suggestions.
- By implementing these improvements, the project can evolve from a route optimization model into a full-featured travel planning tool — practical, user-centric, and ready for real-world use.

The entire modeling process was built using Python and open-source libraries such as pandas, geopy, and random, making the approach fully reproducible, transparent, and easy to customize for other datasets or use cases. While still in its basic stage, this project sets the foundation for building a smarter and more personalized travel planning system — one that can grow with added features, real-time data, and user-focused design.

