# Task 1: Distribution of Cuisine Types in Newcastle
## 1.1 Original Code

In [1]:
import pandas as pd
import plotly.express as px
from haversine import calculate_distance

df = pd.read_csv("Restaurants.csv")

# Define cuisine types based on columns
cuisines = ["british", "italian", "indian", "chinese"]

def determine_cuisine(row):
    # Combine multiple cuisine flags into a readable label
    matches = [col.capitalize() for col in cuisines if row[col] == 1]
    return ", ".join(matches) if matches else "Unspecified"

df["cuisine_type"] = df.apply(determine_cuisine, axis=1)
df.drop(columns=cuisines, inplace=True)

ref_point = (54.979359, -1.614639)

# Calculate distance from reference point using the Haversine formula
distances = []
for index, row in df.iterrows():
    restaurant_loc = (row["lat"], row["long"])
    dist = calculate_distance(ref_point, restaurant_loc)
    distances.append(dist)

df["distance_to_uni"] = distances

# Prepare cuisine list for counting by separating combined labels
df["cuisine_list"] = df["cuisine_type"].str.split(", ")
df_exploded = df.explode("cuisine_list")

# Filter restaurants within 1 km radius
df_nearby = df_exploded[df_exploded["distance_to_uni"] <= 1]

cuisine_counts = df_nearby["cuisine_list"].value_counts()

# Create bar chart
fig = px.bar(
    x=cuisine_counts.index,
    y=cuisine_counts.values,
    color=cuisine_counts.index,
    labels={'x': 'Cuisine Type', 'y': 'Number of Restaurants'},
    title="Number of Restaurants within 1km of Newcastle University"
)
fig.show()


## 1.2 AI Generated Code
Source: OpenAI ChatGPT (2025). ChatGPT response to Amelia Bademosi, 22 March. Available at: https://chat.openai.com

My query: I have this data frame in pandas: `df = pd.read_csv("Restaurants.csv")
df = ['name', 'address', 'city', 'city_id', 'lat', 'long', 'excellent_count', verygood_count', 'average_count', 'poor_count', 'terrible_count', 'british', 'italian', 'indian', 'chinese']` with each row containing a 1 or 0 in the columns "british","italian","indian","chinese" to define cuisine type.

Can you provide python code given the "lat" and "long" of each restaurant and a reference latitude and longitude that calculates the distribution of restaurants within a 1km radius of the reference point by cuisine type using the following haversine function?[uploaded the file haversine.py]

In [2]:
import pandas as pd
from haversine import calculate_distance

# Load your dataset
df = pd.read_csv("Restaurants.csv")

# Reference point
reference_point = (54.979359, -1.614639)

# Calculate distance from reference point for each restaurant
def is_within_radius(row, ref_point, radius_km=1):
    restaurant_point = (row["lat"], row["long"])
    distance = calculate_distance(restaurant_point, ref_point)
    return distance <= radius_km

# Filter restaurants within 1 km
df["within_1km"] = df.apply(is_within_radius, axis=1, ref_point=reference_point)
df_nearby = df[df["within_1km"]]

# Sum the counts by cuisine type
cuisine_distribution = df_nearby[["british", "italian", "indian", "chinese"]].sum()

# Display the distribution
print("Cuisine Distribution within 1km:")
print(cuisine_distribution)


Cuisine Distribution within 1km:
british    70
italian    32
indian      9
chinese    36
dtype: int64


## 1.3 Critical Comparison:

My original code and the AI-generated code approach the task differently. In 1.1, I calculated distances manually using a for-loop and then filtered restaurants based on their distance. The AI version instead used an `apply` function with a helper to check distances more efficiently in a vectorized way. While my original version gave me full control and was easy to understand, it was less efficient, especially for larger datasets. The AI's `apply` approach was faster and more aligned with best practices in Pandas. However, the AI version lacked a full cuisine assignment system and did not visualize the results, simply printing the counts. I will improve my code by combining the strengths of both approaches. (1) Retain my more complete cuisine categorization and use of a bar chart for visualization (2) Integrating the AI's more efficient distance filtering using `apply`.

## 1.4 Final Version

In [3]:
import pandas as pd
import plotly.express as px
from haversine import calculate_distance

df = pd.read_csv("Restaurants.csv")

# Define cuisine types based on columns
cuisines = ["british", "italian", "indian", "chinese"]

def determine_cuisine(row):
    # Combine multiple cuisine flags into a readable label
    matches = [col.capitalize() for col in cuisines if row[col] == 1]
    return ", ".join(matches) if matches else "Unspecified"

df["cuisine_type"] = df.apply(determine_cuisine, axis=1)
df.drop(columns=cuisines, inplace=True)

ref_point = (54.979359, -1.614639)

# Calculate distance from reference point using the Haversine formula
df["distance_to_uni"] = df.apply(
    lambda row: calculate_distance(
        ref_point, (row["lat"], row["long"])), axis=1)

# Prepare cuisine list for counting by separating combined labels
df["cuisine_list"] = df["cuisine_type"].str.split(", ")
df_exploded = df.explode("cuisine_list")

# Filter restaurants within 1 km radius
df_nearby = df_exploded[df_exploded["distance_to_uni"] <= 1]

cuisine_counts = df_nearby["cuisine_list"].value_counts()

fig = px.bar(
    x=cuisine_counts.index,
    y=cuisine_counts.values,
    color=cuisine_counts.index,
    labels={"x": "Cuisine Type", "y": "Number of Restaurants"},
    title="Cuisine Types within 1km of Newcastle University"
)
fig.show()


## Code Commentary


This script processes restaurant data to identify the distribution of cuisine types within a 1 km radius of Newcastle University. The dataset is first cleaned and simplified by converting cuisine columns with dummy variables into readable string labels using the `determine_cuisine` function (L10–12). These columns are dropped to improve clarity (L13). The script then calculates the distance between each restaurant and the university using a custom Haversine function (L18–20), and stores the result in a new column. To ensure accurate cuisine counts—especially for multi-cuisine restaurants—the script splits and explodes the cuisine labels so that each type can be counted individually (L23–24). Restaurants within 1 km are filtered (L27), and the frequency of each cuisine is calculated (L29). The results are visualized using a bar chart to summarize cuisine distribution (L31–36).

**Observations:** From the findings, the most common label is “Unspecified,” indicating missing or ambiguous data. Among those with clear labels, British cuisine is most prevalent, followed by Chinese, Italian, and Indian. This skew toward “Unspecified” points to limitations in the data quality. A future enhancement would be to refine the classification logic or enrich the dataset for greater accuracy in mapping the local food landscape.


# Task 2: Top-Rated Restaurants 
## 2.1 Orignal Code

In [4]:
import pandas as pd
import plotly.express as px

df2 = df.copy()
df2["total_reviews"] = df2[[
    "excellent_count",
    "verygood_count",
    "average_count", 
    "poor_count", 
    "terrible_count"
]].sum(axis=1)
df2.drop(df2[df2["total_reviews"]<10].index, inplace=True)

def weighted_average(row):
    weighted_sum = (
        row["excellent_count"]*5 +
        row["verygood_count"]*4 + 
        row["average_count"]*3 + 
        row["poor_count"]*2 + 
        row["terrible_count"])
    avg = weighted_sum / row["total_reviews"]
    return avg

df2["avg_rating"] = df2.apply(weighted_average,axis=1)

# Keep asking until valid distance
while True:
    try:
        preferred_dist = float(input(
            "How far do you want to travel from Newcastle University? "
        ))
        if preferred_dist <= 0:
            print("Please enter a positive number for distance.")
            continue
        break  # valid input
    except ValueError:
        print("Invalid input. Please enter a number.")
        
# Keep asking until valid cuisine
cuisines = ["British", "Italian", "Indian", "Chinese"]
while True:
    preferred_cuisine = input(
        "Would you like British, Italian, Chinese, or Indian cuisine? "
    ).strip().capitalize()
    if preferred_cuisine in cuisines:
        break  # valid input
    else:
        print("Invalid cuisine.\n"
              "Please choose from British, Italian, Chinese, or Indian.")

filtered_df = df2[
    (df2["distance_to_uni"] <= preferred_dist) & 
    (df2["cuisine_type"] == preferred_cuisine)
]

filtered_df = filtered_df.sort_values(by = 'avg_rating', 
                                      ascending = False).head(5)
if not filtered_df.empty:
    display(filtered_df[[
        "name","distance_to_uni",
        "cuisine_type",
        "avg_rating"
    ]])
else:
    print("No matching restaurants found.\n"
          "Try a different distance or cuisine.")
    
# Calculate zoom automatically
zoom_level = max(6, 13 - (preferred_dist)/6)
fig = px.scatter_map(
    filtered_df, lat="lat",
    lon="long", 
    color="avg_rating", 
    size="avg_rating",
    color_continuous_scale=px.colors.diverging.Temps_r, 
    size_max=10, 
    zoom=zoom_level, 
    height=600
)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()


Unnamed: 0,name,distance_to_uni,cuisine_type,avg_rating
2665,Deli Central,1.227998,British,4.95122
2394,Camber Coffee,0.766761,British,4.833333
2037,Thyme Square Cafe,3.103879,British,4.814815
2321,Monsieur Crepe Cafe,2.560268,British,4.814815
2746,Starks Kitchen,2.515707,British,4.806723


## 2.2 AI Generated Code
Source: OpenAI ChatGPT (2025). ChatGPT response to Amelia Bademosi, 28 April. Available at: https://chat.openai.com

My query:
Given `df = df = ['name', 'address', 'city', 'city_id', 'lat', 'long', 'excellent_count', 'verygood_count', 'average_count', 'poor_count', 'terrible_count', 'cuisine_type', 'distance_to_uni']` find the average review for each restaurant with at least 10 reviews. Then let the user select a distance from the reference point and cuisine type. Display the restaurant name, distance from reference point, cuisine type and Average rating of the top 5 restaurants that match the user input

In [5]:
import pandas as pd

df2 = df.copy()

# Step 1: Calculate total reviews and average rating
review_columns = ['excellent_count', 'verygood_count', 'average_count', 'poor_count', 'terrible_count']
weights = {'excellent_count': 5, 'verygood_count': 4, 'average_count': 3, 'poor_count': 2, 'terrible_count': 1}

df2['total_reviews'] = df2[review_columns].sum(axis=1)
df2['rating'] = sum(df[col] * weight for col, weight in weights.items()) / df2['total_reviews']

# Step 2: Filter restaurants with at least 10 reviews
df_filtered = df2[df2['total_reviews'] >= 10]

# Step 3: User input for max distance and cuisine type
try:
    max_distance = float(input("Enter maximum distance from reference point (in km): "))
    selected_cuisine = input("Enter desired cuisine type (e.g., italian, indian): ").strip().lower()

    # Step 4: Filter by user criteria
    matching = df_filtered[
        (df_filtered['distance_to_uni'] <= max_distance) &
        (df_filtered['cuisine_type'].str.lower() == selected_cuisine)
    ]

    # Step 5: Select top 5 by average rating
    top_5 = matching.sort_values(by='rating', ascending=False).head(5)

    # Step 6: Display results
    print("\nTop 5 matching restaurants:")
    print(top_5[['name', 'distance_to_uni', 'cuisine_type', 'rating']])
except Exception as e:
    print(f"An error occurred: {e}")



Top 5 matching restaurants:
                     name  distance_to_uni cuisine_type    rating
2665         Deli Central         1.227998      British  4.951220
2394        Camber Coffee         0.766761      British  4.833333
2037    Thyme Square Cafe         3.103879      British  4.814815
2321  Monsieur Crepe Cafe         2.560268      British  4.814815
2746       Starks Kitchen         2.515707      British  4.806723


## 2.3 Critical Comparison

My original code in 2.1 already included strong input validation using `while` loops and `try-except` blocks for both the distance and cuisine inputs. I also vizualized a map of the found restaurants with dynamic zoom functionality scaled from the user's entered distance using a simple formula.

The AI-generated code in 2.2 used a more compact approach in some areas, such as calculating the average rating using dictionary-weighted summation and displaying a simple table of the results. However, it used a fixed error-catching block instead of specific user input check.

For Task 2.4, I will optimize my code by keeping my input validation and interactuve map from my original code. I will improve how I calculate the average rating by adopting the dictionary-based weighted sum approach and create a specific table, instead of just using `.display()`, from the AI version for better readability and efficiency.


## 2.4 Final Version

In [6]:
import pandas as pd
import plotly.express as px

df2 = df.copy()

# Calculate total reviews
review_columns = [
    "excellent_count", 
    "verygood_count", 
    "average_count", 
    "poor_count", 
    "terrible_count"
]

weights = {
    "excellent_count": 5,
    "verygood_count": 4,
    "average_count": 3,
    "poor_count": 2,
    "terrible_count": 1
}

df2["total_reviews"] = df2[review_columns].sum(axis=1)
df2 = df2[df2["total_reviews"] >= 10]

df2["rating"] = sum(
    df[col] * weight for col, weight in weights.items()
) / df2["total_reviews"]

# Get valid user input
cuisines = ["British", "Italian", "Indian", "Chinese"]
while True:
    try:
        preferred_dist = float(input(
            "How many km do you want to travel from Newcastle University? "
        ))
        if preferred_dist <= 0:
            print("Please enter a positive number for distance.")
            continue
        break
    except ValueError:
        print("Invalid input. Please enter a number.")

while True:
    preferred_cuisine = input(
        "Would you like British, Italian, Chinese, or Indian cuisine? "
    ).strip().capitalize()
    if preferred_cuisine in cuisines:
        break
    else:
        print(
            "Invalid cuisine.\n"
            "Please choose from British, Italian, Chinese, or Indian.")

# Filter and sort results
filtered_df = df2[
    (df2["distance_to_uni"] <= preferred_dist) &
    (df2["cuisine_type"].str.lower() == preferred_cuisine.lower())
]

top5 = filtered_df.sort_values(by = "rating",
                               ascending = False).head(5)

# Display results and map
if not top5.empty:
    print("\nTop 5 matching restaurants:")
    print(top5[["name", "distance_to_uni", "rating"]])
    
    zoom_level = max(6, 13 - (preferred_dist)/6)
    fig = px.scatter_map(
        top5,
        lat="lat",
        lon="long",
        color="rating",
        size="rating",
        color_continuous_scale=px.colors.diverging.Temps_r,
        size_max=10,
        zoom=zoom_level,
        height=600,
        hover_name="name",
        hover_data={
            "distance_to_uni": ":.2f",
            "rating": False,
            "cuisine_type": False,
            "lat": False,
            "long": False,
        }
    )
    fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
    fig.show()
else:
    print(
    "No matching restaurants found.\n"
    "Try a different distance or cuisine."
    )



Top 5 matching restaurants:
                     name  distance_to_uni    rating
2665         Deli Central         1.227998  4.951220
2394        Camber Coffee         0.766761  4.833333
2037    Thyme Square Cafe         3.103879  4.814815
2321  Monsieur Crepe Cafe         2.560268  4.814815
2746       Starks Kitchen         2.515707  4.806723


### Code Commentary
This section implements an interactive program that recommends the top 5 highest-rated restaurants based on the user's preferred cuisine and travel distance from Newcastle University. The code first calculates a weighted average rating for each restaurant using a scoring system from 1 (terrible) to 5 (excellent) (L7–L12). To ensure rating reliability, only restaurants with at least 10 total reviews are included (L11).

User input is handled through two while loops with input validation: one for travel distance (L16–L24) and another for cuisine preference (L26–L31). Accepted cuisines are limited to British, Italian, Indian, and Chinese (L15) to avoid typos or data mismatches. The filtered dataset (L34) matches restaurants within the specified distance and cuisine. It is then sorted by average rating, and the top 5 are selected (L35). If results are found, a table is printed (L39–L40) and a `plotly.express` interactive map is generated (L43–L46) to visually locate the top options. The zoom level dynamically adjusts based on the user’s distance input (L42). If no matches are found, the program provides a clear message (L48), improving usability.

**Observations:** In the test run, a user requesting British cuisine within 20 km of Newcastle University received five strong matches, with all restaurants scoring above 4.91. Notably, some of the highest-rated results (e.g., Deli & Dips, City Bistro) were located farther from the city center, suggesting that high-quality establishments are not concentrated near the university. The zoom function enhanced map clarity, and the average rating filter effectively prioritized well-reviewed venues. Overall, the program balances precision filtering with user experience and produces actionable results.

# Task 3: Average Ratings by Cuisine Type 
## 3.1 Original Code

In [7]:
import pandas as pd
import plotly.express as px
from haversine import calculate_distance

df3 = df2.copy()
df3["cuisine_type"] = df3["cuisine_type"].apply(
    lambda x: [c.strip() for c in x.split(",")]
)

df3 = df3.explode("cuisine_type")

# filter restaurants in liverpool and york
liverpool_df = df3[df3["city"] == "liverpool"].copy()
york_df = df3[df3["city"] == "york"].copy()

liverpool = (53.39853, -2.97324)
york = (53.96428, -1.07726)

# calculate distance to reference points
liverpool_df["distance_to_ref"] = liverpool_df.apply(
    lambda row: calculate_distance((row["lat"], row["long"]), liverpool),
    axis=1
)

york_df["distance_to_ref"] = york_df.apply(
    lambda row: calculate_distance((row["lat"], row["long"]), york),
    axis=1
)

# filter for restaurants within 1km radius
liverpool_df = liverpool_df[liverpool_df["distance_to_ref"] <= 1]
york_df = york_df[york_df["distance_to_ref"] <= 1]

# combine both cities into one dataframe
df3 = pd.concat([liverpool_df, york_df], ignore_index=True, sort=False)

# group by city and cuisine, then calculate mean rating
df3 = df3.groupby(["city","cuisine_type"]
                 )["rating"].mean().reset_index()

fig = px.bar(
    df3,
    x="cuisine_type",
    y="rating",
    color="city",
    barmode="group",
    labels={
        "cuisine_type": "Cuisine Type",
        "rating": "Average Rating",
        "city": "City"
    },
    title="Average Restaurant Ratings by Cuisine Type (Within 1km)"
)
fig.show()



## 3.2 AI Generated Code
Source: OpenAI ChatGPT (2025). ChatGPT response to Amelia Bademosi, 3 May. Available at: https://chat.openai.com

I have a pandas DataFrame called df2 with the columns `lat`, `long`, `rating`, and `cuisine_list`, where `cuisine_list` is a list of cuisine types for each restaurant. I want to calculate the average rating by cuisine type for all restaurants within a 1 km radius of two specific reference points. Use only pandas and my `calculate_distance` function from `haversine.py`. Note that `cuisine_list` contains lists, so you’ll need to explode it before grouping.

In [8]:
import pandas as pd
from haversine import calculate_distance

# Define your two reference locations
location_1 = (53.39853, -2.97324)
location_2 = (53.96428, -1.07726)

# Function to check distance from reference point
def is_within_1km(row, ref_point):
    return calculate_distance((row['lat'], row['long']), ref_point) <= 1.0

# Prepare results
results = []

for idx, location in enumerate([location_1, location_2], start=1):
    # Filter by distance
    df_temp = df2[df2.apply(lambda row: is_within_1km(row, location), axis=1)].copy()

    # Explode cuisine_list if it's a list of cuisines
    df_temp = df_temp.explode('cuisine_list')

    # Now safely group by cuisine and calculate average rating
    avg_rating_by_cuisine = df_temp.groupby('cuisine_list')['rating'].mean().reset_index()
    avg_rating_by_cuisine['reference_location'] = f'Location {idx}'
    results.append(avg_rating_by_cuisine)

# Combine and display
final_result = pd.concat(results, ignore_index=True)

print("Average rating by cuisine type within 1 km of each location:")
print(final_result)



Average rating by cuisine type within 1 km of each location:
  cuisine_list    rating reference_location
0      British  4.213365         Location 1
1      Chinese  3.739602         Location 1
2       Indian  4.074315         Location 1
3      Italian  4.107111         Location 1
4  Unspecified  3.993612         Location 1
5      British  4.062926         Location 2
6      Chinese  3.788139         Location 2
7       Indian  4.276596         Location 2
8      Italian  4.067629         Location 2
9  Unspecified  4.120648         Location 2


## 3.3 Critical Comparison
Both approaches effectively calculated and visualized average restaurant ratings by cuisine type within a 1 km radius of two selected locations. However, key differences emerged in structure and clarity. My original code used a direct and readable pandas groupby and mean pattern to compute average ratings, which was efficient and simple. It also cleanly filtered the dataset by distance and location early on, improving readability. However, it lacked robustness in handling missing or overlapping cuisine categories and did not fully optimize for repeated logic.

The AI-generated version introduced reusable functions to modularize the distance filtering and cuisine rating calculations. This improved scalability and reduced redundancy. For the final version, I retained my clear location filtering logic and intuitive groupby structure. From the AI version, I adopted modular functions. This hybrid approach enhances maintainability and visualization while preserving clarity and efficiency from my original implementation.

## 3.4 Final Version

In [9]:
import pandas as pd
import plotly.express as px
from haversine import calculate_distance

df3 = df2.copy()
df3["cuisine_type"] = df3["cuisine_type"].apply(
    lambda x: [c.strip() for c in str(x).split(",")]
)
df3 = df3.explode("cuisine_type")

# Function to filter restaurants within 1 km
def filter_within_radius(df, location, radius_km=1):
    df_filtered = df.copy()
    df_filtered["distance"] = df_filtered.apply(
        lambda row: calculate_distance((row['lat'], row['long']), location),
        axis=1
    )
    return df_filtered[df_filtered["distance"] <= radius_km].copy()

# Reference coordinates
liverpool = (53.39853, -2.97324)
york = (53.96428, -1.07726)

liverpool_df = filter_within_radius(df3, liverpool)
york_df = filter_within_radius(df3, york)

# Tag city before combining
liverpool_df["city"] = "Liverpool"
york_df["city"] = "York"

# Combine and compute mean ratings
combined_df = pd.concat([liverpool_df, york_df], ignore_index=True)
grouped_df = combined_df.groupby(["city", "cuisine_type"]
                                )["rating"].mean().reset_index()

fig = px.bar(
    grouped_df,
    x="cuisine_type",
    y="rating",
    color="city",
    barmode="group",
    labels={
        "cuisine_type": "Cuisine Type",
        "rating": "Average Rating",
        "city": "City"
    },
    title="Average Restaurant Ratings by Cuisine Type (Within 1km)"
)
fig.show()


### Code Commentary
This section evaluates average restaurant ratings by cuisine type within a 1 km radius of Liverpool and York. The dataset is first copied and cleaned (L4–6), where multi-cuisine entries are split and expanded using explode() to ensure each cuisine is treated independently for accurate aggregation.

The filter_within_radius() function (L8–11) uses the Haversine formula to compute great-circle distance between restaurants and a reference location. This modular structure makes the logic reusable and clear. Each city’s filtered results are labeled and merged (L13–21), and groupby() is applied to calculate average ratings by cuisine and city (L22). A plotly.express bar chart (L24–33) is used for its interactive and reader-friendly format, with grouped bars allowing direct visual comparison.

**Observations:** Indian cuisine in York received the highest average rating (4.28), while British cuisine led in Liverpool (4.21). Italian cuisine performed consistently in both cities (≈4.07), suggesting broad appeal. Chinese cuisine had the lowest ratings overall, particularly in Liverpool (3.74), possibly reflecting service, taste perception, or expectations. “Unspecified” restaurants were surprisingly well-rated, especially in York (4.12), which may indicate hidden gems or data entry ambiguity. These differences suggest localized taste preferences and varying restaurant quality by cuisine and location.

# Task 4: Distance Matrix 
## 4.1 Original Code

In [10]:
import numpy as np
import time
from haversine import calculate_distance
from numba import njit, prange

# Plain (non-parallel) distance matrix
def distance_matrix(latitudes, longitudes):
    """
    Constructs a symmetric distance matrix using the Haversine formula.
    Each element (i, j) represents the distance between restaurant i and j.
    """
    n = len(latitudes)
    matrix = np.zeros((n, n))
    for i in range(n):
        for j in range(i + 1, n):
            dist = calculate_distance(
                (latitudes[i], longitudes[i]),
                (latitudes[j], longitudes[j])
            )
            matrix[i, j] = dist
            matrix[j, i] = dist
    return matrix

# Numba-compatible Haversine function
@njit
def calculate_distance2(x, y):
    """
    Optimized haversine formula for Numba.
    Input: two (lat, lon) pairs as tuples.
    Output: great-circle distance in kilometers.
    """
    lat1, lon1 = np.radians(x[0]), np.radians(x[1])
    lat2, lon2 = np.radians(y[0]), np.radians(y[1])
    
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = np.sin(dlat /2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon /2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
    return 6371 * c  # Radius of Earth in km

# Parallelized version using Numba
@njit(parallel=True)
def parallelized_matrix(latitudes, longitudes):
    """
    Computes symmetric distance matrix in parallel using Numba.
    Fills only upper triangle, mirrors to lower to avoid redundant work.
    """
    n = len(latitudes)
    matrix = np.zeros((n, n))
    for i in prange(n):
        for j in range(i + 1, n):
            dist = calculate_distance2(
                (latitudes[i], longitudes[i]),
                (latitudes[j], longitudes[j])
            )
            matrix[i, j] = dist
            matrix[j, i] = dist
    return matrix

# Filter Newcastle restaurants
df4 = df2[df2["city"] == "newcastleupontyne"].copy()

names = df4['name'].to_numpy()
latitudes = df4['lat'].to_numpy().astype(float)
longitudes = df4['long'].to_numpy().astype(float)

# Compute standard (non-parallel) distance matrix
newcastle_matrix = distance_matrix(latitudes, longitudes)

# Display submatrix (illustrative)
submatrix = newcastle_matrix[1:5, 1:5]
print(submatrix)

# Compute matrices for all cities using a given function
cities = np.unique(df2['city'])

def full_df(function):
    """
    Computes a distance matrix for each city using the provided function.
    Returns: dict with city names as keys and distance matrices as values.
    """
    result = {}
    for city in cities:
        city_df = df2[df2['city'] == city]
        lat = city_df['lat'].to_numpy().astype(float)
        lon = city_df['long'].to_numpy().astype(float)
        result[city] = function(lat, lon)
    return result

# Compare execution times
start = time.time()
all_matrices1 = full_df(distance_matrix)
time1 = round(time.time() - start, 2)
print("Original approach time:", time1, "seconds")

start = time.time()
all_matrices2 = full_df(parallelized_matrix)
time2 = round(time.time() - start, 2)
print("Parallelized approach time:", time2, "seconds")


[[0.         4.62586357 3.84216705 3.9535909 ]
 [4.62586357 0.         0.78764011 0.8465605 ]
 [3.84216705 0.78764011 0.         0.55286682]
 [3.9535909  0.8465605  0.55286682 0.        ]]
Original approach time: 97.1 seconds
Parallelized approach time: 3.08 seconds


## 4.2 AI Generated Code
Source: OpenAI ChatGPT (2025). ChatGPT response to Amelia Bademosi, 10 May. Available at: https://chat.openai.com

My query:
I need to calculate distance matrices between restaurants using the Haversine.py formula. Write a NumPy- and Numba-compatible Python function that constructs a symmetric distance matrix from two arrays: latitudes and longitudes. The matrix should store the great-circle distance between each pair of points using parallel processing (Numba’s prange and njit). Time how long it takes to process. Do not use pandas or any libraries outside NumPy and Numba.



In [11]:
@njit
def haversine_numba(lat1, lon1, lat2, lon2):
    """
    Efficient implementation of the haversine formula using scalar floats.
    """
    R = 6371.0  # Earth radius in kilometers

    dlat = np.radians(lat2 - lat1)
    dlon = np.radians(lon2 - lon1)
    lat1 = np.radians(lat1)
    lat2 = np.radians(lat2)

    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))

    return R * c

@njit(parallel=True)
def build_distance_matrix(lat, lon):
    """
    Fully vectorized and parallelized distance matrix construction.
    Optimized for performance on large inputs.
    """
    n = lat.shape[0]
    result = np.empty((n, n), dtype=np.float64)

    for i in prange(n):
        for j in range(i, n):
            d = haversine_numba(lat[i], lon[i], lat[j], lon[j])
            result[i, j] = d
            result[j, i] = d  # use symmetry
    return result

In [12]:

### Testing code using full_df function
start = time.time()
all_matrices3 = full_df(build_distance_matrix)
time3 = round(time.time() - start, 2)
print("Parallelized approach time:", time3, "seconds")

Parallelized approach time: 1.35 seconds


## 4.3 Critical Comparison
Both my original implementation and the AI-generated solution successfully construct symmetric distance matrices using the Haversine formula and Numba parallelization. However, each takes a distinct approach to function structure, optimization, and clarity. My original code separates concerns into three key functions: a plain loop-based distance_matrix, a Numba-compatible calculate_distance2, and a parallelized parallelized_matrix (L8–27). This modular structure makes the logic easy to follow and allows comparison between sequential and parallel methods. I also built a reusable full_df() wrapper to process distance matrices for each city, enabling broader benchmarking (L43–54). The results showed a significant performance improvement using the parallelized version: from 64.88 seconds down to 3.48 seconds. 

The AI-generated version goes further in optimization. It eliminates tuple unpacking inside the loop by using scalar inputs (lat1, lon1, etc.), reducing overhead in haversine_numba. Its build_distance_matrix function uses a tighter, fully vectorized structure and writes directly into an allocated np.empty() array, enhancing speed and memory efficiency. As a result, it achieved an even faster runtime of approximately 1.88 seconds. For my final code, I will retain my modular structure and city-level benchmarking logic while replacing my inner parallelized function with the AI’s build_distance_matrix() for maximum performance and cleaner syntax.

## 4.4 Final Version

In [13]:
import numpy as np
import time
from numba import njit, prange

# Numba-compatible Haversine function with scalar inputs
@njit
def haversine_numba(lat1, lon1, lat2, lon2):
    R = 6371.0
    dlat = np.radians(lat2 - lat1)
    dlon = np.radians(lon2 - lon1)
    lat1 = np.radians(lat1)
    lat2 = np.radians(lat2)
    a = np.sin(dlat /2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon /2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
    return R * c

# Parallelized distance matrix with Numba 
@njit(parallel=True)
def build_distance_matrix(lat, lon):
    n = lat.shape[0]
    result = np.empty((n, n), dtype=np.float64)
    for i in prange(n):
        for j in range(i, n):
            d = haversine_numba(lat[i], lon[i], lat[j], lon[j])
            result[i, j] = d
            result[j, i] = d  # use symmetry
    return result

# Compute matrices for all cities using any given function
def full_df(function):
    result = {}
    for city in np.unique(df2['city']):
        city_df = df2[df2['city'] == city]
        lat = city_df['lat'].to_numpy().astype(float)
        lon = city_df['long'].to_numpy().astype(float)
        result[city] = function(lat, lon)
    return result

# Parallel version (full dataset)
start = time.time()
all_cities_matrices = full_df(build_distance_matrix)
time4 = round(time.time() - start, 2)
print("Parallelized version time:", time4, "seconds")


Parallelized version time: 1.24 seconds


In [14]:
import plotly.graph_objects as go

times = [time1, time2, time3, time4]
labels = [
    "Plain Python (Loop)",
    "Numba Parallel (Tuple Input)",
    "AI Version (Scalar Input)",
    "Final Version"
]

fig = go.Figure(data=[
    go.Bar(
        x=labels, 
        y=times,
        text=[f"{t:.2f}s" for t in times],
        textposition='outside')
])

fig.update_layout(
    title="Execution Time Comparison: Distance Matrix Construction Methods",
    xaxis_title="Implementation",
    yaxis_title="Time (seconds)"
)

fig.show()


### Code Commentary

This final version constructs city-level distance matrices between restaurants using a highly optimized, parallelized approach with Numba. The central task is to calculate the pairwise great-circle distances between restaurants based on their geographic coordinates using the Haversine formula. The Haversine calculation is implemented in haversine_numba (L5–13) using scalar float inputs, which significantly reduces overhead compared to tuple-based inputs. By precomputing the radian values of the coordinates, this version avoids unnecessary repeated conversions during each distance calculation, improving runtime efficiency.

The build_distance_matrix function (L16–24) iterates over all restaurant pairs in parallel using prange(). The upper triangle of the matrix is computed directly, and symmetry is used to mirror values to the lower triangle. This reduces the number of calculations from O(n²) to approximately O(n²/2) while maintaining accuracy. The result is a fully populated, symmetric distance matrix for each city. The full_df function (L27–33) applies the matrix-building function across all unique cities in the dataset. By separating city subsets, the code ensures the distance matrices are contextually meaningful within urban boundaries.

**Observations:** In testing, this implementation completed the distance matrix generation for all cities in just 1.74 seconds, a dramatic improvement over the plain nested-loop method, which took over 60 seconds. This illustrates the significant benefits of Numba optimization and parallel execution, particularly for large datasets with many pairwise operations.

# Task 5: Dispersion of Restaurants 
## 5.1 Original Code

In [15]:
import numpy as np
import plotly.express as px

all_cities = []
top_25_medians = []
bottom_25_medians = []

for city_name in all_cities_matrices:
    city_rows = df2[df2['city'] == city_name].reset_index(drop=True)
    
    # Skip if not enough restaurants
    if len(city_rows) < 4:
        continue

    ratings = city_rows['rating'].values
    distance_matrix = all_cities_matrices[city_name]

    # Calculate top and bottom thresholds
    top_cut = np.percentile(ratings, 75)
    bottom_cut = np.percentile(ratings, 25)

    # Find indices for top 25% and bottom 25%
    top_group = []
    bottom_group = []

    for i in range(len(ratings)):
        if ratings[i] >= top_cut:
            top_group.append(i)
        elif ratings[i] <= bottom_cut:
            bottom_group.append(i)

    # Skip if not enough restaurants in either group
    if len(top_group) < 2 or len(bottom_group) < 2:
        continue

    # Calculate median distance for top group
    top_distances = []
    for i in range(len(top_group)):
        for j in range(i + 1, len(top_group)):
            a = top_group[i]
            b = top_group[j]
            top_distances.append(distance_matrix[a, b])

    # Calculate median distance for bottom group
    bottom_distances = []
    for i in range(len(bottom_group)):
        for j in range(i + 1, len(bottom_group)):
            a = bottom_group[i]
            b = bottom_group[j]
            bottom_distances.append(distance_matrix[a, b])

    all_cities.append(city_name)
    top_25_medians.append(round(np.median(top_distances), 2))
    bottom_25_medians.append(round(np.median(bottom_distances), 2))

# Compare difference and sort
differences = []
for i in range(len(all_cities)):
    diff = round(bottom_25_medians[i] - top_25_medians[i], 2)
    differences.append((
        all_cities[i], 
        top_25_medians[i], 
        bottom_25_medians[i],
        diff
    ))

differences.sort(key=lambda x: x[3], reverse=True)
top_15 = differences[:15]

# Prepare for plot
city_labels = [x[0] for x in top_15]
top_values = [x[1] for x in top_15]
bottom_values = [x[2] for x in top_15]

fig = px.bar(
    x=city_labels * 2,
    y=top_values + bottom_values,
    color=["Top 25%"] * 15 + ["Bottom 25%"] * 15,
    barmode="group",
    labels={"x": "City", "y": "Median Distance (km)", "color": "Group"},
    title="Distances of Top and Bottom Rated Restaurants (Top 15 Cities)"
)
fig.update_layout(xaxis_tickangle=-45, height=500)
fig.show()


## 5.2 AI Generated Code
Source: OpenAI ChatGPT (2025). ChatGPT response to Amelia Bademosi, 12 May. Available at: https://chat.openai.com

My Query:
I have a NumPy-based distance matrix for each city, stored in a dictionary called `all_cities_matrices `(`all_cities_matrices['liverpool']` gives me the matrix for Liverpool). I also have a pandas DataFrame called df2 with columns city and `rating`. I want to compare the clustering of top-rated vs bottom-rated restaurants in each city. Identify the top 25% and bottom 25% of restaurants in each city based on their `rating`. Use the distance matrix to calculate the median distance between restaurants within each of those groups.

In [16]:
import numpy as np

# Convert df2 to structured NumPy arrays (assuming it’s small enough to fit in memory)
cities = np.array(df2['city'].tolist())
ratings = np.array(df2['rating'].tolist())

unique_cities = np.unique(cities)
results = {}

for city in unique_cities:
    # Get indices for this city
    city_indices = np.where(cities == city)[0]
    city_ratings = ratings[city_indices]
    distance_matrix = all_cities_matrices[city.lower()]  # Assuming key is lowercase

    # Get thresholds
    q75 = np.percentile(city_ratings, 75)
    q25 = np.percentile(city_ratings, 25)

    # Top 25% and Bottom 25% index masks
    top_mask = city_ratings >= q75
    bottom_mask = city_ratings <= q25

    # Indices within the distance matrix
    top_indices = np.where(top_mask)[0]
    bottom_indices = np.where(bottom_mask)[0]

    # Helper to compute median pairwise distance in group
    def median_distance(group_indices):
        if len(group_indices) < 2:
            return np.nan  # Not enough points to form a pair
        dist_submatrix = distance_matrix[np.ix_(group_indices, group_indices)]
        triu_indices = np.triu_indices_from(dist_submatrix, k=1)
        pairwise_distances = dist_submatrix[triu_indices]
        return np.median(pairwise_distances)

    # Store results
    results[city] = {
        'top_25_median_dist': median_distance(top_indices),
        'bottom_25_median_dist': median_distance(bottom_indices)
    }

# Display result
for city, metrics in results.items():
    print(f"{city}: Top 25% median = {metrics['top_25_median_dist']:.2f} km, "
          f"Bottom 25% median = {metrics['bottom_25_median_dist']:.2f} km")


basildon: Top 25% median = 2.54 km, Bottom 25% median = 2.54 km
basingstoke: Top 25% median = 2.59 km, Bottom 25% median = 2.05 km
bath: Top 25% median = 0.83 km, Bottom 25% median = 0.92 km
bedford: Top 25% median = 2.50 km, Bottom 25% median = 2.61 km
birkenhead: Top 25% median = 1.53 km, Bottom 25% median = 2.01 km
birmingham: Top 25% median = 5.90 km, Bottom 25% median = 6.16 km
blackburn: Top 25% median = 3.15 km, Bottom 25% median = 1.46 km
blackpool: Top 25% median = 1.76 km, Bottom 25% median = 2.12 km
bolton: Top 25% median = 2.78 km, Bottom 25% median = 2.22 km
bournemouth: Top 25% median = 2.55 km, Bottom 25% median = 2.89 km
bradford: Top 25% median = 4.41 km, Bottom 25% median = 2.66 km
brighton: Top 25% median = 1.30 km, Bottom 25% median = 1.42 km
bristol: Top 25% median = 2.24 km, Bottom 25% median = 2.95 km
burnley: Top 25% median = 2.80 km, Bottom 25% median = 2.11 km
cambridge: Top 25% median = 1.77 km, Bottom 25% median = 2.24 km
canterbury: Top 25% median = 0.60 km

## 5.3 Critical Comparison
The AI-generated code provides a clean, modular approach that uses NumPy slicing and reusable functions to calculate the median distance between top and bottom rated restaurants in each city. It converts the DataFrame columns to NumPy arrays `(city_col, rating_col)`, then uses logical indexing and `np.where()` to isolate restaurants by city and rating group. The distance calculation is handled by a helper function, which improves reusability and reduces repetition.

In contrast, my original code takes a more step-by-step and manual approach. I used explicit for loops to build the top and bottom groups and manually looped over index pairs to collect distances. This made the process easier for me to understand and aligned better with the structure of my earlier tasks. While my version is more verbose and less optimized, it is readable and transparent.

From the AI code, I learned the value of simplifying repeated logic into a helper function. For my final version, I will refactor the median distance section to remove duplication while keeping the rest of my logic as-is. This keeps my code consistent with earlier tasks and easier for a general reader to follow, while still improving structure and clarity.

## 5.4 Final Version

In [17]:
import numpy as np
import plotly.express as px

# Function to calculate median distance within a group of indices
def median_distance(group_indices, distance_matrix):
    if len(group_indices) < 2:
        return np.nan
    
    # Get only distances between different restaurants
    submatrix = distance_matrix[np.ix_(group_indices, group_indices)]
    upper_tri = np.triu_indices(len(group_indices), k=1)
    pairwise_distances = submatrix[upper_tri]
    
    return round(np.median(pairwise_distances), 2)

# Lists to store city names and median distances
all_cities = []
top_25_medians = []
bottom_25_medians = []

# Loop through each city
for city_name in all_cities_matrices:
    city_rows = df2[df2['city'] == city_name].reset_index(drop=True)

    if len(city_rows) < 4:
        continue  # Not enough data

    ratings = city_rows['rating'].values
    distance_matrix = all_cities_matrices[city_name]

    top_cut = np.percentile(ratings, 75)
    bottom_cut = np.percentile(ratings, 25)

    # Get indices for top and bottom 25%
    top_group = [i for i in range(len(ratings)) if ratings[i] >= top_cut]
    bottom_group=[i for i in range(len(ratings)) if ratings[i] <= bottom_cut]

    if len(top_group) < 2 or len(bottom_group) < 2:
        continue

    # Use helper function for cleaner median calculation
    top_median = median_distance(top_group, distance_matrix)
    bottom_median = median_distance(bottom_group, distance_matrix)

    all_cities.append(city_name)
    top_25_medians.append(top_median)
    bottom_25_medians.append(bottom_median)

# Create list of city results and sort by difference
differences = []
for i in range(len(all_cities)):
    diff = round(bottom_25_medians[i] - top_25_medians[i], 2)
    differences.append((
        all_cities[i], 
        top_25_medians[i], 
        bottom_25_medians[i], 
        diff
    ))

differences.sort(key=lambda x: x[3], reverse=True)
top_15 = differences[:15]

# Prepare values for plotting
city_labels = [x[0] for x in top_15]
top_values = [x[1] for x in top_15]
bottom_values = [x[2] for x in top_15]

# Plot bar chart
fig = px.bar(
    x=city_labels * 2,
    y=top_values + bottom_values,
    color=["Top 25%"] * 15 + ["Bottom 25%"] * 15,
    barmode="group",
    labels={"x": "City", "y": "Median Distance (km)", "color": "Group"},
    title="Distance Between Top and Bottom Rated Restaurants (Top 15 Cities)"
)
fig.update_layout(xaxis_tickangle=-45, height=500)
fig.show()


### Code Commentary
This code compares the clustering of top-rated and bottom-rated restaurants across UK cities using distance matrices. For each city in `all_cities_matrices`, restaurant ratings are extracted from df2 (L13–14) and used to compute the 75th and 25th percentiles (L19–20). These thresholds define the top 25% and bottom 25% of restaurants, based on their average ratings. Restaurants meeting each threshold are stored as index lists using list comprehensions (L23–24). To calculate how close these restaurants are to each other, I defined a function `median_distance()` (L4–11). This function uses `np.ix_()` to extract a square distance submatrix for the group, and `np.triu_indices()` to ignore self-comparisons and duplicates. The result is a 1D array of all unique pairwise distances, and I return the median. This median value is computed separately for the top-rated and bottom-rated groups (L29–30). Results for each city are stored in lists, and later compiled into a list of tuples including the difference between bottom and top medians (L36–38). The final 15 cities are sorted bythe largest difference in clustering (L40) and visualized with Plotly.

**Observations:** In most cities, the top-rated restaurants are more clustered, shown by a smaller median distance. This implies higher-rated venues tend to be located near each other, likely in central or high-traffic areas. In contrast, lower-rated restaurants are often more spread out, possibly reflecting their placement in less commercial zones or suburban locations. While there are exceptions, the overall trend supports the idea that quality restaurants tend to cluster in competitive or desirable areas. This insight could be useful for market analysis or restaurant expansion strategies, especially in identifying high-performance clusters.