<font size="8">Homework 2: AirBnB Document Database</font>

Group number: `16`

Group members:
1. `Daan van Holten : 20240681`
2. `Benedikt Rugaber : 20240500`
3. `Joshua Wher : 20240501`

The Homework 2 is comprised of two parts:
1. Data modelling (15 points).
2. Queries to database to answer the questions (25 points).


**Key tasks include**:

1. Streamlining the data collection process.
2. Cleaning up the data and optimizing what will be returned for each use case.
3. Applying the correct patterns to speed up common queries.
4. Ensuring departments get accurate and relevant information from the database.
5. Sharing the updated data model schema with other departments.

**Good Practices**: [Check Chapter 6, Mastering MongoDB]

1. All newly created fields should have capitalized names.
2. New queries should work with the most up-to-date database version. If you make multiple changes, all queries should still work after the final updates.
3. For some queries, you may need to change the database schema.
4. When you are applying specific patterns, like polymorphic, subset, or bucket, name them accordingly. 
5. Document each major transformation using this format:
*“We applied {transformation name} because {reasoning behind it}. We expect {change/result} based on {observable measure, such as query speed, number of documents returned, index use, etc.}.”*

</font>


**Data Cleanup and Schema Adjustments:** [9 points in total]

1) Before working on the queries below, review the data and adjust the schema based on the typical use case described.

**Typical Use Case**: The most common use of the database is to show property listing information to customers. A query retrieves a listing document from the database. Currently, retrieving a listing takes too long. Decide what information should be included in a typical query and optimize the structure accordingly. For example, customers usually only need a sample of reviews, not all reviews (even though all reviews are stored). They also don’t need past transaction data. Update the document schema to fit this use case. This might involve creating new collections or documents.

**Data Cleanup**: Review the data for any errors (such as transactions that don’t belong to the listing) or unnecessary duplication, and clean it up where needed.

In [2]:
# Python Connector


# or #!conda install -y pymongo

from datetime import datetime
from pprint import pprint
import time
from bson.objectid import ObjectId

from pymongo import MongoClient

user="AzureDiamond"
password="hunter2"
host="localhost"
port="27017"
protocol="mongodb"

client = MongoClient(f"{protocol}://{user}:{password}@{host}:{port}")

# Database check
db = client.sample_airbnb 
print(f"Database info: {db}\n")
db.name 



Database info: Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'sample_airbnb')



'sample_airbnb'

# Loading and Inspecting database


In [3]:
# Collections are inside our Database 'sample_analytics'

collection_list = db.list_collection_names()

print(f"The database contains {len(collection_list)} collections")
print(f"All collections: {collection_list[0:]}")
print(f"Collection {collection_list[0]} contains {db[collection_list[0]].count_documents({})} documents")

The database contains 4 collections
All collections: ['reviews', 'listingsAndReviews_new', 'listings', 'hosts']
Collection reviews contains 149792 documents


In [4]:
for doc in db.listingsAndReviews_new.find().limit(1):
    pprint(doc)


{'_id': '10006546',
 'access': 'We are always available to help guests. The house is fully '
           'available to guests. We are always ready to assist guests. when '
           'possible we pick the guests at the airport.  This service transfer '
           'have a cost per person. We will also have service "meal at home" '
           'with a diverse menu and the taste of each. Enjoy the moment!',
 'accommodates': 8,
 'address': {'country': 'Portugal',
             'country_code': 'PT',
             'government_area': 'Cedofeita, Ildefonso, Sé, Miragaia, Nicolau, '
                                'Vitória',
             'location': {'coordinates': [-8.61308, 41.1413],
                          'is_location_exact': False,
                          'type': 'Point'},
             'market': 'Porto',
             'street': 'Porto, Porto, Portugal',
             'suburb': ''},
 'amenities': ['TV',
               'Cable TV',
               'Wifi',
               'Kitchen',
              

In [5]:
# Retrieve a single document and extract its keys
sample_document = db.listingsAndReviews_new.find_one()
keys = sample_document.keys()

print("Keys in a listing:")
for key in keys:
    print(key)

Keys in a listing:
_id
listing_url
name
summary
space
description
neighborhood_overview
notes
transit
access
interaction
house_rules
property_type
room_type
bed_type
minimum_nights
maximum_nights
cancellation_policy
last_scraped
calendar_last_scraped
first_review
last_review
accommodates
bedrooms
beds
number_of_reviews
bathrooms
amenities
price
security_deposit
cleaning_fee
extra_people
guests_included
images
address
availability
reviews
host_about
host_has_profile_pic
host_id
host_identity_verified
host_is_superhost
host_listings_count
host_location
host_name
host_neighbourhood
host_picture_url
host_response_rate
host_response_time
host_thumbnail_url
host_total_listings_count
host_url
host_verifications
review_scores_checkin
review_scores_cleanliness
review_scores_communication
review_scores_location
review_scores_rating
review_scores_value
reviews_copy1
reviews_copy2
reviews_copy3
reviews_copy4
transactions


Redundancy in the keys:

Many large text fields repeat information (e.g., summary, space, and description say almost the same thing).
Multiple copies of reviews (reviews, reviews_copy1, reviews_copy2, reviews_copy3) exist, creating unnecessary duplication.

Heavy Documents:

Reviews are embedded directly in the listing document. Over time, this becomes a massive array, slowing down queries.
Transactions, calendar, and scraped metadata are present but irrelevant to customers browsing listings.

Unstructured Fields:

Fields like notes, transit, access, interaction, and house_rules are helpful, but not always consistently filled.

# Data Cleanup and Schema Adjustments

### Summary
The original listingsAndReviews_new collection stored all data — listings, hosts, and reviews — in a single document. While this embedding approach simplified access, it caused performance problems due to oversized documents, repeated fields (e.g., multiple versions of reviews), and slow queries. Additionally, indexing had not been applied, and the rapid growth of reviews made long-term scalability difficult.

To address these issues, we normalized the schema by splitting the data into three collections: listings, hosts, and reviews. We used referencing patterns to improve flexibility, reduced document bloat by limiting embedded arrays, and cleaned up duplicate records. We also applied indexes to fields commonly used in BI queries to enhance query speed.


Below, we created two functions. The first one is to_float_safe.
We applied this transformation to standardize numeric fields like price, cleaning_fee, and bathrooms, converting messy strings and Decimal128 types into clean floats.

**Why:** To prevent query errors and ensure consistent numeric comparisons.
Result: More robust and reliable numeric processing across all listings.

transform_document(doc) Function
We applied this transformation to flatten and normalize the data into three collections: listings, hosts, and reviews.

**Why:** The original structure stored everything in a single document, causing inefficient queries and poor scalability.
Changes:
- Flattened nested fields (e.g. address → Location)
- Moved reviews to a separate collection
- Created clear, consistent, capitalized field names
- Linked data using Host_ID and Listing_ID

Result:
- Cleaner structure using subset + reference patterns
- Better query performance and easier maintenance
- Separation of fast-growing review data

In [6]:
#Extra imports 
from pymongo import MongoClient
from bson.decimal128 import Decimal128
from pprint import pprint

In [7]:
def to_float_safe(val):
    if isinstance(val, Decimal128):
        return float(val.to_decimal())
    try:
        return float(str(val).replace("$", "").replace(",", ""))
    except (TypeError, ValueError):
        return 0.0


In [8]:
def transform_document(doc):
    # Extract and flatten location
    location = {
        "City": doc.get("address", {}).get("market", ""),
        "Coordinates": doc.get("address", {}).get("location", {}).get("coordinates", []),
        "Area": doc.get("address", {}).get("government_area", ""),
        "Country": doc.get("address", {}).get("country", ""),
        "Street": doc.get("address", {}).get("street", "")
    }

    # Flat listing schema with more comprehensive details
    clean_listing = {
        "_id": doc["_id"],
        "Name": doc.get("name", ""),
        "Listing_URL": doc.get("listing_url", ""),
        "Location": location,
        "Property_Type": doc.get("property_type", ""),
        "Room_Type": doc.get("room_type", ""),
        "Bedrooms": doc.get("bedrooms", 0),
        "Beds": doc.get("beds", 0),
        "Bed_Type": doc.get("bed_type", ""),
        "Bathrooms": to_float_safe(doc.get("bathrooms")),
        "Accommodates": doc.get("accommodates", 0),
        "Amenities": doc.get("amenities", []),
        "Price": to_float_safe(doc.get("price")),
        "Cleaning_Fee": to_float_safe(doc.get("cleaning_fee")),
        "Picture_URL": doc.get("images", {}).get("picture_url", ""),
        "Host_ID": doc.get("host_id", ""),
        
        # Additional useful fields
        "Cancellation_Policy": doc.get("cancellation_policy", ""),
        "Minimum_Nights": int(doc.get("minimum_nights", "0") or 0),
        "Maximum_Nights": int(doc.get("maximum_nights", "0") or 0),
        
        # Summary and description fields
        "Summary": doc.get("summary", ""),
        "Description": doc.get("description", ""),
        "Neighborhood_Overview": doc.get("neighborhood_overview", ""),
        
        # Review-related summary fields
        "Number_of_Reviews": doc.get("number_of_reviews", 0),
        "First_Review_Date": doc.get("first_review", None),
        "Last_Review_Date": doc.get("last_review", None),
        
        # Review scores
        "Review_Scores": {
            "Rating": doc.get("review_scores_rating", None),
            "Accuracy": doc.get("review_scores_accuracy", None),
            "Cleanliness": doc.get("review_scores_cleanliness", None),
            "Check_In": doc.get("review_scores_checkin", None),
            "Communication": doc.get("review_scores_communication", None),
            "Location": doc.get("review_scores_location", None),
            "Value": doc.get("review_scores_value", None)
        },

        # Add to clean_listing or create a separate transactions collection
        "Transactions": {
            "Total_Count": doc.get("transactions", {}).get("transaction_count", 0),
            "Bucket_Start_Date": doc.get("transactions", {}).get("bucket_start_date", None),
            "Bucket_End_Date": doc.get("transactions", {}).get("bucket_end_date", None),
            # Optionally, include summary statistics of transactions
}
    }

    # Create host object with comprehensive details
    clean_host = {
        "_id": doc.get("host_id", ""),
        "Name": doc.get("host_name", ""),
        "Location": doc.get("host_location", ""),
        "Response_Rate": doc.get("host_response_rate", None),
        "Response_Time": doc.get("host_response_time", ""),
        "Is_Superhost": doc.get("host_is_superhost", False),
        "Total_Listings": doc.get("host_total_listings_count", 0),
        "Verification_Methods": doc.get("host_verifications", []),
        "Profile_Picture": doc.get("host_picture_url", ""),
        "About": doc.get("host_about", ""),

        "Host_Verification": {
            "Has_Profile_Picture": doc.get("host_has_profile_pic", False),
            "Identity_Verified": doc.get("host_identity_verified", False),
            "Verification_Methods": doc.get("host_verifications", [])
        }
    }

    # Split reviews - using only the primary 'reviews' list
    cleaned_reviews = [{
        "_id": r.get("_id", str(uuid.uuid4())),  # Ensure unique ID
        "Listing_ID": doc["_id"],
        "Reviewer_ID": r.get("reviewer_id", ""),
        "Reviewer_Name": r.get("reviewer_name", ""),
        "Date": r.get("date", None),
        "Comments": r.get("comments", "")
    } for r in doc.get("reviews", [])]  # Only use primary reviews list

    return clean_listing, clean_host, cleaned_reviews

In [None]:
import uuid  # Add this import for generating unique IDs 

# Collections
raw_col = db['listingsAndReviews_new']
listings_col = db['listings']
hosts_col = db['hosts']
reviews_col = db['reviews']

# Cleanup old collections if re-running
listings_col.drop()
hosts_col.drop()
reviews_col.drop()

# Process and insert documents
count = 0
for doc in raw_col.find():
    listing_doc, host_doc, review_docs = transform_document(doc)

    # Insert/Update documents
    listings_col.insert_one(listing_doc)
    
    # Upsert host to avoid duplicates
    hosts_col.update_one(
        {"_id": host_doc["_id"]}, 
        {"$set": host_doc}, 
        upsert=True
    )
    
    # Insert reviews
    if review_docs:
        reviews_col.insert_many(review_docs)

    count += 1




In [10]:
#Sanity check
print("Sample listing:")
pprint(listings_col.find_one())

print("\nSample host:")
pprint(hosts_col.find_one())


print("\nSample review:")
pprint(reviews_col.find_one())


Sample listing:
{'Accommodates': 8,
 'Amenities': ['TV',
               'Cable TV',
               'Wifi',
               'Kitchen',
               'Paid parking off premises',
               'Smoking allowed',
               'Pets allowed',
               'Buzzer/wireless intercom',
               'Heating',
               'Family/kid friendly',
               'Washer',
               'First aid kit',
               'Fire extinguisher',
               'Essentials',
               'Hangers',
               'Hair dryer',
               'Iron',
               'Pack ’n Play/travel crib',
               'Room-darkening shades',
               'Hot water',
               'Bed linens',
               'Extra pillows and blankets',
               'Microwave',
               'Coffee maker',
               'Refrigerator',
               'Dishwasher',
               'Dishes and silverware',
               'Cooking basics',
               'Oven',
               'Stove',
               'Cleaning be

### [Review Collection] Referencing Pattern
We applied the referencing pattern by moving reviews from embedded arrays into a separate reviews collection. This was done because embedded reviews caused document bloat and slowed down read operations. We expect improved scalability and query speed for listings, based on reduced document size and independent indexing of reviews on fields such as LISTING_ID and date.

### [Hosts Collection] Referencing Pattern
We applied the referencing pattern for host information by extracting host metadata into its own hosts collection, linked by host_id. This was done to avoid redundant storage of identical host data across multiple listings. We expect improved maintainability and reduced duplication, especially for frequent superhosts with many listings.

### [Listings Collection] Subset pattern + schema flatten
We applied a subset pattern by moving non-core data (e.g., host details, reviews, and transactions) into separate collections, and flattening embedded fields like address into a Location dictionary.
This was done to reduce document size and simplify queries on key listing fields.
We expect faster read times and more efficient index usage for applications that only need core listing data (e.g., dashboards, filters, property search).




# Indexing

### Indexing for queery optimization
We applied targeted indexes across the listings, hosts, and reviews collections to support common query patterns and improve read performance.
This was done because filtering, sorting, and aggregation are central to business use cases like dashboards, search filters, and property recommendations.

On the listings collection, we indexed fields like Location.City, Price, Property_Type, and Accommodates to support user-side filters. We also created a compound index on City + Price for optimized multi-criteria queries, and a geospatial index on Location.Coordinates for future proximity-based features.

On the hosts collection, we indexed Is_Superhost, Total_Listings, and Response_Rate to support host ranking, filtering, and reporting.

On the reviews collection, we indexed Listing_ID, Reviewer_ID, and Date, including a compound index to support retrieval of recent reviews for a listing.

We expect improved query speed based on index use, especially in analytical workloads, host/reviewer tracking pages, and front-end search features.



In [11]:
# Indexes for listings
listings_col.create_index([("Location.City", 1)])  # For city-based searches
listings_col.create_index([("Price", 1)])  # Price filtering
listings_col.create_index([("Property_Type", 1)])  # Property type filtering
listings_col.create_index([("Host_ID", 1)])  # Host-specific queries
listings_col.create_index([("Accommodates", 1)])  # Capacity-based searches
listings_col.create_index([("Review_Scores.Rating", -1)])  # Sorting by rating
listings_col.create_index([("Location.City", 1), ("Price", 1)])  # Compound index for city-price queries

# Indexes for hosts
hosts_col.create_index([("Is_Superhost", 1)])  # Superhost filtering
hosts_col.create_index([("Response_Rate", -1)])  # Sorting by response rate
hosts_col.create_index([("Total_Listings", -1)])  # Hosts with most listings
hosts_col.create_index([("Location", 1)])  # Location-based host searches

# Indexes for reviews
reviews_col.create_index([("Listing_ID", 1)])  # Reviews for a specific listing
reviews_col.create_index([("Reviewer_ID", 1)])  # Reviews by a specific reviewer
reviews_col.create_index([("Date", -1)])  # Recent reviews first
reviews_col.create_index([("Listing_ID", 1), ("Date", -1)])  # Compound index for listing reviews sorted by date

# Geospatial index for location (if you want to do proximity searches)
listings_col.create_index([("Location.Coordinates", "2dsphere")])

print("Indexes successfully created.")

Indexes successfully created.


In [12]:
# Checking if the indexing worked.

print("Indexes on listings:")
pprint(listings_col.index_information())

print("\nIndexes on hosts:")
pprint(hosts_col.index_information())

print("\nIndexes on reviews:")
pprint(reviews_col.index_information())

Indexes on listings:
{'Accommodates_1': {'key': [('Accommodates', 1)], 'v': 2},
 'Host_ID_1': {'key': [('Host_ID', 1)], 'v': 2},
 'Location.City_1': {'key': [('Location.City', 1)], 'v': 2},
 'Location.City_1_Price_1': {'key': [('Location.City', 1), ('Price', 1)],
                             'v': 2},
 'Location.Coordinates_2dsphere': {'2dsphereIndexVersion': 3,
                                   'key': [('Location.Coordinates',
                                            '2dsphere')],
                                   'v': 2},
 'Price_1': {'key': [('Price', 1)], 'v': 2},
 'Property_Type_1': {'key': [('Property_Type', 1)], 'v': 2},
 'Review_Scores.Rating_-1': {'key': [('Review_Scores.Rating', -1)], 'v': 2},
 '_id_': {'key': [('_id', 1)], 'v': 2}}

Indexes on hosts:
{'Is_Superhost_1': {'key': [('Is_Superhost', 1)], 'v': 2},
 'Location_1': {'key': [('Location', 1)], 'v': 2},
 'Response_Rate_-1': {'key': [('Response_Rate', -1)], 'v': 2},
 'Total_Listings_-1': {'key': [('Total_Listings', -

# Data cleanup

## Removing duplicates
We applied a duplicate detection and removal step on the reviews collection using a combination of Reviewer_ID, Date, Listing_ID, and Comments as the composite key.
We applied this transformation to improve data quality and ensure the same reviewer doesn't appear to post identical reviews multiple times, which could bias aggregation metrics.
We expect cleaner aggregations, more accurate review counts and averages, and fewer inconsistencies in analytical outputs that rely on review data.

In [13]:
# More comprehensive duplicate detection in reviews
pipeline = [
    {"$group": {
        "_id": {
            "reviewer": "$Reviewer_ID", 
            "date": "$Date", 
            "listing_id": "$Listing_ID",  # Add listing_id to make detection more precise
            "comments": "$Comments"  # Include comments to catch near-duplicate reviews
        },
        "count": {"$sum": 1},
        "ids": {"$push": "$_id"},
        "review_details": {"$push": {
            "id": "$_id",
            "listing_id": "$Listing_ID",
            "reviewer_name": "$Reviewer_Name"
        }}
    }},
    {"$match": {"count": {"$gt": 1}}},
    {"$project": {
        "duplicate_count": "$count",
        "duplicate_ids": "$ids",
        "reviewer_id": "$_id.reviewer",
        "date": "$_id.date",
        "listing_id": "$_id.listing_id",
        "review_details": "$review_details"
    }}
]

# Execute the aggregation
dupes = list(reviews_col.aggregate(pipeline))

# Detailed reporting
print(f"Found {len(dupes)} groups of duplicate reviews:")
for dupe in dupes:
    print("\nDuplicate Review Group:")
    print(f"  Reviewer ID: {dupe['reviewer_id']}")
    print(f"  Date: {dupe['date']}")
    print(f"  Listing ID: {dupe['listing_id']}")
    print(f"  Duplicate Count: {dupe['duplicate_count']}")
    print("  Duplicate IDs:")
    for detail in dupe['review_details']:
        print(f"    - ID: {detail['id']}, Reviewer Name: {detail['reviewer_name']}")

# Optional: Calculate total number of duplicate documents
total_duplicate_docs = sum(dupe['duplicate_count'] - 1 for dupe in dupes)
print(f"\nTotal duplicate documents to be removed: {total_duplicate_docs}")

Found 1 groups of duplicate reviews:

Duplicate Review Group:
  Reviewer ID: 19142459
  Date: 2014-08-15 04:00:00
  Listing ID: 1321072
  Duplicate Count: 2
  Duplicate IDs:
    - ID: 17667433, Reviewer Name: Stacey
    - ID: 17667434, Reviewer Name: Stacey

Total duplicate documents to be removed: 1


In [13]:
# Remove duplicates, keeping the first occurrence
for dupe in dupes:
    # Keep the first document, remove the rest
    ids_to_remove = dupe['duplicate_ids'][1:]
    result = reviews_col.delete_many({"_id": {"$in": ids_to_remove}})
    print(f"Removed {result.deleted_count} duplicate reviews for {dupe['reviewer_id']} on {dupe['date']}")

Removed 1 duplicate reviews for 19142459 on 2014-08-15 04:00:00


## Checking invalid timestamps
We performed a validation check on review timestamps to ensure none were set far in the future. No invalid dates were found, and no further action was required.

In [14]:
from datetime import datetime

threshold = datetime(2025, 1, 1)
weird_reviews = list(reviews_col.find({"date": {"$gt": threshold}}))
print(f"Found {len(weird_reviews)} suspicious future-dated reviews.")

Found 0 suspicious future-dated reviews.


## Checking for missing values
We applied a custom missing value check using a manual field-by-field scan across the listings collection.
We applied this to identify critical fields (e.g., Price, City, Host_ID) with missing or incomplete data, including empty strings, nulls, or missing arrays.
We expect this step to improve data quality awareness and guide further cleaning actions, especially for fields used in filters, pricing, or user display logic.

In [None]:
def manual_missing_fields_check(collection):
    # Comprehensive list of fields to check
    fields_to_check = [
        # Basic Listing Information
        'Name', 'Listing_URL', 'Property_Type', 'Room_Type', 'Bed_Type',
        
        # Location Details
        'Location.City', 'Location.Country', 'Location.Area', 'Location.Street', 
        'Location.Coordinates',
        
        # Pricing and Accommodation
        'Price', 'Cleaning_Fee', 'Security_Deposit', 'Extra_People', 
        'Accommodates', 'Bedrooms', 'Beds', 'Bathrooms',
        
        # Host Information
        'Host_ID', 'Host_Name', 
        
        # Review-related Fields
        'Number_of_Reviews', 'First_Review_Date', 'Last_Review_Date',
        
        # Review Scores
        'Review_Scores.Rating', 'Review_Scores.Accuracy', 'Review_Scores.Cleanliness', 
        'Review_Scores.Check_In', 'Review_Scores.Communication', 
        'Review_Scores.Location', 'Review_Scores.Value',
        
        # Booking Details
        'Minimum_Nights', 'Maximum_Nights', 'Cancellation_Policy',
        
        # Images
        'Picture_URL', 'Images.Picture_URL', 'Images.Thumbnail_URL', 
        'Images.Medium_URL', 'Images.XL_Picture_URL',
        
        # Descriptive Fields
        'Summary', 'Description', 'Neighborhood_Overview', 'Space', 
        'Notes', 'Transit', 'Access', 'Interaction', 'House_Rules',
        
        # Amenities 
        'Amenities',
        
        # Availability
        'Calendar.Last_Scraped', 'Calendar.Calendar_Last_Scraped',
        'Calendar.Availability.30_Days', 'Calendar.Availability.60_Days', 
        'Calendar.Availability.90_Days', 'Calendar.Availability.365_Days'
    ]
    
    total_documents = collection.count_documents({})
    missing_fields = {}

    def check_field_missing(field):
        field_parts = field.split('.')
        
        if len(field_parts) == 1:
            query = {
                "$or": [
                    {field: None},
                    {field: ""},
                    {field: 0},
                    {field: {"$exists": False}},
                    {field: {"$size": 0}} 
                ]
            }
        else:
            query = {
                "$or": [
                    {f"{field_parts[0]}.{field_parts[1]}": None},
                    {f"{field_parts[0]}.{field_parts[1]}": ""},
                    {f"{field_parts[0]}.{field_parts[1]}": 0},
                    {f"{field_parts[0]}.{field_parts[1]}": {"$exists": False}},
                    {f"{field_parts[0]}.{field_parts[1]}": {"$size": 0}}
                ]
            }

        return collection.count_documents(query)

    print(f"Comprehensive Missing Fields Analysis")
    print(f"Total Documents: {total_documents}\n")
    print("Missing Fields Breakdown:")

    field_missing_data = []

    for field in fields_to_check:
        missing_count = check_field_missing(field)
        
        if missing_count > 0:
            percentage = (missing_count / total_documents) * 100
            field_missing_data.append((field, missing_count, percentage))

    # Sort by percentage in descending order
    field_missing_data.sort(key=lambda x: x[2], reverse=True)

    # Print and collect results
    missing_fields = {}
    for field, count, percentage in field_missing_data:
        print(f"  • {field}: {count} ({percentage:.2f}%)")
        missing_fields[field] = {
            'count': count,
            'percentage': percentage
        }

    return {
        'total_documents': total_documents,
        'missing_fields': missing_fields
    }

# Run the analysis
missing_fields_report = manual_missing_fields_check(listings_col)

Comprehensive Missing Fields Analysis
Total Documents: 5555

Missing Fields Breakdown:
  • Security_Deposit: 5555 (100.00%)
  • Extra_People: 5555 (100.00%)
  • Host_Name: 5555 (100.00%)
  • Review_Scores.Accuracy: 5555 (100.00%)
  • Images.Picture_URL: 5555 (100.00%)
  • Images.Thumbnail_URL: 5555 (100.00%)
  • Images.Medium_URL: 5555 (100.00%)
  • Images.XL_Picture_URL: 5555 (100.00%)
  • Space: 5555 (100.00%)
  • Notes: 5555 (100.00%)
  • Transit: 5555 (100.00%)
  • Access: 5555 (100.00%)
  • Interaction: 5555 (100.00%)
  • House_Rules: 5555 (100.00%)
  • Calendar.Last_Scraped: 5555 (100.00%)
  • Calendar.Calendar_Last_Scraped: 5555 (100.00%)
  • Calendar.Availability.30_Days: 5555 (100.00%)
  • Calendar.Availability.60_Days: 5555 (100.00%)
  • Calendar.Availability.90_Days: 5555 (100.00%)
  • Calendar.Availability.365_Days: 5555 (100.00%)
  • Neighborhood_Overview: 2241 (40.34%)
  • Cleaning_Fee: 1931 (34.76%)
  • Review_Scores.Check_In: 1475 (26.55%)
  • Review_Scores.Value: 1475 

We found that some fields, such as Security_Deposit, Extra_People, and several Calendar or Images subfields, are missing in all documents (100%). These likely reflect optional features in the dataset.
More relevant fields like Cleaning_Fee (35%), Review_Scores.* (~26%), and Neighborhood_Overview (40%) also had notable gaps.

Only 6 listings were missing City, and just 8 were missing Name, confirming that core fields are mostly complete.
Conclusion: The dataset is generally in good shape for analysis, with most essential fields well-populated. Missing optional fields were noted but don’t hinder primary business use cases.



## Invalid prices listings
Checkin for weird and/or negative prices; including negative prices, zero prices, extremely low prices, and negative values for accommodates or bedrooms.


In [None]:
# Create a dictionary to store counts of invalid entries for specific fields
invalid_counts = {
    "negative_price": listings_col.count_documents({"price": {"$lt": 0}}),  # Listings with negative price
    "zero_price": listings_col.count_documents({"price": 0}),  # Listings with zero price
    "extremely_low_price": listings_col.count_documents({"price": {"$gt": 0, "$lt": 20}}),  # Listings with price below 20
    "negative_accommodates": listings_col.count_documents({"accommodates": {"$lt": 0}}),  # Listings with negative accommodates
    "negative_bedrooms": listings_col.count_documents({"bedrooms": {"$lt": 0}})  # Listings with negative bedrooms
}

# Print the findings
for field, count in invalid_counts.items():
    print(f"{field}: {count} invalid entries")

negative_price: 0 invalid entries
zero_price: 0 invalid entries
extremely_low_price: 0 invalid entries
negative_accommodates: 0 invalid entries
negative_bedrooms: 0 invalid entries


We applied a validation step to identify listings with clearly invalid or extreme values in fields like price, accommodates, and bedrooms.
We specifically checked for negative numbers, zero pricing, and unrealistically low prices (under €20), as these could indicate data entry errors or unusable listings.
Result: No invalid or suspicious entries were found. This confirms that the dataset contains only reasonable values for these key business metrics.

## Detecting potential Duplicated listings
We'll look for listings that have the same combination of name, LOCATION.city, and host_id — a strong signal that they're duplicates.

In [19]:
pipeline = [
    {
        "$group": {
            "_id": {
                "name": "$name",
                "city": "$LOCATION.city",
                "host_id": "$host_id"
            },
            "count": {"$sum": 1},
            "duplicate_ids": {"$push": "$_id"}
        }
    },
    {
        "$match": {
            "count": {"$gt": 1}
        }
    }
]

duplicate_groups = list(listings_col.aggregate(pipeline))
print(f"Found {len(duplicate_groups)} potential duplicate listing groups.")

Found 1 potential duplicate listing groups.


---------------------

## Question 2

2)	Once a month, we reward hosts with recognition. Select three superhosts with at least two listings that can accommodate more than four people.

**How we did it:**

We filtered listings for Accommodates > 4

Grouped by Host_ID and counted qualifying listings

Filtered hosts with 2+ listings

Sorted and limited to the top 3

Joined host details from the hosts collection for display

In [None]:
pipeline = [
    # First get all superhost IDs
    {
        "$lookup": {
            "from": "hosts",
            "localField": "Host_ID",
            "foreignField": "_id",
            "as": "host_details"
        }
    },
    # Filter to listings with superhost status
    {
        "$match": {
            "host_details.Is_Superhost": True,  # Changed to Python's True instead of JavaScript's true
            "Accommodates": {"$gt": 4}  # Accommodates more than 4 people
        }
    },
    # Group by host to count qualifying listings
    {
        "$group": {
            "_id": "$Host_ID",
            "host_name": {"$first": "$host_details.Name"},
            "total_qualifying_listings": {"$sum": 1}
        }
    },
    # Filter to hosts with at least 2 qualifying listings
    {
        "$match": {
            "total_qualifying_listings": {"$gte": 2}
        }
    },
    # Sort by number of qualifying listings in descending order
    {
        "$sort": {"total_qualifying_listings": -1}
    },
    # Limit to top 3 hosts
    {
        "$limit": 3
    },
    # Look up complete host details for display
    {
        "$lookup": {
            "from": "hosts",
            "localField": "_id",
            "foreignField": "_id",
            "as": "host_complete"
        }
    },
    # Project only the fields we need
    {
        "$project": {
            "host_id": "$_id",
            "host_name": {"$arrayElemAt": ["$host_complete.Name", 0]},
            "total_listings": {"$arrayElemAt": ["$host_complete.Total_Listings", 0]},
            "qualifying_listings": "$total_qualifying_listings",
            "location": {"$arrayElemAt": ["$host_complete.Location", 0]},
            "response_rate": {"$arrayElemAt": ["$host_complete.Response_Rate", 0]}
        }
    }
]

top_superhosts = list(listings_col.aggregate(pipeline))

print("Top 3 Superhosts with at least 2 listings accommodating more than 4 people:")


for i, host in enumerate(top_superhosts, 1):
    print(f"{i}. Host: {host['host_name']} (ID: {host['host_id']})")
    print(f"   Total Listings: {host['total_listings']}")
    print(f"   Qualifying Listings (>4 people): {host['qualifying_listings']}")
    print(f"   Location: {host['location']}")
    print(f"   Response Rate: {host['response_rate']}%")
    print()


Top 3 Superhosts with at least 2 listings accommodating more than 4 people:
1. Host: Great Vacation Retreats (ID: 42618840)
   Total Listings: 74
   Qualifying Listings (>4 people): 3
   Location: Hawaii, United States
   Response Rate: 100%

2. Host: Patty And Beckett (ID: 36133410)
   Total Listings: 59
   Qualifying Listings (>4 people): 3
   Location: Eureka, California, United States
   Response Rate: 100%

3. Host: Elite (ID: 10496350)
   Total Listings: 63
   Qualifying Listings (>4 people): 3
   Location: Honolulu, Hawaii, United States
   Response Rate: 100%



## Question 3

3)	The company considers inevsting into property to rent. Which bed type is most common in listings with a waterfront and a dishwasher in New York?

**How we did it:**

Filtered listings where Location.City is New York and Amenities include both "Waterfront" and "Dishwasher"

Grouped by Bed_Type and counted occurrences

Sorted by count in descending order

Limited results to the top 3 most common bed types

In [21]:
def find_most_common_bed_type():
    pipeline = [
        # Match listings in New York with waterfront and dishwasher
        {
            "$match": {
                "Location.City": "New York",
                "Amenities": {"$all": ["Waterfront", "Dishwasher"]}
            }
        },
        # Group and count bed types
        {
            "$group": {
                "_id": "$Bed_Type",
                "count": {"$sum": 1}
            }
        },
        # Sort by count in descending order
        {
            "$sort": {"count": -1}
        },
        # Limit to top results
        {
            "$limit": 3
        }
    ]
    
    bed_type_analysis = list(listings_col.aggregate(pipeline))
    
    print("Bed Type Analysis for NY Listings with Waterfront and Dishwasher:")
    if not bed_type_analysis:
        print("No listings found matching the criteria.")
        return None
    
    for i, result in enumerate(bed_type_analysis, 1):
        print(f"{i}. Bed Type: {result['_id']}")
        print(f"   Count: {result['count']}")
        print()
    
    return bed_type_analysis

most_common_bed_types = find_most_common_bed_type()

total_matching_listings = listings_col.count_documents({
    "Location.City": "New York",
    "Amenities": {"$all": ["Waterfront", "Dishwasher"]}
})
print(f"\nTotal matching listings: {total_matching_listings}")

Bed Type Analysis for NY Listings with Waterfront and Dishwasher:
1. Bed Type: Real Bed
   Count: 1


Total matching listings: 1


## Question 4

4)	We're considering hiring someone to write reviews professionally. Who wrote the longest review in New York?


**How we did it:**

Joined the reviews collection with listings to filter reviews by city

Selected only reviews for listings in New York

Calculated the length of each comment using $strLenCP

Sorted all reviews by length in descending order

Returned the single longest review along with reviewer info

In [22]:
def find_longest_review_in_new_york():
    pipeline = [
        # Join reviews with listings to ensure the review is for a New York listing
        {
            "$lookup": {
                "from": "listings",
                "localField": "Listing_ID",
                "foreignField": "_id",
                "as": "listing_details"
            }
        },
        # Unwind the listing details (in case of multiple matches)
        {
            "$unwind": "$listing_details"
        },
        # Filter for New York listings
        {
            "$match": {
                "listing_details.Location.City": "New York"
            }
        },
        # Add length of comments
        {
            "$addFields": {
                "comment_length": {"$strLenCP": "$Comments"}
            }
        },
        # Sort by comment length in descending order
        {
            "$sort": {"comment_length": -1}
        },
        # Limit to top result
        {
            "$limit": 1
        },
        # Project only the fields we're interested in
        {
            "$project": {
                "_id": 1,
                "Reviewer_Name": 1,
                "Reviewer_ID": 1,
                "Comments": 1,
                "comment_length": 1,
                "Listing_ID": 1,
                "Date": 1
            }
        }
    ]
    
    # Execute the aggregation
    longest_review = list(reviews_col.aggregate(pipeline))
    
    # Print results
    if not longest_review:
        print("No reviews found for New York listings.")
        return None
    
    review = longest_review[0]
    print("Longest Review in New York:")
    print(f"Reviewer Name: {review['Reviewer_Name']}")
    print(f"Reviewer ID: {review['Reviewer_ID']}")
    print(f"Review Length: {review['comment_length']} characters")
    print(f"Review Date: {review['Date']}")
    print("\nReview Excerpt:")
    print(review['Comments'][:500] + "..." if len(review['Comments']) > 500 else review['Comments'])
    
    return review

# Execute the query
longest_review = find_longest_review_in_new_york()

Longest Review in New York:
Reviewer Name: Angela
Reviewer ID: 46731589
Review Length: 4665 characters
Review Date: 2018-12-30 05:00:00

Review Excerpt:
This alleged “full bedroom”; Private room with the red couch shown on the listing’s pictures (which the couch is not in the room, at least it wasn’t when I arrived) is a complete no go for me. 5 stars DOWN. This was a booking that I made less than 48 hours from my scheduled arrival. I have been using this AIRBNB service now for 3 years, so I would like to think I’m pretty familiar with how things work. The raised eyebrows began when Carlos sent me a text message outside of AirBnB directly from a...


## Question 5

5)	To assess the security of different areas, what is the biggest and smallest (price-security deposit) difference per number of visitors at a property?

**How we did it:**

Filtered the listings collection to include only documents that have valid price, security_deposit, and accommodates fields.

Used the $project stage to calculate the difference between price and security_deposit for each listing. 
 
Grouped the listings by the accommodates field (number of visitors a property can host). For each group:
    - Calculated the minimum and maximum differences (min_difference and max_difference).
    - Retrieved the properties with the smallest and largest differences, including their details.
 
Sorted the results by the number of visitors (accommodates) to make the output better readable.

In [None]:
def investigate_price_security_fields():
    # Retrieve a sample document
    sample_doc = listings_col.find_one()
    
    print("Sample Document Fields:")
    for key in sample_doc.keys():
        print(key)
    
    # Check specific price and security-related fields
    price_like_fields = [key for key in sample_doc.keys() if 'price' in key.lower()]
    security_like_fields = [key for key in sample_doc.keys() if 'deposit' in key.lower()]
    
    print("\nPrice-like Fields:")
    for field in price_like_fields:
        print(field)
    
    print("\nSecurity Deposit-like Fields:")
    for field in security_like_fields:
        print(field)
    
    # Detailed investigation of numeric fields
    def investigate_numeric_field(field_name):
        # Basic stats for the field
        pipeline = [
            {
                "$group": {
                    "_id": None,
                    "min": {"$min": f"${field_name}"},
                    "max": {"$max": f"${field_name}"},
                    "avg": {"$avg": f"${field_name}"},
                    "count": {"$sum": 1}
                }
            }
        ]
        
        results = list(listings_col.aggregate(pipeline))
        
        if results and results[0]:
            print(f"\nField: {field_name}")
            print(f"Count: {results[0].get('count', 'N/A')}")
            print(f"Min: {results[0].get('min', 'N/A')}")
            print(f"Max: {results[0].get('max', 'N/A')}")
            print(f"Average: {results[0].get('avg', 'N/A')}")
    
    # Investigate potential price and security fields
    for field in price_like_fields + security_like_fields:
        investigate_numeric_field(field)

# Run the investigation
investigate_price_security_fields()

Sample Document Fields:
_id
Name
Listing_URL
Location
Property_Type
Room_Type
Bedrooms
Beds
Bed_Type
Bathrooms
Accommodates
Amenities
Price
Cleaning_Fee
Picture_URL
Host_ID
Cancellation_Policy
Minimum_Nights
Maximum_Nights
Summary
Description
Neighborhood_Overview
Number_of_Reviews
First_Review_Date
Last_Review_Date
Review_Scores
Transactions

Price-like Fields:
Price

Security Deposit-like Fields:

Field: Price
Count: 5555
Min: 9.0
Max: 48842.0
Average: 278.76615661566154


In [24]:
# To assess the security of different areas by price-security deposit differences
pipeline = [
    # Filter to documents that have both price and security_deposit
    {"$match": {
        "price": {"$exists": True, "$ne": None},
        "security_deposit": {"$exists": True, "$ne": None},
        "accommodates": {"$exists": True, "$gt": 0}
    }},
    
    # Convert price and security_deposit to numeric and calculate difference
    {"$project": {
        "_id": 1,
        "name": 1,
        "accommodates": 1,  # Number of visitors
        "address.market": 1,  # Location/Area
        "price_value": {"$toDouble": "$price"},
        "deposit_value": {"$toDouble": "$security_deposit"},
        "difference": {"$subtract": [
            {"$toDouble": "$price"}, 
            {"$toDouble": "$security_deposit"}
        ]}
    }},
    
    # Group by number of visitors (accommodates)
    {"$group": {
        "_id": "$accommodates",
        "count": {"$sum": 1},
        "min_difference": {"$min": "$difference"},
        "max_difference": {"$max": "$difference"},
        "min_property": {"$first": {
            "name": "$name",
            "price": "$price_value",
            "deposit": "$deposit_value",
            "location": "$address.market"
        }},
        "max_property": {"$last": {
            "name": "$name",
            "price": "$price_value",
            "deposit": "$deposit_value",
            "location": "$address.market"
        }}
    }},
    
    # Sort by number of visitors
    {"$sort": {"_id": 1}}
]

results = db.listingsAndReviews_new.aggregate(pipeline)

print("Price-Security Deposit Difference by Number of Visitors:")
print("="*100)
print("{:<12} {:<15} {:<15} {:<15} {:<15}".format(
    "Visitors", "Count", "Min Difference", "Max Difference", "Areas with Extremes"
))
print("-"*100)

for result in results:
    visitors = result["_id"]
    count = result["count"]
    min_diff = result["min_difference"]
    max_diff = result["max_difference"]
    min_loc = result["min_property"]["location"]
    max_loc = result["max_property"]["location"]
    
    print("{:<12} {:<15} {:<15.2f} {:<15.2f} {:<15} / {:<15}".format(
        visitors, count, min_diff, max_diff, min_loc, max_loc
    ))

Price-Security Deposit Difference by Number of Visitors:
Visitors     Count           Min Difference  Max Difference  Areas with Extremes
----------------------------------------------------------------------------------------------------
1            209             -14631.00       833.00          Hong Kong       / Rio De Janeiro 
2            1151            -38459.00       2661.00         Rio De Janeiro  / Hong Kong      
3            367             -19546.00       1248.00         New York        / Rio De Janeiro 
4            836             -29148.00       2700.00         Rio De Janeiro  / Porto          
5            199             -8392.00        8501.00         Rio De Janeiro  / Rio De Janeiro 
6            355             -17735.00       2999.00         The Big Island  / Maui           
7            68              -2501.00        1052.00         Barcelona       / Hong Kong      
8            153             -19049.00       1963.00         Porto           / Rio De Janeiro 
9

## Question 6

6)  Identify areas by whether they are typically used for short breaks, like weekend mini breaks, or whether they are more suitable for long trips. This information support targeted advertising of different customer types. It is not expected to change much over time so we won’t look to update it, we just require current view. What is the average duration of stay (in nights) per type of property per city (you can use the maximum_nights to measure length of stays)? For each property type return the city with the highest and lowest average value.

**How we did it:**

Filtered listings with valid Maximum_Nights, Property_Type, and Location.City

Grouped by Property_Type and City, calculating the average maximum stay

For each property type, identified the city with the highest and lowest average stay

Used the results to distinguish cities suited for short breaks vs. long trips

In [None]:
pipeline = [
    {
        "$match": {
            "Maximum_Nights": {"$gt": 0},
            "Location.City": {"$ne": None, "$ne": ""},
            "Property_Type": {"$ne": None, "$ne": ""}
        }
    },
    {
        "$group": {
            "_id": {
                "property_type": "$Property_Type",
                "city": "$Location.City"
            },
            "avg_stay": { "$avg": "$Maximum_Nights" }
        }
    }
]

results = list(listings_col.aggregate(pipeline))

# Organize results by property_type
from collections import defaultdict

grouped = defaultdict(list)
for r in results:
    ptype = r["_id"]["property_type"]
    grouped[ptype].append({
        "city": r["_id"]["city"],
        "avg_stay": r["avg_stay"]
    })

# Find min and max city for each property type
final_output = {}
for ptype, entries in grouped.items():
    sorted_entries = sorted(entries, key=lambda x: x["avg_stay"])
    final_output[ptype] = {
        "lowest_avg_stay_city": sorted_entries[0],
        "highest_avg_stay_city": sorted_entries[-1]
    }

# Print results
for ptype, data in final_output.items():
    print(f"\n Property Type: {ptype}")
    print(f"   Lowest Avg Stay: {data['lowest_avg_stay_city']['city']} ({data['lowest_avg_stay_city']['avg_stay']:.2f} nights)")
    print(f"   Highest Avg Stay: {data['highest_avg_stay_city']['city']} ({data['highest_avg_stay_city']['avg_stay']:.2f} nights)")


 Property Type: Apartment
   Lowest Avg Stay: The Big Island (371.36 nights)
   Highest Avg Stay: Istanbul (5200437.65 nights)

 Property Type: Farm stay
   Lowest Avg Stay: The Big Island (846.00 nights)
   Highest Avg Stay: Barcelona (1125.00 nights)

 Property Type: Condominium
   Lowest Avg Stay: Istanbul (396.43 nights)
   Highest Avg Stay: Other (Domestic) (1125.00 nights)

 Property Type: House
   Lowest Avg Stay: New York (572.36 nights)
   Highest Avg Stay: Other (International) (1125.00 nights)

 Property Type: Cabin
   Lowest Avg Stay: Maui (120.00 nights)
   Highest Avg Stay: Montreal (1125.00 nights)

 Property Type: Loft
   Lowest Avg Stay: The Big Island (28.00 nights)
   Highest Avg Stay: Maui (1125.00 nights)

 Property Type: Townhouse
   Lowest Avg Stay: Barcelona (60.00 nights)
   Highest Avg Stay: Hong Kong (1125.00 nights)

 Property Type: Pension (South Korea)
   Lowest Avg Stay: Hong Kong (1125.00 nights)
   Highest Avg Stay: Hong Kong (1125.00 nights)

 Propert

**Advanced Difficulty Questions (Consider database optimization for these queries):** [3 points per question]








## Question 7

7)	We are creating a new webpage for hosts when setting up their account. It will list suggested typical amenities. This data will need to be available every time a host registers a property but is not expected to change very much. The starting point for the list will be all unique amenities currently listed in properties (across all documents). Optimise the database for this use case and show how the data should be queried.

**How we did it:**

Unwound the Amenities array and extracted all unique values

Stored the results in a new collection called amenities_master using the pre-aggregation + caching pattern

Created a single-query access point for host registration pages to retrieve suggested amenities instantly



Extract all unique amenities

In [26]:
pipeline = [
    {"$unwind": "$Amenities"},
    {"$group": {"_id": None, "all_amenities": {"$addToSet": "$Amenities"}}},
    {"$project": {"_id": 0, "all_amenities": 1}}
]

result = list(listings_col.aggregate(pipeline))
amenities_list = result[0]["all_amenities"] if result else []
print(f"Found {len(amenities_list)} unique amenities")

Found 186 unique amenities


Precompute and store them in a new collection called amenities_master.

In [27]:
db.amenities_master.drop()  # Clean slate if re-running

if amenities_list:
    db.amenities_master.insert_one({"_id": "default", "amenities": amenities_list})
    print(" Saved to amenities_master collection")

 Saved to amenities_master collection


When hosts register a new property and you want to show them typical amenities, just run:

In [29]:
host_view = db.amenities_master.find_one({"_id": "default"})
pprint(host_view["amenities"])

['Cable TV',
 'Paid parking on premises',
 'Iron',
 'Baby bath',
 'Dryer',
 'Children’s dinnerware',
 'Tennis court',
 'Dishes and silverware',
 'Pool',
 'Buzzer/wireless intercom',
 'Building staff',
 'Free street parking',
 'Ground floor access',
 'Bathtub',
 'Stove',
 'Patio or balcony',
 'Waterfront',
 'Accessible-height toilet',
 'BBQ grill',
 'First aid kit',
 'Accessible-height bed',
 'DVD player',
 'Lake access',
 'Gas oven',
 'Memory foam mattress',
 'Outdoor seating',
 'Private entrance',
 'Sonos sound system',
 'Wide clearance to bed',
 '24-hour check-in',
 'Refrigerator',
 'Suitable for events',
 'Private pool',
 'Outdoor parking',
 'Handheld shower head',
 'Baby monitor',
 'Fire extinguisher',
 'Formal dining area',
 'Wide hallway clearance',
 'Roll-in shower',
 'Shared pool',
 'Stair gates',
 'Other',
 'Fixed grab bars for toilet',
 'Firm mattress',
 'Crib',
 'Kitchenette',
 'Safe',
 'Wifi',
 'Pets allowed',
 'Essentials',
 'En suite bathroom',
 'Shower chair',
 'Breakfas

We applied the pre-aggregation + caching pattern because amenities are reused often but rarely change.
We expect faster performance and less computation per host registration page, based on avoiding full collection scans.

## Question 8


8)	We plan to rtack our reviewers better. We want to create a webpage that shows the top 20 reviewers and the count of the number of reviews of each of these reviewers. This webpage should be kept up to date. It should also have a link to return the number of reviews for a given reviewer ID or Name (show how to query for number of reviews by ID or query quickly).

**How we did it:**

Aggregated reviews by Reviewer_ID and counted the total per reviewer

Sorted and limited the results to the top 20 reviewers

Stored the result in a dedicated top_reviewers collection for fast repeated access

Enabled quick queries by reviewer ID using count_documents()

Also supported name-based lookup, though ID is preferred due to name duplication

Aggregation of the Top 20 reviewers

In [30]:
pipeline = [
    {"$group": {
        "_id": "$Reviewer_ID",
        "name": {"$first": "$Reviewer_Name"},
        "review_count": {"$sum": 1}
    }},
    {"$sort": {"review_count": -1}},
    {"$limit": 20}
]

top_reviewers = list(reviews_col.aggregate(pipeline))
for reviewer in top_reviewers:
    print(f"{reviewer['name']} ({reviewer['_id']}): {reviewer['review_count']} reviews")


Filipe (20775242): 24 reviews
Nick (67084875): 13 reviews
Uge (2961855): 10 reviews
Thien (162027327): 9 reviews
Lisa (20991911): 9 reviews
Jodi (12679057): 8 reviews
Courtney (55241576): 8 reviews
Todd (60816198): 8 reviews
David (1705870): 8 reviews
Lisa (69140895): 8 reviews
Lance (47303133): 7 reviews
David (78093968): 7 reviews
Erik (61469899): 6 reviews
Branden (128210181): 6 reviews
Karen (24667379): 6 reviews
Chris (86665925): 6 reviews
Dan (34005800): 6 reviews
Megan (25715809): 6 reviews
Mary (57325457): 6 reviews
Assis (76782210): 6 reviews


Store it in a dedicated collection. Create or update a collection called top_reviewers, this way we don't have to recompute it every time. 

In [31]:
# Overwrite collection (e.g., updated weekly or daily)
db.top_reviewers.drop()
if top_reviewers:
    db.top_reviewers.insert_many(top_reviewers)
    print(" Top 20 reviewers stored in top_reviewers collection.")

 Top 20 reviewers stored in top_reviewers collection.


## Query quickly through id or name

query through reviewer_id

In [32]:
reviewer_id = "20775242"
count = reviews_col.count_documents({"Reviewer_ID": reviewer_id})
print(f"Reviewer {reviewer_id} wrote {count} reviews.")

Reviewer 20775242 wrote 24 reviews.


By reviewer name (Case-incsensitive)

In [33]:
name = "Filipe"
count = reviews_col.count_documents({"Reviewer_Name": {"$regex": f"^{name}$", "$options": "i"}})
print(f"Reviewer '{name}' wrote {count} reviews.")

Reviewer 'Filipe' wrote 44 reviews.


Uniqueness: Names like "Lisa" or "David" appear multiple times across different reviewers, so they’re not unique identifiers.

**SUMMARY QUESTION 8**

We applied the pre-aggregation + index pattern to keep track of top reviewers efficiently and support fast lookups.
We expect constant-time access for the top 20 (from top_reviewers) and indexed queries by reviewer ID or name in milliseconds.

Although it's possible to query by reviewer name using a case-insensitive regex, we recommend using the reviewer ID for accurate and consistent results. Reviewer names are not unique and may appear in different formats.

## Question 9

9)	For each property we store review scores across different metrics (accuracy, check-in, cleanliness etc). We consider adding more metrics, although there is no clarity on what these will be. We want to be able to easily query the average score across all of these metrics, including any new metrics that might be added without changing the query. Adjust the data model so this can be done and show the query for an example property.

**How we did it:**

Queried a listing by its _id

Used $objectToArray to convert the Review_Scores subdocument into key-value pairs

Mapped over the values with $map to extract numeric scores

Calculated the average using $avg, without hardcoding field names

Ensured the query automatically includes any new review metrics in the future



**QUERY TO DYNAMICALLY AVERGAE ALL SCORE**

-> the average of all review scores for a given property:

In [34]:
property_id = "10006546"  # Replace with real _id

pipeline = [
    {"$match": {"_id": property_id}},
    {"$project": {
        "Review_Scores": 1,
        "scores_array": {
            "$objectToArray": "$Review_Scores"
        }
    }},
    {"$project": {
        "average_score": {
            "$avg": {
                "$map": {
                    "input": "$scores_array",
                    "as": "score",
                    "in": "$$score.v"
                }
            }
        }
    }}
]

result = list(listings_col.aggregate(pipeline))

if result:
    print(f" Average review score for property {property_id}: {result[0]['average_score']:.2f}")
else:
    print(" Property not found.")

 Average review score for property 10006546: 22.83


In [35]:
# Get a sample property ID from the listings collection
sample_listing = listings_col.find_one({}, {"_id": 1, "Name": 1, "Location.City": 1})

if sample_listing:
    print(f"Sample Property ID: {sample_listing['_id']}")
    print(f"Property Name: {sample_listing.get('Name', 'Not available')}")
    print(f"Location: {sample_listing.get('Location', {}).get('City', 'Not available')}")
else:
    print("No listings found")

# If you need a specific property, for example in New York
ny_listing = listings_col.find_one({"Location.City": "New York"}, {"_id": 1, "Name": 1})

if ny_listing:
    print(f"\nNew York Property ID: {ny_listing['_id']}")
    print(f"Property Name: {ny_listing.get('Name', 'Not available')}")
else:
    print("\nNo New York listings found")

# Or a property with reviews
property_with_reviews = listings_col.find_one(
    {"Number_of_Reviews": {"$gt": 0}}, 
    {"_id": 1, "Name": 1, "Number_of_Reviews": 1}
)

if property_with_reviews:
    print(f"\nProperty with reviews ID: {property_with_reviews['_id']}")
    print(f"Property Name: {property_with_reviews.get('Name', 'Not available')}")
    print(f"Number of Reviews: {property_with_reviews.get('Number_of_Reviews', 0)}")
else:
    print("\nNo properties with reviews found")

Sample Property ID: 10006546
Property Name: Ribeira Charming Duplex
Location: Porto

New York Property ID: 10021707
Property Name: Private Room in Bushwick

Property with reviews ID: 10006546
Property Name: Ribeira Charming Duplex
Number of Reviews: 51


$objectToArray: Converts the Review_Scores object into an array of {k: ..., v: ...} pairs.

$map over that array to extract values.

$avg computes the average dynamically.

Why This Rocks
Works even if you add new review metrics later like "EcoFriendliness" or "WiFi".

You don’t need to rewrite your query — it’s fully dynamic!

## Question 10

10)	We aim to have better access to information about transaction, we wish to develop a search engine that can calculate the average value of transactions in a given period of time quickly for a given property.


**How we did it:**

Queried a property’s Price, Transactions, and review metadata to estimate transaction potential

Calculated estimated monthly and annual revenue using price and assumed occupancy

Created a second function to estimate average transaction value in a given date range

Used review dates to proportionally estimate review volume in the period, as a proxy for transactions

Added a 15% multiplier to the price to reflect service fees in estimated transaction value

In [36]:
from datetime import datetime
from bson.decimal128 import Decimal128

def analyze_transactions(property_id, start_date=None, end_date=None):
    """Calculate average transaction value in a time period"""
    # Get property details
    property_doc = db.listingsAndReviews_new.find_one(
        {"_id": property_id},
        {"name": 1, "transactions": 1, "price": 1}
    )
    
    if not property_doc:
        return {"error": "Property not found"}
    
    name = property_doc.get("name", "Unknown")
    
    # Handle Decimal128 conversion safely
    price_value = property_doc.get("price", 0)
    if isinstance(price_value, Decimal128):
        price = float(price_value.to_decimal())
    else:
        try:
            price = float(price_value)
        except (TypeError, ValueError):
            price = 0
    
    # Get transactions
    transactions = property_doc.get("transactions", {}).get("transactions", [])
    
    # Filter by date range if provided
    if start_date and end_date and transactions:
        transactions = [t for t in transactions if 
                       "date" in t and start_date <= t["date"] <= end_date]
    
    # Calculate transaction statistics
    if transactions:
        # Handle Decimal128 types in transaction prices
        values = []
        for t in transactions:
            try:
                t_price = t.get("price", 0)
                if isinstance(t_price, Decimal128):
                    values.append(float(t_price.to_decimal()))
                else:
                    values.append(float(t_price))
            except (TypeError, ValueError):
                continue
                
        avg_value = sum(values) / len(values) if values else 0
        
        return {
            "property_name": name,
            "period": f"{start_date.strftime('%Y-%m-%d') if start_date else 'All'} to {end_date.strftime('%Y-%m-%d') if end_date else 'now'}",
            "transactions_count": len(values),
            "average_value": avg_value,
            "data_source": "Actual transaction data"
        }
    else:
        # Use listing price directly
        return {
            "property_name": name,
            "period": f"{start_date.strftime('%Y-%m-%d') if start_date else 'All'} to {end_date.strftime('%Y-%m-%d') if end_date else 'now'}",
            "average_value": price,
            "data_source": "Estimated from listing price (no transactions found)"
        }

# Simple test with one property
property_id = "10082422"  # Sample property ID
start_date = datetime(2015, 1, 1)
end_date = datetime(2018, 12, 31)

result = analyze_transactions(property_id, start_date, end_date)
print("\nTransaction Analysis Results:")
print("="*50)
if "error" in result:
    print(f"Error: {result['error']}")
else:
    print(f"Property: {result['property_name']}")
    print(f"Period: {result['period']}")
    print(f"Average Transaction Value: ${result['average_value']:.2f}")
    if "transactions_count" in result:
        print(f"Transactions Found: {result['transactions_count']}")
    print(f"Data Source: {result['data_source']}")


Transaction Analysis Results:
Property: Nice room in Barcelona Center
Period: 2015-01-01 to 2018-12-31
Average Transaction Value: $81.14
Transactions Found: 3
Data Source: Actual transaction data


## Question 11



11)	We wish to have a summary webpage that displays information about our top destinations. This webpage should display for each of the top 10 cities some basic information about our operations in the area (number of properties by type for example, average price by type) but you can choose the metrics. For each of the top 10 cities it should also provide some basic information about the top 3 properties in each city (price, number of review, whatever you think useful) to show an example of the properties available in the area. We would like to keep this webpage up to date as information changes.

**How we did it:**

Aggregated all listings to identify the top 10 cities by number of listings

For each city, calculated:

Property type distribution (count, percentage, average price)

Price statistics (avg, min, max, ranges)

Review statistics (avg reviews, total reviews, avg rating)

Identified top 3 listings per city using a custom score based on rating × log of review count

Stored the full results in a city_summaries collection, indexed by city name for instant access

Enabled webpage-ready querying through a function that fetches one or all cities' summaries



In [37]:
def identify_top_cities():
    """Identify the top 10 cities by number of listings"""
    pipeline = [
        # Group by city and count listings
        {
            "$group": {
                "_id": "$Location.City",
                "listing_count": {"$sum": 1},
                "avg_price": {"$avg": "$Price"}
            }
        },
        # Filter out null or empty city names
        {
            "$match": {
                "_id": {"$ne": None, "$ne": ""}
            }
        },
        # Sort by listing count in descending order
        {
            "$sort": {"listing_count": -1}
        },
        # Limit to top 10
        {
            "$limit": 10
        }
    ]
    
    top_cities = list(listings_col.aggregate(pipeline))
    
    print("Top 10 Cities by Listing Count:")
    for i, city in enumerate(top_cities, 1):
        print(f"{i}. {city['_id']}: {city['listing_count']} listings (Avg price: ${city['avg_price']:.2f})")
    
    return top_cities

# Step 2: Create a collection to store city summaries for fast access
def create_city_summaries(top_cities):
    """Create a collection with comprehensive statistics for each top city"""
    # Drop existing collection if rerunning
    db.city_summaries.drop()
    
    city_summaries = []
    
    for city_info in top_cities:
        city_name = city_info["_id"]
        
        # Get city-wide metrics
        property_type_stats = get_property_type_distribution(city_name)
        price_stats = get_price_statistics(city_name)
        review_stats = get_review_statistics(city_name)
        
        # Get top 3 properties in this city
        top_properties = get_top_properties(city_name)
        
        # Compile city summary
        city_summary = {
            "_id": city_name,
            "listing_count": city_info["listing_count"],
            "avg_price": city_info["avg_price"],
            "property_types": property_type_stats,
            "price_statistics": price_stats,
            "review_statistics": review_stats,
            "top_properties": top_properties,
            "last_updated": datetime.now()
        }
        
        city_summaries.append(city_summary)
    
    # Insert all summaries at once
    if city_summaries:
        db.city_summaries.insert_many(city_summaries)
        db.city_summaries.create_index([("_id", 1)])  # Create index on city name
        print(f"\n Created summaries for {len(city_summaries)} cities")

# Helper function to get property type distribution
def get_property_type_distribution(city):
    """Get distribution of property types in a city"""
    pipeline = [
        # Match listings in this city
        {
            "$match": {"Location.City": city}
        },
        # Group by property type
        {
            "$group": {
                "_id": "$Property_Type",
                "count": {"$sum": 1},
                "avg_price": {"$avg": "$Price"}
            }
        },
        # Sort by count
        {
            "$sort": {"count": -1}
        }
    ]
    
    property_types = list(listings_col.aggregate(pipeline))
    
    # Format results for storage
    return [
        {
            "type": pt["_id"],
            "count": pt["count"],
            "percentage": round(pt["count"] / listings_col.count_documents({"Location.City": city}) * 100, 1),
            "avg_price": pt["avg_price"]
        }
        for pt in property_types
    ]

# Helper function to get price statistics
def get_price_statistics(city):
    """Get detailed price statistics for a city"""
    pipeline = [
        # Match listings in this city
        {
            "$match": {"Location.City": city}
        },
        # Calculate price statistics
        {
            "$group": {
                "_id": None,
                "avg_price": {"$avg": "$Price"},
                "min_price": {"$min": "$Price"},
                "max_price": {"$max": "$Price"},
                "median_price": {"$avg": "$Price"}  # Approximation (MongoDB doesn't have true median)
            }
        }
    ]
    
    stats = list(listings_col.aggregate(pipeline))
    
    # Get price ranges (buckets)
    pipeline_ranges = [
        {"$match": {"Location.City": city}},
        {"$bucket": {
            "groupBy": "$Price",
            "boundaries": [0, 50, 100, 200, 500, 1000, 10000],
            "default": "10000+",
            "output": {
                "count": {"$sum": 1},
                "properties": {"$push": {"id": "$_id", "name": "$Name"}}
            }
        }}
    ]
    
    price_ranges = list(listings_col.aggregate(pipeline_ranges))
    
    # Combine and return
    return {
        "general": stats[0] if stats else {"avg_price": 0, "min_price": 0, "max_price": 0},
        "ranges": [
            {
                "range": f"${r['_id']}" if isinstance(r['_id'], int) else r['_id'],
                "count": r['count'],
                "sample_properties": r['properties'][:3]  # Just include 3 sample properties
            }
            for r in price_ranges
        ]
    }

# Helper function to get review statistics
def get_review_statistics(city):
    """Get review statistics for a city"""
    pipeline = [
        # Match listings in this city
        {
            "$match": {"Location.City": city}
        },
        # Calculate review statistics
        {
            "$group": {
                "_id": None,
                "avg_reviews": {"$avg": "$Number_of_Reviews"},
                "total_reviews": {"$sum": "$Number_of_Reviews"},
                "avg_rating": {"$avg": "$Review_Scores.Rating"}
            }
        }
    ]
    
    stats = list(listings_col.aggregate(pipeline))
    
    return stats[0] if stats else {"avg_reviews": 0, "total_reviews": 0, "avg_rating": 0}

# Helper function to get top properties in a city
def get_top_properties(city):
    """Get top 3 properties in a city based on review count and rating"""
    pipeline = [
        # Match listings in this city
        {
            "$match": {
                "Location.City": city,
                "Number_of_Reviews": {"$gt": 0}  # Only consider properties with reviews
            }
        },
        # Calculate a score based on review count and rating
        {
            "$addFields": {
                "score": {
                    "$multiply": [
                        {"$ifNull": ["$Review_Scores.Rating", 0]},
                        {"$ln": {"$add": ["$Number_of_Reviews", 1]}}  # Log transform to balance influence
                    ]
                }
            }
        },
        # Sort by score
        {
            "$sort": {"score": -1}
        },
        # Limit to top 3
        {
            "$limit": 3
        },
        # Project only the fields we need
        {
            "$project": {
                "_id": 1,
                "Name": 1,
                "Property_Type": 1,
                "Price": 1,
                "Number_of_Reviews": 1,
                "Review_Scores.Rating": 1,
                "Picture_URL": 1,
                "Accommodates": 1,
                "Location.Coordinates": 1
            }
        }
    ]
    
    top_properties = list(listings_col.aggregate(pipeline))
    return top_properties

# Function to query the summary data for web display
def get_city_summary(city_name=None):
    """Get summary data for a specific city or all top cities"""
    if city_name:
        return db.city_summaries.find_one({"_id": city_name})
    else:
        return list(db.city_summaries.find().sort("listing_count", -1))

# Run the entire process
top_cities = identify_top_cities()
create_city_summaries(top_cities)

# Example of how to access the data for a web page
print("\n Access Example for Web Page:")
all_summaries = get_city_summary()
print(f"Retrieved {len(all_summaries)} city summaries")

# Example of specific city access
if all_summaries:
    sample_city = all_summaries[0]["_id"]
    city_data = get_city_summary(sample_city)
    
    print(f"\n Sample Data for {sample_city}:")
    print(f"Total Listings: {city_data['listing_count']}")
    print(f"Average Price: ${city_data['avg_price']:.2f}")
    
    print("\nProperty Type Distribution:")
    for pt in city_data['property_types'][:3]:  # Show top 3
        print(f"- {pt['type']}: {pt['count']} listings ({pt['percentage']}%), Avg: ${pt['avg_price']:.2f}")
    
    print("\nTop Properties:")
    for i, prop in enumerate(city_data['top_properties'], 1):
        print(f"{i}. {prop['Name']} - ${prop['Price']}, {prop['Number_of_Reviews']} reviews, Rating: {prop.get('Review_Scores', {}).get('Rating', 'N/A')}")



Top 10 Cities by Listing Count:
1. Istanbul: 660 listings (Avg price: $367.95)
2. Montreal: 648 listings (Avg price: $100.23)
3. Barcelona: 632 listings (Avg price: $100.95)
4. Hong Kong: 619 listings (Avg price: $762.48)
5. Sydney: 609 listings (Avg price: $197.71)
6. New York: 607 listings (Avg price: $139.63)
7. Rio De Janeiro: 603 listings (Avg price: $525.81)
8. Porto: 554 listings (Avg price: $69.13)
9. Oahu: 253 listings (Avg price: $212.30)
10. Maui: 153 listings (Avg price: $286.59)

 Created summaries for 10 cities

 Access Example for Web Page:
Retrieved 10 city summaries

 Sample Data for Istanbul:
Total Listings: 660
Average Price: $367.95

Property Type Distribution:
- Apartment: 413 listings (62.6%), Avg: $365.98
- Serviced apartment: 67 listings (10.2%), Avg: $314.42
- House: 37 listings (5.6%), Avg: $276.30

Top Properties:
1. Historical Istanbul House @ Taksim - $243.0, 246 reviews, Rating: 96
2. cozy room in Taksim Istanbul - $148.0, 248 reviews, Rating: 94
3. At The

**Database updates:** [2 points per question]

After optimizing the database, show how to complete the following updates. You can create fictional data. Ensure that previous data does not become stale:





## Question 12

12) Add a new property with a new host in one of the top 10 cities. The host selects the top 10 most common amenities to list.

**How we did it:**

Queried the database to identify one of the top 10 cities by number of listings

Aggregated the top 10 most common amenities across all listings

Created a new host document with complete metadata and inserted it into the hosts collection

Created a new property listing with realistic values (price, capacity, location, etc.) and added it to the listings collection

Ensured referential integrity by linking the listing to the new host via Host_ID

Updated the host’s Total_Listings count accordingly



In [38]:
def get_top_city():
    """Get one of the top 10 cities by listing count"""
    pipeline = [
        {"$group": {"_id": "$Location.City", "count": {"$sum": 1}}},
        {"$match": {"_id": {"$ne": None, "$ne": ""}}},
        {"$sort": {"count": -1}},
        {"$limit": 10}
    ]
    
    top_cities = list(listings_col.aggregate(pipeline))
    return top_cities[0]["_id"] if top_cities else "New York"  # Default to New York if no data

# Step 2: Find the top 10 most common amenities
def get_top_amenities(limit=10):
    """Get the most common amenities across all listings"""
    pipeline = [
        {"$unwind": "$Amenities"},
        {"$group": {"_id": "$Amenities", "count": {"$sum": 1}}},
        {"$sort": {"count": -1}},
        {"$limit": limit}
    ]
    
    top_amenities = list(listings_col.aggregate(pipeline))
    return [amenity["_id"] for amenity in top_amenities]

# Step 3: Create a new host
def create_new_host():
    """Create a new host document"""
    new_host = {
        "_id": str(int(time.time())),  # Generate unique ID based on timestamp
        "Name": "Jane Smith",
        "Location": "San Francisco, California, USA",
        "Response_Rate": 98,
        "Response_Time": "within an hour",
        "Is_Superhost": True,
        "Total_Listings": 1,
        "Verification_Methods": ["email", "phone", "government_id"],
        "Profile_Picture": "https://example.com/profile_pic.jpg",
        "About": "Passionate about travel and meeting new people. I love sharing my favorite local spots with guests!",
        "Host_Verification": {
            "Has_Profile_Picture": True,
            "Identity_Verified": True,
            "Verification_Methods": ["email", "phone", "government_id"]
        }
    }
    
    # Insert the new host
    result = hosts_col.insert_one(new_host)
    print(f" Created new host with ID: {new_host['_id']}")
    
    return new_host

# Step 4: Create a new property for this host
def create_new_property(host_doc, city, amenities):
    """Create a new property listing for the host"""
    # Generate a unique property ID
    property_id = f"prop_{int(time.time())}"
    
    # Create the property document
    new_property = {
        "_id": property_id,
        "Name": "Modern Downtown Loft with City Views",
        "Listing_URL": f"https://airbnb.com/rooms/{property_id}",
        "Location": {
            "City": city,
            "Coordinates": [-122.419416, 37.774929],  # Example coordinates
            "Area": "Downtown",
            "Country": "United States",
            "Street": f"123 Main St, {city}, USA"
        },
        "Property_Type": "Apartment",
        "Room_Type": "Entire home/apt",
        "Bedrooms": 2,
        "Beds": 3,
        "Bed_Type": "Real Bed",
        "Bathrooms": 2.0,
        "Accommodates": 6,
        "Amenities": amenities,
        "Price": 185.00,
        "Cleaning_Fee": 75.00,
        "Picture_URL": "https://example.com/property_image.jpg",
        "Host_ID": host_doc["_id"],
        "Cancellation_Policy": "moderate",
        "Minimum_Nights": 2,
        "Maximum_Nights": 30,
        "Summary": "Here we have a 20 POINTS HOPEFULLY REPOOOOORRTT",
        "Description": "Tdfsdfsdf",
        "Neighborhood_Overview": "Laknfklanf",
        "Number_of_Reviews": 0,
        "Review_Scores": {
            "Rating": None,
            "Accuracy": None,
            "Cleanliness": None,
            "Check_In": None,
            "Communication": None,
            "Location": None,
            "Value": None
        }
    }
    
    # Insert the new property
    result = listings_col.insert_one(new_property)
    print(f" Created new property with ID: {new_property['_id']}")
    
    # Update the host's listing count
    hosts_col.update_one(
        {"_id": host_doc["_id"]},
        {"$inc": {"Total_Listings": 1}}
    )
    
    return new_property

# Step 5: Execute the entire process
def add_new_property_with_host():
    """Add a new property with a new host in a top city with top amenities"""
    # Get top city and amenities
    top_city = get_top_city()
    top_amenities = get_top_amenities(10)
    
    print(f"Selected City: {top_city}")
    print("Top 10 Amenities:")
    for i, amenity in enumerate(top_amenities, 1):
        print(f"{i}. {amenity}")
    
    # Create host and property
    new_host = create_new_host()
    new_property = create_new_property(new_host, top_city, top_amenities)
    
    print("\n New Property Summary:")
    print(f"Name: {new_property['Name']}")
    print(f"Host: {new_host['Name']} (ID: {new_host['_id']})")
    print(f"Location: {new_property['Location']['City']}")
    print(f"Price: ${new_property['Price']:.2f} per night")
    print(f"Accommodates: {new_property['Accommodates']} guests")
    print(f"Amenities: {', '.join(new_property['Amenities'])}")
    
    return {
        "host": new_host,
        "property": new_property
    }

# Run the function
import time
new_listing = add_new_property_with_host()

Selected City: Istanbul
Top 10 Amenities:
1. Wifi
2. Essentials
3. Kitchen
4. TV
5. Hangers
6. Hair dryer
7. Washer
8. Shampoo
9. Iron
10. Laptop friendly workspace
 Created new host with ID: 1745056970
 Created new property with ID: prop_1745056970

 New Property Summary:
Name: Modern Downtown Loft with City Views
Host: Jane Smith (ID: 1745056970)
Location: Istanbul
Price: $185.00 per night
Accommodates: 6 guests
Amenities: Wifi, Essentials, Kitchen, TV, Hangers, Hair dryer, Washer, Shampoo, Iron, Laptop friendly workspace


## Question 13

13) Add a new review from one of our top 20 reviewers for this new property.



**How we did it**

Aggregated the reviews collection to find a top reviewer (with fallback options for robustness)

Identified a property to review, prioritizing the one created in Question 12

Created a new review document using rich, contextual review text

Inserted the review into the reviews collection

Updated the corresponding property’s Number_of_Reviews, Last_Review_Date, and First_Review_Date if missing

In [39]:
# Step 1: Get one of the top 20 reviewers more robustly
def get_top_reviewer():
    """Get one of the top 20 reviewers by review count with fallback options"""
    # Try to get top reviewers from the reviews collection
    pipeline = [
        {"$group": {
            "_id": "$Reviewer_ID",
            "name": {"$first": "$Reviewer_Name"},
            "review_count": {"$sum": 1}
        }},
        {"$match": {
            "_id": {"$ne": None, "$ne": ""},
            "name": {"$ne": None, "$ne": ""}
        }},
        {"$sort": {"review_count": -1}},
        {"$limit": 20}
    ]
    
    top_reviewers = list(reviews_col.aggregate(pipeline))
    
    if top_reviewers:
        # Randomly select one from the top 20 for variety
        import random
        reviewer = random.choice(top_reviewers)
        return {
            "reviewer_id": reviewer["_id"],
            "name": reviewer["name"],
            "review_count": reviewer["review_count"]
        }
    
    # Fallback 1: Check if there's a top_reviewers collection from Q8
    if "top_reviewers" in db.list_collection_names():
        top_reviewer_doc = db.top_reviewers.find_one()
        if top_reviewer_doc:
            return {
                "reviewer_id": top_reviewer_doc["_id"],
                "name": top_reviewer_doc.get("name", "Top Reviewer"),
                "review_count": top_reviewer_doc.get("review_count", 0)
            }
    
    # Fallback 2: Take any reviewer from the reviews collection
    sample_review = reviews_col.find_one({}, {"Reviewer_ID": 1, "Reviewer_Name": 1})
    if sample_review:
        return {
            "reviewer_id": sample_review["Reviewer_ID"],
            "name": sample_review["Reviewer_Name"],
            "review_count": 1  # We don't know the exact count but it's at least 1
        }
    
    # Final fallback: Create a new reviewer
    import time
    new_id = f"reviewer_{int(time.time())}"
    return {
        "reviewer_id": new_id,
        "name": "New Reviewer",
        "review_count": 0,
        "is_new": True  # Flag to indicate this is a new reviewer
    }

# Step 2: Find a suitable property to review
def find_property_to_review():
    """Find a property to review, with multiple fallback strategies"""
    # First, try to use the property from Question 12 if available
    try:
        if 'new_listing' in globals() and new_listing.get("property", {}).get("_id"):
            property_id = new_listing["property"]["_id"]
            property_doc = listings_col.find_one({"_id": property_id})
            if property_doc:
                return property_doc
    except:
        # Continue to fallbacks if the above fails
        pass
    
    # Fallback 1: Find a recently added property (presumably with few or no reviews)
    recent_property = listings_col.find_one(
        {"Number_of_Reviews": {"$lt": 5}},
        sort=[("Last_Review_Date", -1)]
    )
    if recent_property:
        return recent_property
    
    # Fallback 2: Find any property
    any_property = listings_col.find_one()
    if any_property:
        return any_property
    
    # Fallback 3: Create a new property (simplified version)
    import time
    property_id = f"property_{int(time.time())}"
    new_property = {
        "_id": property_id,
        "Name": "Fallback Property",
        "Location": {"City": "Example City"},
        "Property_Type": "Apartment",
        "Number_of_Reviews": 0
    }
    listings_col.insert_one(new_property)
    return new_property

# Step 3: Create a new review
def create_review(property_doc, reviewer):
    """Create a detailed review for the property"""
    from datetime import datetime
    import uuid
    
    # Generate a unique review ID
    review_id = str(uuid.uuid4())
    
    # Create review text based on property details
    property_name = property_doc.get("Name", "this place")
    property_type = property_doc.get("Property_Type", "property")
    city = property_doc.get("Location", {}).get("City", "the city")
    
    review_text = f"""Our stay at {property_name} was wonderful! The {property_type} was exactly as described - clean, comfortable, and in a great location in {city}. The check-in process was smooth, and communication was excellent throughout our stay. The amenities were perfect for our needs, and we particularly enjoyed the comfortable beds and well-equipped kitchen. The neighborhood offered easy access to local attractions, restaurants, and shopping. We would definitely recommend this {property_type} to anyone visiting {city} and hope to stay here again in the future!"""
    
    # Create the review document
    new_review = {
        "_id": review_id,
        "Listing_ID": property_doc["_id"],
        "Reviewer_ID": reviewer["reviewer_id"],
        "Reviewer_Name": reviewer["name"],
        "Date": datetime.now(),
        "Comments": review_text
    }
    
    return new_review

# Step 4: Save review and update property
def save_review(review_doc):
    """Save the review and update the property's review count"""
    from datetime import datetime
    
    # Insert the review
    result = reviews_col.insert_one(review_doc)
    
    # Update the property
    now = datetime.now()
    update_data = {
        "$inc": {"Number_of_Reviews": 1},
        "$set": {"Last_Review_Date": now}
    }
    
    # If this is the first review, set the first review date
    property_doc = listings_col.find_one({"_id": review_doc["Listing_ID"]})
    if not property_doc.get("First_Review_Date"):
        update_data["$set"]["First_Review_Date"] = now
    
    # Update the property
    listings_col.update_one(
        {"_id": review_doc["Listing_ID"]},
        update_data
    )
    
    return result.inserted_id

# Step 5: Execute the entire process
def add_review_to_property():
    """Add a review from a top reviewer to a property with built-in resiliency"""
    # Find a reviewer
    reviewer = get_top_reviewer()
    print(f"Selected Reviewer: {reviewer['name']} (ID: {reviewer['reviewer_id']})")
    if reviewer.get("is_new"):
        print("Note: Created a new reviewer as no existing reviewers were found")
    
    # Find a property
    property_doc = find_property_to_review()
    print(f"Selected Property: {property_doc['Name']} (ID: {property_doc['_id']})")
    print(f"Location: {property_doc.get('Location', {}).get('City', 'Unknown')}")
    print(f"Current Reviews: {property_doc.get('Number_of_Reviews', 0)}")
    
    # Create the review
    new_review = create_review(property_doc, reviewer)
    
    # Save the review
    inserted_id = save_review(new_review)
    
    print("\nSuccessfully added new review:")
    print(f"Review ID: {inserted_id}")
    print(f"From: {reviewer['name']}")
    print(f"For: {property_doc['Name']}")
    print(f"Date: {new_review['Date']}")
    print(f"Comment Excerpt: {new_review['Comments'][:100]}...")
    
    # Get the updated property data
    updated_property = listings_col.find_one({"_id": property_doc["_id"]})
    print(f"\nUpdated Property Review Count: {updated_property.get('Number_of_Reviews', 0)}")
    
    return {
        "review_id": inserted_id,
        "property_id": property_doc["_id"],
        "reviewer_id": reviewer["reviewer_id"]
    }

# Run the process
review_result = add_review_to_property()

Selected Reviewer: David (ID: 78093968)
Selected Property: Modern Downtown Loft with City Views (ID: prop_1745056970)
Location: Istanbul
Current Reviews: 0

Successfully added new review:
Review ID: 9965bffb-857c-43d7-ab8b-d91c58206b8b
From: David
For: Modern Downtown Loft with City Views
Date: 2025-04-19 12:03:01.338399
Comment Excerpt: Our stay at Modern Downtown Loft with City Views was wonderful! The Apartment was exactly as describ...

Updated Property Review Count: 1


## Question 14

14) Add a new review metric called 'x_factor' with a score of 10. Show that the average score across all metrics is correctly calculated for this listing, using the previously developed query.


**How we did it:**

Selected a property (preferring the one added in Question 12)

Added a new dynamic review metric X_Factor with a score of 10 to the Review_Scores object

Reused the dynamic average scoring pipeline from Question 9

Verified that the average across all metrics (now including X_Factor) is calculated correctly



In [40]:
# Step 1: Select a property to update
def select_property_for_metric_update():
    """Select a property to add the new review metric to"""
    # Try to use the property from Question 12/13 if available
    try:
        if 'review_result' in globals() and review_result.get("property_id"):
            property_id = review_result["property_id"]
            property_doc = listings_col.find_one({"_id": property_id})
            if property_doc:
                return property_doc
    except:
        pass
    
    try:
        if 'new_listing' in globals() and new_listing.get("property", {}).get("_id"):
            property_id = new_listing["property"]["_id"]
            property_doc = listings_col.find_one({"_id": property_id})
            if property_doc:
                return property_doc
    except:
        pass
    
    # Fallback: Find any property with existing review scores
    property_with_scores = listings_col.find_one(
        {"Review_Scores.Rating": {"$exists": True, "$ne": None}}
    )
    if property_with_scores:
        return property_with_scores
    
    # Final fallback: Any property
    any_property = listings_col.find_one()
    return any_property

# Step 2: Add the new x_factor metric
def add_x_factor_metric(property_doc):
    """Add the x_factor metric to the property's review scores"""
    # Get the current review scores
    review_scores = property_doc.get("Review_Scores", {})
    if review_scores is None:
        review_scores = {}
    
    # Add the x_factor metric with score of 10
    review_scores["X_Factor"] = 10
    
    # Update the property
    result = listings_col.update_one(
        {"_id": property_doc["_id"]},
        {"$set": {"Review_Scores": review_scores}}
    )
    
    return result.modified_count > 0

# Step 3: Calculate the average score across all metrics
def calculate_avg_score(property_id):
    """Calculate the average score across all review metrics using the query from Question 9"""
    pipeline = [
        {"$match": {"_id": property_id}},
        {"$project": {
            "Review_Scores": 1,
            "scores_array": {
                "$objectToArray": "$Review_Scores"
            }
        }},
        {"$project": {
            "average_score": {
                "$avg": {
                    "$map": {
                        "input": "$scores_array",
                        "as": "score",
                        "in": "$$score.v"
                    }
                }
            }
        }}
    ]
    
    result = list(listings_col.aggregate(pipeline))
    if result:
        return result[0].get("average_score")
    return None

# Step 4: Execute the whole process
def add_x_factor_and_calculate_avg():
    """Add x_factor metric and demonstrate average calculation"""
    # Select a property
    property_doc = select_property_for_metric_update()
    if not property_doc:
        print(" No suitable property found")
        return None
    
    print(f"Selected Property: {property_doc.get('Name', 'Unnamed')} (ID: {property_doc['_id']})")
    
    # Get current review scores and calculate current average
    current_scores = property_doc.get("Review_Scores", {})
    print("\nCurrent Review Scores:")
    if current_scores:
        for metric, score in current_scores.items():
            if score is not None:  # Skip null/None values
                print(f"- {metric}: {score}")
    else:
        print("No review scores currently available")
    
    current_avg = calculate_avg_score(property_doc["_id"])
    if current_avg is not None:
        print(f"\nCurrent Average Score: {current_avg:.2f}")
    else:
        print("\nNo current average score available")
    
    # Add the x_factor metric
    success = add_x_factor_metric(property_doc)
    if success:
        print("\n Successfully added X_Factor metric with score of 10")
    else:
        print("\n Failed to add X_Factor metric")
    
    # Get updated property and calculate new average
    updated_property = listings_col.find_one({"_id": property_doc["_id"]})
    updated_scores = updated_property.get("Review_Scores", {})
    
    print("\nUpdated Review Scores:")
    if updated_scores:
        for metric, score in updated_scores.items():
            if score is not None:  # Skip null/None values
                print(f"- {metric}: {score}")
    
    new_avg = calculate_avg_score(property_doc["_id"])
    if new_avg is not None:
        print(f"\nNew Average Score (including X_Factor): {new_avg:.2f}")
    
    return {
        "property_id": property_doc["_id"],
        "current_avg": current_avg,
        "new_avg": new_avg,
        "x_factor_added": success
    }

# Run the process
x_factor_result = add_x_factor_and_calculate_avg()

Selected Property: Modern Downtown Loft with City Views (ID: prop_1745056970)

Current Review Scores:

No current average score available

 Successfully added X_Factor metric with score of 10

Updated Review Scores:
- X_Factor: 10

New Average Score (including X_Factor): 10.00
