# House Hunters - HGTV

"House Hunters" is a television show in which families are shown several homes and then choose one to purchase. Each episode is set in a specific city.

Since its debut in 1999, the show has aired over 3,200 episodes. However, there is no official database that specifies the locations featured in each episode. 

**The episode descriptions do not explicitly state their locations. Here GPT is used to find out he locations**

Utilizing the database created in this notebook, users can now locate episodes specific to a certain city. This is particularly useful for people considering relocating, as it provides them with actual video tours of neighborhoods, offering a glimpse into the housing market of the city.

The actual episodes are not available here, but they can be accessed on various streaming platforms.

All the data presented here was sourced from: https://www.hgtv.com/shows/house-hunters/episodes/


<div class='tableauPlaceholder' id='viz1706735029553' style='position: relative'><noscript><a href='#'><img alt='Dashboard 2 (3) ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;HH&#47;HH_2024&#47;Dashboard23&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='HH_2024&#47;Dashboard23' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;HH&#47;HH_2024&#47;Dashboard23&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='en-US' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1706735029553');                    var vizElement = divElement.getElementsByTagName('object')[0];                    if ( divElement.offsetWidth > 800 ) { vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';} else if ( divElement.offsetWidth > 500 ) { vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';} else { vizElement.style.width='100%';vizElement.style.height='977px';}                     var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

To interect with the data use the link below:

https://public.tableau.com/views/HH_2024/Dashboard23?:language=en-US&publish=yes&:display_count=n&:origin=viz_share_link

## Overview:
- Retrieve data from the official website.
- Utilize OpenAI's GPT to Infer locations based on episode descriptions.
- Clean the data.
- Visualize.


-------------

Imports and OpenAI Key

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os
from openai import OpenAI

In [None]:
#get openAI key
def get_api_key(file_path):
    with open(file_path, 'r') as file:
        return file.read().strip()

api_key = get_api_key('config.txt') 

## Retrieve data from the official website
- Navigate through all the pages to retrieve information on seasons, episodes, titles, and descriptions.

In [None]:
# Start URL
url = 'https://www.hgtv.com/shows/house-hunters/episodes/1a'

all_data = []

while True:
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    episode_info = soup.find_all('span', class_='m-EpisodeCard__a-AssetInfo')
    description_elements = soup.find_all('p', class_='m-EpisodeCard__a-Description')

    for ep, desc in zip(episode_info, description_elements):
        text = ep.text.split(',')
        if len(text) == 2:
            season, episode = text
            season = season.split(' ')[1]  # Extract season number
            episode = episode.split(' ')[2]  # Extract episode number
            description = desc.text.strip()
            all_data.append({'Season': season, 'Episode': episode, 'Description': description})

    # Find the "Next" button link
    next_button = soup.find('a', class_='o-Pagination__a-Button o-Pagination__a-NextButton')
    if next_button and 'href' in next_button.attrs:
        url = 'https:' + next_button['href']
        # Stop if the final URL is reached
        if url.endswith('/episodes/246'):
            break
    else:
        break

# Create DataFrame
df = pd.DataFrame(all_data)

df

Save all the Raw data to an .xlsx file for Backup

In [None]:
# Save the DataFrame to an Excel file
file_path = "full_data_HH.xlsx"

df.to_excel(file_path, index=False)

## OpenAI's GPT to Infer locations based on episode descriptions

- Define base prompt
- define answer fromat
- provide title and description

In [None]:
# Initialize the OpenAI client with your API key
client = OpenAI(api_key)

# Function to query GPT-3.5 with the new API
def query_gpt3(title, description, counter):
    prompt = (
        f"In this scenario, you are presented with a description of a TV show focused on house buyers. Your task is to discern, based on the provided information, the city and state (if applicable) in which the show is likely set. Your response should adhere to the format 'city, state, country.'\n"
        f"In cases where the city's identity remains uncertain, please indicate 'unknown' for the city. If you can identify the city but not the state or country, indicate the city's name, followed by 'unknown' for the state and country. It is crucial to maintain the correct order of 'city, state, country.'\n"
        f"Please provide a confident response regarding the city, and if any uncertainty persists, use 'unknown' for the city. Just give the answer in the desired format, no explanations:\n"
        f"title of episode: {title}\n"
        f"description of episode: {description}\n"
    )

    response = client.chat.completions.create(
        messages=[{"role": "user", "content": prompt}],
        model="gpt-3.5-turbo"
    )

    # Extracting the response text
    response_text = response.choices[0].message.content

    # Splitting the response text into city, state, and country
    split_response = response_text.split(", ")
    city = split_response[0] if len(split_response) > 0 else "unknown"
    state = split_response[1] if len(split_response) > 1 else "unknown"
    country = split_response[2] if len(split_response) > 2 else "unknown"

    # Print the result for tracking
    print(f"Response {counter}: {city}, {state}, {country}")

    return city, state, country

# Load the Excel file
df = pd.read_excel("full_data_HH.xlsx")

# Process each row and store responses
cities = []
states = []
countries = []
counter = 1
for index, row in df.iterrows():
    city = row['City']  # Get the value in the "City" column
    if city == "unknown":  # Check if the city is unknown
        title = row['Title']
        description = row['Description']
        city, state, country = query_gpt3(title, description, counter)
    else:
        # If the city is not "unknown," use the existing values from the DataFrame
        state = row['State']
        country = row['Country']
        print("OK")
    
    cities.append(city)
    states.append(state)
    countries.append(country)
    counter += 1

# Add responses to DataFrame
df['City'] = cities
df['State'] = states
df['Country'] = countries

# Optional: Save the DataFrame to a new Excel file
df.to_excel("gpt3_round_1.xlsx", index=False)


### Clean the data
- Do some cleaning...
- Hardcoded some common mistakes
- make another pass using GPT on bad results

In [None]:
import pandas as pd

# Read the Excel file
input_file = "gpt3_round_1.xlsx.xlsx"
output_file = "gpt3_responses_full_revised.xlsx"
df = pd.read_excel(input_file)

# 1. Remove '.' from the "Country" column
df['Country'] = df['Country'].str.replace('.', '')

# 2. Replace "USA" with "United States" in the "Country" column
df['Country'] = df['Country'].replace('USA', 'United States')

# 3. Replace "country" with "unknown" in the "Country" column
df['Country'] = df['Country'].replace('country', 'unknown')

# 4. Replace "D.C." or "DC" with "Washington D.C." in the "State" column
df['State'] = df['State'].replace(['D.C.', 'DC'], 'Washington D.C.')

# 5. Replace "Washington State" with "Washington" in the "State" column
df['State'] = df['State'].replace('Washington State', 'Washington')

# 6. Replace "Ill." or "Illinois,United States." with "Illinois" in the "State" column
df['State'] = df['State'].replace(['Ill.', 'Illinois,United States.'], 'Illinois')

# 7. Replace "New York City" with "New York" in the "State" column
df['State'] = df['State'].replace('New York City', 'New York')

# 8. Replace "Maryland-Virginia" with "Maryland" in the "State" column
df['State'] = df['State'].replace('Maryland-Virginia', 'Maryland')

# 9. Replace "Ga." with "Georgia" in the "State" column
df['State'] = df['State'].replace('Ga.', 'Georgia')

# 10. Replace "state" with "unknown" in the "State" column
df['State'] = df['State'].replace('state', 'unknown')

# Save the revised data to a new Excel file
df.to_excel(output_file, index=False)

print("Revised data saved to", output_file)


Use GPT again on "unknown results"

In [None]:
# Initialize the OpenAI client with your API key
client = OpenAI(api_key)

# Function to query GPT-3.5 with the new API
def query_gpt3(title, description, counter):
    prompt = (
        f"In this scenario, you are presented with a description of a TV show focused on house buyers. Your task is to discern, based on the provided information, the city and state (if applicable) in which the show is likely set. Your response should adhere to the format 'city, state, country.'\n"
        f"In cases where the city's identity remains uncertain, please indicate 'unknown' for the city. If you can identify the city but not the state or country, indicate the city's name, followed by 'unknown' for the state and country. It is crucial to maintain the correct order of 'city, state, country.'\n"
        f"Please provide a confident response regarding the city, and if any uncertainty persists, use 'unknown' for the city. Just give the answer in the desired format, no explanations:\n"
        f"title of episode: {title}\n"
        f"description of episode: {description}\n"
    )

    response = client.chat.completions.create(
        messages=[{"role": "user", "content": prompt}],
        model="gpt-3.5-turbo"
    )

    # Extracting the response text
    response_text = response.choices[0].message.content

    # Splitting the response text into city, state, and country
    split_response = response_text.split(", ")
    city = split_response[0] if len(split_response) > 0 else "unknown"
    state = split_response[1] if len(split_response) > 1 else "unknown"
    country = split_response[2] if len(split_response) > 2 else "unknown"

    # Print the result for tracking
    print(f"Response {counter}: {city}, {state}, {country}")

    return city, state, country

# Load the Excel file
df = pd.read_excel("gpt3_responses_full_revised.xlsx")

# Process each row and store responses
cities = []
states = []
countries = []
counter = 1
for index, row in df.iterrows():
    city = row['City']  # Get the value in the "City" column
    if city == "unknown":  # Check if the city is unknown
        title = row['Title']
        description = row['Description']
        city, state, country = query_gpt3(title, description, counter)
    else:
        # If the city is not "unknown," use the existing values from the DataFrame
        state = row['State']
        country = row['Country']
        print("OK")
    
    cities.append(city)
    states.append(state)
    countries.append(country)
    counter += 1

# Add responses to DataFrame
df['City'] = cities
df['State'] = states
df['Country'] = countries

Save final result for Backup

In [None]:
df.to_excel("gpt3_round_3.xlsx", index=False)

## Visualize

The link below is the best way to visualize and interact with the data. It leverages Tableau Public.

https://public.tableau.com/views/HH_2024/Dashboard23?:language=en-US&publish=yes&:display_count=n&:origin=viz_share_link

<div class='tableauPlaceholder' id='viz1706735029553' style='position: relative'><noscript><a href='#'><img alt='Dashboard 2 (3) ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;HH&#47;HH_2024&#47;Dashboard23&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='HH_2024&#47;Dashboard23' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;HH&#47;HH_2024&#47;Dashboard23&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='en-US' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1706735029553');                    var vizElement = divElement.getElementsByTagName('object')[0];                    if ( divElement.offsetWidth > 800 ) { vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';} else if ( divElement.offsetWidth > 500 ) { vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';} else { vizElement.style.width='100%';vizElement.style.height='977px';}                     var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>