# Predicting Airbnb Listing Prices in Hawaii

## Project Topic
This project aims to predict the price of Airbnb listings in Hawaii, which is a regression problem. I will consider various factors that make up an Airbnb listing, such as location, number of bedrooms, type of property, and reviews, in order to determine the key factors that play into the pricing of Airbnb's in Hawaii.

## Project Goal
My family is from Hawaii, and I love to visit the islands as much as I can. I hope to create a tool that can help people determine fair pricing of lodging in Hawaii. I will achieve this goal by discovering the most important factors that determine Airbnb pricing, and ultimately create a model that can predict or suggest pricing for Airbnb hosts and guests. Hosts can use the model to price their listings competitively, while guests can ensure they are getting a fair deal.

## Data Description and Citation
We will be using the Inside Airbnb: Hawaii dataset, available on Inside Airbnb. This dataset is updated regularly and provides comprehensive information about Airbnb listings in various cities around the world. The Hawaii dataset specifically includes details about listings across the Hawaiian islands. The dataset consists of approximately 28,000 Airbnb listings in Hawaii, from June 9th, 2022 to September 12th, 2022. The attributes of the dataset consist of listing's ID, name, host ID, host name, island, neighbourhood, latitude, longitude, room type, price, minimum nights, number of reviews, last review date, reviews per month, calculated host listings count, and availability in the year. The exact size of the table for this dataset is 28580 x 18. Below is the first entry in the dataset to serve as a sample for the dataset attributes.

In [13]:
import pandas as pd

df = pd.read_csv('listings.csv')

#Display the first row transposed for visibility
print(df.head(1).transpose())

                                                                        0
id                                                                   5269
name                            Upcountry Hospitality in the 'Auwai Suite
host_id                                                              7620
host_name                                                       Lea & Pat
neighbourhood_group                                                Hawaii
neighbourhood                                                South Kohala
latitude                                                          20.0274
longitude                                                        -155.702
room_type                                                 Entire home/apt
price                                                                 149
minimum_nights                                                          5
number_of_reviews                                                      24
last_review                           

*Get the Data.* Inside Airbnb. (2023). Detailed Listings data for Hawaii, United States. Retrieved July 5, 2023, from http://insideairbnb.com/get-the-data/.

## Data Cleaning and EDA
For part 1 of the project I will conduct some initial cleaning of the data that is immediately obvious. There are a handful of things that need to be done in order to ensure a clean, consistent, and efficient data sample.

### Remove Duplicate Entries
To be safe, I will start by removing any possible duplicate entries in the dataframe. This is important to ensure fair weighting for each data entry.

In [14]:
# If 'id' column exists, drop duplicate rows based on 'id'
if 'id' in df.columns:
    before = df.shape[0]

    df = df.drop_duplicates(subset='id')

    after = df.shape[0]

    if before > after:
        print(f"Duplicates removed: {before - after}")
    else:
        print("No duplicates found")

No duplicates found


### Attribute Removal
I will remove unnecessary attributes from the dataframe. The following are each attribute removed, as well as a quick explanation about why it is unnecessary.
##### id
example: 5269
This is the specific id number for each Airbnb posting. This will be meaningless for any calculations I will be doing, and is likely only useful for Airbnb internal use.
##### host_id
example: 7620
This is the specific id number for each Airbnb host. This will be meaningless for any calculations I will be doing, and is likely only useful for Airbnb internal use.
##### license
example: 119-269-5808-01R
This likely represents the type of license held by each property for legal reasons. It does not contain information useful for determining pricing of Airbnb listings.

In [15]:
#remove unnecessary data attributes
columns_to_drop = ['id', 'host_id', 'license']

for column in columns_to_drop:
    if column in df.columns:
        df = df.drop(column, axis=1)

print(df.head(1).transpose())

                                                                        0
name                            Upcountry Hospitality in the 'Auwai Suite
host_name                                                       Lea & Pat
neighbourhood_group                                                Hawaii
neighbourhood                                                South Kohala
latitude                                                          20.0274
longitude                                                        -155.702
room_type                                                 Entire home/apt
price                                                                 149
minimum_nights                                                          5
number_of_reviews                                                      24
last_review                                                    2022-07-13
reviews_per_month                                                    0.17
calculated_host_listings_count        

### Further Data Removal Considerations
There may be other attributes I choose to remove in the future depending on how useful they prove to be. For example, I may remove the latitude and longitude attributes at some point in the future, but as of now I will keep them incase I find some way to use them for a geological based visualizations and predictions.