# Phase 2- INFO 2950 Project

## Potential Research Questions 
(Listed in perceived order of significance)

### - What do guest reviews tell us about the population's preferred listing type?
### - Do the minimum number of nights on a listing relate to price/availibility?
Note: According to New York State Law, it is illegal to rent out a permanent residence dwelling for less than 30 days without the owner, resident present. Airbnb is frequently under fire for letting their hosts do exactly so.
### - Do hosts who have been with Airbnb longer have better reviews?
### - Do hosts with mutliple listings keep them concentrated in one location or spread out across the city?

In [2]:
## load libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## load cleaned data
airbnb = pd.read_csv("/Users/eliseburdette/Desktop/airbnb.csv")
airbnbrev = pd.read_csv("/Users/eliseburdette/Desktop/airbnbreviews.csv")

## Data Description

- This data is a collection of over 45,000 observed Airbnb listings for New York City. The attributes include: the listing and its host's id numbers; the date that the host joined Airbnb; the count of each host's listings; the borough in which the listing is located and its latitude/longitude; the property type and room type of the listing; how many guests that the listing may accomodate, including bedroom and bed count; the price of the listing; the minimum number of nights that a guest must stay; the maximum number of nights that a guest may stay; how many days over the last 30, 60, 90, and 365 days that the host has made the listing available for rent; the dates of the first and latest reviews; the total guest review score; guest review scores based on cleanliness, location, and value; and the averaged number of reviews per month. <br>
- The dataset was created by Inside Airbnb, an independent project to add data to the debates surrounding the ethicality and legality of Airbnb and its methods. Inside Airbnb is not affiliated with Airbnb, but sources all of its data from publicly available information from the Airbnb website. The dataset was created to answer fundamental questions about "how Airbnb is really being used in cities around the world." Inside Airbnb seeks to make Airbnb's data more publicly accessible and scrutinizable so that the public's understanding about the company is not solely dependent upon Airbnb's claims. 
- Inside Airbnb is personally funded by Murray Cox, an Australian community activist and technologist who currently acts as the founder and chief data activist for Inside Airbnb. The dataset and collection is also funded by community donations. Murray Cox personally describes Airbnb as an "activist platform... to help cities and communities respond to the threat of Airbnb on residential neighborhoods throughout the world." Inside Airbnb is not endorsed by any of Airbnb's competitors.
- All data that was observed came straight from Airbnb's publicly available data, including current listings, the availability calendar for 365 days in the future, and reviews for each listing. Thus, the data that Airbnb chooses to make publicly available highly influences the data recorded by Inside Airbnb. Inside Airbnb claims all data is verified. Inside Airbnb's data is a snapshot of listings available at a particular time. This means that a listing is recorded based on when Inside Airbnb observes it on the Airbnb site, not as soon as the listing is made. There is likely some lag between the two.
- Inside Airbnb disclaims that they have already done some verification, cleansing, and aggregation on the data that they provide. It uses an occupancy model called the "San Francisco Model," which can be used to calculate and estimate other metrics not provided by the Airbnb site. The methodology of the model is explicitly outlined under Inside Airbnb's disclaimers. Our own preprocessing included deleting attributes that were not relevant to our research questions (like a listing's unique name or description made by the host). We also created a second data frame which only included listings with at least one review. This allowed us to drop nearly 10,000 observations that would not contribute to any analysis about guest reviews.
- All data was collected from Airbnb's publicly available listings. Hosts creating the listings may or may not be unaware of Inside Airbnb's data collection, but all listings are uploaded and specified voluntarily by hosts in accordance with Airbnb's policies. To upload a listing, hosts agree that Airbnb will make the listing data publicly available, which hosts likely believe is to primarily find guests.
- Raw Data Cornell Box Link: https://cornell.box.com/s/q60hl086lws4nro5pf8snqwc7pab7q5w

## Data Limitations
- Inside Airbnb collects their directly data from Airbnb. The Airbnb calendar for a listing does not differentiate between a booked night and a night made unavailable by the host, so Inside Airbnb has to count bookings an unavailable nights. This understates the availability attributes of our data, since a popular listing that is actually booked will appear to be unavailable. A listing that has a monthly availability of 10 should indicate that the host is only renting out the unit for 10 days out of a month. However, a listing that the host is renting for 30 days per month but is already booked for 20 days will show the same monthly availability of 10. This means we have to take the availability attributes with a grain of salt and perhaps cross reference to other attributes before making any conclusions about the booking history of a listing. The availabilty metric also depends on hosts keeping their listings' calendars updated. 
- Leaving reviews is optional for guests, which means that if review count is being used as an indicator for booking acitivity it will not be equal to the number of actual bookings. We also need to recognize that because reviews are optional and not all guests leave one, there is a lot of missing data in that regard. Say listings A and B both had ten bookings. If A had only one guest leave a 7 point review and the other nine guests would have left a 10 point review but chose not to, then listing A would show a total review score of 7. If the other nine guests had left a review though, its total score would have been 9.7. If listing B had ten 8 point reviews, on the other hand, then its total review score would be 8. If we were to compare listing A and B based on the data's total review score, then we would be misled into assuming that listing B was favorable to A. The data is clearly limited in the sense that we only know the ratings of the people who took the time to leave a review for a listing, but that this is not guaranteed to be the average experience of every guest. All attributes involving reviews are affected by this limitation, but since we cannot track down missing reviews, we will not know to what extent. 

## Questions for Reviewers
- 