## Seattle AirBNB dataset

## Questions of interest: Given a reviewer, how will he/she qualify each of the houses?
- can we predict price?
- can we predict reviews for a house?
- pick a reviewer random and predict qualification for any house.
- how many reviews from that user we need? Plot by number of reviews.


In [2]:
# import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

In [6]:
#load data
calendar = pd.read_csv('seattle_calendar.csv')
listings = pd.read_csv('seattle_listings.csv')
reviews = pd.read_csv('seattle_reviews.csv')

In [7]:
#overview of shape and composition calendar
print(calendar.shape)
print(calendar.dtypes)
print(calendar.head(10))

(1393570, 4)
listing_id     int64
date          object
available     object
price         object
dtype: object
   listing_id        date available   price
0      241032  2016-01-04         t  $85.00
1      241032  2016-01-05         t  $85.00
2      241032  2016-01-06         f     NaN
3      241032  2016-01-07         f     NaN
4      241032  2016-01-08         f     NaN
5      241032  2016-01-09         f     NaN
6      241032  2016-01-10         f     NaN
7      241032  2016-01-11         f     NaN
8      241032  2016-01-12         f     NaN
9      241032  2016-01-13         t  $85.00


In [65]:
calendar.date.nunique()

365

For every house, we have the information of price and availability day by day

Data quality issues calendar df:
    - convert date to datetype
    - convert available to boolean
    - convert price to int/float

In [66]:
#overview of shape and composition reviews
print(reviews.shape)

(84849, 6)


In [67]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84849 entries, 0 to 84848
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   listing_id     84849 non-null  int64 
 1   id             84849 non-null  int64 
 2   date           84849 non-null  object
 3   reviewer_id    84849 non-null  int64 
 4   reviewer_name  84849 non-null  object
 5   comments       84831 non-null  object
dtypes: int64(3), object(3)
memory usage: 3.9+ MB


In [124]:
#return counts greater than 1
reviews['reviewer_id'].value_counts().loc[lambda x : x>5]

206203      67
15121499    32
5775807     19
2734499     19
29590276    18
            ..
423381       6
594768       6
19235033     6
13120075     6
63815        6
Name: reviewer_id, Length: 160, dtype: int64

In [129]:
reviews.comments[16]

'Despite our late booking request, Rachel & Jon were very responsive and helpful over email. It was a great place to stay - the location was ideal, the house was clean, well-furnished, the room was cozy, and the cat made good company. Overall, a lovely experience and I would definitely recommend the Farmhouse! '

For every house, we have the reviews recieved and who made them

Data quality issues listings df:
    - convert date to datetype

In [10]:
#overview of shape and composition listings
print(listings.shape)
print(listings.dtypes)

(3818, 92)
id                                    int64
listing_url                          object
scrape_id                             int64
last_scraped                         object
name                                 object
                                     ...   
cancellation_policy                  object
require_guest_profile_picture        object
require_guest_phone_verification     object
calculated_host_listings_count        int64
reviews_per_month                   float64
Length: 92, dtype: object


In [18]:
listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3818 entries, 0 to 3817
Data columns (total 92 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   id                                3818 non-null   int64  
 1   listing_url                       3818 non-null   object 
 2   scrape_id                         3818 non-null   int64  
 3   last_scraped                      3818 non-null   object 
 4   name                              3818 non-null   object 
 5   summary                           3641 non-null   object 
 6   space                             3249 non-null   object 
 7   description                       3818 non-null   object 
 8   experiences_offered               3818 non-null   object 
 9   neighborhood_overview             2786 non-null   object 
 10  notes                             2212 non-null   object 
 11  transit                           2884 non-null   object 
 12  thumbn

Can be divided in sections, host, house description, neighbourhood, reviews

Quality issues in listings df:
    - drop license column
    - drop neighbourhood, work with neighbourhood_group_cleansed and neighborhood_cleansed columns
    - drop square feet
    - neighborhood_overview copy according to neighbourhood_cleansed
    - space create dummy variable or count words
    - notes create dummy variable or count words
    - transit create dummy variable
    - thumbnail_url create dummy variable
    - medium_url create dummy variable
    - xl_picture_url create dummy variable
    - host_about create count words
    - host_acceptance_rate, host_response_rate and host_response_time fill with mode
    - host_is_superhost convert to bool
    - weekly price and monthly price to int. a lot of nans, how they relate to price?
    - security_deposit, cleaning_fee fill with 0
    - first and last review to dtype
    - review_scores how to deal with houses without review? fill with mean?

In [None]:
It would be interesting to create a reviewer table, get average reviews for each customer. Very few information about reviewers. Only 8% booked more than once.

How did they qualify?

In [130]:
listings.description[0]

"Make your self at home in this charming one-bedroom apartment, centrally-located on the west side of Queen Anne hill.   This elegantly-decorated, completely private apartment (bottom unit of a duplex) has an open floor plan, bamboo floors, a fully equipped kitchen, a TV,  DVD player, basic cable, and a very cozy bedroom with a queen-size bed. The unit sleeps up to four (two in the bedroom and two on the very comfortable fold out couch, linens included) and includes free WiFi and laundry. The apartment opens onto a private deck, complete with it's own BBQ, overlooking a garden and a forest of black bamboo.    The Apartment is perfectly-located just one block from the bus lines where you can catch a bus and be downtown Seattle in fifteen minutes or historic Ballard in ten or a quick five-minute walk will bring you to Whole Foods and Peet's Coffee or take a fifteen minute walk to the top of Queen Anne Hill where you will find a variety of eclectic shops, bars, and restaurants. There is n

### Wrangle data

#### reviews df