##### Lauren Forando
##### April 27, 2023
##### EDA - Single and Pairwise
##### Problem Set 6

In [1]:
%matplotlib inline 

In [2]:
import xlrd
import os 
import sqlite3
import csv
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd

sns.set(style="whitegrid")

##### Create Dataframe with variables of interest

In [3]:
db_path = './datawarehouse.db'
con=sqlite3.connect(db_path)
cur = con.cursor()

sql_query = pd.read_sql_query('''SELECT room_type, property_type, instant_bookable, price
FROM listings''', con)

df = pd.DataFrame(sql_query, columns = ['room_type', 'price', 'instant_bookable', 'property_type'])
df

Unnamed: 0,room_type,price,instant_bookable,property_type
0,Private room,$20.00,f,Private room in townhouse
1,Entire home/apt,$185.00,f,Entire townhouse
2,Entire home/apt,$221.00,f,Entire rental unit
3,Entire home/apt,$142.00,t,Entire guest suite
4,Entire home/apt,$398.00,t,Entire rental unit
...,...,...,...,...
8623,Entire home/apt,$70.00,f,Entire condo
8624,Entire home/apt,$253.00,t,Entire serviced apartment
8625,Entire home/apt,$95.00,t,Entire condo
8626,Entire home/apt,$180.00,t,Entire serviced apartment


### Single Variable EDA

In [4]:
# this changes price from currency to a float
df["price"] = df["price"].replace("[$,]", "", regex=True).astype(float)

### Room Type

In [5]:
df.room_type.value_counts()

Entire home/apt    6455
Private room       2028
Shared room         123
Hotel room           22
Name: room_type, dtype: int64

Roughly 74.8% of the listings are categorized as entire home or apartment, 23.5% of listings are categorized as a private room, 1.4% of listings are categorized as a shared room, and less than 1% is categorized as a hotel room.

### Instant Bookable

In [6]:
df.instant_bookable.value_counts()

f    5767
t    2861
Name: instant_bookable, dtype: int64

Approximately 69% of the observed listings are not available to instantly book, whereas 31% of listings are available to be instantly booked.

### Property Type

In [7]:
df.property_type.value_counts()

Entire rental unit                    2819
Entire home                            935
Private room in home                   746
Entire condo                           673
Entire serviced apartment              655
Entire townhouse                       630
Entire guest suite                     552
Private room in rental unit            506
Private room in townhouse              354
Room in boutique hotel                 120
Room in hotel                           98
Private room in condo                   82
Entire guesthouse                       79
Shared room in rental unit              50
Private room in bed and breakfast       45
Private room in guest suite             42
Entire loft                             37
Shared room in townhouse                29
Shared room in home                     21
Room in aparthotel                      19
Shared room in hostel                   17
Entire vacation home                    17
Entire bungalow                         11
Room in hos

While this is a categorical variable, it seems some of the listing property types have unique names/titles that only relate to one or a few listings, causing a large number of buckets/options.

### Pairwise EDA

#### Room Type vs Price

In [8]:
from tabulate import tabulate

grouped_price_room_type = df[['room_type', 'price']].groupby('room_type')
grouped_price_room_type_stats = grouped_price_room_type['price'].describe()
grouped_price_room_type_stats


table = tabulate(grouped_price_room_type_stats, 
             headers=('Room Type', 'Count', 'Mean', 'Stdev', 'Min', '25%', '50%', '75%', 'Max'),
            tablefmt='fancy_grid')
print(table)
print('\n')

╒═════════════════╤═════════╤══════════╤══════════╤═══════╤═══════╤═══════╤═══════╤═══════╕
│ Room Type       │   Count │     Mean │    Stdev │   Min │   25% │   50% │   75% │   Max │
╞═════════════════╪═════════╪══════════╪══════════╪═══════╪═══════╪═══════╪═══════╪═══════╡
│ Entire home/apt │    6455 │ 202.196  │ 238.344  │    10 │   105 │   150 │ 228   │  7500 │
├─────────────────┼─────────┼──────────┼──────────┼───────┼───────┼───────┼───────┼───────┤
│ Hotel room      │      22 │  59.9545 │ 106.886  │     0 │    25 │    27 │  29   │   489 │
├─────────────────┼─────────┼──────────┼──────────┼───────┼───────┼───────┼───────┼───────┤
│ Private room    │    2028 │ 123.777  │ 167.238  │    20 │    52 │    75 │ 114   │  2000 │
├─────────────────┼─────────┼──────────┼──────────┼───────┼───────┼───────┼───────┼───────┤
│ Shared room     │     123 │  54.0244 │  25.0882 │    16 │    35 │    45 │  63.5 │   140 │
╘═════════════════╧═════════╧══════════╧══════════╧═══════╧═══════╧═══════╧═════

Entire homes/apartments are on average the most expensive ($202.20), followed by a private room ($123.78). The least expensive stay is a shared room ($54.02). The range for entire home/apartments is significantly larger/wider than the other options, ranging from $10 to $7500. This difference tells us that there are surely some extreme values in our dataset.  


#### Instant Bookable vs Price

In [9]:
grouped_price_instant_bookable = df[['instant_bookable', 'price']].groupby('instant_bookable')
grouped_price_instant_bookable_stats = grouped_price_instant_bookable['price'].describe()
grouped_price_instant_bookable_stats


table2 = tabulate(grouped_price_instant_bookable_stats, 
            headers=('Instant Bookable', 'Count', 'Mean', 'Stdev', 'Min', '25%', '50%', '75%', 'Max'),
            tablefmt='fancy_grid')
print(table2)
print('\n')

╒════════════════════╤═════════╤═════════╤═════════╤═══════╤═══════╤═══════╤═══════╤═══════╕
│ Instant Bookable   │   Count │    Mean │   Stdev │   Min │   25% │   50% │   75% │   Max │
╞════════════════════╪═════════╪═════════╪═════════╪═══════╪═══════╪═══════╪═══════╪═══════╡
│ f                  │    5767 │ 176.207 │ 246.786 │     0 │    84 │   126 │   198 │  7500 │
├────────────────────┼─────────┼─────────┼─────────┼───────┼───────┼───────┼───────┼───────┤
│ t                  │    2861 │ 191.532 │ 171.252 │    16 │    89 │   143 │   240 │  1903 │
╘════════════════════╧═════════╧═════════╧═════════╧═══════╧═══════╧═══════╧═══════╧═══════╛




The average price of an Air BnB listing is higher for those that are instantly bookable ($191.53) compared to listings that are not instantly bookable ($176.21). Interestingly, the minimum price for a instantly bookable stay is $16, whereas the non-instantly bookable stays do not have a minimum.  


#### Property Type vs Price

In [10]:
property_type_df = df[['property_type','price']]
#property_type_df

In [11]:
df_2 = pd.DataFrame(property_type_df)
df_2['property_type_2'] = '0'

df_2.loc[df_2["property_type"].str.contains('condo|loft'), "property_type_2"] = "Condo"
df_2.loc[df_2["property_type"].str.contains('apartment'), "property_type_2"] = "Apartment"
df_2.loc[df_2["property_type"].str.contains('townhouse'), "property_type_2"] = "Townhouse"
df_2.loc[df_2["property_type"].str.contains('home|place'), "property_type_2"] = "Home"
df_2.loc[df_2["property_type"].str.contains('guest suite|guesthouse'), "property_type_2"] = "Guest Suite/Guesthouse"
df_2.loc[df_2["property_type"].str.contains('rental'), "property_type_2"] = "Rental Unit"
df_2.loc[df_2["property_type"].str.contains('hotel|resort'), "property_type_2"] = "Hotel Room"
df_2.loc[df_2["property_type"].str.contains('hostel'), "property_type_2"] = "Hostel"
df_2.loc[df_2["property_type"].str.contains('villa|bungalow'), "property_type_2"] = "Villa/Bungalow"
df_2.loc[df_2["property_type"].str.contains('Cottage|cottage|Boat|Camper/RV|Campsite|Castle|Floor|Houseboat|Tent|Tower'), "property_type_2"] = "Unique Stay"
df_2.loc[df_2["property_type"].str.contains('Tiny|tiny'), "property_type_2"] = "Tiny Home"
df_2.loc[df_2["property_type"].str.contains('Breakfast|breakfast'), "property_type_2"] = "Bed & Breakfast"
df_2.loc[df_2["property_type"].str.contains('particular|aparthotel'), "property_type_2"] = "Other"
df_2.loc[df_2["property_type_2"].str.contains('0'), "property_type_2"] = "Other"

df_2.loc[df_2["property_type"].str.contains('Entire'), "property_type"] = "Entire Space"
df_2.loc[df_2["property_type"].str.contains('Private|private|apartment|breakfast'), "property_type"] = "Private Room"
df_2.loc[df_2["property_type"].str.contains('Shared|shared'), "property_type"] = "Shared Room"
df_2.loc[df_2["property_type"].str.contains('Hotel|hotel'), "property_type"] = "Hotel"
df_2.loc[df_2["property_type"].str.contains('Hostel|hostel'), "property_type"] = "Hostel"
df_2.loc[df_2["property_type"].str.contains('Boat|Camper/RV|Campsite|Casa particular|Castle|Floor|Houseboat|Tent|Tiny home|Tower'), "property_type"] = "Unique Stay"
#df_2

In [12]:
grouped_price_property_type = df_2[['property_type', 'price', 'property_type_2']].groupby('property_type')
grouped_price_property_type_stats = grouped_price_property_type['price'].describe().reset_index()
#grouped_price_property_type_stats

In [13]:
grouped_price_property_type_2 = df_2[['property_type', 'price', 'property_type_2']].groupby('property_type_2')
grouped_price_property_type_stats_2 = grouped_price_property_type_2['price'].describe().reset_index()
grouped_price_property_type_stats_2

Unnamed: 0,property_type_2,count,mean,std,min,25%,50%,75%,max
0,Apartment,660.0,224.772727,98.248577,56.0,159.0,212.0,279.0,716.0
1,Bed & Breakfast,52.0,218.980769,190.636446,25.0,75.0,141.0,300.0,799.0
2,Condo,800.0,148.4375,110.485938,29.0,88.0,125.0,175.0,1425.0
3,Guest Suite/Guesthouse,684.0,122.173977,86.339114,29.0,85.0,105.0,133.5,1237.0
4,Home,1726.0,218.38007,388.024673,10.0,65.0,120.0,250.0,7500.0
5,Hostel,33.0,84.757576,75.380713,0.0,25.0,73.0,90.0,313.0
6,Hotel Room,222.0,382.432432,304.298871,0.0,197.0,302.0,459.0,1903.0
7,Other,28.0,247.714286,123.517237,58.0,174.0,254.0,298.0,482.0
8,Rental Unit,3375.0,151.270519,122.768793,20.0,88.0,126.0,178.0,3000.0
9,Tiny Home,3.0,129.666667,34.588052,92.0,114.5,137.0,148.5,160.0


I grouped the different property types into overall categories (for example: "Apartment" includes rentals that are the entire apartment, private room of an apartment, and shared room in an apartment). There were originally 53 different categories, and I was able to pair that down to about 12 groups. Unique stays include boats, boathouse, campsite, castle, tent, camper/rv, cottage, tower, and a few other listing types. Based on the analysis results, the highest average cost is actually for hotel rooms ($382.43) and the least expensive stay is a hostel ($84.76). Interestingly, the price range for a home stay (10, 7500) and townhouse stay (20, 4357) seem fairly extreme, leading us to believe that there are some extreme values in our dataset. It is worth looking further into this variable and how it interacts with other variables such as number of rooms and location to see how those variables may influence the overall price of a listing. 