# airbnb.db EDA - Reviews Table

The table `reviews` contains data relevant to the experience of tenants that have rented from a given Airbnb listing. Let's take a look at the variables that each entry in the `reviews` contains:

In [14]:
import sqlite3
import pandas as pd

# Establish a connection to the database
connection = sqlite3.connect('airbnb.db')
cursor = connection.cursor()

In [21]:
# Read sqlite query results into a pandas DataFrame
airbnbListings = pd.read_sql_query("SELECT * from reviews", connection)

# List the columns in the reviews table
print(df.columns)

Index(['id', 'reviewer_name', 'reviewer_id', 'date', 'comments', 'listing_id'], dtype='object')


Looking at columns in the reviews table, it would appear that each entry would be similar to the types of reviews you might find on Amazon or Yelp. Each review has an associated reviewer (reviewer_name & reviewer_id), as well as the date of when it was posted to denote its relevance. Lastly there's the comment itself that would describe the reviewers experience and the listing_id which points directly to the airbnb listing they rented.

Now that we have verified the relavance of the columns in the table, let's now take a look at the overall characteristics of the data.

In [22]:
airbnbListings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 330237 entries, 0 to 330236
Data columns (total 6 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   id             330237 non-null  int64 
 1   reviewer_name  330237 non-null  object
 2   reviewer_id    330237 non-null  int64 
 3   date           330237 non-null  object
 4   comments       330237 non-null  object
 5   listing_id     330237 non-null  int64 
dtypes: int64(3), object(3)
memory usage: 15.1+ MB


There are 330237 records (or observations) for each variable so there are no missing values. there are 3 numerical variables and what appears to be 3 categorical variables. Let's look at the first few rows:

In [23]:
airbnbListings.head()

Unnamed: 0,id,reviewer_name,reviewer_id,date,comments,listing_id
0,2215,Lisa,8235,2009-05-10 00:00:00,Staying with Heather and Vasa was a delight. ...,3943
1,2373,Kevin,13733,2009-05-14 00:00:00,"Beautiful old home, just like I remember my gr...",4197
2,2945,Bernhard,16430,2009-05-21 00:00:00,A great place to stay for travelers with limit...,4197
3,3253,Karen,15892,2009-05-28 00:00:00,Staying with Heather and Vasa was great. Their...,3943
4,3569,Brad,19454,2009-06-04 00:00:00,I had to fly in on short notice for an intervi...,3943


What I find most interesting from dataframe preview is that the reviewer_name is just the first name of the reviewer. This can be slightly confusing considering there may be multiple people that share the same name but a different reviewer_id. Another point of interest is that date seems to have the capability to be measured down to the second, however it seems to only be measured down to the day. Lastly it will be hard to guage the significance of each comment, what is the character limit? Can we make sense of the sentiment of the comment? How can we tell if it is a positive or negative comment?

---

# EDA

## Single Variable EDA

We'll start our EDA by looking at each variable individually, starting with the target variable, *id*.

## id

I'm assuming that id is the unique id for each associated review. That would mean each review gets an id (which we previously found true, 330237 non-null int64) and each review should have a unique id (no duplicates).

id is a numerical data type however there are no units so the numbers are arbitrary. The only significance of the number should be that they are all unique. Let's see the range of values that id could be:

In [25]:
airbnbListings.id.describe()

count    3.302370e+05
mean     2.627756e+17
std      3.207232e+17
min      2.215000e+03
25%      3.305652e+08
50%      6.564124e+08
75%      6.035870e+17
max      8.500703e+17
Name: id, dtype: float64

id is not a measurement so we only care about the min and max values from the values above. It would appear that the min is 2215 and the max is 8.500703e+17. The difference in magnitude between the min and max would suggest that each id is random. There may be some sort of hash function to generate each one. It makes no sense to plot the ids just due to how spread out the data it is but we can ensure that there are no duplicate values.

In [26]:
airbnbListings.id.is_unique

True

To this point we have found that each id appears to be random (difference in magnitude between min and max value) and we have verified that each id is unique for each entry in the table. Given a id, we could find the specific row in the table that it is associated with. I do not believe it will be of any value when looking for pairwise relationships going forward.