# Problem Definition: What are the factors and features of a listing that make an AirBnb listing more expensive?

skills used - python, pandas, numpy, matplotlib, seaborn, scikit-learn

>Project is divided into 3 stages:
>1. Data Collection & Cleaning
>2. Data Visualisation
>3. Data Modelling and Prediction

# Stage 1:  Data Collection and Cleaning

In this case, the Airbnb data is already provided but for other use-cases, one would use web scrapers or other sources of datasets for cleaning.

This stage will be concerned about any anomalies or data types that would otherwise hinder the analysis, modelling and preidctions that come later on in the project (e.g. empty cells).

In [13]:
# Import required modules

import matplotlib.pyplot as plt
import pandas as pd


In [14]:
# Importing the lisiting dataset - Airbnb data in tabular form
listingsDF= pd.read_csv('tabular_data/tabular_data/listing.csv')
listingsDF.head()   # See the data, structure, columns of first 5 rows.


Unnamed: 0,ID,Category,Title,Description,Amenities,Location,guests,beds,bathrooms,Price_Night,Cleanliness_rating,Accuracy_rating,Communication_rating,Location_rating,Check-in_rating,Value_rating,amenities_count,url,bedrooms,Unnamed: 19
0,f9dcbd09-32ac-41d9-a0b1-fdb2793378cf,Treehouses,Red Kite Tree Tent - Ynys Affalon,"['About this space', ""Escape to one of these t...","['What this place offers', 'Bathroom', 'Shampo...",Llandrindod Wells United Kingdom,2,1.0,1.0,105,4.6,4.7,4.3,5.0,4.3,4.3,13.0,https://www.airbnb.co.uk/rooms/26620994?adults...,,
1,1b4736a7-e73e-45bc-a9b5-d3e7fcf652fd,Treehouses,Az Alom Cabin - Treehouse Tree to Nature Cabin,"['About this space', ""Come and spend a romanti...","['What this place offers', 'Bedroom and laundr...",Guyonvelle Grand Est France,3,3.0,0.0,92,4.3,4.7,4.6,4.9,4.7,4.5,8.0,https://www.airbnb.co.uk/rooms/27055498?adults...,1.0,
2,d577bc30-2222-4bef-a35e-a9825642aec4,Treehouses,Cabane Entre Les Pins\nüå≤üèïÔ∏èüå≤,"['About this space', 'Rustic cabin between the...","['What this place offers', 'Scenic views', 'Ga...",Duclair Normandie France,4,2.0,1.5,52,4.2,4.6,4.8,4.8,4.8,4.7,51.0,https://www.airbnb.co.uk/rooms/51427108?adults...,1.0,
3,ca9cbfd4-7798-4e8d-8c17-d5a64fba0abc,Treehouses,Tree Top Cabin with log burner & private hot tub,"['About this space', 'The Tree top cabin is si...","['What this place offers', 'Bathroom', 'Hot wa...",Barmouth Wales United Kingdom,2,,1.0,132,4.8,4.9,4.9,4.9,5.0,4.6,23.0,https://www.airbnb.co.uk/rooms/49543851?adults...,,
4,8b2d0f78-16d8-4559-8692-62ebce2a1302,Treehouses,Hanging cabin,"['About this space', 'Feel refreshed at this u...","['What this place offers', 'Heating and coolin...",Wargnies-le-Petit Hauts-de-France France,2,1.0,,111,,,,,,,5.0,https://www.airbnb.co.uk/rooms/50166553?adults...,1.0,


In [15]:
# Assess the data types and araay structure
print("Data type : ", type(listingsDF))
print("Data dims : ", listingsDF.shape)

Data type :  <class 'pandas.core.frame.DataFrame'>
Data dims :  (988, 20)


# Data Cleaning
 
  After viewing the columns, the column numbers were not incredibly large nor were they complex. Therefore, all columns were kept for the cleaning stage.

In [16]:
# Drop NaN values in these columns : 'Cleanliness_rating','Accuracy_rating','Communication_rating','Location_rating','Check-in_rating','Value_rating'
listingsDF.dropna(subset = ['Cleanliness_rating','Accuracy_rating','Communication_rating','Location_rating','Check-in_rating','Value_rating'], inplace= True)

In [17]:
# Replace all listings with empty guests,beds,bathrooms or bedrooms as they should have a minimum of 1.
listingsDF.loc[listingsDF['guests'].isnull(),['guests']] = 1 
listingsDF.loc[listingsDF['beds'].isnull(),['beds']] = 1  
listingsDF.loc[listingsDF['bathrooms'].isnull(),['bathrooms']] = 1  
listingsDF.loc[listingsDF['bedrooms'].isnull(),['bedrooms']] = 1
listingsDF['bathrooms'].loc[listingsDF['bathrooms'] < 1] = 1  


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  listingsDF['bathrooms'].loc[listingsDF['bathrooms'] < 1] = 1


In [18]:
# Assess the number of rows

print("Data type : ", type(listingsDF))
print("Data dims : ", listingsDF.shape) 

# We see the first instance of removing/cleaning of rows that do not contain all the data we need.

Data type :  <class 'pandas.core.frame.DataFrame'>
Data dims :  (890, 20)


In [19]:
# Remove the first 'About this space' segment of the Descriptions column, and removing any apostrophes that may confusion down the line.

listingsDF=listingsDF[listingsDF['Description'].str.contains('About this space')== True]
listingsDF['Description'] = listingsDF['Description'].str.replace("'About this space',","")
listingsDF['Description'] = listingsDF['Description'].str.replace('"',"")
listingsDF['Description'] = listingsDF['Description'].str.replace("' '","")
listingsDF['Description'] = listingsDF['Description'].str.replace("'","")

In [20]:
# Index the rows with alphabetical unique IDs

# Import string module
import string

# Create a list of alphabetical ID list
alphabet = string.ascii_uppercase
index_list = []
id_list= None
row_nums = len(listingsDF)
for first in alphabet:
    for second in alphabet:
        for third in alphabet:
            index_list.append(first + second + third)
matched_list = index_list[:row_nums]
id_list = matched_list

print(row_nums)
#Assign the rows with a unique ID in the order of the newly created id_list
listingsDF["ID"] = id_list
# listingsDF.set_index("ID", inplace=True)
listingsDF.head()

829


Unnamed: 0,ID,Category,Title,Description,Amenities,Location,guests,beds,bathrooms,Price_Night,Cleanliness_rating,Accuracy_rating,Communication_rating,Location_rating,Check-in_rating,Value_rating,amenities_count,url,bedrooms,Unnamed: 19
0,AAA,Treehouses,Red Kite Tree Tent - Ynys Affalon,[ Escape to one of these two fabulous Tree Ten...,"['What this place offers', 'Bathroom', 'Shampo...",Llandrindod Wells United Kingdom,2,1.0,1.0,105,4.6,4.7,4.3,5.0,4.3,4.3,13.0,https://www.airbnb.co.uk/rooms/26620994?adults...,1,
1,AAB,Treehouses,Az Alom Cabin - Treehouse Tree to Nature Cabin,[ Come and spend a romantic stay with a couple...,"['What this place offers', 'Bedroom and laundr...",Guyonvelle Grand Est France,3,3.0,1.0,92,4.3,4.7,4.6,4.9,4.7,4.5,8.0,https://www.airbnb.co.uk/rooms/27055498?adults...,1,
2,AAC,Treehouses,Cabane Entre Les Pins\nüå≤üèïÔ∏èüå≤,"[ Rustic cabin between the pines, 3 meters hig...","['What this place offers', 'Scenic views', 'Ga...",Duclair Normandie France,4,2.0,1.5,52,4.2,4.6,4.8,4.8,4.8,4.7,51.0,https://www.airbnb.co.uk/rooms/51427108?adults...,1,
3,AAD,Treehouses,Tree Top Cabin with log burner & private hot tub,[ The Tree top cabin is situated in our peacef...,"['What this place offers', 'Bathroom', 'Hot wa...",Barmouth Wales United Kingdom,2,1.0,1.0,132,4.8,4.9,4.9,4.9,5.0,4.6,23.0,https://www.airbnb.co.uk/rooms/49543851?adults...,1,
5,AAE,Treehouses,Treehouse near Paris Disney,"[ Charming cabin nestled in the leaves, real u...","['What this place offers', 'Bathroom', 'Hair d...",Le Plessis-Feu-Aussoux √éle-de-France France,4,3.0,1.0,143,5.0,4.9,5.0,4.7,5.0,4.7,32.0,https://www.airbnb.co.uk/rooms/935398?adults=1...,2,


In [25]:
# Print if there are any empty cells remaining.
listingsDF.isnull().sum()  # checks for any remaining empty cells.

ID                        0
Category                  0
Title                     0
Description               0
Amenities                 0
Location                  0
guests                    0
beds                      0
bathrooms                 0
Price_Night               0
Cleanliness_rating        0
Accuracy_rating           0
Communication_rating      0
Location_rating           0
Check-in_rating           0
Value_rating              0
amenities_count           0
url                       0
bedrooms                  0
Unnamed: 19             829
dtype: int64

In [27]:
# Export the data as clean_tabular_data.csv

listingsDF.to_csv('tabular_data/tabular_data/clean_tabular_data.csv',index=True)

In [28]:
# So far, no changes to the data type have been made on any of the cells.
# But in order to make statistical tests smooth further down the line, we must change them to float types, where appropriate.

# Convert objects into float 64 for columns 'guests','bedrooms'
listingsDF['guests'] = pd.to_numeric(listingsDF['guests'])
listingsDF['bedrooms']= pd.to_numeric(listingsDF['bedrooms'])

# Convert all int64 types into float for statistical analysis later.
listingsDF['guests'] = listingsDF['guests'].astype(float)
listingsDF['bedrooms'] = listingsDF['bedrooms'].astype(float)
listingsDF['Price_Night'] = listingsDF['Price_Night'].astype(float)
listingsDF.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 829 entries, 0 to 987
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   ID                    829 non-null    object 
 1   Category              829 non-null    object 
 2   Title                 829 non-null    object 
 3   Description           829 non-null    object 
 4   Amenities             829 non-null    object 
 5   Location              829 non-null    object 
 6   guests                829 non-null    float64
 7   beds                  829 non-null    float64
 8   bathrooms             829 non-null    float64
 9   Price_Night           829 non-null    float64
 10  Cleanliness_rating    829 non-null    float64
 11  Accuracy_rating       829 non-null    float64
 12  Communication_rating  829 non-null    float64
 13  Location_rating       829 non-null    float64
 14  Check-in_rating       829 non-null    float64
 15  Value_rating          8

# Summary 

The dataset has been cleaned and appropriately prepared for data analysis.