### <center> TP1 Airbnb Data cleaning

In [1]:
import pandas as pd
import numpy as np
pd.set_option('display.max_rows', None)

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


# Reading the file

In [2]:
boston = pd.read_csv("data/boston.csv")

cambridge = pd.read_csv("data/cambridge.csv")

df = boston
#df = pd.concat([boston, cambridge])

# Discovering the data set

In [3]:
df.shape

(3507, 106)

In [4]:
df.head(2)

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,5506,https://www.airbnb.com/rooms/5506,20191204162830,2019-12-04,**$79 Special ** Private! Minutes to center!,"Private guest room with private bath, You do n...",**THE BEST Value in BOSTON!!*** PRIVATE GUEST ...,"Private guest room with private bath, You do n...",none,"Peacful, Architecturally interesting, historic...",...,t,f,strict_14_with_grace_period,f,f,6,6,0,0,0.81
1,6695,https://www.airbnb.com/rooms/6695,20191204162830,2019-12-04,$99 Special!! Home Away! Condo,"Comfortable, Fully Equipped private apartment...",** WELCOME *** FULL PRIVATE APARTMENT In a His...,"Comfortable, Fully Equipped private apartment...",none,"Peaceful, Architecturally interesting, histori...",...,t,f,strict_14_with_grace_period,f,f,6,6,0,0,0.91


In [5]:
# check columns
df.columns

Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'name', 'summary',
       'space', 'description', 'experiences_offered', 'neighborhood_overview',
       ...
       'instant_bookable', 'is_business_travel_ready', 'cancellation_policy',
       'require_guest_profile_picture', 'require_guest_phone_verification',
       'calculated_host_listings_count',
       'calculated_host_listings_count_entire_homes',
       'calculated_host_listings_count_private_rooms',
       'calculated_host_listings_count_shared_rooms', 'reviews_per_month'],
      dtype='object', length=106)

In [6]:
# check data types
df.dtypes

id                                                int64
listing_url                                      object
scrape_id                                         int64
last_scraped                                     object
name                                             object
summary                                          object
space                                            object
description                                      object
experiences_offered                              object
neighborhood_overview                            object
notes                                            object
transit                                          object
access                                           object
interaction                                      object
house_rules                                      object
thumbnail_url                                   float64
medium_url                                      float64
picture_url                                     

In [7]:
# summary statistics
df.describe()

Unnamed: 0,id,scrape_id,thumbnail_url,medium_url,xl_picture_url,host_id,host_acceptance_rate,host_listings_count,host_total_listings_count,neighbourhood_group_cleansed,...,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
count,3507.0,3507.0,0.0,0.0,0.0,3507.0,0.0,3492.0,3492.0,0.0,...,2812.0,2810.0,2813.0,2811.0,2811.0,3507.0,3507.0,3507.0,3507.0,2823.0
mean,23024830.0,20191200000000.0,,,,80785230.0,,155.973081,155.973081,,...,9.481152,9.726335,9.673303,9.578798,9.246176,28.414314,25.100371,3.220131,0.035358,2.200712
std,11778550.0,0.0,,,,91953590.0,,325.392676,325.392676,,...,0.857304,0.769328,0.825613,0.762394,0.96406,44.02058,43.917254,6.93927,0.380506,2.221209
min,5506.0,20191200000000.0,,,,7969.0,,0.0,0.0,,...,2.0,2.0,2.0,2.0,2.0,1.0,0.0,0.0,0.0,0.0
25%,13779610.0,20191200000000.0,,,,11631890.0,,2.0,2.0,,...,9.0,10.0,10.0,9.0,9.0,1.0,0.0,0.0,0.0,0.46
50%,23544020.0,20191200000000.0,,,,30283590.0,,6.0,6.0,,...,10.0,10.0,10.0,10.0,9.0,6.0,1.0,0.0,0.0,1.48
75%,33318010.0,20191200000000.0,,,,133457700.0,,43.0,43.0,,...,10.0,10.0,10.0,10.0,10.0,27.0,22.0,3.0,0.0,3.345
max,40573640.0,20191200000000.0,,,,313255100.0,,1068.0,1068.0,,...,10.0,10.0,10.0,10.0,10.0,147.0,147.0,33.0,5.0,15.0


# Data cleaning

after looking at the data set, we can see that there are some columns that are not useful for our analysis, so we will drop them.
here is a table of the columns that we will drop and the reason why we will drop them:



| Column | Reason |
| ------ | ------ |
| id <br> host id <br> scrape_id | IDs are not needed |
| calendar_last_scraped<br>neighbourhood<br>host_neighbourhood<br>host_total_listings_count<br> | duplicate of another column |
| host_acceptance_rate <br/> neighbourhood_group_cleansed | null columns |
| experiences_offered <br/> market <br/> country_code <br/> country <br/> has_availability <br/> jurisdiction_names <br/> is_business_travel_ready | columns with <br/> same value in all rows |
| square_feet<br>weekly_price<br>monthly_price | Not enough values <br> (less than 50%) |
| zipcode | Irrelevant data |
| street <br>city <br>state <br>smart_location| columns that have over 90% of repeated values |


| <center>Column</center> | Reason |
| ------ | ------ |
| <table><tr><td>listing_url</td><td>thumbnail_url</td><td>medium_url</td><td>picture_url</td><td>xl_picture_url</td></tr><tr><td>host_url</td><td>host_thumbnail_url</td><td>host_picture_url</td></tr></table> | Links (url) |
|<table><tr><td>name</td><td>summary</td><td>space</td><td>description</td></tr><tr><td>neighborhood_overview</td><td>notes</td><td>transit</td><td>access</td></tr><tr><td>house rules</td><td>host about</td><td>host_response_time</td><td>interaction</td></tr><tr><td>host verifications</td><td>host name</td><td>amenities</td><td>street</td></tr><tr><td>calendar_updated</td><td>license</td><td>jurisdiction_names</td><td>host location</td></tr><tr><td>market</td><td>smart location</td></tr></table>| Textual Data |
|<table><tr><td>maximum_nights_avg_ntm</td><td>minimum_nights_avg_ntm</td><td>maximum_maximum_nights</td></tr><tr><td>minimum_maximum_nights</td><td>maximum_minimum_nights</td><td>minimum_minimum_nights</td></tr><tr><td>availability_60</td><td>availability_90</td><td>availability_365</td></tr></table> | Can be Calculated |

## Gathering the columns that we will drop

In [8]:
# create an array to store the columns that shoud be deleted
columns_to_delete = []

- columns with url

In [9]:
# columns with url in the name
columns_with_url = [col for col in df.columns if 'url' in col]
# print columns not in columns_to_delete
print([col for col in columns_with_url if col not in columns_to_delete])
columns_to_delete.extend(columns_with_url)

['listing_url', 'thumbnail_url', 'medium_url', 'picture_url', 'xl_picture_url', 'host_url', 'host_thumbnail_url', 'host_picture_url']


- columns with textual data

In [10]:
# get all columns with string data type 
text_cols = df[df.columns[df.dtypes == 'object']]
is_text = []

for col in text_cols:
    # if it is inuque 70% of the time
    if df[col].nunique() / df.shape[0] > 0.1:
        is_text.append(col)
print(len(is_text))
print([col for col in is_text if col not in columns_to_delete])
columns_to_delete.extend(is_text)

22
['name', 'summary', 'space', 'description', 'neighborhood_overview', 'notes', 'transit', 'access', 'interaction', 'house_rules', 'host_name', 'host_since', 'host_about', 'amenities', 'first_review', 'last_review', 'license']


- columns that can be calculated

In [11]:
# columns that can be calculated from other columns
calculated_cols = ['maximum_nights_avg_ntm', 'minimum_nights_avg_ntm', 'maximum_maximum_nights',
                   'minimum_maximum_nights', 'maximum_minimum_nights', 'minimum_minimum_nights',
                   'availability_60', 'availability_90', 'availability_365']

print([col for col in calculated_cols if col not in columns_to_delete])
columns_to_delete.extend(calculated_cols)

['maximum_nights_avg_ntm', 'minimum_nights_avg_ntm', 'maximum_maximum_nights', 'minimum_maximum_nights', 'maximum_minimum_nights', 'minimum_minimum_nights', 'availability_60', 'availability_90', 'availability_365']


- columns with id 

In [12]:
# columns with id in the name
# ignore host_identity_verified
columns_with_id = [col for col in df.columns if 'id' in col and col != 'host_identity_verified']
print(columns_with_id)
columns_to_delete.extend(columns_with_id)

['id', 'scrape_id', 'host_id']


- columns duplicated of another column

In [13]:

# duplicate columns of other columns

duplicate_columns = []

for col in df.columns:
    for col2 in df.columns:
        if col != col2 and df[col].equals(df[col2]) and col not in duplicate_columns:
            duplicate_columns.append(col)
print(len(duplicate_columns))
print([col for col in duplicate_columns if col not in columns_to_delete])
columns_to_delete.extend(duplicate_columns)

9
['last_scraped', 'host_acceptance_rate', 'host_listings_count', 'host_total_listings_count', 'neighbourhood_group_cleansed', 'calendar_last_scraped']


- cloumns with null values

In [14]:
# columns with all null values or same value on all rows
null_columns = df.columns[df.isnull().all()]

print([col for col in null_columns if col not in columns_to_delete])
columns_to_delete.extend(null_columns)

[]


- columns with same value in all rows

In [15]:
# columns with same value on all rows
same_value_columns = df.columns[(df.nunique() == 1)]
print([col for col in same_value_columns if col not in columns_to_delete])
columns_to_delete.extend(same_value_columns)

['experiences_offered', 'market', 'country_code', 'country', 'has_availability', 'jurisdiction_names', 'is_business_travel_ready']


- columns with less than 50% of values

In [16]:

# columns with more than 50% of the values missing
missing_values = df.columns[(df.isnull().sum() / df.shape[0]) > 0.5]
print([col for col in missing_values if col not in columns_to_delete])
columns_to_delete.extend(missing_values)

['square_feet', 'weekly_price', 'monthly_price']


- irrelevant data


In [17]:
# irrelevant columns
irrelevant_columns = ['zipcode','requires_license']
print([col for col in irrelevant_columns if col not in columns_to_delete])
columns_to_delete.extend(irrelevant_columns)


['zipcode', 'requires_license']


- columns with over 90% of repeated values

In [18]:
# columns with over 90% of the values the same

same_value_columns =[]

for col in df.columns:
    # calculate the percentage of the most common value
    values = df[col].value_counts()
    # check if it is over 90%
    for value in values:
        if value / df.shape[0] > 0.9:
            same_value_columns.append(col)
    
print([col for col in same_value_columns if col not in columns_to_delete])

columns_to_delete.extend(same_value_columns)

['host_has_profile_pic', 'street', 'city', 'state', 'smart_location', 'bed_type', 'require_guest_profile_picture', 'require_guest_phone_verification', 'calculated_host_listings_count_shared_rooms']


- `city` before deleting this column, we will keep only the rows from boston

In [19]:
df = df[df["city"].fillna("").str.lower().str.contains("boston")]

* columns that appear to be invalide but they are not, so we will keep them:
    - latitude
    - longitude
    - amenities



In [20]:
columns_to_delete = [col for col in columns_to_delete if col not in ['latitude', 'longitude', 'amenities']]

In [21]:
print("old shape: ", df.shape)
df = df.drop(columns=columns_to_delete)
print("new shape: ", df.shape)


old shape:  (3412, 106)
new shape:  (3412, 43)
