<h1 style="Comic Sans MS; text-align: center; color: #FFC300">Data Processing Project: EDA analysis of Airbnb datase</h1>

<div>
    <ol>
        <li><a href="#step1">Problem statement and data collection</a></li>    
        <li><a href="#step2">Exploration and data cleaning</a></li>
        <li><a href="#step3">Analysis of univariate variables</a></li>
        <li><a href="#step4">Analysis of multivariate variables</a></li>
        <li><a href="#step5">Feature engineering</a></li>
        <li><a href="#step6">Feature selection</a></li>
    </ol>
</div>

<h3 id="step1" style="font-family: Comic Sans MS; color: #68FF33">1. Problem statement and data collection</h3>

In [34]:
import pandas as pd
import numpy as np

# Load the data
path = 'C:/Users/Jorge Payà/Desktop/4Geeks/DSML Bootcamp/Airbnb-EDA-project/data/raw/AB_NYC_2019.csv'
total_data = pd.read_csv(path)
total_data.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


In [35]:
total_data.to_csv('C:/Users/Jorge Payà/Desktop/4Geeks/DSML Bootcamp/Airbnb-EDA-project/data/raw/total_data.csv', index=False)

<h3 id="step2" style="font-family: Comic Sans MS; color: #68FF33">2. Exploration and data cleaning</h3>

In [36]:
# Obtain dimensions
print(total_data.shape)

(48895, 16)


In [37]:
# Obtain information about data types and missing values (aka null values)
print(total_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   name                            48879 non-null  object 
 2   host_id                         48895 non-null  int64  
 3   host_name                       48874 non-null  object 
 4   neighbourhood_group             48895 non-null  object 
 5   neighbourhood                   48895 non-null  object 
 6   latitude                        48895 non-null  float64
 7   longitude                       48895 non-null  float64
 8   room_type                       48895 non-null  object 
 9   price                           48895 non-null  int64  
 10  minimum_nights                  48895 non-null  int64  
 11  number_of_reviews               48895 non-null  int64  
 12  last_review                     

<p style="font-family: Comic Sans MS; color: #33FFFC">
There are a total of 48895 rows (in this case, vacation rentals) and 16 columns, among which we can find the target or class to predict. The variables <em>last_review</em> and <em>reviews_per_month</em> have 38843 instances with values, so they contain 10052 null values. The variable <em>host_name</em> also has null values, but in a much smaller number than the previous ones. The rest of the variables always have a value. The data has 8 numerical characteristics and 8 categorical characteristics as follows:
    <ul style="font-family: Comic Sans MS; color: #33FFFC">
        <li>8 Numerical Characteristics (<em>latitude, longitude, price, minimum_nights, number_of_reviews, last_review, reviews_per_month, calculated_host_listings_count</em>)</li>
        <li>8 Categorical Characteristics (<em>id, name, host_id, host_name, neighbourhood_group, neighbourhood, room_type, availability_365</em>)</li>
    </ul>
</p>

<h5 style="font-family: Comic Sans MS; color: #68FF33">Eliminate duplicates</h5>

In [38]:
# Check if there are duplicates in the data
print(f"The number of duplicated Name records is: {total_data['name'].duplicated().sum()}")
print(f"The number of duplicated Host ID records is: {total_data['host_id'].duplicated().sum()}")
print(f"The number of duplicated ID records is: {total_data['id'].duplicated().sum()}")

The number of duplicated Name records is: 989
The number of duplicated Host ID records is: 11438
The number of duplicated ID records is: 0


 <ul style="font-family: Comic Sans MS; color: #33FFFC">
    <li>name has duplicated values, which is odd, but duplicates can exist, since people can put the same names eg. House in Brooklyn.</li>
    <li>host_id can have duplicates, because some home owners have multiple airbnbs registered.</li>
    <li>There are 0 duplicated id, which means it should be all unique records.</li>
</ul>


<h5 style="font-family: Comic Sans MS; color: #68FF33">Eliminate irrelevant information</h5>
<p style="font-family: Comic Sans MS; color: #33FFFC">
    Now we have to try to be as objective as possible and carry out this preliminary process before the feature selection phase. Therefore, here what we will try to do is a controlled elimination of those variables that we can be sure that the algorithm will not use in the predictive process, the following have been considered as irrelevant: <em>"id", "name", "host_name", "last_review", "reviews_per_month"</em>.
</p>

In [39]:
total_data.drop(['id', 'name', 'host_name', 'last_review', 'reviews_per_month'], axis=1, inplace=True)
total_data.head()

Unnamed: 0,host_id,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,calculated_host_listings_count,availability_365
0,2787,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,6,365
1,2845,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2,355
2,4632,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,1,365
3,4869,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,1,194
4,7192,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,1,0


<h3 id="step3" style="font-family: Comic Sans MS; color: #68FF33">3. Analysis of univariate variables</h3>
<p style="font-family: Comic Sans MS; color: #33FFFC">We must distinguish whether a variable is categorical or numerical. After removing irrelevant variables, we have the following:
    <ul style="font-family: Comic Sans MS; color: #33FFFC">
        <li>6 Numerical Characteristics (<em>latitude, longitude, price, minimum_nights, number_of_reviews, calculated_host_listings_count</em>)</li>
        <li>5 Categorical Characteristics (<em>host_id, neighbourhood_group, neighbourhood, room_type, availability_365</em>)</li>
    </ul>
</p>
<h5 style="font-family: Comic Sans MS; color: #68FF33">Analysis on categorical variables</h5>

<h3 id="step4" style="font-family: Comic Sans MS; color: #68FF33">4. Analysis of multivariate variables</h3>

<h3 id="step5" style="font-family: Comic Sans MS; color: #68FF33">5. Feature engineering</h3>

<h3 id="step6" style="font-family: Comic Sans MS; color: #68FF33">6. Feature selection</h3>