# EDA Project - Airbnb Paris

## Background

When traveling to Europe and booking an Airbnb rental property for a vacation, I try to spend less on lodging to be able to afford good dining and tours.  For Paris specifically, I find that the options close to the centre, that is close to the top attractions i.e. Eiffel tower, Champs Elysees are pricier, and most of the properties are unavailable as well, so I end up staying further from the centre, and when I do, I am always worried that I might have booked a property that is way too far out from the centre of the city, and or far from a metro station -- since actual location is not provided until after you have made the reservation, or unless you contact host and ask.  Distance to the station is relative depending on how far one is willing to walk, but for me, more than half a mile walk to a station is considered far, since I feel I have to save all my energy for the leisurely walking one does around in Paris.

## Questions for Exploration / Goals


1) to be able to explore and answer the following questions:

* How does location influence property rental price, reviews and availability?  

* What other features drive the price of an airbnb rental property?

  i.e. price vs location, price vs distance from a top attraction, etc.



2) recommend Airbnb properties given a certain criteria, and enable a more informed decision for the traveler: 

* walking distance to one of the top attractions (within 2 miles)
* walking distance to metro station (within 1 mile)
* rating >= 9

* other user defined criteria i.e. 
* number of beds
* number of bedrooms
* price range
* type of property

## Datasets

The Airbnb dataset was downloaded from the OpenDataSoft website below.  The data was filtered to only extract and download data pertaining to Paris, France Airbnb rental properties.

[Airbnb Paris Dataset](https://public.opendatasoft.com/explore/dataset/airbnb-listings/table/?disjunctive.host_verifications&disjunctive.amenities&disjunctive.features&refine.city=Paris&dataChart=eyJxdWVyaWVzIjpbeyJjaGFydHMiOlt7InR5cGUiOiJjb2x1bW4iLCJmdW5jIjoiQ09VTlQiLCJ5QXhpcyI6Imhvc3RfbGlzdGluZ3NfY291bnQiLCJzY2llbnRpZmljRGlzcGxheSI6dHJ1ZSwiY29sb3IiOiJyYW5nZS1jdXN0b20ifV0sInhBeGlzIjoiY2l0eSIsIm1heHBvaW50cyI6IiIsInRpbWVzY2FsZSI6IiIsInNvcnQiOiIiLCJzZXJpZXNCcmVha2Rvd24iOiJyb29tX3R5cGUiLCJjb25maWciOnsiZGF0YXNldCI6ImFpcmJuYi1saXN0aW5ncyIsIm9wdGlvbnMiOnsiZGlzanVuY3RpdmUuaG9zdF92ZXJpZmljYXRpb25zIjp0cnVlLCJkaXNqdW5jdGl2ZS5hbWVuaXRpZXMiOnRydWUsImRpc2p1bmN0aXZlLmZlYXR1cmVzIjp0cnVlfX19XSwidGltZXNjYWxlIjoiIiwiZGlzcGxheUxlZ2VuZCI6dHJ1ZX0%3D)

Paris RER/metro dataset was downloaded from the website below and will be used for further exploration:


[Accessibilite des gares et stations de Metro](https://data.ratp.fr/explore/?sort=modified)


### Limitations
* the Airbnb data from OpenDataSoft was last updated in 2017, availability information will be based on what has been captured from the last update.
* Historical data for the property prices and availabity are not available, so we are unable to check for variability of price and availability on different seasons/time.
* For computing distances Haversine formula was used, which would be a good estimate for computing the distance between two points, but this distance could be different from the actual walking distance.

## Cleaning the Data

The geojson file format was read and used for this analysis. The original download file has eighty-six (86) columns, columns which are not needed for analysis, i.e. calendar_last_scraped, calendar_updated, listing_url, etc. were dropped to minimize columns and focus on cleaning and analysis of data which will be useful this EDA.

The 'city' column was dropped from the file as well, since all the records downloaded are those pertaining to Airbnb rental properties in Paris, France.

Further research on zipcodes was done, and found that zipcodes 75001 - 75020 correspond to Arrondissements 1 through 20 in Paris.  Since analysis will be focused on properties within the centre of Paris, the records with zipcodes outside of this range were deleted.

*More information on Paris Arrondissements in wikipedia: https://en.wikipedia.org/wiki/Arrondissements_of_Paris

### Missing Values:
Missing values in beds and bedrooms were imputed using the following rules:
* If 'beds' has a missing value, and 'bedrooms' has a valid value, 'beds' is set with the value of 'bedrooms'.
* If both 'bedrooms' and 'beds' are missing, both fields are set to 1, which is the average number of beds and bedrooms,
and it would be safe to assume that if a property is being rented out in Airbnb, that there is at least one bed, and we will count that as one bedroom, regardless if the property is a studio apartment, where there is no pyhsical division between the rooms.

There were several records as well with missing 'scores' value, a new column 'rating_ind' was created to flag rated = 1, versus unrated = 0 records, so further analysis can be done between these two populations.  The record was flagged as rated, if all the score values have been populated, and set to unrated, if at least one of the score values has not been populated.

### Data Transformation and Feature Engineering:

The following fields were added to the dataset as well:

<br>1) arrondissement - this was derived from the last two digits of the zipcode

2) arrond_name - arrondissement name was populated and compared with the values populated in the neighbourhood_cleansed field.  There were some differences found, since no information is available on how the neighbourhood_cleansed field was derived, for the sake of consistency with the categorizing by Arrondissement, the arrond_name will be used for analysis instead of the neighbourhood_cleansed field.  


3) A column for each of the Distances between the Airbnb property and each of the [2018 top 10 attractions in Paris](https://www.tripadvisor.com/Attractions-g187147-Activities-Paris_Ile_de_France.html#ATTRACTION_SORT_WRAPPER) were derived as well using the [Haversine forumla](https://community.esri.com/groups/coordinate-reference-systems/blog/2017/10/05/haversine-formula), and these features were used to analyze and determine relationship with the rental property price.

["Musee d’Orsay", "Sainte-Chapelle", "Palais Garnier - Opera", "Notre Dame Cathedral", "Musee de l’Orangerie", "Luxembourg Gardens", "Louvre", "Eiffel", "Pont Alexandre III", "Le Marais"]

4) close_to_attraction - this is an indicator if the property is within 2 miles of one of the top 10 attractions

5) closest_attraction - this is the attraction closest to the Airbnb property

6) attraction_dist - distance between the Airbnb property and the closest attraction.




## Exploring the Data

In [2]:
import pandas as pd
dfParis = pd.read_pickle('airbnb_paris/airbnb_Paris_updt_0606.p')

Plotting the Airbnb rental property prices out shows us that the price of Airbnb rental properties in Paris  has a right-skewed unimodal distribution.  We can see from the graph below it shows that majority of the properties are rented out are below \$200, with the average below \$100 which is at \$93.72, with the median only at $75.  We can see that there are some outliers -- properties over \$400, causing a huge difference between the mean and the median price, and for the distribution to be right skewed.

[graph 1.0]
<img src="files/images/airbnb_price_dist.png">

In [3]:
dfParis.price.describe()

count    52295.000000
mean        93.718673
std         71.128178
min          0.000000
25%         55.000000
50%         75.000000
75%        105.000000
max        999.000000
Name: price, dtype: float64

The plot below shows the correlation of the other features with the Airbnb rental property price:
* It does make sense that the distance between the top attractions and the Airbnb property is negatively correlated with the price, and so is with the Arrondissement, and 'close_to_attraction' has a positive correlation with the 'price'.

* With regards to distance between Le Marais and the Airbnb property being positively correlated, it is possible that those properties which are farther from Le Marais are closer to the other top attractions, giving us that inconsistent correlation as far as distances from top attractions are concerned.  For analysis, the 'close_to_attraction' indicator together with the 'attraction_dist' will be used, instead of the individual distances to the top attraction.

* As far as the other property features are concerned, we can see that the number of beds, bedrooms, and accomodations are positively correlated with the Airbnb property price.  It is expected that the bigger the accomodation is, the rental price will be higher as well.

[graph 2.0]
<img src="files/images/features_corr.png">

#### Exploring Location (Arrondissements) and Availability_30

If we look at each of our top 10 attractions in Paris, we can see that the attractions are in arrondissements 1 through 9.

As we saw from the correlation graph above --  Arrondissement values are negatively correlated with the price, that is, the Airbnb properties in the lower Arrondissements on average costs more than the properties in the higher arrondissements, which is what the graph below shows.  Although it appears that rental prices in Arr 8 - Champs Elysees costs the most, and Arr 9 - Opera has a lower rental prices compared to arrondissements 1 - 8.  This could be because Champs Elysees is generally a pricier neighborhood in Paris, and it is a very famous avenue in Paris.

<img src="files/images/df_arrondissements.png">



[graph 3.0]
<img src="files/images/line_pricebyArr.png">

With regards to the availability, the graphs below show that the price does have direct relationship on the availability of the Airbnb rental properties in Paris. Properties which are available for 5 or more days in the next 30 days, has an average price around \$100 or more.

[graph 4.0]
<img src="files/images/airbnb_price_byAvail.png">

If we look at availability of Airbnb rental properties by price range, we can see that for properties within \$100/night budget the options for available properties are much less in Arrondissements 1 through 8, increasing your budget will give you better chances of getting a property close to the attractions (in Arrondissements 1 through 8), and if that is not an option, renting outside of arrondissments 1 through 8 will give one a better option of renting out a property within the \$100/night budget.

[graph 5.0]
<img src="files/images/Avail_less100.png">

#### Exploring Rating Indicator with Availability and Price

The graphs above shows all properties, that is for both rated properties, and unrated which includes properties with incomplete scores.  If we break down the availability of properties, we can see that we have higher availability for 'unrated' properties, which is what we are expecting, since people tend to stay in places with higher ratings, where other renters have had a good experience staying.

[graph 6.0]
<img src="files/images/box_rating_avail.png">

With regards to the price, there appears to be no significant difference between rental price for rated properties versus the unrated ones.  As we have seen in the Correlation table above in [graph 2.0] review_scores fields: [review_scores_value, review_scores_communication, review_scores_location, review_scores_cleanliness, review_scores_checkin, review_scores_accuracy fields] did show a positive correlation with the price, but not a strong correlation. The strongest correlation was with review_scores_location, which makes sense because as we have seen locations closer to the attractions (Arr 1-8) do cost more compared to properties outside of Arrondissements 1-8.

[graph 7.0]
<img src="files/images/box_ratingprice.png">

### Exploring other Features

#### Exploring Property Type

In the graphs below, we can see that the property type does affect the price of the property, as in the case of Camper/RV, Igloo, and villa property types.  We can see that average price for these properties are around the \$300 and above price range, which is way above the average and median price of the rental properties in Paris.  These property types represent less than 1% of the total Airbnb rental properties in Paris, thus causing the skewness we have observed in the price distribution.

<img src="files/images/airbnb_price_byPtype.png">
<img src="files/images/property_type.png">



#### Exploring the distance to a Top Attraction with Property price and Availability

After filtering the dataset down further to get closer to the recommendations:

The dataset was filtered to only include Airbnb properties which are within 2 miles of one of the top attractions,
AND within 1 mile of a RER/metro station.  Finding that most of the properties are within 2 miles of a top attraction, and a mile from a station, the dataset was then classified further to get number of Attractions within 1 mile of the Airbnb property, and number of stations within 0.2 miles of the property.


We can see that the properties which are not within a mile of any of the top attraction has the lowest median price, those which have at least one attraction within 1 mile (1 to 4 attractions), median price are pretty close, and those which are within a mile to more than five (5) attractions, have the higher median price.

It is the same for properties and its closeness to a station.  Those properties which are beyond 0.2 miles to a station has the lowest median price.  Those with at least one station within 0.2 miles (1-2 stations) median price are pretty close to each other, and those with 3-4 stations within 0.2 miles have the higher priced properties.

<img src="files/images/box_recmd_price_vs_sitecount.png">

<img src="files/images/box_recmd_price_vs_stncount.png">

## Summary

### 1) How does location influence property rental price, reviews and availability?  

* Based on our findings from this EDA, we confirmed that the properties in Arrondissements 1 through 8, where most of the top attractions are located, cost more than properties outside of those arrondissements.

* And also, availability of properties in Arrondissements 1 - 8 are lower especially for properties rented out less than \$100/night, compared to properties outside of Arrondissements 1 - 8

* With regards to the rated properties, and unrated properties, availability of unrated properties are higher compared to the rated properties.  An option for a traveler would be to consider the unrated properties to have more options for booking closer to the centre.



1b) What other features drive the price of an airbnb rental property?

* We did see that the number of accomodation/beds/bedrooms has a strong positive correlation with property rental price

* the location of the property i.e. Arrondissement, (close_to_attraction == 1) that is if Airbnb property is within 2 miles of one of the top 10 attractions

* the property type does affect the price of the Airbnb rental property, and in this case, is causing the skewness in the price distribution, because at least 95% of the Airbnb rental properties in Paris has a property type = Apartment, and those properties with the higher cost \$300 and above, such as the Camper/RV, Igloo, and the villa only represents less than 1% of the total Airbnb rental properties in Paris.

* after getting the distances from our top 10 attractions, the data does show that properties outside of a mile from these attractions cost less, and average price of properties which are within a mile to one or more of the attractions would cost more.

* with regards to the distance from a station, the data does show that average price of properties which are within 0.2 miles to at least one station costs more than those properties which are more than 0.2 miles to a station.




### 2) Airbnb Paris Recommendations




Recommendation filter 1:

* properties which are within 1 mile at least one of the top attractions

* Room type == Entire home/apt

* property is within 0.2 miles more than 2 stations


Recommendation filter 2: (if #1 filter results to less than 20 properties, results from #2 will be added to the recommendation list)

* properties which are within 1 mile at least one of the top attractions AND
* Room type == Entire home/apt AND
* property is within 0.2 miles at least 2 stations

Below are Recommended Airbnb properties with the base criteria below, and additional filters applied as defined
in Recommendation filters 1 and 2:  
* Beds: >= 1  
* Accomodates >= 2  
* Available_30 >=  5 days  
* Price Max: $85  
* Rating Ind = 1 (rated properties only)
* Room type = Entire home/Apt
* Review_Scores_Rating = 100.0
   

In [8]:
import pandas as pd
results = pd.read_pickle('airbnb_paris/recommendation_results_0615.p')
results.reset_index(drop = True, inplace = True)
results.style

Unnamed: 0,id,arrondissement,arrond_name,price,cancellation_policy,closest_attraction,attraction_dist,site_count,station_count,Eiffel Tower,The Louvre,Jardin du Luxembourg,Le Marais,Musee d'Orsay,Sainte-Chapelle,Palais Garnier - Opera,Notre Dame Cathedral,Musee de l'Orangerie,Pont Alexandre III,station1_name,station1_dist,station2_name,station2_dist,station3_name,station3_dist,station4_name,station4_dist,station5_name,station5_dist,latitude,longitude
0,Property12691441,1,Louvre,55,flexible,The Louvre,0.324823,7,3,2.27925,0.324823,1.15417,0.703087,0.819248,0.47351,0.887801,0.687199,0.987702,1.40016,Les Halles,0.100234,Châtelet-Les Halles,0.127479,Louvre-Rivoli,0.177881,Etienne Marcel,0.23604,Châtelet,0.302536,48.86223720346151,2.344298191899912
1,Property1846308,11,Popincourt,80,moderate,The Louvre,0.347938,6,3,2.31314,0.347938,1.08447,0.636549,0.849699,0.385578,0.981818,0.590699,1.04265,1.45202,Châtelet-Les Halles,0.0805858,Les Halles,0.0911123,Louvre-Rivoli,0.19373,Châtelet,0.206811,Etienne Marcel,0.254388,48.86097809420282,2.3452323445275267
2,Property3247039,6,Luxembourg,70,moderate,Sainte-Chapelle,0.376824,6,3,1.94343,0.424355,0.572341,1.04049,0.601516,0.376824,1.23375,0.602936,0.909621,1.24056,Mabillon,0.136513,Odéon,0.180855,Saint-Germain des Prés,0.186248,Saint-Michel,0.33141,Pont Neuf,0.369129,48.85447951841604,2.336830506364772
3,Property10489498,1,Louvre,70,strict,Palais Garnier - Opera,0.435389,6,3,1.77461,0.436807,1.36874,1.3139,0.462505,0.927324,0.435389,1.19914,0.439555,0.842335,Pyramides,0.112749,Pyramides,0.129119,Tuileries,0.143665,Palais-Royal (Musée du Louvre),0.227875,Opéra,0.340652,48.86570261463177,2.3319274559507868
4,Property16902638,2,Bourse,70,flexible,Le Marais,0.60404,5,3,2.54114,0.619101,1.41117,0.60404,1.09602,0.69089,0.948613,0.82583,1.22113,1.6338,Etienne Marcel,0.0888375,Réaumur-Sébastopol,0.163637,Sentier,0.18792,Les Halles,0.24529,Arts-et-Métiers,0.302903,48.86494885382183,2.3495076340203824
5,Property17680319,1,Louvre,75,flexible,The Louvre,0.338715,5,3,2.03562,0.338715,1.33377,1.05725,0.641084,0.767485,0.534395,1.01984,0.703575,1.11286,Pyramides,0.179079,Palais-Royal (Musée du Louvre),0.186416,Pyramides,0.188924,Bourse,0.257117,Louvre-Rivoli,0.347106,48.865496179677024,2.3379621719063404
6,Property16052249,2,Bourse,55,strict,Le Marais,0.600863,5,3,2.55661,0.635784,1.42506,0.600863,1.11222,0.703758,0.956175,0.834661,1.23579,1.64828,Etienne Marcel,0.101911,Réaumur-Sébastopol,0.146928,Sentier,0.187308,Les Halles,0.26074,Arts-et-Métiers,0.287805,48.86507933032343,2.3498173130555724
7,Property15563015,2,Bourse,69,strict,Le Marais,0.785244,4,3,2.70245,0.867506,1.71164,0.785244,1.30144,0.991827,0.93589,1.11274,1.36739,1.76814,Strasbourg-Saint-Denis,0.133311,Bonne Nouvelle,0.179496,Réaumur-Sébastopol,0.195368,Sentier,0.221946,Arts-et-Métiers,0.314485,48.86906032099319,2.3517008654862743
8,Property10347142,3,Temple,55,moderate,Le Marais,0.428041,4,3,2.71693,0.765681,1.43423,0.428041,1.26014,0.707518,1.14914,0.772453,1.40825,1.82188,Arts-et-Métiers,0.158819,Réaumur-Sébastopol,0.173283,Rambuteau,0.187806,Etienne Marcel,0.213903,Châtelet-Les Halles,0.353382,48.8638989955195,2.3536801577761324
9,Property4280862,5,Pantheon,79,flexible,Notre Dame Cathedral,0.530646,4,3,2.70487,1.20872,0.611314,0.996372,1.4885,0.740392,2.0346,0.530646,1.79854,2.1148,Cardinal-Lemoine,0.100877,Place Monge (Jardin des Plantes),0.165242,Jussieu,0.197219,Maubert-Mutualité,0.312084,Cluny-La Sorbonne,0.453983,48.84533334631346,2.350578985809922


Paris Map with recommendations can be viewed below:

[Paris_map](https://cdn.rawgit.com/ayeshavm/K2_Project2_EDA/8de36452/images/paris_map.html)

## Further Research and Analysis



* Refine recommendation to assign weights to features. i.e. give each arrondissement a weight, each attraction a weight, etc.

* Incorporate additional information i.e. restaurants and cafes close to the area, and include that in analysis and recommendation process as well.

* Using more current data via Airbnb APIs.

* Text analysis on comments and other features/amenities.