# Airbnb

## Context and objectives

Your have been missioned by the CEO of Airbnb Spain to analyze accommodations park in Madrid. More specifically, you are going to investigate the price of Airbnb accommodations in Madrid from April 2021 to April 2022.

## The database

### Download

Download the database in the `db` directory:

In [None]:
!curl https://wagon-public-datasets.s3.amazonaws.com/certification/airbnb_profits_analysis/airbnb.sqlite > db/airbnb.sqlite

Check the database has been saved:

In [190]:
!tree

### Schema

Open the database with your favorite tool (DBeaver, sqlite3, Postico,...) then:
- **📝 Draw the database schema in Le Wagon editor on https://kitt.lewagon.com/db**
- **💾 Save the `XML` version of the database schema in a `db/airbnb.xml` file**

Once done, check you have the `airbnb.xml` file in the `db` directory:

In [191]:
!tree

### Querying the database

In order to perform some analysis, you need to fetch the following information from **all** the accommodations of the database:
- `id`: the unique identifier of the accommodation
- `price`: the value paid per night in USD
- `neighbourhood`: the neighbourhood the accommodation belongs to
- `neighbourhood_group`: the neighbourhoods group the accommodation belongs to
- `bedrooms`: the number of bedrooms 
- `beds`: the number of beds
- `accommodates`: number of persons the accommodation is suited for
- `amenities`: a list of amenities of the accommodation
- `minimum_nights`: the minimum number of nights which can be booked
- `maximum_nights`: the maximum number of nights which can be booked in a row
- `host_id`: the unique identifier of the host
- `host_since`: date of the first listing of the host
- `host_neighbourhood`: the neighbourhood the accommodation belongs to
- `host_location`: the location of the host
- `host_response_time`: the category of response time of the host
- `host_response_rate`: the response rate of the host in %
- `host_acceptance_rate`: the acceptance rate of the host in %
- `host_is_superhost`: whether or not the host is a Superhost
- `host_has_profile_pic`: whether or not the host has a profile picture
- `host_identity_verified`: whether or not the identity of the host is verified
- `latitude`: latitude of the accommodation
- `longitude`: longitude of the accommodation
- `room_type`: category of the accommodation
- `property_type`: category of the property the accommodation belongs to
- `review_scores_rating`: average score rating for the accommodation in %
- `number_of_reviews`: total number of reviews
- `yearly_availability`: the total number of days where the accommodation available for guests in a year

**📝 Write an SQL query to fetch the above information and store it as a `str` in the `query` variable.**

In [163]:
# YOUR CODE HERE

**📝 Connect to the `airbnb.sqlite` database and use the query above to store the data in a `DataFrame` named `data`. Display the 10 first rows**

In [164]:
# YOUR CODE HERE

### 💾 Save your results

Run the cell below to save your results.

In [195]:
from nbresult import ChallengeResult
result = ChallengeResult(
    'query',
    query=query,
    shape=data.shape,
    columns=data.columns,
    host_locations=data['host_location'].unique(),
    maximums=data.max(axis=0),
    minimums=data.min(axis=0),
    means=data.mean(axis=0)
)
result.write()

### Load data from a CSV file

We provide you a clean dataset you should start withto perform your analysis:

**📝 Load the data from this URL: https://wagon-public-datasets.s3.amazonaws.com/certification/airbnb_profits_analysis/airbnb.csv inside a `DataFrame` named `accommodations`. Display the 10 first rows.** 

In [1]:
# YOUR CODE HERE

## Exploratory analysis

In this section, explore the dataset and visualize the data to get some intuitions.

In particular, try to make a sense of the relationships between the price of an accommodation and its characteristics.


ℹ️ We are **not** waiting for multivariate analysis at this point (using the `statsmodels` package)

In [3]:
# YOUR CODE HERE

## Statistical analysis

These analysis can help you to forge your presentation but you are **strongly encouraged** to follow your own findings.

### Hotel room statistics

The Airbnb team is really interested in the **Hotel rooms** so you will have to answer some questions about it. 

To do so, we are considering a binary segmentation: **Hotel rooms vs the rest**.

**❓ Using a statistical test, can you tell than Hotel rooms are statistically more expensive than the other rooms?**

Store the $p\text{-}value$ of your test inside a `p_value` variable.

In [6]:
# YOUR CODE HERE

### Price room analysis

**❓ Plot the distribution of the prices?**

ℹ️ Use the accommodations with price lower than $200 for a better visualization

In [31]:
# YOUR CODE HERE

**❓ What do think about that distribution?**

> YOUR ANSWER HERE

**📝 What transformation would you use to the price to fit a linear model? Transform your target as you see fit.**

In [32]:
# YOUR CODE HERE


**❓ Can we explain the price for a customer with our features?**

Using the numerical and the categorical features of your choice, try to fit a not too complex model to explain the price of an accommodation.

Store the `summary` of the model inside a `model_summary` variable.

In [274]:
# YOUR CODE HERE

**❓ What features explain the best the price of an accommodation in Madrid?**
- Which ones are the most statistically significant?
- Which numerical feature is the most sensitive to the price?

> YOUR ANSWER HERE

**❓ Explain with your own words the impact of an increase of one unity of the feature of your choice on the price of an accommodation.**

> YOUR ANSWER HERE

**❓ Are you satisfied with your model? Why?**

> YOUR ANSWER HERE

**❓Are you confident in the p-values of your model?**

In [None]:
# YOUR CODE HERE

### 💾 Save your results

Run the cell below to save your results.

In [326]:
from nbresult import ChallengeResult

result = ChallengeResult('analysis', p_value=p_value, model_summary=str(model_summary))
result.write()

## CEO question

> **How can we increase the average rating of the accommodations in Madrid to 95% while keeping the revenue as high as possible?**

[BONUS] In addition could you give some advices / quick wins to Airbnb Hosts Team in order to help hosts increasing their accommodation price while onboarding on the platform.

⚠️ We suppose for this study that: 
- An accommodation is booked **60%** of the available time
- Airbnb takes a **5% fee** on the revenue per accommodation
- The actions you will recommend have a _negligible impact_ on the actual charges of Airbnb

## Presentation

Based on the analysis of the `accommodations` dataset, prepare a slides deck to present your conclusions to the CEO of Airbnb Spain. The presentation must contain **5 slides maximum** (including the title slide). 

💡 The CEO is in fond of illustrations, figures and statistics.


ℹ️ You may follow the [pyramid principle](https://gettingbettereveryday.org/2018/10/05/what-you-could-learn-from-barbra-mintos-the-pyramid-principle-2009-172-pages/) with an inductive approach, actions first!


⚠️ Save your presentation at the root of the repository.

🚀 You turn!