# Boat rentals (24 hour analysis)

### Project Brief

Congratulations, you have just landed your first job as a data scientist at ABC! ABC is a website that allows users to advertise their used boats for sale. To boost traffic to the website, the product manager wants to prevent listing boats that do not receive many views.

The product manager wants to know if you can develop a model to predict the number of views a listing will receive based on the boat's features. She would consider using your model if, on average, the predictions were only 50% off of the true number of views a listing would receive.

In addition, she has noticed that many users never complete the introductory survey to list their boat. She suspects that it is too long and has asked you whether some features are more predictive of views than others. If so, she may be able to trim the length of the survey and increase the number of people who sign up.



#### Table of Content
* [The Dataset](#dataset)
* [Data cleaning](#clean)
* [Feature Selection & Model Evaluation](#feature)
* [Final recommendations](#recommend)
* [Future work](#future)


### The Dataset <a class="anchor" id="dataset"/>

Finding the perfect recipe to increase the sign-up ratio and the traffic into the website is definitely challenging. It gets even more so, if the sign-up experience is full of endless questions. This notebook was created to attempt facing these challenges. To do so, we have been provided with a dataset that consists 9888 entries. Even though the dataset does not seem that big, it's extremely inconsistent which we will see under [Data cleaning](#clean). Here is the overview of dataset and its columns:

<img src="https://raw.githubusercontent.com/dissagaliyeva/boat-rentals/master/data/data.png" width="500"/>

### Data cleaning <a class="anchor" id="clean"/>

The first step is to identify the problems and create the appropriate action plan. After careful analysis, the following patterns were revealed:
- Location, Manufacturer, and Boat Type columns have a huge number of unique values. This will be a problem when building models. For example, Location column contains 2995 unique values, so after using OneHotEncoding, each of them having a unique column. This will simply blowup the dimensions!
- 29% of data entries have at least one missing value. Majority of them fall into "Materials" (18%) or "Manufacturer" columns (14%).
- Price columns contains a mixture of GBP, EUR, CHF, DKK. As seen below, almost 95% of all values have EUR as currency, prices were put to that currency.
<img src="https://github.com/dissagaliyeva/boat-rentals/blob/master/data/prices.png"/>


After cleaning categorical columns, the number of unique values dropped drastically:

| Column | Before (unique values) | After (unique values) |
| --- | --- | --- |
| Boat Type | 126 | 14 |
| Manufacturer | 910 | 742 |
| Type | 24 | 4 |
| Material | 11 | 6 |
| Location | 2995 | 8 |

### Feature Selection & Model Evaluation <a class="anchor" id="feature"/>

To see how data is related to each other, we typically use Correlation Matrix. However, our case requires something more powerful. Since Correlation Matrix can only work with numeric data, we need a tool that can reveal some insights about the categorical values! Therefore, I'm using a Power Score Matrix instead [1]. It's much more powerful in many ways, so we will stick to it. Now hidden data can no longer hide!

<img src="https://github.com/dissagaliyeva/boat-rentals/blob/master/data/pps.png"/>

Below we see some interesting insights:
- Length & Width are correlated which is not surprising. We will need to drop one of them, I'll be removing the "Width" column.
- Manufacturer column has big correlations with "Material", "Boat Type", and "Location" columns. We will also remove this column because of two reasons: multi-collinearity and big number of unique values (over 700).
- No column seem to have correlation with our goal - "Number of views last 7 days".


#### Model
Since the dataset has quite a lot of outliers and a big number of categorical data, the best model to use would be **RandomForestRegressor**. It performed better in both time- and accuracy evaluations in comparison to Adaboost and XGBoost. Last two models didn't score better than 25%.

The best performing model had the following hyper-parameters:
- max features: 'sqrt'
- min_samples_leaf: 7
- n estimators: 200

The overall RMSE score was 120 which is not good and not too bad given the fact that there was only 24 hours given to solve the problem!


### Final recommendations <a class="anchor" id="recommend"/>

1. Fewer questions

One of the tasks was to analyze whether currently asked questions during sign-up are too long. Looking at feature importance can reveal the insights:

<img src="https://github.com/dissagaliyeva/boat-rentals/blob/master/data/features.png">

As seen above, not all the questions were used in the analysis. The two columns that initially had the biggest number of missing values in (Materials=18% and Manufacturer=14%) turned out to be insignificant. Therefore, to increase the signing-up ratio, we would generally need Location, Price, Length & Width, Year Built information.

2. More thorough analysis & better model
It's quite logical to suggest spending more time analyzing the dataset to see more patterns. Having had only 24 hours to submit a complete analysis implies that it was rushed and some important points might be overlooked. Therefore, to know for a fact which moves to take next, it's strongly recommended to further analyze the dataset. The possible vectors are discussed next in the "Future Work" section.

3. Look into the boats with lowest views more in details
The analysis provided here gives food for thought for the upcoming analysis. It's interesting how the most expensive boat - Mega Yacht and Houseboat got less views than anticipated. It might be helpful to analyze the data entries further.


4. Improved UX and UI
While cleaning the dataset, there was a lot of inconsistency present. One big improvement would be to have a drop-down to select a specific category so that there is no bad data introduced.


### Future work <a class="anchor" id="future"/>

To make the analysis better and more accurate, there will be a need to look into the Location distributions to see if there are any patterns. I believe there should be more "boat-friendly" locations where having a boat is a must! Just in general, going back and diving into the dataset is always a good idea to find any new patterns.

Next, to see which questions would increase the sign-up ratio, it would be great to run several statistical tests such as A/B.

Moreover, find the dimensionality techniques and models that are suitted for datasets with a big number of categorical data.
