# Cobify aim

Cobify is a company dedicated to transporting people, characterized by operating without a license of any kind and
by the use of high-end rigged cars. 

Cobify has been using gasoline with a high cetane index such as SP98 to avoid delays / advance in fuel injection (avoiding crankshaft)for the rigged cars. However, they wonder if it would be more profitable the use of fuels that add ethanol in their formulations are cheaper and offer the same cetane number as the more expensive gasoline, as E10. 

Through this analysis, we seek to answer the question:
##  <span style="color:blue"> What kind of fuel should Cobify use for rigged cars? </span>

# Analysis

### Cleaning

For this analysis we have cleaned the dataset [dataset provided by Cobify](https://www.kaggle.com/anderas/car-consume?select=measurements.csv) using _Pandas_. It turned out to be pretty clean.

We find a 12 column dataset, with 388 rows. Each row give us information about one different trip, including distance, consume, average speed, temperature inside, temperature outside, special data (about weather), the gas type used, if the air conditioner was on/off, if it was raining or sunny, and the refilled of gas

Columns _refill liters_, _refill gas_ and _specials_ have a wide range of NAN values (97%, 97% and 76% of nan values respectively), and they didn't give us relevant information so we decided to drop them.

Regarding the _temperature inside the car_, 3% of our data were null, so we decided to replace it for the average temperature. 

Also, we coded the _gasoline type_ into a numeric category.

### Relations and Visualization

Through a correlation matrix and a heat map we find that there was not a high correlation between variables.
![matrix](./Images/matrix.png)

Despite of it, we decided to take a closer look at the highest correlation _speed - distance._
- Distances over 50km usually mean higher speed
- Long distances are been drived more often using SP98 rather than E10
![speed-distance](./Images/speed-distance.png)

We would like to see if there is any specific difference in means by gasoline type, and we find that despite a higher speed uses to correlate inversely with consume, **E10 shows a higher mean for speed and for consume.**
Besides it, SP98 show a higher mean-values in the temperature (specially outside), the use of air conditioner (probably related to the higher temperatures outside) and variability of the weather.

We center the analysis in the **difference of consume** in both gas types, starting with a general view:
- Mean consume with E10 is kept more stable (less variable) than with SP98
- Mean consume is a little bit higher with E10
- Variability in SP98 consume must be caused by other factors (more outliers, and more spreaded)
![consume](./Images/consume.png)

After analysing relation between consume and distance, some insights were found:
- Consume decreases with longer distances, independently of the gasoline type
- The consume of SP98 keeps more stable in large distances than the E10
- For distances between 40-55 km, could be recommeded to use E10
![distances](./Images/distance.png)

The analysis of speed and consume does not reveal information of interest
- Higher consume when the speed is lower, in both gases type
![speed](./Images/speed.png)

Relationship of consume with the temperature inside and outside of the car
- Higher consume with SP98 when the temperature inside the car is 23,5
- Consume with E10 is lower than with SP98 until 24,5º 
- Doesn't seem to be a difference between gas types related to the temperature outside the car

|Temperature inside the car | Temperature outside the car|
|:-------------------:|:-------------------:|
|![temp_inside](./Images/temp_inside.png)  |  ![temp_outside](./Images/temp_out.png)|

Related to the temperature, we have studied the variances with the air conditioner ON/OFF, with the expected results
- Without air conditioner, the consume is almost equal between both gas types (a little bit higher with E10)
- With the air conditioner connected, the consume of both increases slightly, growing more for SP98
![AC](./Images/ac.png)

After the comparison in sunny days and rainy days for both types of gasoline, we see that the variability of the climate influences more on the driving carried out with SP98
- Without sun, the consume is almost the same for both types
- In sunny days, the consume is decreased for both, specially for SP98

- Without rain, the consume is slightly higher with E10
- This little difference decreases with rain, bringing both types closer together

|Sun | Rain |
|:-------------------:|:-------------------:|
|![sun](./Images/sun.png)|![rain](./Images/rain.png)|

### Enrichment of current gasoline price

Cobify team knows that the general prices are 1,38€/liter for E10 and 1,46€/liter for SP98.
If we calculate the cost we incur in gasoline taking into account the consumption and the price given by the Cobify team, we obtain that:
- Price for mean consume driving with E10 is 6.81
- Price for mean consume driving with SP98 is 7.15   


**Despite the car consumes E10 faster than SP98, the price difference shows that using E10 is profitable**.

However, as we want to do a real comparison and get to know which one do we should use, could be interesting know the *current price* of each one in Spain. 

We carry out a webscraping of the [clickgasoil page](https://www.clickgasoil.com/) to obtain the minimum, maximum and average price of both fuels updated daily. We define a function that will allow us to keep this data up to date.

This information has allowed us to make a real calculation of consume by filling the gasoil today, 
letting us know if there is any change and allowing us to change the gasoline if necessary.

> Min consume with E10  price today is 5.34   
> Min consume with SP98 price today is 5.86

> Mean consume with E10  price today is 6.8   
> Mean consume with SP98 price today is 7.58

> Max consume with E10  price today is 7.35   
> Max consume with SP98 price today is 8.11

In all cases price/consume of E10 is lower. 
SP98 increases price in, at least, a 10% respect to the E10.

### Predicting Consume

As we already have the updated price, we would like to be able to predict the consume in different circumstances.
In order to predice the gasoil consume we fit different models and get the metrics for each one:
- Linear Regression
- Ridge
- Lasso
- K Neighbors Regressor
- Gradient Boosting Regressor
- Decision Tree Regressor
- Random Forest Regressor

During the first fitting using the enriched dataset, the Ridge model presents a very high fit with a R squared really near to 1 (0,998) and mean errors (MAE, MSE, RMSE) tending to zero.
![ridge](./Images/ridge.png)

As Ridge is a method of estimating the coefficients of multiple-regression models in scenarios where independent variables are highly correlated, _we would like to check if it is so fitted because of the enrichment of the dataset_. So we are replicating the previous code with the clean (but not enriched) dataset.

After checking that there was overfitting due to the fact that enrichment columns were a multiplication of the values contained in X and y, we decided to go on with K Nearest Neighbours despite it is more used for classification than for regression as it has the higher R squared. As well, we will also search for better parameters for gradient boosting, random forest and decision tree models.

There was used grid search in order to choose a set of optimal hyperparameters for those learning algorithms.   
After training the models, we obtain the following results:
> **K Neighbors Regressor**    
MAE: 0.403      
MSE: 0.303    
RMSE: 0.550 
R2: 0.682   

> **Random Forest Regressor**    
MAE: 0.444     
MSE: 0.345     
RMSE: 0.587    
R2: 0.637  

> **Decision Tree Regressor**    
MAE: 0.457   
MSE: 0.361   
RMSE: 0.601   
R2: 0.621   

> **Gradient Boosting Regressor**    
MAE: 0.579   
MSE: 0.547   
RMSE: 0.740   
R2: 0.425     

With a 0.68 of the variance explained and a mean absolute error of 0.4 _(we have use MAE because it's robust to the outliers we have to the outliers that we had detected in the previous inspection of the data)_, the *K Neighbors Regressor* is the model that we have choose for the prediction of the consume.

![knn](./Images/knn.png)

# Conclusion
## <span style="color:blue"> We recommend changing the gasoil to E10  </span>

- In all cases price/consume of E10 is lower. SP98 increases price in a 10-11%
- Even if mean consume is a little bit higher with E10, is kept more stable (less affected by climate factors) than with SP98

# Technologies
- Python 3.8.5
- Pandas
- Numpy
- Seaborn
- Matplotlib
- Requests
- BeautifulSoup
- Regular Expressions
- Sklearn