<a href="https://colab.research.google.com/github/adochsh/aminadoo/blob/main/task.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)

import phik
from phik.report import plot_correlation_matrix
from phik import report
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#!pip install phik

The key aspect of ride-hailing is **upfront pricing**, which works the following way. 
*   First, it **predicts the price** for a ride **based on** predicted distance and time. This price is what you see on the screen of the phone before ordering a ride. 
*   Second, if **the metered price** based on actual distance and time **differs** a lot **from the predicted one**, the upfront price switches to the metered price.'A lot' means by more than 20%. 


---
For example, suppose you want to make a ride that upfront price predicts to cost 5 euros. 
If the **metered price is between 4 and 6 euros** - the rider pays 5 euros, otherwise the metered price.


---
We would like to **improve the upfront pricing precision**. Kindly analyze the data and **identify top opportunities** for that. Could you name the top one opportunity? 



In [None]:
df = pd.read_excel('/content/drive/MyDrive/bolt/Test.csv.xlsx')
print(df.shape)
df.head(3)

In [None]:
# % of null values in columns   
round(df.isna().mean().sort_values(ascending=False)*100).head(10)

In [None]:
# consider the info above let's get rid of 'device_token' and 'change_reason_pricing' columns
df = df.drop(columns =['device_token', 'change_reason_pricing'])

df.drop_duplicates().shape, df.shape

In [None]:
df['prediction_price_type'].value_counts() / df.shape[0]

in 70% cases of the data the 'upfront' prediction_price_type was applied.

### Let see correlations of the values when metered_price differs from the upfront_price by 20% to other columns



In [None]:
df =df[~df.upfront_price.isna()].copy() #consider only nonull values
df_next =df[df.upfront_price.isna()].copy()

In [None]:
df['price_diff'] = abs(df['metered_price'] - df['upfront_price']) /df['metered_price']

###! diff_more_20 - when the metered price differs from the upfront_price by 20%
df['diff_more_20'] = (df['price_diff'] >= 0.20) * 1 

In [None]:
df.diff_more_20.value_counts() # differs by 20% in 1364 rows

In [None]:
df['diff_more_20'].mean() # differs by 20% in 40% of non null upfront prices

In [None]:
df_matrix = df.phik_matrix()
df_matrix[['diff_more_20']].sort_values(by=['diff_more_20'],ascending=False)\
                                                    .style.background_gradient(cmap='RdPu')

## Top opportunity 1

most correlated (by 0.429982) columns is null **device_name**, lets see more detailed 

In [None]:
df.groupby('device_name')['diff_more_20'].agg(['count','mean']).sort_values('mean',ascending=False).head(10)

In [None]:
df.device_name.str.split().str.get(0).str.replace('\d+', '', regex=True).unique()

In [None]:
df.device_name = df.device_name.str.split().str.get(0).str.replace('\d+', '', regex=True)
df.loc[df.device_name.str.contains('TECNO'),'device_name'] = 'TECNO MOBILE'

In [None]:
df.groupby('device_name')['diff_more_20'].agg(['count','mean']).sort_values('count',ascending=False).head(15)

### Conclusion
---

in the table above a list of devices that makes worse the upfront pricing precision. Recommendation is to change type of mobile.

---



## Top opportunity 2 

second most correlated (by 0.323787) columns is null **gps_confidence**, lets see more detailed  

In [None]:
df.groupby('gps_confidence')['diff_more_20'].agg(['count','mean'])\
  .sort_values('count', ascending=False).head(10)


### Conclusion

---

from the aggregations **above** we can suppose, that the upfront_price precision can be improved by gps tracking device.

---



## Top opportunity 3 

third most correlated (by0.314773) columns is null **duration**, lets see more detailed 

In [None]:
df.groupby('duration')['diff_more_20'].agg(['count','mean'])\
  .sort_values('count', ascending=False).head(5)

In [None]:
# lets segmentate duration values by the frequency of those values for 6 bins.

df['duration_freq'] = pd.qcut(df['duration'], 6)
df.groupby('duration_freq')['diff_more_20'].agg(['count','mean'])\
                                    .sort_values('mean', ascending=False)


### Conclusion

---

here we can conclude that for long distances the upfront pricing doesn't work well. It can be considered in ML modelling.

---



## Top opportunity 4 

next  correlated (by 0.296276) columns is null **eu_indicator**, lets see more detailed 

In [None]:
df.groupby('eu_indicator')['diff_more_20'].agg(['count','mean'])


### Conclusion

---

the upfront_price precision can be improved by improving service in non Europian Union locations. 

---



## Top opportunity 5 

next  correlated (by 	0.224632
) columns is null **metered_price**, lets see more detailed 

In [None]:
# lets segmentate duration values by the frequency of those values for 6 bins.

df['metered_price_freq'] = pd.qcut(df['metered_price'], 6)
df.groupby('metered_price_freq')['diff_more_20'].agg(['count','mean'])\
                                    .sort_values('count', ascending=False)


### Conclusion

---

the metered_price in range (7940.22 - 194483.52) affects on upfront pricing significantly, the metered_price's mathematics should be reconsidered in such ranges. Assume, that it consists of two main variable distance and duration, which affects mostly.

---



## Top opportunity 6 

next less correlated (by 	0.178249
) columns is null **driver_app_version**, lets see more detailed 

In [None]:
df.groupby('driver_app_version')['diff_more_20'].agg(['count','mean'])\
  .sort_values('count',ascending=False).head(7)


### Conclusion

---

the driver_app_version also affects on upfront pricing. The suggestion is to make a alertion in the system for a driver to update app version.

---

