# Assignment - Data Science Intern position @ Dott

## Task
### Is the  distance estimate better improved through a **heuristic** or **model**?

## Heuristic
These are some questions I formulated that could influence my choice:
- Is there a possibility of large fluctuations of variables/metrics that could happen, for example, over:
  - Different cities
  - Different vehicle types?
  Can a heuristic capture and account for deviations unique to specific market regions?
  
More importantly,:
- Can the heuristic evolve to fit the changing product and company needs?
- What if a **KPI is added?**, one that differs across different markets?
- ***The distance estimate varies across different vehicle types***, is it a good idea to hard-code a heuristic?

### Even a modest change, such as change of demand of scooters, can have a major effect on heuristic performance

### One **Pro** of the heuristic solution is it offers a ***quick*** answer to a planning and scheduling problem, but not an ***optimal*** one

## Model
These are the **pros** of using a Model Selection Approach:
- Produces, **over time**, an optimal solution
- Outperform heuristics to maximize operational efficiency
- Flexibility - Adjust parameters to accommodate changing goals, constraints
- Possibility to add complexity - can also be a con (Overfitting and deploying to production)

### Of course, quality and quantity of data is an important underlying factor, if data from a different problem is used for a use-case, it will likely underperform compared to the heruistic


## Which is the right approach?
### In many cases, a complementary approach between model optimization and heuristics could be the most effective solution

### My approach: A complementary solution, or a basic ARIMA/Random Forest Regressor Model 
### Model Selection 
Why ARIMA?
- Takes seasonality into account - important from a product perspective - Rides can boom/dip suring summer, for example
- Already has some feature engineering built in  - for example, lag features - which could be essential if you want to account for the previous entry's possible rate of error - could handle the genuine bugs/outliers in the dataset
 
Why Random Forest?
- Could capture **Non-Linearity**. Relationships between **battery levels,vehicle types,ride usage across cities** could be non-linear
- Noise-resistant - Less sensitive to noise compared to decision trees - stable choice
- Generalization - Handles unseen data well, preventing a risk of underfitting or overfitting


### How?
### Benchmark a Model with a Heuristic
- Use a heuristic to establish a baseline
  - Product-wise in a real-time env, this could even be a solution that could be shipped
- Establishing a **baseline to beat** could also lead to a better integrated solution - one that could be trained on an existing expectation of how large/small a KPI is set to

## Why?
The added advantages of training a model on a heuristic are:
- Use a pre-coded heuristic with product know-how and decision making experience to generate a **good** solution
- You don't need to let the model start from scratch. The model will either improve the heuristic or achieve optimality eventually

In [49]:
#libraries
import numpy as np
import pandas as pd
#for preventing under/overfitting, better performance on unseen and new data
from sklearn.model_selection import train_test_split
#model
from sklearn.ensemble import RandomForestRegressor
#comparison, estimations
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
from statsmodels.tsa.statespace.sarimax import SARIMAX

ModuleNotFoundError: No module named 'statsmodels'

In [28]:
#load dataset
data=pd.read_csv('/Users/vikramvenkat/Downloads/dott_dsia_rides.csv')

### Improved heuristic
This was my idea for a potentially better heuristic

The current one doesn't incorporate factors that could influence the estimation

Hence, I implement a heuristic that accounts for battery level and vehicle type, while considering the efficiency of different types of vehicles

### Battery Factor
Can help in determining the efficiency of the vehicle
Values are chosen on assumptions based on my domain knowledge, I assume a scooter is more powerful than a bike
This can be refined through experimentation, but parameter refining is not the focus


In [29]:

'''
Had to impute the data- didn't run without exception handling
'''
data['battery_voltage'] = data['battery_voltage'].astype(float)


# Impute NaN values with 0
#probably a better way by linear regression of previous rows - not sure if I'm supposed to do that in regards to assignment scope
data['battery_voltage'].fillna(0, inplace=True)

#Heuristic Function - Calculates estimate for every row
def heuristic_function(row):
    battery_factor = 70 if row['vehicle_type'] == 'bike' else 80
    battery_factor *= row['battery_voltage'] / 3600
    estimated_distance = row['main_battery_level_before_ride'] * battery_factor
    return estimated_distance

# Apply the heuristic function to the dataset
data['EstimatedDistance'] = data.apply(heuristic_function, axis=1)

# Print the first few rows with the estimated distances
print(data[['main_battery_level_before_ride', 'vehicle_type', 'battery_voltage', 'EstimatedDistance']])

       main_battery_level_before_ride vehicle_type  battery_voltage  \
0                                89.0      scooter              0.0   
1                                76.0      scooter              0.0   
2                                14.0      scooter              0.0   
3                                93.0      scooter              0.0   
4                                78.0      scooter              0.0   
...                               ...          ...              ...   
47592                            38.0         bike          36000.0   
47593                            18.0         bike          36000.0   
47594                            90.0         bike          36000.0   
47595                            40.0         bike          36000.0   
47596                            89.0         bike          36000.0   

       EstimatedDistance  
0                    0.0  
1                    0.0  
2                    0.0  
3                    0.0  
4           

In [34]:
#Split data for training

#factor variables and target variable
X = data[['main_battery_level_before_ride', 'vehicle_type']]
y = data['total_distance_meters']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [35]:
#Random forest
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

ValueError: could not convert string to float: 'bike'