![title](media/DataRobot.png)

### Building a Predictive Model Factory

#### Definition

A model factory, in the context of data science, is a system, or a set of procedures that automatically generate predictive models with little or no human intervention. Model factories can have multiple layers of complexity which we can call modules. One module might be training models while other modules could be deploying or retraining the models.

### How would you tackle this?

#### - Consider the scenario where you have 20.000 SKUs  and you need to do sales forecasting for each one of them.
#### - Consider the scenario where you have multiple types of customers and you are trying to predict churners.

- Can one model handle the high dimentionality that comes with these problems?
- Is a single model family enough?
- Is one preprocessing method enough?



### Creating a DataRobot Model Factory

In short:
- Use DataRobot to build a single project on the readmissions  dataset.
- Find best model for this project.
- Use DataRobot to build multiple projects based on admission id.
- Find best model for each of the sub-projects
- Make best models ready for deployment

#### Import Libraries

In [None]:
import datarobot as dr #Requires version >2.19
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid")

#### Define Functions
Functions that will be used to plot the ROC curve and Feature Impact

In [None]:
def plot_roc_curve(datarobot_model):
    """This function plots a roc curve.
    Input:
        datarobot_model: <Datarobot Model object>
    """
    roc = datarobot_model.get_roc_curve('crossValidation')
    roc_df = pd.DataFrame(roc.roc_points)
    auc_score = datarobot_model.metrics['AUC']['crossValidation']
    plt.plot(roc_df['false_positive_rate'], roc_df['true_positive_rate'], 'b', label = 'AUC = %0.2f' %auc_score)
    plt.legend(loc = 'lower right')
    plt.plot([0, 1], [0, 1],'r--')
    plt.xlim([0, 1])
    plt.ylim([0, 1])
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.show()

def plot_feature_impact(datarobot_model, title=None):
    """This function plots feature impact
    Input:
        datarobot_model: <Datarobot Model object>
        title : <string> --> title of graph
    """
    #Get feature impact
    feature_impacts = datarobot_model.get_or_request_feature_impact()

    #Sort feature impact based on normalised impact
    feature_impacts.sort(key=lambda x: x['impactNormalized'], reverse=True)

    fi_df = pd.DataFrame(feature_impacts) #Save feature impact in pandas dataframe
    fig, ax = plt.subplots(figsize=(14,5))
    b = sns.barplot(x="featureName", y="impactNormalized", data=fi_df[0:5], color="b")
    b.axes.set_title('Feature Impact' if not title else title,fontsize=20)

In [None]:
#Import dataset
df = pd.read_excel('data/10k_diabetes.xlsx')

In [None]:
df.head()

#### Connect to DataRobot
Connect to DataRobot using your credentials

In [None]:
endpoint = 'YOUR_DATAROBOT_HOST'
api_token = 'YOUR_API_KEY'
dr.Client(token=api_token, endpoint=endpoint)

#### Initiate DataRobot project for all patients

In [None]:
original_proj = dr.Project.start(df,                                       #Pandas dataframe with data
                                project_name = 'Readmissions',             # Name of the project
                                target = 'readmitted',                     #Target of the project
                                metric = 'LogLoss',                        #Optimization metric (Default is LogLoss anyways)
                                worker_count = -1)                         #Amount of workers to use (-1 means every worker available)

log = original_proj.wait_for_autopilot() #Wait for autopilot to finish

#### Get best model from original project

In [None]:
#Pick best model
best_model = original_proj.get_models()[0]

print(best_model) #Print best model's name
best_model.metrics['LogLoss']['crossValidation'] #Print crossValidation score

#### Visualize the ROC Curve

In [None]:
plot_roc_curve(best_model)

#### Plotting Feature Impact

In [None]:
plot_feature_impact(best_model)

#### Making a better model
Admission type can be used as a splitting point in order to create multiple projects.

In [None]:
fig, ax = plt.subplots(figsize=(12,5))
c = sns.countplot(x="admission_type_id",data=df)

#### Creating a mini model factory
Using a for loop to automatically create multiple projects!

In [None]:
projects = {} #To save projects

#Create one project for each customer type
for value in df['admission_type_id'].unique():
    try:
        temp_project = dr.Project.start(df.loc[df['admission_type_id'] == value],
                                    project_name = 'Readmission_%s'%value,
                                    target = 'readmitted',
                                    metric = 'LogLoss',
                                    worker_count = 10)
        projects[value] = temp_project

    except: #Catching the case when dataset has fewer than 20 rows.
        pass
#Wait for all autopilots to finish
for key in projects:
    log = projects[key].wait_for_autopilot()

#### Getting best model for each customer category

In [None]:
best_models = {} #To save models
for key in projects:
    best_models[key] = projects[key].get_models()[0]
    print('--------------------------------')
    print('Best model for admission type id: %s' %key)
    print(best_models[key])
    print(best_models[key].metrics['LogLoss']['crossValidation'])
    print('--------------------------------')

##### Even though accuracy changes might be insignificant for this dataset, in cases where it makes sense, model factory can produce measurable value. Furthermore, this concept becomes more and more important the higher the cardinality of your data.

#### Its not just about more predictive performance....
We also have differences in feature impact which could give actionable insights.

In [None]:
for key in projects:
    plot_feature_impact(best_models[key], title ='Feature Impact for admission type id: %s' %key)

#### Deploying the models
Deploy the models as a REST API after which, we can make HTTP requests and get predictions back by using the deployment id!

In [None]:
prediction_server = dr.PredictionServer.list()[0]

for key in best_models:
    temp_deployment = dr.Deployment.create_from_learning_model(
                                    best_models[key].id, label='Readmissions_admission_type: %s' %key,
                                    description='Test deployment',
                                    default_prediction_server_id=prediction_server.id
                                            )

#### Through the API, the sky is the limit when it comes to what you can do: 

- You could monitor service performance (also available via UI)
- You could motior accuracy performance (also available via UI)
- You could retrain and update models (also available via UI)

You could create a model factory that does all of the above based on rules you set and need minimum human intervention. 