# Actors


In this tutorial, we demonstrate how getML can be applied in an e-commerce context. Using a dataset of about 400,000 orders, our goal is to predict whether an order will be cancelled.

We also show that we can significantly improve our results by using getML's built-in hyperparameter tuning routines.

Summary:

- Prediction type: __Classification model__
- Domain: __E-commerce__
- Prediction target: __The gender of an actor__ 
- Population size: __817718__

_Author: Dr. Patrick Urbanke_

# Background

The data set contains about 400,000 orders from a British online retailer. Each order consists of a product that has been ordered and a corresponding quantity. Several orders can be summarized onto a single invoice. The goal is to predict whether an order will be cancelled.

Because the company mainly sells to other businesses, the cancellation rate is relatively low, namely 1.83%.

The data set has been originally collected for this study:

Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197-208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).

It has been downloaded from the UCI Machine Learning Repository:

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

### A web frontend for getML

The getML monitor is a frontend built to support your work with getML. The getML monitor displays information such as the imported data frames, trained pipelines and allows easy data and feature exploration. You can launch the getML monitor [here](http://localhost:1709).

### Where is this running?

Your getML live session is running inside a docker container on [mybinder.org](https://mybinder.org/), a service built by the Jupyter community and funded by Google Cloud, OVH, GESIS Notebooks and the Turing Institute. As it is a free service, this session will shut down after 10 minutes of inactivity.

# Analysis

Let's get started with the analysis and set up your session:

In [27]:
import copy
import os
from urllib import request

import numpy as np
import pandas as pd
from IPython.display import Image
import matplotlib.pyplot as plt
plt.style.use('seaborn')
%matplotlib inline  

import getml

getml.engine.set_project('actors')


Connected to project 'actors'


Tuning is effective at improving our results, but it takes quite long, so we want to make it optional:

In [28]:
ALLOW_TUNING = True

## 1. Loading data

### 1.1 Download from source

We begin by downloading the data from the source file:

In [29]:
conn = getml.database.connect_mariadb(
    host="relational.fit.cvut.cz",
    dbname="imdb_ijs",
    port=3306,
    user="guest",
    password="relational"
)

conn

Connection(conn_id='default', dbname='imdb_ijs', dialect='mysql', 
           host='relational.fit.cvut.cz', port=3306)

In [30]:
def load_if_needed(name):
    """
    Loads the data from the relational learning
    repository, if the data frame has not already
    been loaded.
    """
    if not getml.data.exists(name):
        data_frame = getml.data.DataFrame.from_db(
            name=name,
            table_name=name,
            conn=conn
        )
        data_frame.save()
    else:
        data_frame = getml.data.load_data_frame(name)
    return data_frame

In [31]:
actors = load_if_needed("actors")
roles = load_if_needed("roles")
movies = load_if_needed("movies")
movies_genres = load_if_needed("movies_genres")

In [32]:
actors

Name,id,target,first_name,last_name,gender
Role,join_key,target,unused_string,unused_string,unused_string
0.0,2,0,Michael,'babeepower' Viera,M
1.0,3,0,Eloy,'Chincheta',M
2.0,4,0,Dieguito,'El Cigala',M
3.0,5,0,Antonio,'El de Chipiona',M
4.0,6,0,José,'El Francés',M
,...,...,...,...,...
817713.0,845461,1,Herdís,Þorvaldsdóttir,F
817714.0,845462,1,Katla Margrét,Þorvaldsdóttir,F
817715.0,845463,1,Lilja Nótt,Þórarinsdóttir,F
817716.0,845464,1,Hólmfríður,Þórhallsdóttir,F


In [33]:
roles

Name,actor_id,movie_id,role
Role,join_key,join_key,unused_string
0.0,2,280088,Stevie
1.0,2,396232,Various/lyricist
2.0,3,376687,Gitano 1
3.0,4,336265,El Cigala
4.0,5,135644,Himself
,...,...,...
3431961.0,845461,137097,Kata
3431962.0,845462,208838,Magga
3431963.0,845463,870,Gunna
3431964.0,845464,378123,Gudrun


In [34]:
movies

Name,id,year,rank,name
Role,join_key,numerical,numerical,unused_string
0.0,0,2002,,#28
1.0,1,2000,,"#7 Train: An Immigrant Journey, The"
2.0,2,1971,6.4,$
3.0,3,1913,,"$1,000 Reward"
4.0,4,1915,,"$1,000 Reward"
,...,...,...,...
388264.0,412316,1991,,"""zem blch krlu"""
388265.0,412317,1995,,"""rgammk"""
388266.0,412318,2002,,"""zgnm Leyla"""
388267.0,412319,1983,,""" Istanbul"""


In [35]:
movies_genres

Name,movie_id,genre
Role,join_key,categorical
0.0,1,Documentary
1.0,1,Short
2.0,2,Comedy
3.0,2,Crime
4.0,5,Western
,...,...
395114.0,378612,Adventure
395115.0,378612,Drama
395116.0,378613,Comedy
395117.0,378613,Drama


### 1.2 Prepare data for getML

getML requires that we define *roles* for each of the columns.

In [36]:
actors["target"] = (actors.gender == 'F').as_num()

In [37]:
actors.set_role("id", getml.data.roles.join_key)
actors.set_role("target", getml.data.roles.target)

In [38]:
roles.set_role(["actor_id", "movie_id"], getml.data.roles.join_key)

In [39]:
movies.set_role("id", getml.data.roles.join_key)
movies.set_role(["year", "rank"], getml.data.roles.numerical)

In [40]:
movies_genres.set_role("movie_id", getml.data.roles.join_key)
movies_genres.set_role("genre", getml.data.roles.categorical)

The *StockCode* is a 5-digit code that uniquely defines a product. It is hierarchical, meaning that every digit has a meaning. We want to make use of that, so we assign a unit to the stock code, which we can reference in our preprocessors.

Let's take a look at what we have done so far:

In [41]:
random = actors.random()

is_training = (random < 0.7)
is_validation = (~is_training & (random < 0.85))
is_test = (~is_training & ~is_validation)

data_train = actors.where("data_train", is_training)
data_validation = actors.where("data_validation", is_validation)
data_test = actors.where("data_test", is_test)

## 2. Predictive modelling

We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning.

### 2.1 Define relational model

To get started with relational learning, we need to specify the data model.

In our case, there are two joins we are interested in: 

1) We want to take a look at all of the other orders on the same invoice.

2) We want to check out how often a certain customer has cancelled orders in the past. Here, we limit ourselves to the last 90 days. To avoid data leaks, we set a horizon of one day.

In [42]:
actors_ph = getml.data.Placeholder('actors')
roles_ph = getml.data.Placeholder('roles')
movies_ph = getml.data.Placeholder('movies')
movie_genres_ph = getml.data.Placeholder('movie_genres')

actors_ph.join(
    roles_ph,
    join_key='id',
    other_join_key='actor_id'
)

roles_ph.join(
    movies_ph,
    join_key='movie_id',
    other_join_key='id',
    relationship=getml.data.relationship.many_to_one
)

movies_ph.join(
    movie_genres_ph,
    join_key='id',
    other_join_key='movie_id'
)

actors_ph

### 2.2 getML pipeline

<!-- #### 2.1.1  -->
__Set-up the feature learner & predictor__

We have mentioned that the *StockCode* is a hierarchical code. To make use of that fact, we use getML's substring preprocessor, extracting the first digit, the first two digits etc. Since we have assigned the unit *code* to the *StockCode*, the preprocessors know which column they should be applied to.

In [43]:
relboost = getml.feature_learning.RelboostModel(
    num_features=10,
    num_subfeatures=10,
    loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
    seed=4367,
    num_threads=1
)

predictor = getml.predictors.XGBoostClassifier()

__Build the pipeline__

In [44]:
pipe = getml.pipeline.Pipeline(
    tags=['relboost'],
    population=actors_ph,
    peripheral=[roles_ph, movies_ph, movie_genres_ph],
    feature_learners=[relboost],
    predictors=[predictor]
)

### 2.3 Model training

In [45]:
pipe.check(data_train, {"roles": roles, "movies": movies, "movie_genres": movies_genres})

Checking data model...


INFO [JOIN KEYS NOT FOUND]: When joining the composite data frame 'roles'-'movies' that has been created by many-to-one joins or one-to-one joins and  data frame 'movie_genres' over 'id' and 'movie_id', there are no corresponding entries for 26.899421% of entries in 'id' in 'the composite data frame 'roles'-'movies' that has been created by many-to-one joins or one-to-one joins'. You might want to double-check your join keys.


In [None]:
pipe.fit(data_train, {"roles": roles, "movies": movies, "movie_genres": movies_genres})

Checking data model...


INFO [JOIN KEYS NOT FOUND]: When joining the composite data frame 'roles'-'movies' that has been created by many-to-one joins or one-to-one joins and  data frame 'movie_genres' over 'id' and 'movie_id', there are no corresponding entries for 26.899421% of entries in 'id' in 'the composite data frame 'roles'-'movies' that has been created by many-to-one joins or one-to-one joins'. You might want to double-check your join keys.



Relboost: Training subfeatures...

Relboost: Training subfeatures...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Training features...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

### 2.4 Model evaluation

In [None]:
in_sample = pipe.score(data_train, {"roles": roles, "movies": movies, "movie_genres": movies_genres})

out_of_sample = pipe.score(data_test, {"roles": roles, "movies": movies, "movie_genres": movies_genres})

pipe.scores

### 2.6 Studying features

__Feature correlations__

We want to analyze how the features are correlated with the target variable.

In [None]:
names, correlations = pipe.features.correlations()

plt.subplots(figsize=(20, 10))

plt.bar(names, correlations, color='#6829c2')

plt.title('Feature Correlations')
plt.xlabel('Features')
plt.ylabel('Correlations')
plt.xticks(rotation='vertical')
plt.show()

In [None]:
pipe.features.to_sql()

__Feature importances__
 
Feature importances are calculated by analyzing the improvement in predictive accuracy on each node of the trees in the XGBoost predictor. They are then normalized, so that all importances add up to 100%.

In [None]:
names, importances = pipe.features.importances()

plt.subplots(figsize=(20, 10))

plt.bar(names, importances, color='#6829c2')

plt.title('Feature Importances')
plt.xlabel('Features')
plt.ylabel('Importances')
plt.xticks(rotation='vertical')
plt.show()

most_important = names[0]

__Column importances__

Because getML uses relational learning, we can apply the principles we used to calculate the feature importances to individual columns as well.

As we can see, the *StockCode* contributes about 50% of the predictive accuracy.

In [None]:
names, importances = pipe.columns.importances()

plt.subplots(figsize=(20, 10))

plt.bar(names, importances, color='#6829c2')

plt.title('Columns importances')
plt.xlabel('Columns')
plt.ylabel('Importances')
plt.xticks(rotation='vertical')
plt.show()

most_important = names[0]

__Transpiling the learned features__

We can also transpile the learned features to SQLite3 code. We want to show the two most important features. That is why we call the `.features.importances().` method again. The names that are returned are already sorted by importance.

In [None]:
names, _ = tuned_pipe.features.importances()

pipe.features.to_sql()[names[0]]

In [None]:
names, _ = tuned_pipe.features.importances()

pipe.features.to_sql()[names[1]]

## 3. Conclusion

In this notebook we have demonstrated how getML can be applied to an e-commerce setting. In particular, we have seen how results can be improved using the built-in hyperparamater tuning routines.

# Next Steps

This tutorial went through the basics of applying getML to relational data. If you want to learn more about getML, here are some additional tutorials and articles that will help you:

__Tutorials:__
* [Loan default prediction: Introduction to relational learning](loans_demo.ipynb)
* [Occupancy detection: A multivariate time series example](occupancy_demo.ipynb)  
* [Expenditure categorization: Why relational learning matters](consumer_expenditures_demo.ipynb)
* [Disease lethality prediction: Feature engineering and the curse of dimensionality](atherosclerosis_demo.ipynb)
* [Traffic volume prediction: Feature engineering on multivariate time series](interstate94_demo.ipynb)
* [Air pollution prediction: Why feature learning outperforms brute-force approaches](air_pollution_demo.ipynb) 


__User Guides__ (from our [documentation](https://docs.getml.com/latest/)):
* [Feature learning with Multirel](https://docs.getml.com/latest/user_guide/feature_engineering/feature_engineering.html#multirel)
* [Feature learning with Relboost](https://docs.getml.com/latest/user_guide/feature_engineering/feature_engineering.html#relboost)


# Get in contact

If you have any question schedule a [call with Alex](https://go.getml.com/meetings/alexander-uhlig/getml-demo), the co-founder of getML, or write us an [email](team@getml.com). Prefer a private demo of getML? Just contact us to make an appointment.