<a href="https://colab.research.google.com/github/debbysonino/LamasDataHack/blob/master/DataLearn_2019_Scaffold_basefilepaulo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

&nbsp; ![alt text](https://s3.amazonaws.com/monday.com/static/svg/monday-logos/monday-footer-logo.svg)

#Model scaffold
This notebook is intended to get you up and running faster.

It has the basic scaffold of an ML model, including:
* Data loading
* Feature extraction
* Columns transformation
* Training
* Evaluating
* Submitting results

###Getting our depnedncies (and data!)
First we'll import our relevant libraries

In [0]:
# General DS libraries we are going to need
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import timedelta

# Importing our base model
# [REDACTED ML MODEL USED]

# Imports for working with our large dataset
from sklearn.utils.random import sample_without_replacement
from sklearn.model_selection import train_test_split

# We need those for data manipulation and getting our features ready for the model
from sklearn.preprocessing import OneHotEncoder, Normalizer, Binarizer
from sklearn.compose import make_column_transformer

# These can be used to measure our model's performance
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix

# Ignore DataFrame assignment warnings
pd.options.mode.chained_assignment = None

In [0]:
## (when using google colab)
! pip install catboost
! pip install plotly_express

Collecting catboost
[?25l  Downloading https://files.pythonhosted.org/packages/ef/e9/41060f73ca5dcf604f75bc871ee5ce8dcb201897640b37c95aa8b1e139c8/catboost-0.16.5-cp36-none-manylinux1_x86_64.whl (61.9MB)
[K     |████████████████████████████████| 61.9MB 1.7MB/s 
Installing collected packages: catboost
Successfully installed catboost-0.16.5
Collecting plotly_express
  Downloading https://files.pythonhosted.org/packages/d4/d6/8a2906f51e073a4be80cab35cfa10e7a34853e60f3ed5304ac470852a08d/plotly_express-0.4.1-py2.py3-none-any.whl
Collecting plotly>=4.1.0 (from plotly_express)
[?25l  Downloading https://files.pythonhosted.org/packages/70/19/8437e22c84083a6d5d8a3c80f4edc73c9dcbb89261d07e6bd13b48752bbd/plotly-4.1.1-py2.py3-none-any.whl (7.1MB)
[K     |████████████████████████████████| 7.1MB 8.5MB/s 
Installing collected packages: plotly, plotly-express
  Found existing installation: plotly 3.6.1
    Uninstalling plotly-3.6.1:
      Successfully uninstalled plotly-3.6.1
Successfully installed

In [0]:
## General
import os 
import joblib
import requests
from google_drive_downloader import GoogleDriveDownloader as gdd

## Data manipulation
import pandas as pd
import numpy as np

## Modeling
### Modeling pipeline
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import GridSearchCV
 
### Models
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier    
from catboost import CatBoostClassifier, Pool

## Visuatlization
import plotly 
import plotly.express as px
import plotly.graph_objects as go


We set a few constants to use later on for sampeling and running the model

In [0]:
#@title Model parameters { run: "auto" }
# n_neighbors = 7 #@param {type:"slider", min:1, max:30, step:1}
group_name = "Lamassim" #@param {type:"string"}
samples_num = 170000 #@param {type:"slider", min:0, max:1500000, step:10000}
n_jobs = -1 #@param {type:"slider", min:-1, max:32, step:1}
path_prefix = "https://storage.googleapis.com/mondaycom-datahack/final_sets" #@param ["https://storage.googleapis.com/mondaycom-datahack/final_sets", "https://mondaycom-datahack.s3.amazonaws.com/final_sets"] {allow-input: true}

Next we'll load all the different parts of our dataset

<br/>

_Our use my data loading [snippet](https://colab.research.google.com/drive/1_Y-sZ5eHIDlDUMuLCwfnbuJdLh0DTXmO#scrollTo=5HGlaJTEAYJu&line=23&uniqifier=1)!_

In [0]:
import os
import pandas as pd

# We define the datasets we want to load
datasets = ('accounts', 'users', 'events', 'subscriptions')
source_prefix = 'https://storage.googleapis.com/mondaycom-datahack/final_sets/'

local_dir = './datasets/datahack/'
file_prefix = 'train_'
file_suffix = ''
file_extension = 'csv'

# We create a directory for the datasets if it doesn't exist
if not os.path.exists(local_dir):
    os.makedirs(local_dir)

# For each dataset we want, we check if we already downloaded it and fix it if we didn't
for dataset in datasets:
  if not os.path.isfile('{}{}{}{}.{}'.format(local_dir, file_prefix, dataset, file_suffix, file_extension)):
    !curl {source_prefix}{file_prefix}{dataset}{file_suffix}.{file_extension} --output {local_dir}{file_prefix}{dataset}{file_suffix}.{file_extension}

  # Load the datasets into a DataFrame using pandas
  globals()['{}{}'.format(file_prefix, dataset)] = pd.read_csv('{}{}{}{}.{}'.format(local_dir, file_prefix, dataset, file_suffix, file_extension), low_memory=False)

In [0]:
import os
import pandas as pd

# We define the datasets we want to load
datasets = ('accounts', 'users', 'events', 'subscriptions')
source_prefix = 'https://storage.googleapis.com/mondaycom-datahack/final_sets/'

local_dir = './datasets/datahack/'
file_prefix = 'test_'
file_suffix = ''
file_extension = 'csv'

# We create a directory for the datasets if it doesn't exist
if not os.path.exists(local_dir):
    os.makedirs(local_dir)

# For each dataset we want, we check if we already downloaded it and fix it if we didn't
for dataset in datasets:
  if not os.path.isfile('{}{}{}{}.{}'.format(local_dir, file_prefix, dataset, file_suffix, file_extension)):
    !curl {source_prefix}{file_prefix}{dataset}{file_suffix}.{file_extension} --output {local_dir}{file_prefix}{dataset}{file_suffix}.{file_extension}

  # Load the datasets into a DataFrame using pandas
  globals()['{}{}'.format(file_prefix, dataset)] = pd.read_csv('{}{}{}{}.{}'.format(local_dir, file_prefix, dataset, file_suffix, file_extension), low_memory=False)

We need to add our test sets to our train sets and work on both at the same time.

We'll split them back up before training and inference.

In [0]:
accounts = train_accounts.append(test_accounts)
users = train_users.append(test_users)
events = train_events.append(test_events)
subscriptions = train_subscriptions.append(test_subscriptions)


Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.





###Feature engineering
In this block we add a new feature of `[REDACTED]` extracted from the user `[REDACTED]`

We also seperate all the `[REDACTED]` users into a different DataFrame

In [0]:
users['[REDACTED]'] = users['REDACTED'].apply(lambda x: "[REDACTED]")
[REDACTED] = users[users["[REDACTED]"] == "[REDACTED]"]

KeyError: ignored

Let's enrich our data a bit

In [0]:
# Joining the accounts with the [REDACTED] users
all_features = accounts.merge([REDACTED], on='account_id', suffixes=('_account', '_user'))

all_features = all_features.reset_index().set_index(['account_id', '[REDACTED]']).drop(columns='index')

# Adding the [REDACTED] in a seperate column
all_features['[REDACTED]'] = "[REDACTED]"

NameError: ignored

In [0]:
all_features=accounts

In [0]:
all_features.head()

Unnamed: 0,account_id,account_name,browser,churn_date,churn_reason,collection_21_days,company_size,country,created_at,device,has_logo,industry,lead_score,max_team_size,min_team_size,mrr,os,paying,payment_currency,plan_id,region,subscription_started_at,team_size,time_diff,trial_start,user_description,user_goal,utm_cluster_id
0,1.0,"Gardner, Barron and Keller",microsoft edge,,,0,,AU,2019-01-01 00:01:15,desktop,1,,0.0,5.0,2.0,,windows,0,AUD,,New South Wales,,,11.0,2019-01-01 00:01:15,,,orders
1,2.0,Dunn Ltd,,,,0,,US,2019-01-01 00:01:52,mobile,1,,0.0,5.0,2.0,,ios,0,USD,,New Jersey,,,,2019-01-01 00:01:52,,,
2,3.0,Boone Inc,chrome,,,0,,US,2019-01-01 00:03:12,desktop,1,Other,0.0,1.0,1.0,,windows,0,USD,,Louisiana,,1.0,-6.0,2019-01-01 00:03:12,,,todos
3,4.0,"Christian, Carroll and Davis",,,,0,,IL,2019-01-01 00:04:11,mobile,1,,0.0,,,,android,0,USD,,Tel Aviv,,,,2019-01-01 00:04:11,,,
4,5.0,Brooks-Oliver,chrome,,,0,,US,2019-01-01 00:04:21,desktop,1,Design,0.0,1.0,1.0,,chrome_os,0,USD,,North Carolina,,1.0,-5.0,2019-04-04 11:09:12,,,todos


###Data preperation
After we created our raw features we need to make sure the fit the way our ML model expects to receive them.

In [0]:
# We map our features into different types
categorical_features = ['country', 'device']

normalized_features = ['collection_21_days']

binary_features = ['paying', 'has_logo']

untouched_features = ['account_id']

target = ['lead_score']

# And create a column transformer to handle the manipulation for us
preprocess = make_column_transformer(
    (OneHotEncoder(), categorical_features),
    (Normalizer(), normalized_features),
    (Binarizer(), binary_features)
)

###Re-splitting
We now need to split our data back to the original train set and test set.

We also make sure we keep only the columns we want in the data frame (the features)

In [0]:
# Getting only the relevant features from the dataset
dataset = all_features[categorical_features + normalized_features + binary_features + untouched_features + target]

# Filling empty values with default values 
dataset.loc[:,categorical_features] = dataset[categorical_features].fillna('')
dataset.loc[:,normalized_features +
              binary_features +
              untouched_features] = dataset[normalized_features +
                                            binary_features +
                                            untouched_features].fillna(0)

# Splitting them back up to the original train/test split
dataset_train = dataset[dataset.reset_index().account_id.isin(train_accounts.account_id)]
dataset_test = dataset[dataset.reset_index().account_id.isin(test_accounts.account_id)]


Boolean Series key will be reindexed to match DataFrame index.


Boolean Series key will be reindexed to match DataFrame index.



In [0]:
dataset.head()

Unnamed: 0,country,device,collection_21_days,paying,has_logo,account_id,lead_score
0,AU,desktop,0,0,1,1.0,0.0
1,US,mobile,0,0,1,2.0,0.0
2,US,desktop,0,0,1,3.0,0.0
3,IL,mobile,0,0,1,4.0,0.0
4,US,desktop,0,0,1,5.0,0.0


In [0]:
dataset.shape

(1433661, 7)

###Setting everything up
Our dataset is large (1,500,000+ accounts, each has a few users, each has events for every day)

We need to work on a smaller batch of the training data so we can iterate more quickly.

Once we find a good architecture we can increase the sample size to increase the accuracy.

In [0]:
sampled_dataset_train = dataset_train.iloc[sample_without_replacement(dataset_train.shape[0], samples_num)]

In [0]:
sampled_dataset_train.head()


Unnamed: 0,country,device,collection_21_days,paying,has_logo,account_id,lead_score
567388,MX,desktop,0,0,1,597414.0,0.0
1226058,US,desktop,0,0,1,1290549.0,0.0
14980,AR,desktop,0,0,1,15785.0,0.0
209515,CO,desktop,0,0,1,220583.0,0.0
911356,AU,desktop,0,0,1,959343.0,0.0


In [0]:


# We fit our column transformer on both the train and the test sets
preprocess.fit(sampled_dataset.append(dataset_test))

# We use transform to finally manipulate the features of our training set
X = preprocess.transform(sampled_dataset_train)

# Seperating the label
y = sampled_dataset_train.pop('lead_score')


Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.





ValueError: ignored

In [0]:
test=sampled_dataset_train.append(dataset_test)
test.head()


Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.





Unnamed: 0,account_id,collection_21_days,country,device,has_logo,lead_score,paying
1128513,1187872.0,0,IN,desktop,1,,0
1270589,1337509.0,0,CL,desktop,1,,0
929401,978261.0,0,FR,desktop,1,,0
117723,123922.0,0,NO,desktop,1,,0
981823,1033446.0,0,TH,desktop,1,,0


In [0]:
dataset_test

Unnamed: 0,country,device,collection_21_days,paying,has_logo,account_id,lead_score


In [0]:
test.tail()


Unnamed: 0,account_id,collection_21_days,country,device,has_logo,lead_score,paying
856668,901869.0,0,US,desktop,1,,0
671816,707335.0,0,CO,mobile,1,,0
501604,528116.0,0,US,mobile,1,,0
789937,831598.0,0,IL,desktop,1,,0
674282,709915.0,0,BE,desktop,1,,0


In [0]:
# You now need to split the data into YOUR OWN training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, random_state=42)

# For standardization purposes we store y_test in a y_true variable
y_true = y_test

In [0]:

y.tail()


742446     0.0
1337228    0.0
708862     0.0
36225      0.0
205899     0.0
Name: lead_score, dtype: float64

###Running the model
It's the money time, we can finally run our model!

First we need to created it, and train(fit) it.

In [0]:
## git a simple decision tree with given hyper parameters
dt = DecisionTreeClassifier(random_state = 2, 
                            max_depth = 10, 
                            min_samples_split = 10)
dt.fit(X_train, y_train)

ValueError: ignored

In [0]:
clf = REDACTED_MODEL_TYPE(REDACTED_MODEL_PARAMETERS, random_state=42)
%time clf.fit(X_train, y_train)

ValueError: ignored

In [0]:
# Now we need to get the predictions of our test set
%time y_pred = clf.predict(X_test)

###Model evaluation
Now that we have our model and it can predict the lead score based on features, we need a way to test if it's any good

####Classification report
We use classification_report to get different metrics comparing our prediction to the ground truth.

In [0]:
print(classification_report(y_true, y_pred, target_names=['Not Lead', 'Lead']))

We can also get the MCC score of the model

In [0]:
print('Acc:  {}'.format(metrics.accuracy_score(y_true, y_pred)))
print('MCC: {}'.format(metrics.matthews_corrcoef(y_true, y_pred)))
print('F1:  {}'.format(metrics.f1_score(y_true, y_pred)))

####Plotting the confusion matrix
Confusion matrices are useful for comparing our predictions

In [0]:
fig, axs = plt.subplots(ncols=2, figsize=(14,4))

cm = confusion_matrix(y_true, y_pred)
ticks = ['Not Lead', 'Lead']
cmap = sns.color_palette("Blues")

# We normalize our data to see more accurate comparsion
sns.heatmap(cm.astype('float') / cm.sum(axis=1)[:, np.newaxis], annot=True, ax=axs[0], cmap=cmap)
axs[0].set(title="Normalized confusion matrix", xlabel="Prediction", ylabel="Truth", xticklabels=ticks, yticklabels=ticks)

# We also plot the original numbers to get the whole picture
sns.heatmap(cm, annot=True, ax=axs[1], fmt='g', cmap=cmap)
axs[1].set(title="Confusion matrix", xlabel="Prediction", ylabel="Truth", xticklabels=ticks, yticklabels=ticks);

###Submitting results
After you ran several iterations, and you think your model is good enough, you can send it to us and we'll add your score on the leaderboard!

You have to get the results into the following format:
```python
{"9023749": 1, "9837598": 0, ...}
```

This is a dictionary where the keys are `account_id`s and the values are the predicted lead_score.

_Make sure you send us **all** the test accounts!_

_There should be exactly `71,683` of them!_

####Prediction
First of all, just like before, we have to predict the lead_score.

This time you need to use the test set _we_ provided.

In [0]:
submission_account_ids = dataset_test.index.values
X_submission = preprocess.transform(dataset_test).drop(columns='lead_score')

y_pred_submission = clf.predict(X_submission)

####Submission
Now that we have our submission predictions, we need to pack them up into a compatible format for our server to handle.


In [0]:
# Creating a dictionary where the keys are the account_ids
# and the values are your predictions
prediction = dict(zip(submission_account_ids, y_pred_submission))

We now just send the results to our server and wait for the score!

In [0]:
# Importing stuff for http requests
from urllib import request
import json

# We validate first that we actually send all the test accounts expected to be sent
if y_pred_submission.shape[0] != 71683 or submission_account_ids.shape[0] != 71683:
  raise Exception("You have to send all of the accounts! Expected: (71683, 71683), Got: ({}, {})".format(y_pred_submission.shape[0], submission_account_ids.shape[0]))

if "group_name" not in vars() or group_name == "":
  group_name = input("Please enter your group's name:")

data = json.dumps({'submitter': group_name, 'predictions': predictions}).encode('utf-8')

req = request.Request(f"https://leaderboard.datahack.org.il/monday/api/",
                      headers={'Content-Type': 'application/json'},
                      data=data)

res = request.urlopen(req)
print(json.load(res))