# Machine Learning Challenge - Example Solution

The goal of this notebook is to illustrate how to import the dataset for the machine learning challenge and to submit a solution.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Import dataset from CSV

In [2]:
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")

In [3]:
df_train.head()

Unnamed: 0,uid,install_timestamp,free_trial_timestamp,country,language,device_type,os_version,attribution_network,product_price_tier,product_periodicity,product_free_trial_length,onboarding_birth_year,onboarding_gender,net_purchases_15d,net_purchases_1y
0,wBmGYN3Z+l3OBvdnzv/fdw==,2019-06-07 00:01:44+00:00,2019-06-07 03:12:33+00:00,US,en,iPhone 6,12.2,Snapchat Installs,20,30,7,2005.0,M,0.0,0.0
1,s8j5KRaEXICNrKRsaBB8FQ==,2019-06-07 00:05:10+00:00,2019-06-08 07:39:48+00:00,US,en,iPhone X,12.3.1,Organic,6,7,7,2000.0,M,4.2,63.0
2,YpCrTmFIkyrv/8Xv2hNJcw==,2019-06-07 00:05:53+00:00,2019-06-07 00:07:01+00:00,GB,en,iPhone 8,12.3.1,Snapchat Installs,6,7,7,1994.0,M,0.0,0.0
3,U8WfGbQJ5rtEuMARYiLkkA==,2019-06-07 00:08:51+00:00,2019-06-07 00:10:59+00:00,US,en,iPhone 6S Plus,12.3.1,Organic,6,7,7,1992.0,F,4.2,4.9
4,996ECqYkUl6anviM65Npxg==,2019-06-07 00:12:01+00:00,2019-06-07 00:12:41+00:00,GB,en,iPhone 8,12.2,Organic,6,7,7,2000.0,M,4.13,98.900002


In [4]:
df_test.head()

Unnamed: 0,uid,install_timestamp,free_trial_timestamp,country,language,device_type,os_version,attribution_network,product_price_tier,product_periodicity,product_free_trial_length,onboarding_birth_year,onboarding_gender,net_purchases_15d
0,oZM0Urdhm8tR1ZYKnyGbYw==,2019-06-07 00:11:25+00:00,2019-06-07 00:12:35+00:00,GB,en,iPhone 7,12.2,Organic,6,7,7,1989.0,M,0.0
1,0VzRx4ggiIhf66/Rx8b16Q==,2019-06-07 00:18:03+00:00,2019-06-07 00:22:42+00:00,FR,fr,iPhone 8 Plus,12.3.1,Snapchat Installs,20,30,7,1999.0,M,0.0
2,vtf64fAiTxwDmC2hn4aFwg==,2019-06-07 00:22:46+00:00,2019-06-07 00:23:37+00:00,US,en,iPhone XS Max,12.2,Snapchat Installs,20,30,7,1995.0,M,0.0
3,imIeOU/L+Pp9iEHnpBGxOQ==,2019-06-07 00:31:42+00:00,2019-06-17 01:33:42+00:00,US,en,iPhone 5S,12.1.4,Organic,6,7,7,2000.0,M,0.0
4,mFBL8ysoHjQyecIbpbWrzg==,2019-06-07 00:39:12+00:00,2019-06-07 00:39:38+00:00,US,en,iPhone X,12.3.1,Snapchat Installs,6,7,7,2004.0,?,0.0


## Linear Model

Let's import some useful functions

In [5]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

Let's define the cost function we want to minimize

In [6]:
def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

Let's build a simple linear model, using the purchases made in the first 15 days since the free trial as a predictor

In [7]:
X_train = df_train[["net_purchases_15d"]]
X_test = df_test[["net_purchases_15d"]]

y_train = df_train["net_purchases_1y"]
# y_test is unknown to the candidate; it will be used to evaluate the submission

In [8]:
model = LinearRegression()

In [9]:
model.fit(X_train, y_train)
y_pred_train = model.predict(X_train)

Let's see the RMSE of this model.

In [10]:
rmse_lin = rmse(y_train, y_pred_train)

print(f"Linear Regression RMSE: {rmse_lin}")

Linear Regression RMSE: 13.620715866632395


## Export CSV for submission

Now let's make an example submission with our linear model.

In [11]:
y_pred_test = model.predict(X_test)

In [12]:
df_candidate = pd.DataFrame({"y_pred_test": y_pred_test})
df_candidate.to_csv("example_submission.csv", index=False)

Here's a sample of how the CSV for submission will look like:

In [13]:
df_candidate.sample(5)

Unnamed: 0,y_pred_test
2181,17.30267
3137,2.323053
670,2.323053
1403,2.323053
487,17.020445


We will compare your solution, `y_pred_test`, with the ground truth target, `y_test`. Your goal is to make use of the available data to minimize the RMSE on the test set, namely `rmse(y_test, y_pred_test)`.

NB: you will be asked to explain how you approached the problem and to detail the high-level steps that brought you to the solution you submitted.