# Tutorial on using the nudging package

**Important note**: If you're using this notebook on binder: be careful with uploading private datasets, since the data will uploaded to cloud providers.

To install the package locally see: https://github.com/UtrechtUniversity/nudging.

The goal of this tutorial is show how you can use the nudging package and apply machine learning methods to predict the Conditional Average Treatment Effect (CATE). Intuitively speaking, this simply the difference of outcome for each person, in the case when someone is or isn't nudged.

## Step 1: Python imports

First import some modules that are needed later.

In [1]:
import nudging.dataset

In [2]:
import pandas as pd
import numpy as np

from nudging.dataset.file import FileDataset

## Step 2: Read the file into a pandas dataframe

Pandas has a good number of file reading functions for tabular datasets, including Excel, Stata and CSV. Below an example for a CSV file.

In [3]:
# Make sure that the file demonstration.csv is in the same folder as the notebook.
df = pd.read_csv("tutorial.csv")
df[:5]

Unnamed: 0,PassengerId,Name,Sex,Age,Parch,Fare,Cabin,Embarked,Birthday,Board time,Married since,checked,survival chance
0,1,"Braund, Mr. Owen Harris",male,22.0,0,7.25,,S,1922-03-23,14:57:38,2022-08-13 08:42:37,0,-0.673159
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,0,71.2833,C85,C,1938-03-22,17:16:30,2022-07-24 06:45:36,0,0.15806
2,3,"Heikkinen, Miss. Laina",female,26.0,0,7.925,,S,1909-11-06,15:57:59,2022-07-21 02:32:58,1,8.650757
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,0,53.1,C123,S,1915-10-14,14:17:36,2022-07-19 13:22:18,0,0.103713
4,5,"Allen, Mr. William Henry",male,35.0,0,8.05,,S,1929-04-18,13:23:39,2022-07-26 11:47:42,0,1.038863


In the above tutorial dataset, the "checked" column is the nudge, while the "survival chance" is the outcome. This will be standardized in the next step (but obviously that can also be done before loading the CSV.

## Step 3: Adjust the columns 

The end result of adjusting should be that there are at least two columns:

`outcome`, which is a column with the outcome for each person.
`nudge`, whether a person was nudged or not. Should be 1 if nudged, 0 if not nudged.

Then there should be the features such as age and gender as the other columns. The features should be numerical. So for example, one would convert gender to a value [0, 1].

Below an we adjust the columns of our demonstration dataset. Note that originally it doesn't have anything to do with nudging, so two extra columns are generated. The CATE is set to be equal to the fare so that we can see that the CATE modeling works.

In [4]:
# Rename the checked column
df["nudge"] = df["checked"]

# Set the outcome to the Fare + a small random factor
df["outcome"] = df["survival chance"]

# Convert the Sex column to one with a numerical value.
df["gender"] = (df["Sex"] == "male").astype(int)

# Drop all columns except the ones we train the model on.
new_df = df[["Age", "Parch", "Fare", "gender", "nudge", "outcome"]]
new_df

Unnamed: 0,Age,Parch,Fare,gender,nudge,outcome
0,22.0,0,7.2500,1,0,-0.673159
1,38.0,0,71.2833,0,0,0.158060
2,26.0,0,7.9250,0,1,8.650757
3,35.0,0,53.1000,0,0,0.103713
4,35.0,0,8.0500,1,0,1.038863
...,...,...,...,...,...,...
886,27.0,0,13.0000,1,0,-0.562519
887,19.0,0,30.0000,0,0,-0.123510
888,,2,23.4500,0,0,-0.278336
889,26.0,0,30.0000,1,0,-0.405690


## Step 4: Convert the pandas DataFrame to a nudging dataset

This simply done with the `FileDataset` python class.

In [5]:
dataset = FileDataset.from_dataframe(new_df)

## Step 5: Predict the CATE

In [6]:
cate = dataset.predict_cate()

# Show the first 10 values of the CATE.
print(cate[:10])

[ 7.27231861 71.31586725  7.94288532 53.12835366  8.0325206  51.82207185
 21.11322614 11.2550339  30.06927027 16.72114871]


Notice that the length of the nudging dataset is not the same as the length of the original dataframe. This is because the modeling does not use any people with NA's in their features. To convert it back to the original you can use the index: `dataset.standard_df.index`.

In [7]:
cate_whole = np.full(len(df), np.nan, dtype=float)
cate_whole[dataset.standard_df.index] = cate
df["cate"] = cate_whole
df[:10]

Unnamed: 0,PassengerId,Name,Sex,Age,Parch,Fare,Cabin,Embarked,Birthday,Board time,Married since,checked,survival chance,nudge,outcome,gender,cate
0,1,"Braund, Mr. Owen Harris",male,22.0,0,7.25,,S,1922-03-23,14:57:38,2022-08-13 08:42:37,0,-0.673159,0,-0.673159,1,7.272319
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,0,71.2833,C85,C,1938-03-22,17:16:30,2022-07-24 06:45:36,0,0.15806,0,0.15806,0,71.315867
2,3,"Heikkinen, Miss. Laina",female,26.0,0,7.925,,S,1909-11-06,15:57:59,2022-07-21 02:32:58,1,8.650757,1,8.650757,0,7.942885
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,0,53.1,C123,S,1915-10-14,14:17:36,2022-07-19 13:22:18,0,0.103713,0,0.103713,0,53.128354
4,5,"Allen, Mr. William Henry",male,35.0,0,8.05,,S,1929-04-18,13:23:39,2022-07-26 11:47:42,0,1.038863,0,1.038863,1,8.032521
5,6,"Moran, Mr. James",male,,0,8.4583,,Q,1924-06-16,18:21:50,2022-08-06 13:35:13,0,-1.882357,0,-1.882357,1,
6,7,"McCarthy, Mr. Timothy J",male,54.0,0,51.8625,E46,S,1926-09-17,10:46:29,2022-08-09 04:50:09,0,0.549448,0,0.549448,1,51.822072
7,8,"Palsson, Master. Gosta Leonard",male,2.0,1,21.075,,S,1925-08-07,16:23:32,2022-07-26 20:56:16,1,22.51214,1,22.51214,1,21.113226
8,9,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,2,11.1333,,S,1934-07-16,12:37:38,2022-07-18 13:50:11,0,0.040837,0,0.040837,0,11.255034
9,10,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,0,30.0708,,C,1917-08-19,11:07:42,2022-08-09 00:15:36,1,30.644706,1,30.644706,0,30.06927


### Save the results of the modeling to a new CSV file

In [8]:
df.to_csv("tutorial_with_cate.csv", index=False)