### Sampling
* Training on 10% & 25% of the dataset. For train/test split, use 90/10. 
* Note that RAND does not use a deterministic seed. 

In [3]:
from google.datalab import Context

context = Context.default()
print('The current project is %s' % context.project_id)

The current project is capstone-project-229521


In [4]:
import google.datalab.bigquery as bq

query="""
SELECT COUNT(*) as total_rows
FROM `uplift.data`
"""

df = bq.Query(query).execute().result().to_dataframe()
df.head()

Unnamed: 0,total_rows
0,25309482


#### 10% data

In [5]:
%%bq query --name samples
SELECT f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11, treatment, conversion
FROM `uplift.data` 
WHERE RAND() < 0.1

In [6]:
df = samples.execute().result().to_dataframe()
len(df)

2534376

There are 2533296 observations in the sample set, which is about 10% of the whole data.

#### Store data
* Pickle is a serialized way of storing a Pandas dataframe. 
* Save the 10% dataframe using the code df.to_pickle(file_name)
* We can easily load the dataframe back using df = pd.read_pickle(file_name)

In [7]:
df.to_pickle('data10.pkl')

In [None]:
df = pd.read_pickle('data10.pkl')

#### 25% data

In [8]:
%%bq query --name samples
SELECT f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11, treatment, conversion
FROM `uplift.data` 
WHERE RAND() < 0.25

In [9]:
df25 = samples.execute().result().to_dataframe()
len(df25)

6327782

In [10]:
df25.to_pickle('data25.pkl')

In [None]:
# df = pd.read_pickle('data25.pkl')

There are 6325022 observations in the sample set, which is about 25% of the whole data.

#### Normalize the data

In [5]:
normalized_df=(df-df.min())/(df.max()-df.min())
normalized_df.head()

Unnamed: 0,f0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f11,treatment,conversion
0,0.748562,0.0,0.127308,0.940855,0.169318,0.0,0.666944,1.0,0.465607,0.20116,0.013567,0.0,0.0,0.0
1,0.224739,0.0,0.849874,1.0,0.231494,0.0,0.510309,1.0,0.574894,0.0,0.078046,0.0,1.0,0.0
2,0.566353,0.0,0.009958,0.817259,0.395456,0.0,0.424659,1.0,0.364059,0.571041,0.990042,0.290586,1.0,1.0
3,0.874888,0.0,0.393534,0.940855,0.169318,0.0,0.626398,1.0,0.63565,0.397272,0.606396,0.192067,1.0,0.0
4,0.896559,0.0,0.102252,0.454785,0.0,0.0,0.507131,1.0,0.589101,0.448026,1.0,0.0,1.0,0.0


#### Train/Test Split

We can use the same codes for training/testing split.

In [7]:
from sklearn.model_selection import train_test_split

# split the data
X_train, X_test, y_train, y_test = train_test_split(normalized_df[normalized_df.columns[:-1]], normalized_df[['treatment','conversion']], 
                                                    test_size=0.1, random_state=35)