### Sampling
* Training on 10% & 25% of the dataset. For train/test split, use 90/10. 
* Note that RAND does not use a deterministic seed. 

In [1]:
from google.datalab import Context

context = Context.default()
print('The current project is %s' % context.project_id)

import google.datalab.bigquery as bq

query="""
SELECT COUNT(*) as total_rows
FROM `uplift.data`
"""

df = bq.Query(query).execute().result().to_dataframe()
df.head()

The current project is capstone-project-229521


Unnamed: 0,total_rows
0,25309482


In [2]:
%%bq query --name uplift
SELECT
  *
FROM uplift.data

In [3]:
df = uplift.execute(output_options=bq.QueryOutput.dataframe()).result()
df.head()

Unnamed: 0,f0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f11,treatment,conversion,visit,exposure
0,1.991981,3.263641,8.272483,3.735871,3.506733,10.161281,2.981721,-0.166689,1.107571,9.850093,-1.8609,4.157648,1,0,0,0
1,1.991981,3.263641,8.272483,3.735871,3.506733,10.161281,2.981721,-0.166689,1.107571,9.850093,-1.8609,4.157648,1,0,0,0
2,1.991981,3.263641,8.272483,3.735871,3.506733,10.161281,2.981721,-0.166689,1.107571,9.850093,-1.8609,4.157648,1,0,0,0
3,1.991981,3.263641,8.272483,3.735871,3.506733,10.161281,2.981721,-0.166689,1.107571,9.850093,-1.8609,4.157648,1,0,0,0
4,1.991981,3.263641,8.272483,3.735871,3.506733,10.161281,2.981721,-0.166689,1.107571,9.850093,-1.8609,4.157648,1,0,0,0


This step is **time-consuming**, so a better idea maybe directly read from csv via **chunksize** regardless of using google bigquery.

#### 10% data

In [4]:
df = df.drop(['visit', 'exposure'], axis=1)
df.head()

Unnamed: 0,f0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f11,treatment,conversion
0,1.991981,3.263641,8.272483,3.735871,3.506733,10.161281,2.981721,-0.166689,1.107571,9.850093,-1.8609,4.157648,1,0
1,1.991981,3.263641,8.272483,3.735871,3.506733,10.161281,2.981721,-0.166689,1.107571,9.850093,-1.8609,4.157648,1,0
2,1.991981,3.263641,8.272483,3.735871,3.506733,10.161281,2.981721,-0.166689,1.107571,9.850093,-1.8609,4.157648,1,0
3,1.991981,3.263641,8.272483,3.735871,3.506733,10.161281,2.981721,-0.166689,1.107571,9.850093,-1.8609,4.157648,1,0
4,1.991981,3.263641,8.272483,3.735871,3.506733,10.161281,2.981721,-0.166689,1.107571,9.850093,-1.8609,4.157648,1,0


In [5]:
df10 = df.sample(frac=0.1, replace=True, random_state=35)
len(df10)

2530948

There are 2530948 observations in the sample set, which is about 10% of the whole data.

#### Store data
* Pickle is a serialized way of storing a Pandas dataframe. 
* Save the 10% dataframe using the code df.to_pickle(file_name)
* We can easily load the dataframe back using df = pd.read_pickle(file_name)

In [6]:
df10.to_pickle('data10.pkl')

In [None]:
#df = pd.read_pickle('data10.pkl')

#### 25% data

In [7]:
df25 = df.sample(frac=0.25, replace=True, random_state=35)
len(df25)

6327371

In [8]:
df25.to_pickle('data25.pkl')

There are 6327371 observations in the sample set, which is about 25% of the whole data.

#### 50% data

In [9]:
df50 = df.sample(frac=0.50, replace=True, random_state=35)
len(df50)

12654741

In [10]:
df50.to_pickle('data50.pkl')

There are 12654741 observations in the sample set, which is about 25% of the whole data.

#### Normalize the data

In [11]:
normalized_df=(df-df.min())/(df.max()-df.min())

#### Train/Test Split

We can use the same codes for training/testing split.

In [7]:
from sklearn.model_selection import train_test_split

# split the data
X_train, X_test, y_train, y_test = train_test_split(normalized_df[normalized_df.columns[:-1]], normalized_df[['treatment','conversion']], 
                                                    test_size=0.1, random_state=35)