# Data Wrangling
## September 2021
Create a tidy dataset from the training data provided for the Kaggle competition. 
https://www.kaggle.com/c/ga-customer-revenue-prediction/overview

Note: The target for the project is the natural log of the sum of the spend for each user. So...

$target_{user} = ln(y_{user}+1)$

where 

$y_{user} = \sum\limits_{1}^{n} transactions_{user}$

I will have to aggregate the spend of each user for the target variable.  I could do that on the front end, but that would get rid of the time component (repeated visits to store) so I don't want to do that.  I will have to figure out how to aggregate the predictions after modelling.  

In [1]:
import pandas as pd
import os
import numpy as np
import json

Read in training data

In [2]:
os.chdir(r'D:\Springboard\Capstone 3 maybe\Google Analytics')
os.listdir()

['sample_submission.csv',
 'sample_submission_v2.csv',
 'test.csv',
 'test_v2.csv',
 'train.csv',
 'train_v2.csv']

In [3]:
raw_train = pd.read_csv('train.csv', parse_dates=['date'],dtype={'fullVisitorId':'object'})
raw_train.head()

Unnamed: 0,channelGrouping,date,device,fullVisitorId,geoNetwork,sessionId,socialEngagementType,totals,trafficSource,visitId,visitNumber,visitStartTime
0,Organic Search,2016-09-02,"{""browser"": ""Chrome"", ""browserVersion"": ""not a...",1131660440785968503,"{""continent"": ""Asia"", ""subContinent"": ""Western...",1131660440785968503_1472830385,Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""1"", ""pageviews"": ""1"",...","{""campaign"": ""(not set)"", ""source"": ""google"", ...",1472830385,1,1472830385
1,Organic Search,2016-09-02,"{""browser"": ""Firefox"", ""browserVersion"": ""not ...",377306020877927890,"{""continent"": ""Oceania"", ""subContinent"": ""Aust...",377306020877927890_1472880147,Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""1"", ""pageviews"": ""1"",...","{""campaign"": ""(not set)"", ""source"": ""google"", ...",1472880147,1,1472880147
2,Organic Search,2016-09-02,"{""browser"": ""Chrome"", ""browserVersion"": ""not a...",3895546263509774583,"{""continent"": ""Europe"", ""subContinent"": ""South...",3895546263509774583_1472865386,Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""1"", ""pageviews"": ""1"",...","{""campaign"": ""(not set)"", ""source"": ""google"", ...",1472865386,1,1472865386
3,Organic Search,2016-09-02,"{""browser"": ""UC Browser"", ""browserVersion"": ""n...",4763447161404445595,"{""continent"": ""Asia"", ""subContinent"": ""Southea...",4763447161404445595_1472881213,Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""1"", ""pageviews"": ""1"",...","{""campaign"": ""(not set)"", ""source"": ""google"", ...",1472881213,1,1472881213
4,Organic Search,2016-09-02,"{""browser"": ""Chrome"", ""browserVersion"": ""not a...",27294437909732085,"{""continent"": ""Europe"", ""subContinent"": ""North...",27294437909732085_1472822600,Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""1"", ""pageviews"": ""1"",...","{""campaign"": ""(not set)"", ""source"": ""google"", ...",1472822600,2,1472822600


In [4]:
raw_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 903653 entries, 0 to 903652
Data columns (total 12 columns):
 #   Column                Non-Null Count   Dtype         
---  ------                --------------   -----         
 0   channelGrouping       903653 non-null  object        
 1   date                  903653 non-null  datetime64[ns]
 2   device                903653 non-null  object        
 3   fullVisitorId         903653 non-null  object        
 4   geoNetwork            903653 non-null  object        
 5   sessionId             903653 non-null  object        
 6   socialEngagementType  903653 non-null  object        
 7   totals                903653 non-null  object        
 8   trafficSource         903653 non-null  object        
 9   visitId               903653 non-null  int64         
 10  visitNumber           903653 non-null  int64         
 11  visitStartTime        903653 non-null  int64         
dtypes: datetime64[ns](1), int64(3), object(8)
memory usage: 82

In [5]:
train = raw_train.copy()

There don't appear to be any nulls, and I have formatted the `date` and `fullVisitorId` columns correctly on the read in.  There are 4 columns that are dictionarys that need to be convereted to a number of columns.  I will create a function to do that now. 

In [6]:
#dict_cols = ['device','geoNetwork','totals','trafficSource']

In [7]:
#train['device'] = train.device.apply(json.loads)

For the first of the columns with a dictionary (in string form) from the dataset I will only keep the features that don't say `not available in demo dataset`

In [6]:
train.loc[0,'device']

'{"browser": "Chrome", "browserVersion": "not available in demo dataset", "browserSize": "not available in demo dataset", "operatingSystem": "Windows", "operatingSystemVersion": "not available in demo dataset", "isMobile": false, "mobileDeviceBranding": "not available in demo dataset", "mobileDeviceModel": "not available in demo dataset", "mobileInputSelector": "not available in demo dataset", "mobileDeviceInfo": "not available in demo dataset", "mobileDeviceMarketingName": "not available in demo dataset", "flashVersion": "not available in demo dataset", "language": "not available in demo dataset", "screenColors": "not available in demo dataset", "screenResolution": "not available in demo dataset", "deviceCategory": "desktop"}'

In [7]:
train['device'] = train['device'].apply(json.loads)
### apply json.loads function to convert string to dictionary

In [8]:
train.loc[0,'device']

{'browser': 'Chrome',
 'browserVersion': 'not available in demo dataset',
 'browserSize': 'not available in demo dataset',
 'operatingSystem': 'Windows',
 'operatingSystemVersion': 'not available in demo dataset',
 'isMobile': False,
 'mobileDeviceBranding': 'not available in demo dataset',
 'mobileDeviceModel': 'not available in demo dataset',
 'mobileInputSelector': 'not available in demo dataset',
 'mobileDeviceInfo': 'not available in demo dataset',
 'mobileDeviceMarketingName': 'not available in demo dataset',
 'flashVersion': 'not available in demo dataset',
 'language': 'not available in demo dataset',
 'screenColors': 'not available in demo dataset',
 'screenResolution': 'not available in demo dataset',
 'deviceCategory': 'desktop'}

In [9]:
device_cols = ['browser','operatingSystem','isMobile','deviceCategory']

In [10]:
for c in device_cols:
    train[c] = train['device'].apply(lambda x: x.get(c,np.nan))

In [11]:
train[device_cols]

Unnamed: 0,browser,operatingSystem,isMobile,deviceCategory
0,Chrome,Windows,False,desktop
1,Firefox,Macintosh,False,desktop
2,Chrome,Windows,False,desktop
3,UC Browser,Linux,False,desktop
4,Chrome,Android,True,mobile
...,...,...,...,...
903648,Chrome,Windows,False,desktop
903649,Chrome,Android,True,mobile
903650,Android Webview,Android,True,mobile
903651,Chrome,Windows,False,desktop


In [12]:
train[device_cols].dtypes

browser            object
operatingSystem    object
isMobile             bool
deviceCategory     object
dtype: object

That looks almost perfect for the `device` column.  Eventually the `isMobile` column will need to be converted to 0's and 1's, so I'll do that now.

In [13]:
train['isMobile'] = train['isMobile']*1
train['isMobile'].mean()

0.26461816648647213

Done.  It looks like 26% of the records come from mobile devices.

Now I can process the `geoNetwork` column the same way.

In [14]:
train['geoNetwork'] = train['geoNetwork'].apply(json.loads)
train.loc[0,'geoNetwork']

{'continent': 'Asia',
 'subContinent': 'Western Asia',
 'country': 'Turkey',
 'region': 'Izmir',
 'metro': '(not set)',
 'city': 'Izmir',
 'cityId': 'not available in demo dataset',
 'networkDomain': 'ttnet.com.tr',
 'latitude': 'not available in demo dataset',
 'longitude': 'not available in demo dataset',
 'networkLocation': 'not available in demo dataset'}

In [15]:
geoNet_cols = ['continent','subContinent','country','region','metro','city','networkDomain']

In [16]:
for c in geoNet_cols:
    train[c] = train['geoNetwork'].apply(lambda x: x.get(c,np.nan))

In [17]:
train[geoNet_cols].dtypes

continent        object
subContinent     object
country          object
region           object
metro            object
city             object
networkDomain    object
dtype: object

In [18]:
train[geoNet_cols].isna().sum()

continent        0
subContinent     0
country          0
region           0
metro            0
city             0
networkDomain    0
dtype: int64

Job done, column types are correct and no nulls.  Now I'll skip `totals` and deal with `trafficSource`.

In [19]:
train['trafficSource'] = train['trafficSource'].apply(json.loads)
train.loc[0,'trafficSource']

{'campaign': '(not set)',
 'source': 'google',
 'medium': 'organic',
 'keyword': '(not provided)',
 'adwordsClickInfo': {'criteriaParameters': 'not available in demo dataset'}}

In [20]:
traffic_cols = ['campaign','source','medium','keyword']

In [21]:
for c in traffic_cols:
    train[c] = train['trafficSource'].apply(lambda x: x.get(c,np.nan))

In [22]:
train[traffic_cols].dtypes

campaign    object
source      object
medium      object
keyword     object
dtype: object

In [23]:
train[traffic_cols].isna().sum()/train.shape[0]

campaign    0.000000
source      0.000000
medium      0.000000
keyword     0.556551
dtype: float64

Job done.  But more than half of the new `keyword` column is null.  I will drop it now. 

In [24]:
train = train.drop('keyword',axis=1)

Now for `totals` which also happens to have the target for this project `transactionRevenue`, the amount of money spent at the GStore. I assume where this is not present in the record, there was no purchase made.

In [25]:
train['totals'] = train['totals'].apply(json.loads)
train.loc[0,'totals']

{'visits': '1',
 'hits': '1',
 'pageviews': '1',
 'bounces': '1',
 'newVisits': '1'}

In [26]:
totals_cols = ['visits','hits','pageviews','bounces','newVisits','transactionRevenue']

In [27]:
for c in totals_cols:
    train[c] = train['totals'].apply(lambda x: x.get(c,0))

In [28]:
### Correct the data type for the latest created columns
for c in totals_cols:
    train[c] = pd.to_numeric(train[c])

In [29]:
### Quick look at the transactionRevenue by top spenders
train.groupby('fullVisitorId')[['transactionRevenue']].sum().sort_values('transactionRevenue',ascending=False).head(10)

Unnamed: 0_level_0,transactionRevenue
fullVisitorId,Unnamed: 1_level_1
1957458976293878100,77113430000
5632276788326171571,16023750000
9417857471295131045,15170120000
4471415710206918415,11211100000
4984366501121503466,9513900000
9089132392240687728,8951970000
9029794295932939024,7846350000
7463172420271311409,7225100000
7311242886083854158,7143250000
79204932396995037,7047150000


Woah these numbers look way too big... $77 billion????

I found a discussion thread saying that units are actually $1*10^6.  I will leave the numbers for now, but will need to change for plotting later. 

Now I can drop the 4 columns that I extracted sub columns out of

In [30]:
train = train.drop(['device','geoNetwork','totals','trafficSource'],axis=1)

In [31]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 903653 entries, 0 to 903652
Data columns (total 28 columns):
 #   Column                Non-Null Count   Dtype         
---  ------                --------------   -----         
 0   channelGrouping       903653 non-null  object        
 1   date                  903653 non-null  datetime64[ns]
 2   fullVisitorId         903653 non-null  object        
 3   sessionId             903653 non-null  object        
 4   socialEngagementType  903653 non-null  object        
 5   visitId               903653 non-null  int64         
 6   visitNumber           903653 non-null  int64         
 7   visitStartTime        903653 non-null  int64         
 8   browser               903653 non-null  object        
 9   operatingSystem       903653 non-null  object        
 10  isMobile              903653 non-null  int32         
 11  deviceCategory        903653 non-null  object        
 12  continent             903653 non-null  object        
 13 

Is there some redundancy here in the visits columns? I'll take a look

In [32]:
train[['fullVisitorId','visitId','visitNumber','visitStartTime','visits','hits','pageviews','bounces','newVisits','transactionRevenue']].sort_values('fullVisitorId')

Unnamed: 0,fullVisitorId,visitId,visitNumber,visitStartTime,visits,hits,pageviews,bounces,newVisits,transactionRevenue
230774,0000010278554503158,1477029466,1,1477029466,1,11,8,0,1,0
89784,0000020424342248747,1480578901,1,1480578901,1,17,13,0,1,0
683463,0000027376579751715,1486866293,1,1486866293,1,6,5,0,1,0
648840,0000039460501403861,1490629516,1,1490629516,1,2,2,0,1,0
683316,0000040862739425590,1486838824,2,1486838824,1,3,3,0,0,0
...,...,...,...,...,...,...,...,...,...,...
182220,999997225970956660,1471672810,1,1471672810,1,1,1,1,1,0
796213,999997550040396460,1491708533,1,1491708533,1,1,1,1,1,0
812150,999997550040396460,1492640838,2,1492640838,1,2,2,0,0,0
617541,9999978264901065827,1485325580,1,1485325580,1,1,1,1,1,0


For my own reference a few definitions for these columns from www.tendenci.com

"""
1. __Visit__ - This is the one piece of information that you really want to know. A visit is one individual visitor who arrives at your web site and proceeds to browse. A visit counts all visitors, no matter how many times the same visitor may have been to your site.
 
2. __Unique Visit/New Visit__ - This is also called Visit by Cookie. A unique visit will tell you which visits from item 1 are visiting your site for the first time. The website can track this as unique by the IP address of the computer. *The number of unique visits will be far less that visits because a unique visit is only tracked if cookies are enabled on the visitors computer*
 
3. __Page View__ - This is also called Impression.  Once a visitor arrives at your website, they will search around on a few more pages. On average, a visitor will look at about 2.5 pages. Each individual page a visitor views is tracked as a page view.
 
4. __Hits__ - The real Black Sheep in the family. The average website owner thinks that a hit means a visit but it is very different (see item 1).  A Hit actually refers to the number of files downloaded on your site, this could include photos, graphics, etc. Picture the average web page, it has photos (each photo is a file and hence a hit) and lots of buttons (each button is a file and hence a hit). On average, each page will include 15 hits.
 
To give you an example -  Using the average statistics listed above, 1 Visit to an average web site will generate 3 Page Views and 45 Hits.
 
"""

Also from webtrafficgeeks.org

__Bounce__ = user left the website without clicking further


In [33]:
train[train.visitNumber>1][['fullVisitorId','visitNumber','newVisits']].head(10)

Unnamed: 0,fullVisitorId,visitNumber,newVisits
4,27294437909732085,2,0
45,239575343807682372,3,0
46,7356002680834488802,2,0
63,854783508496317255,5,0
67,3746051970600816343,2,0
69,8720204952657722494,2,0
87,4555387715496410320,2,0
116,4100806472252896384,2,0
127,5242592678391575004,2,0
132,4784477552492518396,11,0


In [34]:
train[(train.fullVisitorId=='0949718915643445721')].sort_values('visitNumber')

Unnamed: 0,channelGrouping,date,fullVisitorId,sessionId,socialEngagementType,visitId,visitNumber,visitStartTime,browser,operatingSystem,...,networkDomain,campaign,source,medium,visits,hits,pageviews,bounces,newVisits,transactionRevenue
758490,Direct,2016-08-06,0949718915643445721,0949718915643445721_1470518725,Not Socially Engaged,1470518725,18,1470518725,Chrome,Macintosh,...,qwest.net,(not set),(direct),(none),1,1,1,1,0,0
758495,Direct,2016-08-06,0949718915643445721,0949718915643445721_1470546023,Not Socially Engaged,1470546023,19,1470546023,Chrome,Macintosh,...,qwest.net,(not set),(direct),(none),1,1,1,1,0,0
893927,Direct,2016-08-07,0949718915643445721,0949718915643445721_1470617387,Not Socially Engaged,1470617387,20,1470617387,Chrome,Macintosh,...,qwest.net,(not set),(direct),(none),1,1,1,1,0,0
11685,Direct,2016-08-11,0949718915643445721,0949718915643445721_1470928701,Not Socially Engaged,1470928701,21,1470928701,Chrome,Macintosh,...,qwest.net,(not set),(direct),(none),1,1,1,1,0,0
853865,Direct,2016-08-18,0949718915643445721,0949718915643445721_1471573251,Not Socially Engaged,1471573251,22,1471573251,Chrome,Macintosh,...,qwest.net,(not set),(direct),(none),1,1,1,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
871784,Direct,2016-10-05,0949718915643445721,0949718915643445721_1475675614,Not Socially Engaged,1475675614,141,1475675614,Chrome,Macintosh,...,qwest.net,(not set),(direct),(none),1,1,1,1,0,0
51987,Direct,2016-10-14,0949718915643445721,0949718915643445721_1476452567,Not Socially Engaged,1476452567,142,1476452567,Chrome,Macintosh,...,qwest.net,(not set),(direct),(none),1,2,2,0,0,0
245608,Direct,2016-10-15,0949718915643445721,0949718915643445721_1476551760,Not Socially Engaged,1476551760,143,1476551760,Chrome,Macintosh,...,qwest.net,(not set),(direct),(none),1,1,1,1,0,0
245744,Direct,2016-10-15,0949718915643445721,0949718915643445721_1476587564,Not Socially Engaged,1476587564,144,1476587564,Chrome,Macintosh,...,qwest.net,(not set),(direct),(none),1,1,1,1,0,0


After exploring a little, if `visitNumber` is > 1 then `newVisit` is 0, so `visitNumber` and `newVisits` appear to contain redundant information. I'll leave it for now.  In EDA I will check to see if `newVisit` has any other information value concerning the target. 

In [35]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 903653 entries, 0 to 903652
Data columns (total 28 columns):
 #   Column                Non-Null Count   Dtype         
---  ------                --------------   -----         
 0   channelGrouping       903653 non-null  object        
 1   date                  903653 non-null  datetime64[ns]
 2   fullVisitorId         903653 non-null  object        
 3   sessionId             903653 non-null  object        
 4   socialEngagementType  903653 non-null  object        
 5   visitId               903653 non-null  int64         
 6   visitNumber           903653 non-null  int64         
 7   visitStartTime        903653 non-null  int64         
 8   browser               903653 non-null  object        
 9   operatingSystem       903653 non-null  object        
 10  isMobile              903653 non-null  int32         
 11  deviceCategory        903653 non-null  object        
 12  continent             903653 non-null  object        
 13 

In [36]:
train[train.fullVisitorId=='3694640209777929241']

Unnamed: 0,channelGrouping,date,fullVisitorId,sessionId,socialEngagementType,visitId,visitNumber,visitStartTime,browser,operatingSystem,...,networkDomain,campaign,source,medium,visits,hits,pageviews,bounces,newVisits,transactionRevenue
711,Referral,2016-09-02,3694640209777929241,3694640209777929241_1472846444,Not Socially Engaged,1472846444,9,1472846444,Chrome,Linux,...,(not set),(not set),mall.googleplex.com,referral,1,10,10,0,0,0
1528,Referral,2016-09-02,3694640209777929241,3694640209777929241_1472850328,Not Socially Engaged,1472850328,10,1472850328,Chrome,Linux,...,(not set),(not set),mall.googleplex.com,referral,1,1,1,1,0,0
221043,Referral,2016-09-06,3694640209777929241,3694640209777929241_1473186538,Not Socially Engaged,1473186538,11,1473186538,Chrome,Linux,...,(not set),(not set),mall.googleplex.com,referral,1,1,1,1,0,0


The dataset looks like it is ready for EDA.  I will export the cleaned set now. 

In [41]:
train.to_csv('train_wrangled.csv')

Done and ready for EDA.