# Feature Engineering and Pre-processing
### Blaine Murphy -- September 2021


In this notebook I need to 
- Encode categorical columns
- Calculate information value for all columns to whether there was a purchase and cull columns that don't provide any information value
- Calculate multicollinearity among the features and remove those that don't offer additional information 
- Format the data into a tensor for input to RNN model
    - Find maximum number of repeated visits
    - Scaling??
    - Aggregate users and reshape data to [batch,maximum number of repeat visits,# of predictive features]
    - The second dimension will house the purchase amount of each visit and the 3rd dimension will have all of the predictive features
    - Split users into train, validation, test sets

In [1]:
import pandas as pd
import os
import numpy as np

In [2]:
os.chdir(r'D:\Springboard\Capstone 3 maybe\Google Analytics')
os.listdir()

['e390021b-3cdc-4df0-b92e-22082e3ad15b_Data.csv',
 'GDP by Country world bank.csv',
 'life expectency by country_world bank.csv',
 'Metadata_Country_API_NY.GDP.MKTP.CD_DS2_en_csv_v2_2763936.csv',
 'Metadata_Country_API_SP.DYN.LE00.IN_DS2_en_csv_v2_2764094.csv',
 'Metadata_Country_API_SP.POP.TOTL_DS2_en_csv_v2_2763937.csv',
 'Metadata_Indicator_API_NY.GDP.MKTP.CD_DS2_en_csv_v2_2763936.csv',
 'Metadata_Indicator_API_SP.DYN.LE00.IN_DS2_en_csv_v2_2764094.csv',
 'Metadata_Indicator_API_SP.POP.TOTL_DS2_en_csv_v2_2763937.csv',
 'Population by country world bank.csv',
 'sample_submission.csv',
 'sample_submission_v2.csv',
 'test.csv',
 'test_v2.csv',
 'train.csv',
 'train_eda.csv',
 'train_v2.csv',
 'train_wrangled.csv',
 'worldpopulationreview.com']

In [3]:
train = pd.read_csv('train_eda.csv',index_col=0,parse_dates=['date','dateTime'])
train.info()

  interactivity=interactivity, compiler=compiler, result=result)


<class 'pandas.core.frame.DataFrame'>
Int64Index: 901907 entries, 0 to 903652
Data columns (total 50 columns):
 #   Column                                   Non-Null Count   Dtype         
---  ------                                   --------------   -----         
 0   channelGrouping                          901907 non-null  object        
 1   date                                     901907 non-null  datetime64[ns]
 2   fullVisitorId                            901907 non-null  object        
 3   sessionId                                901907 non-null  object        
 4   visitId                                  901907 non-null  int64         
 5   visitNumber                              901907 non-null  int64         
 6   visitStartTime                           901907 non-null  int64         
 7   browser                                  901907 non-null  object        
 8   operatingSystem                          901907 non-null  object        
 9   isMobile                  

A couple of columns to fix. `keyword` is mostly null and I already created new columns from it based on the most frequent search words that resulted in a purchase.  I can drop it now.  

And `timeDiffLastVisit` is mostly null because most of the records are new visits, not repeat visits.  I will fill all of those nulls with the maximum of the `timeDiffLastVisit` column.


In [4]:
train.drop('keyword',axis=1,inplace=True)
train['timeDiffLastVisit'] = train.timeDiffLastVisit.fillna(train.timeDiffLastVisit.max())


In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 901907 entries, 0 to 903652
Data columns (total 49 columns):
 #   Column                                   Non-Null Count   Dtype         
---  ------                                   --------------   -----         
 0   channelGrouping                          901907 non-null  object        
 1   date                                     901907 non-null  datetime64[ns]
 2   fullVisitorId                            901907 non-null  object        
 3   sessionId                                901907 non-null  object        
 4   visitId                                  901907 non-null  int64         
 5   visitNumber                              901907 non-null  int64         
 6   visitStartTime                           901907 non-null  int64         
 7   browser                                  901907 non-null  object        
 8   operatingSystem                          901907 non-null  object        
 9   isMobile                  

### One hot encoding of categorical columns

In [6]:
"','".join(list(train.columns))

"channelGrouping','date','fullVisitorId','sessionId','visitId','visitNumber','visitStartTime','browser','operatingSystem','isMobile','deviceCategory','continent','subContinent','country','region','metro','city','campaign','source','hits','pageviews','bounces','newVisits','transactionRevenue','dollars','purchase','dateTime','hour','weekend','month','time','timeDiffLastVisit','countryPopulation','countryGDP','countryLE','state','keywordStore','keywordMerch','keywordShirt','keywordGoogle','keywordYoutube','keywordShop','keywordApparel','keyword_(Remarketing/Content targeting)','keyword_(User vertical targeting)','keyword_(automatic matching)','keyword_(content targeting)','keyword_(not provided)','keyword_search"

In [7]:
num_cols = ['isMobile','hits','pageviews','bounces','newVisits',\
        'hour','weekend','month','time','timeDiffLastVisit',\
        'countryPopulation','countryGDP','countryLE','keywordStore',\
        'keywordMerch','keywordShirt','keywordGoogle',\
        'keywordYoutube','keywordShop','keywordApparel',\
        'keyword_(Remarketing/Content targeting)',\
        'keyword_(User vertical targeting)','keyword_(automatic matching)',\
        'keyword_(content targeting)','keyword_(not provided)','keyword_search']

In [8]:
dummy_cols = ['channelGrouping','browser','operatingSystem',\
              'deviceCategory','subContinent','country','city','state',\
              'campaign','source']

In [9]:
X = pd.concat([train[num_cols],pd.get_dummies(train[dummy_cols])], axis=1)

### Information value. 
Calculation of information value to whether there was a purchase or not.
I will drop all of the columns with less than 0.01

In [10]:
### Function to calculate the Information Value (IV) of each feature
### 


max_bin = 20
force_bin = 3
import pandas.core.algorithms as algos
import scipy.stats.stats as stats
import re


# define a binning function
### this function is for binning and calculating the Weight of Evidence 
### for the target and the Information Value for the feature
### and the In
def mono_bin(Y, X, n = max_bin):    
    ### create new dataframe of series feature and series y
    df1 = pd.DataFrame({"X": X, "Y": Y})
    
    ### check to see if any nulls in feature and seperate nulls out
    justmiss = df1[['X','Y']][df1.X.isnull()]
    notmiss = df1[['X','Y']][df1.X.notnull()]
    #print("justmiss", justmiss)
    #print("notmiss", notmiss)
    
    
    ### while loop until Spearman correlation coefficient is between [-1,1]
    ### Finding minumum bin size that results in 'abs(r)' of >1???
    
    r = 0
    while np.abs(r) < 1:
        ### Try creation of new dataframe with max or less bin size 'n'
        ### create dataframe with feature, target and binned feature 
        ### create 'd2' group by object on 'Bucket'
        ### calculate Spearman correlation 'r' and p-value from mean of feature and target
        ### If exception reduce bin number by one and try again
        ### Effectively finding the max bin number that can be used 
        ### for calculating WOE
        try:
            d1 = pd.DataFrame({"X": notmiss.X, "Y": notmiss.Y,\
                               "Bucket": pd.qcut(notmiss.X, n)})
            d2 = d1.groupby('Bucket', as_index=True)
                             
            r, p = stats.spearmanr(d2.mean().X, d2.mean().Y)
            #print("I am here 1",r, n,len(d2))
            n = n - 1 
            
            
        except Exception as e:
            n = n - 1
            #print("I am here e",n)

    ### If length of d2 is 1 (ie 1 bucket for all of feature) do this
    if len(d2) == 1:
        #print("I am second step ",r, n)
        
        ### force 'n' to 3 and calculate quantiles of feature from (0,0.5,1) 
        ### to be used as bins, if not 3 unique because of heavily skewed data
        ### manually create bin
        n = force_bin         
        bins = algos.quantile(notmiss.X, np.linspace(0, 1, n))
        if len(np.unique(bins)) == 2:
            bins = np.insert(bins, 0, 1)
            bins[1] = bins[1]-(bins[1]/2)
            
        ### Create new dataframe bucketed by manual bins
        d1 = pd.DataFrame({"X": notmiss.X, "Y": notmiss.Y, "Bucket": pd.cut(notmiss.X, np.unique(bins),include_lowest=True)}) 
        d2 = d1.groupby('Bucket', as_index=True)
    
    ### Create new dataframe from aggregating the binned dataframe
    d3 = pd.DataFrame({},index=[])
    d3["MIN_VALUE"] = d2.min().X
    d3["MAX_VALUE"] = d2.max().X
    d3["COUNT"] = d2.count().Y
    d3["EVENT"] = d2.sum().Y
    d3["NONEVENT"] = d2.count().Y - d2.sum().Y
    d3=d3.reset_index(drop=True)
    
    if len(justmiss.index) > 0:
        d4 = pd.DataFrame({'MIN_VALUE':np.nan},index=[0])
        d4["MAX_VALUE"] = np.nan
        #print(justmiss.count().Y)
        d4["COUNT"] = justmiss.count().Y
        d4["EVENT"] = justmiss.sum().Y
        d4["NONEVENT"] = justmiss.count().Y - justmiss.sum().Y
        d3 = d3.append(d4,ignore_index=True)
    
    ### add more features to d3 describing the 'events' of the target
    d3["EVENT_RATE"] = d3.EVENT/d3.COUNT
    d3["NON_EVENT_RATE"] = d3.NONEVENT/d3.COUNT
    d3["DIST_EVENT"] = d3.EVENT/d3.sum().EVENT
    d3["DIST_NON_EVENT"] = d3.NONEVENT/d3.sum().NONEVENT
    print(np.log(d3.DIST_EVENT/d3.DIST_NON_EVENT))
    d3["WOE"] = np.log(d3.DIST_EVENT/d3.DIST_NON_EVENT)
    d3["IV"] = (d3.DIST_EVENT-d3.DIST_NON_EVENT)*np.log(d3.DIST_EVENT/d3.DIST_NON_EVENT)
    d3["VAR_NAME"] = "VAR"
    d3 = d3[['VAR_NAME','MIN_VALUE', 'MAX_VALUE', 'COUNT', 'EVENT', 'EVENT_RATE', 'NONEVENT', 'NON_EVENT_RATE', 'DIST_EVENT','DIST_NON_EVENT','WOE', 'IV']]       
    d3 = d3.replace([np.inf, -np.inf], 0)
    d3.IV = d3.IV.sum()
    
    return(d3)

def char_bin(Y, X):
        
    df1 = pd.DataFrame({"X": X, "Y": Y})
    justmiss = df1[['X','Y']][df1.X.isnull()]
    notmiss = df1[['X','Y']][df1.X.notnull()]    
    df2 = notmiss.groupby('X',as_index=True)
    d3 = pd.DataFrame({},index=[])
    d3["COUNT"] = df2.count().Y
    d3["MIN_VALUE"] = df2.sum().Y.index
    d3["MAX_VALUE"] = d3["MIN_VALUE"]
    d3["EVENT"] = df2.sum().Y
    d3["NONEVENT"] = df2.count().Y - df2.sum().Y
    
    if len(justmiss.index) > 0:
        d4 = pd.DataFrame({'MIN_VALUE':np.nan},index=[0])
        d4["MAX_VALUE"] = np.nan
        d4["COUNT"] = justmiss.count().Y
        d4["EVENT"] = justmiss.sum().Y
        d4["NONEVENT"] = justmiss.count().Y - justmiss.sum().Y
        d3 = d3.append(d4,ignore_index=True)
    
    d3["EVENT_RATE"] = d3.EVENT/d3.COUNT
    d3["NON_EVENT_RATE"] = d3.NONEVENT/d3.COUNT
    d3["DIST_EVENT"] = d3.EVENT/d3.sum().EVENT
    d3["DIST_NON_EVENT"] = d3.NONEVENT/d3.sum().NONEVENT
    d3["WOE"] = np.log(d3.DIST_EVENT/d3.DIST_NON_EVENT)
    d3["IV"] = (d3.DIST_EVENT-d3.DIST_NON_EVENT)*np.log(d3.DIST_EVENT/d3.DIST_NON_EVENT)
    d3["VAR_NAME"] = "VAR"
    d3 = d3[['VAR_NAME','MIN_VALUE', 'MAX_VALUE', 'COUNT', 'EVENT', 'EVENT_RATE', 'NONEVENT', 'NON_EVENT_RATE', 'DIST_EVENT','DIST_NON_EVENT','WOE', 'IV']]      
    d3 = d3.replace([np.inf, -np.inf], 0)
    d3.IV = d3.IV.sum()
    #print("hi",d3.IV )
    d3 = d3.reset_index(drop=True)
    
    return(d3)

def data_vars(df1, target):
    import traceback
    
    
    ### Extract raw traceback from error in one of the two sub functions
    ### assign traceback elemnts to variables
    stack = traceback.extract_stack()
    filename, lineno, function_name, code = stack[-2]
    
    ###
    vars_name = re.compile(r'\((.*?)\).*$').search(code).groups()[0]
    final = (re.findall(r"[\w']+", vars_name))[-1]
    
    
    ### get column names from df1
    x = df1.dtypes.index
    
    count = -1
    ### Loop through columns
    for i in x:
        print(i)
        if i.upper() not in (final.upper()):
            ### test if numeric and not a one-hot encoding
            if np.issubdtype(df1[i], np.number) and len(pd.Series.unique(df1[i])) > 2:
                #print("Number and unique value greater than 2")
                ###  pass target and feature to 'mono_bin'
                conv = mono_bin(target, df1[i])
                
                ### assign feature name to 'conv'
                conv["VAR_NAME"] = i
                count = count + 1
            else:
                #print("I am here 2")
                ###  pass target and feature to 'char_bin'
                conv = char_bin(target, df1[i])
                conv["VAR_NAME"] = i            
                count = count + 1

            ### First time run through the loop where count==0,
            ### create new df from current
            if count == 0:
                iv_df = conv
            ### on subsequent loops append rows to bottom of 'iv_df'
            ### of next feature and scoring calcs
            else:
                iv_df = iv_df.append(conv,ignore_index=True)
    ### aggregate 'iv_df' taking the maximum ?correlation? score 
    ### for each feature and creating a new summary df with columns 
    ### 'VAR_NAME' & 'IV'
    iv = pd.DataFrame({'IV':iv_df.groupby('VAR_NAME').IV.max()})
    iv = iv.reset_index()
    ### return detailed df with all computed features
    ### and summary of only max values and feature names
    return(iv_df,iv)

In [11]:
y = train.purchase
#X = pd.get_dummies(train[['channelGrouping']])
#X = pd.get_dummies(train[['city']])

iv_df, iv = data_vars(X,y)

isMobile
hits
0   -7.540655
1    1.064957
dtype: float64
pageviews


  result = getattr(ufunc, method)(*inputs, **kwargs)


0        -inf
1   -5.992563
2    1.390519
dtype: float64
bounces
newVisits
hour
0   -0.277721
1   -0.166819
2    0.390857
dtype: float64
weekend
month
0    0.062175
1   -0.070128
dtype: float64
time
0   -0.454482
1    0.313858
dtype: float64
timeDiffLastVisit
0   -0.831835
1    0.000113
dtype: float64
countryPopulation
0   -2.438479
1    0.700114
dtype: float64
countryGDP
0   -2.518802
1    0.707667
dtype: float64
countryLE
0    0.260456
1   -2.197908
dtype: float64
keywordStore
keywordMerch
keywordShirt
keywordGoogle
keywordYoutube
keywordShop


  result = getattr(ufunc, method)(*inputs, **kwargs)


keywordApparel
keyword_(Remarketing/Content targeting)
keyword_(User vertical targeting)
keyword_(automatic matching)
keyword_(content targeting)
keyword_(not provided)
keyword_search
channelGrouping_(Other)
channelGrouping_Affiliates
channelGrouping_Direct
channelGrouping_Display
channelGrouping_Organic Search
channelGrouping_Paid Search
channelGrouping_Referral
channelGrouping_Social
browser_Amazon Silk
browser_Android Webview
browser_Chrome
browser_Edge
browser_Firefox
browser_Internet Explorer
browser_Opera
browser_Other
browser_Safari
browser_Safari (in-app)
operatingSystem_Android
operatingSystem_Chrome OS
operatingSystem_Linux
operatingSystem_Macintosh
operatingSystem_Other
operatingSystem_Windows
operatingSystem_Windows Phone
operatingSystem_iOS
deviceCategory_desktop
deviceCategory_mobile
deviceCategory_tablet
subContinent_Australasia
subContinent_Caribbean
subContinent_Central America
subContinent_Central Asia
subContinent_Eastern Africa
subContinent_Eastern Asia
subContinent

city_Fort Collins
city_Fort Worth
city_Fortaleza
city_Frankfurt
city_Frederiksberg
city_Fremont
city_Fresno
city_Fukui
city_Furth
city_Gatineau
city_Gaziantep
city_Geneva
city_Ghent
city_Gijon
city_Glasgow
city_Goiania
city_Goose Creek
city_Gothenburg
city_Granada
city_Greer
city_Greve Strand
city_Greystones
city_Groningen
city_Guadalajara
city_Guangzhou
city_Guatemala City
city_Gurgaon
city_Guwahati
city_Ha Tinh
city_Hai Duong
city_Hai Phong
city_Hallein
city_Hamburg
city_Hamden
city_Hangzhou
city_Hanoi
city_Hayward
city_Helsinki
city_Hermosillo
city_Herzliya
city_Ho Chi Minh City
city_Hoi An
city_Hong Kong
city_Honolulu
city_Houston
city_Hradec Kralove
city_Hua Hin
city_Hue
city_Hukou Township
city_Hyderabad
city_Iasi
city_Indianapolis
city_Indore
city_Ipoh
city_Irvine
city_Islamabad
city_Issy-les-Moulineaux
city_Istanbul
city_Izmir
city_Jacksonville
city_Jaipur
city_Jakarta
city_Jeddah
city_Jersey City
city_Johnson City
city_Johor Bahru
city_Kalamazoo
city_Kampar
city_Kansas City
ci

source_pinterest.com
source_plus.google.com
source_quora.com
source_reddit.com
source_search.myway.com
source_search.xfinity.com
source_seroundtable.com
source_siliconvalley.about.com
source_sites.google.com
source_t.co
source_trainup.withgoogle.com
source_us-mg5.mail.yahoo.com
source_yahoo
source_youtube.com


In [12]:
iv.sort_values('IV',ascending=False)

Unnamed: 0,VAR_NAME,IV
905,hits,5.636926e+00
931,pageviews,2.501622e+00
1026,subContinent_Northern America,2.013279e+00
892,country_United States,1.883897e+00
678,countryGDP,1.566860e+00
...,...,...
17,campaign_Data Share,1.261294e-12
866,country_St. Martin,1.261294e-12
828,country_Norfolk Island,1.261294e-12
863,country_St. Barthélemy,1.261294e-12


In [13]:
iv[iv.IV>0.01].sort_values('IV',ascending=False)

Unnamed: 0,VAR_NAME,IV
905,hits,5.636926
931,pageviews,2.501622
1026,subContinent_Northern America,2.013279
892,country_United States,1.883897
678,countryGDP,1.566860
...,...,...
206,city_Dublin,0.011650
700,country_Belgium,0.011110
860,country_South Korea,0.011066
419,city_New Delhi,0.010398


### Now assessing multicollinearity with variance inflation factor

Leaving variables with a high IV (>0.8) becuase the problem is not just to predict if purchase or not, but the size of the purchases for each user. 'hits' and 'pageviews' are 2 of only a small number of continuous variables in the data.

In [88]:
feats = list(iv.loc[iv.IV>0.01,'VAR_NAME'])
X2 = X[feats]


In [86]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

def iterate_vif(df, vif_threshold=5, max_vif=6):
    count = 0
    while max_vif > vif_threshold:
        count += 1
        print("Iteration # "+str(count))
        
        ### Create data frame with features column and 
        ### variance inflation factor column
        vif = pd.DataFrame()
        vif["VIFactor"] = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]
        vif["features"] = df.columns
        
        
        
        if vif['VIFactor'].max() > vif_threshold:
            print('Removing %s with VIF of %f' % (vif[vif['VIFactor'] == vif['VIFactor'].max()]['features'].values[0], vif['VIFactor'].max()))
            df = df.drop(vif[vif['VIFactor'] == vif['VIFactor'].max()]['features'].values[0], axis=1)
            max_vif = vif['VIFactor'].max()
            #if count==45:
                #print('early stop for plotting intermediate step')
                #return df, vif.sort_values('VIFactor')  
        else:
            print('Complete')
            return df, vif.sort_values('VIFactor')  



In [89]:
#final_df,vif = iterate_vif(X2)

Iteration # 1
Removing countryGDP with VIF of 13665.770702
Iteration # 2
Removing channelGrouping_Affiliates with VIF of 8574.478605
Iteration # 3
Removing countryLE with VIF of 5546.885285
Iteration # 4
Removing time with VIF of 2557.517592
Iteration # 5
Removing source_Partners with VIF of 2048.434755
Iteration # 6
Removing isMobile with VIF of 984.920352
Iteration # 7
Removing state_New York with VIF of 415.740056
Iteration # 8
Removing deviceCategory_desktop with VIF of 169.377869
Iteration # 9
Removing keyword_(not provided) with VIF of 64.120647
Iteration # 10
Removing state_NotUS with VIF of 56.538822
Iteration # 11
Removing state_Illinois with VIF of 53.914906
Iteration # 12
Removing subContinent_Northern America with VIF of 49.690311
Iteration # 13
Removing pageviews with VIF of 41.735349
Iteration # 14
Removing countryPopulation with VIF of 40.893756
Iteration # 15
Removing deviceCategory_mobile with VIF of 40.438496
Iteration # 16
Removing operatingSystem_Windows with VIF of

That probably took about 12 hours... I'll print the features out in case I need them 

In [129]:
print(feats)

['bounces', 'browser_Android Webview', 'browser_Firefox', 'browser_Opera', 'browser_Safari', 'browser_Safari (in-app)', 'campaign_Data Share Promo', 'channelGrouping_Referral', 'city_(not set)', 'city_Ann Arbor', 'city_Austin', 'city_Bangkok', 'city_Bengaluru', 'city_Cambridge', 'city_Chicago', 'city_Dublin', 'city_Istanbul', 'city_London', 'city_Los Angeles', 'city_Mountain View', 'city_Mumbai', 'city_New Delhi', 'city_New York', 'city_Paris', 'city_San Bruno', 'city_San Francisco', 'city_Seattle', 'city_Sunnyvale', 'city_Sydney', 'city_Tel Aviv-Yafo', 'city_Warsaw', 'city_not available in demo dataset', 'country_Argentina', 'country_Belgium', 'country_Brazil', 'country_Czechia', 'country_Denmark', 'country_France', 'country_Germany', 'country_India', 'country_Indonesia', 'country_Ireland', 'country_Israel', 'country_Italy', 'country_Japan', 'country_Malaysia', 'country_Mexico', 'country_Netherlands', 'country_Pakistan', 'country_Peru', 'country_Philippines', 'country_Poland', 'countr

In [14]:
final_df = X[['bounces', 'browser_Android Webview', 'browser_Firefox', 'browser_Opera', 'browser_Safari', 'browser_Safari (in-app)', 'campaign_Data Share Promo', 'channelGrouping_Referral', 'city_(not set)', 'city_Ann Arbor', 'city_Austin', 'city_Bangkok', 'city_Bengaluru', 'city_Cambridge', 'city_Chicago', 'city_Dublin', 'city_Istanbul', 'city_London', 'city_Los Angeles', 'city_Mountain View', 'city_Mumbai', 'city_New Delhi', 'city_New York', 'city_Paris', 'city_San Bruno', 'city_San Francisco', 'city_Seattle', 'city_Sunnyvale', 'city_Sydney', 'city_Tel Aviv-Yafo', 'city_Warsaw', 'city_not available in demo dataset', 'country_Argentina', 'country_Belgium', 'country_Brazil', 'country_Czechia', 'country_Denmark', 'country_France', 'country_Germany', 'country_India', 'country_Indonesia', 'country_Ireland', 'country_Israel', 'country_Italy', 'country_Japan', 'country_Malaysia', 'country_Mexico', 'country_Netherlands', 'country_Pakistan', 'country_Peru', 'country_Philippines', 'country_Poland', 'country_Romania', 'country_Russia', 'country_Singapore', 'country_South Korea', 'country_Spain', 'country_Sweden', 'country_Taiwan', 'country_Thailand', 'country_Turkey', 'country_Ukraine', 'country_United Kingdom', 'country_United States', 'deviceCategory_tablet', 'hits', 'hour', 'newVisits', 'operatingSystem_Android', 'operatingSystem_Chrome OS', 'operatingSystem_Linux', 'operatingSystem_Macintosh', 'operatingSystem_iOS', 'source_google', 'source_google.com', 'source_mall.googleplex.com', 'source_youtube.com', 'state_Massachusetts', 'state_Texas', 'state_Washington', 'subContinent_Australasia', 'subContinent_Northern Africa', 'subContinent_South America', 'subContinent_Southeast Asia', 'subContinent_Southern Europe', 'subContinent_Western Asia', 'weekend']]


In [15]:
feats = list(final_df.columns)
iv[iv.VAR_NAME.isin(feats)].sort_values('IV',ascending=False)

Unnamed: 0,VAR_NAME,IV
905,hits,5.636926
892,country_United States,1.883897
976,source_youtube.com,1.373643
958,source_mall.googleplex.com,0.892687
922,newVisits,0.686432
...,...,...
206,city_Dublin,0.011650
700,country_Belgium,0.011110
860,country_South Korea,0.011066
419,city_New Delhi,0.010398


### Pre-proccesing for modelling 
1. Put the data back together with the purchase amount, visitor id and visit number

In [16]:
df = pd.concat([ final_df,train[['fullVisitorId','visitNumber','dollars']] ], axis=1).sort_values(['fullVisitorId','visitNumber']).reset_index(drop=True)

In [17]:
df.head()

Unnamed: 0,bounces,browser_Android Webview,browser_Firefox,browser_Opera,browser_Safari,browser_Safari (in-app),campaign_Data Share Promo,channelGrouping_Referral,city_(not set),city_Ann Arbor,...,subContinent_Australasia,subContinent_Northern Africa,subContinent_South America,subContinent_Southeast Asia,subContinent_Southern Europe,subContinent_Western Asia,weekend,fullVisitorId,visitNumber,dollars
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,4823595352351,1,0.0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,5103959234087,1,0.0
2,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,10278554503158,1,0.0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,20424342248747,1,0.0
4,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,26722803385797,1,0.0


In [18]:
print(df.fullVisitorId.value_counts().head())

824839726118485274     242
1856749147915772585    188
3608475193341679870    187
7634897085866546110    144
3269834865385146569    144
Name: fullVisitorId, dtype: int64


The maximum number of repeated visits is 242.

In [19]:
time_steps = 242 ### maximum number of repeated visits
features = 87 ### features for predicting each time step

Now I need to reshape the data to (users,time_steps,features) where the time series is the purchase amount from the GStore reverse filled from the right.  So if a user has 3 visits then them time series will look like [0,0,0,0,...10,0,35], something like that.  

In [20]:
shape = tuple([df.fullVisitorId.nunique(),time_steps,features])
arr = np.full(shape,np.nan)

MemoryError: Unable to allocate 113. GiB for an array with shape (722668, 242, 87) and data type float64

Problem with my plan to model the data with an LSTM.  My computer cannot handle the size of the 3 dimensional array that is needed.  
___
But a tensorflow model trains and processes in batches!  Maybe I could seperate the data into train, validation and test sets by randomly choosing users.  Then I could convert a batch of users into the 3D time series format as input into the LSTM model... maybe with `tf.keras.preprocessing.sequence.TimeseriesGenerator`
___
I'll consult with Rahul. As a fallback I can use a standard deep learning model instead with `visitNumber` as a feature.  That still leaves the problem of how to collect all of each users purchases at the end of the network and minimize that error. 
