<h3 align = 'center'><font color = '#28abe3'>Red Hat Data Pre-processing</font></h3>

Now that I have some ideas about how to treat the data, potential future predictors etc I need to convert it into a usable form, meaning I need to convert all the categorical data to something scikitlearn can handle. My first instinct would be to create a transformation of the categories to the natural numbers $$C \rightarrow \mathbb{N}$$
such that the result is monotonically increasing w.r.t. its' emperical probability of a positive i.e. $$\forall \: i \: s.t \: C_i \in C$$ we define a function f that approximates the emperical probability $$P(Outcome=1|X=C_i) \approx f(C_i) =  \frac{len(\{Outcomes = 1 \land  X = C_i\})}{len(\{X = C_i\})}$$ and we create an ordered list $$Z = \{f(C_a),f(C_b),....f(C_l)\},\:f(C_a) < f(C_b) < ... < f(C_l)$$ and we define our transform $$C_i \rightarrow k$$ where k is the index of Z such that $$Z_k = f(C_i)$$.

----

The problem with this however is that we are given a dataset of approx 2 million datapoints. This is bordering on computationally infeasible. A much easier computation would be to map each $$C_i \rightarrow i$$ This may generate relationships that are harder to teach but at least the computation is doable.

In [5]:
#import modules
import pandas as pd
import numpy as np
import os 
import sqlite3


In [47]:
#rename columns
act_train = pd.read_csv('act_train.csv')
peopleData = pd.read_csv('people.csv')


In [63]:
#lets just deal with the people data first.
for i in list(peopleData.columns):
    if i not in ['people_id','activity_id','date']:
        if peopleData[i].dtype == 'object':
            peopleData[i] = peopleData[i].fillna('type 0')
            peopleData[i] = peopleData[i].map(lambda x: x.split(' ')[1]).astype(np.int32)
        elif peopleData[i].dtype == 'bool' :
            peopleData[i] = peopleData[i].astype(np.int8)

In [71]:
#now lets do it for the act_train dataset
for i in list(act_train.columns):
    if i not in ['outcome','people_id','activity_id','date']:
        act_train[i] = act_train[i].fillna('type 0')
        act_train[i] = act_train[i].map(lambda x: x.split(' ')[1]).astype(np.int32)
    elif act_train[i].dtype == 'bool' :
        act_train[i] = act_train[i].astype(np.int8)

REMINDER: ALL NA VALUES ARE FILLED IN WITH ZERO.

In [143]:
#join the dataframes
trainData = act_train.merge(peopleData,on = 'people_id')
trainData[:3]

Unnamed: 0,people_id,activity_id,date_x,activity_category,char_1_x,char_2_x,char_3_x,char_4_x,char_5_x,char_6_x,...,char_29,char_30,char_31,char_32,char_33,char_34,char_35,char_36,char_37,char_38
0,ppl_100,act2_1734928,2023-08-26,4,0,0,0,0,0,0,...,0,1,1,0,0,1,1,1,0,36
1,ppl_100,act2_2434093,2022-09-27,2,0,0,0,0,0,0,...,0,1,1,0,0,1,1,1,0,36
2,ppl_100,act2_3404049,2022-09-27,2,0,0,0,0,0,0,...,0,1,1,0,0,1,1,1,0,36


Uyh that was trickier than I thought, okay, well at least we have a dataset that scikitlearn can interact with now. I think I should start with some simple benchmarks and go from there. Maybe a benchmark random forest then isolate the important predictors, work on some feature engineering and roll from there.

In [144]:
trainData.columns

Index(['people_id', 'activity_id', 'date_x', 'activity_category', 'char_1_x',
       'char_2_x', 'char_3_x', 'char_4_x', 'char_5_x', 'char_6_x', 'char_7_x',
       'char_8_x', 'char_9_x', 'char_10_x', 'outcome', 'char_1_y', 'group_1',
       'char_2_y', 'date_y', 'char_3_y', 'char_4_y', 'char_5_y', 'char_6_y',
       'char_7_y', 'char_8_y', 'char_9_y', 'char_10_y', 'char_11', 'char_12',
       'char_13', 'char_14', 'char_15', 'char_16', 'char_17', 'char_18',
       'char_19', 'char_20', 'char_21', 'char_22', 'char_23', 'char_24',
       'char_25', 'char_26', 'char_27', 'char_28', 'char_29', 'char_30',
       'char_31', 'char_32', 'char_33', 'char_34', 'char_35', 'char_36',
       'char_37', 'char_38'],
      dtype='object')

In [150]:
X = trainData.drop('people_id',1)
X = X.drop('activity_id',1)
X = X.drop('date_x',1)
X = X.drop('date_y',1)
X = X.drop('outcome',1)
Y = trainData['outcome']

In [215]:
#split training and testing
sample = np.random.permutation(len(X))
X = X.iloc[sample]
Y = Y.iloc[sample]
xTrain = X[:1000000]
yTrain = Y[:1000000]
yTest = Y[1000000:1001000]
xTest = X[1000000:1001000]

In [216]:
from sklearn.ensemble import RandomForestClassifier

forestBenchmark = RandomForestClassifier(n_estimators = 100,random_state=100)
forestBenchmark.fit(xTrain,yTrain)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=100, verbose=0, warm_start=False)

In [217]:
import numpy as np
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import confusion_matrix
predictions = forestBenchmark.predict(xTest)
sum(predictions == yTest)/len(predictions)
cma = confusion_matrix(yTest,predictions)

In [218]:
TPR = cma[0][0]/(cma[0][0]+cma[0][1])
FPR = cma[1][0]/(cma[1][0]+cma[1][1])

In [219]:
print('AUC ~', TPR*(1-FPR))

AUC ~ 0.942446570622


One point of confusion: With 1milli points of training data we get a score of .94 but only around .78 with 1000 datapoints. Additionally, it takes awhile for the forest to train on 1 mil datapoints so I dont want to work with the 1mil model. In order to do well at the end of the day i'll need to train on all the training data but im not sure if its' valid to assume model alterations that improve the AUC score w/ 1000 training points also improve the AUC score w/ 1mil training points an identical-ish amount.

At least we have a good benchmark now and can focus on feature engineering.

----
Some Ideas: 

1.) Introduce the historical user data to make predictions.  (Strong)

2.)Develop a better method for treating the NaN fields.  (Strong)

3.) Consider transforms from categorical data that may be easier to learn.  (Weak)
