#### In this Notebook we would be analyzing the importance of different columns or features to the prediction of our Analysis, then try to understand what is the main determinant in the creating awareness for our brand, which is equal the number of yes, gotten. And if the experiment was also a determining factor in this experiment.


### Columns Description

* **auction_id:** the unique id of the online user who has been presented the BIO. In standard terminologies this is called an impression id. The user may see the BIO questionnaire but choose not to respond. In that case both the yes and no columns are zero. 


* **experiment:** which group the user belongs to - control or exposed.
    * **control:** users who have been shown a dummy ad
    * **exposed:** users who have been shown a creative, an online interactive ad, with the SmartAd brand. 
    
    
* **date:** the date in YYYY-MM-DD format


* **hour:** the hour of the day in HH format.


* **device_make:** the name of the type of device the user has e.g. Samsung


* **platform_os:** the id of the OS the user has.


* **browser:** the name of the browser the user uses to see the BIO questionnaire.


* **yes:** 1 if the user chooses the “Yes” radio button for the BIO questionnaire.


* **no:** 1 if the user chooses the “No” radio button for the BIO questionnaire.

### Importing Libriaries for Analysis

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
#
from xgboost.sklearn import XGBClassifier
import xgboost as xgb
import seaborn as sns
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler

from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
#
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn import linear_model
from sklearn import preprocessing
#
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import label_binarize
from sklearn.preprocessing import scale, StandardScaler
#
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import roc_auc_score, make_scorer
from sklearn.metrics import r2_score

from sklearn.metrics import log_loss
from collections import defaultdict
import warnings
warnings.filterwarnings('always') 

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### Data Cleaning and Manipulation


In [None]:
df = pd.read_csv('../input/ab-testing/AdSmartABdata.csv')

The data was splitted between those who participated and those that didn't

In [None]:
 p= df[~((df['yes']== 0) & (df['no']== 0))]
np =  df[(df['yes']== 0) & (df['no']== 0)]

> check if this worked

In [None]:
print('Number of yes for P would be distributed among 1 and 0')
print(p['yes'].value_counts())
print('    Number of yes and no for np would just be zero')
print(np['yes'].value_counts())
print(np['no'].value_counts())

That worked

We try to now rename the columns yes would represent people who are participated in the survey and are either aware or not and while no would represent those who didn't participate in the survey but where targetted by our Ad, that is they saw the Advert

In [None]:
np =np.rename(columns={'yes':'aware', 'no': 'participate'})
p =p.rename(columns={'yes':'aware', 'no': 'participate'})

I would use aware which signifies Yes to represent people who participated in the survey and know about our brand, i will fill the column of participate for all those who are participated by filling yes or no with one, for the non participant it is already zero

In [None]:
p['participate'] = 1

> > confirm our changes
   

In [None]:
p.head()

I would append the people that didn't participate to the Data of people who participated, they would be represented as 0 which also means they didn't participate.

I would base this analysis only on people who actually participated in the survey, since those are the most important people in this experiment.

In [None]:
t = pd.concat([p,np])
t = t.sort_index()
t.head()

### Plotting a Correlation Matrix

In [None]:
# Correlation matrix
def plotCorrelationMatrix(df, graphWidth):
    filename = "Smart Ad"
    df = df.dropna('columns') # drop columns with NaN
    df = df[[col for col in df if df[col].nunique() > 1]] # keep columns where there are more than 1 unique values
    if df.shape[1] < 2:
        print(f'No correlation plots shown: The number of non-NaN or constant columns ({df.shape[1]}) is less than 2')
        return
    corr = df.corr()
    plt.figure(num=None, figsize=(graphWidth, graphWidth), dpi=80, facecolor='w', edgecolor='k')
    corrMat = plt.matshow(corr, fignum = 1)
    plt.xticks(range(len(corr.columns)), corr.columns, rotation=90)
    plt.yticks(range(len(corr.columns)), corr.columns)
    plt.gca().xaxis.tick_bottom()
    plt.colorbar(corrMat)
    plt.title(f'Correlation Matrix for {filename}', fontsize=15)
    plt.show()

In [None]:
plotCorrelationMatrix(t,8)

#### Encoding our Data for our Machine Learning Model 

In [None]:
label_encoder = preprocessing.LabelEncoder()

In [None]:
t['browser'] = label_encoder.fit_transform(t["browser"])
t['experiment'] = label_encoder.fit_transform(t["experiment"])
t['date'] = label_encoder.fit_transform(t["date"])
t['device_make'] = label_encoder.fit_transform(t["device_make"])

In [None]:
X = t.drop(columns={'auction_id', 'aware'}).values
Y = t['aware'].values

In this Calculation we are assuming that the phone people use are not Factors that determine if they would click an Ad in the first place, so we focused on the time at which they are seeing the Ad, the day they see the Ads and the different platform which they see the ads, since the ad would be rendered exactly the same way for the same OS type, we assume that the ad might not be very appealling in some Browsers 

In [None]:
# performing feature scaling on our training data
from sklearn.preprocessing import StandardScaler
scaler = MinMaxScaler()

# fitting and transforming X_train while transforming X_test
X = scaler.fit_transform(X)

#### Train, Test split for Prediction

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.1, random_state=6)

### Apply ML and train using 5-fold CV

Train a machine learning model using 5-fold cross validation the following 3 different algorithms:     
           Logistic Regression    
           Decision Trees   
           XGBoost    


#### Training and Fitting using Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegressionCV

clf = LogisticRegressionCV(cv=5, random_state=6, scoring= 'accuracy')
clf.fit(X_train, y_train)
y_pred_log = clf.predict(X_test)
y_predpr_log =clf.predict_proba(X_test)
clf.score(X_train, y_train)

#### Predicting using Decision Trees 

In [None]:
rocauc_scorer = metrics.make_scorer(metrics.accuracy_score)

rfc = DecisionTreeClassifier(random_state=5)
#                              , oob_score = True) 
param_grid = { 
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth':[50,150],
    'min_samples_leaf':[1,10]
}

CV_rfc = GridSearchCV(estimator=rfc, 
                      param_grid=param_grid,
                      scoring = rocauc_scorer,
                      cv= 5)

CV_rfc.fit(X_train, y_train)


In [None]:
CV_rfc.score(X_train, y_train)

In [None]:
y_pred_dec= CV_rfc.predict(X_test)
y_predpr_dec= CV_rfc.predict_proba(X_test)


In [None]:
features = t.columns
importances = CV_rfc.best_estimator_.feature_importances_
indices = importances.argsort()

f, ax = plt.subplots(figsize=(15, 8))
plt.title('Feature Importance')
plt.barh(range(len(indices)), importances[indices], color='orange', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')

#### Training and Fitting using Xgboost

In [None]:
xgb1 = XGBClassifier(
 learning_rate =0.1,
 n_estimators=1000,
 max_depth=5,
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)

In [None]:
def modelfit(alg, useTrainCV=True, cv_folds=5, early_stopping_rounds=200):
    
    if useTrainCV:
        xgb_param = alg.get_xgb_params()
        xgtrain = xgb.DMatrix(X_train, label=y_train)
        cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
            metrics='auc', early_stopping_rounds=early_stopping_rounds)
        alg.set_params(n_estimators=cvresult.shape[0])
         
            
    #Fit the algorithm on the data
    alg.fit(X_train, y_train,eval_metric='auc')
        
    #Predict training set:
    dtrain_predictions = alg.predict(X_train)
    dtrain_predprob = alg.predict_proba(X_train)[:,1]
        
    #Print model report:
    print ("\nModel Report")
    print ("Accuracy : %.4g" % metrics.accuracy_score(y_train, dtrain_predictions))
    print("AUC Score : %f" % metrics.roc_auc_score(y_train, dtrain_predprob))
                 
   
    
    return alg

In [None]:
mod =modelfit(xgb1)


In [None]:
y_pred_xgb =mod.predict(X_test)
y_predpr_xgb =mod.predict_proba(X_test)

In [None]:
features = t.columns
importances = mod.feature_importances_
indices = importances.argsort()

f, ax = plt.subplots(figsize=(15, 8))
plt.title('Feature Importance')
plt.barh(range(len(indices)), importances[indices], color='orange', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')

### Predictions and Accuracy of our Test data

I would print a Classification report that would show

In [None]:
# printing the classification report for each classifier to assess performance
from sklearn.metrics import classification_report

# classification report for Logistic Regression
print("Logistic Regression classification report:")
print(classification_report(y_test, y_pred_log))


# classification report for Decision Tree Classifier
print("Decision Tree classification report:")
print(classification_report(y_test, y_pred_dec))

# classification report for Decision Tree Classifier
print("XGBoost report:")
print(classification_report(y_test, y_pred_xgb))


The different score above help us understand the Accuracy for the for each labels in our Data,For the Support is the total number of rows which is equal around our dataset.


#### Confusion Matrix for our Data

In [None]:

print('Logistic Regression classifier: Confusion Matrix')
print(confusion_matrix(y_pred_log, y_test))

print('Decision Tree classifier: Confusion Matrix')
print(confusion_matrix(y_pred_dec, y_test))

print('XGBoost Classifier : Confusion Matrix')
print(confusion_matrix(y_pred_xgb, y_test))


The best Accuracy was gotten from the XG boost  Algorithm

In [None]:

print('Logistic Regression classifier: Log Loss')
print(log_loss(y_test,y_predpr_log))

print('Decision Tree classifier: Log Loss')
print(log_loss( y_test,y_predpr_dec))

print('XGBoost Classifier : Log Loss')
print(log_loss( y_test, y_predpr_xgb))


### Explain what the difference is between using A/B testing to test a hypothesis vs using Machine learning to learn the viability of the same effect?


The difference with using Machine Learning to test Hypothesis here is explainability of your result and understanding 

### Explain the purpose of training using k-fold cross validation instead of using the whole data to train the ML models?


The reason we used K-Fold Stratified Cross Validation is to maintain a Balanced data set that contains but Variable represented equal, so that our machine learning algorithm can learn properly

### What information do you gain using the Machine Learning approach that you couldn’t obtain using A/B testing?


The information i couldn't get from the A/B testing approach is what really drive my model and to what extent it does for each of the Features in our Data set, what can help us get what we are looking for more accurately basically