# Exercise 1

### We conducted an experiment to improve the acceptance rate of our product. The results of the experiment are in outcomes.csv. 
### Did the treatment improve the acceptance rate? Should we launch it to all of our customers?
* Separate the treatment and control groups according to 'treated' values
* Find group sizes and number of successes in each group
* Perform a 2 Sample Proportions z-test to determine significant statistical differences

## Read data

In [6]:
import pandas as pd
import numpy as np
from statsmodels.stats.proportion import proportions_ztest

df = pd.read_csv('outcomes.csv')

# Check for missing values
print("Missing values in 'outcomes.csv': {}\n".format(df.isna().sum().sum()))

Missing values in 'outcomes.csv': 0



## Separate the treatment and control groups according to 'treated' values

In [7]:
# Treatment Group
df_treatment = df[df.treated]
df_treatment.reset_index(inplace=True, drop=True)

size_treatment = df_treatment.shape[0]
success_treatment = df_treatment.accepted.value_counts()[True]
print("Treatment group:")
print("size       = {}".format(size_treatment))
print("successes  = {}".format(success_treatment))
print("fails      = {}".format(df_treatment.accepted.value_counts()[False]))
print("proportion = {:.3f}\n".format(success_treatment/size_treatment))

# Control Group
df_control = df[~df.treated]
df_control.reset_index(inplace=True, drop=True)

size_control = df_control.shape[0]
success_control = df_control.accepted.value_counts()[True]
print("Control group:")
print("size       = {}".format(size_control))
print("successes  = {}".format(success_control))
print("fails      = {}".format(df_control.accepted.value_counts()[False]))
print("proportion = {:.3f}\n".format(success_control/size_control))

Treatment group:
size       = 25158
successes  = 4334
fails      = 20824
proportion = 0.172

Control group:
size       = 24842
successes  = 4156
fails      = 20686
proportion = 0.167



## Proportions z-test to determine significant statistical differences

In [11]:
# Proportions z-test : Significant statistical difference between the 2 populations
# H0: Null Hypothesis: Two proportions are the same P1 = P2
# H1: Alternative Hypothesis: Two proportions are not the same 

significance = 0.05  # alpha
successes = np.array([ success_treatment, success_control ])
trials    = np.array([ size_treatment, size_control ])

zstat, pvalue = proportions_ztest(count=successes, nobs=trials, alternative="two-sided")
print("zstat = {:.3f}, p-value = {:.3f}\n".format(zstat, pvalue))

if pvalue > significance:
    print(" We failed to reject the null hypothesis, then the two proportions are the same --> P1 = P2")
    print(" No significant statistical difference between treatment and control groups")
else:
    print("There is a significant statistical difference between treatment and control groups")

zstat = 1.481, p-value = 0.139

 We failed to reject the null hypothesis, then the two proportions are the same --> P1 = P2
 No significant statistical difference between treatment and control groups


### <font color="blue">Conclusion: Even if the acceptance rate is slightly better in the treatment group (0.005),  the p-value obtained from the statistical test is greater than the chosen threshold (0.05), which means that there isn't a significant statistical difference between the groups. </font> 

### <font color="blue">Conclusion: I wouldn't launch it to all the customers yet. Instead, I would estimate the cost of performing the experiment on more businesses which would increase the robustness of the statistical test. I would also explore possible error sources </font> 

# Exercise 2

### Suppose we now obtain some data about the pre-experiment characteristics of the businesses in the experiment. It is provided in pre_experiment.csv. 
### What do you conclude about the experiment based on this data? Were the test/control groups randomly selected?

* Using the data in pre_experiment.csv, predictions of whether or not a business was included in the experiment ('treated' variable in outcomes.csv) will be made using Logistic Regression with L2 and L1 regularization. 
* If the groups were in fact randomly selected, 0.5 (random model) should be within the confidence interval of the predicted accuracy of the 'treated' variable. If that is not the case, the results from the experiment should not be trusted due to randomization issues.
* In the following cell blocks, predictions of the 'treated' variable and its corresponding accuracy and confidence interval are computed. A 10-fold cross-validation approach was adopted where the predictions are always on unseen data.

## Read data

In [12]:
#!pip install sklearn-pandas
import pandas as pd
import numpy as np
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import LabelBinarizer

dfx = pd.read_csv('pre_experiment.csv')
dfy = pd.read_csv('outcomes.csv')

# Check for missing values
print("Missing values in 'pre_experiment.csv': {}\n".format(dfx.isna().sum().sum()))

# Set labels: treated False = 0 / True = 1
Y = dfy.treated.astype(int).to_frame() 

# Set features with proper types
xx = dfx.drop(labels=["business_id"], axis="columns")
mask = xx.dtypes == 'object'                         # get columns based on data type (state,type)
cat_cols = xx.columns[mask].tolist()                 # list with the categorical column labels of the DataFrame
print('cat_cols',cat_cols)
num_cols = xx.columns[~mask].tolist()                # list with the ordinal columns labels
xx[num_cols] = xx[num_cols].astype(float)
print('num_cols',num_cols)

# Binarize categorical features
binarize = DataFrameMapper( 
    [ ([col], [LabelBinarizer()]) for col in cat_cols ] + 
    [ ([col], None) for col in num_cols ],
    input_df=True,
    df_out=True
)

print('Binarize',binarize)

X = binarize.fit_transform(xx)
print('X',X)
all_cols = X.columns.tolist()

Missing values in 'pre_experiment.csv': 0

cat_cols ['state', 'type']
num_cols ['average_daily_sales_dollar_volume', 'average_daily_sales_transactions', 'age_in_months', 'debt', 'credit_score', 'expense_to_income_ratio', 'offer_size', 'offer_fee']
Binarize DataFrameMapper(df_out=True, drop_cols=[],
                features=[(['state'], [LabelBinarizer()]),
                          (['type'], [LabelBinarizer()]),
                          (['average_daily_sales_dollar_volume'], None),
                          (['average_daily_sales_transactions'], None),
                          (['age_in_months'], None), (['debt'], None),
                          (['credit_score'], None),
                          (['expense_to_income_ratio'], None),
                          (['offer_size'], None), (['offer_fee'], None)],
                input_df=True)
X        state_AK  state_AL  state_AR  state_AZ  state_CA  state_CO  state_CT  \
0             0         0         0         0         0         0 

## Model

In [10]:
# Predict treatment class based on features with Logistic Regression and L2 regularization
import pandas as pd
import numpy as pf
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Outer 10-fold cross-validation
cv_outer = KFold(n_splits=10, shuffle=True, random_state=0)

accu_outer = []
n = 1
for id_train, id_test in cv_outer.split(X):
    
    # set 10 train/test subsets
    Xtrain, Xtest = X.iloc[id_train,:], X.iloc[id_test,:]
    ytrain, ytest = Y.iloc[id_train,:], Y.iloc[id_test,:]
    
    # normalize data
    scaler = StandardScaler()
    scaler.fit(Xtrain)                                                # scaler uses mean and std of training set
    xtrain = pd.DataFrame(scaler.transform(Xtrain), columns=all_cols) # results to DataFrame instead of numpy array
    xtest  = pd.DataFrame(scaler.transform(Xtest),  columns=all_cols)
    
    # inner 5-fold cross-validation (fine tune hyperparameter)
    cv_inner = KFold(n_splits=5, shuffle=True, random_state=1)
    
    # model
    model = LogisticRegression(penalty='l2',       # L2 regularization
                               random_state=0, 
                               solver='liblinear', 
                               max_iter=100)
    
    # grid search over C (regularization strenght)
    grid_C = {'C': list(np.logspace(-1, 4, 6))}
    grid_search = GridSearchCV(estimator=model, 
                               param_grid=grid_C, 
                               scoring='accuracy', 
                               cv=cv_inner, 
                               refit=True,          # Refit an estimator using the best found parameters on the whole dataset (5 fold).
                               n_jobs=16)           # Number of jobs to run in parallel 
    grid_search.fit(xtrain, ytrain.values.ravel())  # apply grid search (ravel -> flatten)
    model_best = grid_search.best_estimator_        # save best model
    
    # predictions
    ypred_proba = model_best.predict_proba(xtest)[:,1]  # prob class 1
    ypred = model_best.predict(xtest)                   # pred class
    accu = accuracy_score(ytest.values.ravel(), ypred)  # true label vs predicted label
    
    # save relevant data
    accu_outer.append(accu)
    
    # print progress
    print("n = {}, C = {}, test score = {}".format(n, grid_search.best_params_['C'], accu))
    n += 1
    
# Print summary
print("Probability of predicting treatment based on features = {:.3f} +/= {:.3f}".format(
      np.mean(accu_outer), 
      np.std(accu_outer)))

n = 1, C = 10.0, test score = 0.6526
n = 2, C = 100.0, test score = 0.6676
n = 3, C = 100.0, test score = 0.6722
n = 4, C = 10.0, test score = 0.6634
n = 5, C = 10.0, test score = 0.6554
n = 6, C = 1.0, test score = 0.669
n = 7, C = 1000.0, test score = 0.6632
n = 8, C = 1000.0, test score = 0.6664
n = 9, C = 1.0, test score = 0.6718
n = 10, C = 10.0, test score = 0.6744
Probability of predicting treatment based on features = 0.666 +/= 0.007


### <font color="blue">Conclusion: With both L1 and L2 regularization, the models show that the 'treated' variable can be predicted with a 67% accuracy and the random model (0.5) is not within the confidence interval. This means that, based on the features available, the selection of the groups is not completely random and it is biased towards certain features. As a result, a standard proportions test will fail to determine the effect of the treatment on the outcomes. In our particular problem, we can conclude that the results from part a) can not be trusted and we can't tell whether or not the experiment improved the acceptance rate. </font>