Koushik Chowdhury (2572865)

# General Information
The goal of this assignment is to familiarize yourself with quantitative methods and statistics in practice. In the course of this assignment, you will parse and analyze logfiles of an online computer parts retailer. You will also learn to interpret these values to draw conclusions based on quantitative insights.

# Introduction
You work for a regional online computer parts retailer Fresno, California and are responsible for the technical infrastructure. On January 12020 the California Consumer Privacy Act (CCPA) has become effective and you just now realized that it requires a ”Do not sell” opt-out option for all customers.
You want to go above and beyond this requirement by showing customers a 90 second long video that explains which data your company collects, how it is used, and under which circumstances you transmit them to third parties. However, when you asked the executive board what they thought about your plan they told you that (a) they would rather only implement the new law as minimally as required, and (b) you may only change the checkout-procedure if you can demonstrate that it does not affect the sales. Since the CEO attended a class in statistics in college, you will need to provide statistical tests to convince her.
In order to eliminate the concern of the board you conduct a limited comparative test for one week with three conditions: (1) current checkout-procedure without any privacy related options, (2) ”Do not sell” opt-out possibility as required by CCPA, and (3) your 90 seconds video explanation with an opt-in at the end. You programmed the online shop software to randomly assign arriving customers to one of these three conditions.

You can download the three resulting Apache logfiles, and the inventory list from the Materials page on CMS. Your job is to find out if either of the two new options affects sales, i.e.
the number of aborted checkout procedures, or the amount of money that shoppers spend on computer parts. You may solve this exercise either in R or Python. We expect your submission in a Jupyter Notebook that includes all necessary code and your written answers.

# Prerequisites

* Dowload the webserver logfiles from the [Materials Page](https://cms.cispa.saarland/usec20/materials/)
* Download the price list from the [Materials Page](https://cms.cispa.saarland/usec20/materials/)
* You may use the [Apache-Log-Parser](https://github.com/rory/apache-log-parser) library in this exercise
> pip3 install apache-log-parser
* You may use the python libraries numpy, scipy,  pandas, and statsmodels for the required  calculations  in  this  exercise.
> pip3 install numpyscipy pandas statsmodels

## 1 Parse and prepare the Log-Files (3 Points)


In [36]:
import apache_log_parser
import scipy.stats as stats
import numpy as np
import pandas as pd

# TODO: parse log files

import os

In [37]:
pwd()

'C:\\Users\\Koushik\\usp'

In [38]:
os.chdir('C:\\Users\\Koushik\\usp')

In [39]:
pwd()

'C:\\Users\\Koushik\\usp'

In [40]:
line_parser = apache_log_parser.make_parser('%a %u %l %t "%r" "%{User-Agent}i" %>s %b')

# reading each individual file and saving data to a dictionary
read_data = {}

for file in os.listdir():
    # only reading log files
    # saving content of each file in different dictionary keys as per file name
    if file.endswith(".log"):
        key_ = file.replace(".log", "")
        if key_ not in read_data:
            read_data[key_] = {}
            
        with open(file, "r") as file:
            # reading each line and parsing information and saving by IP address
            for line in file:
                log_line_data = line_parser(line)
                if log_line_data['remote_ip'] not in read_data[key_]:
                    read_data[key_][log_line_data['remote_ip']] = []
                    
                # each entry for a IP is saved to a list for future readings
                read_data[key_][log_line_data['remote_ip']].append(log_line_data)

In [41]:
# intitiating placeholder for extracted information
extracted_insight = {}

# reading content for different file name
for file_name in read_data:
    
    # user information per file
    user_info_payment = {}
    
    # for each user
    for user in read_data[file_name]:
        # reading parsed data
        temp_lst = read_data[file_name][user]
        
        # initializing to be saved information per user
        added_product = []
        removed_prodct = []
        successful_payement = []
        aborted_payment = []
        
        # Initially both of these conditions should be False
        payment_done = False
        confirmation_wanted = False
        
        # for each log per IP
        for index, dct in enumerate(temp_lst):
            
            # If the status code is valid and up and running,
            # only then we can procceed
            if dct['status'] == "200":
                # getting action done 
                action = dct['request_url_path']
                
                # collecting timestamps
                timestamp = dct['time_received_utc_datetimeobj']
                
                # getting product name if available
                product = dct['request_url_query_dict']
                
                # For the 3 log files, this condition is not necessary
                # However, it will come handy if multiple successful payment is done by the same user
                if payment_done == True:
                    payment_done = False
                    confirmation_wanted = False
                
                # if product is added to cart then add it to added product list
                if "/cart/add" in action:
                    product = product['sku'][0].split(" ")[0]
                    added_product.append(product)
                
                # if prodcut is removed from the card add it to removed product list
                elif "/cart/remove" in action:
                    product = product['sku'][0].split(" ")[0]
                    removed_prodct.append(product)
                
                # If checkout is requested then save the timestamp and make confirmation_wanted to True
                elif "/cart/checkout" in action:
                    confirmation_wanted = True
                    confirmation_wanted_timestamp = timestamp
                
                # If confirmation is wanted previously and current action is thank_you_for_your_order then susscessful_payment
                # and make payment_done to True
                elif ("/thank_you_for_your_order" in action) and confirmation_wanted:
                    payment_done = True
                    successful_payement.append(timestamp)

        # What if payment was never done, meaning - 
        # payment_done == False; but, confirmation was previously wanted
        # then, it should be a aborted payment
        if confirmation_wanted == True and payment_done == False:
            aborted_payment.append(confirmation_wanted_timestamp)
        
        # ordered products 
        ordered_products = list(set(added_product) - set(removed_prodct))
        
        # saving all these information to a file specific dictionary
        user_info_payment[user] = {
            "added_product" : added_product,
            "removed_prodct" : removed_prodct, 
            "ordered_products" : ordered_products,
            "successful_payement" : successful_payement,
            "aborted_payment" : aborted_payment
        }
    
    # now saving the file specific dictionary for further analysis to a combined dictionary
    extracted_insight[file_name] = user_info_payment

In [42]:
# reading the csv file
for file in os.listdir():
    # we only have one csv file
    if '.csv' in file:
        df = pd.read_csv(file, names = ["product_id", "product_name", "price"])

# converting $ sign to floating point numbers
df['price'] = df['price'].replace("[$]", "", regex = True).astype(float)

display(df)

Unnamed: 0,product_id,product_name,price
0,2251588819,"Lithium Battery CR2032, Coin CMOS, 3V",1.99
1,2407472151,Super Fast AA/AAA Batterv Charaer - Included B...,4.99
2,2251588819,"Lithium Battery CR2032, Coin CMOS, 3V",1.99
3,3219851208,"Thermaltake Versa H22 Mid Tower ATX Case, Blac...",49.99
4,5913479748,Cooler Master MasterBox MB600L. Mid Tower ATX ...,59.99
...,...,...,...
250,1087963548,"TP-Link Deco M3 Home Mesh AC1200 WIFI System, ...",109.99
251,9744135408,Ubiquiti Long Ranae Wireless 802.11ac 1300 Dua...,119.99
252,9705275849,Ubiquiti PRO Wireless AC 1750 Dual Band Access...,159.99
253,5796418955,TP-Link Archer AX50 AX3000 Dual-Band Wireless ...,159.99


### 1.1 The amount of aborted checkout procedures

In [8]:
# TODO: Calculate Aborted Checkout Procedures for the three Log-Files
sum_abotion = 0

for file_name, user_info_payment in extracted_insight.items():
    temp_sum = 0
    for user_ip, payment_info in user_info_payment.items():
        temp_sum += len(payment_info['aborted_payment'])
    print(f"File: {file_name}.log has {temp_sum} aborted payments.")

    sum_abotion += temp_sum
print(100 * "-")
print(f"In total the system has {sum_abotion} aborted payments.")

File: Assignment_2_webshop-comville-apache_ccpa_opt-out_access.log has 13 aborted payments.
File: Assignment_2_webshop-comville-apache_no_privacy_options_access.log has 12 aborted payments.
File: Assignment_2_webshop-comville-apache_video_explanations_opt-in_access.log has 30 aborted payments.
----------------------------------------------------------------------------------------------------
In total the system has 55 aborted payments.


### 1.2 The amount of successful checkout procedures


In [9]:
# TODO: Calculate Successful Checkout Procedures for the three Log-Files
sum_successful = 0

for file_name, user_info_payment in extracted_insight.items():
    temp_sum = 0
    for user_ip, payment_info in user_info_payment.items():
        temp_sum += len(payment_info['successful_payement'])
    print(f"File: {file_name}.log has {temp_sum} successful payments.")

    sum_successful += temp_sum
    
print(100 * "-")
print(f"In total the system has {sum_successful} successful payments.")

File: Assignment_2_webshop-comville-apache_ccpa_opt-out_access.log has 442 successful payments.
File: Assignment_2_webshop-comville-apache_no_privacy_options_access.log has 417 successful payments.
File: Assignment_2_webshop-comville-apache_video_explanations_opt-in_access.log has 423 successful payments.
----------------------------------------------------------------------------------------------------
In total the system has 1282 successful payments.


### 1.3 The amount of money spent in each successful checkout procedure

In [10]:
# TODO: Calculate the amount of money spent in each successful checkout procedure
# placeholder for saving successful payment amount per user
successful_payement_amount_puser = {}
sum_total = 0
for file_name, user_info_payment in extracted_insight.items():
    temp_sum_file = 0
    successful_payement_amount_puser[file_name] = {}
    
    for user_ip, payment_info in user_info_payment.items():
        if len(payment_info['successful_payement']) >= 1:
            ordered_products = payment_info['ordered_products']
            
            # extracting spect amount in each successful payment
            temp_sum = 0
            for product in ordered_products:
                temp_sum += df[df['product_id'] == int(product)]['price'].iloc[0]
            
            successful_payement_amount_puser[file_name][user_ip] = temp_sum
            print(f"User: {user_ip}\tspent total of:\t{temp_sum : .2f}$")
            temp_sum_file += temp_sum
            
    print(100 * "-")
    print(f"File: {file_name}.log has a total of:\t{temp_sum_file : .2f}$ in successful payments.")

    sum_total += temp_sum_file

print(100 * "+")

print(f"A total payment of:\t{sum_total : .2f} is made by the system.")

User: 192.0.3.164	spent total of:	 619.90$
User: 201.49.125.22	spent total of:	 792.87$
User: 192.55.85.143	spent total of:	 913.84$
User: 192.57.53.252	spent total of:	 694.85$
User: 100.6.161.249	spent total of:	 954.90$
User: 192.175.49.245	spent total of:	 858.99$
User: 192.44.117.175	spent total of:	 688.84$
User: 192.175.49.93	spent total of:	 884.89$
User: 192.52.195.191	spent total of:	 851.90$
User: 192.90.21.51	spent total of:	 902.83$
User: 2.199.49.238	spent total of:	 692.90$
User: 168.228.196.128	spent total of:	 826.84$
User: 198.78.65.129	spent total of:	 645.88$
User: 192.88.92.251	spent total of:	 783.84$
User: 98.100.157.33	spent total of:	 758.85$
User: 192.171.39.146	spent total of:	 1233.81$
User: 22.78.138.144	spent total of:	 275.89$
User: 192.31.64.102	spent total of:	 892.93$
User: 192.92.115.57	spent total of:	 1138.80$
User: 192.0.3.218	spent total of:	 1191.80$
User: 192.175.52.197	spent total of:	 564.91$
User: 169.125.253.9	spent total of:	 616.85$
User: 

User: 1.224.47.132	spent total of:	 940.93$
User: 202.47.233.38	spent total of:	 442.90$
User: 100.232.4.4	spent total of:	 630.87$
User: 192.25.64.83	spent total of:	 700.88$
User: 203.0.37.155	spent total of:	 599.85$
User: 192.175.24.112	spent total of:	 691.84$
User: 169.253.227.112	spent total of:	 622.90$
User: 171.68.117.201	spent total of:	 988.94$
User: 198.51.52.3	spent total of:	 649.89$
User: 192.52.192.189	spent total of:	 718.91$
User: 169.240.161.197	spent total of:	 814.82$
User: 192.88.103.163	spent total of:	 714.89$
User: 198.118.31.227	spent total of:	 885.78$
User: 192.175.51.131	spent total of:	 804.85$
User: 192.17.75.26	spent total of:	 1102.85$
User: 198.30.48.2	spent total of:	 754.86$
User: 97.238.67.198	spent total of:	 786.91$
User: 192.109.3.100	spent total of:	 339.91$
User: 206.47.239.225	spent total of:	 632.89$
User: 192.15.62.139	spent total of:	 676.87$
User: 192.31.199.238	spent total of:	 778.84$
User: 192.171.99.209	spent total of:	 745.83$
User: 

User: 203.23.243.108	spent total of:	 973.81$
User: 198.51.217.171	spent total of:	 683.88$
User: 169.213.62.215	spent total of:	 1010.86$
User: 192.0.75.218	spent total of:	 562.91$
User: 203.139.225.158	spent total of:	 888.83$
User: 192.198.50.252	spent total of:	 918.86$
User: 192.31.178.88	spent total of:	 726.91$
User: 192.175.49.9	spent total of:	 1220.83$
User: 192.91.147.191	spent total of:	 423.92$
User: 169.255.120.248	spent total of:	 430.89$
User: 192.31.198.207	spent total of:	 594.89$
User: 115.164.228.41	spent total of:	 864.81$
User: 169.27.13.99	spent total of:	 949.89$
User: 117.235.180.28	spent total of:	 1000.79$
User: 203.12.29.122	spent total of:	 431.86$
User: 168.226.183.5	spent total of:	 856.88$
User: 101.146.84.120	spent total of:	 432.87$
User: 192.226.120.212	spent total of:	 582.88$
User: 192.31.197.202	spent total of:	 928.81$
User: 192.52.136.125	spent total of:	 921.83$
-----------------------------------------------------------------------------------

User: 192.89.120.12	spent total of:	 570.91$
User: 192.0.105.103	spent total of:	 645.87$
User: 192.0.22.97	spent total of:	 880.86$
User: 192.28.81.38	spent total of:	 545.92$
User: 192.52.192.27	spent total of:	 432.86$
User: 192.52.194.189	spent total of:	 800.79$
User: 192.53.237.218	spent total of:	 641.89$
User: 192.185.150.247	spent total of:	 426.89$
User: 192.52.190.196	spent total of:	 465.91$
User: 192.0.54.15	spent total of:	 813.86$
User: 198.51.101.153	spent total of:	 817.86$
User: 192.20.23.176	spent total of:	 996.88$
User: 192.0.40.221	spent total of:	 860.88$
User: 192.0.3.117	spent total of:	 657.85$
User: 192.29.174.54	spent total of:	 586.86$
User: 202.12.80.39	spent total of:	 771.86$
User: 192.49.173.253	spent total of:	 705.82$
User: 192.175.14.51	spent total of:	 653.92$
User: 198.51.118.67	spent total of:	 732.90$
User: 100.63.48.51	spent total of:	 745.85$
User: 192.0.31.235	spent total of:	 506.89$
User: 192.66.11.94	spent total of:	 701.87$
User: 6.137.74.

User: 192.175.55.199	spent total of:	 941.87$
User: 6.248.210.105	spent total of:	 672.89$
User: 203.0.37.233	spent total of:	 648.87$
User: 192.15.162.212	spent total of:	 615.88$
User: 198.11.168.18	spent total of:	 894.90$
User: 198.52.115.163	spent total of:	 824.88$
User: 193.137.201.138	spent total of:	 486.92$
User: 198.51.109.134	spent total of:	 707.88$
User: 192.31.178.222	spent total of:	 779.88$
User: 198.51.101.52	spent total of:	 1079.86$
User: 66.195.188.11	spent total of:	 792.89$
User: 203.1.159.66	spent total of:	 1049.79$
User: 192.229.123.180	spent total of:	 681.91$
User: 192.31.152.47	spent total of:	 617.85$
User: 192.175.156.238	spent total of:	 1131.83$
User: 124.167.246.49	spent total of:	 649.88$
----------------------------------------------------------------------------------------------------
File: Assignment_2_webshop-comville-apache_no_privacy_options_access.log has a total of:	 318457.19$ in successful payments.
User: 169.189.61.173	spent total of:	 734

User: 192.31.119.43	spent total of:	 534.87$
User: 203.0.83.27	spent total of:	 575.89$
User: 203.0.114.92	spent total of:	 807.83$
User: 203.5.108.242	spent total of:	 833.87$
User: 192.0.29.171	spent total of:	 686.82$
User: 192.31.194.22	spent total of:	 377.92$
User: 198.51.98.98	spent total of:	 456.87$
User: 192.52.222.99	spent total of:	 464.89$
User: 192.175.5.190	spent total of:	 491.92$
User: 198.14.203.194	spent total of:	 198.94$
User: 203.12.113.170	spent total of:	 572.87$
User: 192.0.242.240	spent total of:	 366.88$
User: 203.206.131.69	spent total of:	 440.94$
User: 192.31.216.175	spent total of:	 672.83$
User: 198.213.175.173	spent total of:	 645.84$
User: 169.255.82.213	spent total of:	 449.90$
User: 214.147.102.225	spent total of:	 1026.91$
User: 203.0.119.9	spent total of:	 475.92$
User: 192.52.192.242	spent total of:	 903.82$
User: 192.90.7.47	spent total of:	 934.87$
User: 203.86.117.98	spent total of:	 989.86$
User: 192.31.204.78	spent total of:	 625.86$
User: 19

User: 203.0.104.103	spent total of:	 949.85$
User: 102.158.116.245	spent total of:	 637.87$
User: 35.190.213.238	spent total of:	 881.84$
User: 10.223.185.71	spent total of:	 673.90$
User: 192.88.24.43	spent total of:	 669.90$
User: 192.175.58.216	spent total of:	 517.90$
----------------------------------------------------------------------------------------------------
File: Assignment_2_webshop-comville-apache_video_explanations_opt-in_access.log has a total of:	 302300.37$ in successful payments.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A total payment of:	 951741.77 is made by the system.


## 2 Analyze the Extracted Information
For each statistical test you run you should (a) **explicitly state H0**, and (b) **list all of the results** (including the non-significant ones).

### 2.1 Effect on Checkout Completion (6 Points)

(a) Assess the global effect of the privacy policy condition on the checkout success/abortion using a non-parametric test.

In [50]:
# TODO: Compare Success-Rates
from scipy.stats import chi2_contingency

cont_table = np.array([[442, 417, 423],
                       [13, 12, 30]])

#chi-square test
chi2, p, dof, expected = chi2_contingency(cont_table)

print('(a) Assess the global effect of the privacy policy condition on the checkout success/abortion using a non-parametric test.')
print('Chi-square statistic: {:.3f}'.format(chi2))
print('p-value: {:.5f}'.format(p))


(a) Assess the global effect of the privacy policy condition on the checkout success/abortion using a non-parametric test.
Chi-square statistic: 10.935
p-value: 0.00422


(explain your code and the results here)

(b) In case of a global effect, run a post-hoc analysis to find out which condition shows an effect.

In [53]:
# TODO: if neccessary, post hoc-analysis here
from itertools import combinations

conditions = ['CCPA', 'No-Privacy', 'Video']

# pairwise chi-square tests
for c1, c2 in combinations(conditions, 2):
    print(f"Chi-square test between {c1} and {c2}")
    subset = cont_table[:, [conditions.index(c1), conditions.index(c2)]]
    chi2, p, dof, expected = stats.chi2_contingency(subset)
    print('Chi-square statistic: {:.3f}'.format(chi2))
    print('p-value: {:.5f}'.format(p))
    print()


Chi-square test between CCPA and No-Privacy
Chi-square statistic: 0.022
p-value: 0.88136

Chi-square test between CCPA and Video
Chi-square statistic: 6.324
p-value: 0.01191

Chi-square test between No-Privacy and Video
Chi-square statistic: 6.291
p-value: 0.01214



(explain your code and the results here)

(c) Write a short explanation for the executive board that describes why you chose these tests and what the results mean for the future privacy options on the online shop.

H0= There is no significant difference between CCPA and Video<br>
H0 = There is no significant difference between No-Privacy and Video<br>

To Explanation to executive board:<br>
<p style='text-align: justify;'> The chi square test was my method of choice because the data was categorical, and I used the global chi-square test as a post-hoc analysis to determine which of the variables had an impact. We can infer from the result from 1.1 that there is a significant relationship between the checkout success/abortion condition and the privacy policy condition. From 1.2, we can see that the p value for the Chi-square test between the CCPA and the videos and video and No-Privacy is below 0.05 </p>
    
<p style='text-align: justify;'>As a result, in this instance, we have substantial proof to reject the both null hypothesis. Additionally, we can draw the conclusion that videos significantly alter both cases. Therefore, it would be preferable to update the video policy to assist customers in making wise decisions.  </p>



### 2.2 Describe the distributions of money spent by the customers (6 Points)


(a) Calculate the mean and the standard deviation of the money spent in successful checkout procedures for all three log files.

In [59]:
# TODO: Calculate mean and std. deviation of the money spent in successful checkout procedures
import numpy as np

# placeholder for saving successful payment amount per file
successful_payment_amount_file = {}

for file_name, user_info_payment in successful_payement_amount_puser.items():
    payment_amounts = list(user_info_payment.values())
    successful_payment_amount_file[file_name] = payment_amounts
    mean = np.mean(payment_amounts)
    std = np.std(payment_amounts)
    print(f"File: {file_name}.log")
    print(f"Mean: {mean:.2f}$\t SD: {std:.2f}$")
    print(100 * "-")


File: Assignment_2_webshop-comville-apache_ccpa_opt-out_access.log
Mean: 748.83$	 SD: 206.18$
----------------------------------------------------------------------------------------------------
File: Assignment_2_webshop-comville-apache_no_privacy_options_access.log
Mean: 763.69$	 SD: 203.37$
----------------------------------------------------------------------------------------------------
File: Assignment_2_webshop-comville-apache_video_explanations_opt-in_access.log
Mean: 714.66$	 SD: 220.68$
----------------------------------------------------------------------------------------------------


(b) Check each of the three distributions for normality and homogeneity of variances.

In [67]:
# TODO: Check for normality and homogeneity of variances
# for each of the all payment
import scipy.stats as stats
# for each of the three payment
successful_payment_amount_file = {}

for file_name, user_info_payment in successful_payement_amount_puser.items():
    payment_amounts = list(user_info_payment.values())
    successful_payment_amount_file[file_name] = payment_amounts
    shapiro_stat, shapiro_p = stats.shapiro(payment_amounts)
    print(f"File: {file_name}.log")
    print(f"Shapiro-Wilk : stat = {shapiro_stat: .3f}, p = {shapiro_p: .5f}")
    print("Normally distributed." if shapiro_p > 0.05 else "Not normally distributed.")
    print(100 * "-")

levene_stat, levene_p = stats.levene(successful_payment_amount_file['Assignment_2_webshop-comville-apache_no_privacy_options_access'], 
                                     successful_payment_amount_file['Assignment_2_webshop-comville-apache_ccpa_opt-out_access'], 
                                     successful_payment_amount_file['Assignment_2_webshop-comville-apache_video_explanations_opt-in_access'])
print(f'Levene Homoginity : stat = {levene_stat: .3f}, p = {levene_p: .5f}')

File: Assignment_2_webshop-comville-apache_ccpa_opt-out_access.log
Shapiro-Wilk : stat =  0.998, p =  0.87039
Normally distributed.
----------------------------------------------------------------------------------------------------
File: Assignment_2_webshop-comville-apache_no_privacy_options_access.log
Shapiro-Wilk : stat =  0.992, p =  0.03427
Not normally distributed.
----------------------------------------------------------------------------------------------------
File: Assignment_2_webshop-comville-apache_video_explanations_opt-in_access.log
Shapiro-Wilk : stat =  0.998, p =  0.89182
Normally distributed.
----------------------------------------------------------------------------------------------------
Levene Homoginity : stat =  1.832, p =  0.16052


(c) Explain what each of your results means in simple terms for the executive board.

To Explanation to executive board:

<p style='text-align: justify;'> To determine whether the data from each log file follows a normal distribution, I used the Shapiro-Wilk test which is known as test of normality. I discovered that, with the exception of the "No Privacy" log file, the other two are normally distributed. After that, I look at the homogeneity of variation between the two groups. For that, I used the Levene test. The variances are not significantly different from one another if the p-value for the Levene test is higher than 0.05, according to that, P value in my result is higher than 0.05 for Levene. If we examine the homogeneity between the three groups, even though the p value for "No Privacy" is less than 0.05 for the normality test, the p value increases, indicating that there is not enough evidence to reject the null hypothesis. We can infer that any differences in purchase amounts between the three categories are probably not the result of different variances.  </p>



### 2.3 Effect on the amount of money spent in each successful checkout procedure (6 Points)


(a) Use a parametric test to find out if there is a global effect on the money spent in each checkout procedure.

In [69]:
# TODO: Parametric test to check for global effect on money spent 
import scipy.stats as stats

# Get the list of payment amounts for each log file
payment_amounts = []
for user_payment in successful_payement_amount_puser.values():
    payment_amounts.append(list(user_payment.values()))

# Perform one-way ANOVA test
f_stat, p = stats.f_oneway(*payment_amounts)

print("One-way ANOVA results:")
print("p-value:", p)print('p-value: {:.5f}'.format(p))

One-way ANOVA results:
p-value: 0.00251


(b) In case of a global effect, run a post-hoc analysis to find out which conditions shows an effect.

In [76]:
test_pair = []

file_names = list(successful_payment_amount_file.keys())

# extracting pairs for binferinni testing
for index, file in enumerate(file_names[:-1]):
    test_pair.append([file_names[index], file_names[index + 1]])
    
test_pair.append([file_names[index], file_names[-1]])

alpha = 0.05
bonferonni_alpha = alpha / len(test_pair)

for pairs in test_pair:
    t_stat, p_value = stats.ttest_ind(successful_payment_amount_file[pairs[0]], successful_payment_amount_file[pairs[1]])
    
    if p_value < bonferonni_alpha:
        print(f'There is a significant difference between {pairs[0]} and {pairs[1]} with p-value: {p_value:.3f} & bonferonni alpha: {bonferonni_alpha:.3f}\n')
    else:
        print(f'There is no significant difference between {pairs[0]} and {pairs[1]} with p-value: {p_value:.3f} & bonferonni alpha: {bonferonni_alpha:.3f}\n')


There is no significant difference between Assignment_2_webshop-comville-apache_ccpa_opt-out_access and Assignment_2_webshop-comville-apache_no_privacy_options_access with p-value: 0.289 & bonferonni alpha: 0.017

There is a significant difference between Assignment_2_webshop-comville-apache_no_privacy_options_access and Assignment_2_webshop-comville-apache_video_explanations_opt-in_access with p-value: 0.001 & bonferonni alpha: 0.017

There is a significant difference between Assignment_2_webshop-comville-apache_no_privacy_options_access and Assignment_2_webshop-comville-apache_video_explanations_opt-in_access with p-value: 0.001 & bonferonni alpha: 0.017



(c) Write a short explanation for the CEO that describes why you chose these tests and what the results mean for the future privacy options on
the online shop.

<p style='text-align: justify;'> One way ANOVA is the most popular parametric measure for our dataset. One-way ANOVA is suitable test to determine whether there is a general impact on money spent across various log files as it is find significant differences of three or more groups. Additionally, while there are a few popular post hoc analyses for One way ANOVA, I chose Bonferroni because it employs t-test to perform pairwise comparisons between group means [1] as well as that is  something we learned in lecture. From One way ANOVA result, we can observed the p value is less than 0.05. According to the p value and bonferonni alpha value, we can suggest that online shop should consider executing future privacy option since for both cases we can reject null hypothesis for privacy option.
 </p>
 
 [1] https://www.ibm.com/docs/en/spss-statistics/saas?topic=anova-one-way-post-hoc-tests

### 2.4  Argue which kind of additional data and corresponding statistical tests could provide evidence that your preferred option (video-explanation with opt-in) trumps the other two. (4 Points)

<p style='text-align: justify;'> We can also include information about record the customer number on, data such as region like from what region the customer is from, and record the possible traffic-related information like how frequently customers frequented the site. To comprehend the connection between each element, we can use correlation analysis. On both quantitative and qualitative factors, correlation analysis can be used. For instance, when using additional data, we attempt to determine which variable has a high correlation with the other. This number represents positive and negative correlation as -1 and 1, respectively. Apart from Correlation analysis, we can apply also regression analysis which is mentioned in one the lecture. It is also the same where we basically understand the relationship between one variable to another. Probably one of this 2 test can be help to find whether 'Video' is performing better than 2. For Post hoc, we can apply TukeyHSD which is basically do the same work as be Bonferroni but do TukeyHSD, because after adding additional data, we have a set number of planned comparison as well as we are doing pairwise comparison. 

[2]https://www.statisticshowto.com/probability-and-statistics/correlation-analysis/ 
</p>

