### In This Notebook We will test our Hypotheses that we found from our analysis.

In [130]:
from scipy import stats
import numpy as np
import pandas as pd
import warnings

warnings.filterwarnings('ignore')

In [131]:
df = pd.read_csv('online_shoppers_intention.csv')
df.head()

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,1,1,1,1,Returning_Visitor,False,False
1,0,0.0,0,0.0,2,64.0,0.0,0.1,0.0,0.0,Feb,2,2,1,2,Returning_Visitor,False,False
2,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,4,1,9,3,Returning_Visitor,False,False
3,0,0.0,0,0.0,2,2.666667,0.05,0.14,0.0,0.0,Feb,3,2,2,4,Returning_Visitor,False,False
4,0,0.0,0,0.0,10,627.5,0.02,0.05,0.0,0.0,Feb,3,3,1,4,Returning_Visitor,True,False


- Defining a function to test porportions of two groups.

In [132]:
def test_porportions(col, category_of_interest, alternative):
    # Count the number of times source 2 appears
    n_source_2 = df[col].value_counts()[category_of_interest]

    # Count the number of times all other sources appear
    n_other_sources = df.shape[0] - n_source_2

    # Perform one-sample proportion test
    p_other_sources = n_other_sources / df.shape[0]
    p_value = stats.binom_test(n_source_2,
                               n_source_2 + n_other_sources,
                               p_other_sources, 
                               alternative=alternative)

    print(f'P-Value is : {p_value:.3f}')

- Operating System source 2 is highest one being used. (reject the null hypothesis)
    - H_0 : other sources >= source 2
    - H_A : source 2 < other sources

In [133]:
test_porportions(col='OperatingSystems', category_of_interest=2, alternative='greater')

P-Value is : 0.000


- Browser source 2 is highest one being used. (reject the null hypothesis)
    - H_0 : other sources >= source 2
    - H_A : source 2 < other sources

In [134]:
test_porportions(col='Browser', category_of_interest=2, alternative='greater')

P-Value is : 0.000


- Region source 1 is highest one in the data. (fail to reject the null hypothesis) this is because source 3 has a considerable existence.
    - H_0 : other sources >= source 1
    - H_A : source 1 < other sources

In [135]:
test_porportions(col='Region', category_of_interest=1, alternative='greater')

P-Value is : 1.000


In [136]:
# Count the number of times source 2 appears
n_source_2 = df['Region'].value_counts()[[1, 3]].sum()

# Count the number of times all other sources appear
n_other_sources = df.shape[0] - n_source_2

# Perform one-sample proportion test
p_other_sources = n_other_sources / df.shape[0]
p_value = stats.binom_test(n_source_2,
                            n_source_2 + n_other_sources,
                            p_other_sources, 
                            alternative='greater')

print(f'P-Value is : {p_value:.3f}')

P-Value is : 0.000


- Traffic Type source 2 is highest one being used. (fail reject the null hypothesis) this because sources [1, 3, 4] have considerabel existence.
    - H_0 : other sources >= source 2
    - H_A : source 2 < other sources

In [137]:
test_porportions(col='TrafficType', category_of_interest=2, alternative='greater')

P-Value is : 1.000


In [138]:
# Count the number of times source 2 appears
n_source_2 = df['TrafficType'].value_counts()[[1, 2, 3, 4]].sum()

# Count the number of times all other sources appear
n_other_sources = df.shape[0] - n_source_2

# Perform one-sample proportion test
p_other_sources = n_other_sources / df.shape[0]
p_value = stats.binom_test(n_source_2,
                            n_source_2 + n_other_sources,
                            p_other_sources, 
                            alternative='greater')

print(f'P-Value is : {p_value:.3f}')

P-Value is : 0.000


- [May, November] are highest Months people tend to use our website. (reject the null hypothesis)
    - H_0 : other months >= [May, November]
    - H_A : [May, November] < other months

In [139]:
# Count the number of times source 2 appears
n_source_2 = df['Month'].value_counts()[['May', 'Nov']].sum()

# Count the number of times all other sources appear
n_other_sources = df.shape[0] - n_source_2

# Perform one-sample proportion test
p_other_sources = n_other_sources / df.shape[0]
p_value = stats.binom_test(n_source_2,
                            n_source_2 + n_other_sources,
                            p_other_sources, 
                            alternative='greater')

print(f'P-Value is : {p_value:.3f}')

P-Value is : 0.000


- Returning visitors Type are highest type visiting our website. (reject the null hypothesis)
    - H_0 : other types >= Returning visitors
    - H_A : Returning visitors < other types

In [140]:
test_porportions(col='VisitorType', category_of_interest='Returning_Visitor', alternative='greater')

P-Value is : 0.000


- People tend to use our website during weekdays. (reject the null hypothesis)
    - H_0 : weekends >= weekdays
    - H_A : weekdays < weekends

In [141]:
test_porportions(col='Weekend', category_of_interest=False, alternative='greater')

P-Value is : 0.000


- Most People don't generate Revenues. (reject the null hypothesis)
    - H_0 : revenues >= no-revenues
    - H_A : no-revenues < revenues

In [142]:
test_porportions(col='Revenue', category_of_interest=False, alternative='greater')

P-Value is : 0.000


- 0.0 value in SpecialDay is the most existent value in the dataset. (reject the null hypothesis)
    - H_0 : other values >= 0.0 value
    - H_A : 0.0 value < other values

In [143]:
test_porportions(col='SpecialDay', category_of_interest=0.0, alternative='greater')

P-Value is : 0.000


- Administrative source 0 is highest one being used. (fail reject the null hypothesis)
    - H_0 : other sources >= source 0
    - H_A : source 0 < other sources

In [144]:
test_porportions(col='Administrative', category_of_interest=0, alternative='greater')

P-Value is : 1.000


- Informational source 0 is highest one being used. (reject the null hypothesis)
    - H_0 : other sources >= source 0
    - H_A : source 0 < other sources

In [145]:
test_porportions(col='Informational', category_of_interest=0, alternative='greater')

P-Value is : 0.000


# **----------------------------------------------------------------------------------------**

#### We will define a function to test the two proportions test.

In [146]:
def test_two_samples(col, categroy_of_interest, alternative='greater'):
    mask = df[col] == categroy_of_interest
    
    revenue = df['Revenue'][mask]
    no_revenue = df['Revenue'][~mask]

    t_stat, p_value = stats.ttest_ind(revenue, no_revenue, alternative=alternative)
    print(f'statistic is : {t_stat:.003f}, p-value: {p_value:.003f}')

- Returning_Visitor tend to generate revenues more than other visitors types (fail to reject the null hypothesis, and found that New_Visitors are highest generating revenues)
    - H_0 :  Other visitors types >= Returning_Visitor revenues
    - H_1 : Returning_Visitor revenues < ther visitors types

In [147]:
test_two_samples(col='VisitorType', categroy_of_interest='Returning_Visitor')

statistic is : -11.592, p-value: 1.000


In [148]:
test_two_samples(col='VisitorType', categroy_of_interest='New_Visitor')

statistic is : 11.626, p-value: 0.000


- Traffic Type source 2 tend to generate revenues more than other traffic types (reject the null hypothesis)
    - H_0 :  Other sources >= source 2 revenues
    - H_1 : source 2 revenues < Other sources

In [149]:
test_two_samples(col='TrafficType', categroy_of_interest=2)

statistic is : 13.006, p-value: 0.000


- Region source 2 tend to generate revenues more than other Region sources (fail to reject the null hypothesis)
    - H_0 :  Other sources >= source 1 revenues
    - H_1 : source 1 revenues < Other sources

In [150]:
test_two_samples(col='Region', categroy_of_interest=1)

statistic is : 1.601, p-value: 0.055


- Browser source 2 tend to generate revenues more than other Browser sources (fail to reject the null hypothesis)
    - H_0 :  Other sources >= source 2 revenues
    - H_1 : source 2 revenues < Other sources

In [151]:
test_two_samples(col='Browser', categroy_of_interest=2)

statistic is : -0.464, p-value: 0.679


- Operating System source 2 tend to generate revenues more than other operating system sources (reject the null hypothesis)
    - H_0 :  Other sources >= source 2 revenues
    - H_1 : source 2 revenues < Other sources

In [152]:
test_two_samples(col='OperatingSystems', categroy_of_interest=2)

statistic is : 6.678, p-value: 0.000


- People tend to genrate revenues in November more than other months (reject the null hypothesis)
    - H_0 :  Other month revenue >= November revenues
    - H_1 : November revenues < Other month revenue

In [153]:
test_two_samples(col='Month', categroy_of_interest='Nov')

statistic is : 17.394, p-value: 0.000


- 0.0 value of SpecialDay tend to genrate more revenues than any other values of SpecialDay (reject the null hypothesis)
    - H_0 :  Other values >= 0.0 value revenues
    - H_1 : 0.0 value revenues < Other values

In [154]:
test_two_samples(col='SpecialDay', categroy_of_interest=0.0)

statistic is : 9.650, p-value: 0.000
