<a href="https://colab.research.google.com/github/dBalag/Chinook/blob/main/5_Chinook_hypothesis_test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hypothesis Test
Fundamental method used in statistics to make inferences or draw conclusions about a population based on a sample of data. It involves formulating a hypothesis, collecting data, and then determining the likelihood that the data supports or contradicts the hypothesis.

In [2]:
import pandas as pd
from google.colab import files

# Upload the files
uploaded = files.upload()

# Load datasets
media_type = pd.read_csv('media_type.csv')
playlist = pd.read_csv('playlist.csv')
playlist_track = pd.read_csv('playlist_track.csv')
album = pd.read_csv('album.csv')
artist = pd.read_csv('artist.csv')
employee_cleaned = pd.read_csv('employee_cleaned.csv')
invoice_cleaned = pd.read_csv('invoice_cleaned.csv')
invoice_line = pd.read_csv('invoice_line.csv')
track_cleaned = pd.read_csv('track_cleaned.csv')
genre = pd.read_csv('genre.csv')
customer_cleaned = pd.read_csv('customer_cleaned.csv')

Saving album.csv to album.csv
Saving artist.csv to artist.csv
Saving customer_cleaned.csv to customer_cleaned.csv
Saving employee_cleaned.csv to employee_cleaned.csv
Saving genre.csv to genre.csv
Saving invoice_cleaned.csv to invoice_cleaned.csv
Saving invoice_line.csv to invoice_line.csv
Saving media_type.csv to media_type.csv
Saving playlist.csv to playlist.csv
Saving playlist_track.csv to playlist_track.csv
Saving track_cleaned.csv to track_cleaned.csv


### 1.0 Hypothesis about unit_price
Let's say we want to test if the average unit_price of items sold is equal to $0.99.

*   Null Hypothesis (H0): The average unit_price is $0.99.

*   Alternative Hypothesis (H1): The average unit_price is not $0.99.

Since we are comparing the mean of a single sample to a known value, a one-sample t-test is appropriate.


In [4]:
import pandas as pd
from scipy import stats # import the stats module from the scipy library

# Hypothesized mean
hypothesized_mean = 0.99

# Perform the one-sample t-test
t_statistic, p_value = stats.ttest_1samp(invoice_line['unit_price'], hypothesized_mean)

# Display the results
print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")


T-statistic: 10.804401274261174
P-value: 1.4644777067157573e-26


### 1.1 Interpret the Results
We'll interpret the p-value to draw our conclusion.

In [5]:
# Define the significance level
alpha = 0.05

# Decision based on the p-value
if p_value < alpha:
    print("Reject the null hypothesis (H0). The average unit price is significantly different from $0.99.")
else:
    print("Fail to reject the null hypothesis (H0). The average unit price is not significantly different from $0.99.")


Reject the null hypothesis (H0). The average unit price is significantly different from $0.99.


### 2. Hypothesis on Sales Performance
Is the average number of tracks sold per invoice higher during the holiday season compared to the rest of the year?

*   Null Hypothesis (H0): The average number of tracks sold per invoice is the same during the holiday season and the rest of the year.

*   Alternative Hypothesis (H1): The average number of tracks sold per invoice is higher during the holiday season than the rest of the year.






In [11]:
# Convert the invoice_date to datetime in invoice_cleaned
invoice_cleaned['invoice_date'] = pd.to_datetime(invoice_cleaned['invoice_date'])


In [12]:
# Merge the invoice_cleaned and invoice_line datasets on 'invoice_id'
data = pd.merge(invoice_cleaned, invoice_line, on='invoice_id')

# Extract month from the invoice_date
data['month'] = data['invoice_date'].dt.month

# Define holiday season as November (11), December (12), and January (1)
holiday_season = [11, 12, 1]

# Create a column to identify if the invoice is in the holiday season
data['is_holiday_season'] = data['month'].apply(lambda x: 1 if x in holiday_season else 0)



In [13]:
# Calculate the number of tracks sold per invoice
tracks_per_invoice = data.groupby('invoice_id').agg({'quantity': 'sum', 'is_holiday_season': 'first'}).reset_index()

# Separate the data into holiday season and non-holiday season
holiday_tracks = tracks_per_invoice[tracks_per_invoice['is_holiday_season'] == 1]['quantity']
non_holiday_tracks = tracks_per_invoice[tracks_per_invoice['is_holiday_season'] == 0]['quantity']


In [15]:
# Import the required function
from scipy.stats import ttest_ind

# Perform an independent t-test
t_stat, p_value = ttest_ind(holiday_tracks, non_holiday_tracks, equal_var=False)

print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

T-statistic: -0.15802270114067282
P-value: 0.8746190342655766


In [16]:
alpha = 0.05

if p_value < alpha:
    print("Reject the null hypothesis (H0). The average number of tracks sold per invoice is higher during the holiday season.")
else:
    print("Fail to reject the null hypothesis (H0). There is no significant difference in the average number of tracks sold per invoice during the holiday season compared to the rest of the year.")


Fail to reject the null hypothesis (H0). There is no significant difference in the average number of tracks sold per invoice during the holiday season compared to the rest of the year.


### 3. Hypothesis on Customer Demographics
Do customers from different countries have different average purchase amounts?
*   Null Hypothesis (H0): The average purchase amounts are the same for customers from all countries.

*   Alternative Hypothesis (H1): The average purchase amounts differ for customers from different countries.



In [26]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np


# Merge invoice_cleaned and customer_cleaned datasets on 'customer_id'
merged_data = pd.merge(invoice_cleaned, customer_cleaned, on='customer_id')



In [19]:
# Calculate the total purchase amount for each invoice
merged_data['total_purchase'] = merged_data['total']

# Group by country and calculate the average purchase amount
country_avg_purchase = merged_data.groupby('country')['total_purchase'].mean().reset_index()

# Display the average purchase amount for each country
print(country_avg_purchase)


           country  total_purchase
0        Argentina        5.374286
1        Australia        5.374286
2          Austria        6.088571
3          Belgium        5.374286
4           Brazil        5.431429
5           Canada        5.427857
6            Chile        6.660000
7   Czech Republic        6.445714
8          Denmark        5.374286
9          Finland        5.945714
10          France        5.574286
11         Germany        5.588571
12         Hungary        6.517143
13           India        5.789231
14         Ireland        6.517143
15           Italy        5.374286
16     Netherlands        5.802857
17          Norway        5.660000
18          Poland        5.374286
19        Portugal        5.517143
20           Spain        5.374286
21          Sweden        5.517143
22             USA        5.747912
23  United Kingdom        5.374286


In [20]:
# Extract the purchase amounts for each country
country_groups = [group['total_purchase'].values for name, group in merged_data.groupby('country')]


In [21]:
# Perform ANOVA test
f_stat, p_value = stats.f_oneway(*country_groups)

print(f"ANOVA test statistic: {f_stat}")
print(f"p-value: {p_value}")



ANOVA test statistic: 0.07361605570892288
p-value: 0.999999999357232


In [22]:
# Interpret the results
alpha = 0.05
if p_value < alpha:
    print("We reject the null hypothesis (H0). There are significant differences in the average purchase amounts among customers from different countries.")
else:
    print("We fail to reject the null hypothesis (H0). The average purchase amounts are the same for customers from all countries.")


We fail to reject the null hypothesis (H0). The average purchase amounts are the same for customers from all countries.


### 4. Hypothesis on Customer Loyalty
Are returning customers spending more on average than new customers?
*   Null Hypothesis (H0): Returning customers spend the same amount on average as new customers.
*   Alternative Hypothesis (H1): Returning customers spend more on average than new customers.



In [24]:
# Calculate the first and last purchase dates for each customer
customer_purchase_dates = invoice_cleaned.groupby('customer_id')['invoice_date'].agg(['min', 'max']).reset_index()


In [27]:
# Classify customers as new or returning based on the number of purchases
purchase_counts = invoice_cleaned.groupby('customer_id').size().reset_index(name='purchase_count')
customer_status = purchase_counts.copy()
customer_status['status'] = np.where(customer_status['purchase_count'] > 1, 'returning', 'new')



In [28]:
# Merge the status back to the invoice data
merged_data = pd.merge(invoice_cleaned, customer_status[['customer_id', 'status']], on='customer_id')


In [29]:
# Calculate the total purchase amount for each invoice
merged_data['total_purchase'] = merged_data['total']

In [31]:
# Group by customer status and calculate the average purchase amount
status_avg_purchase = merged_data.groupby('status')['total_purchase'].mean().reset_index()

# Display the average purchase amount for each customer status
print(status_avg_purchase)

      status  total_purchase
0  returning        5.651942


In [32]:
# Extract the purchase amounts for returning and new customers
returning_customers = merged_data[merged_data['status'] == 'returning']['total_purchase']
new_customers = merged_data[merged_data['status'] == 'new']['total_purchase']

In [33]:
# Perform a t-test
t_stat, p_value = stats.ttest_ind(returning_customers, new_customers, equal_var=False)

print(f"T-test statistic: {t_stat}")
print(f"p-value: {p_value}")


T-test statistic: nan
p-value: nan


Results display nan results due to there's no new customers, meaning that all customers from dataset has purchased more than 1 time.

In [34]:
# Interpret the results
alpha = 0.05
if p_value < alpha:
    print("We reject the null hypothesis (H0). Returning customers spend more on average than new customers.")
else:
    print("We fail to reject the null hypothesis (H0). Returning customers spend the same amount on average as new customers.")

We fail to reject the null hypothesis (H0). Returning customers spend the same amount on average as new customers.
