***

## Individual Challenge 3: T-test and chi square-test

This Notebook starts with an example of a t-test and a Chi square-test. Then the actual indivual challenge begins where you can apply these tests yourself.

There are three Interpretation questions in the Notebook on which you need to focus. 

***

### Example t-test

The given data set is derived from a study called “Framing and Behavior Change” by Nurit Nobel (2020). Participants on this research are smokers who followed through a program to quit smoking. They joined the program with a mobile application. On the application, the researcher purposefully modified the framing of a welcome message that portrayed a particular effect of smoking. Participants were randomly assigned into one of these five groups: *benchmark*, *money-gain (MG)*, *money-loss (ML)*, *time-gain (TG)*, and *time-loss (TL)*. 

In [65]:
# Importing the data
import pandas as pd
df = pd.read_csv('sample_cssci.csv', delimiter=';') # Add file path if needed
df.head(5)

Unnamed: 0,user_id,habit_program_id,registered_date,probably_fake,gender,age,has_quit,has_reduced_habit,number_of_daily_cigarettes_before,number_of_daily_cigarettes_after,quit_from_before,exp_group,group
0,390977248,390977502,08/05/20,False,FEMALE,35.00.00,True,,18.0,0.0,True,baseline,Benchmark
1,391695433,391695687,10/05/20,False,FEMALE,33.00.00,False,False,10.0,,False,baseline,Benchmark
2,392220070,392220324,11/05/20,False,FEMALE,24.00.00,True,True,12.0,0.0,False,baseline,Benchmark
3,392887074,392887328,11/05/20,False,FEMALE,19.00,True,,4.0,,False,TL,Nudge
4,393173586,393173840,12/05/20,False,FEMALE,30.00.00,False,False,20.0,,False,baseline,Benchmark


In [66]:
# Subsetting the dataset
df2 = df.loc[df['exp_group'].isin(['TG', 'MG']), ['number_of_daily_cigarettes_after', 'exp_group']]
df2.head()

Unnamed: 0,number_of_daily_cigarettes_after,exp_group
50,5.0,MG
150,,MG
165,0.0,MG
220,0.0,MG
320,,TG


In [67]:
# Run an independent sample T-test
import scipy.stats as stats # statistical tests
t_statistic, p_value = stats.ttest_ind(a = df2.loc[df2['exp_group'] == 'TG', 'number_of_daily_cigarettes_after'], # vector of values for TG
                                      b = df2.loc[df2['exp_group'] == 'MG', 'number_of_daily_cigarettes_after'], # vector of values for MG
                                      alternative = 'two-sided',
                                      nan_policy = 'omit')

# group 1 data
# group 2 data
# equal_var - if True (default), perform a standard independent 2 sample test that assumes equal population variances

# alternative - defines the alternative hypothesis. The following options are available (default is ‘two-sided’).
#   - ‘two-sided’: the mean of the underlying distribution of the sample is different than the given population mean (popmean)
#   - ‘less’: the mean of the underlying distribution of the sample is less than the given population mean (popmean)
#   - ‘greater’: the mean of the underlying distribution of the sample is greater than the given population mean (popmean)

# nan_policy - defines how to handle when input contains nan. The following options are available (default is ‘propagate’)
#   - ‘propagate’: returns nan
#   - raise’: throws an error
#   - 'omit’: performs the calculations ignoring nan values

In [68]:
# Printing out the results
print(f't-statistic: {t_statistic}')
print(f'p-value: {p_value}')

t-statistic: 0.9671529447445474
p-value: 0.33375877282002


In [69]:
# Conclusions

alpha = 0.05

if p_value < alpha:
  print(f'p-value of {round(p_value,3)} is lower than the alpha value of {alpha}. \nThe alternative hypothesis should be accepted. \
\nThe mean values in two groups are stastically significantly different.')
  
else:
    print(f'p-value of {round(p_value,3)} is higher than alpha value of {alpha}. \nThe null hypothesis should be retained. \
\nThe mean values in two groups are not stastically significantly different.')
    

p-value of 0.334 is higher than alpha value of 0.05. 
The null hypothesis should be retained. 
The mean values in two groups are not stastically significantly different.


### Example Chi square-test

Imagine that you are running an experiment where you are interested in the relationship between the "tone" of a message and whether people subscribe to a news letter. 

Independent variable (IV): tone of the message 
- Convincing
- Neutral 

Dependent variable (DV): whether people subscribed to the newsletter
- Subscribed
- Not subscribed

Your research question:

Is the **tone of message on your website** related to the __subscription rate__?

H0: The tone of message is __not related__ to the subscription rate

Ha: The tone of message is __related__ to the subscription rate

In [70]:
# Creating values

# Generate values form the Bernoulli discrete random variable distribution (0 or 1)

# size - number of rounds (sample size)
# p - probability of success
# n - number of trials within a round

import numpy as np

np.random.seed(11)

message = np.random.binomial(size=1000, n=1, p=0.55) # 0 - convincing, 1 - neutral
subscription = np.random.binomial(size=1000, n=1, p=0.1) # 1 - subscribed, 0 - not subscribed

# Putting together the dataset

df = pd.DataFrame({'message' : message,
                   'subscription' : subscription})

# Changing numeric values to categories for clarity
df['subscription'].replace([0, 1], ['not_subscribed', 'subscribed'], inplace=True) # [old_values], [new_values]
df['message'].replace([0, 1], ['convincing', 'neutral'], inplace=True)
df.head(5)

Unnamed: 0,message,subscription
0,neutral,not_subscribed
1,neutral,not_subscribed
2,neutral,not_subscribed
3,convincing,not_subscribed
4,neutral,not_subscribed


In [71]:
# Running a crosstab
crosstab = pd.crosstab(df['message'], df['subscription'], margins=True)
crosstab

subscription,not_subscribed,subscribed,All
message,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
convincing,397,54,451
neutral,506,43,549
All,903,97,1000


In [72]:
# Running a normalized crosstab
# There seems to be a difference in the number of subscriptions per convincing and neutral messages
# To see it clearer, the frequencies are turned into proportions by using 'normalize' parameter

crosstab_prop = round(pd.crosstab(df['message'], df['subscription'], normalize='index'),2) * 100
crosstab_prop

# normalize - default False
#   - If passed ‘all’ or True, will normalize over all values.
#   - If passed ‘index’ will normalize over each row.
#   - If passed ‘columns’ will normalize over each column.

subscription,not_subscribed,subscribed
message,Unnamed: 1_level_1,Unnamed: 2_level_1
convincing,88.0,12.0
neutral,92.0,8.0


In [73]:
# Running Chi-squared test

# Unpacking the results of the Chi-squared test into:

# cs_statistic - Chi-squared value for the crosstab
# p_value - p-value
# dof - Degrees of freedom
# expected - Expected values under null hypothesis

# Important: Note that the input for chi2_contingency() should be a raw crosstab,
# not the one with proportions!!!

# That's why we're using the object called 'crosstab' created in 2.2,
# not 'crosstab_prop' created in 2.4.

cs_statistic, p_value, dof, expected = stats.chi2_contingency(crosstab)

print(f'The Chi-squared statistic value: {cs_statistic}')
print(f'The p-value: {p_value}')
print(f'Number of degrees of freedom: {dof}')

The Chi-squared statistic value: 4.8472290674906136
The p-value: 0.30333494360532326
Number of degrees of freedom: 4


In [74]:
# Expected values represent the (theoretical) distribution under the null hypothesis -
# that is in case there is no relationship between the type of message and subscription behavior

# The expected values are calculated with the formula: (Row Total * Column Total)/N

print(f'Expected values: {expected}')

Expected values: [[ 407.253   43.747  451.   ]
 [ 495.747   53.253  549.   ]
 [ 903.      97.    1000.   ]]


In [75]:
# Turn the array "expected" (created above) into a dataframe to improve readability
# Each list in this array is essentially a row in the dataframe below. We manually name the columns
df1 = pd.DataFrame({'0': expected[:, 0], '1': expected[:, 1], 'All': expected[:, 2]})
df1

Unnamed: 0,0,1,All
0,407.253,43.747,451.0
1,495.747,53.253,549.0
2,903.0,97.0,1000.0


In [76]:
# Interpreting the p-value and drawing the conclusions

alpha = 0.05 # setting the alpha value of 5%

if p_value < alpha:
  print(f'p-value of {round(p_value,3)} is lower than the alpha value of {alpha}. \nThe alternative hypothesis should be accepted. \
\nThe tone of message is related to the subscription rate.')
  
else:
    print(f'p-value of {round(p_value,3)} is higher than alpha value of {alpha}. \nThe null hypothesis should be retained. \
\nThe tone of message is not related to the subscription rate.')


p-value of 0.303 is higher than alpha value of 0.05. 
The null hypothesis should be retained. 
The tone of message is not related to the subscription rate.


## The Individual Challenge

*** 

### Are older films better than recent ones? 

This question can be addressed with a t-test.

For reasons why older films may be better, see: https://www.quora.com/Why-are-old-movies-way-better-than-new-ones

***

In [77]:
# Import packages and open data
import pandas as pd
df = pd.read_csv('imdb_cssci.csv') # Add path if necessary
print(df.shape)
df.head()

(42500, 6)


Unnamed: 0,year,rating,num_votes,title,recent_film,awarded
0,1910,5.6,29,The Connecticut Yankee,0,0
1,1910,4.7,38,Abraham Lincoln's Clemency,0,0
2,1910,6.7,38,The Sanitarium,0,0
3,1910,4.7,24,Rip Van Winkle,0,0
4,1910,5.7,57,Jane Eyre,0,0


In [78]:
# Compare the ratings for two groups 
df.groupby('recent_film')['rating'].mean()

recent_film
0    5.954845
1    5.708871
Name: rating, dtype: float64

In [79]:
# Now run a t-test and assess whether you can accept/reject the alternative hypothesis

# Run an independent sample T-test
import scipy.stats as stats # statistical tests
t_statistic, p_value = stats.ttest_ind(a = df.loc[df['recent_film'] == 0, 'rating'], # vector of values for old films
                                      b = df.loc[df['recent_film'] == 1, 'rating'], # vector of values for new films
                                      alternative = 'two-sided',
                                      nan_policy = 'omit')

In [80]:
# Printing out the results
print(f't-statistic: {t_statistic}')
print(f'p-value: {p_value}')

t-statistic: 20.54842509469471
p-value: 2.264764248276443e-93


In [81]:
# Conclusions

alpha = 0.05

if p_value < alpha:
  print(f'p-value of {round(p_value,3)} is lower than the alpha value of {alpha}. \nThe alternative hypothesis should be accepted. \
\nThe mean values in two groups are stastically significantly different.')
  
else:
    print(f'p-value of {round(p_value,3)} is higher than alpha value of {alpha}. \nThe null hypothesis should be retained. \
\nThe mean values in two groups are not stastically significantly different.')

p-value of 0.0 is lower than the alpha value of 0.05. 
The alternative hypothesis should be accepted. 
The mean values in two groups are stastically significantly different.


***

**Interpretaton question 1:** 

Explain why you accept/reject the alternative hypothesis. Provide an intuitive explanation and then also one in which you demonstrate your mathematical knowledge, while using the terms "alpha", "p-value" and "z-score" in your answer. 


***

Your answer: 


The reason why I accept the alternative hypothesis is that in hypothesis testing, the p-value represents the probability of obtaining results as extreme as the observed results, assuming the null hypothesis is true. In this case, the p-value(2.264764248276443e-93) is significantly smaller than the chosen significance level (alpha = 0.05). When the p-value is smaller than alpha, it suggests that the null hypothesis should be rejected in favor of the alternative hypothesis. Additionally, the t-test score of 20.54842509469471 suggests that the difference between the means of the two groups (older and recent films) is quite large relative to the variability within each group. Due to the large t-test score, the corresponding z-score will also be large. When compared with the critical values for a two-tailed test with alpha = 0.05, it will fall far beyond the cutoff points, leading to rejection of the null hypothesis and accept the alternative hypothesis.

### Do audiences and professionals agree on what is a good film?

***

Different audiences can evaluate and valuate things in a different way. In cultural sociology, some make a useful distiction between _professional recognition_ (recognition that comes from other professionals in a field, often experts, who are trained in such an area) and audience recognition (consumers who aren't necessarily experts). 

This prompts the question: are the two groups agreeing or disagreeing on what is a good film? Something that can be studied with the following two variables. 

Independent variable (IV): awarded 
- 1) Film title received an award (indicative of professional recognition)
- 0) Film title did not receive a film award

Dependent variable (DV): high_rated
- 1) Film is highly rated (indicative of audience recognition)
- 0) Film is not highly rated

Your research question may be:

Is the **professional recognition** of films related to **audience recognition**?

Your hypotheses:

H0: Awards are _not related_ to high ratings

Ha: Awards are _related_ to high ratings

***


First, we like to create a cutoff point to identify "highly rated" films. If film titles have a rating that is higher than 1 standard deviation above the mean rating, then we consider it "highly rated". 

In [82]:
# Create cutoff point by computing the mean and 1 standard deviation for ratings and add those to each other
basic_stats = df.rating.describe().loc[['mean','std']]
mean_std = basic_stats[0] + basic_stats[1]
print(mean_std)

7.0670900497119895


  mean_std = basic_stats[0] + basic_stats[1]


In [83]:
# Create a new binary variable "high_rated" based on this cuttoff point
import numpy as np
df['high_rated'] = np.where(df['rating'] > mean_std,1,0)
df

Unnamed: 0,year,rating,num_votes,title,recent_film,awarded,high_rated
0,1910,5.6,29,The Connecticut Yankee,0,0,0
1,1910,4.7,38,Abraham Lincoln's Clemency,0,0,0
2,1910,6.7,38,The Sanitarium,0,0,0
3,1910,4.7,24,Rip Van Winkle,0,0,0
4,1910,5.7,57,Jane Eyre,0,0,0
...,...,...,...,...,...,...,...
42495,2009,4.8,733,Breaking Point,1,0,0
42496,2009,8.8,15,Under One Sun,1,0,1
42497,2009,7.8,20,Taylor's Way,1,1,1
42498,2009,4.2,419,Babysitters Beware,1,0,0


In [84]:
# Now run your crosstabs

In [85]:
# Running a crosstab
crosstab = pd.crosstab(df['awarded'], df['high_rated'], margins=True)
crosstab

high_rated,0,1,All
awarded,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,30937,3663,34600
1,5487,2413,7900
All,36424,6076,42500


In [86]:
# Running a normalized crosstab

crosstab_prop = round(pd.crosstab(df['awarded'], df['high_rated'], normalize='index'),2) * 100
crosstab_prop

high_rated,0,1
awarded,Unnamed: 1_level_1,Unnamed: 2_level_1
0,89.0,11.0
1,69.0,31.0


***

**Interpretaton question 2:** 

Your Core Lecturer Rens thought it would be a good idea to construct a cutoff point by adding 1 standard deviation to the mean film rating. 2.1) What could be the reasoning behind his choice? 2.2) Propose an alternative way to construct a cutoff point to identify "highly rated" films, and explain why you may want to use this.  


***

Your answer:

2.1)Because the entire data set is normally distributed and there is a large amount of data in the middle, Rens wants to use this method to improve the standards for ’high-rated‘ and make the identification of high-rated and low-rated movies more accurate and statistical results more effective.
2.2)I can also use quartile calculations to define high-rated movies. The quartiles of movie ratings are first calculated, and then the value of the third quartile is used as the cutoff for "highly rated" movies. Movies with ratings above this cutoff are considered high-rated. This is because quartiles are interpretable, providing easily interpretable cutoff points based on the distribution of ratings. Movies ranked above the third quartile are intuitively considered highly rated compared to most other movies.Additionally, quartiles are robust to outliers in the data, minimizing the impact of extreme values.

In [87]:
# Now run a Chi square-test and assess whether you can accept/reject the alternative hypothesis

In [92]:
cs_statistic, p_value, dof, expected = stats.chi2_contingency(crosstab_prop)

print(f'The Chi-squared statistic value: {cs_statistic}')
print(f'The p-value: {p_value}')
print(f'Number of degrees of freedom: {dof}')

The Chi-squared statistic value: 10.880048221820374
The p-value: 0.0009720571562669138
Number of degrees of freedom: 1


In [89]:
print(f'Expected values: {expected}')

Expected values: [[29653.42117647  4946.57882353 34600.        ]
 [ 6770.57882353  1129.42117647  7900.        ]
 [36424.          6076.         42500.        ]]


In [90]:
# Turn the array "expected" (created above) into a dataframe to improve readability
df1 = pd.DataFrame({'0': expected[:, 0], '1': expected[:, 1], 'All': expected[:, 2]})
df1

Unnamed: 0,0,1,All
0,29653.421176,4946.578824,34600.0
1,6770.578824,1129.421176,7900.0
2,36424.0,6076.0,42500.0


In [91]:
# Interpreting the p-value and drawing the conclusions

alpha = 0.05 # setting the alpha value of 5%

if p_value < alpha:
  print(f'p-value of {round(p_value,3)} is lower than the alpha value of {alpha}. \nThe alternative hypothesis should be accepted. \
\nThe tone of message is related to the subscription rate.')
  
else:
    print(f'p-value of {round(p_value,3)} is higher than alpha value of {alpha}. \nThe null hypothesis should be retained. \
\nThe tone of message is not related to the subscription rate.')

p-value of 0.0 is lower than the alpha value of 0.05. 
The alternative hypothesis should be accepted. 
The tone of message is related to the subscription rate.


***

**Interpretaton question 3:** 

3.1) Shortly explain why you accept/reject the alternative hypothesis. 3.2) Now explain the intuition behind the formula that is used to compute the Chi-squared statistic, use the terms "observed" and "expected" values in your answer, and use the two-way tables (crosstab) that are created above. 3.3) Can you suggest another way (different data or methods) with which one could study differences between the evaluations by professionals and more broader audiences? 


***

Your answer:

3.1) I accept the alternative hypothesis because the calculated p-value(0.0009720571562669138) is much lower than alpha(0.05), which means the possibility that null hypothesis happens is quite low. So I can accept the alternative hypothesis.
3.2) The intuition behind the formula for computing the chi-squared statistic lies in comparing the observed frequencies (counts) of the categories in your crosstable to the frequencies that would be expected if there was no association between the variables. In order to explain the intuition behind the formula, take high-rated and award-winning movies in the chi-square crosstab created above as an example. The observed value of award-winning and high-rated movies is its actual observed frequency; its expected value is derived from the product of the total number of award-winning movies and high ratings. , divided by the quotient of the sum of the total rows and total columns. That is (7900*6076)/42500=1129. I subtracted the expected value from the observed values of the four types (awarded with high ratings, no awarded with high ratings, awarded with low ratings, and no awarded with low ratings), and then divided the square of the differences by the expected value, and finally summed it up: The chi-square value can be obtained.
3.3)In addition to the award data and rating data, I think the number of votes and ratings can also be used. Because the number of votes better reflects the specific preferences of the judges. We can use the same method used to process the rating and award data before to create a crosstab of the voting data and rating data, and then draw the corresponding conclusion through the chi-square test.