***

# Individual Challenge 2: t-test and chi square-test

This notebook begins with examples of a t-test and a chi-square test. After that, you'll move on to the individual challenge, where you'll apply these tests yourself and answer three related questions.

***

***

## The example part

***

### Example t-test

The dataset used in this example originates from the study _Framing and Behavior Change_ by Nurit Nobel (2020). The research focused on smokers participating in a smoking cessation program through a mobile application. As part of the study, the researcher deliberately altered the framing of a welcome message highlighting a specific effect of smoking. Participants were randomly assigned to one of five groups: benchmark, money-gain (MG), money-loss (ML), time-gain (TG), or time-loss (TL).

In [39]:
# Importing the data
import pandas as pd
df = pd.read_csv('sample_cssci-2.csv', delimiter=';')
df.head(5)

Unnamed: 0,user_id,habit_program_id,registered_date,probably_fake,gender,age,has_quit,has_reduced_habit,number_of_daily_cigarettes_before,number_of_daily_cigarettes_after,quit_from_before,exp_group,group
0,390977248,390977502,08/05/20,False,FEMALE,35.00.00,True,,18.0,0.0,True,baseline,Benchmark
1,391695433,391695687,10/05/20,False,FEMALE,33.00.00,False,False,10.0,,False,baseline,Benchmark
2,392220070,392220324,11/05/20,False,FEMALE,24.00.00,True,True,12.0,0.0,False,baseline,Benchmark
3,392887074,392887328,11/05/20,False,FEMALE,19.00,True,,4.0,,False,TL,Nudge
4,393173586,393173840,12/05/20,False,FEMALE,30.00.00,False,False,20.0,,False,baseline,Benchmark


In [40]:
# Subsetting the dataset
df2 = df.loc[df['exp_group'].isin(['TG', 'MG']), ['number_of_daily_cigarettes_after', 'exp_group']]
df2.head()

Unnamed: 0,number_of_daily_cigarettes_after,exp_group
50,5.0,MG
150,,MG
165,0.0,MG
220,0.0,MG
320,,TG


In [41]:
# Run an independent sample T-test
import scipy.stats as stats # statistical tests
t_statistic, p_value = stats.ttest_ind(a = df2.loc[df2['exp_group'] == 'TG', 'number_of_daily_cigarettes_after'], # vector of values for TG
                                      b = df2.loc[df2['exp_group'] == 'MG', 'number_of_daily_cigarettes_after'], # vector of values for MG
                                      alternative = 'two-sided',
                                      nan_policy = 'omit')

# group 1 data
# group 2 data
# equal_var - if True (default), perform a standard independent 2 sample test that assumes equal population variances

# alternative - defines the alternative hypothesis. The following options are available (default is ‘two-sided’).
#   - ‘two-sided’: the mean of the underlying distribution of the sample is different than the given population mean (popmean)
#   - ‘less’: the mean of the underlying distribution of the sample is less than the given population mean (popmean)
#   - ‘greater’: the mean of the underlying distribution of the sample is greater than the given population mean (popmean)

# nan_policy - defines how to handle when input contains nan. The following options are available (default is ‘propagate’)
#   - ‘propagate’: returns nan
#   - raise’: throws an error
#   - 'omit’: performs the calculations ignoring nan values

In [42]:
# Printing out the results
print(f't-statistic: {t_statistic}')
print(f'p-value: {p_value}')

t-statistic: 0.9671529447445474
p-value: 0.33375877282002


In [43]:
# Conclusions

alpha = 0.05

if p_value < alpha:
  print(f'p-value of {round(p_value,3)} is lower than the alpha value of {alpha}. \nThe alternative hypothesis should be accepted. \
\nThe mean values in two groups are stastically significantly different.')
  
else:
    print(f'p-value of {round(p_value,3)} is higher than alpha value of {alpha}. \nThe null hypothesis should be retained. \
\nThe mean values in two groups are not stastically significantly different.')
    

p-value of 0.334 is higher than alpha value of 0.05. 
The null hypothesis should be retained. 
The mean values in two groups are not stastically significantly different.


### Example Chi square-test

Imagine that you are running an experiment where you are interested in the relationship between the "tone" of a message and whether people subscribe to a news letter. 

Independent variable (IV): tone of the message 
- Convincing
- Neutral 

Dependent variable (DV): whether people subscribed to the newsletter
- Subscribed
- Not subscribed

Your research question:

Is the **tone of message on your website** related to the __subscription rate__?

H0: The tone of message is __not related__ to the subscription rate

Ha: The tone of message is __related__ to the subscription rate

In [44]:
# Creating values

# Generate values form the Bernoulli discrete random variable distribution (0 or 1)

# size - number of rounds (sample size)
# p - probability of success
# n - number of trials within a round

import numpy as np

np.random.seed(11)

message = np.random.binomial(size=1000, n=1, p=0.55) # 0 - convincing, 1 - neutral
subscription = np.random.binomial(size=1000, n=1, p=0.1) # 1 - subscribed, 0 - not subscribed

# Putting together the dataset

df = pd.DataFrame({'message' : message,
                   'subscription' : subscription})

# Changing numeric values to categories for clarity
df['subscription'].replace([0, 1], ['not_subscribed', 'subscribed'], inplace=True) # [old_values], [new_values]
df['message'].replace([0, 1], ['convincing', 'neutral'], inplace=True)
df.head(5)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['subscription'].replace([0, 1], ['not_subscribed', 'subscribed'], inplace=True) # [old_values], [new_values]
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['message'].replace([0, 1], ['convincing', 'neutral'], inplace=True)


Unnamed: 0,message,subscription
0,neutral,not_subscribed
1,neutral,not_subscribed
2,neutral,not_subscribed
3,convincing,not_subscribed
4,neutral,not_subscribed


In [45]:
# Running a crosstab
crosstab = pd.crosstab(df['message'], df['subscription'], margins=True)
crosstab

subscription,not_subscribed,subscribed,All
message,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
convincing,397,54,451
neutral,506,43,549
All,903,97,1000


In [46]:
# Running a normalized crosstab
# There seems to be a difference in the number of subscriptions per convincing and neutral messages
# To see it clearer, the frequencies are turned into proportions by using 'normalize' parameter

crosstab_prop = round(pd.crosstab(df['message'], df['subscription'], normalize='index'),2) * 100
crosstab_prop

# normalize - default False
#   - If passed ‘all’ or True, will normalize over all values.
#   - If passed ‘index’ will normalize over each row.
#   - If passed ‘columns’ will normalize over each column.

subscription,not_subscribed,subscribed
message,Unnamed: 1_level_1,Unnamed: 2_level_1
convincing,88.0,12.0
neutral,92.0,8.0


In [47]:
# Running Chi-squared test

# Unpacking the results of the Chi-squared test into:

# cs_statistic - Chi-squared value for the crosstab
# p_value - p-value
# dof - Degrees of freedom
# expected - Expected values under null hypothesis

# Important: Note that the input for chi2_contingency() should be a raw crosstab,
# not the one with proportions!!!

# That's why we're using the object called 'crosstab' created in 2.2,
# not 'crosstab_prop' created in 2.4.

cs_statistic, p_value, dof, expected = stats.chi2_contingency(crosstab)

print(f'The Chi-squared statistic value: {cs_statistic}')
print(f'The p-value: {p_value}')
print(f'Number of degrees of freedom: {dof}')

The Chi-squared statistic value: 4.8472290674906136
The p-value: 0.30333494360532326
Number of degrees of freedom: 4


In [48]:
# Expected values represent the (theoretical) distribution under the null hypothesis -
# that is in case there is no relationship between the type of message and subscription behavior

# The expected values are calculated with the formula: (Row Total * Column Total)/N

print(f'Expected values: {expected}')

Expected values: [[ 407.253   43.747  451.   ]
 [ 495.747   53.253  549.   ]
 [ 903.      97.    1000.   ]]


In [49]:
# Turn the array "expected" (created above) into a dataframe to improve readability
# Each list in this array is essentially a row in the dataframe below. We manually name the columns
df1 = pd.DataFrame({'0': expected[:, 0], '1': expected[:, 1], 'All': expected[:, 2]})
df1

Unnamed: 0,0,1,All
0,407.253,43.747,451.0
1,495.747,53.253,549.0
2,903.0,97.0,1000.0


In [50]:
# Interpreting the p-value and drawing the conclusions

alpha = 0.05 # setting the alpha value of 5%

if p_value < alpha:
  print(f'p-value of {round(p_value,3)} is lower than the alpha value of {alpha}. \nThe alternative hypothesis should be accepted. \
\nThe tone of message is related to the subscription rate.')
  
else:
    print(f'p-value of {round(p_value,3)} is higher than alpha value of {alpha}. \nThe null hypothesis should be retained. \
\nThe tone of message is not related to the subscription rate.')


p-value of 0.303 is higher than alpha value of 0.05. 
The null hypothesis should be retained. 
The tone of message is not related to the subscription rate.


***

## The challenge part

***

*** 

### Part 1: Are older films better (i.e., more highly rated) than recent ones? 

This question can be addressed with a t-test.

***

![film](https://i.guim.co.uk/img/media/722eed5bd8c9ccf308dd9228674225587b8fca59/0_44_2750_1649/master/2750.jpg?width=620&dpr=1&s=none&crop=none)

Casablanca (1943) is often hailded as a cinematic masterpiece and one of the greatest films of all time. Read more about it [here](https://www.theguardian.com/film/2022/nov/26/casablanca-80-humphrey-bogart-ingrid-bergman). 


In [51]:
# Import packages and open data
import pandas as pd
from scipy.stats import ttest_ind
df = pd.read_csv('imdb_cssci-2.csv')
print(df.shape)
df.head()

(42500, 6)


Unnamed: 0,year,rating,num_votes,title,recent_film,awarded
0,1910,5.6,29,The Connecticut Yankee,0,0
1,1910,4.7,38,Abraham Lincoln's Clemency,0,0
2,1910,6.7,38,The Sanitarium,0,0
3,1910,4.7,24,Rip Van Winkle,0,0
4,1910,5.7,57,Jane Eyre,0,0


In [52]:
# Compare the ratings for two groups 
old_films = df[df['year'] < 2000]
new_films = df[df['year'] >= 2000]

print("Old film mean:", old_films['rating'].mean())
print("New film mean:", new_films['rating'].mean())

Old film mean: 5.818554789172305
New film mean: 5.851980955619792


In [53]:
# Run an independent sample T-test
t_stat, p_value = ttest_ind(old_films['rating'], new_films['rating'], nan_policy='omit')
print("T-statistic:", t_stat)
print("P-value:", p_value)

T-statistic: -2.4878388529352966
P-value: 0.012855997864403645


***

**Interpretaton question 1:** 

1.1) State your hypotheses. Explain in basic terms why you accept/reject your hypotheses based on your t-test. 

1.2) Now explain why you accept/reject your hypotheses in a more elaborate way, while extensively demonstrating your mathematical knowledge, and using the terms "alpha", "p-value," and "z-score" in your answer. 

1.3) We recommend using a two-sided t-test, as demonstrated in the smoking example above. Why is a two-sided test appropriate for your study comparing older and newer films? Additionally, does using a two-sided test make it easier or more difficult to find a statistically significant difference between the two groups? Explain your reasoning.


***

Your answer:

(1.1)

Null Hypothesis (H0):
    Null Hypothesis (H0): There is NO difference in the mean IMDb ratings between older films (before 2000) and newer films (2000 and after).

Alternative Hypothesis (Ha):
    Alternative Hypothesis (Ha): There is A difference in the mean IMDb ratings between the two groups.

Based on the results of the t-test, we reject the null hypothesis if the p-value is less than 0.05 (our typical significance level, α = 0.05). (If the p-value is greater than 0.05, we fail to reject the null hypothesis. In our case, for example, the p-value is 0.012 < 0.05, so we reject H₀ and conclude that the average ratings are statistically different between old and new films.)



(1.2)

We set the significance level α = 0.05, which means we are willing to accept a 5% chance of incorrectly rejecting the null hypothesis (Type I error).

Since the p-value is less than alpha (0.0129 < 0.05), we reject the null hypothesis. In practical terms, this means the observed difference in average ratings is statistically significant. The t-statistic behaves similarly to a z-score: the larger absolute value, the further from the expected value under H0, indicating rejecting it.

The t-test returns a t-statistic of -2.49, which measures how many standard errors the observed difference is from the hypothesized mean difference (which is 0 under H0).

Thus, using formal statistical criteria, we find that IMDb ratings DO differ between older and newer films.

(1.3)

A two-sided t-test is appropriate here because we are not making a directional claim (e.g., “new films are better than old films” or vice versa). Instead, we are simply testing whether there is any difference between the average ratings of the two groups in EITHER direction.

Using a two-sided test is more conservative, meaning it is harder to achieve statistical significance. The total alpha level (duplicated from above, 0.05) is split between BOTH tails of the distribution.

***

### Part 2: Do professionals and audiences agree on what is a good film?

***

![critic](https://www.hollywoodreporter.com/wp-content/uploads/2012/03/movie_theater_crowd_a_l.jpg?w=1440&h=810&crop=1)

***

Different audiences can evaluate and valuate things in a different way. When studying cultural production, it can be useful to make a useful distiction between _professional recognition_ (recognition that comes from other professionals in a field, often experts, who are trained in such an area) and audience recognition (consumers who aren't necessarily experts). 

This prompts the question: are the two groups agreeing or disagreeing on what is a good film? Something that can be studied with the following two variables. 

Independent variable (IV): awarded 
- 1) Film title received an award (indicative of professional recognition)
- 0) Film title did not receive a film award

Dependent variable (DV): high_rated
- 1) Film is highly rated (indicative of audience recognition)
- 0) Film is not highly rated

Your research question may be:

Is the **professional recognition** of films related to **audience recognition**?

Your hypotheses:

H0: Awards are _not related_ to high ratings

Ha: Awards are _related_ to high ratings

***


First, we like to create a cutoff point to identify "highly rated" films. If film titles have a rating that is higher than 1 standard deviation above the mean rating, then we consider it "highly rated". 

In [54]:
# Create cutoff point by computing the mean and 1 standard deviation for ratings and add those to each other
basic_stats = df.rating.describe().loc[['mean','std']]
mean_std = basic_stats['mean'] + basic_stats['std']
print(mean_std)

7.0670900497119895


In [55]:
# Create a new binary variable "high_rated" based on this cuttoff point
df['high_rated'] = np.where(df['rating'] > mean_std,1,0)
df

Unnamed: 0,year,rating,num_votes,title,recent_film,awarded,high_rated
0,1910,5.6,29,The Connecticut Yankee,0,0,0
1,1910,4.7,38,Abraham Lincoln's Clemency,0,0,0
2,1910,6.7,38,The Sanitarium,0,0,0
3,1910,4.7,24,Rip Van Winkle,0,0,0
4,1910,5.7,57,Jane Eyre,0,0,0
...,...,...,...,...,...,...,...
42495,2009,4.8,733,Breaking Point,1,0,0
42496,2009,8.8,15,Under One Sun,1,0,1
42497,2009,7.8,20,Taylor's Way,1,1,1
42498,2009,4.2,419,Babysitters Beware,1,0,0


In [56]:
# Create a crosstab between 'awarded' and 'high_rated'
contingency_table = pd.crosstab(df['awarded'], df['high_rated'])
print("Contingency Table (Awarded vs. Highly Rated):")
print(contingency_table)

Contingency Table (Awarded vs. Highly Rated):
high_rated      0     1
awarded                
0           30937  3663
1            5487  2413


***

**Interpretaton question 2:** 

Your Core Lecturer Rens thought it would be a good idea to construct a cutoff point by adding 1 standard deviation to the mean film rating. 

2.1) What could be the reasoning behind his choice? 

2.2) Propose an alternative way to construct a cutoff point to identify "highly rated" films, and explain why you may want to use this.  


***

Your answer:

(2.1)

z-score thresholding approach, a common way to identify values that are statistically above average in which:

    The mean represents the average rating.
    The standard deviation tells us how much variation or spread there is in the ratings.
    Setting the cutoff at mean + 1 std indicates selecting films that are significantly BETTER rated than most (counting outliers) — only the top ~16% in a roughly normal distribution. (100%/2 - 68%/2 = 16%)

(2.2)

Use a percentile threshold (top 25%)

percentile_75 = df['rating'].quantile(0.75)

df['high_rated_alt'] = np.where(df['rating'] > percentile_75, 1, 0)

Percentiles are less sensitive to skewed distributions and outliers.

In [57]:
# Running Chi-squared test
from scipy.stats import chi2_contingency
chi2, p, dof, expected = chi2_contingency(contingency_table)
print("Chi-squared Test Results:")
print(f"Chi2 Statistic: {chi2:.4f}")
print(f"Degrees of Freedom: {dof}")
print(f"P-value: {p:.4f}")
print("\nExpected Frequencies:")
print(expected)

Chi-squared Test Results:
Chi2 Statistic: 2089.1271
Degrees of Freedom: 1
P-value: 0.0000

Expected Frequencies:
[[29653.42117647  4946.57882353]
 [ 6770.57882353  1129.42117647]]


***

**Interpretaton question 3:** 

3.1) Shortly explain why you accept/reject your hypotheses. 

3.2) Now explain the intuition behind the formula that is used to compute the Chi-squared statistic, use the terms "observed" and "expected" values in your answer, and use the two-way tables (crosstab) that are created above. 

3.3) Can you suggest another way (different data or methods) with which one could study differences between the evaluations by professionals and more broader audiences? 


***

Your answer:

(3.1)

We reject the null hypothesis (H₀: awards and high ratings are not related), because the p-value is < 0.0001, far below the common alpha level of 0.05.
This provides strong evidence that professional recognition (awards) and audience recognition (high ratings) ARE statistically related.


(3.2)

The formula is:

χ^2 = ∑[(𝑂−𝐸)^2/𝐸]

𝑂 = Observed frequencies from data

𝐸 = Expected frequencies if the two variables (awarded and high_rated) were independent

The expected values (e.g., for awarded=1 and high_rated=1) were calculated under the assumption of independence. Large differences between observed and expected values inflate the χ^2 statistic, this case being:

(30937-29653.42)^2/29653.42 + (3663-4946.58)^2/4946.58 + (5487-6770.58)^2/6770.58 + (2413-1129.42)^2/1129.42 = 2089.13

A strong relationship between awards and high audience ratings.


(3.3)

Logistic Regression:

Use a logistic model to predict high_rated (0 or 1) using awarded as a predictor, giving the odds of being highly rated if awarded (but other data should be, supposedly, collected to be control variables).

Correlation Analysis with Continuous Scores:

Compute a correlation coefficient (Pearson or Spearman), giving a direct measure of agreement between professional and public evaluations across films, treating both as continuous variables.