In [3]:
import pandas as pd
import numpy as np

A public opinion poll surveyed a simple random sample of 1000 voters. Respondents were classified by gender (male or female) and by voting preference (Republican, Democrat, or Independent).

In [4]:
vote = pd.DataFrame({
    "republican": [200, 250],
    "democrat": [150, 300],
    "independent": [50, 50],
}, index=["male", "female"])
vote

Unnamed: 0,republican,democrat,independent
male,200,150,50
female,250,300,50


Is there a gender gap? Do the men's voting preferences differ significantly from the women's preferences? Use a 0.05 level of significance.

# 1. Hypotheses

Ho: Gender and voting preferences are independent.

Ha: Gender and voting preferences are not independent.

# 2. Analysis Plan

We will conduct a chi-square test for independence with 0.05 significance level.

# 3. Analyze Sample Data

## 3.1 Resampling Approach

1. Constitute a box with 34 ones (clicks) and 2,966 zeros (no clicks).
2. Shuffle, take three separate samples of 1,000, and count the clicks in each.
3. Find the squared differences between the shuffled counts and the expected
counts and sum them.
4. Repeat steps 2 and 3, say, 1,000 times.
5. How often does the resampled sum of squared deviations exceed the observed?
That’s the p-value.

In [17]:
vote

Unnamed: 0,republican,democrat,independent
male,200,150,50
female,250,300,50


For this test, we
need to have the “expected” distribution of clicks, and in this case, that would be
under the null hypothesis assumption that all three headlines share the same click
rate, for an overall click rate of 34/3,000. Under this assumption, our contingency
table would look like Table 3-5.

In [47]:
expected_vote = vote.assign(
    republican = vote.mean(axis=1),
    democrat = vote.mean(axis=1),
    independent = vote.mean(axis=1),
)
expected_vote

Unnamed: 0,republican,democrat,independent
male,133.333333,133.333333,133.333333
female,200.0,200.0,200.0


In [79]:
observed = vote.values
expected = np.mean(vote.values, axis=1).reshape((2, 1))
r = (observed - expected) / np.sqrt(expected)
x = np.sum(r ** 2)
x

262.5

## 3.2 Theoretical Approach with Chi-Square Distribution

In [49]:
vote.values

array([[200, 150,  50],
       [250, 300,  50]])

In [10]:
from scipy.stats import chi2_contingency

chisq, pvalue, df, expected = chi2_contingency(vote.values)
print(f'p-value: {pvalue:.4f}')

p-value: 0.0003


We canot accept the null hypothesis. *There is a relationship between gender and voting preference*.