## 1. Introduction: 
This Jupyter Notebook performs a proportions test using `proportions_ztest` from `statsmodels.stats.proportion` to determine if proportion of hobbyist users is the same for those under thirty as for those aged thirty and over.

### 1.1 Problem Statement
**$H_0$**: The proportion of hobbyist users is the same for those under thirty as for those aged thirty and over.

**$H_a$**: The proportion of hobbyist users is not the same for the two age groups.

### 1.2 Data Source
The dataset, `stack_overflow.feather` is from datacamp. This is available as a feather file for datacamp subscribers and expected to be placed in `../../data` folder. 
We will be using `hobbyist` column which is a `Yes, No` column. Yes representing that the user is a hobbyist, and `age_cat` column which is categorical column with values `Under 30` for users who are under 30 year old and `At least 30` for users with at least 30 years old.
We'll use `pandas` to load the data and create two separate groups based on age: `under_thirty` and `at_least_thirty`.

## 2. Import Libraries and Set Significe Level
This step imports the necessary Python libraries and loads the dataset into a pandas DataFrame, and sets the significance level to 0.05
- `pandas` is imported as pd for data manipulation and analysis.
- `proportions_ztest` is imported to test for proportions based on normal (z) test
- `alpha` is set to 0.05. This is the chosen significance level, which will be used to compare against the p-value to decide whether to reject the null hypothesis.

In [22]:
import pandas as pd
import numpy as np
from statsmodels.stats.proportion import proportions_ztest
alpha = 0.05

### 2.1 Load Data and preprocessing
- Load the `stack_overflow.feather` dataset into a pandas DataFrame as `df`.
- Convert `hobbyist` column values to 1 for `Yes` and 0 for `No`.
- Divide original dataframe `df` into two dataframes based on users age.
    - `under_thirty_df` for users with age under 30 years.
    - `at_least_thirty_df` for users with age at least 30 years.


In [23]:
df = pd.read_feather('../../data/stack_overflow.feather')

df['hobbyist'] = df['hobbyist'].map({'Yes': 1, 'No':0})
# Separate the data into two groups based on age
under_thirty_df = df[df['age_cat'] == 'Under 30']
at_least_thirty_df = df[df['age_cat'] == 'At least 30']


## 3. Calculating Sample Proportions

To perform the z-test for proportions, we need the number of "successes" (hobbyist users) and the total number of observations (total users) for each group.

1.  **Count the successes and total observations for each group.** A "success" is defined as a user being a hobbyist. We'll count the number of users where the `hobbyist` column is `True`.

2.  **Calculate the number of users for each group.** The `len` of group gives number of users in that group.

In [24]:
# Calculate counts for the under_thirty group
count_under_thirty = under_thirty_df['hobbyist'].sum()
nobs_under_thirty = len(under_thirty_df)

# Calculate counts for the at_least_thirty group
count_at_least_thirty = at_least_thirty_df['hobbyist'].sum()
nobs_at_least_thirty = len(at_least_thirty_df)

# Store counts and nobs in arrays
count = [count_under_thirty, count_at_least_thirty]
nobs = [nobs_under_thirty, nobs_at_least_thirty]

print(count, nobs)

[np.int64(1021), np.int64(812)] [1211, 1050]



## 4. Performing the Z-Test

The `proportions_ztest` function from `statsmodels.stats.proportion` is the primary tool for this test. It takes three key arguments:

  * `count`: A list or array of the number of successes for each group.
  * `nobs`: A list or array of the total number of observations for each group.
  * `alternative`: Specifies the type of test. For our hypothesis, the `alternative='two-sided'` option is used because we are testing for a difference, not a specific direction (e.g., one group being greater than the other). Other options are `'smaller'` or `'larger'`.

The function returns two values: the **z-statistic** and the **p-value**.



In [25]:

# Perform the z-test
z_score, p_value = proportions_ztest(count=count, nobs=nobs, alternative='two-sided')

# Print the results
print(f"z-score: {z_score:.4f}")
print(f"p-value: {p_value:.4f}")

z-score: 4.2237
p-value: 0.0000




### Interpreting the Results

The p-value is the probability of observing our data (or more extreme data) if the null hypothesis were true. We compare the p-value to our **significance level** ($\\alpha$), typically 0.05.

  * If **p-value \< $\alpha$**: We **reject the null hypothesis** ($H_0$). This suggests there is a statistically significant difference between the proportions.
  * If **p-value \>= $\alpha$**: We **fail to reject the null hypothesis**. This suggests there is not enough evidence to conclude there is a significant difference.


In [26]:
# Interpret the results
if p_value < alpha:
    print("Result: Reject the null hypothesis. There is a statistically significant difference in the proportion of hobbyist users between the two age groups.")
else:
    print("Result: Fail to reject the null hypothesis. There is no statistically significant difference in the proportion of hobbyist users between the two age groups.")

Result: Reject the null hypothesis. There is a statistically significant difference in the proportion of hobbyist users between the two age groups.
