In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from scipy.stats import norm

# 1. When normal approximation fails (Page 211 - Section 6.1.4)

In the last notebook, when calculating the CI for our ML model, we increased our test set size to satisfy the success/failure condition. Now let's see how to calculate a CI even if we cannot approximate the sampling distribution as a normal distribution.

In [2]:
# Same code from notebook 5
# test_size = 0.2
from sklearn.datasets import load_breast_cancer
breast_cancer = load_breast_cancer()
X = breast_cancer.data
y = breast_cancer.target

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)
X_train.shape, X_test.shape

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(f'The accuracy of our model is {100*accuracy:.1f}% on the test set.')

The accuracy of our model is 96.5% on the test set.


In [3]:
# Success - Failure condition isn't met
successes = np.sum(y_pred == y_test)
failures = np.sum(y_pred != y_test)

successes, failures

(110, 4)

From our textbook "For a confidence interval when the success-failure condition isn’t met, we can use what’s called the Clopper-Pearson interval."

Clopper and Pearson described this confidence interval in their 1934 paper:

Clopper, Charles J., and Egon S. Pearson. "The use of confidence or fiducial limits illustrated in the case of the binomial." Biometrika 26.4 (1934): 404-413. [Link to paper.](https://www.jstor.org/stable/2331986)



Again from the textbook: "The details are beyond the scope of this book. However, there are many internet resources covering this topic."

EXERCISE

Find a resource and calculate the Clopper-Pearson interval. Solution available below:

In [4]:
# YOUR CODE HERE

In [4]:
from scipy.stats import beta

In [5]:
trials = X_test.shape[0]
trials

114

In [6]:
confidence_level = 0.95

# Calculate alpha (significance level)
alpha = 1 - confidence_level

# Calculate the lower and upper bounds using the beta distribution
lower_bound = beta.ppf(alpha / 2, successes, trials - successes + 1)
upper_bound = beta.ppf(1 - alpha / 2, successes + 1, trials - successes)

print(f"Clopper-Pearson confidence interval: [{lower_bound}, {upper_bound}]")

Clopper-Pearson confidence interval: [0.9125955422149606, 0.9903585051787718]


In [7]:
print(f"{accuracy*100:.1f} (95% CI: {lower_bound*100:.1f}% - {upper_bound*100:.1f}%)")

96.5 (95% CI: 91.3% - 99.0%)


## 2. Difference of two proportions (Page 217 - Section 6.2)

Let's use the stack overflow survey again.

In [8]:
survey = pd.read_csv('survey_results_public.csv')
survey.head()

Unnamed: 0,ResponseId,MainBranch,Age,Employment,RemoteWork,Check,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,...,JobSatPoints_6,JobSatPoints_7,JobSatPoints_8,JobSatPoints_9,JobSatPoints_10,JobSatPoints_11,SurveyLength,SurveyEase,ConvertedCompYearly,JobSat
0,1,I am a developer by profession,Under 18 years old,"Employed, full-time",Remote,Apples,Hobby,Primary/elementary school,Books / Physical media,,...,,,,,,,,,,
1,2,I am a developer by profession,35-44 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,0.0,0.0,0.0,0.0,0.0,0.0,,,,
2,3,I am a developer by profession,45-54 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,,,,,,,Appropriate in length,Easy,,
3,4,I am learning to code,18-24 years old,"Student, full-time",,Apples,,Some college/university study without earning ...,"Other online resources (e.g., videos, blogs, f...",Stack Overflow;How-to videos;Interactive tutorial,...,,,,,,,Too long,Easy,,
4,5,I am a developer by profession,18-24 years old,"Student, full-time",,Apples,,"Secondary school (e.g. American high school, G...","Other online resources (e.g., videos, blogs, f...",Technical documentation;Blogs;Written Tutorial...,...,,,,,,,Too short,Easy,,


For this exercise, our aim is to explore whether there is a significant difference between professional developers under the age of 35 and those 35 and over in their likelihood of using AI-powered search tools as their first choice when faced with a technical question at work. This investigation will help us understand how age influences the adoption of AI in problem-solving among developers.

Use the data from the following 3 questions.

Q1 Which of the following options best describes you today? For the purpose of this survey, a developer is "someone who writes code".*

Q2 What is your age?*

Q3 When you have a technical question at work, where do you first go to get an answer?

Required questions are noted with *, meaning these columns do not contain NaN values.

EXERCISE:

estimate the difference $p_1 - p_2$, where $p_1$ is the proportion of under-35 developers who choose AI-powered search and $p_2$ is the proportion of those 35 and older who do the same. Calculate the 95% confidence interval for this difference and decide if it is statistically and practically significant.

In [9]:
# YOUR CODE HERE

SOLUTION:

In [10]:
# Since we are interested in "professional developers"
# filter accordingly
developers = survey[survey['MainBranch'] == 'I am a developer by profession']

Next step is to create two groups by age:

1. Group 1: Age < 35
2. Group 2: Age >= 35

In [11]:
developers['Age'].value_counts()

Age
25-34 years old       20887
35-44 years old       12705
18-24 years old        9032
45-54 years old        4937
55-64 years old        1850
65 years or older       353
Under 18 years old      296
Prefer not to say       147
Name: count, dtype: int64

since this question is required we do not expect NaN values. We will still drop "Prefer not to say"

In [12]:
developers = developers[developers['Age'] != 'Prefer not to say']

Now let's create our two groups `under35` and `over35`:

In [14]:
under35 = developers[developers['Age'].isin(['Under 18 years old', '18-24 years old', '25-34 years old'])]
n1 = len(under35)
n1

30215

In [15]:
over35 = developers[developers['Age'].isin(['35-44 years old', '45-54 years old', '55-64 years old', '65 years or older'])]
n2 = len(over35)
n2

19845

Now let's move to Q3. If you look at `survey_results_schema.csv`, you will see that Q3 column name is `ProfessionalQuestion`

Q1 and Q2 are required to answer. Q3 is not. So let's dropna. Note our propose at this point is to focus on difference of two proportions. When analyzing your creative brief data always make sure you are not introducing any bias by dropping columns.

In [16]:
under35 = under35.dropna(subset=['ProfessionalQuestion'])
n1 = len(under35)
n1

15638

In [17]:
over35 = over35.dropna(subset=['ProfessionalQuestion'])
n2 = len(over35)
n2

10636

In [18]:
p_hat_1 = np.sum(under35['ProfessionalQuestion'].str.contains('AI-powered'))/n1
p_hat_1

0.15110627957539327

In [19]:
p_hat_2 = np.sum(over35['ProfessionalQuestion'].str.contains('AI-powered'))/n2
p_hat_2

0.14009025949605114

In [20]:
p_hat_1 - p_hat_2

0.011016020079342131

So our point estimate for the difference is:

$\hat{p}_1 - \hat{p}_2 = 0.011$

as you know the standard error for this estimate is:

$\text{SE}_{\hat{p}_1 - \hat{p}_2}\approx \sqrt{\frac{\hat{p}_1(1 - \hat{p}_1)}{n_1} + \frac{\hat{p}_2(1 - \hat{p}_2)}{n_2}}$

but before doing this we need to check two conditions:

*Condition 1: Independence*

The responses from developers under 35 and those over 35 need to be independent. We can assume each developer completed the survey individually without influence from others, this ensures independence within each group. Additionally, the two groups (under 35 and over 35) should be independent of each other. The groups were independently stratified after data collection, this satisfies the between-group independence requirement. So this condition is satified.

*Condition 2: Success-Failure*

Since $\hat{p}_1$ and $\hat{p}_2$ are not very low or high, and since ${n_1}$ and ${n_2}$ are high we expect this condition to be satisfied, but let's check.

IMPORTANT NOTE: In your creative brief analysis, failing to verify that the assumptions of your chosen technique are valid may result in a loss of points during grading.

In [23]:
n1 * p_hat_1 >= 10, n1 * (1-p_hat_1) >= 10

(True, True)

In [24]:
n2 * p_hat_2 >= 10, n2 * (1-p_hat_2) >= 10

(True, True)

so both conditions are satified we can approximate the sampling distribution as a normal distribution. Let's calculate the standard error:

$\text{SE}_{\hat{p}_1 - \hat{p}_2}\approx \sqrt{\frac{\hat{p}_1(1 - \hat{p}_1)}{n_1} + \frac{\hat{p}_2(1 - \hat{p}_2)}{n_2}}$

In [27]:
se = np.sqrt((p_hat_1 * (1 - p_hat_1) / n1) + (p_hat_2 * (1 - p_hat_2) / n2))
se

0.004419141639817424

In [28]:
z = norm.ppf(0.975)
z

1.959963984540054

In [31]:
margin_of_error = z * se
margin_of_error

0.008661358456623428

In [32]:
point_estimate = p_hat_1 - p_hat_2
point_estimate

0.011016020079342131

In [33]:
lower_bound = point_estimate - margin_of_error
upper_bound = point_estimate + margin_of_error
lower_bound, upper_bound

(0.002354661622718704, 0.01967737853596556)

We are 95% confident that the difference in proportions between developers under 35 and those over 35 who use AI-powered search tools as their first choice lies between 0.24% and 1.97%. Since this confidence interval is entirely above zero, we can conclude that developers under 35 are more likely to use AI-powered search tools than those over 35. However, since the upper bound of this interval is less than 2%, the difference, while statistically significant, might not be practically significant. (Refer to Section 5.3.6 on page 199 for further context.)

## 3. Chi-square (Section 6.4 - Page 240)

EXERCISE:

Now let's extend our previous analysis. You want to test if there is a significant difference between developers under 35 and those over 35 in how they respond to the question about their preferred source for technical answers.

In [34]:
# YOUR CODE HERE

SOLUTION:

Let's first create the contingency table using the value counts:

In [39]:
under35_counts = under35['ProfessionalQuestion'].value_counts()
over35_counts = over35['ProfessionalQuestion'].value_counts()

In [41]:
under35_counts.head()

ProfessionalQuestion
Traditional public search engine    8762
A coworker                          2782
AI-powered search (free)            1368
AI-powered search (paid)             995
Slack search                         546
Name: count, dtype: int64

In [42]:
over35_counts.head()

ProfessionalQuestion
Traditional public search engine    5822
A coworker                          2004
AI-powered search (paid)             821
AI-powered search (free)             669
Slack search                         415
Name: count, dtype: int64

In [45]:
# possible choices
choices = list(np.unique(under35['ProfessionalQuestion']))
choices

['A coworker',
 'AI-powered search (free)',
 'AI-powered search (paid)',
 'Do search of internal share drives/storage locations for documentation (i.e., not a structured knowledge base)',
 'Internal Developer portal',
 'Microsoft Teams search',
 'Other:',
 'Slack search',
 'Traditional public search engine']

In [53]:
# Ensure both groups have the same index (categories)
under35_counts = under35_counts.reindex(choices, fill_value=0)
over35_counts = over35_counts.reindex(choices, fill_value=0)

In [60]:
# Create the contingency table
data = pd.DataFrame({'Under 35': under35_counts, 'Over 35': over35_counts})
data

Unnamed: 0_level_0,Under 35,Over 35
ProfessionalQuestion,Unnamed: 1_level_1,Unnamed: 2_level_1
A coworker,2782,2004
AI-powered search (free),1368,669
AI-powered search (paid),995,821
"Do search of internal share drives/storage locations for documentation (i.e., not a structured knowledge base)",491,294
Internal Developer portal,384,307
Microsoft Teams search,63,61
Other:,247,243
Slack search,546,415
Traditional public search engine,8762,5822


In [61]:
from scipy.stats import chi2_contingency

In [62]:
# Perform the Chi-Square test
chi2, p, dof, expected = chi2_contingency(data)

In [67]:
print(f"Chi-Square Statistic: {chi2}")
print(f"P-value: {p}")

Chi-Square Statistic: 103.08622252204025
P-value: 9.97665503703158e-19


Since the p-value is so low, we can reject the null (null is no difference between two groups) meaning there is a very strong indication that age influences the choice of search method.

In [68]:
# These are the expected counts for each cell (Page 241)
pd.DataFrame(expected, index=data.index, columns=data.columns)

Unnamed: 0_level_0,Under 35,Over 35
ProfessionalQuestion,Unnamed: 1_level_1,Unnamed: 2_level_1
A coworker,2848.575322,1937.424678
AI-powered search (free),1212.40032,824.59968
AI-powered search (paid),1080.863515,735.136485
"Do search of internal share drives/storage locations for documentation (i.e., not a structured knowledge base)",467.223491,317.776509
Internal Developer portal,411.27571,279.72429
Microsoft Teams search,73.803456,50.196544
Other:,291.642689,198.357311
Slack search,571.976783,389.023217
Traditional public search engine,8680.238715,5903.761285
