# Chapter py_06 
 Statistics for Data Science and Analytics<br>
by Peter C. Bruce, Peter Gedeck, Janet F. Dobbins

Publisher: Wiley; 1st edition (2024) <br>
<!-- ISBN-13: 978-3031075650 -->

(c) 2024 Peter C. Bruce, Peter Gedeck, Janet F. Dobbins

The code needs to be executed in sequence.

Python packages and Python itself change over time. This can cause warnings or errors. 
"Warnings" are for information only and can usually be ignored. 
"Errors" will stop execution and need to be fixed in order to get results. 

If you come across an issue with the code, please follow these steps

- Check the repository (https://gedeck.github.io/sdsa-code-solutions/) to see if the code has been upgraded. This might solve the problem.
- Report the problem using the issue tracker at https://github.com/gedeck/sdsa-code-solutions/issues
- Paste the error message into Google and see if someone else already found a solution

# Counting

## Counting in Python

In [2]:
# create a list of 100 coin flips
import random
random.seed(1234)
coin_flips = [random.choice(["H", "T"]) for i in range(100)]

count = 0
for coin_flip in coin_flips:
    if coin_flip == "H":
        count += 1
print(count)

55


In [3]:
sum(coin_flip == "H" for coin_flip in coin_flips)

55

In [4]:
coin_flips.count("H")

55

In [5]:
counts = {"H": 0, "T": 0}
for coin_flip in coin_flips:
    counts[coin_flip] += 1
print(counts)  # prints: {'H': 55, 'T': 45}

{'H': 55, 'T': 45}


In [6]:
from collections import Counter
counts = Counter(coin_flips)
print(counts)  # prints: Counter({'H': 55, 'T': 45})
print(f'Number of heads: {counts["H"]}')  # prints: Number of heads: 55

Counter({'H': 55, 'T': 45})
Number of heads: 55


In [7]:
counts = Counter(coin_flips)
print(counts)  # prints: Counter({'H': 55, 'T': 45})
counts.update(random.choice(["H", "T"]) for i in range(100))
print(counts)  # prints: Counter({'H': 101, 'T': 99})

Counter({'H': 55, 'T': 45})
Counter({'H': 101, 'T': 99})


## Counting in Pandas

In [8]:
import pandas as pd
df = pd.read_csv("microUCBAdmissions.csv")
counts = df["Admission"].value_counts()
print(f"Number of admitted students: {counts['Admitted']}")
counts

Number of admitted students: 1755


Admission
Rejected    2771
Admitted    1755
Name: count, dtype: int64

In [9]:
counts = df[["Admission", "Gender"]].value_counts()
counts

Admission  Gender
Rejected   Male      1493
           Female    1278
Admitted   Male      1198
           Female     557
Name: count, dtype: int64

In [10]:
print(f'Number of admitted male students: {counts["Admitted", "Male"]}')
print(counts["Admitted"])
print(counts[:, "Female"])

Number of admitted male students: 1198
Gender
Male      1198
Female     557
Name: count, dtype: int64
Admission
Rejected    1278
Admitted     557
Name: count, dtype: int64


In [11]:
counts = df[["Admission", "Gender"]].value_counts(normalize=True)
counts

Admission  Gender
Rejected   Male      0.329872
           Female    0.282369
Admitted   Male      0.264693
           Female    0.123067
Name: proportion, dtype: float64

In [12]:
counts.reset_index()

Unnamed: 0,Admission,Gender,proportion
0,Rejected,Male,0.329872
1,Rejected,Female,0.282369
2,Admitted,Male,0.264693
3,Admitted,Female,0.123067


## Two-way tables

In [13]:
df = pd.read_csv("microUCBAdmissions.csv")
pd.crosstab(df["Admission"], df["Gender"])

Gender,Female,Male
Admission,Unnamed: 1_level_1,Unnamed: 2_level_1
Admitted,557,1198
Rejected,1278,1493


In [14]:
pd.crosstab(df["Admission"], df["Gender"], margins=True)

Gender,Female,Male,All
Admission,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Admitted,557,1198,1755
Rejected,1278,1493,2771
All,1835,2691,4526


In [15]:
pd.crosstab(df["Admission"], df["Gender"], normalize="all", margins=True)

Gender,Female,Male,All
Admission,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Admitted,0.123067,0.264693,0.38776
Rejected,0.282369,0.329872,0.61224
All,0.405435,0.594565,1.0


In [16]:
pd.crosstab(df["Admission"], df["Gender"], normalize="index", margins=True)

Gender,Female,Male
Admission,Unnamed: 1_level_1,Unnamed: 2_level_1
Admitted,0.317379,0.682621
Rejected,0.461205,0.538795
All,0.405435,0.594565


In [17]:
pd.crosstab(df["Admission"], df["Gender"], margins=True, normalize="columns")

Gender,Female,Male,All
Admission,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Admitted,0.303542,0.445188,0.38776
Rejected,0.696458,0.554812,0.61224


## Chi-square test

In [18]:
data = pd.DataFrame({
    "states": ["Texas"] * 200 + ["California"] * 200,
    "votes": ["yes"] * 25 + ["no"] * 175 + ["yes"] * 17 + ["no"] * 183
})
observed = pd.crosstab(data["states"], data["votes"])
common_rate = observed.sum(axis=0) / 2
observed_difference = abs(observed - common_rate).sum().sum()
print(f'The observed difference is {observed_difference}')

The observed difference is 16.0


In [19]:
import random
import numpy as np
random.seed(1234)
differences = []
votes = list(data["votes"])
for _ in range(5_000):
    random.shuffle(votes)
    distribution = pd.crosstab(data["states"], votes)
    differences.append(abs(distribution - common_rate).sum().sum())
at_least_observed = sum(np.array(differences) >= observed_difference) / len(differences)
print(f'Observed difference of at least {observed_difference}: {at_least_observed:.1%}')

Observed difference of at least 16.0: 25.9%


In [20]:
from scipy import stats
result = stats.chi2_contingency(observed)
print(f"chi2 = {result.statistic:.3f}")
print(f"p-value = {result.pvalue:.4f}")
print(f"degrees of freedom = {result.dof}")
print("expected")
print(result.expected_freq)

chi2 = 1.304
p-value = 0.2536
degrees of freedom = 1
expected
[[179.  21.]
 [179.  21.]]
