# Hypothesis testing on car accidents data in Czech Republic

First, import libraries for data handling & statistical analysis

In [18]:
import pandas as pd
import scipy.stats as stat
import numpy as np

Fetch the data from data file

In [19]:
df = pd.read_pickle("accidents.pkl.gz")

Test the following hypotheses with 95% confidence level.

## Hypothesis 1
_The probability of an accident being fatal is the same for 1st class roads as for the highways_

Filter by road type -- highways & 1st class roads only

In [20]:
df1 = df.loc[df["p36"] <= 1, ["p36", "p13a"]].copy()
df1["p36"] = df1["p36"].map({0: "Highway", 1: "1st class"})


Collect analyzed data -- fatal & non-fatal car accidents

In [21]:
df1["fatal"] = df1["p13a"] > 0

Compute a cross tabulation comparing the road type & accident's fatality

In [22]:
tab = pd.crosstab(df1["fatal"], df1["p36"], rownames=["Fatal"], colnames=["Road type"])
print(tab.to_numpy())
tab

[[78618 24293]
 [  911   166]]


Road type,1st class,Highway
Fatal,Unnamed: 1_level_1,Unnamed: 2_level_1
False,78618,24293
True,911,166


Compute $\chi^2$ test

In [23]:
_, p, _, exp = stat.chi2_contingency([tab])
exp = exp[0]
print("P-value", p, "is", "lesser" if p < 0.05 else "greater", "than 0.05")
print("Expected frequencies:\n", exp)
probs = exp[1] / np.sum(exp, axis=0)
print("The robability of an accident being fatal is", "greater" if probs[0] > probs[1] else "lesser", "for 1st class road")


P-value 3.6067450279444316e-10 is lesser than 0.05
Expected frequencies:
 [[78705.32098896 24205.67901104]
 [  823.67901104   253.32098896]]
The robability of an accident being fatal is lesser for 1st class road


Here we see that the p-value was lesser than 0.05, meaning that we reject the null hypothesis - that there is no relationship between the two factors. Thus the _hypothesis 1_ is proven to be **correct**.

## Hypothesis 2
_The vehicle damages are lesser for **Škoda** brand cars than for **Audi**._

Filter by car brand -- Škoda, Audi

In [24]:
df2 = df.loc[(df["p45a"] == 39) | (df["p45a"] == 2), ["p53", "p45a"]].copy()
skoda = df2.loc[df2["p45a"] == 39, ["p53"]].squeeze()
audi = df2.loc[df2["p45a"] == 2, ["p53"]].squeeze()

We are going to compute independent sample T-test to see, whether the population means of two groups (Škoda, Audi) are equal or not.

In [25]:
val, p = stat.ttest_ind(skoda, audi, equal_var=False, alternative="less")
print("P-value", p, "is", "lesser" if p < 0.05 else "greater", "than 0.05")
print("The value for", "Škoda" if val < 0 else "Audi", "is lesser")

P-value 6.1078288453876684e-121 is lesser than 0.05
The value for Škoda is lesser


Here we see that the p-value was lesser than 0.05, meaning that the null hypothesis - the mean values of the two populations are **equal**, is **rejected**, and that the mean value for Škoda cars is indeed lesser. Thus the alternative hypothesis - _hypothesis 2_ is proven to be **correct**.