In [None]:
%matplotlib inline


# Z and t-tests
In this tutorial we demonstrate how to check if values are significantly different from each other 
using z-tests and t-tests


Z-tests and t-tests are statistical tests that help us make inferences or draw conclusions about a population based on sample data. They are commonly used in hypothesis testing and comparing groups.

## Z-Test:
A z-test is used when we know the population's standard deviation (a measure of how spread out the data is) and want to compare a sample mean to the population mean. It helps us determine if the difference between the sample mean and the population mean is statistically significant.

## T-Test:
A t-test is used when we don't know the population standard deviation and have to estimate it from the sample data. It is also used when the sample size is small. The t-test helps us compare the means of two groups (independent samples t-test) or test whether the mean of a sample differs significantly from a known or assumed population mean (one-sample t-test).

The main difference between z-tests and t-tests is the use of the population standard deviation. In a z-test, we use the population standard deviation, while in a t-test, we estimate the standard deviation from the sample.

To summarize:

- Use a z-test when you know the population standard deviation and want to compare a sample mean to the population mean.
- Use a t-test when you don't know the population standard deviation, estimate it from the sample, and when the sample size is small.

Both tests provide a p-value, which indicates the probability of obtaining the observed results by chance. If the p-value is below a predetermined significance level (usually 0.05), we reject the null hypothesis (no difference) and conclude that there is a statistically significant difference.

In [1]:
import pandas as pd
import numpy as np
import json
# plotting
import matplotlib.pyplot as plt
#opening data
import os
import pathlib
import warnings  
# pd.options.mode.chained_assignment = None
# warnings.filterwarnings('ignore')

## Opening the dataset

First we open the data. For this example we will use WyScout data from 2017/18 Premier League season.  To meet file size requirements of 
Github, we have to open it from different files,
but you can open the file locally from the directory you saved it in. Also, we open the file containing all teams in WyScout database.



In [None]:
# Tutorial cell

#open events
train = pd.DataFrame()
for i in range(13):
    file_name = 'events_England_' + str(i+1) + '.json'
    path = os.path.join(str(pathlib.Path().resolve()), 'data', 'Wyscout', file_name)
    with open(path) as f:
        data = json.load(f)
    train = pd.concat([train, pd.DataFrame(data)])
    
#open team data
path = os.path.join(str(pathlib.Path().resolve()),"data", 'Wyscout', 'teams.json')
with open(path) as f:
    teams = json.load(f)

teams_df = pd.DataFrame(teams)
teams_df = teams_df.rename(columns={"wyId": "teamId"})

In [3]:
# information about all events that occured in all the games during 2017/18 Premier League
path = '../wyscout-data/events/events_England.json'
events_england = pd.read_json(path)

# Open dataset with players
path = '../wyscout-data/teams.json'
teams_df = pd.read_json(path)
teams_df = teams_df.rename(columns={"wyId": "teamId"})
teams_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 142 entries, 0 to 141
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   city          142 non-null    object
 1   name          142 non-null    object
 2   teamId        142 non-null    int64 
 3   officialName  142 non-null    object
 4   area          142 non-null    object
 5   type          142 non-null    object
dtypes: int64(1), object(5)
memory usage: 6.8+ KB


## Preparing the dataset

First, we take out corners. Then, we sum them by team. We also merge it together with team dataframe to keep their names.
Then we repeat the same, but calculate corners taken by each team per game. 



In [4]:
#get corners
corners = events_england[events_england["subEventName"] == "Corner"]
#count corners by team
corners_by_team = corners.groupby(['teamId']).size().reset_index(name='counts')
#merge with team name
summary = corners_by_team.merge(teams_df[["name", "teamId"]], how = "left", on = ["teamId"])
#count corners by team by game
corners_by_game = corners.groupby(['teamId', "matchId"]).size().reset_index(name='counts')
#merge with team name
summary2 = corners_by_game.merge(teams_df[["name", "teamId"]], how = "left", on = ["teamId"])

## Two-sided z-test

We use two-sided z-test to check if Manchester City take 8 corners per game. We set the significance level at 0.05.
At this significance level, there's no reason to reject the null hypothesis. Therefore, we claim that City takes
8 corners per game.



In [5]:
from statsmodels.stats.weightstats import ztest

#get city corners
city_corners = summary2[summary2["name"] == 'Manchester City']["counts"]

#test 
t, pvalue = ztest(city_corners,  value=8)
#checking outcome
if pvalue < 0.05:
    print("P-value amounts to", pvalue, "- We reject null hypothesis - Manchester City do not take 8 corners per game")
else:
    print("P-value amounts to", pvalue, " - We do not reject null hypothesis - Manchester City take 8 corners per game")

P-value amounts to 0.34703298713007624  - We do not reject null hypothesis - Manchester City take 8 corners per game


## One-sided z-test

We use one-sided z-test to check if Manchester City take more than 6 corners per game. We set the significance level at 0.05.
At this significance level, we reject the null hypothesis. Therefore, we claim that City takes
more than 6 corners per game.



In [6]:
t, pvalue = ztest(city_corners,  value=6, alternative = "larger")
if pvalue < 0.05:
    print("P-value amounts to", pvalue, "- We reject null hypothesis - Manchester City take more than 6 corners per game")
else:
    print("P-value amounts to", pvalue, " - We do not reject null hypothesis - Manchester City do not take 6 more corners per game")

P-value amounts to 0.0023931156479123942 - We reject null hypothesis - Manchester City take more than 6 corners per game


- Two-sided z-tests are used when you want to test if the sample mean is significantly different from the population mean, without specifying a particular direction.
- One-sided z-tests are used when you have a specific direction in mind and want to test if the sample mean is significantly greater or less than the population mean.

## One-sample two-sided t-test

We use one-sample t-test to check if Leicester City take different number of corners than the league average. We set the significance level at 0.05.
At this significance level, there's no reason to reject the null hypothesis. Therefore, we claim that Leicester City take
more than 6 corners per game. 



In [7]:
mean = summary["counts"].mean()
std = summary["counts"].std()


from scipy.stats import ttest_1samp
leicester_corners = summary.loc[summary["name"] == "Leicester City"]["counts"].iloc[0]
t, pvalue = ttest_1samp(summary["counts"], leicester_corners)

if pvalue < 0.05:
    print("P-value amounts to", pvalue, "- We reject null hypothesis - Leicester City do not take average number of corners than league average")
else:
    print("P-value amounts to", pvalue, " - We do not reject null hypothesis - Leicester City take average number of corners than league average")

P-value amounts to 0.4023279517451914  - We do not reject null hypothesis - Leicester City take average number of corners than league average


## One-sample one-sided t-test

We use one-sample t-test to check if Arsenal took more number of corners than the league average. We set the significance level at 0.05.
At this significance level, we reject the null hypothesis. Therefore, we claim that Arsenal take
more than 6 corners per game. 



In [8]:
from scipy.stats import ttest_1samp
arsenal_corners = summary.loc[summary["name"] == "Arsenal"]["counts"].iloc[0]
t, pvalue = ttest_1samp(summary["counts"], arsenal_corners, alternative='less')

if pvalue < 0.05:
    print("P-value amounts to", pvalue, "- We reject null hypothesis - Arsenal take more corners than league average")
else:
    print("P-value amounts to", pvalue, " - We do not reject null hypothesis - Arsenal do not take more corners than league average")

P-value amounts to 0.001609869097090137 - We reject null hypothesis - Arsenal take more corners than league average


## Two-sample two-sided t-test

We use two-sample t-test to check if Liverpool took different number of corners per game than the league average. We set the significance level at 0.05.
At this significance level, there is no reason to reject the null hypothesis. Therefore, we claim that Liverpool took
the same number of corners as United. 



In [9]:
#check if united takes the same average number of corners per game as liverpool
liverpool_corners = summary2.loc[summary2["name"] == 'Liverpool']["counts"]
united_corners = summary2.loc[summary2["name"] == 'Manchester United']["counts"]

from scipy.stats import ttest_ind
t, pvalue  = ttest_ind(a=liverpool_corners, b=united_corners, equal_var=True)

if pvalue < 0.05:
    print("P-value amounts to", pvalue, "- We reject null hypothesis - Liverpool took different number of corners per game than United")
else:
    print("P-value amounts to", pvalue, " - We do not reject null hypothesis - Liverpool took the same number of corners per game as United")

P-value amounts to 0.5879909398542313  - We do not reject null hypothesis - Liverpool took the same number of corners per game as United


## Two-sample one-sided t-test

We use two-sample t-test to check if Manchester City took more corners per game than Newcastle. We set the significance level at 0.05.
At this significance level, we reject the null hypothesis. Therefore, we claim that City took
more corners than Newcastle. 



In [10]:
city_corners = summary2.loc[summary2["name"] == 'Manchester City']["counts"]
castle_corners = summary2.loc[summary2["name"] == 'Newcastle United']["counts"]

from scipy.stats import ttest_ind
t, pvalue  = ttest_ind(a=city_corners, b=castle_corners, equal_var=True, alternative = "greater")

if pvalue < 0.05:
    print("P-value amounts to", pvalue, "- We reject null hypothesis - City took more corners per game than Newcastle")
else:
    print("P-value amounts to", pvalue, " - We do not reject null hypothesis - City did not  take the more corners per game than Newcastle")

P-value amounts to 1.4280208353516603e-05 - We reject null hypothesis - City took more corners per game than Newcastle
