# **Player Performances in the 2018/2019 Premier League Season**


Source: https://footystats.org/england/premier-league#

570 observation / 26 variable 



The population under study consists of  570  players who participated in the English Premier League during the 2018/2019 season.
Using the different variables, we will extract simple statistics on player and team performances to evaluate their contributions, while also gaining a comprehensive overview of the league’s dynamics throughout the season.


## **Variables** :

### Categorical Variables :
- **position**: The player’s position. (Raw data)
- **Current Club**: The team the player represents. (Raw data)
- **nationality**:  (Raw data)

### Discrete Variables :

- **goals_overall** Range(0-22), **assists_overall**:  goals, and assists.


### Continuous Variables :
- **appearances_overall** Range (0-38)
- **minutes_played_overall** Range (0-3420), Total minutes played.
- **age**: 
	•	Variable Name: age
	•	Variable Description: The age of the player.
	•	Variable Type: Discrete 
	•   Range: 21-43
	•	Process: Raw data






In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import scipy.stats as st

### Data View

In [None]:
premier = pd.read_csv('Data-PL.csv', sep=';')
premier.head()

# **Function**

##  Function Frequencies Table :


In [None]:
def categorical_frequencies(x):
    l, ni = np.unique(x, return_counts=True)  
    fi = ni / len(x)  
    df_frequencies = pd.DataFrame(
        data=np.transpose([ni, fi]),
        index=l,
        columns=["frequencies", "relative frequencies"]
    )
    mean_frequency = np.mean(ni)
    print(f"mean frequency: {mean_frequency:.2f}")
    return df_frequencies

def discret_frequencies(x):
    l, ni = np.unique(x, return_counts=True)  
    fi = ni / len(x)  
    cf = np.cumsum(fi)  
    df_frequencies = pd.DataFrame(
        data=np.transpose([ni, fi, cf]),
        index=l,
        columns=["frequencies", "relative frequencies", "cumulative frequencies"]
    )
    return df_frequencies

def continuous_frequencies(x, bins):
    ni = np.histogram(x, bins=bins)[0]
    fi = ni / len(x)
    Fi = np.cumsum(fi)
    lo = bins[:-1]
    hi = bins[1:]
    frequency_table = pd.DataFrame({
        'lo': lo,
        'hi': hi,
        'frequencies': ni,
        'relative frequencies': fi,
        'cumulative frequencies': Fi
    })

    return frequency_table

## Function graphs

### Categorical variable Function

In [None]:
def vertical_charts(x,sx,sy):
    ni = np.array(x['frequencies'])
    labels = x.index
    plt.figure(figsize=(sx, sy))
    plt.bar(labels, ni)
    plt.ylabel('Frequencies')
    plt.tight_layout()
    plt.show()

def horizontal_charts(x,sx,sy):
    fi = np.array(x['relative frequencies'])
    labels = x.index
    plt.figure(figsize=(sx, sy))
    plt.barh(labels, fi)
    plt.xlabel('Relative Frequencies')
    plt.tight_layout()
    plt.show()

def pie_charts(x,sx,sy):
    ni = np.array(x['frequencies'])
    labels = x.index
    plt.figure(figsize=(sx, sy))
    plt.pie(ni, labels=labels, autopct='%1.1f%%')
    plt.tight_layout()
    plt.show()

### Discret Variable Function

In [None]:
def bar_graph(x,sx):
    ni = np.array(x['frequencies'][:])
    plt.bar(x.index, ni, width=sx)
    plt.xticks(x.index, x.index)  
    plt.show()

def line_graph(x):
    fi = np.array(x['relative frequencies'][:])
    plt.plot(x.index, fi)  
    plt.show()

def stair(data, x_min, x_max):
    Fi = np.array(data['cumulative frequencies'][:])
    Fi_comp = np.insert(Fi, 0, 0)
    edges = np.array(data.index).astype(float)
    edges = np.insert(edges, 0, x_min)  
    edges = np.append(edges, x_max)    
    plt.stairs(Fi_comp, edges)
    plt.xlim(x_min, x_max)
    plt.show()

### Continuous Variable Function

In [None]:
def historgram(x):
    lo = np.array(x['lo'])
    hi = np.array(x['hi'])
    fi = np.array(x['relative frequencies'])
    width = hi - lo 
    height = fi / width 
    bin_edges = np.append(lo, hi[-1])
    plt.hist(bin_edges[:-1], bins=bin_edges, weights=height)
    plt.show()

def box_plot(x,v=""):
    plt.boxplot(x)
    plt.xticks([1],[v])
    plt.show()  

def eCDF(x) :
    a= np.min(x)
    b= np.max(x)
    n = len(x)
    x.sort()
    alpha=[0, 0.25, 0.5, 0.75, 1]
    unique, freq = np.unique(x, return_counts=True)
    cumul_freq = np.cumsum(freq)
    cumul_freq = np.insert(cumul_freq, 0, 0)
    F = cumul_freq / n

    t = np.insert(unique, 0, a)
    t = np.append(t, b)
    q = np.quantile(x, alpha, method='higher')

    plt.stairs(F, t)
    plt.xlim(a,b)
    
    for i in range(1,4):
        plt.plot([a,b],[alpha[i],alpha[i]])
        plt.plot([q[i],q[i]],[0.,alpha[i]], 'k--', linewidth = 0.7)
    plt.xticks(
        [a, q[1], q[2], q[3], b],
        [f"{a:.2f}", 'Q1', 'Q2', 'Q3', f"{b:.2f}"]
    )
    plt.show()

def historgramm(x,y):
    a = np.min(y)
    b = np.max(y)
    alpha = [0, 0.25, 0.5, 0.75, 1]
    q = np.quantile(y, alpha, method='higher')  
    lo = np.array(x['lo'])
    hi = np.array(x['hi'])
    fi = np.array(x['relative frequencies'])
    length = hi - lo
    height = fi / length  
    bin_edges = np.append(lo, hi[-1])
    m = max(height) 
    plt.hist(bin_edges[:-1], bins=bin_edges, weights=height )
    for i in range(1, 4):  
        plt.plot([q[i], q[i]], [0., m], 'k--', linewidth=0.8) 
    plt.xticks(
        [a, q[1], q[2], q[3], b],
        [f"{a:.2f}", 'Q1', 'Q2', 'Q3', f"{b:.2f}"]
    )
    plt.xlabel("")
    plt.ylabel("")
    plt.show()

## Function Info

In [None]:
def get_info(x) :
    m = np.mean(x)
    ra = np.ptp(x)
    s2 = np.var(x)
    s = np.std(x)
    skew = np.mean((x-m)**4)/s**4 - 3
    kurt = np.mean((x-m)**3)/s**3
    x.sort()
    n = len(x)
    unique, freq = np.unique(x, return_counts=True)
    cumul_freq = np.cumsum(freq)
    cumul_rel_freq = np.insert(cumul_freq, 0, 0.) / n
    cumul_rel_freq = np.append(cumul_rel_freq, 1)
    deviation = np.array([4,5, 3, 2, 1])*s
    val = np.insert(unique, 0, 0.)
    val = np.append(val, 80000)
    F = lambda t: np.interp(t, val, cumul_rel_freq)

    print(f"Range  : {ra}   Variance : {s2} Mean : {m}  Skewness : {skew}  Kurtosis : {kurt}" )
    p = [F(m + y) - F(m - y) for y in deviation]
    p_percent = [value * 100 for value in p]
    for i, value in enumerate(p_percent):
        print(f"{deviation[i]:.2f} , {value:.2f} % ")

## Function Tests

###  Confidence Intervalle 

In [None]:
def CI_mean(x, alpha=0.05, sample_size=100):
    if sample_size:
        indices = np.random.choice(len(x), size=sample_size, replace=False)
        x = x[indices]
    bx = np.mean(x)  
    s = np.std(x, ddof=1)  
    n = sample_size  
    u = st.norm.ppf(1 - alpha / 2)  
    t = u * s / np.sqrt(n) 
    print(f"With a sample of {sample_size} players, the mean is in the interval: ({bx - t:.2f} ; {bx + t:.2f}) with a confidence level of {(1 - alpha) * 100:.2f}%.")

def CI_proportion(x,name,i,sample_size=100):
    if sample_size:
        indices = np.random.choice(len(x), size=sample_size, replace=False)
        x = x[indices]
    n= sample_size
    levels, ni = np.unique(x, return_counts = True) 
    fi = ni/n
    p0 = fi[i]
    alpha = 0.05
    u = st.norm.ppf(1-alpha/2)
    t = u * np.sqrt(p0*(1-p0)/n)
    print(f'the proportion of {name} is in ({p0-t:.2f},{p0+t:.2f}) with a confidencelevel of {(1-alpha)*100} % ')

### Goodness of fit tests

In [None]:
def gamma_goodness_of_fit(x, bins, alpha=0.05):
    mean_x = np.mean(x)
    var_x = np.var(x, ddof=1)
    beta = mean_x / var_x
    alpha_gamma = beta * mean_x
    dist = st.gamma(a=alpha_gamma, scale=1/beta)
    cdf_vals = dist.cdf(bins)
    n = len(x)
    npi = np.diff(cdf_vals) * n
    ni = np.histogram(x, bins=bins)[0]
    di = (ni - npi)**2 / npi
    chi_squared = np.sum(di)
    J = len(bins) - 1
    pvalue = st.chi2.sf(chi_squared, J - 1 - 2)
    t = np.linspace(np.min(x), np.max(x), 100)
    plt.hist(x, bins=bins, density=True)
    plt.plot(t, dist.pdf(t), 'k-')
    plt.show()
    print(f"Chi-squared value: {chi_squared:.2f}")
    print(f"p-value: {pvalue:.4f}")
    if pvalue > alpha:
        print(f"The p-value is greater than {alpha}. This means that there is no significant evidence to reject H_0, and we consider that the fit is correct.")
    else:
        print(f"The p-value is less than or equal to {alpha}. This means that there is significant evidence to reject H_0, and we consider that the fit is not correct.")

    


def goodness_of_fit_exponential_discrete(x, alpha=0.05):
    mean_data = np.mean(x)
    lambda_hat = 1 / mean_data 
    dist = st.expon(scale=1/lambda_hat)
    values, ni = np.unique(x, return_counts=True)
    n = len(x)
    cdf_vals = dist.cdf(values)
    npi = np.diff(cdf_vals) * n
    npi = np.append(npi, (1 - cdf_vals[-2]) * n) 
    plt.bar(values, ni/n,)
    plt.plot(values, dist.pdf(values), '-ko')
    plt.show()
    dj2 = (ni - npi)**2 / npi
    d2 = np.sum(dj2)
    J = len(dj2)
    xc = st.chi2.ppf(1 - alpha, df=J - 1)
    p_value = st.chi2.sf(d2, J - 1)
    print(f"Chi-squared value: {d2:.2f}")
    print(f"Critical value: {xc:.2f}")
    print(f"p-value: {p_value:.4f}")
    if p_value > alpha:
        print(f"The p-value is  greater than {alpha}. This means that there is no significant evidence to reject H_0, and we consider that the fit is correct.")
    else:
        print(f"The p-value is less than or equal to {alpha}. This means that there is significant evidence to reject H_0, and we consider that the fit is not correct.")


def goodness_of_fit_uniform(x, alpha=0.05):
    a = np.min(x)
    b = np.max(x)
    n = len(x)
    dist = st.uniform(loc=a, scale=(b-a))
    bins = np.linspace(a, b, 10)
    J = len(bins) - 1
    nj = np.histogram(x, bins=bins)[0]
    F = dist.cdf(bins[1:-1])
    F = np.hstack((0., F, 1.))
    npj = n * np.diff(F)
    dj = (nj - npj) ** 2 / npj
    d2 = np.sum(dj)
    xc = st.chi2.ppf(1 - alpha, J - 1 - 2)
    p_value = st.chi2.sf(d2, J - 1 - 2)
    plt.hist(x, bins=bins, density=True)
    plt.plot(np.linspace(a, b, 100), dist.pdf(np.linspace(a, b, 100)), 'k')
    plt.show()
    print(f"Chi-squared value: {d2:.2f}")
    print(f"Critical value: {xc:.2f}")
    print(f"p-value: {p_value:.4f}")
    if p_value > alpha:
        print(f"The p-value is greater than {alpha}. This means that there is no significant evidence to reject H_0, and we consider that the fit is correct.")
    else:
        print(f"The p-value is less than or equal to {alpha}. This means that there is significant evidence to reject H_0, and we consider that the fit is not correct.")


### Comparaison Tests


In [None]:
def chi_squared_test(bins, x, y, titrex, titrey, alpha=0.05):
    nic = np.histogram(x, bins=bins)[0]  
    ninc = np.histogram(y, bins=bins)[0]  
    nkj = np.vstack((nic, ninc))  
    nk = np.sum(nkj, axis=1)  
    nj = np.sum(nkj, axis=0)  
    ddl = (np.shape(nkj)[0] - 1) * (np.shape(nkj)[1] - 1)
    n = np.sum(nkj)  
    nkj_th = np.outer(nk, nj) / n  
    dkj = (nkj - nkj_th) ** 2 / nkj_th  
    d2 = np.sum(dkj)  
    p_value = st.chi2.sf(d2, ddl)
    plt.figure(figsize=(10, 6))
    plt.subplot(1, 2, 1)
    plt.hist(x, bins=bins, density=True)
    plt.title(titrex)
    plt.yscale('log') # I used the log scale for the y-axis instead of manually adjusting the ylim because the histogram were not viseable
    plt.subplot(1, 2, 2)
    plt.hist(y, bins=bins, density=True)
    plt.title(titrey)
    plt.yscale('log') # I used the log scale for the y-axis instead of manually adjusting the ylim because the histogram were not viseable
    plt.show()
    print(f"Chi-squared value: {d2:.2f}")
    print(f"p-value: {p_value:.4f}")
    if p_value > alpha:
        print(f"The distributions are not significantly different (p-value > {alpha}).")
    else:
        print(f"The distributions are significantly different (p-value <= {alpha}).")

def compare_means(x, y, alpha=0.05):
    bx, sx2, nx = np.mean(x), np.var(x, ddof=1), len(x)
    by, sy2, ny = np.mean(y), np.var(y, ddof=1), len(y)
    pooled_variance = ((nx - 1) * sx2 + (ny - 1) * sy2) / (nx + ny - 2)
    t = (bx - by) / np.sqrt(pooled_variance * (1 / nx + 1 / ny))
    p_value = 2 * st.t.sf(np.abs(t), nx + ny - 2)
    
    print(f"Mean of first sample: {bx:.2f}")
    print(f"Mean of second sample: {by:.2f}")
    print(f"P-value: {p_value:.4f}")
    
    if p_value > alpha:
        print(f"The means are not significantly different (p-value > {alpha}).")
    else:
        print(f"The means are significantly different (p-value <= {alpha}).")

#  **Variable**

## Categorical Variables :

### Current Club

In [None]:
club = np.array(premier["Current Club"])  
categorical_frequencies(club)

**the distribution of players across clubs is fairly balanced**


The club with the fewest players, Manchester City (24 players), finished first, while the club with the most players  Huddersfield Town with 32 players  finished last.

In [None]:
x= categorical_frequencies(club)
horizontal_charts(x,10,5)

In [None]:
vertical_charts(x,35,5)

### Position


In [None]:
position = np.array(premier["position"])  
categorical_frequencies(position)

In [None]:
x= categorical_frequencies(position)
pie_charts(x,5,5)

In [None]:
vertical_charts(x,8,4)

For a proportion, the margin of error t is given by

$$
t = u_{1 - \frac{\alpha}{2}} \cdot \sqrt{\frac{p_0 (1 - p_0)}{n}}
$$

The confidence interval is:

$$
 \left[ p_0 - t, \, p_0 + t \right]
$$


In [None]:
CI_proportion(position,"Defender",0,29)
CI_proportion(position,"GoalKeeper",2,29)

Each time the test is run, 29 different players are randomly selected, and the resulting confidence intervals consistently match the true proportion, demonstrating the reliability of the sample estimates.

## Discrete Variables :

### Goal Overall

In [None]:
goal = np.array(premier["goals_overall"])  
discret_frequencies(goal)

In [None]:
x= discret_frequencies(goal)
bar_graph(x,0.50)
line_graph(x)

In [None]:
get_info(goal)

The goal distribution shows a range of 22, reflecting a wide disparity in player performance. The strong right skew (10.39) indicates that most players scored very few goals, while a small number scored significantly more. The high kurtosis (2.99) suggests the presence of outliers, with some players standing out by scoring far more goals than average.

For a mean, the margin of error t is given by

$$
t = u_{1 - \frac{\alpha}{2}} \cdot \frac{s}{\sqrt{n}}
$$

The confidence interval is 

$$
 \left[ \bar{x} - t, \, \bar{x} + t \right]
$$

In [None]:
CI_mean(goal,0.05,29)

We take the sample of 29 which the average number of player in a team this year in Premier League ( thanks to the Current Club frequencies )

Each time the test is run, 29 different players are randomly selected, and the resulting confidence intervals for the mean consistently match the true mean, demonstrating the reliability of the sample estimates.

We are testing whether the distribution of goals scored by players fits an exponential distribution, using a goodness-of-fit test.

In [None]:
goodness_of_fit_exponential_discrete(goal)

## Continuous Variables :

### Age

In [None]:
age = np.array(premier["age"])  
bins_age = np.array([21,23, 26,28,30,32,33,35,37,41])
continuous_frequencies(age,bins_age)

In [None]:
x= continuous_frequencies(age,bins_age)
box_plot(age,"age")
eCDF(age)
historgramm(x,age)

In [None]:
get_info(age)

In [None]:
CI_mean(age,0.05,29)

Each time the test is run, 29 different players are randomly selected, and the resulting confidence intervals for the mean consistently match the true mean, demonstrating the reliability of the sample estimates.

We are testing whether the distribution of players’ ages fits a gamma distribution, using a goodness-of-fit test.

In [None]:
gamma_goodness_of_fit(age,bins_age)

In [None]:
french = premier[premier["nationality"] == "France"]
england= premier[premier["nationality"] == "England"]
age_fff = np.array(french["age"])  
age_eff = np.array(england["age"])  

In [None]:
chi_squared_test(bins_age,age_eff,age_fff,"England","France")

The chi-squared test result indicates  that the age distributions of French and English players are not significantly different.

In [None]:
compare_means(age_eff,age_fff)

In [None]:
att = premier[premier["position"] == "Forward"]
dc = premier[premier["position"] == "Defender"]
age_att = np.array(att["age"])  
age_dc = np.array(dc["age"])  

In [None]:
chi_squared_test(bins_age,age_att,age_dc,"Forward","Defender")

In [None]:
compare_means(age_att,age_dc)

### Appearances Overall

We filtered the dataset to include only players with more than 3 appearances (appearances_overall > 3) to ensure the analysis focuses on players with sufficient game time, leading to more reliable and meaningful insights

In [None]:
played = premier[premier["appearances_overall"] >3]
app = np.array(played["appearances_overall"])
bins_app = np.array([3, 5,10,15,20,25,30,35,38])
continuous_frequencies(app,bins_app)

In [None]:
x = continuous_frequencies(app,bins_app)
box_plot(app,"Appearances")
eCDF(app)
historgramm(x,app)

In [None]:
get_info(app)

By excluding players with fewer than 3 appearances,The mean increased from 18.38 to 23.22, suggesting that the filtered data focuses on more active players. The variance dropped significantly from 165.18 to 101.95, reflecting less variability in the adjusted dataset. The skewness reduced slightly from -1.40 to -1.11, showing that the filtered dataset is less left-skewed but still concentrated on higher appearances.

In [None]:
CI_mean(app,0.1,29)

Each time the test is run, 29 different players are randomly selected, and the resulting confidence intervals for the mean consistently match the true mean, demonstrating the reliability of the sample estimates.

We are testing whether the distribution of player appearances follows a uniform distribution, using a goodness-of-fit test.

In [None]:
goodness_of_fit_uniform(app)

In [None]:
old= premier[(premier["age"]>30 ) & (premier["appearances_overall"] > 1)]
jung= premier[(premier["age"] <=30 ) & (premier["appearances_overall"] > 1)]
app_old = np.array(old["appearances_overall"])  
app_jung= np.array(jung["appearances_overall"])  

In [None]:
chi_squared_test(bins_app,app_old,app_jung,"More Than 30 ", "Less Than 30")

In [None]:
compare_means(app_jung,app_old)

In [None]:
old= premier[premier["age"]>30 ]
jung= premier[premier["age"] <=30 ]
app_old = np.array(old["appearances_overall"])  
app_jung= np.array(jung["appearances_overall"]) 
compare_means(app_jung,app_old)

When players who have never played are included, the mean number of appearances for players under 30 is 17.01, compared to 20.14 for players 30 and older. The p-value (0.0038) indicates a significant difference between the two groups. However, when players who have never played are excluded, the means increase to 21.61 for players under 30 and 22.06 for players 30 and older. The p-value (0.6591) shows no significant difference. 

This also suggests that a significant number of younger players have not played, which lowers the mean number of appearances for the under-30 group when those who have never played are included. This contrasts with older players, who are more likely to have played, resulting in a higher mean for their group even when non-playing players are included.

In [None]:
keeper = premier[(premier["position"] == "Goalkeeper") & (premier["appearances_overall"] > 1)]
forward= premier[(premier["position"] == "Forward") & (premier["appearances_overall"] > 1)]
app_k = np.array(keeper["appearances_overall"])  
app_f= np.array(forward["appearances_overall"])  

In [None]:
chi_squared_test(bins_app,app_k,app_f,"Keeper ", "Forward")

We conclude that the distributions of appearances between goalkeepers and forwards are not significantly different. This suggests that the number of appearances for both positions follow similar distribution patterns

In [None]:
compare_means(app_k,app_f)

### Minuted Played

We filtered the dataset to include only players with more than 100 minutes played (minutes_played_overall > 100) to ensure the analysis focuses on players with significant participation, providing more reliable and relevant data.

In [None]:
played = premier[premier["minutes_played_overall"] > 100]
min = np.array(played['minutes_played_overall'][:])
bins_min = np.array([100,500,700,900,1100,1600,1800, 2000 ,2200 ,3100 ,3420 ])
continuous_frequencies(min,bins_min)

In [None]:
box_plot(min,"Minuted")
eCDF(min)
x=continuous_frequencies(min,bins_min)
historgramm(x,min)

In [None]:
get_info(min)

The minutes played overall show a wide range (3317 minutes) with a mean of 1667.52 minutes, indicating that players’ participation varies significantly. The negative skewness (-1.17) suggests that more players have high playing times, while the kurtosis (0.14) indicates a distribution close to normal with slightly fewer extreme values.

In [None]:
CI_mean(min,0.1,29)

Each time the test is run, 29 different players are randomly selected, and the resulting confidence intervals for the mean consistently match the true mean, demonstrating the reliability of the sample estimates.

The true mean, calculated at  1667.51, falls within this interval, suggesting that the sample provides a good estimate of the population mean.

We are testing whether the distribution of minutes played by players follows a uniform distribution, using a goodness-of-fit test.

In [None]:
goodness_of_fit_uniform(min)

In [None]:
de = premier[(premier["nationality"] == "Germany") & (premier["minutes_played_overall"] > 100)]
en= premier[(premier["nationality"] == "England") & (premier["minutes_played_overall"] > 100)]
min_de = np.array(de["minutes_played_overall"])  
min_en= np.array(en["minutes_played_overall"])  

In [None]:
compare_means(min_de,min_en)

The analysis reveals that the mean minutes played by German players (2198.42) is significantly higher than that of English players (1592.80), with a p-value of 0.0344. This indicates that there is a significant difference in the total minutes played between the two nationalities among players who have played more than 100 minutes.

In [None]:
keeper = premier[(premier["position"] == "Goalkeeper") & (premier["minutes_played_overall"] > 100)]
forward= premier[(premier["position"] == "Forward") & (premier["minutes_played_overall"] > 100)]
app_k = np.array(keeper["minutes_played_overall"])  
app_f= np.array(forward["minutes_played_overall"]) 

In [None]:
chi_squared_test(bins_min, app_k, app_f, "Keeper", "Forward")

In [None]:
compare_means(app_k,app_f)

The analysis shows that when comparing appearances between goalkeepers and forwards, there is no significant difference in their mean number of appearances (p-value = 0.5430), and the distributions also do not differ significantly (Chi-squared p-value = 0.2920). This suggests that the number of appearances for both positions are similarly distributed.

However, when comparing the minutes played overall, there is a significant difference in the mean minutes played by goalkeepers (2061.82 minutes) and forwards (1449.57 minutes) with a p-value of 0.0040, indicating a statistically significant difference. Additionally, the chi-squared test also shows a significant difference in the distributions of minutes played (p-value = 0.0012), implying that goalkeepers and forwards have different playing time distributions.

Goalkeepers tend to play significantly more minutes than forwards, as seen from the statistical tests on minutes_played_overall.
The lack of significant difference in appearances suggests that the number of games played is more evenly distributed between goalkeepers and forwards, while the time spent on the field differs considerably.

# Conclusion

The analysis of Premier League player performances for the 2018/2019 season has revealed several trends and identified interesting characteristics regarding the distribution of players, their contributions, and the overall dynamics of the league.

We used Statistical Analysis: : 
- Descriptive statistics were used to understand the distribution of  our  variables 
- Confidence intervals for means and proportions were calculated to assess the reliability of our estimates.
- Goodness-of-fit tests were performed to check if the data followed known distributions .
- Hypothesis tests, including chi-squared and t-tests, helped us compare the performance across different groups, such as position, age, and nationality.

The analysis revealed important insights into player participation in the Premier League. While goalkeepers play more minutes than forwards, the number of appearances is generally balanced across these positions. Age and nationality also play a role in a player’s involvement, with younger players having fewer appearances and German players playing more minutes than the English .

This study provides a deeper understanding of the dynamics of player performance and participation in the Premier League. 
