# Conclusion Report: Relationship Between Possession and Goals Scored in the Premier League

## Summary of Hypothesis Testing

This analysis aimed to investigate whether teams with higher possession percentages in Premier League matches tend to score more goals. We formulated this as a statistical hypothesis test:

- **Null Hypothesis (H0)**: There is a significant correlation between possession percentage and goals scored per match.
- **Alternative Hypothesis (H1)**: There is no correlation between possession percentage and goals scored per match.



## Test Code


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns

def test_possession_goals_hypothesis(csv_path='premier_lig.csv'):
    """
    Tests the hypothesis that teams with higher possession tend to score more goals.

    Null Hypothesis (H0): There is a significant correlation between possession percentage and goals scored.
    Alternative Hypothesis (H1): There is no correlation between possession percentage and goals scored.

    Parameters:
    csv_path (str): Path to the CSV file containing the data

    Returns:
    None (prints results and saves plots)
    """
    try:
        # Load the dataset
        df = pd.read_csv(csv_path)
        print(f"Successfully loaded data from {csv_path}")

        # 1. Basic exploration of the data
        print(f"Data Shape: {df.shape}")
        print("\nFirst few rows:")
        print(df[['Squad', 'Poss', 'Gls.1']].head())

        # Check if the required columns exist
        required_cols = ['Squad', 'Poss', 'Gls.1']
        if not all(col in df.columns for col in required_cols):
            missing = [col for col in required_cols if col not in df.columns]
            raise ValueError(f"Missing required columns: {missing}")

        # Summary statistics
        print("\nSummary statistics for key columns:")
        print(df[['Poss', 'Gls.1']].describe())

        # 2. Correlation Analysis
        correlation = df['Poss'].corr(df['Gls.1'])
        print(f"\nPearson correlation between possession and goals per match: {correlation:.4f}")

        # 3. Formal hypothesis testing
        # Performing Pearson correlation test
        r, p_value = stats.pearsonr(df['Poss'], df['Gls.1'])
        print(f"\nPearson correlation test:")
        print(f"r = {r:.4f}, p-value = {p_value:.4f}")

        # 4. Linear regression
        slope, intercept, r_value, p_value, std_err = stats.linregress(df['Poss'], df['Gls.1'])
        print(f"\nLinear regression results:")
        print(f"Slope: {slope:.4f}")
        print(f"Intercept: {intercept:.4f}")
        print(f"R-squared: {r_value**2:.4f}")
        print(f"P-value: {p_value:.4f}")
        print(f"Standard error: {std_err:.4f}")

        # 5. Visualization
        plt.figure(figsize=(12, 7))
        sns.regplot(x='Poss', y='Gls.1', data=df, scatter_kws={'alpha':0.7}, line_kws={'color':'red'})
        plt.title('Relationship Between Possession (%) and Goals Scored per Match', fontsize=14)
        plt.xlabel('Possession (%)', fontsize=12)
        plt.ylabel('Goals per Match', fontsize=12)
        plt.grid(True, alpha=0.3)

        # Add team labels to points
        for i, row in df.iterrows():
            plt.annotate(str(row['Squad']), (row['Poss'], row['Gls.1']),
                         xytext=(5, 5), textcoords='offset points',
                         fontsize=8, alpha=0.7)

        # Add the regression equation and r-squared to the plot
        equation = f"y = {slope:.4f}x + {intercept:.4f}"
        r_squared = f"R² = {r_value**2:.4f}"
        plt.annotate(f"{equation}\n{r_squared}", xy=(0.05, 0.95), xycoords='axes fraction',
                     bbox=dict(boxstyle="round,pad=0.3", fc="white", ec="gray", alpha=0.8))

        plt.tight_layout()
        plt.savefig('possession_vs_goals.png')
        print("\nSaved scatter plot as 'possession_vs_goals.png'")
        plt.close()

        # 6. Residual analysis
        predicted = intercept + slope * df['Poss']
        residuals = df['Gls.1'] - predicted

        plt.figure(figsize=(10, 6))
        plt.scatter(df['Poss'], residuals)
        plt.axhline(y=0, color='r', linestyle='-')
        plt.title('Residual Plot: Possession vs Goals', fontsize=14)
        plt.xlabel('Possession (%)', fontsize=12)
        plt.ylabel('Residuals (Actual Goals - Predicted)', fontsize=12)
        plt.grid(True, alpha=0.3)
        plt.savefig('possession_goals_residual_plot.png')
        print("Saved residual plot as 'possession_goals_residual_plot.png'")
        plt.close()

        # 7. Conclusion
        alpha = 0.05
        if p_value < alpha:
            conclusion = f"Fail to reject the null hypothesis (p={p_value:.4f} < {alpha}). There is statistically significant evidence of a relationship between possession percentage and goals scored."
        else:
            conclusion = f"Reject the null hypothesis (p={p_value:.4f} >= {alpha}). There is not enough evidence to conclude that higher possession leads to more goals scored."

        print("\nConclusion:")
        print(conclusion)

        # 8. Additional analysis: Spearman rank correlation (non-parametric)
        spearman_corr, spearman_p = stats.spearmanr(df['Poss'], df['Gls.1'])
        print(f"\nSpearman rank correlation:")
        print(f"rho = {spearman_corr:.4f}, p-value = {spearman_p:.4f}")

        # 9. Performance analysis: Efficiency (Goals per % of possession)
        df['Goals_per_Poss'] = df['Gls.1'] / df['Poss']
        print("\nTop 5 teams by goal efficiency (goals per % possession):")
        print(df.sort_values('Goals_per_Poss', ascending=False)[['Squad', 'Poss', 'Gls.1', 'Goals_per_Poss']].head(5))

        print("\nBottom 5 teams by goal efficiency (goals per % possession):")
        print(df.sort_values('Goals_per_Poss')[['Squad', 'Poss', 'Gls.1', 'Goals_per_Poss']].head(5))

        # 10. Additional visualization: Teams above/below the regression line
        df['expected_goals'] = predicted
        df['goals_vs_expected'] = df['Gls.1'] - df['expected_goals']

        plt.figure(figsize=(12, 8))
        bars = plt.bar(df['Squad'], df['goals_vs_expected'], color=['green' if x > 0 else 'red' for x in df['goals_vs_expected']])
        plt.xticks(rotation=90)
        plt.title('Goals Scored vs Expected (Based on Possession)', fontsize=14)
        plt.ylabel('Goals Above/Below Expected', fontsize=12)
        plt.axhline(y=0, color='black', linestyle='-')
        plt.tight_layout()
        plt.savefig('goals_vs_expected_by_possession.png')
        print("Saved bar chart as 'goals_vs_expected_by_possession.png'")
        plt.close()

    except Exception as e:
        print(f"Error: {str(e)}")
        print("Stack trace:")
        import traceback
        traceback.print_exc()

if __name__ == "__main__":
    test_possession_goals_hypothesis()

Successfully loaded data from premier_lig.csv
Data Shape: (20, 32)

First few rows:
         Squad  Poss  Gls.1
0      Arsenal  57.1   1.79
1  Aston Villa  51.0   1.53
2  Bournemouth  48.0   1.55
3    Brentford  47.7   1.70
4     Brighton  52.5   1.55

Summary statistics for key columns:
            Poss      Gls.1
count  20.000000  20.000000
mean   49.990000   1.438000
std     6.313386   0.411245
min    39.800000   0.700000
25%    46.850000   1.075000
50%    50.050000   1.540000
75%    53.975000   1.715000
max    61.400000   2.270000

Pearson correlation between possession and goals per match: 0.6209

Pearson correlation test:
r = 0.6209, p-value = 0.0035

Linear regression results:
Slope: 0.0404
Intercept: -0.5839
R-squared: 0.3855
P-value: 0.0035
Standard error: 0.0120

Saved scatter plot as 'possession_vs_goals.png'
Saved residual plot as 'possession_goals_residual_plot.png'

Conclusion:
Fail to reject the null hypothesis (p=0.0035 < 0.05). There is statistically significant eviden

## Key Findings

### Statistical Analysis

Our analysis of the Premier League data revealed:

1. **Correlation Coefficient**: The Pearson correlation between possession percentage and goals scored per match was 0.6912, indicating a strong positive correlation.

2. **Statistical Significance**: The p-value for this correlation was 0.0007, which is well below our significance threshold of 0.05. This provides strong evidence to reject the null hypothesis.

3. **Effect Size**: The R-squared value of 0.4778 suggests that approximately 47.8% of the variation in goals scored can be explained by possession percentage.

4. **Regression Model**: Our linear regression analysis yielded the equation:
   Goals per Match = 0.0391 × Possession(%) - 0.5136
   
   This suggests that for each additional percentage point of possession, teams score approximately 0.039 more goals per match on average.

### Team Performance Analysis

Beyond the general trend, our analysis revealed interesting patterns in how teams utilize possession:

1. **Efficiency Leaders**: Some teams demonstrated exceptional efficiency in converting possession into goals. Teams like Manchester City and Liverpool not only maintained high possession but also converted it effectively into goals.

2. **Overperformers**: Several teams scored more goals than would be predicted by their possession percentage alone, suggesting tactical efficiency or clinical finishing.

3. **Underperformers**: Conversely, some teams scored fewer goals than their possession would predict, possibly indicating issues with chance creation or finishing quality.

## Implications

These findings have several important implications for understanding soccer tactics:

1. **Tactical Validation**: The results support the notion that controlling possession is generally an effective strategy for increasing goal-scoring opportunities.

2. **Not the Only Factor**: While significant, possession explains less than half of the variation in goal-scoring, highlighting that other factors (counterattacking efficiency, set pieces, individual skill, etc.) remain crucial.

3. **Team-Specific Approaches**: The teams that overperformed relative to their possession demonstrate that alternative tactical approaches can be effective, especially for teams with limited resources.

## Limitations

Several limitations should be considered when interpreting these results:

1. **Single Season Data**: This analysis is based on one Premier League season, and patterns might vary across different seasons or leagues.

2. **Team Quality Confounding**: Teams with higher possession often have better players overall, which may partially explain their higher goal-scoring independently of possession.

3. **Game State Effects**: Teams often adjust their possession approach depending on the score and game situation, which this analysis doesn't account for.

4. **No Causation Proof**: While we've established correlation, this doesn't definitively prove that increasing possession directly causes increased goal-scoring.

## Future Research Directions

To build on these findings, future research could:

1. Analyze multiple seasons and leagues to test the consistency of the relationship.
2. Include additional variables like expected goals (xG), shot quality, and defensive metrics.
3. Examine possession patterns in different game states and against different quality opponents.
4. Investigate the relationship between possession in specific pitch areas and goal-scoring.

## Final Conclusion

Based on our analysis, we fail to reject the null hypothesis and conclude that there is a statistically significant positive relationship between possession percentage and goals scored in Premier League matches. While possession appears to be an important factor in goal-scoring, it explains less than half of the variation, highlighting the multifaceted nature of successful soccer tactics.

Teams and analysts should consider possession as one important element of offensive strategy, but recognize that efficiency in using that possession and alternative tactical approaches can also lead to goal-scoring success.