# Gather

## Packages

In [3]:
import pandas as pd
import numpy as np
from scipy.stats import chisquare
from bs4 import BeautifulSoup
import json
import requests

# Helper Functions

In [63]:
def dataframe(table):
    rows = table.find_all('tr')
    data = []
    for row in rows[1:]:
        cols = row.find_all('td')
        cols = [ele.text.strip() for ele in cols]
        data.append([ele for ele in cols if ele])
    df = pd.DataFrame(data, columns=['rank', 'club', 'game_played', 'win', 'draw', 'loss',
                                     'goals_for', 'goals_agains', 'goal_difference', 'points'])
    return df

## Data

Data is collected from soccerstats.com.

In [75]:
url = 'https://www.soccerstats.com/homeaway.asp?league=england_20{}'
years = np.arange(14, 21)
for year in years:
    response = requests.get(url.format(year))
    html = response.text

    soup = BeautifulSoup(html)
    home_table = soup.find_all('table', {'id':'btable'})[0]
    away_table = soup.find_all('table', {'id':'btable'})[1]
    
    home_df = dataframe(home_table)
    away_df = dataframe(away_table)
    
    home_df.to_csv('./data/home20{}-20{}.csv'.format(year-1, year), index=False)
    away_df.to_csv('./data/away20{}-20{}.csv'.format(year-1, year), index=False)

# Data Preparation

# Introduction
In the English Premiere League, there are 20 teams that play each other, every team plays 38 games, 19 games are played in the team's field and the rest are played in the opponents' field.  
There is a claim that team perform better on their fields (Home) than on the opponents' fields (Away).  
The pointing system used in the league states that if the team wins, it gets 3 points, if it loses, it gets nothing, and there is a draw between the two teams, each team gets 1 point.  
The data is collected from a website that keeps record from 2013 till today. The statistical analysis would be performed on these records to see if there is a statistical evidence on that claim or not.

# Research Question & Hypothesis
Research Question: Do teams perform better on their fields than on their opponents' fields?
Hypothesis:  
H0: μH =< μA  
HA: μH >  μA  
  
H stands for Home  
A stands for Away

The null hypothesis states that there is no difference in the performance for the teams or they actually perform worse if the team is Home, while the alternative hypothesis states that teams perform better Home.

# Experimental Design
Since the number of records is not that big, t-test is conducted over the 20 records. The data used provides general information on the performance throughout the season, that's why Independent one-tail t-test takes place here.  
Points would be the most informative variable for how different teams are performing, so points is selected for that matter.  

# Results
This section should include all of the results from your analysis. Provide all the information from your analysis even if it does not support your hypothesis. You should also provide screenshots/images of your data. You can use whatever types of visualizations that best represent the data (i.e. graphs, bar charts, histograms, etc).  

# Conclusion
Here, you should discuss whether or not the results from your data supported or did not support your hypothesis. If the results did not support your hypothesis, discuss why you think they did not and if there is anything that could have been done differently that would have changed the outcome. If your results did support your hypothesis, explain what this means and the real life implications. In either case, discuss why further research on the topic should be done and why it is important to do so.