# <center><div class="alert alert-block alert-info">This code extracts data from a <b>Web Site</b>, Transforms it and Load into a Database</div></center>

## The web site we will be scraping here is __[GlobalEspn.com](https://global.espn.com/football/standings/_/league/eng.1/season/2010)__ to collect English Premier League Table from 2010 - 2023

### Importing needed packages

In [2]:
import pandas as pd
import urllib
from bs4 import BeautifulSoup
import sqlite3

In [3]:
# Looping through the website to collect each year table and 
# loading it to the database at once
for i in range(10, 24):
    # Required variables
    url = f'https://global.espn.com/football/standings/_/league/eng.1/season/20{i}'
    html_object = urllib.request.urlopen(url)
    
    # Extracting the html structure from the object created
    html_soup = BeautifulSoup(html_object, 'html.parser')

 
    ## Retreiving the Teams Name and creating the DataFrame
    # Fetching the 'standings' table from the soup
    standings_html_table = html_soup.find(class_ = 'standings__table InnerLayout__child--dividers')
 
    # Retreiving teams' mames
    teams_html_table = standings_html_table.find(class_ = 'Table Table--align-right Table--fixed Table--fixed-left')
    teams_name_year_dict = {'Club':[], 'Year':[]}
    for element in teams_html_table.find('tbody').find_all(class_ = 'hide-mobile'):
        teams_name_year_dict['Club'].append(element.get_text())
        teams_name_year_dict['Year'].append(f'20{i}')
    teams_df = pd.DataFrame(teams_name_year_dict)


    ## Retreiving the Glossary
    # Retreiving the glossary
    glossary_html_divider = html_soup.find(class_ = 'glossary glossary--fullWidth glossary--fullWidth--desktopLG')

    # Creating the glossary dictionary
    glossary_list = [element.get_text().split(':') for element in glossary_html_divider.find_all('li')]
    glossary_dict = {short:long for short,long in glossary_list}


    ## Retreiving the Points
    # Extracting points table from the html soup
    points_html_table = html_soup.find(class_ = 'Table Table--align-right')
    
    # Creating points dataframe
    # point headers
    points_header_list = [element.get_text() for element in points_html_table.find('thead').find_all('th')]

    points_list = []
    for row in points_html_table.find('tbody').find_all('tr'):
        points_list.append([element.get_text() for element in row.find_all('td')])

    points_df = pd.DataFrame(points_list, columns=points_header_list)
    points_df = points_df.rename(columns = glossary_dict)


    ## Joining both DataFrames
    df = pd.concat([teams_df, points_df], axis = 1)


    ## Loading data to folder into CSV format in the current working directory
    # df.to_csv(f'premier_league_table_20{i}.csv', index = False)


    ## Loading data to the database server
    conn = sqlite3.connect('Premier_league.db')
    df.to_sql(f'premier_league_table_20{i}', con = conn, if_exists = 'replace', index = False)
    conn.close()

<b>Notice that so far all we have done is extract the data and transform it.<br>
No cleaning is made yet as cleaning involves handling duplicate entries, outliers, inacurate, unwanted, irrelevant, and missing data. And fixing structured errors as well.<br>We will go through data cleaning in detail at the Analysis part.<b>