# Scraping Ado

This script shows how to scrape data from [LombardiaCanestro](https://lombardia.italiacanestro.it/). In this case, scrape the **Promozione - Girone E League**. Here the list of all Teams:

* [Aurora Trezzo](https://lombardia.italiacanestro.it/Maschile/Squadra?id=42&tid=482).
* [Posal Sesto San Giovanni](https://lombardia.italiacanestro.it/Maschile/Squadra?id=42&tid=483).
* [Ado San Benedetto Milano](https://lombardia.italiacanestro.it/Maschile/Squadra?id=42&tid=484).
* [CGB Brugherio](https://lombardia.italiacanestro.it/Maschile/Squadra?id=42&tid=485).
* [Azzurri Niguardese](https://lombardia.italiacanestro.it/Maschile/Squadra?id=42&tid=486).
* [Pallacanestro Carugate](https://lombardia.italiacanestro.it/Maschile/Squadra?id=42&tid=487).
* [CBBA Olimpia Cologno](https://lombardia.italiacanestro.it/Maschile/Squadra?id=42&tid=488).
* [Cesano Seveso](https://lombardia.italiacanestro.it/Maschile/Squadra?id=42&tid=489).
* [Inzago Basket](https://lombardia.italiacanestro.it/Maschile/Squadra?id=42&tid=490).
* [OSAL Novate](https://lombardia.italiacanestro.it/Maschile/Squadra?id=42&tid=491).
* [Basket Ajaccio 1988](https://lombardia.italiacanestro.it/Maschile/Squadra?id=42&tid=492).
* [Social OSA](https://lombardia.italiacanestro.it/Maschile/Squadra?id=42&tid=493).
* [Basket San Rocco 2013 Seregno](https://lombardia.italiacanestro.it/Maschile/Squadra?id=42&tid=494).
* [Ciesse Freebasket Milano](https://lombardia.italiacanestro.it/Maschile/Squadra?id=42&tid=495).
* [ACLI Trecella](https://lombardia.italiacanestro.it/Maschile/Squadra?id=42&tid=496).

## Python Script

The following cells shows the scrape script from the LombardiaCanestro Site. Scrape data:

* *All games*, with this link: `https://lombardia.italiacanestro.it/Maschile/Partita?id=`***`<id_game>`***
* *Standings*, with this link: `https://lombardia.italiacanestro.it/Maschile/Calendario?id=`***`42`***
* *Rosters*, with this link: `https://lombardia.italiacanestro.it/Maschile/Roster?id=`***`42`***

### Libraries

Used libraries:

* `pandas`.
* `pycelonis`.

In [1]:
import pandas as pd
import numpy as np
from   urllib.request import Request, urlopen
from tqdm import tqdm

#-- Celonis
from pycelonis import get_celonis
celonis = get_celonis(
    base_url = "alberto-filosa-protiviti-it.training.celonis.cloud",
    api_token = "NzQ4Mzg3YjctNzkzNy00ZTFhLWE5ZTUtN2Y5NDk0MGVhYWJiOnlHK2xYb3NKRHpwTitGU053NUxOT2ZDZFZOUllKaXNsNWlUeGFwVnJ0UTc3",
    key_type = 'USER_KEY'
)



[2023-04-30 10:41:53,967] INFO: Initial connect successful! PyCelonis Version: 2.0.0
[2023-04-30 10:41:54,030] INFO: `package-manager` permissions: ['EDIT_ALL_SPACES', 'MANAGE_PERMISSIONS', 'CREATE_SPACE', 'DELETE_ALL_SPACES']
[2023-04-30 10:41:54,032] INFO: `workflows` permissions: ['EDIT_AGENTS', 'VIEW_AGENTS', 'REGISTER_AGENTS', 'MANAGE_PERMISSIONS']
[2023-04-30 10:41:54,033] INFO: `task-mining` permissions: ['EDIT_CLIENT_SETTINGS', 'EDIT_USERS']
[2023-04-30 10:41:54,037] INFO: `action-engine` permissions: ['CREATE_PROJECTS', 'MANAGE_SKILLS', 'ACCESS_ALL_PROJECTS', 'MY_INBOX']
[2023-04-30 10:41:54,038] INFO: `team` permissions: ['MANAGE_AUDIT_LOGS', 'MANAGE_SSO_SETTINGS', 'USE_AUDIT_LOGS_API', 'MANAGE_ADOPTION_VIEWS', 'MANAGE_GENERAL_SETTINGS', 'MANAGE_GROUPS', 'MANAGE_APPLICATIONS', 'USE_STUDIO_ADOPTION_API', 'MANAGE_LOGIN_HISTORY', 'MANAGE_LICENSE_SETTINGS', 'USE_LOGIN_HISTORY_API', 'MANAGE_MEMBERS', 'MANAGE_UPLINK_INTEGRATIONS', 'MANAGE_PERMISSIONS', 'MANAGE_ADMIN_NOTIFICATIONS',

### Games Scraping

<!-- Inserire cosa fa la funzione -->

In [2]:
def scraping_table(ls_id_game):
    
    #-- Disable chained assignments
    pd.options.mode.chained_assignment = None 
    
    #-- Iniziate Empty DataFrame. Then, will be inserted for each ig_game the Game Result
    df_all_games  = pd.DataFrame()
    
    for game in tqdm(ls_id_game):
        
        lv_url = f"https://lombardia.italiacanestro.it/Maschile/Partita?id={game}"
        
        #-- Get URL
        req = Request(lv_url, headers={'User-Agent': 'Mozilla/5.0'})
        webpage = urlopen(req).read()
            
        #-- Get Current Game
        df_single_game = pd.read_html(webpage)[0]
        
        #-- If not Played Yet, next to the next game
        if len(df_single_game.index) < 5:
            continue
        
        #-- List of Teams
        ls_teams = df_single_game.loc[df_single_game[1] == "PTS"][0]
        
        #-----------------------
        #-- Data Manipulation --
        #-----------------------
        
        #-- Drop NAs (in a single URL there is one single Table to identify the Teams)
        df_single_game_nona = df_single_game.dropna(how = "all")
        
        #-- Add Columns in the DataFrame
        df_single_game_nona["Squadra"]    = [ls_teams[0] if ls_teams.index[1] > row else ls_teams[ls_teams.index[1]] for row in range(0, df_single_game_nona.shape[0])]
        df_single_game_nona["Squadra"]    = df_single_game_nona["Squadra"].str.title()
        df_single_game_nona["Avversario"] = [ls_teams[ls_teams.index[1]] if ls_teams.index[1] > row else ls_teams[0] for row in range(0, df_single_game_nona.shape[0])]
        df_single_game_nona["Avversario"] = df_single_game_nona["Avversario"].str.title()
        df_single_game_nona["Partita"]    = ["C" if ls_teams.index[1] > row else "T" for row in range(0, df_single_game_nona.shape[0])]
        df_single_game_nona["id_gara"]    = game
        
        #-- Remove Header Rows (if they have PTS in the first column)
        df_single_game_end = df_single_game_nona[df_single_game_nona[1] != 'PTS']
        df_single_game_end.columns = ["giocatore", "punti_totali", "tiri_liberi",
                                      "due_punti", "tre_punti", "squadra",
                                      "avversario", "partita","id_gara"]
        
        df_single_game_end["giocatore"] = df_single_game_end["giocatore"].str.title()
        
        #-- Concat Games
        df_all_games = pd.concat([df_all_games, df_single_game_end])
        
    return df_all_games

### Rosters

In [3]:
def scraping_players(id_roster):
    
    #-- Iniziate Empty DataFrame. Then, will be inserted for each ig_game the Game Result
    df_all_players = pd.DataFrame()
    
    lv_url = f"https://lombardia.italiacanestro.it/Maschile/Roster?id={id_roster}"
        
    #-- Get URL
    req = Request(lv_url, headers={'User-Agent': 'Mozilla/5.0'})
    webpage = urlopen(req).read()

    #-- Get Current Player
    df_single_team = pd.read_html(webpage)
    
    for team in tqdm(df_single_team):
    
        team["Squadra"] = team.columns[1].title()
        team.columns = ["Numero", "Giocatore", "Squadra"]
        df_all_players = pd.concat([df_all_players, team], axis = 0, ignore_index = True)
    
    df_all_players["Giocatore"] = df_all_players["Giocatore"].str.title()
    
    return df_all_players

### Standings

In [4]:
def scraping_standings(id_calendar):
    
    lv_url = f"https://lombardia.italiacanestro.it/Maschile/Calendario?id={id_calendar}"
        
    #-- Get URL
    req = Request(lv_url, headers = {'User-Agent': 'Mozilla/5.0'})
    webpage = urlopen(req).read()
    
    df_standings = pd.read_html(webpage)[1]
    
    #-- Data String Manipulation
    df_standings["CLASSIFICA"] = df_standings["CLASSIFICA"].str.title()
    
    df_standings.columns = ["posizione", "squadra", "punti", "partite_giocate",
                            "vittorie",  "sconfitte", "punti_fatti", "punti_subiti"]
    
    return df_standings

## Upload to Celonis

| DataFrame Name   | SQL Table Name      |
|------------------|---------------------|
| `df_all_games`   | `PR_DATA_GAMES`     |
| `df_standing`    | `PR_DATA_STANDINGS` |
| `df_all_players` | `PR_PLAYERS_NAME`   |

In [5]:
%%time

print("Downloading All Games ...")
df_all_games   = scraping_table(np.arange(5508, 5714))

print("Downloading All Players ...")
df_all_players = scraping_players(42)

df_standing    = scraping_standings(42)

Downloading All Games ...


100%|██████████| 206/206 [00:44<00:00,  4.61it/s]


Downloading All Players ...


100%|██████████| 15/15 [00:00<00:00, 1075.41it/s]


CPU times: user 6.88 s, sys: 360 ms, total: 7.24 s
Wall time: 45.4 s


In [6]:
#-- Selecting Data Pool, Data Model and Data Job
data_pool = celonis.data_integration.get_data_pool("26a8fa87-21b1-4850-9447-48c2e6a171fc")
data_model = data_pool.get_data_model("9fb8576b-a8f6-4f71-9cb9-4722bafa7a92")
print(f"Selected the '{data_pool.name}' Data Pool and the '{data_model.name}' Data Model")

data_job = data_pool.get_job("f9300adf-cde5-43d8-bc44-eca7b355fda1")

Selected the 'Basket - Scraping Data' Data Pool and the 'Data Model - Promozione - Girone E' Data Model


In [7]:
dict_df_games = {
    "PR_DATA_GAMES":     df_all_games,
    "PR_DATA_STANDINGS": df_standing,
    "PR_PLAYERS_NAME":   df_all_players
}

for lv_sql_table, lv_data_frame in dict_df_games.items():
    
    print(f"Uploading of the '{lv_sql_table}' Table from Python to Celonis: \n")
    
    data_pool.create_table(table_name     = lv_sql_table,
                           df             = lv_data_frame,
                           drop_if_exists = True,
                           force          = True)
    
    print("Upload of the Table Completed!")
    print("_" * 45, "\n \n")

Uploading of the 'PR_DATA_GAMES' Table from Python to Celonis: 

[2023-04-30 10:42:47,632] INFO: Successfully created data push job with id '57b9b6ba-e65d-4485-8dee-3557f5566e77'
[2023-04-30 10:42:47,634] INFO: Add data frame as file chunks to data push job with id '57b9b6ba-e65d-4485-8dee-3557f5566e77'


  0%|          | 0/1 [00:00<?, ?it/s]

[2023-04-30 10:42:47,935] INFO: Successfully upserted file chunk to data push job with id '57b9b6ba-e65d-4485-8dee-3557f5566e77'
[2023-04-30 10:42:48,115] INFO: Successfully triggered execution for data push job with id '57b9b6ba-e65d-4485-8dee-3557f5566e77'
[2023-04-30 10:42:48,117] INFO: Wait for execution of data push job with id '57b9b6ba-e65d-4485-8dee-3557f5566e77'


0it [00:00, ?it/s]

[2023-04-30 10:43:08,684] INFO: Successfully created table 'PR_DATA_GAMES' in data pool
[2023-04-30 10:43:08,821] INFO: Successfully deleted data push job with id '57b9b6ba-e65d-4485-8dee-3557f5566e77'
Upload of the Table Completed!
_____________________________________________ 
 

Uploading of the 'PR_DATA_STANDINGS' Table from Python to Celonis: 

[2023-04-30 10:43:15,279] INFO: Successfully created data push job with id '26cb7022-e36f-4036-97e5-e980ecef2a71'
[2023-04-30 10:43:15,281] INFO: Add data frame as file chunks to data push job with id '26cb7022-e36f-4036-97e5-e980ecef2a71'


  0%|          | 0/1 [00:00<?, ?it/s]

[2023-04-30 10:43:15,558] INFO: Successfully upserted file chunk to data push job with id '26cb7022-e36f-4036-97e5-e980ecef2a71'
[2023-04-30 10:43:15,730] INFO: Successfully triggered execution for data push job with id '26cb7022-e36f-4036-97e5-e980ecef2a71'
[2023-04-30 10:43:15,732] INFO: Wait for execution of data push job with id '26cb7022-e36f-4036-97e5-e980ecef2a71'


0it [00:00, ?it/s]

[2023-04-30 10:43:30,138] INFO: Successfully created table 'PR_DATA_STANDINGS' in data pool
[2023-04-30 10:43:30,284] INFO: Successfully deleted data push job with id '26cb7022-e36f-4036-97e5-e980ecef2a71'
Upload of the Table Completed!
_____________________________________________ 
 

Uploading of the 'PR_PLAYERS_NAME' Table from Python to Celonis: 

[2023-04-30 10:43:42,680] INFO: Successfully created data push job with id '30ad0b67-0b7d-4cdb-965a-3cec7ea17725'
[2023-04-30 10:43:42,681] INFO: Add data frame as file chunks to data push job with id '30ad0b67-0b7d-4cdb-965a-3cec7ea17725'


  0%|          | 0/1 [00:00<?, ?it/s]

[2023-04-30 10:43:42,965] INFO: Successfully upserted file chunk to data push job with id '30ad0b67-0b7d-4cdb-965a-3cec7ea17725'
[2023-04-30 10:43:43,132] INFO: Successfully triggered execution for data push job with id '30ad0b67-0b7d-4cdb-965a-3cec7ea17725'
[2023-04-30 10:43:43,134] INFO: Wait for execution of data push job with id '30ad0b67-0b7d-4cdb-965a-3cec7ea17725'


0it [00:00, ?it/s]

[2023-04-30 10:43:55,480] INFO: Successfully created table 'PR_PLAYERS_NAME' in data pool
[2023-04-30 10:43:55,648] INFO: Successfully deleted data push job with id '30ad0b67-0b7d-4cdb-965a-3cec7ea17725'
Upload of the Table Completed!
_____________________________________________ 
 



In [None]:
data_job.name
data_job.execute()

In [None]:
data_model.reload()