# 01 - Data Ingestion

This notebook loads the raw CSV files from the Transfermarkt dataset, performs initial cleaning and type conversions, and stores everything in a SQLite database for SQL analysis.

**Data Source**: [Football Data from Transfermarkt](https://www.kaggle.com/datasets/davidcariboo/player-scores) (Kaggle)

## Pipeline
1. Load CSV files into pandas DataFrames
2. Inspect data shapes, dtypes, and missing values
3. Perform type conversions (dates, numerics)
4. Load all tables into SQLite database
5. Create indexes for query performance
6. Verify data integrity

In [1]:
import pandas as pd
import sqlite3
import os
import sys
from pathlib import Path

# Project paths
PROJECT_ROOT = Path("..").resolve()
DATA_RAW = PROJECT_ROOT / "data" / "raw"
DATA_PROCESSED = PROJECT_ROOT / "data" / "processed"
DB_PATH = DATA_PROCESSED / "football.db"

# Ensure output directory exists
DATA_PROCESSED.mkdir(parents=True, exist_ok=True)

print(f"Raw data directory: {DATA_RAW}")
print(f"Database path: {DB_PATH}")
print(f"Files in raw directory: {list(DATA_RAW.glob('*.csv'))}")

Raw data directory: C:\Users\bvenn\OneDrive\Desktop\Python Projekte\Showcases\first_project\data\raw
Database path: C:\Users\bvenn\OneDrive\Desktop\Python Projekte\Showcases\first_project\data\processed\football.db
Files in raw directory: [WindowsPath('C:/Users/bvenn/OneDrive/Desktop/Python Projekte/Showcases/first_project/data/raw/appearances.csv'), WindowsPath('C:/Users/bvenn/OneDrive/Desktop/Python Projekte/Showcases/first_project/data/raw/clubs.csv'), WindowsPath('C:/Users/bvenn/OneDrive/Desktop/Python Projekte/Showcases/first_project/data/raw/club_games.csv'), WindowsPath('C:/Users/bvenn/OneDrive/Desktop/Python Projekte/Showcases/first_project/data/raw/competitions.csv'), WindowsPath('C:/Users/bvenn/OneDrive/Desktop/Python Projekte/Showcases/first_project/data/raw/games.csv'), WindowsPath('C:/Users/bvenn/OneDrive/Desktop/Python Projekte/Showcases/first_project/data/raw/game_events.csv'), WindowsPath('C:/Users/bvenn/OneDrive/Desktop/Python Projekte/Showcases/first_project/data/raw/

## 1. Load CSV Files

Load all CSV files from the Kaggle dataset into pandas DataFrames.

In [2]:
# Define expected CSV files and their names
CSV_FILES = {
    "appearances": "appearances.csv",
    "clubs": "clubs.csv",
    "competitions": "competitions.csv",
    "games": "games.csv",
    "players": "players.csv",
    "player_valuations": "player_valuations.csv",
    "transfers": "transfers.csv",
    "club_games": "club_games.csv",
    "game_events": "game_events.csv",
}

# Load all CSVs
dataframes = {}
for name, filename in CSV_FILES.items():
    filepath = DATA_RAW / filename
    if filepath.exists():
        df = pd.read_csv(filepath, low_memory=False)
        dataframes[name] = df
        print(f"  {name:.<30} {df.shape[0]:>10,} rows x {df.shape[1]:>3} cols")
    else:
        print(f"  {name:.<30} FILE NOT FOUND: {filepath}")

print(f"\nLoaded {len(dataframes)} / {len(CSV_FILES)} tables.")

  appearances...................  1,725,531 rows x  13 cols
  clubs.........................        451 rows x  17 cols
  competitions..................         44 rows x  11 cols
  games.........................     77,872 rows x  23 cols
  players.......................     34,301 rows x  23 cols
  player_valuations.............    454,064 rows x   6 cols
  transfers.....................     85,404 rows x  10 cols
  club_games....................    155,744 rows x  11 cols
  game_events...................  1,098,364 rows x  11 cols

Loaded 9 / 9 tables.


## 2. Data Inspection

Quick overview of each table: columns, data types, missing values.

In [3]:
for name, df in dataframes.items():
    print(f"\n{'=' * 60}")
    print(f"TABLE: {name} ({df.shape[0]:,} rows, {df.shape[1]} columns)")
    print(f"{'=' * 60}")
    
    # Show columns, types, and null counts
    info_df = pd.DataFrame({
        "dtype": df.dtypes,
        "non_null": df.count(),
        "null_count": df.isnull().sum(),
        "null_pct": (df.isnull().sum() / len(df) * 100).round(1),
        "sample": df.iloc[0] if len(df) > 0 else None
    })
    display(info_df)


TABLE: appearances (1,725,531 rows, 13 columns)


Unnamed: 0,dtype,non_null,null_count,null_pct,sample
appearance_id,object,1725531,0,0.0,2231978_38004
game_id,int64,1725531,0,0.0,2231978
player_id,int64,1725531,0,0.0,38004
player_club_id,int64,1725531,0,0.0,853
player_current_club_id,int64,1725531,0,0.0,235
date,object,1725531,0,0.0,2012-07-03
player_name,object,1725527,4,0.0,Aurélien Joachim
competition_id,object,1725531,0,0.0,CLQ
yellow_cards,int64,1725531,0,0.0,0
red_cards,int64,1725531,0,0.0,0



TABLE: clubs (451 rows, 17 columns)


Unnamed: 0,dtype,non_null,null_count,null_pct,sample
club_id,int64,451,0,0.0,10
club_code,object,451,0,0.0,arminia-bielefeld
name,object,451,0,0.0,Arminia Bielefeld
domestic_competition_id,object,451,0,0.0,L1
total_market_value,float64,0,451,100.0,
squad_size,int64,451,0,0.0,27
average_age,float64,413,38,8.4,25.3
foreigners_number,int64,451,0,0.0,15
foreigners_percentage,float64,399,52,11.5,55.6
national_team_players,int64,451,0,0.0,4



TABLE: competitions (44 rows, 11 columns)


Unnamed: 0,dtype,non_null,null_count,null_pct,sample
competition_id,object,44,0,0.0,BE1
competition_code,object,44,0,0.0,jupiler-pro-league
name,object,44,0,0.0,jupiler-pro-league
sub_type,object,44,0,0.0,first_tier
type,object,44,0,0.0,domestic_league
country_id,int64,44,0,0.0,19
country_name,object,36,8,18.2,Belgium
domestic_league_code,object,36,8,18.2,BE1
confederation,object,44,0,0.0,europa
url,object,44,0,0.0,https://www.transfermarkt.co.uk/jupiler-pro-le...



TABLE: games (77,872 rows, 23 columns)


Unnamed: 0,dtype,non_null,null_count,null_pct,sample
game_id,int64,77872,0,0.0,2211607
competition_id,object,77872,0,0.0,NLSC
season,int64,77872,0,0.0,2012
round,object,77872,0,0.0,Final
date,object,77872,0,0.0,2012-08-05
home_club_id,int64,77872,0,0.0,383
away_club_id,int64,77872,0,0.0,610
home_club_goals,int64,77872,0,0.0,4
away_club_goals,int64,77872,0,0.0,2
home_club_position,float64,53818,24054,30.9,



TABLE: players (34,301 rows, 23 columns)


Unnamed: 0,dtype,non_null,null_count,null_pct,sample
player_id,int64,34301,0,0.0,10
first_name,object,32167,2134,6.2,Miroslav
last_name,object,34301,0,0.0,Klose
name,object,34301,0,0.0,Miroslav Klose
last_season,int64,34301,0,0.0,2015
current_club_id,int64,34301,0,0.0,398
player_code,object,34301,0,0.0,miroslav-klose
country_of_birth,object,31386,2915,8.5,Poland
city_of_birth,object,31711,2590,7.6,Opole
country_of_citizenship,object,33948,353,1.0,Germany



TABLE: player_valuations (454,064 rows, 6 columns)


Unnamed: 0,dtype,non_null,null_count,null_pct,sample
player_id,int64,454064,0,0.0,405973
date,object,448187,5877,1.3,2000-01-20
market_value_in_eur,int64,454064,0,0.0,150000
current_club_name,object,454064,0,0.0,Unknown
current_club_id,int64,454064,0,0.0,3057
player_club_domestic_competition_id,object,423027,31037,6.8,BE1



TABLE: transfers (85,404 rows, 10 columns)


Unnamed: 0,dtype,non_null,null_count,null_pct,sample
player_id,int64,85404,0,0.0,1077560
transfer_date,object,85404,0,0.0,2027-06-30
transfer_season,object,85404,0,0.0,26/27
from_club_id,int64,85404,0,0.0,3060
to_club_id,int64,85404,0,0.0,683
from_club_name,object,85404,0,0.0,Atromitos
to_club_name,object,85404,0,0.0,Olympiacos
transfer_fee,float64,56009,29395,34.4,0.0
market_value_in_eur,float64,52750,32654,38.2,1000000.0
player_name,object,85404,0,0.0,Stavros Pnevmonidis



TABLE: club_games (155,744 rows, 11 columns)


Unnamed: 0,dtype,non_null,null_count,null_pct,sample
game_id,int64,155744,0,0.0,2211607
club_id,int64,155744,0,0.0,383
own_goals,int64,155744,0,0.0,4
own_position,float64,107636,48108,30.9,
own_manager_name,object,154080,1664,1.1,Dick Advocaat
opponent_id,int64,155744,0,0.0,610
opponent_goals,int64,155744,0,0.0,2
opponent_position,float64,107636,48108,30.9,
opponent_manager_name,object,154080,1664,1.1,Frank de Boer
hosting,object,155744,0,0.0,Home



TABLE: game_events (1,098,364 rows, 11 columns)


Unnamed: 0,dtype,non_null,null_count,null_pct,sample
game_event_id,object,1098364,0,0.0,2f41da30c471492e7d4a984951671677
date,object,1098364,0,0.0,2012-08-05
game_id,int64,1098364,0,0.0,2211607
minute,int64,1098364,0,0.0,77
type,object,1098364,0,0.0,Cards
club_id,int64,1098364,0,0.0,610
club_name,object,1098364,0,0.0,Ajax Amsterdam
player_id,int64,1098364,0,0.0,4425
description,object,1009499,88865,8.1,"1. Yellow card , Mass confrontation"
player_in_id,float64,532495,565869,51.5,


## 3. Data Type Conversions

Parse date columns and ensure numeric columns have correct types.

In [4]:
# Date columns to parse per table
DATE_COLUMNS = {
    "games": ["date"],
    "appearances": ["date"],
    "players": ["date_of_birth", "contract_expiration_date"],
    "player_valuations": ["date"],
    "transfers": ["transfer_date"],
}

for table, cols in DATE_COLUMNS.items():
    if table in dataframes:
        for col in cols:
            if col in dataframes[table].columns:
                before_nulls = dataframes[table][col].isnull().sum()
                dataframes[table][col] = pd.to_datetime(
                    dataframes[table][col], errors="coerce"
                )
                after_nulls = dataframes[table][col].isnull().sum()
                new_nulls = after_nulls - before_nulls
                print(f"  {table}.{col}: converted to datetime"
                      f" ({new_nulls} unparseable values set to NaT)")

print("\nDate conversions complete.")

  games.date: converted to datetime (0 unparseable values set to NaT)
  appearances.date: converted to datetime (0 unparseable values set to NaT)
  players.date_of_birth: converted to datetime (0 unparseable values set to NaT)
  players.contract_expiration_date: converted to datetime (0 unparseable values set to NaT)
  player_valuations.date: converted to datetime (0 unparseable values set to NaT)
  transfers.transfer_date: converted to datetime (0 unparseable values set to NaT)

Date conversions complete.


In [5]:
# Ensure numeric columns are properly typed
NUMERIC_COLUMNS = {
    "players": ["market_value_in_eur", "highest_market_value_in_eur", "height_in_cm"],
    "player_valuations": ["market_value_in_eur"],
    "transfers": ["transfer_fee", "market_value_in_eur"],
    "clubs": ["total_market_value", "squad_size", "average_age"],
    "games": ["home_club_goals", "away_club_goals", "attendance"],
    "appearances": ["goals", "assists", "minutes_played", "yellow_cards", "red_cards"],
}

for table, cols in NUMERIC_COLUMNS.items():
    if table in dataframes:
        for col in cols:
            if col in dataframes[table].columns:
                dataframes[table][col] = pd.to_numeric(
                    dataframes[table][col], errors="coerce"
                )

print("Numeric type conversions complete.")

Numeric type conversions complete.


## 4. Load into SQLite Database

Store all DataFrames in a SQLite database for SQL analysis in subsequent notebooks.

In [6]:
# Remove existing database to start fresh
if DB_PATH.exists():
    DB_PATH.unlink()
    print("Removed existing database.")

conn = sqlite3.connect(DB_PATH)

for name, df in dataframes.items():
    df.to_sql(name, conn, if_exists="replace", index=False)
    row_count = pd.read_sql_query(f"SELECT COUNT(*) as cnt FROM [{name}]", conn)
    print(f"  {name:.<30} {row_count['cnt'].iloc[0]:>10,} rows loaded")

print(f"\nDatabase created at: {DB_PATH}")
print(f"Database size: {DB_PATH.stat().st_size / 1024 / 1024:.1f} MB")

conn.close()

  appearances...................  1,725,531 rows loaded
  clubs.........................        451 rows loaded
  competitions..................         44 rows loaded
  games.........................     77,872 rows loaded
  players.......................     34,301 rows loaded
  player_valuations.............    454,064 rows loaded
  transfers.....................     85,404 rows loaded
  club_games....................    155,744 rows loaded
  game_events...................  1,098,364 rows loaded

Database created at: C:\Users\bvenn\OneDrive\Desktop\Python Projekte\Showcases\first_project\data\processed\football.db
Database size: 334.3 MB


## 5. Create Indexes

Add indexes for frequently queried columns to improve query performance.

In [7]:
conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()

indexes = [
    # Player valuations -- heavily queried
    "CREATE INDEX IF NOT EXISTS idx_pv_player_id ON player_valuations(player_id)",
    "CREATE INDEX IF NOT EXISTS idx_pv_date ON player_valuations(date)",
    "CREATE INDEX IF NOT EXISTS idx_pv_club_comp ON player_valuations(player_club_domestic_competition_id)",
    
    # Appearances
    "CREATE INDEX IF NOT EXISTS idx_app_player_id ON appearances(player_id)",
    "CREATE INDEX IF NOT EXISTS idx_app_game_id ON appearances(game_id)",
    
    # Games
    "CREATE INDEX IF NOT EXISTS idx_games_season ON games(season)",
    "CREATE INDEX IF NOT EXISTS idx_games_competition ON games(competition_id)",
    "CREATE INDEX IF NOT EXISTS idx_games_date ON games(date)",
    
    # Transfers
    "CREATE INDEX IF NOT EXISTS idx_transfers_player ON transfers(player_id)",
    "CREATE INDEX IF NOT EXISTS idx_transfers_date ON transfers(transfer_date)",
    "CREATE INDEX IF NOT EXISTS idx_transfers_to_club ON transfers(to_club_id)",
    "CREATE INDEX IF NOT EXISTS idx_transfers_from_club ON transfers(from_club_id)",
    
    # Players
    "CREATE INDEX IF NOT EXISTS idx_players_club ON players(current_club_id)",
    "CREATE INDEX IF NOT EXISTS idx_players_position ON players(position)",
    
    # Clubs
    "CREATE INDEX IF NOT EXISTS idx_clubs_competition ON clubs(domestic_competition_id)",
    
    # Club games
    "CREATE INDEX IF NOT EXISTS idx_cg_club ON club_games(club_id)",
    "CREATE INDEX IF NOT EXISTS idx_cg_game ON club_games(game_id)",
    
    # Game events
    "CREATE INDEX IF NOT EXISTS idx_ge_game ON game_events(game_id)",
    "CREATE INDEX IF NOT EXISTS idx_ge_player ON game_events(player_id)",
]

for idx_sql in indexes:
    cursor.execute(idx_sql)
    idx_name = idx_sql.split("EXISTS ")[1].split(" ON")[0]
    print(f"  Created index: {idx_name}")

conn.commit()
conn.close()

print(f"\n{len(indexes)} indexes created.")
print(f"Database size after indexing: {DB_PATH.stat().st_size / 1024 / 1024:.1f} MB")

  Created index: idx_pv_player_id
  Created index: idx_pv_date
  Created index: idx_pv_club_comp
  Created index: idx_app_player_id
  Created index: idx_app_game_id
  Created index: idx_games_season
  Created index: idx_games_competition
  Created index: idx_games_date
  Created index: idx_transfers_player
  Created index: idx_transfers_date
  Created index: idx_transfers_to_club
  Created index: idx_transfers_from_club
  Created index: idx_players_club
  Created index: idx_players_position
  Created index: idx_clubs_competition
  Created index: idx_cg_club
  Created index: idx_cg_game
  Created index: idx_ge_game
  Created index: idx_ge_player

19 indexes created.
Database size after indexing: 434.1 MB


## 6. Verification

Run basic integrity checks to ensure data was loaded correctly.

In [8]:
sys.path.insert(0, str(Path("..").resolve()))
from notebooks.utils.db_helpers import get_connection, run_query, table_info

# Show all tables with row counts
print("Database Tables:")
print("=" * 40)
display(table_info())

Database Tables:


Unnamed: 0,table,rows
0,appearances,1725531
1,club_games,155744
2,clubs,451
3,competitions,44
4,game_events,1098364
5,games,77872
6,player_valuations,454064
7,players,34301
8,transfers,85404


In [9]:
# Quick sanity checks
checks = [
    ("Top 5 leagues exist", """
        SELECT competition_id, name 
        FROM competitions 
        WHERE competition_id IN ('GB1', 'ES1', 'IT1', 'L1', 'FR1')
        ORDER BY name
    """),
    ("Players have market values", """
        SELECT 
            COUNT(*) as total_players,
            SUM(CASE WHEN market_value_in_eur IS NOT NULL THEN 1 ELSE 0 END) as with_value,
            ROUND(AVG(market_value_in_eur), 0) as avg_value
        FROM players
    """),
    ("Transfers have fees", """
        SELECT 
            COUNT(*) as total_transfers,
            SUM(CASE WHEN transfer_fee > 0 THEN 1 ELSE 0 END) as with_fee,
            ROUND(MAX(transfer_fee), 0) as max_fee
        FROM transfers
    """),
    ("Valuation date range", """
        SELECT 
            MIN(date) as earliest,
            MAX(date) as latest,
            COUNT(DISTINCT player_id) as unique_players
        FROM player_valuations
    """),
]

for title, query in checks:
    print(f"\n{title}:")
    display(run_query(query))

print("\n All checks passed. Database is ready for analysis.")


Top 5 leagues exist:


Unnamed: 0,competition_id,name
0,L1,bundesliga
1,ES1,laliga
2,FR1,ligue-1
3,GB1,premier-league
4,IT1,serie-a



Players have market values:


Unnamed: 0,total_players,with_value,avg_value
0,34301,31660,1658334.0



Transfers have fees:


Unnamed: 0,total_transfers,with_fee,max_fee
0,85404,10594,180000000.0



Valuation date range:


Unnamed: 0,earliest,latest,unique_players
0,2000-01-20 00:00:00,2026-02-14 00:00:00,31660



 All checks passed. Database is ready for analysis.
