# Data collection 

## Pipeline Overview:

**1. Data Collection** - Fetch player data from the **Lichess API**: usernames, game details, ratings, moves, and openings.

**2. Data Processing** - Clean up the raw data: extract openings, check game outcomes, calculate rating differences, and handle missing info.

**3. Data Export** - Save the cleaned data into a **CSV** for easy manipulation, analysis, and reporting.

**4. Data Analysis** - Manipulate the data as necessary, building dataframes for different visualizations. Look at win rates: figure out which **openings** and **playstyles** work best for different colors at different ranks, check out how **rating gaps** affect match results, see which **colors** typically win more frequently, see how **move count** (game length) might be impacted by the details of a game.

### Data Collection:

#### 0. Imports

In [None]:
!nvidia-smi


Fri Oct 25 17:35:02 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P8              10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
# Install required libraries
%pip install berserk pandas chess stockfish matplotlib seaborn



In [None]:
import berserk
import pandas as pd
import chess
import chess.polyglot
import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import plotly.express as px
token = ""
session = berserk.TokenSession(token)
client = berserk.Client(session=session)
warnings.filterwarnings('ignore')

In [None]:
# ---------------------------
# Step 1: Set Token
# ---------------------------

token = ""  # lichess token  e.g. or .env works. 

if not token:
    raise ValueError("No token.")


#### 2. Query generation API Code

In [None]:
# ---------------------------
# Step 2: Init Client
# ---------------------------

session = berserk.TokenSession(token)
client = berserk.Client(session=session)


In [None]:
# ---------------------------
# Step 3: Fetch Players
# ---------------------------

def get_top_players(perf='blitz', cnt=100):
    try:
        lb = client.users.get_leaderboard(perf_type=perf, count=cnt)  # leaderboard
        return [user['username'] for user in lb]  # names
    except berserk.exceptions.ResponseError as e:
        print(f"Error: {e}")
        return []  # return empty if error

# perf types to fetch
perf_types = {'blitz': 200, 'bullet': 200, 'rapid': 200, 'classical': 200}

# collect all usernames
all_users = set()
for perf, cnt in perf_types.items():
    users = get_top_players(perf, cnt)
    all_users.update(users)
    print(f"Fetched {len(users)} {perf}")

print(f"Total: {len(all_users)} users")


Fetched 200 blitz
Fetched 200 bullet
Fetched 200 rapid
Fetched 200 classical
Total: 752 users


In [None]:
# ---------------------------
# Step 4: Helper Functions
# ---------------------------

# retry fetching games
def get_games(username, max_games, retries=3, delay=5):
    for attempt in range(retries):
        try:
            games = client.games.export_by_player(
                username,
                max=max_games,
                perf_type='classical',  # Ensure 'classical' is correctly specified
                moves=True,
                pgn_in_json=False,
                clocks=False,
                evals=False,
                opening=True,
                as_pgn=False
            )
            return list(games)  # success
        except berserk.exceptions.ResponseError as e:
            if e.status_code == 429:  # limit hit
                print(f"Wait {delay} sec...")
                time.sleep(delay)
            else:
                print(f"Error: {e}. Retry {attempt + 1}")
                time.sleep(delay)
        except Exception as e:
            print(f"Error: {e}. Retry {attempt + 1}")
            time.sleep(delay)
    return []  # fail

# determine opening
def get_opening(moves):
    board = chess.Board()
    opening_name = 'Unknown'
    eco_code = 'Unknown'

    for move_uci in moves.split()[:20]:  # Limit to first 20 moves
        try:
            move = chess.Move.from_uci(move_uci)
            if move not in board.legal_moves:
                break  # Illegal move
            board.push(move)
            #Using python-chess's eco database to get the opening names
            current_opening = chess.polyglot.opening_name(board)
            if current_opening != "Unknown Opening":
                opening_name = current_opening
                eco_code = 'N/A'  # if unknown we cant use eco
                break
        except Exception as e:
            print(f"Error parsing move {move_uci}: {e}")
            break  # Invalid move

    opening = f"{eco_code}: {opening_name}"
    return opening


In [None]:
# ---------------------------
# Step 5: Fetch & Process Games
# ---------------------------

max_games = 10000
sleep_time = 1  # seconds to wait

all_data = []  # store game data
print("Fetching games...")

for idx, user in enumerate(all_users):
    print(f"User {idx + 1}: {user}")
    games = get_games(user, max_games)
    print(f"Got {len(games)} games.")

    for game in games:
        try:
            white = game['players']['white'].get('user', {}).get('name', 'Anon')
            white_rating = game['players']['white'].get('rating')
            black = game['players']['black'].get('user', {}).get('name', 'Anon')
            black_rating = game['players']['black'].get('rating')

            game_id = game['id']
            link = f"https://lichess.org/{game_id}"

            opening = game.get('opening', {}).get('name', 'Unknown')
            moves = game.get('moves', '')
            if opening == 'Unknown':
                opening = get_opening(moves)

            winner = game.get('winner', 'draw')
            move_count = len(moves.split())

            all_data.append({
                'game_id': game_id,
                'white': white,
                'white_rating': white_rating,
                'black': black,
                'black_rating': black_rating,
                'opening': opening,
                'winner': winner,
                'moves': move_count,
                'link': link
            })
        except Exception as e:
            print(f"Error in {game.get('id', 'Unknown')}: {e}")

    time.sleep(sleep_time)  # respect limits

print(f"Total games: {len(all_data)}")


Fetching games...
User 1: zwenna
Got 156 games.
User 2: Erebuni91
Got 1 games.
User 3: qwl8
Got 85 games.
User 4: Mishka_The_Great
Got 2 games.
User 5: Zombieblitz
Got 45 games.
User 6: SanthoshAyyappan
Got 397 games.
User 7: Nikolayi10
Got 544 games.
User 8: joddle
Got 0 games.
User 9: NeSuBaMu
Got 317 games.
User 10: gmluke
Got 0 games.
User 11: Tyoma888
Got 16 games.
User 12: Wu_Kong
Got 4 games.
User 13: abudabi22840
Got 0 games.
User 14: chuckbranson
Got 3330 games.
User 15: Superferz7
Got 0 games.
User 16: CryptoPanda
Got 442 games.
User 17: igormezentsev
Got 786 games.
Error parsing move e4: expected uci string to be of length 4 or 5: 'e4'
User 18: zweidreizehen
Got 0 games.
User 19: AlvaroVargasD
Got 1158 games.
User 20: defenestration30
Got 147 games.
User 21: apollothedog
Got 0 games.
User 22: chessed70
Got 11 games.
User 23: terrapin
Got 1273 games.
User 24: KR228
Got 20 games.
User 25: Andrey_Esipenko
Got 0 games.
User 26: jambocomeon
Got 471 games.
Error parsing move Be7: 

Fri Oct 25 17:34:33 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P8              10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
# ---------------------------
# Step 6: Create DataFrame
# ---------------------------
#after creating the DataFrame, drop games with unknown openings
df = pd.DataFrame(all_data)

df_cleaned = df[df['opening'] != 'Unknown: Unknown']
print(f"Dropped {df.shape[0] - df_cleaned.shape[0]} games with unknown openings.")
df_cleaned.to_csv('lichess_games_data.csv', index=False)
print("Exported cleaned data to CSV.")

print(df.head())  # show first rows


Dropped 1818 games with unknown openings.
Exported cleaned data to CSV.
    game_id      white  white_rating       black  black_rating  \
0  hN05at52    aapp_61        1678.0      zwenna        2442.0   
1  ttu4bd63     zwenna        2437.0     Aidas08        2136.0   
2  Qwoly8kd  Ugjgjgjvj        2058.0      zwenna        2434.0   
3  eMJwQKDV     zwenna        2433.0  Narcisse29        1820.0   
4  8AFBBfka     zwenna        2433.0  Bracho2013        1089.0   

                                             opening winner  moves  \
0                   Sicilian Defense: Bowdler Attack  black     54   
1  Sicilian Defense: Najdorf Variation, Zagreb Va...  white     77   
2  Italian Game: Two Knights Defense, Modern Bish...  black     44   
3  French Defense: Tarrasch Variation, Pawn Cente...  white     21   
4                          Ruy Lopez: Berlin Defense  white     57   

                           link  
0  https://lichess.org/hN05at52  
1  https://lichess.org/ttu4bd63  
2  https

In [None]:
# ---------------------------
# Step 7: Export CSV
# ---------------------------

df.to_csv('lichess_games_data.csv', index=False)
print("Exported to CSV.")


Exported to CSV.


### Data processing:

In [None]:
df = pd.read_csv("lichess_games_data.csv")
df.head(5)
print(len(df))

164589


## Created repository with 160k+ games.
