# Phase II: Data Curation, Exploratory Analysis and Plotting
## What makes someone good at chess?

- Sandeep Salwan
- Andrew Fielding
- QuangVinh Tran

## Project Goal:

Chess is a global strategy game played by all, from children to the elderly, with countless openings, tactics, and playstyles that all influence the outcome of the match. Many dedicated chess players hone in on these aspects while reviewing past games or studying the professionals in search of improvement. However, what if the strategies employed are not the best fit for a given rank and/or playstyle? With the wealth of data available by APIs such as Lichess and Stockfish API, this project seeks to understand the effectiveness of various openings, playstyles (such as aggressive vs. passive), and common tactics at various ranks to provide users of all skill levels with applicable information to help improve their game. Additionally, with the rise of chess.com, chess can be played online with players being matched to those of similar ranking; however, not every match will be evenly set resulting in a potential skill advantage to the higher-ranked player. This study aims to explore 2 key questions:
1. Which chess openings and playstyles lead to higher win percentages across different rank levels?
2. How does the rating differential impact the outcome of a match?


## Pipeline Overview:

**1. Data Collection** - Fetch player data from the **Lichess API**: usernames, game details, ratings, moves, and openings.

**2. Data Processing** - Clean up the raw data: extract openings, check game outcomes, calculate rating differences, and handle missing info.

**3. Data Analysis** - Look at win rates: figure out which **openings** and **playstyles** work best at different ranks, and how **rating gaps** affect match results.

**4. Data Export** - Save the cleaned data into a **CSV** for easy analysis and reporting.


### Pipeline:

#### 0. Imports

In [None]:
# Install required libraries
%pip install berserk pandas chess stockfish matplotlib seaborn

In [3]:
import berserk
import pandas as pd
import chess.pgn
from io import StringIO
from stockfish import Stockfish
import chess
import chess.polyglot
import chess.engine
from chess import pgn
import matplotlib.pyplot as plt
import seaborn as sns
from lichess_secret import token
session = berserk.TokenSession(token)
client = berserk.Client(session=session)


In [None]:
# ---------------------------
# Step 1: Set Token
# ---------------------------

token = "lip_3ux7YvAOCe5o3Sg3YThR"  # Your token

if not token:
    raise ValueError("No token.")


#### 2. Query generation API Code

In [None]:
# ---------------------------
# Step 2: Init Client
# ---------------------------

session = berserk.TokenSession(token)
client = berserk.Client(session=session)


In [None]:
# ---------------------------
# Step 3: Fetch Players
# ---------------------------

def get_top_players(perf='blitz', cnt=100):
    try:
        lb = client.users.get_leaderboard(perf_type=perf, count=cnt)  # leaderboard
        return [user['username'] for user in lb]  # names
    except berserk.exceptions.ResponseError as e:
        print(f"Error: {e}")
        return []  # return empty if error

# perf types to fetch
perf_types = {'blitz': 100, 'bullet': 100, 'rapid': 100, 'classic': 100}

# collect all usernames
all_users = set()
for perf, cnt in perf_types.items():
    users = get_top_players(perf, cnt)
    all_users.update(users)
    print(f"Fetched {len(users)} {perf}")

print(f"Total: {len(all_users)} users")


In [None]:
# ---------------------------
# Step 4: Helper Functions
# ---------------------------

# retry fetching games
def get_games(username, max_games, retries=3, delay=5):
    for attempt in range(retries):
        try:
            games = client.games.export_by_player(
                username, max=max_games, perf_type='classical',
                moves=True, opening=True)
            return list(games)  # success
        except berserk.exceptions.ResponseError as e:
            if e.status_code == 429:  # limit hit
                print(f"Wait {delay} sec...")
                time.sleep(delay)
            else:
                print(f"Error: {e}. Retry {attempt + 1}")
                time.sleep(delay)
        except Exception as e:
            print(f"Error: {e}. Retry {attempt + 1}")
            time.sleep(delay)
    return []  # fail

# determine opening
def get_opening(moves):
    board = chess.Board()
    for move in moves.split()[:20]:
        try:
            m = chess.Move.from_uci(move)
            if m not in board.legal_moves:
                break
            board.push(m)
            return chess.polyglot.opening_name(board)
        except:
            return "Unknown"
    return "Unknown"


In [None]:
# ---------------------------
# Step 5: Fetch & Process Games
# ---------------------------

max_games = 100
sleep_time = 1  # seconds to wait

all_data = []  # store game data
print("Fetching games...")

for idx, user in enumerate(all_users):
    print(f"User {idx + 1}: {user}")
    games = get_games(user, max_games)
    print(f"Got {len(games)} games.")

    for game in games:
        try:
            white = game['players']['white'].get('user', {}).get('name', 'Anon')
            white_rating = game['players']['white'].get('rating')
            black = game['players']['black'].get('user', {}).get('name', 'Anon')
            black_rating = game['players']['black'].get('rating')

            game_id = game['id']
            link = f"https://lichess.org/{game_id}"

            opening = game.get('opening', {}).get('name', 'Unknown')
            moves = game.get('moves', '')
            if opening == 'Unknown':
                opening = get_opening(moves)

            winner = game.get('winner', 'draw')
            move_count = len(moves.split())

            all_data.append({
                'game_id': game_id,
                'white': white,
                'white_rating': white_rating,
                'black': black,
                'black_rating': black_rating,
                'opening': opening,
                'winner': winner,
                'moves': move_count,
                'link': link
            })
        except Exception as e:
            print(f"Error in {game.get('id', 'Unknown')}: {e}")

    time.sleep(sleep_time)  # respect limits

print(f"Total games: {len(all_data)}")


In [None]:
# ---------------------------
# Step 6: Create DataFrame
# ---------------------------

df = pd.DataFrame(all_data)
print(df.head())  # show first rows


In [None]:
# ---------------------------
# Step 7: Export CSV
# ---------------------------

df.to_csv('lichess_games_data.csv', index=False)
print("Exported to CSV.")


## Visualizations:

## Analysis/ML Plan:

The data visualization shows bias for Walmart-sponsored items in their search webpage, showing the sponsored items before the non-sponsored and place them in the beginning of each pages, placing non-sponsored items of similiar prices in later pages. This is a martketing stragety from Walmart to push the customers to buy more from their brand, however, the data showed that Walmart-sponsored products are more highly rated by customers. More analysis on the quality of the products could be done over further examination of customers' ratings.

As for the ML Plan, we plan to use logarithmic regression as a binary classifier to be able to isolate each feature and study the magnitudes of each feature to analyse it effectively.

## Visualizations:

## Analysis/ML Plan:

Polynomial linear regression