# League of Legends Ranked Solo/Duo EDA

## Environment  Setup

In [17]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import os

%load_ext cudf.pandas

data_dir = "data"
file_path = os.path.join(data_dir, "league.feather")

DVC (Data Versioning Control) will be used as the data versioning tool as the analysis progresses. Some questions will only require a certain subset of the data, and will lead to more efficient queries. However, a way to track differences across versions is required when dealing with a large amount of data such as the dataset here.

See install instructions on the [DVC website](https://dvc.org/doc/install).

GPU usage check

In [4]:
!nvidia-smi

Wed Mar 26 11:09:03 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.86.17              Driver Version: 572.47         CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 4070 ...    On  |   00000000:01:00.0  On |                  N/A |
|  0%   50C    P8             19W /  220W |   10781MiB /  12282MiB |     12%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

Save the file locally after downloading from Huggingface Hub

In [18]:
from datasets import load_dataset

os.makedirs(data_dir, exist_ok=True)

if not os.path.exists(file_path):

    data = load_dataset("renecotyfanboy/leagueData")

    df = data['train'].to_pandas()
    df.to_feather(file_path)

    print(f"Dataset saved to {file_path}")
else:
    print(f"Dataset already exists at {file_path}")

Dataset already exists at data/league.feather


The data can now be tracked by DVC using `dvc add league.feather`.

In [None]:
df = pd.read_feather('data/league.feather')

## Initial EDA

The data consists of 222 columns, with the first column being an index. There are 2,997,254 rows of data, each corresponding to a single game from a single player.

In [None]:
df.describe()

Unnamed: 0.1,Unnamed: 0,gameStartTimestamp,gameDuration,detectorWardsPlaced,baronKills,visionClearedPings,turretKills,damageDealtToBuildings,consumablesPurchased,totalHeal,...,controlWardsPlaced,teamBaronKills,stealthWardsPlaced,completeSupportQuestInTime,flawlessAces,laningPhaseGoldExpAdvantage,poroExplosions,epicMonsterKillsNearEnemyJungler,elderDragonMultikills,tookLargeDamageSurvived
count,2997254.0,2997254.0,2997254.0,2997254.0,2997254.0,2997254.0,2997254.0,2997254.0,2997254.0,2997254.0,...,2997254.0,2997254.0,2997254.0,2997254.0,2997254.0,2991512.0,2997254.0,2997254.0,2997254.0,2997254.0
mean,1498626.0,1710864000000.0,1785.701,1.27534,0.107499,0.0,1.169899,3455.873,3.42494,8447.035,...,1.261784,0.537306,9.394825,0.117574,0.176401,0.111424,0.0,0.163,0.004176,0.05443
std,865232.8,1439656000.0,407.6952,1.935386,0.348538,0.0,1.395141,3664.54,2.809105,8588.143,...,1.903211,0.68563,8.154407,0.322852,0.442902,0.314656,0.0,0.506687,0.067184,0.317388
min,0.0,1705441000000.0,900.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,749313.2,1709999000000.0,1522.0,0.0,0.0,0.0,0.0,680.0,1.0,2570.0,...,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1498626.0,1711186000000.0,1770.0,1.0,0.0,0.0,1.0,2390.0,3.0,5642.0,...,1.0,0.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,2247940.0,1712013000000.0,2037.0,2.0,0.0,0.0,2.0,5092.0,5.0,11831.0,...,2.0,1.0,11.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,2997253.0,1712792000000.0,6220.0,81.0,5.0,0.0,12.0,67906.0,83.0,215888.0,...,36.0,8.0,84.0,4.0,5.0,1.0,0.0,9.0,4.0,14.0


: 

## Basic Data Cleaning

Removal of unnecessary information, empty columns, missing values, and other cleaning techniques will be performed to ensure quality data. These changes will be pushed to the branch `cleaned` to document cleaning changes, which will be executed on the LakeFS web browser as an executable SQL query.