# Sumo Wrestling Match Prediction

## Data Collection (1/5)

## Contents: 
- [Data Import](#Data-Import)
- [Data Dictionary](#Data-Dictionary)

## Data Import

This project uses two datasets from [data.world](https://data.world/cervus/sumo-japan). They cover sumo rankings, results and rikishi (sumo wrestler's) characteristics such as weight, height and age. 

### Libraries

In [1]:
# Import libraries
import numpy as np 
import pandas as pd

In [2]:
# Change the option to display with no max 
# Reference: (https://kakakakakku.hatenablog.com/entry/2021/04/19/090229)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

### Data Import

#### Banzuke.csv

In [3]:
# Read in the data 
banzuke = pd.read_csv('../data/original/banzuke.csv')

# Review
banzuke.head()

Unnamed: 0,basho,id,rank,rikishi,heya,shusshin,birth_date,height,weight,prev,prev_w,prev_l
0,1983.01,1354,Y1e,Chiyonofuji,Kokonoe,Hokkaido,1955-06-01,182.0,116.0,Y1e,14.0,1.0
1,1983.01,4080,Y1w,Kitanoumi,Mihogaseki,Hokkaido,1953-05-16,179.0,165.0,Y2eHD,9.0,3.0
2,1983.01,4095,Y2eHD,Wakanohana,Futagoyama,Aomori,1953-04-03,186.0,133.0,Y1w,0.0,0.0
3,1983.01,4104,O1e,Takanosato,Futagoyama,Aomori,1952-09-29,181.0,144.0,O1e,10.0,5.0
4,1983.01,4112,O1w,Kotokaze,Sadogatake,Mie,1957-04-26,183.0,163.0,O1w,10.0,5.0


In [4]:
# Check Data Types 
banzuke.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 176000 entries, 0 to 175999
Data columns (total 12 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   basho       176000 non-null  float64
 1   id          176000 non-null  int64  
 2   rank        176000 non-null  object 
 3   rikishi     176000 non-null  object 
 4   heya        176000 non-null  object 
 5   shusshin    176000 non-null  object 
 6   birth_date  175337 non-null  object 
 7   height      170826 non-null  float64
 8   weight      170826 non-null  float64
 9   prev        175915 non-null  object 
 10  prev_w      175915 non-null  float64
 11  prev_l      175915 non-null  float64
dtypes: float64(5), int64(1), object(6)
memory usage: 16.1+ MB


In [5]:
# Check data shape
banzuke.shape

(176000, 12)

In [6]:
# Check for missing values
banzuke.isnull().sum()

basho            0
id               0
rank             0
rikishi          0
heya             0
shusshin         0
birth_date     663
height        5174
weight        5174
prev            85
prev_w          85
prev_l          85
dtype: int64

#### Results.csv

In [7]:
# Read in the data
results = pd.read_csv('../data/original/results.csv')

# Review
results.head()

Unnamed: 0,basho,day,rikishi1_id,rikishi1_rank,rikishi1_shikona,rikishi1_result,rikishi1_win,kimarite,rikishi2_id,rikishi2_rank,rikishi2_shikona,rikishi2_result,rikishi2_win
0,1983.01,1,4140,J13w,Chikubayama,0-1 (7-8),0,yorikiri,4306,Ms1e,Ofuji,1-0 (6-1),1
1,1983.01,1,4306,Ms1e,Ofuji,1-0 (6-1),1,yorikiri,4140,J13w,Chikubayama,0-1 (7-8),0
2,1983.01,1,1337,J12w,Tochitsukasa,1-0 (9-6),1,oshidashi,4323,J13e,Shiraiwa,0-1 (3-12),0
3,1983.01,1,4323,J13e,Shiraiwa,0-1 (3-12),0,oshidashi,1337,J12w,Tochitsukasa,1-0 (9-6),1
4,1983.01,1,4097,J12e,Tamakiyama,0-1 (8-7),0,yorikiri,4319,J11w,Harunafuji,1-0 (5-10),1


In [8]:
# Check data info
results.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226590 entries, 0 to 226589
Data columns (total 13 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   basho             226590 non-null  float64
 1   day               226590 non-null  int64  
 2   rikishi1_id       226590 non-null  int64  
 3   rikishi1_rank     226590 non-null  object 
 4   rikishi1_shikona  226590 non-null  object 
 5   rikishi1_result   226590 non-null  object 
 6   rikishi1_win      226590 non-null  int64  
 7   kimarite          226590 non-null  object 
 8   rikishi2_id       226590 non-null  int64  
 9   rikishi2_rank     226590 non-null  object 
 10  rikishi2_shikona  226590 non-null  object 
 11  rikishi2_result   226590 non-null  object 
 12  rikishi2_win      226590 non-null  int64  
dtypes: float64(1), int64(5), object(7)
memory usage: 22.5+ MB


In [9]:
# Check data shape
results.shape

(226590, 13)

In [10]:
# Check for missing values
results.isnull().sum()

basho               0
day                 0
rikishi1_id         0
rikishi1_rank       0
rikishi1_shikona    0
rikishi1_result     0
rikishi1_win        0
kimarite            0
rikishi2_id         0
rikishi2_rank       0
rikishi2_shikona    0
rikishi2_result     0
rikishi2_win        0
dtype: int64

### Merge Datasets

The datasets were merged using the inner join function based on the columns `basho`, `rikishi_id`, `rikishi1_rank`, and `rikishi1_shikona` from results.csv, and `basho`, `id`, `rank`, and `rikishi` from banzuke.csv. The purpose of this merge was to create a merged dataframe that includes information about the rikishi1's wrestlers in each match. Similarly, another merge was performed using the columns `basho`, `rikishi2_id`, `rikishi2_rank`, and `rikishi2_shikona` from results.csv, and `basho`, `id`, `rank`, and `rikishi` from banzuke.csv to add information about the opponent wrestlers (rikishi2) in each match.

In [11]:
# Reference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

# Make a copy
results1 = results.copy()
banzuke1 = banzuke.copy()

# Merge dataframe (Rikishi1)
df = pd.merge(results1, 
              banzuke1, 
              left_on=['basho', 'rikishi1_id', 'rikishi1_rank', 'rikishi1_shikona'], 
              right_on=['basho', 'id', 'rank', 'rikishi'], 
              how='inner'
             )

df.shape

(226590, 24)

In [12]:
# Reference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

# Make a copy
banzuke2 = banzuke.copy()

# Merge dataframe (Rikishi2)
df = pd.merge(df, 
              banzuke2, 
              left_on=['basho', 'rikishi2_id', 'rikishi2_rank', 'rikishi2_shikona'], 
              right_on=['basho', 'id', 'rank', 'rikishi'], 
              how='inner'
             )

df.shape

(226590, 35)

### Drop duplicate columns

After the merges, unnecessary and duplicate columns were dropped. 

In [13]:
# Drop unnecessary/duplicates columns 
df.drop(columns = ['id_x', 'rank_x', 'rikishi_x', 
                   'id_y', 'rank_y', 'rikishi_y'                   
                   ], inplace=True)

In [14]:
df.columns

Index(['basho', 'day', 'rikishi1_id', 'rikishi1_rank', 'rikishi1_shikona',
       'rikishi1_result', 'rikishi1_win', 'kimarite', 'rikishi2_id',
       'rikishi2_rank', 'rikishi2_shikona', 'rikishi2_result', 'rikishi2_win',
       'heya_x', 'shusshin_x', 'birth_date_x', 'height_x', 'weight_x',
       'prev_x', 'prev_w_x', 'prev_l_x', 'heya_y', 'shusshin_y',
       'birth_date_y', 'height_y', 'weight_y', 'prev_y', 'prev_w_y',
       'prev_l_y'],
      dtype='object')

### Rename column names

The remaining columns were renamed to clarify the information between the first and second players. Specifically, columns that include information about `rikshi1` or the main wrestler were renamed with `r1_` at the beginning, and columns that include information about `rikishi2` or the opponent wrestler were renamed with `r2_` at the beginning. Additionally, the string `rikishi` in the column names was replaced with `r` to shorten and simplify them.

In [15]:
def rename_columns(df):
    """
    If the column name ends with '_x', then add 'r1_' to the beginning of the column names. 
    If the colun name ends with '_y', then add 'r2_' to the beginning of the column names. 
    """
    new_cols = []
    for col in df.columns:
        if col.endswith('_x'):
            new_cols.append('r1_' + col[:-2])
        elif col.endswith('_y'):
            new_cols.append('r2_' + col[:-2])
        else:
            new_cols.append(col)
    df.columns = new_cols
    return df

In [16]:
df = rename_columns(df)

In [17]:
def shorten_col_names(df):
    """
    Shorten column names by replacing 'rikishi' to 'r'
    """
    new_cols = {}
    for col in df.columns:
        if 'rikishi' in col:
            new_cols[col] = col.replace('rikishi', 'r')
    return df.rename(columns=new_cols)

In [18]:
df = shorten_col_names(df)

### Save combined dataset into csv.file 

Finally, the modified merged dataframe was saved as a csv file for future use and analysis.

In [19]:
# Save dataset in csv file 
df.to_csv('../data/sumo_v1_combined.csv', index=False)

## Data Dictionary 

|Feature|Type|Dataset|Description|
|---|---|---|---|
|basho|float64|banzuke.csv|tournament month as yyyy-mm|
|id|int64|banzuke.csv|ID|
|rank|object|banzuke.csv|rank|
|rikishi|object|banzuke.csv|name|
|heya|object|banzuke.csv|'room', an organization of sumo wrestlers|
|shusshin|object|banzuke.csv|hometown|
|birth_date|object|banzuke.csv|birth date|
|height|float64|banzuke.csv|height|
|weight|float64|banzuke.csv|weight|
|prev|object|banzuke.csv|rank in previous tournament|
|prev_w|float64|banzuke.csv|number of wins in previous tournament|
|prev_l|float64|banzuke.csv|number of losses in previous tournament|
|basho|float64|results.csv|tournament month as yyyy-mm|
|day|int64|results.csv|day number (16 for play-off)|
|rikishi1_id|int64|results.csv|first wrestler ID|
|rikishi1_rank|object|results.csv|first wrestler rank|
|rikishi1_shikona|object|results.csv|first wrestler ring name|
|rikishi1_result|object|results.csv|first wrestler result after the bout (final result in brackets)|
|rikishi1_win|int64|results.csv|1 for win, 0 for defeat|
|kimarite|object|results.csv|a type of technique used in sumo by a rikishi to win a match|
|rikishi2_id|int64|results.csv|second wrestler ID|
|rikishi2_rank|object|results.csv|second wrestler rank|
|rikishi2_shikona|object|results.csv|second wrestler ring name|
|rikishi2_result|object|results.csv|first wrestler result after the bout (final result in brackets)|
|rikishi2_win|int64|results.csv|1 for win, 0 for defeat|