# March Machine Learning Mania 2025 (using Tensorflow)

* 컴피티션 목표  
  : NCAA Division 1 미국 대학농구 토너먼트, 혹은 다른 말로 **3월의 광란**의 결과를 예측하는 것!
  
* 전략  
  : 매 시즌 정규시즌 결과를 팀별로 정리한 후, 팀별 승률, 마진(점수차) 등을 사용한다.  
    추가적으로 평균득점, 리바운드, 어시스트 등의 1차 스탯과 시드, 랭킹 등 외부 평가자료를 학습데이터로 사용한다.  
    2024년까지의 정규시즌 성적을 특성 X, 토너먼트 성적을 라벨 y로 하여 머신러닝을 수행한다.  
    
    올해 처음 참가하는 것이므로 데이터를 형식에 맞춰 작성하는 것을 기본 목표로 한다.

## 1. Environment Settings

In [1]:
import numpy as np
import pandas as pd
import xgboost as xgb

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.metrics import brier_score_loss

## 2. Gamescores, Seeds (RS, from 1985)

- MRegularSeasonCompactResults.csv  
  하나의 로우에 승리팀, 패배팀의 데이터가 섞여있으므로 구분해준다.    
  (패배팀의 경우, 마진에 -1을 곱해줘야 한다.)  
  W/L 컬럼에는 승패를 1,0으로 인코딩해서 넣어준다.  

- MNCAATourneySeeds.csv  
  토너먼트 시드 데이터는 별개의 파일에서 가져온다.

두 데이터를 정리한 뒤, 모든 팀에 대한 경우(카테시안 곱)를 구한다.  
먼저 나온 TeamID가 작은 경우만 남겨서 예측에 사용한다

In [2]:
compact_results_url = '/kaggle/input/march-machine-learning-mania-2025/MRegularSeasonCompactResults.csv'
seeds_url = '/kaggle/input/march-machine-learning-mania-2025/MNCAATourneySeeds.csv'

In [3]:
def regular_season(compact_results_url, seeds_url):
    # Regular Season Data from 1985
    rs_compact = pd.read_csv(compact_results_url)
    rs_compact['margin'] = rs_compact['WScore'] - rs_compact['LScore']

    W_columns = {'WTeamID': 'TeamID', 'WScore': 'Score'}
    L_columns = {'LTeamID': 'TeamID', 'LScore': 'Score'}

    W_results = rs_compact[['Season', 'margin'] + list(W_columns.keys())].rename(columns=W_columns)
    L_results = rs_compact[['Season', 'margin'] + list(L_columns.keys())].rename(columns=L_columns)

    W_results['W/L'], L_results['W/L'] = 1, 0
    L_results['margin'] = -L_results['margin']

    df_flatten = pd.concat([W_results, L_results], axis=0).reset_index(drop=True)
    summary_mean = df_flatten.groupby(['Season', 'TeamID'], as_index=False).mean()

    # Add seeds from 1985
    seeds = pd.read_csv(seeds_url)
    seeds['Seed'] = seeds['Seed'].apply(lambda x: int(x[1:3]))

    df_seeds = pd.merge(summary_mean, seeds, on=['Season', 'TeamID'], how='left')
    df_seeds['Seed'] = df_seeds['Seed'].fillna(16)  # Teams that didn't make the tournament

    # Self join to create a Cartesian product with TeamID ordering
    df_cartesian = df_seeds.merge(df_seeds, on='Season', suffixes=('_x', '_y'))
    df_cartesian = df_cartesian[df_cartesian['TeamID_x'] < df_cartesian['TeamID_y']]

    # Create variables
    df_cartesian['ID'] = df_cartesian['Season'].astype(str) + "_" + df_cartesian['TeamID_x'].astype(str) + "_" + df_cartesian['TeamID_y'].astype(str)
    df_cartesian['margin_diff'] = df_cartesian['margin_x'] - df_cartesian['margin_y']
    df_cartesian['Seed_diff'] = df_cartesian['Seed_x'] - df_cartesian['Seed_y']
    df_cartesian['Score_diff'] = df_cartesian['Score_x'] - df_cartesian['Score_y']
    df_cartesian['W/L_diff'] = df_cartesian['W/L_x'] - df_cartesian['W/L_y']

    return df_cartesian[['ID', 'Season', 'margin_diff', 'Seed_diff', 'Score_diff', 'W/L_diff']]

df_rs = regular_season(compact_results_url, seeds_url)
df_rs

Unnamed: 0,ID,Season,margin_diff,Seed_diff,Score_diff,W/L_diff
1,1985_1102_1103,1985,-2.748188,0.0,2.039855,-0.182971
2,1985_1102_1104,1985,-13.591667,9.0,-5.416667,-0.491667
3,1985_1102_1106,1985,-2.000000,0.0,-8.541667,-0.208333
4,1985_1102_1108,1985,-13.751667,0.0,-19.916667,-0.551667
5,1985_1102_1109,1985,23.333333,0.0,9.250000,0.166667
...,...,...,...,...,...,...
4397248,2025_1477_1479,2025,-4.551843,0.0,-1.430876,-0.267281
4397249,2025_1477_1480,2025,0.550538,0.0,-3.411828,-0.005376
4397612,2025_1478_1479,2025,-3.502381,0.0,6.147619,-0.195238
4397613,2025_1478_1480,2025,1.600000,0.0,4.166667,0.066667


## 3. Gamestats, Rankings (RS, from 2003)

- MRegularSeasonDetailedResults.csv  
  개별 경기의 득점, 리바운드, 어시스트 등의 데이터가 있으므로 시즌 평균을 집계한다.  
  Compact Data와 동일한 방식으로 하나의 로우에 승리팀, 패배팀의 데이터가 섞여있으므로 구분해준다.      

- MMasseyOrdinals.csv  
  매 시즌 각 팀의 랭킹 데이터. 자세히는 모르겠다.  
  시즌 마지막 날의 랭킹 데이터를 평균내서 스탯으로 사용한다.  

위 두 데이터는 2003년부터 데이터가 존재한다.  
시즌별로 정리한 뒤, 위에서 만든 데이터와 합쳐준다.  

In [4]:
detailed_results_url = '/kaggle/input/march-machine-learning-mania-2025/MRegularSeasonDetailedResults.csv'
rankings_url = '/kaggle/input/march-machine-learning-mania-2025/MMasseyOrdinals.csv'

In [5]:
def regular_season_detailed(detailed_results_url, rankings_url):
    # Regular Season Detailed Data from 2003
    rs_detailed = pd.read_csv(detailed_results_url)

    W_columns = {'WTeamID': 'TeamID', 'WScore': 'Score', 
                'WFGM': 'FGM', 'WFGA': 'FGA', 'WFGM3': 'FGM3', 'WFGA3': 'FGA3', 'WFTM': 'FTM', 'WFTA': 'FTA', 
                'WOR': 'OR', 'WDR': 'DR', 'WAst': 'Ast', 'WTO': 'TO', 'WStl': 'Stl', 'WBlk': 'Blk', 'WPF': 'PF'}
    L_columns = {'LTeamID': 'TeamID', 'LScore': 'Score',
                'LFGM': 'FGM', 'LFGA': 'FGA', 'LFGM3': 'FGM3', 'LFGA3': 'FGA3', 'LFTM': 'FTM', 'LFTA': 'FTA',
                'LOR': 'OR', 'LDR': 'DR', 'LAst': 'Ast', 'LTO': 'TO', 'LStl': 'Stl', 'LBlk': 'Blk', 'LPF': 'PF'}

    W_results = rs_detailed[['Season'] + list(W_columns.keys())].rename(columns=W_columns)
    L_results = rs_detailed[['Season'] + list(L_columns.keys())].rename(columns=L_columns)

    df_flatten = pd.concat([W_results, L_results], axis=0).reset_index(drop=True)
    summary_mean = df_flatten.groupby(['Season', 'TeamID'], as_index=False).mean()

    # Add rankings from 2003
    df_rankings = pd.read_csv(rankings_url)

    latest_ranking_days = df_rankings.groupby("Season")["RankingDayNum"].max().reset_index()
    df_latest_ranking = df_rankings.merge(latest_ranking_days, on=["Season", "RankingDayNum"], how="inner")
    df_latest_ranking = df_latest_ranking.groupby(["Season", "TeamID"])["OrdinalRank"].mean().reset_index()

    df_rankings = pd.merge(summary_mean, df_latest_ranking, on=['Season', 'TeamID'], how='left')
    df_rankings['OrdinalRank'] = df_rankings['OrdinalRank'].fillna(351)  # Teams that didn't make the tournament

    df_rankings

    # Create variables
    df_rankings['FG%'] = df_rankings['FGM'] / df_rankings['FGA']
    df_rankings['FG3%'] = df_rankings['FGM3'] / df_rankings['FGA3']
    df_rankings['FT%'] = df_rankings['FTM'] / df_rankings['FTA']

    df_rankings['TOR'] = df_rankings['OR'] + df_rankings['DR']
    df_rankings['AST/TO'] = df_rankings['Ast'] / df_rankings['TO']

    # Self join to create a Cartesian product with TeamID ordering
    df_cartesian = df_rankings.merge(df_rankings, on='Season', suffixes=('_x', '_y'))
    df_cartesian = df_cartesian[df_cartesian['TeamID_x'] < df_cartesian['TeamID_y']]

    df_cartesian['ID'] = df_cartesian['Season'].astype(str) + "_" + df_cartesian['TeamID_x'].astype(str) + "_" + df_cartesian['TeamID_y'].astype(str)
    df_cartesian['Rank_diff'] = df_cartesian['OrdinalRank_x'] - df_cartesian['OrdinalRank_y']
    
    return df_cartesian

df_rs_detailed = regular_season_detailed(detailed_results_url, rankings_url)
df_rs_detailed

Unnamed: 0,Season,TeamID_x,Score_x,FGM_x,FGA_x,FGM3_x,FGA3_x,FTM_x,FTA_x,OR_x,...,Blk_y,PF_y,OrdinalRank_y,FG%_y,FG3%_y,FT%_y,TOR_y,AST/TO_y,ID,Rank_diff
1,2003,1102,57.250000,19.142857,39.785714,7.821429,20.821429,11.142857,17.107143,4.178571,...,2.333333,19.851852,168.000000,0.486074,0.338710,0.736390,29.703704,1.205279,2003_1102_1103,-11.968750
2,2003,1102,57.250000,19.142857,39.785714,7.821429,20.821429,11.142857,17.107143,4.178571,...,3.785714,18.035714,38.031250,0.420362,0.320144,0.709898,37.500000,0.911290,2003_1102_1104,118.000000
3,2003,1102,57.250000,19.142857,39.785714,7.821429,20.821429,11.142857,17.107143,4.178571,...,2.076923,20.230769,308.968750,0.395755,0.364815,0.705986,36.615385,0.779381,2003_1102_1105,-152.937500
4,2003,1102,57.250000,19.142857,39.785714,7.821429,20.821429,11.142857,17.107143,4.178571,...,3.142857,18.178571,262.687500,0.423773,0.346154,0.646421,36.142857,0.685535,2003_1102_1106,-106.656250
5,2003,1102,57.250000,19.142857,39.785714,7.821429,20.821429,11.142857,17.107143,4.178571,...,2.035714,15.892857,301.937500,0.418272,0.357488,0.733509,28.500000,0.948864,2003_1102_1107,-145.906250
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2770809,2025,1477,64.354839,23.000000,55.290323,8.387097,26.709677,9.967742,15.483871,7.935484,...,1.892857,16.678571,323.092593,0.421367,0.357013,0.806867,24.428571,1.344322,2025_1477_1479,15.695869
2770810,2025,1477,64.354839,23.000000,55.290323,8.387097,26.709677,9.967742,15.483871,7.935484,...,2.966667,15.833333,345.018519,0.427206,0.296642,0.690058,29.333333,1.231013,2025_1477_1480,-6.230057
2771173,2025,1478,71.933333,24.800000,55.400000,7.500000,22.900000,14.833333,20.866667,7.466667,...,1.892857,16.678571,323.092593,0.421367,0.357013,0.806867,24.428571,1.344322,2025_1478_1479,24.111111
2771174,2025,1478,71.933333,24.800000,55.400000,7.500000,22.900000,14.833333,20.866667,7.466667,...,2.966667,15.833333,345.018519,0.427206,0.296642,0.690058,29.333333,1.231013,2025_1478_1480,2.185185


## 4. Data Concatenation

정규시즌 성적, 시드, 평균 스탯, 랭킹 데이터를 정리하여 각 팀별 성적을 합친다.

조인 방식을 `left`로 하면 1985년부터의 데이터를 사용할 수 있고 (스탯, 랭킹은 NULL)  
조인 방식을 `inner`로 하면 2003년부터의 데이터를 모두 사용할 수 있다.

이 분석에서는 2003년부터의 데이터를 사용한다.

In [6]:
df_rs_all = pd.merge(df_rs, df_rs_detailed, on=['ID', 'Season'], how='inner')
df_rs_all

Unnamed: 0,ID,Season,margin_diff,Seed_diff,Score_diff,W/L_diff,TeamID_x,Score_x,FGM_x,FGA_x,...,Stl_y,Blk_y,PF_y,OrdinalRank_y,FG%_y,FG3%_y,FT%_y,TOR_y,AST/TO_y,Rank_diff
0,2003_1102_1103,2003,-0.379630,0.0,-21.527778,-0.052910,1102,57.250000,19.142857,39.785714,...,7.259259,2.333333,19.851852,168.000000,0.486074,0.338710,0.736390,29.703704,1.205279,-11.968750
1,2003_1102_1104,2003,-4.035714,6.0,-12.035714,-0.178571,1102,57.250000,19.142857,39.785714,...,6.607143,3.785714,18.035714,38.031250,0.420362,0.320144,0.709898,37.500000,0.911290,118.000000
2,2003_1102_1105,2003,5.134615,0.0,-14.519231,0.159341,1102,57.250000,19.142857,39.785714,...,9.307692,2.076923,20.230769,308.968750,0.395755,0.364815,0.705986,36.615385,0.779381,-152.937500
3,2003_1102_1106,2003,0.392857,0.0,-6.357143,-0.035714,1102,57.250000,19.142857,39.785714,...,8.357143,3.142857,18.178571,262.687500,0.423773,0.346154,0.646421,36.142857,0.685535,-106.656250
4,2003_1102_1107,2003,10.035714,0.0,-8.678571,0.178571,1102,57.250000,19.142857,39.785714,...,6.857143,2.035714,15.892857,301.937500,0.418272,0.357488,0.733509,28.500000,0.948864,-145.906250
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1381956,2025_1477_1479,2025,-4.551843,0.0,-1.430876,-0.267281,1477,64.354839,23.000000,55.290323,...,6.607143,1.892857,16.678571,323.092593,0.421367,0.357013,0.806867,24.428571,1.344322,15.695869
1381957,2025_1477_1480,2025,0.550538,0.0,-3.411828,-0.005376,1477,64.354839,23.000000,55.290323,...,6.866667,2.966667,15.833333,345.018519,0.427206,0.296642,0.690058,29.333333,1.231013,-6.230057
1381958,2025_1478_1479,2025,-3.502381,0.0,6.147619,-0.195238,1478,71.933333,24.800000,55.400000,...,6.607143,1.892857,16.678571,323.092593,0.421367,0.357013,0.806867,24.428571,1.344322,24.111111
1381959,2025_1478_1480,2025,1.600000,0.0,4.166667,0.066667,1478,71.933333,24.800000,55.400000,...,6.866667,2.966667,15.833333,345.018519,0.427206,0.296642,0.690058,29.333333,1.231013,2.185185


## 5. Tournament Data for Model Training

2024년까지의 정규시즌 성적을 특성 X, 토너먼트 성적을 라벨 y로 하는 것이 전략이었다.  
이를 위해 토너먼트 성적을 불러오자.

In [7]:
tournament_results_url = '/kaggle/input/march-machine-learning-mania-2025/MNCAATourneyCompactResults.csv'

In [8]:
def tournament(tournament_results_url):
    df_y = pd.read_csv(tournament_results_url)
    df_y["Team1"] = df_y[["WTeamID", "LTeamID"]].min(axis=1)
    df_y["Team2"] = df_y[["WTeamID", "LTeamID"]].max(axis=1)
    df_y["ID"] = df_y["Season"].astype(str) + "_" + df_y["Team1"].astype(str) + "_" + df_y["Team2"].astype(str)
    df_y["Margin"] = df_y.apply(lambda row: row["WScore"] - row["LScore"] if row["WTeamID"] == row["Team1"] else row["LScore"] - row["WScore"], axis=1)
    df_y['W/L'] = (df_y['Margin'] > 0).astype(int)
    
    return df_y[['ID', 'W/L', 'Margin']]
 
df_tourney = tournament(tournament_results_url)
df_tourney

Unnamed: 0,ID,W/L,Margin
0,1985_1116_1234,1,9
1,1985_1120_1345,1,1
2,1985_1207_1250,1,25
3,1985_1229_1425,1,3
4,1985_1242_1325,1,11
...,...,...,...
2513,2024_1181_1301,0,-12
2514,2024_1345_1397,1,6
2515,2024_1104_1163,0,-14
2516,2024_1301_1345,0,-13


토너먼트 결과는 전체 데이터의 일부일 뿐이다.  
`inner join`을 사용해서 토너먼트 데이터에 해당하는 데이터만 가져온다.

우리는 2003년부터의 데이터를 사용하기 때문에  
조인 후의 토너먼트 데이터도 2003년부터의 데이터만 존재한다.

In [9]:
df_train = pd.merge(df_tourney, df_rs_all, on='ID', how='inner')
df_train

Unnamed: 0,ID,W/L,Margin,Season,margin_diff,Seed_diff,Score_diff,W/L_diff,TeamID_x,Score_x,...,Stl_y,Blk_y,PF_y,OrdinalRank_y,FG%_y,FG3%_y,FT%_y,TOR_y,AST/TO_y,Rank_diff
0,2003_1411_1421,0,-8,2003,9.208046,0.0,1.593103,0.151724,1411,72.800000,...,7.068966,3.000000,19.103448,240.343750,0.429265,0.360153,0.762768,35.448276,0.804255,-1.062500
1,2003_1112_1436,1,29,2003,10.309113,-15.0,17.421182,0.237685,1112,85.214286,...,6.862069,2.965517,15.896552,153.125000,0.444444,0.340757,0.657848,38.689655,1.009804,-150.448529
2,2003_1113_1272,1,13,2003,-1.896552,3.0,1.448276,-0.172414,1113,75.965517,...,7.379310,5.068966,18.758621,21.705882,0.437931,0.348797,0.653614,40.034483,1.205000,14.294118
3,2003_1141_1166,1,6,2003,-8.805643,5.0,0.102403,-0.085684,1141,79.344828,...,8.393939,4.454545,17.272727,20.735294,0.499473,0.389053,0.692890,34.060606,1.258503,24.952206
4,2003_1143_1301,1,2,2003,0.324138,-1.0,2.082759,0.124138,1143,74.482759,...,7.766667,3.066667,18.666667,50.312500,0.456250,0.354074,0.770358,31.766667,1.032864,-13.906250
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1377,2024_1181_1301,0,-12,2024,8.739583,-7.0,3.482639,0.138889,1181,79.843750,...,7.416667,3.500000,16.361111,57.666667,0.449203,0.346049,0.733520,32.638889,1.376506,-44.937853
1378,2024_1345_1397,1,6,2024,1.648674,-1.0,3.925189,0.128788,1345,83.393939,...,7.937500,4.656250,17.437500,6.627119,0.444332,0.341912,0.749263,36.093750,1.680251,-3.661017
1379,2024_1104_1163,0,-14,2024,-7.371324,3.0,9.279412,-0.255515,1104,90.750000,...,6.235294,5.382353,16.235294,2.033898,0.495988,0.366871,0.742470,35.264706,2.032258,12.491525
1380,2024_1301_1345,0,-13,2024,-9.575758,10.0,-7.032828,-0.267677,1301,76.361111,...,5.666667,3.787879,14.363636,2.966102,0.488324,0.408012,0.721212,37.696970,1.676796,54.700565


## 6. Model Training

아래와 같은 내용을 고려해서 Tensorflow를 사용한 DNN 모형을 만든다.  

- 최종적으로 Team 1이 승리할 확률을 구하기 때문에 출력층은 `Dense(1, activation='sigmoid')`
- 업셋에도 강건하게 하기 위해 L2 Regularization과 Dropout을 적용
- 마찬가지로 Overfitting을 피하기 위해 모형의 구조는 단순하게

In [10]:
import tensorflow as tf

features = [# Summary Statistics
            'margin_diff', 'Seed_diff', 'Score_diff', 'W/L_diff', 'Rank_diff', 
            # Basis Stats: Team 1
            'Score_x', 'FGM_x', 'FGA_x', 'FGM3_x', 'FGA3_x', 'FTM_x', 'FTA_x', 'OR_x', 'DR_x', 'Ast_x', 'TO_x', 'Stl_x', 'Blk_x', 'PF_x', 
            # Derived Variables: Team 1
            'OrdinalRank_x', 'FG%_x', 'FG3%_x', 'FT%_x', 'TOR_x', 'AST/TO_x', 
            # Basic Stats: Team 2
            'Score_y', 'FGM_y', 'FGA_y', 'FGM3_y', 'FGA3_y', 'FTM_y', 'FTA_y', 'OR_y', 'DR_y', 'Ast_y', 'TO_y', 'Stl_y', 'Blk_y', 'PF_y', 
            # Derived Variables: Team 2
            'OrdinalRank_y', 'FG%_y', 'FG3%_y', 'FT%_y', 'TOR_y', 'AST/TO_y',]

X_train, y_train = df_train[features], df_train["W/L"]

model_tf = tf.keras.Sequential([
    tf.keras.layers.Dense(32, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.001)),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(8, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.001)),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model_tf.compile(optimizer='adam',
                 loss='binary_crossentropy',
                 metrics=['accuracy'])
model_tf.fit(X_train, y_train, epochs=1000, batch_size=32)

Epoch 1/1000
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 32ms/step - accuracy: 0.5031 - loss: 13.6617
Epoch 2/1000
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.5765 - loss: 2.8064
Epoch 3/1000
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.5575 - loss: 1.3024
Epoch 4/1000
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.5697 - loss: 1.0157
Epoch 5/1000
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.5622 - loss: 0.9359
Epoch 6/1000
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.5274 - loss: 0.8230
Epoch 7/1000
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.5571 - loss: 0.8334
Epoch 8/1000
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.5840 - loss: 0.7947
Epoch 9/1000
[1m44/44[0m [32m━━━━━━

<keras.src.callbacks.history.History at 0x79724e540040>

## 7. Men's Tournament Prediction

학습을 마쳤으면, 2025년 데이터를 사용해서 예측 데이터를 생성한다.

In [11]:
df_submission = df_rs_all[df_rs_all.Season == 2025].copy()
df_submission['Pred'] = model_tf.predict(df_submission[features])
df_submission = df_submission.reset_index()
df_submission_men = df_submission[['ID', 'Pred']]

df_submission_men

[1m2065/2065[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 1ms/step


Unnamed: 0,ID,Pred
0,2025_1101_1102,0.670342
1,2025_1101_1103,0.168256
2,2025_1101_1104,0.002535
3,2025_1101_1105,0.734987
4,2025_1101_1106,0.532790
...,...,...
66061,2025_1477_1479,0.559143
66062,2025_1477_1480,0.598932
66063,2025_1478_1479,0.525310
66064,2025_1478_1480,0.575163


## 8. Women's Tournament Prediction

동일한 방식으로 여자 대학농구 토너먼트 예측도 만들어준다.  
다만 여자농구는 랭킹 정보가 제공되지 않아 1985년부터의 경기 데이터만을 사용해서 예측한다.  

In [12]:
# Women's prediction
compact_results_url = '/kaggle/input/march-machine-learning-mania-2025/WRegularSeasonCompactResults.csv'
tournament_results_url = '/kaggle/input/march-machine-learning-mania-2025/WNCAATourneyCompactResults.csv'
rankings_url = '/kaggle/input/march-machine-learning-mania-2025/WMasseyOrdinals.csv'

df_rs = regular_season(compact_results_url, seeds_url)
df_tourney = tournament(tournament_results_url)
df_train = pd.merge(df_tourney, df_rs, on='ID', how='inner').fillna(0)

In [13]:
import tensorflow as tf

features = ['margin_diff', 'Seed_diff', 'Score_diff', 'W/L_diff']

X_train, y_train = df_train[features], df_train["W/L"]

model_tf = tf.keras.Sequential([
    tf.keras.layers.Dense(32, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.001)),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(8, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.001)),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model_tf.compile(optimizer='adam',
                 loss='binary_crossentropy',
                 metrics=['accuracy'])
model_tf.fit(X_train, y_train, epochs=1000, batch_size=32)

Epoch 1/1000
[1m52/52[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 27ms/step - accuracy: 0.4638 - loss: 2.3762
Epoch 2/1000
[1m52/52[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.5713 - loss: 1.3229
Epoch 3/1000
[1m52/52[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.5660 - loss: 1.0105
Epoch 4/1000
[1m52/52[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.6115 - loss: 0.9022
Epoch 5/1000
[1m52/52[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.5875 - loss: 0.7922
Epoch 6/1000
[1m52/52[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.6262 - loss: 0.7570
Epoch 7/1000
[1m52/52[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.5969 - loss: 0.7358
Epoch 8/1000
[1m52/52[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.6138 - loss: 0.6698
Epoch 9/1000
[1m52/52[0m [32m━━━━━━━

<keras.src.callbacks.history.History at 0x797237a93a00>

In [14]:
df_submission = df_rs[df_rs.Season == 2025].copy()
df_submission['Pred'] = model_tf.predict(df_submission[features])
df_submission = df_submission.reset_index()
df_submission_women = df_submission[['ID', 'Pred']]

df_submission_women

[1m2042/2042[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step


Unnamed: 0,ID,Pred
0,2025_3101_3102,0.581024
1,2025_3101_3103,0.733680
2,2025_3101_3104,0.136101
3,2025_3101_3105,0.550794
4,2025_3101_3106,0.999139
...,...,...
65336,2025_3477_3479,0.448516
65337,2025_3477_3480,0.393640
65338,2025_3478_3479,0.256691
65339,2025_3478_3480,0.146921


## 9. Final Submission  

남자농구 예측과 여자 농구 예측을 합쳐서 최종 파일을 생성한다.

In [15]:
# Create final submission file
submission_df = pd.concat([
    df_submission_men,
    df_submission_women
], axis=0).sort_values(by='ID')

submission_df.reset_index(drop=True, inplace=True)

# Save submission file
submission_df.to_csv("submission.csv", index=False)
print("Submission file created successfully.")

Submission file created successfully.
