# Overview

## What to Predict

- Stage 1 - You should submit predicted probabilities for every possible matchup in the past 5 NCAA® tournaments (seasons 2015-2019).
- Stage 2 - You should submit predicted probabilities for every possible matchup before the 2020 tournament begins.

Refer to the [Timeline page](https://www.kaggle.com/c/google-cloud-ncaa-march-madness-2020-division-1-mens-tournament/overview/timeline) for specific dates. In both stages, the sample submission will tell you which games to predict.

# Import Packages

In [1]:
import os

import pandas as pd
import numpy as np

import matplotlib.pylab as plt
from matplotlib.pylab import rcParams
#rcParams['figure.figsize'] = 20, 6
%matplotlib inline

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Data Download

In [2]:
# Downloading the Data using Kaggle API
!kaggle competitions download -c google-cloud-ncaa-march-madness-2020-division-1-mens-tournament

Downloading google-cloud-ncaa-march-madness-2020-division-1-mens-tournament.zip to F:\OneDrive - Georgia State University\Data Science\Competition\Google Cloud & NCAA® ML Competition 2020-NCAAM




  0%|          | 0.00/120M [00:00<?, ?B/s]
  2%|1         | 2.00M/120M [00:00<00:07, 17.3MB/s]
  4%|4         | 5.00M/120M [00:00<00:06, 19.7MB/s]
  7%|6         | 8.00M/120M [00:00<00:05, 21.9MB/s]
  9%|9         | 11.0M/120M [00:00<00:04, 23.7MB/s]
 12%|#1        | 14.0M/120M [00:00<00:04, 25.2MB/s]
 14%|#4        | 17.0M/120M [00:00<00:04, 26.3MB/s]
 17%|#6        | 20.0M/120M [00:00<00:03, 27.1MB/s]
 19%|#9        | 23.0M/120M [00:00<00:03, 27.7MB/s]
 22%|##1       | 26.0M/120M [00:00<00:03, 28.2MB/s]
 24%|##4       | 29.0M/120M [00:01<00:03, 28.5MB/s]
 27%|##6       | 32.0M/120M [00:01<00:03, 23.5MB/s]
 29%|##9       | 35.0M/120M [00:01<00:03, 24.7MB/s]
 34%|###4      | 41.0M/120M [00:01<00:02, 29.0MB/s]
 37%|###7      | 45.0M/120M [00:01<00:02, 29.1MB/s]
 41%|####      | 49.0M/120M [00:01<00:02, 29.0MB/s]
 43%|####3     | 52.0M/120M [00:01<00:02, 29.0MB/s]
 46%|####5     | 55.0M/120M [00:02<00:02, 29.0MB/s]
 48%|####8     | 58.0M/120M [00:02<00:02, 29.1MB/s]
 51%|#####     | 61.

# Data Import

In [3]:
def get_file_list(datapath):
    # create a list of file and sub directories 
    # names in the given directory 
    listOfFile = os.listdir(datapath)
    allFiles = list()
    # Iterate over all the entries
    for entry in listOfFile:
        # Create full path
        fullPath = os.path.join(datapath, entry)
        # If entry is a directory then get the list of files in this directory 
        if os.path.isdir(fullPath):
            allFiles = allFiles + getListOfFiles(fullPath)
        else:
            allFiles.append(fullPath)
                
    return allFiles

In [4]:
main_path = 'D:\\OneDrive - Georgia State University\\Data Science\\Competition\\Data\\Google Cloud & NCAA® ML Competition 2020-NCAAM\\'
sub_folders = os.listdir(main_path)
datapath = main_path + sub_folders[0] + '\\'

# Get the list of all files in directory tree at given path
data_list = get_file_list(datapath)

data_list

['D:\\OneDrive - Georgia State University\\Data Science\\Competition\\Data\\Google Cloud & NCAA® ML Competition 2020-NCAAM\\Data Section 1 - The Basics\\MNCAATourneyCompactResults.csv',
 'D:\\OneDrive - Georgia State University\\Data Science\\Competition\\Data\\Google Cloud & NCAA® ML Competition 2020-NCAAM\\Data Section 1 - The Basics\\MNCAATourneySeeds.csv',
 'D:\\OneDrive - Georgia State University\\Data Science\\Competition\\Data\\Google Cloud & NCAA® ML Competition 2020-NCAAM\\Data Section 1 - The Basics\\MRegularSeasonCompactResults.csv',
 'D:\\OneDrive - Georgia State University\\Data Science\\Competition\\Data\\Google Cloud & NCAA® ML Competition 2020-NCAAM\\Data Section 1 - The Basics\\MSampleSubmissionStage1_2020.csv',
 'D:\\OneDrive - Georgia State University\\Data Science\\Competition\\Data\\Google Cloud & NCAA® ML Competition 2020-NCAAM\\Data Section 1 - The Basics\\MSeasons.csv',
 'D:\\OneDrive - Georgia State University\\Data Science\\Competition\\Data\\Google Cloud & NC

In [5]:
tourney_compact_result_M = pd.read_csv(datapath + 'MNCAATourneyCompactResults.csv')
tourney_seed_M = pd.read_csv(datapath + 'MNCAATourneySeeds.csv')
regular_compact_result_M = pd.read_csv(datapath + 'MRegularSeasonCompactResults.csv')
season_M = pd.read_csv(datapath + 'MSeasons.csv')
teams_M = pd.read_csv(datapath + 'MTeams.csv')

MSampleSubmissionStage1_2020 = pd.read_csv(datapath + 'MSampleSubmissionStage1_2020.csv')

# Data Exploratory Analysis

    - for each data we will:
        - handle missing values
        - find-out correlation between variables

## Section 1 - The Basics

In [48]:
tourney_compact_result_M.info()
tourney_compact_result_M

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2251 entries, 0 to 2250
Data columns (total 8 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Season   2251 non-null   int64 
 1   DayNum   2251 non-null   int64 
 2   WTeamID  2251 non-null   int64 
 3   WScore   2251 non-null   int64 
 4   LTeamID  2251 non-null   int64 
 5   LScore   2251 non-null   int64 
 6   WLoc     2251 non-null   object
 7   NumOT    2251 non-null   int64 
dtypes: int64(7), object(1)
memory usage: 140.8+ KB


Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT
0,1985,136,1116,63,1234,54,N,0
1,1985,136,1120,59,1345,58,N,0
2,1985,136,1207,68,1250,43,N,0
3,1985,136,1229,58,1425,55,N,0
4,1985,136,1242,49,1325,38,N,0
...,...,...,...,...,...,...,...,...
2246,2019,146,1120,77,1246,71,N,1
2247,2019,146,1277,68,1181,67,N,0
2248,2019,152,1403,61,1277,51,N,0
2249,2019,152,1438,63,1120,62,N,0


In [53]:
tourney_seed_M.info()
tourney_seed_M

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2286 entries, 0 to 2285
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Season  2286 non-null   int64 
 1   Seed    2286 non-null   object
 2   TeamID  2286 non-null   int64 
dtypes: int64(2), object(1)
memory usage: 53.7+ KB


Unnamed: 0,Season,Seed,TeamID
0,1985,W01,1207
1,1985,W02,1210
2,1985,W03,1228
3,1985,W04,1260
4,1985,W05,1374
...,...,...,...
2281,2019,Z12,1332
2282,2019,Z13,1414
2283,2019,Z14,1330
2284,2019,Z15,1159


In [58]:
regular_compact_result_M.info()
regular_compact_result_M

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 161552 entries, 0 to 161551
Data columns (total 8 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   Season   161552 non-null  int64 
 1   DayNum   161552 non-null  int64 
 2   WTeamID  161552 non-null  int64 
 3   WScore   161552 non-null  int64 
 4   LTeamID  161552 non-null  int64 
 5   LScore   161552 non-null  int64 
 6   WLoc     161552 non-null  object
 7   NumOT    161552 non-null  int64 
dtypes: int64(7), object(1)
memory usage: 9.9+ MB


Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT
0,1985,20,1228,81,1328,64,N,0
1,1985,25,1106,77,1354,70,H,0
2,1985,25,1112,63,1223,56,H,0
3,1985,25,1165,70,1432,54,H,0
4,1985,25,1192,86,1447,74,H,0
...,...,...,...,...,...,...,...,...
161547,2019,132,1153,69,1222,57,N,0
161548,2019,132,1209,73,1426,64,N,0
161549,2019,132,1277,65,1276,60,N,0
161550,2019,132,1387,55,1382,53,N,0


In [56]:
season_M.info()
season_M

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36 entries, 0 to 35
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Season   36 non-null     int64 
 1   DayZero  36 non-null     object
 2   RegionW  36 non-null     object
 3   RegionX  36 non-null     object
 4   RegionY  36 non-null     object
 5   RegionZ  36 non-null     object
dtypes: int64(1), object(5)
memory usage: 1.8+ KB


Unnamed: 0,Season,DayZero,RegionW,RegionX,RegionY,RegionZ
0,1985,10/29/1984,East,West,Midwest,Southeast
1,1986,10/28/1985,East,Midwest,Southeast,West
2,1987,10/27/1986,East,Southeast,Midwest,West
3,1988,11/2/1987,East,Midwest,Southeast,West
4,1989,10/31/1988,East,West,Midwest,Southeast
5,1990,10/30/1989,East,Midwest,Southeast,West
6,1991,10/29/1990,East,Southeast,Midwest,West
7,1992,11/4/1991,East,West,Midwest,Southeast
8,1993,11/2/1992,East,Midwest,Southeast,West
9,1994,11/1/1993,East,Southeast,Midwest,West


In [57]:
teams_M.info()
teams_M

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 367 entries, 0 to 366
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   TeamID         367 non-null    int64 
 1   TeamName       367 non-null    object
 2   FirstD1Season  367 non-null    int64 
 3   LastD1Season   367 non-null    int64 
dtypes: int64(3), object(1)
memory usage: 11.6+ KB


Unnamed: 0,TeamID,TeamName,FirstD1Season,LastD1Season
0,1101,Abilene Chr,2014,2020
1,1102,Air Force,1985,2020
2,1103,Akron,1985,2020
3,1104,Alabama,1985,2020
4,1105,Alabama A&M,2000,2020
...,...,...,...,...
362,1463,Yale,1985,2020
363,1464,Youngstown St,1985,2020
364,1465,Cal Baptist,2019,2020
365,1466,North Alabama,2019,2020


# Reference

- Primary: [google-cloud-ncaa-march-madness-2020-division-1-mens-tournament](https://www.kaggle.com/c/google-cloud-ncaa-march-madness-2020-division-1-mens-tournament)
- Secondary: