# Analysis Summary

The goal from this analysis will be to predict the outcome of March Madness games. To do this, I will look at historical data from NCAA tourney matchups from 2003 up to 2018 output predictions for the winning team for each game. For every March Madness game, each team will be assigned a score based on its regular season performance in  several statistical categories, assigned rankings, and how well its oppponent performs in several statistical categories. 

### Baseline Models
I will be using a few baseline models to compare the model's results against. The baseline models I will be using are:

    - March Madness Seed: The seed assigned to each team for a March Madness tournament
    - Rating Percentage Index (RPI):  Ranking assigned to each team based on a team's wins, losses, and strength of record
    - Pomeroy College Basketball Ratings (POM): Ken Pomeroy's annual college basketball ratings 

The baseline models pose the following hypothesis:

    - That in any NCAA tournament game, the winner of a game will have either a lower March Madness seed, lower RPI, or lower POM rating.
  
Intuitively, this is a reasonable prediction. RPI, March Madness seeds, and POM ratings are all based on a team's wins, losses, strength of schedule, and generally considered representative of a team's performance relative to other teams. If team A has a lower RPI than team B at the end of a season, it's generally considered that team A's performance throughout the season has been at a higher level than team B's. For this reason, if we knew nothing else about the two teams, predicting the outcome based on RPI, or the other two benchmarks, is a good starting point. 

### Data Sources Overview
The data that was used to perform this analysis came from the [Google Cloud & NCAA® ML Competition 2018-Men's Kaggle competition](https://www.kaggle.com/c/mens-machine-learning-competition-2018). The following is a brief summary of all the datasets that were used:
- NCAATourneyCompactResults: Contains records from each NCAA tournament game from 1985-2017, including score and region.  
- Teams: Contains information for each Division 1 (D1) basketball team including an ID, name, first and most recent year playing D1 basketball. 
- MasseyOrdinals_Prelim2018: Contains data from 2003-2017 surrounding each D1 team's rank from various rankings sources throughout. 
- RegularSeasonDetailedResults: Contains similar information as the NCAATourneyCompactResults dataset, with the addition that each row will also contain the totals in a variety of statistical categories for the winning team and losing team. These are categories that are often found in a boxscore. 

A more in depth description of each of the datasets that were used and additional datasets provided by Kaggle can be found at https://www.kaggle.com/c/mens-machine-learning-competition-2018/data

### Data Cleansing & Preparation

In order for the model to be created, the following was completed:
    1. A data frame was created that holds a set of statistics for each team in a given year used to evaluate the that team's offensive and defensive performance. 
    2. A data frame was created, where each row represents a matchup that occurred in an NCAA tournament and contains the set of statistics for each team in the matchup 

To accomplish the first task, the following statistics were calculated for each team:
    - Three Point Percentage (3P%)
    - Two Point Percentage (2P%)
    - Defensive Rebounds (DR)
    - Offensive Rebounds (OR)
    - Field Goal Percentage (FG%)
    - Free Throws Attempted (FTA)
    - Free Throw Percentage (FTP)
    - Turnovers (TO)
    - Turnovers Forced (TOF)
    - Personal Fouls (PF)
    - Efficient Field Goal Percentage (eFG): (FGM + 0.5 * 3PM) / FGA
    - Possessions (POS): FGA - ORB + TO + (0.4 x FTA)
    - Defensive Efficiency (dEff): 100 x PA / (FGA - ORB + TO + (0.4 x FTA))   
    - Offensive Effiency (oEff): 100 x (PS) / (FGA - ORB + TO + (0.4 x FTA))
    - Three Point Usage (3PP): 3PA / FGA
    - Two Point Usage (2PP): 2PA / FGA
    - Rankings Percentage Index (RPI)
    - March Madness Seed (Seed)
    - Ken Pomeroy Rating (POM)

Each statistic was calcualated using game data leading up to the tournament, including conference play. RPI, March Madness Seed, and POM ratings were found using the latest ranking given to a team before the tournament.

An example of the output data frame is seen below with a snapshot of some teams' performance in 2003:

In [95]:
statsDemo = setupStatsDF(2003)
statsDemo.head()

Unnamed: 0,teamId,oEff,dEff,eFG,FG%,3P%,FT%,3PP,2P%,2PP,FTA,OR,DR,TO,TOF,POS,PF,season
0,1102_2003,1.062575,1.057935,0.678187,0.481149,0.375643,0.651357,0.409857,0.596987,0.590143,17.107143,4.178571,16.821429,11.428571,12.964286,53.878571,18.75,2003
1,1103_2003,1.140972,1.131853,0.583886,0.486074,0.33871,0.73639,0.207334,0.545624,0.792666,25.851852,9.777778,19.925926,12.62963,15.333333,69.044444,19.851852,2003
2,1104_2003,1.061618,0.995951,0.531855,0.420362,0.320144,0.709898,0.275258,0.473684,0.724742,20.928571,13.571429,23.928571,13.285714,13.857143,65.264286,18.035714,2003
3,1105_2003,0.950489,1.015179,0.519039,0.395755,0.364815,0.705986,0.31672,0.411488,0.68328,21.846154,13.5,23.115385,18.653846,18.807692,75.507692,20.230769,2003
4,1106_2003,0.954755,0.956899,0.534561,0.423773,0.346154,0.646421,0.28804,0.460152,0.71196,16.464286,12.285714,23.857143,17.035714,15.071429,66.621429,18.178571,2003


To complete the second requirement of preparing the data for the model, two steps were taken: 
1. The data frame that was created above and a data frame holding the rankings for all teams by year were merged based on a shared custom id. The custom id was a combination of a team's id and the season the corresponding row of data was associated to. 
2. The merged data frame's features were normalized, converting all values to a scale from 0 - 1. The features were normalized for two reasons: 
    1. To allow each record of data to be interpreted the same way
    2. Eliminate any additional influence a feature with a large range (FTA) would have compared to a feature with a smaller range of data (FT%). 

The output of these two steps can be seen below:


In [99]:
ranks = getRankings()
seeds = getSeeds()
seedsAndRanks = mergeSeedsAndRanks(seeds, ranks)
statsSeeds = mergeSeedsRanksStats(seedsAndRanks, statsDemo)
normStats = normalizeFeatures(statsSeeds)
normStats.head()

Unnamed: 0,teamId,oEff,dEff,eFG,FG%,3P%,FT%,3PP,2P%,2PP,...,OR,DR,TO,TOF,POS,PF,season,Seed,POM,RPI
0,1104_2003,0.240034,0.44146,0.081066,0.160056,0.024833,0.540265,0.517317,0.273052,0.482683,...,0.687804,0.46772,0.317239,0.400189,0.342766,0.409248,2003,0.6,0.117647,0.171296
1,1112_2003,0.669673,0.187895,0.386234,0.559339,0.347937,0.489395,0.39575,0.538274,0.60425,...,0.880254,0.957187,0.482932,0.757805,0.954239,0.372801,2003,0.0,0.007353,0.00463
2,1113_2003,0.613722,0.54546,0.221122,0.722442,0.0,0.299035,0.0,0.636862,1.0,...,0.701961,0.38625,0.39614,0.598082,0.490229,0.585044,2003,0.6,0.113971,0.143519
3,1120_2003,0.188074,0.397887,0.422431,0.641261,0.288017,0.253537,0.366593,0.631687,0.633407,...,0.491648,0.288249,0.606018,0.631788,0.421546,0.196334,2003,0.6,0.161765,0.162037
4,1122_2003,0.17995,0.54458,0.379396,0.471159,0.333096,0.428905,0.442242,0.487484,0.557758,...,0.335534,0.506417,0.531149,0.417219,0.403711,0.489736,2003,0.8,0.448529,0.333333


### Model Creation & Score Assignment

Using the normalized statistics data frame and a data frame consisting of historical records of each March Madness tournament matchup, a score can be calculated for each team to predict the outcome of a tournament game. Each team is given a score based on several criteria:

- Offensive Performance Score
- Defensive Performance Score:
- External Ranking Score: 



Lastly, each matchup was analyzed to assign a score to each team based on offensive performance, defensive performance, and external rankings. 

In [None]:
matchDemo = createModelMatchupsDF()

In [74]:
import pandas as pd 
from scipy import stats as sciStats
import numpy as np
import team, game as g
import Model as mod

"""
stats for each team found from a certain year 
seeds and rankings are found for each team from a certain year 
data frames are merged linking stats to seeds/ranks
merged data frame is normalized to give all stats or ranks equal weight

each matchup from a year's tournament is used to calculate a score for each team
score is calculated using the normalized features and the logic in the findScore function
- Offensive score, defensive score, ranking score that are combined 
- weights used were based on four factors of basketball and then adjusted 

score for both team is calculated and then used to create a new data frame for the corresponding matchup 
a score for both teams in each matchup is calculated and a data frame with those scores is created 
A prediction for each matchup in data frame is found by looking at which team has highest calculated score 

model predictions are compared against baseline models based on how many games were picked accurately by year 
one tailed proportions test was done to test whether or not the models results were significant 
"""

In [75]:
def getTeamNames():
    """
    Return dictionary where key is team ID and value is team name
    """
    names = {}
    teams = pd.read_csv("Data/Teams.csv")
    for index, row in teams.iterrows():
        teamId = row["TeamID"]
        name = row["TeamName"]
        names[teamId] = name
    return names

def getSeasonStats(ncaaTourneyTeams):
    """
    Use regular season results and RPI rankings to create a 
    dictionary where key is the team's ID and the value is a 
    Team object. Team objects contain yearly avg stats for each 
    team in various categories.
    """
    teams = {}
    names = getTeamNames()
    unfiltRanks = pd.read_csv("data/MasseyOrdinals_Prelim2018.csv")
    rankings = unfiltRanks[(unfiltRanks["SystemName"] == "RPI") & (unfiltRanks["RankingDayNum"] == 133)]
    regSeasonResults = pd.read_csv("data/RegularSeasonDetailedResults.csv")
    for index, row in regSeasonResults.iterrows():
        season = row["Season"]
        dayNum = row["DayNum"]
        wTeamId = row["WTeamID"]
        lTeamId = row["LTeamID"]
        customWId = str(wTeamId) + "_" + str(season)
        customLId = str(lTeamId) + "_" + str(season)
        wRPI = None
        lRPI = None
        try:
            wRPI = rankings[(rankings["Season"] == season) & (rankings["TeamID"] == wTeamId)].iloc[0]["OrdinalRank"]
            lRPI = rankings[(rankings["Season"] == season) & (rankings["TeamID"] == lTeamId)].iloc[0]["OrdinalRank"]
        except Exception as e:
            pass
            # print str(lTeamId) + " " + str(season) + " not found"
        
        if customWId not in teams:
            teams[customWId] = team.Team(customWId)
        if customLId not in teams:
            teams[customLId] = team.Team(customLId)
        wTeam = teams[customWId]
        wTeam.RPI = wRPI
        wTeam.name = names[wTeamId]
        wTeam.updateStats(row, True)
        if customLId in ncaaTourneyTeams:
            wTeam.winsVsTourney += 1
        lTeam = teams[customLId]
        lTeam.name = names[lTeamId]
        lTeam.RPI = lRPI
        lTeam.updateStats(row, False)
        
    for t in teams:
        teams[t].setScore()
    
    return teams

def getTeamStats(teams):
    """
    Get season stats for each team in a Data Frame. Able to see yearly averages and totals for stored
    statistical categories in each team object in teams dictionary
    """
    allTeamData = []
    visited = set()
    for team in teams:
        if team not in visited:
            idAndSeason = team.split("_")
            season = idAndSeason[1]
            _id = idAndSeason[0]
            teamData = teams[team].objToDict()
            teamData['season'] = season
            allTeamData.append(teamData)
            visited.add(team)
    return pd.DataFrame(allTeamData)

def dataToDict(df):
    """
    Convert dataframe to list to dictionary where key is team id and value is row data
    """
    data = {}
    for index, row in df.iterrows():
        data[row["_id"]] = row.to_dict()
    return data

def normalizeFeatures(teamStats):
    """
    Uses team stats for all seasons and returns data frame with stats normalized by season
    """
    toReturn = teamStats.copy()
    copy = toReturn.copy()
    copy['season'] = copy['season'].astype(int)
    copy.drop(copy.select_dtypes(['object']), inplace=True, axis=1)
    normalized = normalize(copy, 'season')
#     normalized.drop(columns=['numGamesPlayed'])
    toReturn.update(normalized)
    return toReturn

def normalize(df, by):
    """
    groups df by season and normalizes each statistical category of features from 0 - 1
    """
    groups = df.groupby(by)
    mins = groups.transform(np.min)
    maxs = groups.transform(np.max)
    return (df[mins.columns] - mins) / (maxs - mins)

def populateNCAATourneyTeams():
    """
    Create an ID for each team using a combination of its id and the season the team played in. 
    Output a dictionary with an entry for each team whose key is its newly created id
    """
    ncaaTourneyTeams = {}
    ncaaTournResults = pd.read_csv("data/NCAATourneyCompactResults.csv")
    for index, row in ncaaTournResults.iterrows():
        season = row["Season"]
        dayNum = row["DayNum"]
        wTeamId = row["WTeamID"]
        lTeamId = row["LTeamID"]
        customWId = str(wTeamId) + "_" + str(season)
        customLId = str(lTeamId) + "_" + str(season)

        if customWId not in ncaaTourneyTeams:
            ncaaTourneyTeams[customWId] = 1
        if customLId not in ncaaTourneyTeams:
            ncaaTourneyTeams[customLId] = 1
    return ncaaTourneyTeams

def getMatchups(teams, inputDataModified=False):
    """
    Use NCAA Tournament results to return data frame of matchups where each row contains data for one matchup between two teams, including their yearly avg totals in statistical categories, RPI, and game result.
    """
    matchups = []
    ncaaTournResults = pd.read_csv("data/NCAATourneyCompactResults.csv")
    for index, row in ncaaTournResults.iterrows():
        season = row["Season"]
        dayNum = row["DayNum"]
        wTeamId = row["WTeamID"]
        lTeamId = row["LTeamID"]
        customWId = str(wTeamId) + "_" + str(season)
        customLId = str(lTeamId) + "_" + str(season)

        if customWId in teams and customLId in teams:
            wTeamData, lTeamData = {}, {}
            if inputDataModified:
                wTeamData = teams[customWId].copy()
                del wTeamData['season']
            else:
                wTeamData = teams[customWId].objToDict().copy()
            wTeamDataMod, lTeamDataMod = {}, {}
            for key in wTeamData.keys():
                wTeamDataMod["w" + key] = wTeamData[key]
                
            if inputDataModified:
                lTeamData = teams[customLId]
                del lTeamData['season']
            else:
                lTeamData = teams[customLId].objToDict().copy()
            for key in lTeamData.keys():
                lTeamDataMod["l" + key] = lTeamData[key]
                
            matchupData = wTeamDataMod.copy()
            matchupData.update(lTeamDataMod)
            matchupData["dayNum"] = dayNum
            matchupData["season"] = season
            matchups.append(matchupData)
    colOrder = ["dayNum", "season", "l_id", "lname", "w_id", "wname", "lscore", "wscore", "lDRB", "lEFG", "lFTA", "lFTP", "lMOL", "lMOV", "lORB", "lPOSS",
                "lRPI", "lTO", "lTOF", "lconfTournWins", "ldEff", "lnumGamesPlayed", "loEff", "lwinsVsTourney",
                "wDRB", "wEFG", "wFTA", "wFTP", "wMOL", "wMOV", "wORB", "wPOSS", "wRPI", "wTO", "wTOF", 
                "wconfTournWins", "wdEff", "wnumGamesPlayed", "woEff", "wwinsVsTourney"]
    df = pd.DataFrame.from_dict(matchups)
    df = df[colOrder]
    return df

def getMatchupData():
    """
    Returns data frame of historical matchups in NCAA tournament.
    Reads in existing CSV if available. Otherwise, produces data frame by creating Team objects, calculating yearly avg totals for each team, and joining with historical NCAA tourney matchup data
    """
    try:
        matchups = pd.read_csv("Data/output/matchups_normalized.csv")
        return matchups
    except Exception as e:
        ncaaTourneyTeams = populateNCAATourneyTeams()
        teamObjs = getSeasonStats(ncaaTourneyTeams)
        teamStats = getTeamStats(teamObjs)
        normalized = normalizeFeatures(teamStats)
        dataAsDict = dataToDict(normalized)
        matchups = getMatchups(dataAsDict, inputDataModified=True)
        matchups.to_csv("Data/output/matchups_normalized.csv", index=False)
        return matchups

In [76]:
def getRankings():
    unfiltRanks = pd.read_csv("data/MasseyOrdinals_Prelim2018.csv")
    rankings = unfiltRanks[(unfiltRanks["SystemName"] == "POM") & (unfiltRanks["RankingDayNum"] == 133) | (unfiltRanks["SystemName"] == "RPI") & (unfiltRanks["RankingDayNum"] == 133)]
    transpose = {}
    for index, row in rankings.iterrows():
        teamId = str(row["TeamID"]) + "_" + str(row["Season"])
        if teamId in transpose:
            if row["SystemName"] == "RPI":
                transpose[teamId]["RPI"] = row["OrdinalRank"]
            else:
                transpose[teamId]["POM"] = row["OrdinalRank"]
        else:
            transpose[teamId] = {}
            if row["SystemName"] == "RPI":
                transpose[teamId]["RPI"] = row["OrdinalRank"]
            else:
                transpose[teamId]["POM"] = row["OrdinalRank"]
    ranksDF = pd.DataFrame.from_dict(orient = "index", data = transpose).reset_index()
    ranksDF.columns = ["teamId", "POM", "RPI"]
    return ranksDF

def getSeeds():
    seeds = pd.read_csv("data/NCAATourneySeeds.csv")
    seeds["teamId"] = seeds.TeamID.astype(str).str.cat(seeds.Season.astype(str), sep='_')
    seeds["Seed"] = seeds.Seed.str.extract('(\d+)', expand=False).astype(int)
    seeds = seeds.drop(["Season", "TeamID"], axis = 1)
    return seeds

def mergeSeedsAndRanks(seeds, ranks):
    return seeds.merge(ranks, how = "inner", on = "teamId")

def mergeSeedsRanksStats(seedsAndRanks, stats):
    return stats.merge(seedsAndRanks, how = "inner", on = "teamId")

def normalizeFeatures(teamStats):
    """
    Uses team stats for all seasons and returns data frame with stats normalized by season
    """
    toReturn = teamStats.copy()
    copy = toReturn.copy()
    copy['season'] = copy['season'].astype(int)
    copy.drop(copy.select_dtypes(['object']), inplace=True, axis=1)
    normalized = normalize(copy, 'season')
    toReturn.update(normalized)
    return toReturn

def normalize(df, by):
    """
    groups df by season and normalizes each statistical category of features from 0 - 1
    """
    groups = df.groupby(by)
    mins = groups.transform(np.min)
    maxs = groups.transform(np.max)
    return (df[mins.columns] - mins) / (maxs - mins)

def findScore(teamA, teamB):
    offScoreData = [5*teamA["oEff"].values[0],(1 - teamA["POS"] - teamB["POS"].values[0]).values[0],2*teamA["3PP"].values[0],3*teamA["3P%"].values[0],8*teamA["FG%"].values[0],5*teamA["FTA"].values[0],3.5*teamA["FT%"].values[0],2*teamA["OR"].values[0],2.5*teamA["TO"].values[0]]
    defScoreData = [8.5*(1-teamA["dEff"]).values[0], 5*teamA["DR"].values[0], 3*teamA["TOF"].values[0], 5*teamA["PF"].values[0]]
    seedData = [12.5*(1-teamA["Seed"]).values[0], 10*(1-teamA["POM"]).values[0], 7.5*(1-teamA["RPI"]).values[0]]
    offScore = (5*teamA["oEff"]*(1 - teamA["POS"] - teamB["POS"].values[0]) + 4*teamA["3P%"] + 8*teamA["2P%"] + 5*teamA["FTA"] + 3.5*teamA["FT%"] + 2*teamA["OR"] - 2.5*teamA["TO"])
    defScore = (8.5*(1-teamA["dEff"]) + 5*teamA["DR"] + 3*teamA["TOF"] - 5*teamA["PF"])
    seedScore = 12.5*(1-teamA["Seed"]) + 10*(1-teamA["POM"]) + 7.5*(1-teamA["RPI"])
    total = offScore.values[0] + defScore.values[0] + seedScore.values[0]
    scoreData = [teamA["teamId"].values[0]] + offScoreData + [offScore.values[0]] + defScoreData + [defScore.values[0]] + seedData + [seedScore.values[0]] + [total]
    return scoreData


In [94]:
def setupStatsDF(year):
    allGames = []
    regSeasonResults = pd.read_csv("data/RegularSeasonDetailedResults.csv")
    games = regSeasonResults[regSeasonResults["Season"] == year]
    for index, row in games.iterrows():
        season = row["Season"]
        dayNum = row["DayNum"]
        wTeamId = row["WTeamID"]
        lTeamId = row["LTeamID"]
        customWId = str(wTeamId) + "_" + str(season)
        customLId = str(lTeamId) + "_" + str(season)
        row1 = [customWId, season, dayNum, row["WScore"], row["LScore"], row[30]] + list(row[8:21])
        row2 = [customLId, season, dayNum, row["LScore"], row["WScore"], row[17]] + list(row[21:])
        allGames.append(row1)
        allGames.append(row2)
    colNames = ["teamId", "season", "dayNum", "score", "oppScore", "TOF", "FGM", "FGA", "FGM3", "FGA3", "FTM", "FTA", "OR", "DR", "Ast", 
               "TO", "Stl", "Blk", "PF"]
    gamesDF = pd.DataFrame(allGames, columns = colNames)
    gamesDF = gamesDF.drop(["dayNum"], axis = 1)
    gamesDF["season"] = gamesDF["season"].astype("object")
    gamesDF["teamId"] = gamesDF["teamId"].astype("object")
    statDevs = gamesDF.groupby(by = "teamId").std().reset_index() ## Consider using later
    gamesDF["POS"] = gamesDF["FGA"] - gamesDF["OR"] + gamesDF["TO"] + (0.4 * gamesDF["FTA"])
    sums = gamesDF.groupby(by = "teamId").sum().reset_index()
    means = gamesDF.groupby(by = "teamId").mean().reset_index()

    sums["oEff"] = sums["score"]/sums["POS"]
    sums["dEff"] = sums["oppScore"]/sums["POS"]
    sums["eFG"] = (sums["FGM"] + (0.5 + sums["FGM3"])) / sums["FGA"]
    sums["FG%"] = sums["FGM"]/sums["FGA"]
    sums["3P%"] = sums["FGM3"]/sums["FGA3"]
    sums["3PP"] = 3*sums["FGM3"]/sums["score"]
    sums["FT%"] = sums["FTM"]/sums["FTA"]
    sums["2P%"] = (sums["FGM"] - sums["FGM3"]) / (sums["FGA"] - sums["FGA3"])
    sums["2PP"] = 1 - sums["3PP"]

    meanSub = means[["teamId", "FTA", "OR", "DR", "TO", "TOF", "POS", "PF"]]
    sumSub = sums[["teamId", "oEff", "dEff", "eFG", "FG%", "3P%", "FT%", "3PP", "2P%", "2PP"]]
    stats = sumSub.merge(meanSub, how='left', left_on = "teamId", right_on = "teamId")
    stats["season"] = stats["teamId"].apply(lambda x: x[x.index("_") + 1:])
    return stats


In [96]:
def createModelMatchupsDF():
    ranks = getRankings()
    seeds = getSeeds()
    seedsAndRanks = mergeSeedsAndRanks(seeds, ranks)
    names = getTeamNames()
    matchups = []
    teamScoreInfo = []
    visited = set()
    for year in range(2003, 2018):
        statsDF = setupStatsDF(year)
        statsSeeds = mergeSeedsRanksStats(seedsAndRanks, statsDF)
        normStats = normalizeFeatures(statsSeeds)
        tournRes = pd.read_csv("data/NCAATourneyCompactResults.csv")
        tournRes = tournRes[tournRes["DayNum"] > 135]
        recentRes = tournRes[tournRes["Season"] == year]

        for index, row in recentRes.iterrows():
            season = row["Season"]
            wTeamId = row["WTeamID"]
            lTeamId = row["LTeamID"]
            customWId = str(wTeamId) + "_" + str(season)
            customLId = str(lTeamId) + "_" + str(season)
            teamW = normStats[normStats["teamId"] == customWId]
            teamL = normStats[normStats["teamId"] == customLId]
            seedRankW = [teamW["RPI"].values[0], teamW["POM"].values[0], teamW["Seed"].values[0]]
            seedRankL = [teamL["RPI"].values[0], teamL["POM"].values[0], teamL["Seed"].values[0]]
            if not teamW.empty:
                wScore = findScore(teamW, teamL)
                lScore = findScore(teamL, teamW)
                if customWId not in visited:
                    teamScoreInfo.append(wScore)
                    visited.add(customWId)
                if customLId not in visited:
                    teamScoreInfo.append(lScore)
                    visited.add(customLId)
                matchups.append([season, customWId, names[wTeamId], customLId, names[lTeamId], wScore[-1], lScore[-1]] + seedRankW + seedRankL)
    matchDF = pd.DataFrame(matchups, columns = ["season", "wId", "wName", "lId", "lName", "wScore", "lScore", 
                                               "wRPI", "wPOM", "wSeed", "lRPI", "lPOM", "lSeed"])
    matchDF["predicted"] = matchDF["wScore"] > matchDF["lScore"]
    matchDF["seedBaseline"] = matchDF["wSeed"] < matchDF["lSeed"]
    matchDF["pomBaseline"] = matchDF["wPOM"] < matchDF["lPOM"]
    matchDF["rpiBaseline"] = matchDF["wRPI"] < matchDF["lRPI"]
    matchDF.to_csv("data/output/matchupRes_regression.csv", index = False)
    return matchDF

"""
stats for each team found from a certain year 
seeds and rankings are found for each team from a certain year 
data frames are merged linking stats to seeds/ranks
merged data frame is normalized to give all stats or ranks equal weight
each matchup from a year's tournament is used to calculate a score for each team
score is calculated using the normalized features and the logic in the findScore function
- Offensive score, defensive score, ranking score that are combined 
- weights used were based on four factors of basketball and then adjusted 

score for both team is calculated and then used to create a new data frame for the corresponding matchup 
a score for both teams in each matchup is calculated and a data frame with those scores is created 
A prediction for each matchup in data frame is found by looking at which team has highest calculated score 

model predictions are compared against baseline models based on how many games were picked accurately by year 
one tailed proportions test was done to test whether or not the models results were significant 

"""






                                  

"\nstats for each team found from a certain year \nseeds and rankings are found for each team from a certain year \ndata frames are merged linking stats to seeds/ranks\nmerged data frame is normalized to give all stats or ranks equal weight\neach matchup from a year's tournament is used to calculate a score for each team\nscore is calculated using the normalized features and the logic in the findScore function\n- Offensive score, defensive score, ranking score that are combined \n- weights used were based on four factors of basketball and then adjusted \n\nscore for both team is calculated and then used to create a new data frame for the corresponding matchup \na score for both teams in each matchup is calculated and a data frame with those scores is created \nA prediction for each matchup in data frame is found by looking at which team has highest calculated score \n\nmodel predictions are compared against baseline models based on how many games were picked accurately by year \none ta

In [79]:
# no less than for seedbaseline
modelComparisons = matchDF.groupby('season')["predicted", "seedBaseline", "pomBaseline", "rpiBaseline"].agg('sum')

In [80]:
myModel = np.array(modelComparisons ["predicted"])
seed = np.array(modelComparisons["seedBaseline"])
np.random.seed(12345678)
sciStats.ttest_rel(myModel,seed)

Ttest_relResult(statistic=2.1754067128234165, pvalue=0.047224674617508226)