# Chasing Aggregates

It looks like all of our `cml_` fields are hovering around `50%`. That makes is feel like we are calculating the for and against at the same time. This is wrong for all of these fields. 

## Steps to Investigate
- Set up test to `calculate aggregates` for a single year
- Print out some aggregate for that year, week by week for a team
  as well as the other meaningful data for that team

In [1]:
import utils.game_utils as gu
import utils.plot as guplot

import math
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

nfld = gu.NFL_Data()

YEAR = 2018
WEEK = 14
TEAM = gu.TEAM_NAME['Saints']

## Calculate Aggregates

In [5]:
# TBD - once we have the data presenting below

## Print Aggregates for the week

Intent here is to see the data for a team by-week so that we can see the aggregation at work. A visualization may be helpful as well.

In [4]:
FIELDS = gu.COMMON_FIELDS + [
  'team_cml_pass_yards_before',
  'team_pass_yards',
  'team_cml_pass_yards_after',
]

nfl_df = nfld.data_by_team()
year_df = gu.get_year(nfl_df, YEAR)
team_df = year_df[year_df['team'] == TEAM]
team_df[FIELDS]

Unnamed: 0,date,year,week,team,team_score,opponent,opponent_score,win,home,team_cml_pass_yards_before,team_pass_yards,team_cml_pass_yards_after
4106,2018-09-09,2018,1,New Orleans Saints,40,Tampa Bay Buccaneers,48,0,1,0,439,439
4144,2018-09-16,2018,2,New Orleans Saints,21,Cleveland Browns,18,1,1,439,243,682
4165,2018-09-23,2018,3,New Orleans Saints,43,Atlanta Falcons,37,1,0,682,396,1078
4217,2018-09-30,2018,4,New Orleans Saints,33,New York Giants,18,1,0,1078,227,1305
4250,2018-10-08,2018,5,New Orleans Saints,43,Washington Redskins,19,1,1,1305,363,1668
4301,2018-10-21,2018,7,New Orleans Saints,24,Baltimore Ravens,23,1,0,1668,212,1880
4335,2018-10-28,2018,8,New Orleans Saints,30,Minnesota Vikings,20,1,0,1880,164,2044
4358,2018-11-04,2018,9,New Orleans Saints,45,Los Angeles Rams,35,1,1,2044,346,2390
4367,2018-11-11,2018,10,New Orleans Saints,51,Cincinnati Bengals,14,1,0,2390,265,2655
4412,2018-11-18,2018,11,New Orleans Saints,48,Philadelphia Eagles,7,1,1,2655,373,3028


### Checking if accumulation is not working

The below goes through all `cml` fields to see if their `_after_` is equal to `_before + current`. At the time of this writing there were no examples where this was incorrect.

In [12]:
nfl_df = nfld.data_by_team()
year_df = gu.get_year(nfl_df, YEAR)

cols = nfl_df.columns
team_cml_cols = pd.Series(cols[cols.str.contains('team_cml_')]).sort_values()
cml_cols = team_cml_cols.apply(lambda x: x.replace('team_', '')).values
# cml_cols = team_cml_cols.values

df = nfl_df
for col in cml_cols:
  if "_perf" not in col:
    root_field = cml_root_field = col.replace('cml_', '').replace('_before', '').replace('_after', '')
    if root_field == 'points': 
      root_field = 'score'
    l = len(df[df[f'team_cml_{cml_root_field}_after'] != (df[f'team_cml_{cml_root_field}_before']+df[f'team_{root_field}'])])
    print(f"[{l}] {root_field}")

[0] first_downs
[0] first_downs
[0] fumble_gained
[0] fumble_gained
[0] fumble_lost
[0] fumble_lost
[0] interceptions_gained
[0] interceptions_gained
[0] interceptions_lost
[0] interceptions_lost
[0] pass_completions
[0] pass_completions
[0] pass_count
[0] pass_count
[0] pass_yards
[0] pass_yards
[0] penalty_count
[0] penalty_count
[0] score
[0] score
[0] rush_count
[0] rush_count
[0] rush_yards
[0] rush_yards
[0] sack_count
[0] sack_count
[0] sack_gained
[0] sack_gained
[0] top_sec
[0] top_sec
[0] total_yards
[0] total_yards
[0] turnovers_gained
[0] turnovers_gained
[0] turnovers_lost
[0] turnovers_lost


# TESTER : CML Values for all data

The below shows that all fields are 50%, 55%, 45% and the like. Rerunning the below should get us data much more separated once things are fixed.

In [4]:

### 
### CML PERCENTAGES
### 

def get_perc(df, field):
  '''
  Given a df and a field, return the number of time the 
  winning team lead in that field.
  
  If there is a '!' in the field name the invert the 
  logic
  '''
  if field.startswith('!'):
    field = field.replace('!','')
    wins_df = df[
      ((df['win'] == 0) & (df[f'team_{field}'] > df[f'opponent_{field}']))|
      ((df['win'] == 1) & (df[f'team_{field}'] <= df[f'opponent_{field}']))
    ]
  else:
    wins_df = df[
        ((df['win'] == 1) & (df[f'team_{field}'] > df[f'opponent_{field}']))|
        ((df['win'] == 0) & (df[f'team_{field}'] <= df[f'opponent_{field}']))
      ]

  return (len(wins_df) / len(df)) * 100
def work_fields(df, fields):
  data = {}
  for field in fields:
    data[field] = get_perc(df, field)
  return data

nfl_df = nfld.data_by_game()

# read in all CML columns and build an
# array of them (without team_ and opponent_)
cols = nfl_df.columns
team_cml_cols = pd.Series(cols[cols.str.contains('team_cml_')]).sort_values()
cml_cols = team_cml_cols.apply(lambda x: x.replace('team_', '')).values

data = work_fields(gu.get_year(nfl_df, 2021), cml_cols)

# array each item so it can be DF'd
for item in data: data[item] = [data[item]]
pdf = pd.DataFrame(data)

# pdf = pdf.T.sort_values(by=0).T
pdf.T



Unnamed: 0,0
cml_comb_comp_perf_after,63.636364
cml_comb_comp_perf_before,51.515152
cml_comb_def_perf_after,63.636364
cml_comb_def_perf_before,51.515152
cml_comb_off_perf_after,63.636364
cml_comb_off_perf_before,51.515152
cml_first_downs_after,61.818182
cml_first_downs_before,58.787879
cml_fumble_gained_after,55.151515
cml_fumble_gained_before,47.272727
