### *Question 3*
The five datafiles ```rpl17.html```, ```rpl18.html```, ```rpl19.html```, ```rpl20.html```, ```rpl21.html``` contain detailed statistics on soccer matches of the last five seasons of the Russian Premier League (the 2021 season is currently in play). Among other statistics, each table includes information about: goals, yellow cards, red cards, shots, shots on target, pre-match bookmaker odds, possession, etc., each for both home and away teams. Check [this webpage](ttps://footystats.org/download-stats-csv) for a more detailed explanation of features in each table. Your task will be to prepare data for Question 4 by completing the function below. In particular, it should

#####  ```process_season(filename,window):```
 1. drop any incomplete matches (note the ```status``` column);
 2. drop all features not listed in ```features``` defined in line 3;
 3. rename those features to ```new_features``` defined in line 4;
 4. add the ```outcome``` feature indicating if the match ended in a home win (1), draw (0), or a home loss (-1);
 5. make sure that rows are sorted in temporal order (note the ```timestamp``` column);
 6. add two more features to your data: ```H_miss``` and ```A_miss``` (the number of goals conceded by home and away teams, respectively);
 7. note that the ```ppg``` column (average points earned per game prior to the current match) does not always contain the correct value; recompute this column based on information from other columns in the data (recall that in soccer, wins, draws and losses bring 3, 1, and 0 points, respectively);
 8. drop all rows with implausible column values; i.e., numeric statistics for home and away teams (```H_``` and ```A_``` prefixed columns except for the team names) should never be negative, while the bookmaker odds (```win```, ```draw```,```loss```) should all be at least 1. Entries violating these rules are most probably indications of missing data.
 9. for each match, replace the numeric statistics for home and away teams (```H_``` and ```A_``` prefixed columns) with their average in the previous ```window``` (e.g., 5) matches of each of these two teams. For example, the row corresponding to the match "Krasnodar"-"Zenit" (the first team listed is always the home team by default) should have the average number of yellow cards earned by "Krasnodar" and "Zenit" in their last ```window``` matches in the league in columns ```H_ycards``` and ```A_ycards```, respectively. This procedure is necessary since we don't have access to match statistics before it is played, so the model in Question 4 will have to base its predictions on the running average perfromance of the teams playing.
 10. return the resulting Pandas DataFrame.
 
After this function is ready, the remaining cells in this file will aggregate data across seasons, then split, normalize, and save it. Run these cells as you will need this data for Question 4.

In [3]:
import pandas as pd
import numpy as np
#pd.options.mode.chained_assignment = None  # default='warn'

def FixIndexing(df):
    new_indexes = []
    for a in range(df.shape[0]):
        new_indexes.append(a)
    df2 = df.set_index( [pd.Index(new_indexes)] )
    return df2

In [4]:
def process_season(filename,window):
    data_test=pd.read_csv(filename)
    
    features=["home_team_name","away_team_name","Pre-Match PPG (Home)","Pre-Match PPG (Away)","home_team_goal_count","away_team_goal_count","home_team_corner_count","away_team_corner_count","home_team_yellow_cards","home_team_red_cards","away_team_yellow_cards","away_team_red_cards","home_team_first_half_cards","away_team_first_half_cards","home_team_shots","away_team_shots","home_team_shots_on_target","away_team_shots_on_target","home_team_fouls","away_team_fouls","home_team_possession","away_team_possession","odds_ft_home_team_win","odds_ft_draw","odds_ft_away_team_win"]
    numG, feats = data_test.shape
    
    #step 1, ensure data is sorted by time-stamp
    data_test.sort_values(by=['timestamp'])

    #step 2, drop incomplete matches
    status_incomp = []
    for g in range(numG):
        if data_test.loc[g,'status'] != 'complete':
            status_incomp.append(g)
    if len(status_incomp) > 0:
        data_test_kms = data_test.drop(data_test.index[status_incomp],inplace=False)
        data_test_1 = FixIndexing(data_test_kms)
    else:
        data_test_1 = data_test.copy()
    
    #step 3, drop irrelevant features
    drop_features=[]
    for i in range(data_test_1.shape[1]):
        if data_test_1.columns[i] not in features:
            drop_features.append(i)
    data_test_2 = data_test_1.drop(data_test_1.columns[drop_features],axis=1,inplace=False)

    #step 4, rename columns
    features_new=["H_team","A_team","H_ppg","A_ppg","H_score","A_score","H_corners","A_corners",'H_ycards', 'H_rcards',"A_ycards","A_rcards","H_htcards","A_htcards","H_shots","A_shots","H_shotst","A_shotst","H_fouls","A_fouls","H_pos","A_pos","win","draw","loss"]
    data_test_3 = data_test_2.rename(columns={name:features_new[i] for i,name in enumerate(data_test_2.columns)},inplace=False)
    

    #step 5, add outcomes column
    numG_, feats = data_test_3.shape
    #note that numG_ is not the same as numG, as we might have dropped many indices
    outcomes=[]
    for i in range(numG_):
        if data_test_3.loc[i,'H_score'] > data_test_3.loc[i,'A_score']:
            outcomes.append(1)
        elif data_test_3.loc[i,'H_score'] < data_test_3.loc[i,'A_score']:
            outcomes.append(-1)
        else:
            outcomes.append(0)
    data_test_3['outcome']=outcomes

    #step 6, adding H_miss and A_miss
    H_miss_data = []
    A_miss_data = []
    for i in range(numG_):
        H_miss_data.append(data_test_3.loc[i,'A_score'])
        A_miss_data.append(data_test_3.loc[i,'H_score'])
    data_test_3['H_miss']=H_miss_data
    data_test_3['A_miss']=A_miss_data

    #step 7, compute H_PPG and A_PPG
    DistinctTeamNames = data_test_3.loc[:,'H_team'].drop_duplicates()
    distinctnames = DistinctTeamNames.tolist() #make a list of all distinct team names for this season
    for team in distinctnames:
        #keep a running tally of total number of games played (per team), total points, and thus the running ppg
        running_games = 0
        points_total = 0
        running_ppg = 0
        for i in range(data_test_3.shape[0]):

            if team == data_test_3.loc[i,'H_team']:
                
                if running_games > 0:
                    data_test_3.loc[i,'H_ppg'] = running_ppg

                running_games += 1
                if data_test_3.loc[i,'outcome'] == 1:
                    points_total += 3
                elif data_test_3.loc[i,'outcome'] == 0:
                    points_total += 1
                #else add zero!
                running_ppg = points_total/running_games

            if team == data_test_3.loc[i,'A_team']:
                if running_games > 0:
                    data_test_3.loc[i,'A_ppg'] = running_ppg
                running_games+=1
                if data_test_3.loc[i,'outcome'] == -1: #team is away now!
                    points_total += 3
                elif data_test_3.loc[i,'outcome'] == 0:
                    points_total += 1
                #else add zero!
                running_ppg = points_total/running_games

                
    ####step 8, drop nonsensical values!
    rows_to_drop = []
    features_a=["H_ppg","A_ppg","H_score","A_score","H_corners","A_corners",'H_ycards', 'H_rcards',"A_ycards",
                "A_rcards","H_htcards","A_htcards","H_shots","A_shots","H_shotst","A_shotst",
                "H_fouls","A_fouls","H_pos","A_pos"]
    features_b=["win","draw","loss"]
    for i in range(data_test_3.shape[0]):
        #if a given match has a "bad feature", either in section features_a or features_b, 
        #then add to the list of games to drop!
        for fa in features_a:
            if data_test_3.loc[i,fa] < 0.0:
                rows_to_drop.append(i)
        for fb in features_b:
            if data_test_3.loc[i,fb] < 1.0:
                rows_to_drop.append(i)
    rows_to_drop = list(dict.fromkeys(rows_to_drop))
    
    if len(rows_to_drop) > 0:
        data_test_step7 = data_test_3.drop(data_test.index[rows_to_drop],inplace=False)
    else:
        data_test_step7 = data_test_3.copy()

    if data_test_step7.shape[0] == 0: #if there are no columns left (i.e. for rsl17...)
        return None
    
    data_test_new = FixIndexing(data_test_step7)

    

    #step 9 running window average!
    for team in distinctnames:
        running_window=0
        dickt={'ppg':[],'score':[],'corners':[], 'ycards': [],'rcards': [],'htcards': [],
                'shots':[],'shotst':[],'fouls':[],'pos':[],'miss':[]}
        #for each team, we keep a running tally of all the relevant features
        for i in range(data_test_new.shape[0]):

            if data_test_new.loc[i,'H_team'] == team:
                #if the team plays a game, we add the counter to the running window
                running_window +=1
                
                #append this game value to the running dictionary
                dickt['ppg'].append( data_test_new.loc[i,'H_ppg']  )
                dickt['score'].append( data_test_new.loc[i,'H_score']  )
                dickt['corners'].append( data_test_new.loc[i,'H_corners']  )
                dickt['ycards'].append( data_test_new.loc[i,'H_ycards']  )
                dickt['rcards'].append( data_test_new.loc[i,'H_rcards']  )
                dickt['htcards'].append( data_test_new.loc[i,'H_htcards']  )
                dickt['shots'].append( data_test_new.loc[i,'H_shots']  )
                dickt['shotst'].append( data_test_new.loc[i,'H_shotst']  )
                dickt['fouls'].append( data_test_new.loc[i,'H_fouls']  )
                dickt['pos'].append( data_test_new.loc[i,'H_pos']  )
                dickt['miss'].append( data_test_new.loc[i,'H_miss']  )

                #if they have played at least "window+1" games, then the dictionary contains at most 5 elements
                if running_window >= window + 1:
                    #at this game, replace the data with the content of the previous 5 games played!
                    #compute the window average
                    data_test_new.loc[i,'H_ppg']=np.average(np.array(dickt['ppg'])[0:window])
                    data_test_new.loc[i,'H_score']=np.average(np.array(dickt['score'])[0:window])
                    data_test_new.loc[i,'H_corners']=np.average(np.array(dickt['corners'])[0:window])
                    data_test_new.loc[i,'H_ycards']=np.average(np.array(dickt['ycards'])[0:window])
                    data_test_new.loc[i,'H_rcards']=np.average(np.array(dickt['rcards'])[0:window])
                    data_test_new.loc[i,'H_htcards']=np.average(np.array(dickt['htcards'])[0:window])
                    data_test_new.loc[i,'H_shots']=np.average(np.array(dickt['shots'])[0:window])
                    data_test_new.loc[i,'H_shotst']=np.average(np.array(dickt['shotst'])[0:window])
                    data_test_new.loc[i,'H_fouls']=np.average(np.array(dickt['fouls'])[0:window])
                    data_test_new.loc[i,'H_pos']=np.average(np.array(dickt['pos'])[0:window])
                
                if running_window > window :
                    #get rid of the earliest game in the current 5-game window (i.e. should always have length 5)
                    running_window -= 1
                    dickt['ppg'].pop(0)
                    dickt['score'].pop(0)
                    dickt['corners'].pop(0)
                    dickt['ycards'].pop(0)
                    dickt['rcards'].pop(0)
                    dickt['htcards'].pop(0)
                    dickt['shots'].pop(0)
                    dickt['shotst'].pop(0)
                    dickt['fouls'].pop(0)
                    dickt['pos'].pop(0)
                    dickt['miss'].pop(0)

            #repeat the above reasoning in case the game is away! still need to treat the two games the same
            elif team == data_test_new.loc[i,'A_team']:

                running_window +=1

                dickt['ppg'].append( data_test_new.loc[i,'A_ppg']  )
                dickt['score'].append( data_test_new.loc[i,'A_score']  )
                dickt['corners'].append( data_test_new.loc[i,'A_corners']  )
                dickt['ycards'].append( data_test_new.loc[i,'A_ycards']  )
                dickt['rcards'].append( data_test_new.loc[i,'A_rcards']  )
                dickt['htcards'].append( data_test_new.loc[i,'A_htcards']  )
                dickt['shots'].append( data_test_new.loc[i,'A_shots']  )
                dickt['shotst'].append( data_test_new.loc[i,'A_shotst']  )
                dickt['fouls'].append( data_test_new.loc[i,'A_fouls']  )
                dickt['pos'].append( data_test_new.loc[i,'A_pos']  )
                dickt['miss'].append( data_test_new.loc[i,'A_miss']  )
                
                if running_window >= window + 1:            
                    data_test_new.loc[i,'A_ppg']=np.average(np.array(dickt['ppg'])[0:window])
                    data_test_new.loc[i,'A_score']=np.average(np.array(dickt['score'])[0:window])
                    data_test_new.loc[i,'A_corners']=np.average(np.array(dickt['corners'])[0:window])
                    data_test_new.loc[i,'A_ycards']=np.average(np.array(dickt['ycards'])[0:window])
                    data_test_new.loc[i,'A_rcards']=np.average(np.array(dickt['rcards'])[0:window])
                    data_test_new.loc[i,'A_htcards']=np.average(np.array(dickt['htcards'])[0:window])
                    data_test_new.loc[i,'A_shots']=np.average(np.array(dickt['shots'])[0:window])
                    data_test_new.loc[i,'A_shotst']=np.average(np.array(dickt['shotst'])[0:window])
                    data_test_new.loc[i,'A_fouls']=np.average(np.array(dickt['fouls'])[0:window])
                    data_test_new.loc[i,'A_pos']=np.average(np.array(dickt['pos'])[0:window])

                if running_window > window :
                    running_window -= 1
                    dickt['ppg'].pop(0)
                    dickt['score'].pop(0)
                    dickt['corners'].pop(0)
                    dickt['ycards'].pop(0)
                    dickt['rcards'].pop(0)
                    dickt['htcards'].pop(0)
                    dickt['shots'].pop(0)
                    dickt['shotst'].pop(0)
                    dickt['fouls'].pop(0)
                    dickt['pos'].pop(0)
                    dickt['miss'].pop(0)
                    
    #remove games that did not have a window average!
    distinctnames = DistinctTeamNames.tolist()
    to_drop=[]
    #track the first window+1 games for each team
    for x in distinctnames:
        num_g = 0
        for i in range(data_test_new.shape[0]):
            if x == data_test_new.loc[i,'H_team']:
                num_g += 1
                if num_g == window + 1:
                    to_drop.append(i)
            if x == data_test_new.loc[i,'A_team']:
                num_g += 1
                if num_g == window+1:
                    to_drop.append(i)
    #take the min of these game numbers
    min_to_drop = min(to_drop)
    #drop all games until this point
    data_test_new.drop(data_test_new.index[0:min_to_drop],inplace=True)
    data_test_final = FixIndexing(data_test_new)

    return data_test_final

In [5]:
# Process each league and concatenate datasets (run this):
filenames=[f'rpl{season}.csv' for season in range(17,22)]
data=pd.DataFrame()
for file in filenames:
    data=pd.concat([data,process_season(file,5)],ignore_index=True)
data=data.reindex()
data

Unnamed: 0,H_team,A_team,H_ppg,A_ppg,H_score,A_score,H_corners,A_corners,H_ycards,H_rcards,...,H_fouls,A_fouls,H_pos,A_pos,win,draw,loss,outcome,H_miss,A_miss
0,Ural,Rubin Kazan,1.217391,1.149846,1.0,1.2,3.0,5.0,3.0,1.0,...,18.0,17.2,56.0,53.8,3.20,2.75,2.55,0,1,1
1,Ufa,Zenit,1.478261,1.772727,1.0,2.0,8.0,1.0,1.0,0.0,...,11.0,7.0,58.0,42.0,4.25,3.10,2.00,-1,2,1
2,Krasnodar,Anzhi Makhachkala,1.760080,0.950530,1.4,1.0,3.0,4.2,0.8,0.0,...,12.4,12.4,53.2,53.0,1.39,4.40,8.00,0,1,1
3,Rostov,CSKA Moskva,1.179567,1.863636,1.2,2.0,5.0,2.0,1.8,0.0,...,11.6,7.0,48.6,50.0,3.40,2.95,2.30,-1,2,1
4,Rubin Kazan,Akhmat Grozny,1.149335,1.173913,1.4,2.0,4.8,4.0,2.8,0.0,...,17.2,11.0,52.2,42.0,2.00,3.05,4.15,1,2,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
486,Lokomotiv Moskva,Rotor Volgograd,1.517143,0.244444,1.0,0.4,4.6,2.4,2.2,0.0,...,14.6,14.4,52.4,39.8,1.43,4.15,7.75,-1,2,1
487,Ural,Tambov,1.129365,0.823889,0.8,1.0,2.8,3.2,2.2,0.2,...,9.0,13.4,53.6,44.6,2.05,3.15,3.80,0,0,0
488,Rostov,Khimki,1.761111,0.514048,1.8,1.4,4.4,3.0,3.6,0.4,...,13.0,13.4,43.4,38.2,1.77,3.45,4.70,-1,2,0
489,Akhmat Grozny,Ufa,1.437937,0.665635,0.6,0.0,7.2,4.2,3.0,0.4,...,16.2,15.2,52.2,44.4,1.71,3.55,5.05,1,1,3


In [6]:
# Specify predictors and the target (run this):
features=[x for x in data.columns if x not in ["H_team","A_team","outcome"]]
target="outcome"

In [7]:
# Split (80/20) and MinMax transform the data (run this):
from sklearn.preprocessing import MinMaxScaler
train=np.random.choice(range(len(data)),size=int(0.8*len(data)),replace=False)
test=[x for x in data.index if x not in train]
train_X,train_y=data.loc[train,features],data.loc[train,target]
test_X,test_y=data.loc[test,features],data.loc[test,target]
scaler=MinMaxScaler()
train_X=scaler.fit_transform(train_X)
test_X=scaler.transform(test_X)
np.save("train_X_test.npy",train_X)
np.save("train_y_test.npy",train_y)
np.save("test_X_test.npy",test_X)
np.save("test_y_test.npy",test_y)