# Footmav Basics

In [1]:
import sys
sys.path.append('../src')
import pandas as pd


## Load Data

Instantiate an FbRefData object from a pandas dataframe containing fbref game logs.  The data format expectation is a dataframe of player game log data (for example, https://fbref.com/en/players/5515376c/matchlogs/2021-2022/summary/Trevoh-Chalobah-Match-Logs), with multiple players concatenated verticly and the various fbref data categories (Defending, Misc, Passing, etc)
joined horizontally.

In [4]:
from footmav import FbRefData

data = FbRefData(pd.read_parquet('ligue1_2017.parquet'))

## Key Concepts

Footmav two conceptual models are `DataAttribute` which is essentially a column in a DataFrame.  Each DataAttribute defines properties, such as its name, how that DataAttribute is aggregated, etc.  One of the main ways in which users of Footmav can manipulate data is by defining new derived DataAttributes through column operations on existing DataAttributes. 


The other key concept of footmav is the `pipe` member function of `Data` object and a series of `pipeable` functions that can be passed to it. Each pipe transformation operates on the entirety of the data, transforming it in some way and returning a new Data object containing the transformed data.  Pipe functions can change the size of the data (though they don't neccessarily have to)

## Filtering

Filtering can be done using the `filter` pipeable function, with the other parameter being a list of `Filter` objects to apply.  These `Filter` objects apply sequentially, so at present only `AND` filtering is supported.

The `Filter` object has the following parameters:
- `attribute`: The `DataAttribute` to filter on
- `value`: The value or values to filter for
- `operation`: The filtering operation to apply to the `attribute` on the `value`

The currently supported filtering operations are:
- `EQ`: Keep only the rows where the `attribute` is equal to the `value`
- `NEQ`: Keep only the rows where the `attribute` is not equal to the `value`
- `GT`: Keep only the rows where the `attribute` is greater than the `value`
- `GTE`: Keep only the rows where the `attribute` is greater than or equal to the `value`
- `LT`: Keep only the rows where the `attribute` is less than the `value`
- `LTE`: Keep only the rows where the `attribute` is less than or equal to the `value`
- `Contains`: Keep only the rows where the `attribute` is a string containing the string specified in `value`
- `NotContains`: Keep only the rows where the `attribute` is a string not containing the string specified in `value`
- `IsIn`: Keep only the rows where `attribute` is one of the elements contained in a list specified by `value`.
- `StrContainsOneOf`: A combination of `IsIn` and `Contains`, this will return only the rows where the `attribute` is a string containing one of the strings in a list specified in `value`

## Filtering Examples



In [5]:
from footmav import filter, filters, Filter
from footmav import fb # Import all the FbRef DataAttributes
# Only PSG players who are not keepers and who were not unused substitutes

filtered_data =  data.pipe(
    filter, 
    [
        Filter(fb.TEAM, 'Paris S-G', filters.EQ), 
        Filter(fb.POSITION, 'GK', filters.NotContains), 
        Filter(fb.MINUTES, 0, filters.GT)
    ]
)

print(filtered_data.df.head(10))

           date dayofweek     comp         round venue result      squad  \
1809 2017-08-05       Sat  Ligue 1   Matchweek 1  Home  W 2–0  Paris S-G   
1810 2017-08-13       Sun  Ligue 1   Matchweek 2  Away  W 3–0  Paris S-G   
1811 2017-08-20       Sun  Ligue 1   Matchweek 3  Home  W 6–2  Paris S-G   
1815 2017-09-17       Sun  Ligue 1   Matchweek 6  Home  W 2–0  Paris S-G   
1816 2017-09-23       Sat  Ligue 1   Matchweek 7  Away  D 0–0  Paris S-G   
1819 2017-10-14       Sat  Ligue 1   Matchweek 9  Away  W 2–1  Paris S-G   
1821 2017-10-27       Fri  Ligue 1  Matchweek 11  Home  W 3–0  Paris S-G   
1823 2017-11-04       Sat  Ligue 1  Matchweek 12  Away  W 5–0  Paris S-G   
1824 2017-11-18       Sat  Ligue 1  Matchweek 13  Home  W 4–1  Paris S-G   
1826 2017-11-26       Sun  Ligue 1  Matchweek 14  Away  W 2–1  Paris S-G   

         opponent game_started position  ...  avg_distance_def_actions_gk  \
1809       Amiens            Y       RB  ...                          NaN   
1810     

## Creating new DataAttributes

You can create a new `DataAttribute` using a `DerivedDataAttribute` class. This is an abstract class which currenly has one implementation, `FunctionDerivedDataAttribute` which lets you create a new DataAttribute by performing column operations on existing DataAttributes, including other DerivedDataAttributes

In [6]:
from footmav.data_definitions import attribute_functions as F
from footmav.data_definitions.derived import FunctionDerivedDataAttribute
from footmav import fb # Import all the FbRef DataAttributes
from footmav.data_definitions.data_sources import DataSource

# Create a new DataAttribute that calculates shots per touches in the box
SHOTS_PER_TOUCHES_IN_PENALTY_AREA = FunctionDerivedDataAttribute(
    'shots_per_touches_in_penalty_area',
    F.Col(fb.SHOTS_TOTAL) / F.Col(fb.TOUCHES_ATT_PEN_AREA),
    data_type=float,
    source=DataSource.FBREF,
    agg_function=None,
    recalculate_on_aggregation=True
)

#Add it to the data set
data_w_shots_per_touches = filtered_data.with_attributes([SHOTS_PER_TOUCHES_IN_PENALTY_AREA])

#Show it for Neymar only
neymar_logs = data_w_shots_per_touches.pipe(
    filter, [Filter(fb.PLAYER, 'neymar', filters.EQ)]
)

neymar_logs.df[[fb.DATE.N, fb.OPPONENT.N, fb.SHOTS_TOTAL.N, fb.TOUCHES_ATT_PEN_AREA.N, SHOTS_PER_TOUCHES_IN_PENALTY_AREA.N]]

Unnamed: 0,date,opponent,shots_total,touches_att_pen_area,shots_per_touches_in_penalty_area
46355,2017-08-13,Guingamp,6.0,12.0,0.5
46356,2017-08-20,Toulouse,6.0,16.0,0.375
46357,2017-08-25,Saint-Étienne,0.0,3.0,0.0
46358,2017-09-08,Metz,3.0,8.0,0.375
46360,2017-09-17,Lyon,3.0,5.0,0.6
46362,2017-09-30,Bordeaux,2.0,6.0,0.333333
46363,2017-10-14,Dijon,4.0,6.0,0.666667
46365,2017-10-22,Marseille,1.0,2.0,0.5
46367,2017-11-18,Nantes,3.0,7.0,0.428571
46369,2017-11-26,Monaco,2.0,8.0,0.25


## Aggregations

You can aggregate data together using the `aggregate_by` pipeable.  As a second parameter, specify a list of `DataAttributes` you wish to group by.  Typical examples tend to be grouping by player name (or player id, which avoids messy situations when you have two players with exactly the same name) to get total player season data, or by team, to get total team data.

Any `DerivedDataAttributes` in your dataset will be recalculated on the aggregated data if they've been created with the `recalculate_on_aggregation` parameter set to `True`

In [7]:
from footmav import aggregate_by

# Aggregate season data for all players
aggregated = data_w_shots_per_touches.pipe(
    aggregate_by, [fb.PLAYER]
)

print(aggregated.df.head(10))

               player  corner_kicks_out  cards_red  carry_distance  \
0       adrien rabiot               0.0        0.0         12766.0   
1      angel di maria               0.0        0.0          6189.0   
2      blaise matuidi               0.0        0.0            93.0   
3  christopher nkunku               0.0        0.0          2072.0   
4          dani alves               0.0        1.0          7669.0   
5      edinson cavani               0.0        0.0          2008.0   
6    giovani lo celso               0.0        0.0          9867.0   
7      goncalo guedes               0.0        0.0            19.0   
8      javier pastore               0.0        0.0          5056.0   
9      julian draxler               0.0        0.0          6746.0   

   passes_live  passes_left_foot  passes_switches  pens_saved  \
0       2148.0            1765.0             33.0         0.0   
1       1128.0            1131.0             68.0         0.0   
2         16.0              13.0  

## Additional pipeables
- `per_90` converts the data from total to per 90 minute data.  This should only be applied on aggregated datasets
- `rank` calculate percentile rank for the data.  This also should be applied on aggregrated datasets
- Much more are coming

## Putting it all together

### Find all strikers with NPxG of > 0.5 and xA of > 0.1 who played at least 500 minutes in the striker position 



In [13]:
from footmav import per_90
data.pipe(
    filter, [Filter(fb.POSITION, 'FW', filters.Contains)] # Per-match filtering of only players who played at least part of the game at the striker position
).pipe(
    aggregate_by, [fb.PLAYER] # Aggregate by player
).pipe(
    per_90 # Calculate per-90 data.  Per 90 doesn't do anything to the MINUTES column, so we can still use it as a filter later.
).pipe(
    filter, 
    [
        Filter(fb.MINUTES, 500, filters.GTE), # Filter players who played more than 500 minutes
        Filter(fb.NPXG, 0.5, filters.GTE),    # And have more than 0.5 NPxG per 90 minutes
        Filter(fb.XA, 0.10, filters.GTE)]     # And have more than 0.1 xA per 90 minuets
).df[[fb.PLAYER.N, fb.TEAM.N, fb.NPXG.N, fb.XA.N]]

Unnamed: 0,player,squad,npxg,xa
9,alexandre mendy,Bordeaux,0.555882,0.105882
33,clinton njie,Marseille,0.549505,0.237624
41,edinson cavani,Paris S-G,0.683456,0.170864
110,kylian mbappe,"Monaco,Paris S-G",0.596215,0.241325
130,memphis depay,Lyon,0.596154,0.480769


### Find all CBs who are in the top 25 percentile in both tackles+interceptions+blocks and long passes per 90 of the players who played more than 500 minutes at CB

First, tackles+interceptions+blocks is not an existing Data Attribute, so lets create it

In [9]:
DEFENSIVE_ACTIONS = FunctionDerivedDataAttribute(
    'defensive_actions',
    F.Col(fb.TACKLES_INTERCEPTIONS)+F.Col(fb.BLOCKS),
    data_type=float,
    source=DataSource.FBREF,
    agg_function='sum',
    recalculate_on_aggregation=True
)


Now create and calculate the filter

In [24]:
from footmav import rank
data.with_attributes(DEFENSIVE_ACTIONS).pipe(
    filter, [Filter(fb.POSITION, 'CB', filters.Contains)] # Per-match filtering of only players who played at least part of the game at the centreback position
).pipe(
    aggregate_by, [fb.PLAYER] # Aggregate by player
).pipe(
    per_90 # Calculate per-90 data.  Per 90 doesn't do anything to the MINUTES column, so we can still use it as a filter later.
).pipe(
    filter,[Filter(fb.MINUTES, 500, filters.GTE)] # Filter players who played more than 500 minutes
).pipe(
    rank # Calculate percentile rank of all attributes
).pipe(
    filter,
    [
        Filter(DEFENSIVE_ACTIONS, 0.75, filters.GTE), # Filter players who are in the top 25 percentile of tackles + interceptions + blocks
        Filter(fb.PASSES_LONG, 0.75, filters.GTE),    # and are in the top 25 percentile of long passes
    ]
).df[[fb.PLAYER.N, fb.TEAM.N, DEFENSIVE_ACTIONS.N, fb.PASSES_LONG.N]]


Unnamed: 0,player,squad,defensive_actions,passes_long
22,damien da silva,Caen,0.958904,0.780822
102,papy djilobodji,Dijon,0.835616,0.972603
122,vitorino hilton,Montpellier,0.794521,0.945205
