## Test Section to be deleted in final notebook

In [None]:
import numpy as np
import pandas as pd

In [None]:
def test_dataframe():
    s1 = pd.Series(['2000-03-01',    'Zambia',    'Lesotho',    False,    2,  False,      True,  20])
    s2 = pd.Series(['2000-03-04',  'Honduras',  'Nicaragua',    False,    3,  False,      True,  40])
    s3 = pd.Series(['2000-03-12',    'Zambia',    'Malawi',    False,    0,  True,      False,  20])
    s4 = pd.Series(['2000-03-13',    'Zambia',    'Malawi',    False,    1,  False,      False,  20])
    s5 = pd.Series(['2018-07-07',  'Russia',  'Croatia',    False,    0,   True,     False,  60])
    s6 = pd.Series(['2018-07-10',  'France',  'Belgium',     True,    1,  False,      True,  60])
    test = pd.concat([s1, s2, s3, s4], axis=1, sort=False).T 
    test.columns = ['date',    'home',     'away',  'neutral',  'gdf',   'draw', 'home_win',   'k']
    return(test)

In [60]:
def test_rtgs():
    return {'Zambia':1553, 'Lesotho':1113,'Honduras':1674, 'Nicaragua':932, 'Malawi':1236 }

In [None]:
## End of Test Section

## This is the introduction <a name="introduction"></a>
Some introduction text, formatted in heading 2 style

1. [Introduction](#introduction)
2. [Data](#Data)
    1. [DataFrame from File](#Datasub1)
    2. [Prepare DataFrame for Analysis](#Datasub2)
3. [Ranking Systems](#Rank)
    1. [ELO](#RankSub1)
    2. [Fifa ELO System](#RankSub2)
    3. [Regress to Mean after WorldCup ](#RankSub3)
    4. [Modify G Coefficient](#RankSub4)
    

## Data <a name="Data"></a>
The data comes from the Kaggle  as a csv file. More info about data



In [32]:
# this function prepares intial data frame 
def results_1():   
    return pd.read_csv(r'results.csv', parse_dates=[0])

## DataFrame from File <a name="Datasub1"></a>
With no parsing except for defining the date column as a date object, this is what I get. For my analysis I am not going to work with the entire data set.  Here I can reduce the data set by setting a start and end date.

In [53]:
df = results_1()
start, end = pd.datetime(2000, 3, 1), pd.datetime(2010, 3, 1)
df = df[df['date'] >= start]
df = df[df['date'] >= end]
print(df.head())

            date   home_team            away_team  home_score  away_score  \
31794 2010-03-02  Guadeloupe  St. Kitts and Nevis           2           1   
31795 2010-03-02     Ireland               Brazil           0           2   
31796 2010-03-03     Albania     Northern Ireland           1           0   
31797 2010-03-03     Algeria               Serbia           0           3   
31798 2010-03-03      Angola               Latvia           1           1   

      tournament             city     country  neutral  
31794   Friendly  Vieux-Habitants  Guadeloupe    False  
31795   Friendly           London     England     True  
31796   Friendly           Tirana     Albania    False  
31797   Friendly          Algiers     Algeria    False  
31798   Friendly           Luanda      Angola    False  


## Prepare DataFrame for Analysis <a name="Datasub2"></a>

Next I modify the DataFrame prior to apply either rating system.  I keep the *'date'*, *'home_team'* and *'away_team'* columns.  The *'home_score'* and *'away_score'* columns are combined  (winning score - losing score) into a new column **'gdf'** (goal differential as int) and deleted. Along with the that a **'home_win'** (bool) and **'not_draw'** (bool) are added. The *'neutral'* column is cast as a bool.  The *'tournament'* column values are changed to appropriate match weighting constant and the column renamed as **'k'**.  Columns *'city'* and *'country'* are not needed.
```
60 for World Cup, Olympic Games (1908–1980)
50 for Continental championship and intercontinental tournaments
40 for World Cup and Continental qualifiers and major tournaments
30 for All other tournaments
20 for Friendly matchess.
```


In [44]:
def conv_tournament(S):
    L = set(S.unique())
    A = {'FIFA World Cup'}
    B = {'AFC Asian Cup', 'African Cup of Nations', 'Copa América', 'Gold Cup', 'Nations Cup',
         'Copa América', 'UEFA Euro'}
    C = {'FIFA World Cup qualification', 'AFC Asian Cup qualification', 'African Cup of Nations qualification',
        'Copa América qualification', 'Gold Cup qualification', 'CONCACAF Championship', 'UEFA Euro qualification'}
    E = {'Friendly'}
    D = L - A - B - C - E
    return  {60:A, 50:B, 40:C, 30:D, 20:E}

In [43]:
def set_k(S, dict):
    A = []
    for s in S:
        for key, val in dict.items():
            if s in val:
                A.append(key)
    return A

In [45]:
def result(draw, win):
    if draw:   result = 0.50
    if win:    result = 1.00
    else:      result = 0.00
    
    return result

In [46]:
# this function prepares intial data frame 
def results_2(df):   
    df.columns = ['date', 'home', 'away', 'home_score', 'away_score', 'tournament', 'city', 'country', 'neutral']
    
    df['gdf']      = np.subtract(df.home_score, df.away_score)
    df['draw']     =  np.repeat(True,len(df))
    df['draw']     = df.draw.where(df.gdf == 0, other=False)
    df['win']      = np.repeat(True,len(df))
    df['win']      = df.win.where(df.gdf > 0, other=False)
    df['gdf']      = df.gdf.abs()
    df['result']   = df.apply(lambda x : result(x['draw'], x['win']), axis=1)
    df['k']        = set_k(df.tournament, conv_tournament(df.tournament)) 
    df['home_rtg'] = np.repeat(0, len(df))
    df['away_rtg'] = np.repeat(0, len(df))
    
    df = df.drop(['home_score', 'away_score', 'city', 'country', 'tournament'], axis=1) 
    df.index = range(len(df))
    
    
    return df

In [55]:
df = results_2(df)
print(df.head(2))

        date        home                 away  neutral  gdf   draw    win  \
0 2010-03-02  Guadeloupe  St. Kitts and Nevis    False    1  False   True   
1 2010-03-02     Ireland               Brazil     True    2  False  False   

   result   k  home_rtg  away_rtg  
0     1.0  20         0         0  
1     0.0  20         0         0  


## Ranking Systems <a name="Rank"></a>


ovarall discussiom

## ELO Rating System <a name="RankSub1"></a>

The World Football Elo Ratings are based on the Elo rating system, developed by Dr. Arpad Elo. This system is used by FIDE, the international chess federation, to rate chess players.

Eloratings.net applies the Elo rating system to international football, by adding a weighting for the kind of match, an adjustment for the home team advantage, and an adjustment for goal difference in the match result.

The ratings take into account all international "A" matches for which results could be found. International football data is primarily obtained from rsssf.com, theroonba.com, and soccer-db.info.

Ratings tend to converge on a team's true strength relative to its competitors after about 30 matches. Ratings for teams with fewer than 30 matches should be considered provisional.

The ratings are based on the following formulas:
**Rn = Ro + K × G x (W - We)**

Rn is the new rating, Ro is the old (pre-match) rating.

**K** is the weight constant for the tournament played:
```
60 for World Cup finals;
50 for continental championship finals and major intercontinental tournaments;
40 for World Cup and continental qualifiers and major tournaments;
30 for all other tournaments;
20 for friendly matches.
```
**G** is a constant based on the goal differential (gdf).

```
+1                 G = 1.00
+2                 G = 1.50
+3                 G = 1.75
+4 or higher       G = (11 + gdf) / 8
```
**W** is the result of the game.
```
1 for a win
0.5 for a draw
0 for a loss
```
**We** is the expected result (win expectancy). The value *dfr* is the difference in ratings plus 100 points for a team playing at home.
```
We = 1 / (10(-dfr/400) + 1)
```




In [1]:
import numpy as np
import pandas as pd

In [2]:
# this function calculates G value 
def calc_g(gdf):    # gdf is int >= 0
    if not gdf:  return 1.00
    if gdf == 1: return 1.00
    if gdf == 2: return  1.50 
    else:        return (11 + gdf) / 8  

In [3]:
def win_exp(dfr):
    return 1 / ((10) ** (-dfr / 400) + 1)

In [10]:
#creates ratings dict to begin an analysis
def start_results(date,dict):
    A = []
    for key, val in dict.items():
        S = pd.Series([date, key, val])
        A.append(S)
    df = pd.DataFrame(A)
    df.columns = ['date', 'country', 'rating']
    return df

In [27]:
df = test_dataframe()
rtgs = test_rtgs()
start = pd.datetime(2000, 3, 1)
results = start_results(start,rtgs)

In [28]:
win_probs =[]

for (row, S) in df.iterrows():
   
    date = S[0]                                         #date
    home, away = S[1], S[2]                             # strings
    nuetral, draw, win, favor = S[3], S[5], S[6], True  # bools
    gdf, k =  S[4], S[7]                                # ints 
    result = 0.0                                        # float
    
    
    home_rtg, away_rtg = rtgs[home], rtgs[away]
    
    if not nuetral:   dfr = home_rtg + 100 - away_rtg                    # not nuetral? ==> home rating + 100                                    
    else:    dfr = home_rtg - away_rtg
    if dfr < 0: favor = False
    
    if win: result = 1.0
    if draw: result = .50   
        
    expect = win_exp(dfr)
    adjust = np.round(calc_g(gdf) * k * (result - expect), decimals=0)  
   
    
    #if win or draw and not favor:                     # if home team wins or draws when not favored, rating increases
    home_rtg += adjust
    away_rtg -= adjust
    #else:
        #home_rtg -= adjust
        #away_rtg += adjust 
        
    if not draw:
        if win: win_probs.append(expect)                            # appends list with winning team's P(expect)
        else:   win_probs.append(1.0 - expect)
    results = results.append(pd.DataFrame(pd.Series({'date':date, 'country':home, 'rating':home_rtg})).T) 
    results = results.append(pd.DataFrame(pd.Series({'date':date, 'country':away, 'rating':away_rtg})).T) 
    
    
    rtgs[home] = home_rtg; rtgs[away] = away_rtg 
    
results.index = range(len(results))

In [29]:
results.index = range(len(results))
results.head(9)

Unnamed: 0,date,country,rating
0,2000-03-01 00:00:00,Zambia,1553
1,2000-03-01 00:00:00,Lesotho,1113
2,2000-03-01 00:00:00,Honduras,1674
3,2000-03-01 00:00:00,Nicaragua,932
4,2000-03-01 00:00:00,Malawi,1236
5,2000-03-01,Zambia,1554
6,2000-03-01,Lesotho,1112
7,2000-03-04,Honduras,1675
8,2000-03-04,Nicaragua,931


In [15]:
rtgs

{'Zambia': 1528.0,
 'Lesotho': 1112.0,
 'Honduras': 1675.0,
 'Nicaragua': 931.0,
 'Malawi': 1262.0}

In [16]:
win_probs

[0.957241588853464, 0.9922088227539905, 0.08996208018848373]

## Fifa ELO System <a name="RankSub2"></a>

After the 2018 world cup, FIFA adopted a modified version of the ELO system .... 


In [None]:
def fifa_conv_tournament(S):
    L = set(S.unique())
    A = {'FIFA World Cup'}
    B = {'AFC Asian Cup', 'African Cup of Nations', 'Copa América', 'Gold Cup', 'Nations Cup',
         'Copa América', 'UEFA Euro'}
    C = {'FIFA World Cup qualification', 'AFC Asian Cup qualification', 'African Cup of Nations qualification',
        'Copa América qualification', 'Gold Cup qualification', 'CONCACAF Championship', 'UEFA Euro qualification'}
    E = {'Friendly'}
    D = L - A - B - C - E
    return  {60:A, 50:B, 40:C, 30:D, 20:E}

In [None]:
def fifa_win_exp(dfr):
    return 1 / ((10) ** (-dfr / 400) + 1)

## Regress to Mean after WorldCup   <a name="RankSub3"></a>

One complaint

In [69]:
def Regress_mean(dict):
    for key, val in dict.items():
        dict[key] = dict[key] + (1500 - dict[key]) // 2

    return dict

    

In [70]:
test = Regress_mean(test_rtgs())
test

{'Zambia': 1526,
 'Lesotho': 1306,
 'Honduras': 1587,
 'Nicaragua': 1216,
 'Malawi': 1368}

## Modify G Coefficient <a name="RankSub4"></a>

My belief is that teams should not .....

In [None]:
# this function calculates G value 
def calc_g(gdf):    # gdf is int >= 0
    if not gdf:  return 1.00
    if gdf == 1: return 1.00
    if gdf == 2: return  1.50 
    else:        return (11 + gdf) / 8  