# Prototype linking teams from loteca to betexp

In this notebook, we will try linking lotecaa matches into BetExplorer
matches. I know this has been done before, but we want a clearer approach.

Just a bit of code will be redone. Let's aim to keep most of what can be kept.

In [1]:
import logging
import sqlite3

In [2]:
import sys
sys.path.append('..')

In [3]:
from src.util import load_pickle

In [4]:
# from src.data.interim.teams import betexplorer
from src.data.interim.teams import betexplorer, loteca

## Loading the data

First we will load the teams and calculate some of
their properties in both end of the thing.

Our wanted result is a list of `Team`s for each source.

In [5]:
def retrieve_loteca_teams():
    datapath = '../data/process/loteca_matches.pkl'
    teams = loteca.retrieve_teams(datapath)
    return teams

def retrieve_betexp_teams():
    dbpath = '../data/db.sqlite3'
    teams = betexplorer.retrieve_teams(dbpath)
    return teams

In [6]:
loteca_teams = retrieve_loteca_teams()
betexp_teams = retrieve_betexp_teams()

ERROR:root:Found multiple states for fname 'bragantino'


## Bragantino

In [7]:
[t for t in betexp_teams if t.fname == 'bragantino']

[Team ("Bragantino U20"), Team ("Bragantino")]

In [8]:
import pandas as pd

conn = sqlite3.connect('../data/db.sqlite3')
df = pd.read_sql_query("SELECT * FROM betexp_matches WHERE league_category == 'brazil'", conn)
conn.close()

In [9]:
teams = ['Bragantino', 'Bragantino U20']
df[(df.team_h.isin(teams)) |
   (df.team_a.isin(teams))]

Unnamed: 0,id,url,league_category,league_name,league_year,team_h,team_a,date,score,scoremod
380,GtN3bsu5,http://www.betexplorer.com/soccer/brazil/serie...,brazil,Série B,2009,ABC,Bragantino,28.11.2009,0:1,
392,Gxs5mQQn,http://www.betexplorer.com/soccer/brazil/serie...,brazil,Série B,2009,Bragantino,Brasiliense,21.11.2009,1:2,
403,W6pzWOeC,http://www.betexplorer.com/soccer/brazil/serie...,brazil,Série B,2009,Figueirense,Bragantino,14.11.2009,2:1,
416,AsH5wzDR,http://www.betexplorer.com/soccer/brazil/serie...,brazil,Série B,2009,Bragantino,Parana,10.11.2009,2:2,
427,YswGDJqG,http://www.betexplorer.com/soccer/brazil/serie...,brazil,Série B,2009,Ceara,Bragantino,04.11.2009,2:0,
438,GfKU5qUs,http://www.betexplorer.com/soccer/brazil/serie...,brazil,Série B,2009,Bragantino,America RN,28.10.2009,2:1,
443,YwEvAvbf,http://www.betexplorer.com/soccer/brazil/serie...,brazil,Série B,2009,Bragantino,Fortaleza,24.10.2009,4:1,
458,6ZGRG2jE,http://www.betexplorer.com/soccer/brazil/serie...,brazil,Série B,2009,Ponte Preta,Bragantino,20.10.2009,2:1,
463,619F03yL,http://www.betexplorer.com/soccer/brazil/serie...,brazil,Série B,2009,Bragantino,Bahia,17.10.2009,3:0,
479,UVjgz3p2,http://www.betexplorer.com/soccer/brazil/serie...,brazil,Série B,2009,Vila Nova FC,Bragantino,07.10.2009,1:2,


There are two `Bragantino`s:

- One from São Paulo ([1])
- And one from Pará ([2])

Both of them are called `Bragantino` in the BetExplorer database. We hope
this error is not common, as it could bring some problems.

[1]: https://pt.wikipedia.org/wiki/Clube_Atl%C3%A9tico_Bragantino
[2]: https://pt.wikipedia.org/wiki/Bragantino_Clube_do_Par%C3%A1

### Problematic cases

We are working under the assumption that each `fname` corresponds to one 
team and one team only.

As seen above, however, this assumption is not true.

In our state matching, this may assign the wrong state to a team. It may happen
under the following circumstances:

- Two teams from different states share the same `fname`.
- One of the teams plays in intra-state matches.
- The other one only plays in inter-state matches.

The state for the teams will be retrieved from the intra-state matches. As one
of the teams only play in inter-state matches, the state for this team will not
be retrieved.

There won't be any conflict, and both teams will be assigned to the same state.

#### Rarity

This setup, however, is pretty hard to happen. When a team plays on the inter-state
matches, we also expect it to play on the intra-state matches, as it is presumably
big.

The needed setup forbids this.

## Everything

In [10]:
loteca_teams = retrieve_loteca_teams()
betexp_teams = retrieve_betexp_teams()

ERROR:root:Found multiple states for fname 'bragantino'


In [11]:
# GENERATE COUNTRIES DICT

# load countries
countries = load_pickle('../data/interim/countries.pkl')

# standardize country names
def standardize_country(name):
    from unidecode import unidecode
    return unidecode(name).lower()
_sc = standardize_country
countries = {_sc(k): _sc(v) for k, v in countries.items()}

# manual updates
updates = {
    'bosnia herzegovina': 'bosnia & herzegovina',
    'camaroes': 'cameroon',
    'costa do marfim': 'ivory coast',
    'escocia': 'scotland',
    'estados unidos': 'usa',
    'inglaterra': 'england',
    'irlanda do norte': 'northern ireland',
    'pais de gales': 'wales',
    'rep.tcheca': 'czech republic',
    'republica tcheca': 'czech republic',
    'servia e montenegro': 'serbia and montenegro',
    'taiti': 'tahiti',
}
countries.update(updates)

In [12]:
import logging
import re

from src.util import re_strip

LOTECA_MAX_SIZE = max(len(str(t)) for t in loteca_teams)

    
def is_same_team(loteca_team, betexp_team):
    lt = loteca_team
    be = betexp_team
    
    # strict restrictions
    if lt.women_flag != be.women_flag:
        return False
    if lt.under != be.under:
        return False
    if (lt.state and be.state and
          lt.state != be.state):
        return False
    
    # split case
    if not lt.state and not lt.country:
        # country team
        return countries[lt.fname] == be.fname
    elif lt.country:
        # teams outside of brazil
        return (be.state is None and
                  be.fname == lt.fname)
    else:
        # teams from brazil
        return (lt.state == be.state and
                  lt.fname_without_state == be.fname_without_state)        

ltb_dict = {}
for loteca_team in loteca_teams:
    # get matching BetExplorer strings
    matching_teams = [t for t in betexp_teams if is_same_team(loteca_team, t)]
    betexp_strings = [betexplorer.generate_string(t) for t in matching_teams]
    betexp_strings = set(betexp_strings)
    
    # log founds
    msg = "{loteca_string!s:>{loteca_max_size}} -> {found}"
    msg = msg.format(loteca_string=loteca_team.string,
                     loteca_max_size=LOTECA_MAX_SIZE,
                     found=betexp_strings if betexp_strings else "{}")
    log_level = logging.ERROR if len(betexp_strings) > 1 else logging.INFO
    logging.log(log_level, msg)
        
    # save
    ltb_dict[loteca_team.string] = betexp_strings      