# Prototype Loteca to BetExplorer matches

## Test pandas speed

We are gonna use pandas a lot here. So, let's test
base speed of retrieval of an item vs a dict.

Let's test a namedtuple also.

In [1]:
from collections import namedtuple
import pandas as pd

Obj = namedtuple('Obj', 'a, b, c')
a = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
b = {'a': 1, 'b': 2, 'c': 3}
c = Obj(1, 2, 3)

In [2]:
%timeit a.loc['a']
%timeit b['a']
%timeit c.a

37.4 µs ± 423 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
53.5 ns ± 0.768 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
80.2 ns ± 1.02 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


pandas is about 1000 times slower than the python alternatives.

This should be taken in consideration.

## Teams problem

Some loteca fnames correspond to more than one BetExplorer team.

For example, both `ATLÉTICO/MG` and `ATLÉTICO/GO` get formatted
into `atletico`. This ends up in one fname corresponding to more
than one team.

Let's see an example below:

In [1]:
import sys
sys.path.append('..')

In [5]:
from src.util import load_pickle

d = load_pickle('../data/interim/ltb_teams.pkl')
d

defaultdict(set,
            {'abaete (PA)': set(),
             'abc (RN)': {'abc'},
             'adeco (SP)': set(),
             'africa do sul': {'south africa'},
             'aguia (PA)': set(),
             'albania': {'albania'},
             'alecrim (RN)': {'alecrim'},
             'alemanha': {'germany'},
             'america (MG)': set(),
             'america (PE)': set(),
             'america (RJ)': set(),
             'america (RN)': set(),
             'america (SE)': set(),
             'america (SP)': set(),
             'americano (RJ)': set(),
             'amiens': {'amiens'},
             'amparo (SP)': set(),
             'ananindeua (PA)': set(),
             'anapolina (GO)': {'anapolina'},
             'anapolis (GO)': {'anapolis'},
             'andorra': {'andorra'},
             'aparecidense (GO)': {'aparecidense'},
             'araguaina (TO)': set(),
             'araripina (PE)': set(),
             'argelia': {'algeria'},
             'argentina': 

It's not here because it was generated in the matches algorithm itself!

## Misc

In [3]:
import sys
sys.path.append('..')

In [4]:
from src.util import load_pickle

df = load_pickle('../data/process/loteca_matches.pkl')

In [5]:
df.head()

Unnamed: 0,roundno,gameno,date,team_h,goals_h,team_a,goals_a,happened
5110,366,1,2009-06-07,CRUZEIRO/MG,1,INTERNACIONAL/RS,1,True
5111,366,2,2009-06-07,ATLÉTICO/PR,0,ATLÉTICO/MG,4,True
5112,366,3,2009-06-07,AVAÍ/SC,0,SÃO PAULO/SP,0,True
5113,366,4,2009-06-06,AMÉRICA/RN,2,BRASILIENSE/DF,1,True
5114,366,5,2009-06-06,PONTE PRETA/SP,5,PORTUGUESA DESPORTOS/SP,2,True


In [6]:
import sqlite3

import pandas as pd

conn = sqlite3.connect('../data/db.sqlite3')
q = "SELECT * FROM betexp_matches"
df = pd.read_sql_query(q, conn)
conn.close()

In [7]:
df.head()

Unnamed: 0,id,url,league_category,league_name,league_year,team_h,team_a,date,score,scoremod
0,bqHgC1sl,http://www.betexplorer.com/soccer/world/arab-c...,world,Arab Champions League,2008/2009,Esperance Tunis,Wydad,21.05.2009,1:1,
1,GKRlDsSs,http://www.betexplorer.com/soccer/world/arab-c...,world,Arab Champions League,2008/2009,Wydad,Esperance Tunis,09.05.2009,0:1,
2,pxyJI34K,http://www.betexplorer.com/soccer/world/arab-c...,world,Arab Champions League,2008/2009,Esperance Tunis,ES Setif,26.04.2009,2:0,
3,EBuNHNJQ,http://www.betexplorer.com/soccer/world/arab-c...,world,Arab Champions League,2008/2009,Wydad,Sfaxien,25.04.2009,2:0,
4,6oxFJqkE,http://www.betexplorer.com/soccer/world/arab-c...,world,Arab Champions League,2008/2009,Sfaxien,Wydad,12.04.2009,1:1,


In [8]:
df.shape

(299367, 10)

In [10]:
len(set(df.team_h) | set(df.team_a))

13253

In [None]:
from src.data.interim.teams import betexplorer

# remove duplicates
df = df.drop_duplicates(subset='id')

# process necessary columns
def get_date(s):
    d, m, y = [int(v) for v in s.split('.')]
    return date(y, m, d)

def get_score(s):
    if s == '':
        return None
    
    s = s.strip()
    return [int(v) for v in s.split(':')]

def process_name(s):
    name, _, women_flag, under, _ = betexplorer.parse_string(s)
    fname = betexplorer.format_name(name)
    return (fname, under, women_flag)

dates = [get_date(s) for s in df.date]
scores = [get_score(s) for s in df.score]
h_names = [process_name(s) for s in df.team_h]  # about 3s
a_names = [process_name(s) for s in df.team_a]  # about 3s