### Modificando tabelas
Este script tem como objetivo manipular algumas tabelas e criar novas versões, sendo:
- Relação entre as tabelas `player` e `player_attributes` em uma nova tabela, chamada `player_attributes_modified`, onde cada key do json é uma nova coluna
- Relação entre as tabelas `team` e `team_Attributes` em uma nova tabela, chamada `team_attributes_modified`, onde cada key do json é uma nova coluna
- Criar uma tabela chamada `match_modified` cuja coluna seja representada como JSON, onde as chaves precisam ser referentes às colunas da tabela Match, sendo elas : id,match_api_id, home_team_api_id, away_team_api_id


#### Import das libs

In [1]:
#Importando das libs que serão utilizadas no processo
import sqlite3

#### Conectado ao banco e iniciando cursor

In [10]:
#conectando ao banco de dados test_analytics_engineer
conn = sqlite3.connect('test_analytics_engineer.db')

In [11]:
#instanciando o cursor
c = conn.cursor()

#### Criação das novas tabelas

##### Player Attributes Modified

In [4]:
#realiando teste de extração de algumas colunas do JSON de attributes
c.execute('''
    SELECT
        id,
        player_attributes,
        JSON_EXTRACT(player_attributes,'$.id') AS id,
        JSON_EXTRACT(player_attributes,'$.player_fifa_api_id') AS player_fifa_api_id,
        JSON_EXTRACT(player_attributes,'$.player_api_id') AS player_api_id,
        JSON_EXTRACT(player_attributes,'$.date') AS date
    FROM
        player_attributes
''').fetchall()

OperationalError: malformed JSON

Temos casos de JSON com problema de incosistência

In [9]:
# validando volumetria de json com problemas
c.execute('''
    SELECT
       COUNT(1)
    FROM
        player_attributes
    WHERE
        NOT(JSON_VALID(player_attributes))
''').fetchone()

(3624,)

In [10]:
# checando 1 caso
c.execute('''
    SELECT
       *
    FROM
        player_attributes
    WHERE
        NOT(JSON_VALID(player_attributes))
''').fetchone()

(373,
 '{"id": "374", "player_fifa_api_id": "156626", "player_api_id": "46447", "date": "2010-08-30 00:00:00", "overall_rating": 64.0, "potential": 71.0, "preferred_foot": "right", "attacking_work_rate": NaN, "defensive_work_rate": "_0", "crossing": 41.0, "finishing": 33.0, "heading_accuracy": 74.0, "short_passing": 57.0, "volleys": 24.0, "dribbling": 30.0, "curve": 35.0, "free_kick_accuracy": 40.0, "long_passing": 45.0, "ball_control": 44.0, "acceleration": 60.0, "sprint_speed": 61.0, "agility": 59.0, "reactions": 58.0, "balance": 73.0, "shot_power": 48.0, "jumping": 75.0, "stamina": 64.0, "strength": 71.0, "long_shots": 39.0, "aggression": 71.0, "interceptions": 58.0, "positioning": 28.0, "vision": 61.0, "penalties": 39.0, "marking": 62.0, "standing_tackle": 61.0, "sliding_tackle": 57.0, "gk_diving": 15.0, "gk_handling": 14.0, "gk_kicking": 13.0, "gk_positioning": 10.0, "gk_reflexes": 12.0}')

Temos 1 dos atributos que está com NaN e o `JSON_EXTRACT` não aceita. A solução será substituir o valor para `NULL`

In [11]:
#testando se a solução do replace atende todos os casos ou se temos algum problema diferente
c.execute('''
    SELECT
        JSON_VALID(REPLACE(player_attributes,'NaN','null')) AS json_validation,
        COUNT(1) AS qtd
    FROM
        player_attributes
    WHERE
        NOT(JSON_VALID(player_attributes))
    GROUP BY
        json_validation
''').fetchone()

(1, 3624)

Todos os casos estavam com o mesmo problema!

In [12]:
#validando possíveis chaves do player na tabela player
c.execute('''
        SELECT
            COUNT(1) AS qtd,
            COUNT(DISTINCT player_api_id) AS pids,
            COUNT(DISTINCT player_fifa_api_id) AS pfids
        FROM
            player
        ''').fetchall()

[(11060, 11060, 11060)]

In [13]:
#testando as chaves do player na tabela de attributes
c.execute('''
        WITH player_attributes_key AS (
            SELECT DISTINCT
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.player_api_id') AS player_api_id,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.player_fifa_api_id') AS player_fifa_api_id
            FROM 
                player_attributes
        )
        
        SELECT
            COUNT(1) AS qtd,
            COUNT(DISTINCT player_api_id) AS pids,
            COUNT(DISTINCT player_fifa_api_id) AS pfids
        FROM 
            player_attributes_key
        ''').fetchall()

[(11069, 11060, 11062)]

Aparentemente a chave mais confiável é a `player_fifa_api_id`

In [14]:
#Checando casos player_fifa_api_id duplicados
c.execute('''
        WITH player_attributes_key AS (
            SELECT DISTINCT
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.player_fifa_api_id') AS player_fifa_api_id,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.player_api_id') AS player_api_id
            FROM 
                player_attributes
        )
        
        SELECT *
        FROM
            player_attributes_key
        WHERE
            player_fifa_api_id IN (
                SELECT
                    player_fifa_api_id
                FROM 
                    player_attributes_key
                GROUP BY
                    player_fifa_api_id
                HAVING COUNT(1) > 1
            )
        ORDER BY
            player_fifa_api_id
        ''').fetchall()

[('118359', '32968'),
 ('118359', '38966'),
 ('184431', '96540'),
 ('184431', '42116'),
 ('190195', '37254'),
 ('190195', '156664'),
 ('192635', '150396'),
 ('192635', '164128'),
 ('198394', '282274'),
 ('198394', '163838'),
 ('206652', '300532'),
 ('206652', '30271'),
 ('208618', '359193'),
 ('208618', '11285')]

Dos 7 casos de duplicidade,todos foram por causa de alteração no `player_api_id`. Mesmo assim, podemos utilizar o `player_fifa_api_id` como chave entre as tabelas, já que na `player` esta informação está deduplicada.

In [5]:
#criação da tabela, extraindo as informações do json da player_attribuites e cruzando com a player pelo player_fifa_api_id
c.execute('''
        CREATE TABLE IF NOT EXISTS player_attributes_modified AS
        WITH player_attributes_extracted AS (
            SELECT 
                CAST(JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.id') AS INTEGER) AS attribute_id,
                CAST(JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.player_fifa_api_id') AS INTEGER) AS player_fifa_api_id,
                CAST(JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.player_api_id') AS INTEGER) AS player_api_id,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.date') AS date,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.overall_rating') AS overall_rating,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.potential') AS potential,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.preferred_foot') AS preferred_foot,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.right') AS right,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.attacking_work_rate') AS attacking_work_rate,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.defensive_work_rate') AS defensive_work_rate,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.crossing') AS crossing,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.finishing') AS finishing,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.heading_accuracy') AS heading_accuracy,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.short_passing') AS short_passing,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.volleys') AS volleys,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.dribbling') AS dribbling,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.curve') AS curve,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.free_kick_accuracy') AS free_kick_accuracy,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.long_passing') AS long_passing,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.ball_control') AS ball_control,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.acceleration') AS acceleration,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.sprint_speed') AS sprint_speed,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.agility') AS agility,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.reactions') AS reactions,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.balance') AS balance,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.shot_power') AS shot_power,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.jumping') AS jumping,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.stamina') AS stamina,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.strength') AS strength,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.long_shots') AS long_shots,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.aggression') AS aggression,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.interceptions') AS interceptions,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.positioning') AS positioning,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.vision') AS vision,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.penalties') AS penalties,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.marking') AS marking,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.standing_tackle') AS standing_tackle,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.sliding_tackle') AS sliding_tackle,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.gk_diving') AS gk_diving,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.gk_handling') AS gk_handling,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.gk_kicking') AS gk_kicking,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.gk_positioning') AS gk_positioning,
                JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.gk_reflexes') AS gk_reflexes
            FROM 
                player_attributes
        )
        
        SELECT
            player.player_fifa_api_id,
            attributes.attribute_id,
            attributes.date,
            CAST(attributes.overall_rating AS INTEGER) AS overall_rating,
            CAST(attributes.potential AS INTEGER) AS potential,
            attributes.preferred_foot,
            attributes.right,
            attributes.attacking_work_rate,
            attributes.defensive_work_rate,
            CAST(attributes.crossing AS INTEGER) AS crossing,
            CAST(attributes.finishing AS INTEGER) AS finishing,
            CAST(attributes.heading_accuracy AS INTEGER) AS heading_accuracy,
            CAST(attributes.short_passing AS INTEGER) AS short_passing,
            CAST(attributes.volleys AS INTEGER) AS volleys,
            CAST(attributes.dribbling AS INTEGER) AS dribbling,
            CAST(attributes.curve AS INTEGER) AS curve,
            CAST(attributes.free_kick_accuracy AS INTEGER) AS free_kick_accuracy,
            CAST(attributes.long_passing AS INTEGER) AS long_passing,
            CAST(attributes.ball_control AS INTEGER) AS ball_control,
            CAST(attributes.acceleration AS INTEGER) AS acceleration,
            CAST(attributes.sprint_speed AS INTEGER) AS sprint_speed,
            CAST(attributes.agility AS INTEGER) AS agility,
            CAST(attributes.reactions AS INTEGER) AS reactions,
            CAST(attributes.balance AS INTEGER) AS balance,
            CAST(attributes.shot_power AS INTEGER) AS shot_power,
            CAST(attributes.jumping AS INTEGER) AS jumping,
            CAST(attributes.stamina AS INTEGER) AS stamina,
            CAST(attributes.strength AS INTEGER) AS strength,
            CAST(attributes.long_shots AS INTEGER) AS long_shots,
            CAST(attributes.aggression AS INTEGER) AS aggression,
            CAST(attributes.interceptions AS INTEGER) AS interceptions,
            CAST(attributes.positioning AS INTEGER) AS positioning,
            CAST(attributes.vision AS INTEGER) AS vision,
            CAST(attributes.penalties AS INTEGER) AS penalties,
            CAST(attributes.marking AS INTEGER) AS marking,
            CAST(attributes.standing_tackle AS INTEGER) AS standing_tackle,
            CAST(attributes.sliding_tackle AS INTEGER) AS sliding_tackle,
            CAST(attributes.gk_diving AS INTEGER) AS gk_diving,
            CAST(attributes.gk_handling AS INTEGER) AS gk_handling,
            CAST(attributes.gk_kicking AS INTEGER) AS gk_kicking,
            CAST(attributes.gk_positioning AS INTEGER) AS gk_positioning,
            CAST(attributes.gk_reflexes AS INTEGER) AS gk_reflexes
        FROM
            player
        INNER JOIN
            player_attributes_extracted AS attributes
        ON player.player_fifa_api_id = attributes.player_fifa_api_id
        ''')

<sqlite3.Cursor at 0x1b35a9afdc0>

Por que o `INNER JOIN` ? Do ponto de vista de negócio me parece fazer mais sentido considerar apenas dados de players que consigamos identificar. Ter atributos de players os quais não temos em nossa base pode acabar sujando análises.

In [12]:
#fazendo check de quantos dados perdemos
original_count = c.execute('''
                        SELECT
                            COUNT(DISTINCT CAST(JSON_EXTRACT(REPLACE(player_attributes,'NaN','null'),'$.player_fifa_api_id') AS INTEGER))
                        FROM
                            player_attributes

                        ''').fetchone()

modified_count = c.execute('''
                        SELECT
                            COUNT(DISTINCT player_fifa_api_id)
                        FROM
                            player_attributes_modified

                        ''').fetchone()
print(original_count,modified_count)

(11062,) (11060,)


Levando em conta que a base de attributes é histórica e possui a evolução dos players, são pouquíssimos dados perdidos na granularidade de player (2 perdas, sendo mais exato)

In [6]:
#commitando a criação da base
conn.commit()

##### Team Attributes Modified

In [13]:
#vendo um exemplo de linha
c.execute('''
        SELECT
            *
        FROM
            team_attributes

        ''').fetchone()

(0,
 '{"id": "1", "team_fifa_api_id": "434", "team_api_id": "9930", "date": "2010-02-22 00:00:00", "buildUpPlaySpeed": "60", "buildUpPlaySpeedClass": "Balanced", "buildUpPlayDribbling": NaN, "buildUpPlayDribblingClass": "Little", "buildUpPlayPassing": "50", "buildUpPlayPassingClass": "Mixed", "buildUpPlayPositioningClass": "Organised", "chanceCreationPassing": "60", "chanceCreationPassingClass": "Normal", "chanceCreationCrossing": "65", "chanceCreationCrossingClass": "Normal", "chanceCreationShooting": "55", "chanceCreationShootingClass": "Normal", "chanceCreationPositioningClass": "Organised", "defencePressure": "50", "defencePressureClass": "Medium", "defenceAggression": "55", "defenceAggressionClass": "Press", "defenceTeamWidth": "45", "defenceTeamWidthClass": "Normal", "defenceDefenderLineClass": "Cover"}')

In [12]:
#realiando teste de extração de algumas colunas do JSON de attributes
c.execute('''
    SELECT
        id,
        team_attributes,
        JSON_EXTRACT(team_attributes,'$.id') AS id,
        JSON_EXTRACT(team_attributes,'$.team_fifa_api_id') AS player_fifa_api_id,
        JSON_EXTRACT(team_attributes,'$.team_api_id') AS player_api_id,
        JSON_EXTRACT(team_attributes,'$.date') AS date
    FROM
        team_attributes
''').fetchall()

OperationalError: malformed JSON

Temos casos de JSON com problema de incosistência

In [13]:
# validando volumetria de json com problemas
c.execute('''
    SELECT
       COUNT(1)
    FROM
        team_attributes
    WHERE
        NOT(JSON_VALID(team_attributes))
''').fetchone()

(971,)

In [14]:
# checando 1 caso
c.execute('''
    SELECT
       *
    FROM
        team_attributes
    WHERE
        NOT(JSON_VALID(team_attributes))
''').fetchone()

(0,
 '{"id": "1", "team_fifa_api_id": "434", "team_api_id": "9930", "date": "2010-02-22 00:00:00", "buildUpPlaySpeed": "60", "buildUpPlaySpeedClass": "Balanced", "buildUpPlayDribbling": NaN, "buildUpPlayDribblingClass": "Little", "buildUpPlayPassing": "50", "buildUpPlayPassingClass": "Mixed", "buildUpPlayPositioningClass": "Organised", "chanceCreationPassing": "60", "chanceCreationPassingClass": "Normal", "chanceCreationCrossing": "65", "chanceCreationCrossingClass": "Normal", "chanceCreationShooting": "55", "chanceCreationShootingClass": "Normal", "chanceCreationPositioningClass": "Organised", "defencePressure": "50", "defencePressureClass": "Medium", "defenceAggression": "55", "defenceAggressionClass": "Press", "defenceTeamWidth": "45", "defenceTeamWidthClass": "Normal", "defenceDefenderLineClass": "Cover"}')

Temos 1 dos atributos que está com NaN e o `JSON_EXTRACT` não aceita. Aparentemente é o mesmo problema da tabela de player. A solução será substituir o valor para `NULL`

In [15]:
#testando se a solução do replace atende todos os casos ou se temos algum problema diferente
c.execute('''
    SELECT
        JSON_VALID(REPLACE(team_attributes,'NaN','null')) AS json_validation,
        COUNT(1) AS qtd
    FROM
        team_attributes
    WHERE
        NOT(JSON_VALID(team_attributes))
    GROUP BY
        json_validation
''').fetchone()

(1, 971)

Todos os casos estavam com o mesmo problema!

In [16]:
#validando possíveis chaves do team na tabela team
c.execute('''
        SELECT
            COUNT(1) AS qtd,
            COUNT(DISTINCT team_api_id) AS tids,
            COUNT(DISTINCT team_fifa_api_id) AS tfids
        FROM
            team
        ''').fetchall()

[(299, 299, 285)]

Diferentemente da base de `players`, aqui temos algumas duplicidades na chave `team_fifa_api_id`

In [20]:
#checando as duplicidades
c.execute('''
        SELECT
            *
        FROM
            team
        WHERE
            team_fifa_api_id IN (
                SELECT
                    team_fifa_api_id
                FROM
                    team
                GROUP BY
                    team_fifa_api_id
                HAVING COUNT(1) > 1
            )
        ''').fetchall()

[(16, 9996, 111560, 'Royal Excel Mouscron', 'MOU'),
 (2510, 274581, 111560, 'Royal Excel Mouscron', 'MOP'),
 (31444, 8031, 111429, 'Polonia Bytom', 'POB'),
 (31445, 8020, 111429, 'Polonia Bytom', 'GOR'),
 (31451, 8244, 301, 'Widzew Łódź', 'LOD'),
 (32409, 8024, 301, 'Widzew Łódź', 'WID')]

In [21]:
#testando as chaves do team na tabela de attributes
c.execute('''
        WITH team_attributes_key AS (
            SELECT DISTINCT
                JSON_EXTRACT(REPLACE(team_attributes,'NaN','null'),'$.team_api_id') AS team_api_id,
                JSON_EXTRACT(REPLACE(team_attributes,'NaN','null'),'$.team_fifa_api_id') AS team_fifa_api_id
            FROM 
                team_attributes
        )
        
        SELECT
            COUNT(1) AS qtd,
            COUNT(DISTINCT team_api_id) AS tids,
            COUNT(DISTINCT team_fifa_api_id) AS tfids
        FROM 
            team_attributes_key
        ''').fetchall()

[(288, 288, 285)]

Aparentemente a chave mais confiável é a `team_api_id`. Diferentemente da tabela de `player_attributes`, a base de `team_attributes` possui dados já consolidados na granularidade do time ao invés do histórico

In [24]:
# checando 1 caso
c.execute('''
    SELECT
       *
    FROM
        team_attributes
    WHERE
        NOT(JSON_VALID(team_attributes))
''').fetchone()

(0,
 '{"id": "1", "team_fifa_api_id": "434", "team_api_id": "9930", "date": "2010-02-22 00:00:00", "buildUpPlaySpeed": "60", "buildUpPlaySpeedClass": "Balanced", "buildUpPlayDribbling": NaN, "buildUpPlayDribblingClass": "Little", "buildUpPlayPassing": "50", "buildUpPlayPassingClass": "Mixed", "buildUpPlayPositioningClass": "Organised", "chanceCreationPassing": "60", "chanceCreationPassingClass": "Normal", "chanceCreationCrossing": "65", "chanceCreationCrossingClass": "Normal", "chanceCreationShooting": "55", "chanceCreationShootingClass": "Normal", "chanceCreationPositioningClass": "Organised", "defencePressure": "50", "defencePressureClass": "Medium", "defenceAggression": "55", "defenceAggressionClass": "Press", "defenceTeamWidth": "45", "defenceTeamWidthClass": "Normal", "defenceDefenderLineClass": "Cover"}')

In [8]:
#criação da tabela, extraindo as informações do json da team_attribuites e cruzando com a team pelo team_api_id
c.execute('''
        CREATE TABLE IF NOT EXISTS team_attributes_modified AS
        WITH team_attributes_extracted AS (
            SELECT 
                CAST(JSON_EXTRACT(REPLACE(team_attributes,'NaN','null'),'$.id') AS INTEGER) AS attribute_id,
                CAST(JSON_EXTRACT(REPLACE(team_attributes,'NaN','null'),'$.team_fifa_api_id') AS INTEGER) AS team_fifa_api_id,
                CAST(JSON_EXTRACT(REPLACE(team_attributes,'NaN','null'),'$.team_api_id') AS INTEGER) AS team_api_id,
                JSON_EXTRACT(REPLACE(team_attributes,'NaN','null'),'$.date') AS date,
                JSON_EXTRACT(REPLACE(team_attributes,'NaN','null'),'$.buildUpPlaySpeed') AS buildUpPlaySpeed,
                JSON_EXTRACT(REPLACE(team_attributes,'NaN','null'),'$.buildUpPlaySpeedClass') AS buildUpPlaySpeedClass,
                JSON_EXTRACT(REPLACE(team_attributes,'NaN','null'),'$.buildUpPlayDribbling') AS buildUpPlayDribbling,
                JSON_EXTRACT(REPLACE(team_attributes,'NaN','null'),'$.buildUpPlayDribblingClass') AS buildUpPlayDribblingClass,
                JSON_EXTRACT(REPLACE(team_attributes,'NaN','null'),'$.buildUpPlayPassing') AS buildUpPlayPassing,
                JSON_EXTRACT(REPLACE(team_attributes,'NaN','null'),'$.buildUpPlayPassingClass') AS buildUpPlayPassingClass,
                JSON_EXTRACT(REPLACE(team_attributes,'NaN','null'),'$.buildUpPlayPositioningClass') AS buildUpPlayPositioningClass,
                JSON_EXTRACT(REPLACE(team_attributes,'NaN','null'),'$.chanceCreationPassing') AS chanceCreationPassing,
                JSON_EXTRACT(REPLACE(team_attributes,'NaN','null'),'$.chanceCreationPassingClass') AS chanceCreationPassingClass,
                JSON_EXTRACT(REPLACE(team_attributes,'NaN','null'),'$.chanceCreationCrossing') AS chanceCreationCrossing,
                JSON_EXTRACT(REPLACE(team_attributes,'NaN','null'),'$.chanceCreationCrossingClass') AS chanceCreationCrossingClass,
                JSON_EXTRACT(REPLACE(team_attributes,'NaN','null'),'$.chanceCreationShooting') AS chanceCreationShooting,
                JSON_EXTRACT(REPLACE(team_attributes,'NaN','null'),'$.chanceCreationShootingClass') AS chanceCreationShootingClass,
                JSON_EXTRACT(REPLACE(team_attributes,'NaN','null'),'$.chanceCreationPositioningClass') AS chanceCreationPositioningClass,
                JSON_EXTRACT(REPLACE(team_attributes,'NaN','null'),'$.defencePressure') AS defencePressure,
                JSON_EXTRACT(REPLACE(team_attributes,'NaN','null'),'$.defencePressureClass') AS defencePressureClass,
                JSON_EXTRACT(REPLACE(team_attributes,'NaN','null'),'$.defenceAggression') AS defenceAggression,
                JSON_EXTRACT(REPLACE(team_attributes,'NaN','null'),'$.defenceAggressionClass') AS defenceAggressionClass,
                JSON_EXTRACT(REPLACE(team_attributes,'NaN','null'),'$.defenceTeamWidth') AS defenceTeamWidth,
                JSON_EXTRACT(REPLACE(team_attributes,'NaN','null'),'$.defenceTeamWidthClass') AS defenceTeamWidthClass,
                JSON_EXTRACT(REPLACE(team_attributes,'NaN','null'),'$.defenceDefenderLineClass') AS defenceDefenderLineClass
            FROM 
                team_attributes
        )
        
        SELECT
            team.team_api_id,
            attributes.attribute_id,
            attributes.date,
            CAST(attributes.buildUpPlaySpeed AS INTEGER) AS buildUpPlaySpeed,
            attributes.buildUpPlaySpeedClass,
            CAST(attributes.buildUpPlayDribbling AS INTEGER) AS buildUpPlayDribbling,
            attributes.buildUpPlayDribblingClass,
            CAST(attributes.buildUpPlayPassing AS INTEGER) AS buildUpPlayPassing,
            attributes.buildUpPlayPassingClass,
            attributes.buildUpPlayPositioningClass,
            CAST(attributes.chanceCreationPassing AS INTEGER) AS chanceCreationPassing,
            attributes.chanceCreationPassingClass,
            CAST(attributes.chanceCreationCrossing AS INTEGER) AS chanceCreationCrossing,
            attributes.chanceCreationCrossingClass,
            CAST(attributes.chanceCreationShooting AS INTEGER) AS chanceCreationShooting,
            attributes.chanceCreationShootingClass,
            attributes.chanceCreationPositioningClass,
            CAST(attributes.defencePressure AS INTEGER) AS defencePressure,
            attributes.defencePressureClass,
            CAST(attributes.defenceAggression AS INTEGER) AS defenceAggression,
            attributes.defenceAggressionClass,
            CAST(attributes.defenceTeamWidth AS INTEGER) AS defenceTeamWidth,
            attributes.defenceTeamWidthClass,
            attributes.defenceDefenderLineClass
        FROM
            team
        INNER JOIN
            team_attributes_extracted AS attributes
        ON team.team_api_id = attributes.team_api_id
        ''')

<sqlite3.Cursor at 0x1b35a9afdc0>

In [47]:
#fazendo check de quantos dados perdemos por conta do inner
original_count = c.execute('''
                        SELECT
                            COUNT(DISTINCT CAST(JSON_EXTRACT(REPLACE(team_attributes,'NaN','null'),'$.team_api_id') AS INTEGER))
                        FROM
                            team_attributes

                        ''').fetchone()

modified_count = c.execute('''
                        SELECT
                            COUNT(DISTINCT team_api_id)
                        FROM
                            team_attributes_modified

                        ''').fetchone()
print(original_count,modified_count)

(288,) (288,)


Todos os times possuem dados de atributos!

In [9]:
#commitando a criação da base
conn.commit()

##### match Modified

In [23]:
#criação da tabela match_modified, consolidando as chaves em um único campo json
c.execute('''
        CREATE TABLE IF NOT EXISTS match_modified AS
        SELECT
            JSON_OBJECT(
                'id',id,
                'match_api_id',match_api_id,
                'home_team_api_id',home_team_api_id,
                'away_team_api_id','away_team_api_id'
            ) AS match_keys
        FROM
            match
        ''')

<sqlite3.Cursor at 0x1c0ba733b20>

In [28]:
#checando algumas linhas
c.execute('''
        SELECT
            *
        FROM
            match_modified
        LIMIT 10
        ''').fetchall()

[('{"id":1,"match_api_id":492473,"home_team_api_id":9987,"away_team_api_id":"away_team_api_id"}',),
 ('{"id":2,"match_api_id":492474,"home_team_api_id":10000,"away_team_api_id":"away_team_api_id"}',),
 ('{"id":3,"match_api_id":492475,"home_team_api_id":9984,"away_team_api_id":"away_team_api_id"}',),
 ('{"id":4,"match_api_id":492476,"home_team_api_id":9991,"away_team_api_id":"away_team_api_id"}',),
 ('{"id":5,"match_api_id":492477,"home_team_api_id":7947,"away_team_api_id":"away_team_api_id"}',),
 ('{"id":6,"match_api_id":492478,"home_team_api_id":8203,"away_team_api_id":"away_team_api_id"}',),
 ('{"id":7,"match_api_id":492479,"home_team_api_id":9999,"away_team_api_id":"away_team_api_id"}',),
 ('{"id":8,"match_api_id":492480,"home_team_api_id":4049,"away_team_api_id":"away_team_api_id"}',),
 ('{"id":9,"match_api_id":492481,"home_team_api_id":10001,"away_team_api_id":"away_team_api_id"}',),
 ('{"id":10,"match_api_id":492564,"home_team_api_id":8342,"away_team_api_id":"away_team_api_id"}',

In [29]:
#commitando criação da tabela
conn.commit()

In [14]:
#encerrando conexão
conn.close()