# EDA

EDA is performed on the Spider [dataset](https://drive.google.com/uc?export=download&id=1TqleXec_OykOYFREKKtschzY29dUcVAQ). More info about the datset [here](https://yale-lily.github.io/spider).

In [4]:
path_to_data = '../../Spider Dataset'

In [2]:
from glob import glob

Let's start by looking at the structure of each of the files.

In [7]:
glob(f'{path_to_data}/*')

['../../Spider Dataset/dev_gold.sql',
 '../../Spider Dataset/database',
 '../../Spider Dataset/train_others.json',
 '../../Spider Dataset/train_spider.json',
 '../../Spider Dataset/tables.json',
 '../../Spider Dataset/dev.json',
 '../../Spider Dataset/README.txt',
 '../../Spider Dataset/train_gold.sql']

As the Spider info page describes, the json files contain records containing the following fields:

- `db_id`: the database id to which this question is addressed.
- `query`: the SQL query corresponding to the question.
- `query_toks`: the SQL query tokens corresponding to the question.
- `query_toks_no_value`: list of tokens for the query, where the values are mapped to the keyword `value`.
- `question`: the natural language question.
- `question_toks`: list of tokens for the natural language question.
- `sql`: parsed results of this SQL query using [process_sql.py](https://github.com/taoyds/spider/blob/master/process_sql.py). 

In [23]:
!head -n 200 '../../Spider Dataset/train_spider.json'

[
    {
        "db_id": "department_management",
        "query": "SELECT count(*) FROM head WHERE age  >  56",
        "query_toks": [
            "SELECT",
            "count",
            "(",
            "*",
            ")",
            "FROM",
            "head",
            "WHERE",
            "age",
            ">",
            "56"
        ],
        "query_toks_no_value": [
            "select",
            "count",
            "(",
            "*",
            ")",
            "from",
            "head",
            "where",
            "age",
            ">",
            "value"
        ],
        "question": "How many heads of the departments are older than 56 ?",
        "question_toks": [
            "How",
            "many",
            "heads",
            "of",
            "the",
            "departments",
            "are",
            "older",
            "than",
            "56",
            "?"
        ],
        "sql

It's unclear what `train_others.json` is used for, but `train_spider.json` is the training set, and `dev.json` is the validation set. We will explore this further later.

It's unclear what the sql files are used for. They seem to align with the `query` field in the `.json` files. Let's verify this.

In [25]:
!head '../../Spider Dataset/train_gold.sql'

SELECT count(*) FROM head WHERE age  >  56	department_management
SELECT name ,  born_state ,  age FROM head ORDER BY age	department_management
SELECT creation ,  name ,  budget_in_billions FROM department	department_management
SELECT max(budget_in_billions) ,  min(budget_in_billions) FROM department	department_management
SELECT avg(num_employees) FROM department WHERE ranking BETWEEN 10 AND 15	department_management
SELECT name FROM head WHERE born_state != 'California'	department_management
SELECT DISTINCT T1.creation FROM department AS T1 JOIN management AS T2 ON T1.department_id  =  T2.department_id JOIN head AS T3 ON T2.head_id  =  T3.head_id WHERE T3.born_state  =  'Alabama'	department_management
SELECT born_state FROM head GROUP BY born_state HAVING count(*)  >=  3	department_management
SELECT creation FROM department GROUP BY creation ORDER BY count(*) DESC LIMIT 1	department_management
SELECT T1.name ,  T1.num_employees FROM department AS T1 JOIN management AS T2 ON T1.

In [29]:
with open('../../Spider Dataset/train_gold.sql') as f:
    lines = f.readlines()
len(lines)

8659

In [42]:
import json

with open('../../Spider Dataset/train_spider.json') as f:
    train_spider = json.load(f)
len(train_spider)

7000

It already seems like there is a discrepancy, as `train_spider.json` has less records that `train_gold.sql`. It's possible, `train_others.json` might have the remaining records.

In [38]:
import json

with open('../../Spider Dataset/train_others.json') as f:
    train_others = json.load(f)
len(train_others)

1659

Seems like the full training corpus includes records from both `train_spider.json` and `train_others.json`. Now, let's verify that `train_gold.sql` aligns with the `query` field in the `.json` files.

In [69]:
gold_queries = [line.split('\t')[0] for line in lines]
train_spider_queries = [record['query'] for record in train_spider]
gold_queries[:len(train_spider_queries)] == train_spider_queries

False

In [73]:
p = 0
for i, (gold, train) in enumerate(zip(gold_queries[:len(train_spider_queries)], train_spider_queries)):
    if p > 10:
        break
    if gold != train:
        print(i)
        print(gold)
        print(train)
        p += 1

20
SELECT Hosts FROM farm_competition WHERE Theme != 'Aliens'
SELECT Hosts FROM farm_competition WHERE Theme !=  'Aliens'
21
SELECT Hosts FROM farm_competition WHERE Theme != 'Aliens'
SELECT Hosts FROM farm_competition WHERE Theme !=  'Aliens'
54
SELECT Census_Ranking FROM city WHERE Status != "Village"
SELECT Census_Ranking FROM city WHERE Status !=  "Village"
55
SELECT Census_Ranking FROM city WHERE Status != "Village"
SELECT Census_Ranking FROM city WHERE Status !=  "Village"
165
SELECT count(*) FROM trip AS T1 JOIN station AS T2 ON T1.end_station_id  =  T2.id WHERE T2.city != "San Francisco"
SELECT count(*) FROM trip AS T1 JOIN station AS T2 ON T1.end_station_id  =  T2.id WHERE T2.city !=  "San Francisco"
789
SELECT city_name FROM city WHERE population  =  ( SELECT MAX ( population ) FROM city WHERE state_name  =  "wyoming" ) AND state_name  =  "wyoming";
SELECT count(*) FROM member WHERE Membership_card  =  'Black'
790
SELECT city_name FROM city WHERE population  =  ( SELECT MAX (

`train_gold.sql` seems to disagree with the training data. Even though the number of records is equal to the sum of the number of records in `train_spider.json` and `train_others.json`, it seems to have records in a different order. According to the Spider GitHub [page](https://github.com/taoyds/spider), it's used for evaluation to compare the predicted sql to the true sql. This can probably be generated from the `.json` files depending on how we decide to create our trainining, validation, and testing splits. We can proceed by ignoring this file for now.