# Instructions

Datasets:
* `ref_table_precinct_locations_PSGC.csv` – lookup table for precincts
* `results_president.csv` – precinct-level election results for the 2016 presidential race
* `results_vice-president.csv` – precinct-level election results for the 2016 vice presidential race

Tasks:
1. Create a denormalized table replacing precinct_code in the results_*.csv files with the columns: region, province, municipality, and barangay. 

2. Create an interesting data visualization using this dataset.

# Workflow Summary

Since these datasets are rich with geodata, we'll use GADM shape files for plotting later on. This entails cleaning the location tags to match with the GADM labels. The general workflow for cleaning is divided in to three parts:
1. Exact matching by merging the shape files and the location dataframe
2. Manual cleaning for some items, especially those tagged as overseas votes since they are easily identifiable and we're sure that they're not in our shape files.
3. Fuzzy matching for the rest of the entries.

The result after these steps is a lookup table of precinct codes with the correct GADM location tags so we can easily merge them in Tableau for visualization, and a denormalized table for the `results` file so we can do preliminary EDA on the dataset.

# Preliminaries: library imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

#geodata
import geopandas as gpd

#utilities
import os
import shutil
from glob import glob
from operator import itemgetter
from collections import defaultdict
import json
from tqdm import tqdm

#for fuzzy maching
from fuzzywuzzy import process

# spark imports
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql.functions import udf
import pyspark.sql.functions as F
from pyspark.sql import Row

sc = SparkContext()
spark = SparkSession(sc)

In [2]:
try:
    os.mkdir(os.path.join('..', 'output'))
except:
    print('folder already exists!')

folder already exists!
folder already exists!


# Data loading

In [3]:
shape_files = sorted(glob(os.path.join('..', 'gadm', "gadm*.shp")))[1:]
data_files = sorted(glob(os.path.join('..', 'datasets', '*.csv')))

In [4]:
datasets = dict(map(lambda x: (x.strip(r'.csv').split('_')[-1].lower().replace('-', '_'), x), data_files))
datasets

{'location': '../datasets/ref_table_precinct_locations.csv',
 'psgc': '../datasets/ref_table_precinct_locations_PSGC.csv',
 'president': '../datasets/results_president.csv',
 'vice_president': '../datasets/results_vice-president.csv'}

In [5]:
shapes = dict(zip(['provincial', 'municipal', 'brgy'], shape_files))
shapes

{'provincial': '../gadm/gadm36_PHL_1.shp',
 'municipal': '../gadm/gadm36_PHL_2.shp',
 'brgy': '../gadm/gadm36_PHL_3.shp'}

In [6]:
for label, filepath in tqdm(datasets.items()):
    datasets[label] = spark.read.csv(filepath, sep=',', header=True)
#     print(label)
#     display(datasets[label].limit(2).toPandas())
    datasets[label].createOrReplaceTempView(label)

100%|██████████| 4/4 [00:05<00:00,  1.31s/it]


In [7]:
for label, filepath in tqdm(shapes.items()):
#     print(label)
    gdf = gpd.GeoDataFrame.from_file(filepath)
    col_include = list(filter(lambda x: x.startswith('NAME') or x.startswith('VAR'), gdf.columns))
    df = pd.DataFrame(gdf[col_include])
    for col in df.columns:
        df[col] = df[col].astype(str) 
    shapes[label] = spark.createDataFrame(df)
    shapes[label].createOrReplaceTempView(label)
#     display(shapes[label].limit(2).toPandas())

100%|██████████| 3/3 [00:05<00:00,  1.67s/it]


In [8]:
spark.sql("SHOW TABLES").show()

+--------+--------------+-----------+
|database|     tableName|isTemporary|
+--------+--------------+-----------+
|        |          brgy|       true|
|        |      location|       true|
|        |     municipal|       true|
|        |     president|       true|
|        |    provincial|       true|
|        |          psgc|       true|
|        |vice_president|       true|
+--------+--------------+-----------+



# Cleaning

## first pass cleaning : provincial level

In [9]:
spark.sql('select distinct region, province from location').count()

90

In [10]:
lookup = spark.sql("""
            SELECT DISTINCT
                --l.precinct_code, 
                l.region, 
                l.province, 
                --l.municipality,
                --l.barangay,
                p.NAME_1 gadm_province 
                --b.NAME_2 gadm_municipality, 
                --b.NAME_3 gadm_brgy
            FROM location l
            LEFT JOIN provincial p
                ON (UPPER(p.NAME_1) = l.province OR (UPPER(p.VARNAME_1) = l.province))
--                AND UPPER(b.NAME_2) = l.municipality
--                AND UPPER(b.NAME_3) = l.barangay
""")
lookup.createOrReplaceTempView('lookup')

In [11]:
spark.sql('SELECT * FROM lookup WHERE gadm_province IS NULL').toPandas()

Unnamed: 0,region,province,gadm_province
0,REGION XI,DAVAO (DAVAO DEL NORTE),
1,NCR,TAGUIG - PATEROS,
2,OAV,ASIA,
3,OAV,MIDDLE EAST AND AFRICAS,
4,REGION XI,DAVAO OCCIDENTAL,
5,OAV,NORTH AND LATIN AMERICA,
6,NCR,NATIONAL CAPITAL REGION - THIRD DISTRICT,
7,NCR,NATIONAL CAPITAL REGION - FOURTH DISTRICT,
8,OAV,EUROPE,
9,NCR,NATIONAL CAPITAL REGION - SECOND DISTRICT,


In [12]:
lookup = spark.sql("""
            SELECT 
                --precinct_code,
                region, 
                province,
                CASE 
                    WHEN region = "NCR"
                        THEN "Metropolitan Manila"
                    WHEN region = "OAV"
                        THEN province
                    WHEN province = "DAVAO OCCIDENTAL"
                        THEN "Davao del Sur"
                    WHEN province LIKE "%WESTERN%"
                        THEN "Samar"
                    WHEN province LIKE "%DAVAO DEL NORTE%"
                        THEN "Davao del Norte"
                    WHEN province LIKE "COTABATO%NORTH%"
                        THEN "North Cotabato"
                        ELSE gadm_province
                END as gadm_province
            FROM lookup
""")
lookup.createOrReplaceTempView('lookup')
spark.sql('SELECT * FROM lookup WHERE gadm_province IS NULL').toPandas()

Unnamed: 0,region,province,gadm_province


In [13]:
spark.sql('select count(*) from lookup').show()

+--------+
|count(1)|
+--------+
|      90|
+--------+



## second pass cleaning: municipal level

In [14]:
spark.sql('select distinct region, province, municipality from location').count()

1663

In [15]:
lookup = spark.sql("""
            SELECT DISTINCT
                k.region, 
                k.province, 
                k.gadm_province, 
                l.municipality, 
                m.NAME_2 gadm_municipality
            FROM lookup k
            LEFT JOIN (SELECT DISTINCT province, municipality FROM location) as l
                ON l.province = k.province
            LEFT JOIN municipal m
                ON m.NAME_1 = k.gadm_province 
                    AND (
                        UPPER(m.NAME_2) = l.municipality 
                            OR UPPER(m.VARNAME_2) = l.municipality
                        )
""")
lookup.createOrReplaceTempView('lookup')

In [16]:
spark.sql('select count(*) from lookup').show()

+--------+
|count(1)|
+--------+
|    1663|
+--------+



In [17]:
spark.sql('SELECT * FROM lookup WHERE gadm_municipality IS NULL').toPandas()

Unnamed: 0,region,province,gadm_province,municipality,gadm_municipality
0,REGION X,MISAMIS ORIENTAL,Misamis Oriental,MAGSAYSAY (LINUGOS),
1,REGION XIII,DINAGAT ISLANDS,Dinagat Islands,BASILISA (RIZAL),
2,REGION XIII,DINAGAT ISLANDS,Dinagat Islands,LIBJO (ALBOR),
3,REGION IV-B,ORIENTAL MINDORO,Oriental Mindoro,CITY OF CALAPAN,
4,REGION IV-B,ORIENTAL MINDORO,Oriental Mindoro,BULALACAO (SAN PEDRO),
...,...,...,...,...,...
133,ARMM,LANAO DEL SUR,Lanao del Sur,POONA BAYABAO (GATA),
134,ARMM,LANAO DEL SUR,Lanao del Sur,TAGOLOAN,
135,ARMM,LANAO DEL SUR,Lanao del Sur,BALINDONG (WATU),
136,ARMM,TAWI-TAWI,Tawi-Tawi,MAPUN (CAGAYAN DE TAWI-TAWI),


In [18]:
lookup = spark.sql("""
            SELECT 
                --precinct_code,
                region, 
                province,
                gadm_province,
                municipality,
                CASE 
                    WHEN province LIKE "%MANILA"
                        THEN "Manila"
                    WHEN region = "OAV"
                        THEN municipality
                        ELSE gadm_municipality
                END as gadm_municipality
            FROM lookup
""")
lookup.createOrReplaceTempView('lookup')

In [19]:
spark.sql('SELECT * FROM lookup WHERE gadm_municipality IS NULL').toPandas()

Unnamed: 0,region,province,gadm_province,municipality,gadm_municipality
0,REGION X,MISAMIS ORIENTAL,Misamis Oriental,MAGSAYSAY (LINUGOS),
1,REGION XIII,DINAGAT ISLANDS,Dinagat Islands,BASILISA (RIZAL),
2,REGION XIII,DINAGAT ISLANDS,Dinagat Islands,LIBJO (ALBOR),
3,REGION IV-B,ORIENTAL MINDORO,Oriental Mindoro,CITY OF CALAPAN,
4,REGION IV-B,ORIENTAL MINDORO,Oriental Mindoro,BULALACAO (SAN PEDRO),
...,...,...,...,...,...
102,ARMM,LANAO DEL SUR,Lanao del Sur,POONA BAYABAO (GATA),
103,ARMM,LANAO DEL SUR,Lanao del Sur,TAGOLOAN,
104,ARMM,LANAO DEL SUR,Lanao del Sur,BALINDONG (WATU),
105,ARMM,TAWI-TAWI,Tawi-Tawi,MAPUN (CAGAYAN DE TAWI-TAWI),


In [20]:
spark.sql('select count(*) from lookup').show()

+--------+
|count(1)|
+--------+
|    1663|
+--------+



In [21]:
def create_reference_dict(query):
    reference = spark.sql(query).toJSON().collect()
    ref = defaultdict(list)
    for entry in reference:
        items = json.loads(entry)
        if len(items) < 3:
            ref[items['NAME_1']] += [items['NAME_2']]
        else:
            ref[(items['NAME_1'], items['NAME_2'])] += [items['NAME_3']]
    return ref

In [22]:
query = """
                SELECT NAME_1, NAME_2
                FROM municipal
                WHERE NAME_1 IN (SELECT gadm_province FROM lookup WHERE gadm_municipality IS NULL)
                --AND NAME_2 NOT IN (SELECT gadm_municipality FROM lookup WHERE gadm_municipality IS NOT NULL)
    """
reference = create_reference_dict(query)

In [23]:
def map_municipalities(r):
    if r.gadm_municipality:
        return r
#     elif r.region == "OAV":
#         return Row(r.region, r.province, r.gadm_province, r.municipality, r.municipality)
    else:
        m = r.municipality
        p = r.gadm_province
        try:
            municipalities = reference[p]
            g = process.extractOne(m, municipalities)[0]
        except:
            return r
        if g is None:
            return r
        else:
            return Row(r.region, r.province, r.gadm_province, r.municipality, g)

In [24]:
rdd = lookup.rdd.map(map_municipalities)
lookup = rdd.toDF()
lookup.createOrReplaceTempView('lookup')

In [25]:
spark.sql('select * from lookup where gadm_municipality is null').toPandas()

Unnamed: 0,region,province,gadm_province,municipality,gadm_municipality


In [26]:
spark.sql('select count(*) from lookup').show()

+--------+
|count(1)|
+--------+
|    1663|
+--------+



## third pass: barangay level

In [27]:
spark.sql('select distinct region, province, municipality, barangay from location').count()

40805

In [28]:
lookup = spark.sql("""
            SELECT DISTINCT
                k.region, 
                k.province, 
                k.gadm_province, 
                k.municipality, 
                k.gadm_municipality,
                l.barangay
                --, b.NAME_3 gadm_brgy
            FROM lookup k
            LEFT JOIN location l
                ON l.province = k.province
                AND l.municipality = k.municipality

""")
lookup.createOrReplaceTempView('lookup')

In [29]:
spark.sql('select * from lookup limit 5').show()

+-----------+-------------+-------------+------------+-----------------+--------------------+
|     region|     province|gadm_province|municipality|gadm_municipality|            barangay|
+-----------+-------------+-------------+------------+-----------------+--------------------+
|REGION VIII|EASTERN SAMAR|Eastern Samar|    LLORENTE|         Llorente|  BARANGAY  2 (POB.)|
|REGION VIII|EASTERN SAMAR|Eastern Samar|        ORAS|             Oras|             MABUHAY|
|REGION VIII|EASTERN SAMAR|Eastern Samar|     SALCEDO|          Salcedo|              BUABUA|
|REGION VIII|EASTERN SAMAR|Eastern Samar|  SAN JULIAN|       San Julian|               LIBAS|
|REGION VIII|EASTERN SAMAR|Eastern Samar|      LAWAAN|           Lawaan|BARANGAY POBLACIO...|
+-----------+-------------+-------------+------------+-----------------+--------------------+



In [30]:
lookup = spark.sql("""
            SELECT DISTINCT 
                k.region, 
                k.province, 
                k.gadm_province, 
                k.municipality, 
                k.gadm_municipality, 
                k.barangay, 
                b.NAME_3 gadm_brgy
            FROM lookup k
            LEFT JOIN brgy b
            ON b.NAME_1 = k.gadm_province
            AND b.NAME_2 = k.gadm_municipality
            AND (UPPER(b.NAME_3) = k.barangay)
""")
lookup.createOrReplaceTempView('lookup')

In [31]:
lookup.show(5)

+-----------+----------------+----------------+------------+-----------------+-----------+-----------+
|     region|        province|   gadm_province|municipality|gadm_municipality|   barangay|  gadm_brgy|
+-----------+----------------+----------------+------------+-----------------+-----------+-----------+
|        CAR|            ABRA|            Abra|   MALIBCONG|        Malibcong|      UMNAP|      Umnap|
|REGION XIII|AGUSAN DEL NORTE|Agusan del Norte|  BUENAVISTA|       Buenavista|POBLACION 6|Poblacion 6|
|REGION XIII|AGUSAN DEL NORTE|Agusan del Norte|  BUENAVISTA|       Buenavista|      RIZAL|      Rizal|
|REGION XIII|AGUSAN DEL NORTE|Agusan del Norte|  LAS NIEVES|       Las Nieves|MARCOS CALO|Marcos Calo|
|REGION XIII|AGUSAN DEL NORTE|Agusan del Norte|    SANTIAGO|         Santiago| SAN ISIDRO| San Isidro|
+-----------+----------------+----------------+------------+-----------------+-----------+-----------+
only showing top 5 rows



In [32]:
lookup.count()

40805

In [33]:
spark.sql('SELECT count(*) FROM lookup WHERE gadm_brgy IS NULL').show()

+--------+
|count(1)|
+--------+
|    5276|
+--------+



In [34]:
lookup = spark.sql("""
            SELECT 
                --precinct_code,
                region, 
                province,
                gadm_province,
                municipality,
                gadm_municipality,
                barangay,
                CASE 
                    WHEN region = "OAV"
                        THEN barangay
                        ELSE gadm_brgy
                END as gadm_brgy
            FROM lookup
""")
lookup.createOrReplaceTempView('lookup')

In [35]:
spark.sql('SELECT count(*) FROM lookup WHERE gadm_brgy IS NULL').show()

+--------+
|count(1)|
+--------+
|    5239|
+--------+



In [36]:
query = """
            SELECT NAME_1, NAME_2, NAME_3
            FROM brgy
            WHERE NAME_1 in (SELECT gadm_province FROM lookup WHERE gadm_brgy IS NULL)
            AND NAME_2 in (SELECT gadm_municipality FROM lookup WHERE gadm_brgy IS NULL)
            --AND NAME_2 NOT IN (SELECT gadm_municipality FROM lookup WHERE gadm_municipality IS NOT NULL)
"""
reference = create_reference_dict(query)

In [37]:
len(reference)

1188

In [38]:
def map_brgys(r):
    if r.gadm_brgy:
        return r
    else:
        m = r.gadm_municipality
        p = r.gadm_province
        b = r.barangay
        
        brgys = reference[(p, m)]
        g = process.extractOne(b, brgys)[0]
        if g is None:
            return r
        else:
            return Row(r.region, r.province, r.gadm_province, r.municipality, r.gadm_municipality, r.barangay, g)

In [39]:
rdd = lookup.rdd.map(map_brgys)
lookup = rdd.toDF()
lookup.createOrReplaceTempView('lookup')

In [40]:
spark.sql('select count(*) from lookup where gadm_brgy is null').show()

+--------+
|count(1)|
+--------+
|       0|
+--------+



In [41]:
spark.sql('select * from lookup limit 5').show()

+-----------+----------------+----------------+------------+-----------------+-----------+-----------+
|     region|        province|   gadm_province|municipality|gadm_municipality|   barangay|  gadm_brgy|
+-----------+----------------+----------------+------------+-----------------+-----------+-----------+
|        CAR|            ABRA|            Abra|   MALIBCONG|        Malibcong|      UMNAP|      Umnap|
|REGION XIII|AGUSAN DEL NORTE|Agusan del Norte|  BUENAVISTA|       Buenavista|POBLACION 6|Poblacion 6|
|REGION XIII|AGUSAN DEL NORTE|Agusan del Norte|  BUENAVISTA|       Buenavista|      RIZAL|      Rizal|
|REGION XIII|AGUSAN DEL NORTE|Agusan del Norte|  LAS NIEVES|       Las Nieves|MARCOS CALO|Marcos Calo|
|REGION XIII|AGUSAN DEL NORTE|Agusan del Norte|    SANTIAGO|         Santiago| SAN ISIDRO| San Isidro|
+-----------+----------------+----------------+------------+-----------------+-----------+-----------+



In [42]:
lookup.count()

40805

## final precinct code lookup

In [43]:
spark.sql('select count(distinct precinct_code) from location').show()

+-----------------------------+
|count(DISTINCT precinct_code)|
+-----------------------------+
|                        90642|
+-----------------------------+



In [44]:
spark.sql('select count(*) from lookup').show()

+--------+
|count(1)|
+--------+
|   40805|
+--------+



In [45]:
spark.sql('select distinct(*) from lookup').count()

40805

In [46]:
lookup.show(2)

+-----------+----------------+----------------+------------+-----------------+-----------+-----------+
|     region|        province|   gadm_province|municipality|gadm_municipality|   barangay|  gadm_brgy|
+-----------+----------------+----------------+------------+-----------------+-----------+-----------+
|        CAR|            ABRA|            Abra|   MALIBCONG|        Malibcong|      UMNAP|      Umnap|
|REGION XIII|AGUSAN DEL NORTE|Agusan del Norte|  BUENAVISTA|       Buenavista|POBLACION 6|Poblacion 6|
+-----------+----------------+----------------+------------+-----------------+-----------+-----------+
only showing top 2 rows



In [47]:
final_lookup = spark.sql("""
            SELECT DISTINCT
                l.precinct_code, 
                k.region,
                k.gadm_province, 
                k.gadm_municipality, 
                k.gadm_brgy
            FROM location l
            LEFT JOIN lookup k
                ON k.province = l.province
                AND k.municipality = l.municipality
                AND k.barangay = l.barangay
""")
final_lookup.createOrReplaceTempView('final_lookup')

In [48]:
spark.sql('select * from final_lookup limit 5').show()

+-------------+-----------+-------------------+-----------------+--------------+
|precinct_code|     region|      gadm_province|gadm_municipality|     gadm_brgy|
+-------------+-----------+-------------------+-----------------+--------------+
|     74040764|        NCR|Metropolitan Manila|      Quezon City|     Milagrosa|
|     14030047| REGION III|            Bulacan|          Baliuag|        Pagala|
|     24040005|  REGION XI|      Davao del Sur|          Hagonoy|Guihing Aplaya|
|     56070075|REGION IV-A|             Quezon|          Calauag|   Santa Maria|
|     64170026|REGION VIII|     Southern Leyte|            Sogod|      Kahupian|
+-------------+-----------+-------------------+-----------------+--------------+



In [49]:
final_lookup.count()

90642

In [50]:
spark.sql('''select * from lookup 
          where province = "SORSOGON" 
          and municipality="SORSOGON CITY" and barangay="POBLACION"''').show()

+--------+--------+-------------+-------------+-----------------+---------+---------+
|  region|province|gadm_province| municipality|gadm_municipality| barangay|gadm_brgy|
+--------+--------+-------------+-------------+-----------------+---------+---------+
|REGION V|SORSOGON|     Sorsogon|SORSOGON CITY|    Sorsogon City|POBLACION|Poblacion|
+--------+--------+-------------+-------------+-----------------+---------+---------+



In [51]:
spark.sql('select * from location where precinct_code = 62160095').show()

+-------------+--------+--------+-------------+---------+-----------------+------------+
|precinct_code|  region|province| municipality| barangay|registered_voters|ballots_cast|
+-------------+--------+--------+-------------+---------+-----------------+------------+
|     62160095|REGION V|SORSOGON|SORSOGON CITY|POBLACION|              694|         584|
+-------------+--------+--------+-------------+---------+-----------------+------------+



## save lookup tables

In [52]:
def save_query_to_csv(query, filename):
    path = os.path.join('..', 'output', filename)
    try:
        spark.sql(query
                 ).coalesce(1).write.option(
            "header", "true").csv(path)
        print('file saved!')
    except:
        inp = input('file already exists! retry save? Y/N')
        if inp =='Y':
            shutil.rmtree(path)
            save_query_to_csv(query, filename)
        else:
            pass

In [53]:
save_query_to_csv('select * from final_lookup', 'lookup')

file already exists! retry save? Y/Nn


In [54]:
save_query_to_csv('''select 
                precinct_code, 
                registered_voters, 
                ballots_cast 
            from location''', 'location')

file already exists! retry save? Y/NY
file saved!


# Denormalized results tables

In [55]:
for table in ['president', 'vice_president']:
    spark.sql(f"""
                SELECT 
                    f.region, 
                    f.gadm_province, 
                    f.gadm_municipality, 
                    f.gadm_brgy,
                    p.precinct_code, 
                    p.contest_code, 
                    p.candidate_name, 
                    p.party_code,
                    p.votes, 
                    p.col5, 
                    p.ballots_cast, 
                    p.col7, 
                    p.col8, 
                    p.pct_votes
                FROM {table} p
                JOIN final_lookup f 
                ON p.precinct_code = f.precinct_code

    """).createOrReplaceTempView(f'denorm_{table[:4]}')

In [56]:
spark.sql('select * from denorm_vice limit 5').toPandas()

Unnamed: 0,region,gadm_province,gadm_municipality,gadm_brgy,precinct_code,contest_code,candidate_name,party_code,votes,col5,ballots_cast,col7,col8,pct_votes
0,REGION IV-A,Batangas,Batangas City,Barangay 1,10050007,299009,"MARCOS, BONGBONG (IND)",58,129,4,491,8,0,0.262729124236
1,REGION IV-A,Batangas,Batangas City,Barangay 1,10050007,299009,"TRILLANES, ANTONIO IV (IND)",58,12,6,491,8,0,0.0244399185336
2,REGION IV-A,Batangas,Batangas City,Barangay 1,10050007,299009,"HONASAN, GRINGO (UNA)",163,9,3,491,8,0,0.0183299389002
3,REGION IV-A,Batangas,Batangas City,Barangay 1,10050007,299009,"ROBREDO, LENI DAANG MATUWID (LP)",85,210,5,491,8,0,0.427698574338
4,REGION IV-A,Batangas,Batangas City,Barangay 1,10050007,299009,"ESCUDERO, CHIZ (IND)",58,54,2,491,8,0,0.109979633401


In [57]:
spark.sql('select * from denorm_pres limit 5').toPandas()

Unnamed: 0,region,gadm_province,gadm_municipality,gadm_brgy,precinct_code,contest_code,candidate_name,party_code,votes,col5,ballots_cast,col7,col8,pct_votes
0,REGION IV-A,Batangas,Batangas City,Barangay 1,10050007,199009,"DUTERTE, RODY (PDPLBN)",114,168,3,491,2,0,0.34215885947
1,REGION IV-A,Batangas,Batangas City,Barangay 1,10050007,199009,"DEFENSOR SANTIAGO, MIRIAM (PRP)",135,35,2,491,2,0,0.071283095723
2,REGION IV-A,Batangas,Batangas City,Barangay 1,10050007,199009,"BINAY, JOJO (UNA)",163,88,1,491,2,0,0.179226069246
3,REGION IV-A,Batangas,Batangas City,Barangay 1,10050007,199009,"ROXAS, MAR DAANG MATUWID (LP)",85,82,5,491,2,0,0.16700610998
4,REGION IV-A,Batangas,Batangas City,Barangay 1,10050007,199009,"SEÑERES, ROY (WPPPMM)",165,1,6,491,2,0,0.0020366598778


In [58]:
for table in tqdm(['pres', 'vice']):
    save_query_to_csv(f'select * from denorm_{table}', f'denorm_{table}')

 50%|█████     | 1/2 [00:25<00:25, 25.08s/it]

file saved!


100%|██████████| 2/2 [00:47<00:00, 23.82s/it]

file saved!





In [4]:
files = glob(os.path.join('..', 'output','*','*csv'))

In [5]:
ogfiles = list(map(lambda x: os.path.split(x)[-1], files))
ogfiles

['part-00000-76f5f423-02b9-49cf-9f40-258287e7ea70-c000.csv',
 'part-00000-a16b52c8-a693-4847-9f27-2b9c0022ae36-c000.csv',
 'part-00000-fff5ddd7-10e8-4520-8bf2-bca38e143240-c000.csv',
 'part-00000-4d6ed548-8914-413c-84dc-bac3ae10fd94-c000.csv']

['part-00000-76f5f423-02b9-49cf-9f40-258287e7ea70-c000.csv',
 'part-00000-a16b52c8-a693-4847-9f27-2b9c0022ae36-c000.csv',
 'part-00000-fff5ddd7-10e8-4520-8bf2-bca38e143240-c000.csv',
 'part-00000-4d6ed548-8914-413c-84dc-bac3ae10fd94-c000.csv']

In [6]:
filenames = list(map(lambda x: x.split(os.sep)[2]+r'.csv', files))
filenames

['location.csv', 'lookup.csv', 'denorm_pres.csv', 'denorm_vice.csv']

['location.csv', 'lookup.csv', 'denorm_pres.csv', 'denorm_vice.csv']

In [10]:
for path, filename in tqdm(zip(files, filenames)):
    try:
        ogfilename = os.path.split(path)[-1]
        shutil.move(path, os.path.join('..', 'output'))
        os.rename(os.path.join('..', 'output', ogfilename), os.path.join('..', 'output', filename))
        shutil.rmtree(os.path.join('..', 'output', filename.split('.')[0]))
    except FileNotFoundError:
        pass

4it [00:00, 383.18it/s]
4it [00:00, 383.18it/s]
