<a id="ID_top"></a>
## Raw data script

Takes files from `./0_raw/` repository and allows small bits of cleaning or correction and documentation of errors before saving into:
- `./1_raw_processed_backup/` with version control
- `./2_raw_processed_input/` to store the version to be used for live analysis in other scripts

**Last change:** 16.06.2020

#### Code sections:
    
|| [0|Top](#ID_top) || [1|Filepaths](#ID_paths) || [2|Filenames](#ID_names) || [3|Load](#ID_load) || [4|Correct](#ID_correct) || [5|Export](#ID_export) ||

In [1]:
#=== working with external files
# this function saves the content of a cell to the file, it silently overwrites any content in existing files
#%%writefile script_0_to_1.py

# this function loads the content of a file into a cell
#%load script_0_to_1.py

In [1]:
# package to get all file names in directory
import os
from shutil import copyfile
from shutil import make_archive
import zipfile
# create dataframes
import pandas as pd
# dynamic versioning
from datetime import datetime

<a id="ID_paths"></a>
### Filepaths
|| [0|Top](#ID_top) || [1|Filepaths](#ID_paths) || [2|Filenames](#ID_names) || [3|Load](#ID_load) || [4|Correct](#ID_correct) || [5|Export](#ID_export) ||

In [2]:
# This script allows one to load and correct raw files before saving them again.

file_path_0_raw = "./0_raw/"
file_path_1_backup = "./1_raw_processed_backup/"
file_path_2_input = "./2_raw_processed_input/"

<a id="ID_names"></a>
### Filenames

|| [0|Top](#ID_top) || [1|Filepaths](#ID_paths) || [2|Filenames](#ID_names) || [3|Load](#ID_load) || [4|Correct](#ID_correct) || [5|Export](#ID_export) ||

In [3]:
# list of all files
filenames = os.listdir(file_path_0_raw)
print(filenames)

# list of file names that can be read with same rule
file_to_batch_read = [filenames[2]]

['0_raw_explainer_doc.md', 'WIOT2014_Nov16_ROW.xlsb', 'release_1.0_2005_2016.csv.zip', 'China-multi-regional-input-output-MRIO-table-2012.xlsx']


<a id="ID_load"></a>
### Load

|| [0|Top](#ID_top) || [1|Filepaths](#ID_paths) || [2|Filenames](#ID_names) || [3|Load](#ID_load) || [4|Correct](#ID_correct) || [5|Export](#ID_export) ||

In [4]:
# list of datasets
dataset_list = []

# load files in batch
for file in file_to_batch_read:
    print(f"Processing | {file_path_0_raw}{file} ...")
    
    # read zip file and make sure only first file is passed on
    temp_zf = zipfile.ZipFile(f"{file_path_0_raw}{file}")    
    #temp_read = pd.read_csv(f"{file_path_0_raw}{temp_zf_first}")
    temp_read = pd.read_csv(temp_zf.open(zipfile.ZipFile.namelist(temp_zf)[0]))
    dataset_list.append(temp_read)
    print(f"...done")

Processing | ./0_raw/release_1.0_2005_2016.csv.zip ...
...done


<a id="ID_correct"></a>
### Correct

|| [0|Top](#ID_top) || [1|Filepaths](#ID_paths) || [2|Filenames](#ID_names) || [3|Load](#ID_load) || [4|Correct](#ID_correct) || [5|Export](#ID_export) ||

In [5]:
# list object, remember to index to open correct table
dataset_list[0].head()

Unnamed: 0,year,country_d,iso3_d,dynamic_code_d,landlocked_d,island_d,region_d,gdp_pwt_const_d,pop_d,gdp_pwt_cur_d,...,hostility_level_o,hostility_level_d,distance,common_language,colony_of_destination_after45,colony_of_destination_current,colony_of_destination_ever,colony_of_origin_after45,colony_of_origin_current,colony_of_origin_ever
0,2005,Aruba,ABW,ABW,0,1,caribbean,3906.5203,0.100031,4093.2434,...,0,0,120.05867,1,0,0,0,0,0,0
1,2006,Aruba,ABW,ABW,0,1,caribbean,4118.1396,0.10083,4217.0669,...,0,0,978.77728,1,0,0,0,0,0,0
2,2007,Aruba,ABW,ABW,0,1,caribbean,4196.4634,0.101218,4248.4707,...,0,0,8563.6963,0,0,0,0,0,0,0
3,2008,Aruba,ABW,ABW,0,1,caribbean,4433.6772,0.101342,4441.8828,...,0,0,7562.6733,0,0,0,0,0,0,0
4,2009,Aruba,ABW,ABW,0,1,caribbean,4183.0449,0.101416,4304.9224,...,0,0,16904.596,1,0,0,0,0,0,0


#### Corrections | release_1.0_2005_2016.csv 

In [6]:
# release_1.0_2005_2016.csv
raw_df_dyn_grav_05_16 = dataset_list[0]

In [7]:
# preview problematic part of dataset
temp_column_view = ["country_d","country_o","contiguity"]

# filter on greece and bulgaria
raw_df_dyn_grav_05_16[
    # filter the Bulgaria - Greece combo or the Bulgaria - Greece combo
    ((raw_df_dyn_grav_05_16.iso3_d == "GRC") & (raw_df_dyn_grav_05_16.iso3_o == "BGR")) 
    |
    ((raw_df_dyn_grav_05_16.iso3_d == "BGR") & (raw_df_dyn_grav_05_16.iso3_o == "GRC"))
                     ].loc[:,temp_column_view].head()

Unnamed: 0,country_d,country_o,contiguity
71945,Bulgaria,Greece,0
72099,Bulgaria,Greece,0
72483,Bulgaria,Greece,0
72638,Bulgaria,Greece,0
72988,Bulgaria,Greece,0


In [8]:
# correct lack of contiguity between bulgaria and greece
temp_index = raw_df_dyn_grav_05_16[
    # filter the Bulgaria - Greece combo or the Bulgaria - Greece combo
    ((raw_df_dyn_grav_05_16.iso3_d == "GRC") & (raw_df_dyn_grav_05_16.iso3_o == "BGR")) 
    |
    ((raw_df_dyn_grav_05_16.iso3_d == "BGR") & (raw_df_dyn_grav_05_16.iso3_o == "GRC"))
                     ].index

In [9]:
# update values
raw_df_dyn_grav_05_16.loc[list(temp_index),"contiguity"] = 1

<a id="ID_export"></a>
### Export

|| [0|Top](#ID_top) || [1|Filepaths](#ID_paths) || [2|Filenames](#ID_names) || [3|Load](#ID_load) || [4|Correct](#ID_correct) || [5|Export](#ID_export) ||

In [11]:
# Files to export
file_export = [raw_df_dyn_grav_05_16]
file_names = ["dynamic_gravity"]

In [12]:
# run through all files
for index,file in enumerate(file_export):
    # set current time and version
    temp_now = datetime.now()
    time_of_export = temp_now.strftime("%Y%m%d_%H%M")
    # Back up repository
    temp_back_up_name = f"{file_path_1_backup}store_{file_names[index]}_{time_of_export}.csv"
    # export
    compression_type = "gzip"
    temp_compress_name = temp_back_up_name+"."+compression_type
    file.to_csv(temp_back_up_name+"."+compression_type,compression = compression_type)
    # copy to live folder
    temp_live_name = f"{file_path_2_input}input_{file_names[index]}.csv.{compression_type}"
    copyfile(temp_compress_name,temp_live_name)