# FII Processing notebook

Objective: 

FII data is in xlsx spreadsheets. Goal is to make this data easier to work with. 

Raw data (desciption / notes)
 - FII data is broken down in a directory structure
 - directory structure indicates "wave" of surveys
 - filename indicates country of origin, wave #, date
 - each dataset has three files: XLSX format data (MS Excel), SAV format data (SAS), and a legend for columns in XLS format (MS Excel)
 - Excel files have a single worksheet. 

Desired output: 
 - everything in one giant table (Hd5?) 

Questions: 
 - different legend files for each country. Does that mean columns or data elements mean different things in each country's survey? If so, cannot merge datasets trivially. 
 
 
Progress:
 - Convert each XLSX into CSV format



In [1]:
import os
import sys 

# where xlsx files come from. Provide root, will crawl
source_dir = '../data/FII'

# where to place csv files afer conversion. Will flatten source dirs
dest_dir = '../data/fii_csv/'

## Get all source data file names and paths

In [2]:
os.listdir(source_dir)

['FII Datasets - Wave 3', 'FII Datasets - Wave 2', 'FII Datasets - Wave 1']

In [3]:
"""
# get full path of all xls files
"""
all_xls_files = []
for root, dirs, files in os.walk(source_dir):
    xls_files = [f for f in files if f.endswith('xlsx')]
    
#     if len(xls_files) > 0: 
#         print len(xls_files)
#         print root
#         print dirs
#         print os.path.join(root, xls_files[0])

    xls_files = [os.path.join(root, f) for f in xls_files]
    all_xls_files += xls_files


for xf in all_xls_files:
    print xf

../data/FII/FII Datasets - Wave 3/FSP_Final_Kenya_W3 (public).xlsx
../data/FII/FII Datasets - Wave 3/FSP_Final_Indonesia_W2 (public).xlsx
../data/FII/FII Datasets - Wave 3/FSP_Final_Tanzania_W3 (public).xlsx
../data/FII/FII Datasets - Wave 3/FSP_Final_Bangladesh_W3 (public).xlsx
../data/FII/FII Datasets - Wave 3/FSP_Final_India_W3 (public).xlsx
../data/FII/FII Datasets - Wave 3/FSP_Final_Pakistan_W3 (public).xlsx
../data/FII/FII Datasets - Wave 3/FSP_Final_Uganda_W3 (public).xlsx
../data/FII/FII Datasets - Wave 3/FSP_Final_Nigeria_W3 (public).xlsx
../data/FII/FII Datasets - Wave 2/FSP_Final_Tanzania_w2_12182014(public).xlsx
../data/FII/FII Datasets - Wave 2/FSP_Final_Uganda_W2_10312015(public).xlsx
../data/FII/FII Datasets - Wave 2/FSP_Final_India_w2_02242015 (public).xlsx
../data/FII/FII Datasets - Wave 2/FSP_Final_Nigeria_w2_11182014(public).xlsx
../data/FII/FII Datasets - Wave 2/FSP_Final_Kenya_w2_12092014(public).xlsx
../data/FII/FII Datasets - Wave 2/FSP_Final_Bangladesh_w2_101620

## Util to process single source file

In [6]:
import pandas as pd
import numpy as np

In [25]:

def conv_rename_and_move(source_full_path, dest_dir):
    """
    Uses pandas to load xlsx and write to csv
    """
    
    # SOURCE
    source_file_name = source_full_path.split('/')[-1]
    
    # debug
    print "Processing file {} ... ".format(source_file_name), 
    
    # DESTINATION
    # make lower case, remove spaces, change extension
    dest_file_name = source_file_name.lower().replace(' ', '_').replace('.xlsx', '.csv')
    dest_full_path = os.path.join(dest_dir, dest_file_name)
        
    # do
    df = pd.read_excel(source_full_path)
    df.to_csv(dest_full_path, encoding='utf-8')
    
    print "--> {}".format(dest_full_path)
    
    return len(df)
        




# DO: Process all the files

In [26]:
for xls_file in all_xls_files: 
    conv_rename_and_move(xls_file,
                         dest_dir=dest_dir)

Processing file FSP_Final_Kenya_W3 (public).xlsx ...  --> ../data/fii_csv/fsp_final_kenya_w3_(public).csv
Processing file FSP_Final_Indonesia_W2 (public).xlsx ...  --> ../data/fii_csv/fsp_final_indonesia_w2_(public).csv
Processing file FSP_Final_Tanzania_W3 (public).xlsx ...  --> ../data/fii_csv/fsp_final_tanzania_w3_(public).csv
Processing file FSP_Final_Bangladesh_W3 (public).xlsx ...  --> ../data/fii_csv/fsp_final_bangladesh_w3_(public).csv
Processing file FSP_Final_India_W3 (public).xlsx ...  --> ../data/fii_csv/fsp_final_india_w3_(public).csv
Processing file FSP_Final_Pakistan_W3 (public).xlsx ...  --> ../data/fii_csv/fsp_final_pakistan_w3_(public).csv
Processing file FSP_Final_Uganda_W3 (public).xlsx ...  --> ../data/fii_csv/fsp_final_uganda_w3_(public).csv
Processing file FSP_Final_Nigeria_W3 (public).xlsx ...  --> ../data/fii_csv/fsp_final_nigeria_w3_(public).csv
Processing file FSP_Final_Tanzania_w2_12182014(public).xlsx ...  --> ../data/fii_csv/fsp_final_tanzania_w2_12182014(