### <center>**SOURCE DATA**</center>
    
    
For this replication, I decided to use the observations from the year 1985 to 1995 (from WTF85 to WTF95 at https://cid.econ.ucdavis.edu/data/undata/undata.html). The code is designed to clean and set a unique .csv file storaged at data_cleaned folder. The user is able to generate its own .csv file by downloading different versions of the files and put them on the data_raw folder.

In [1]:
# First, set the directory on the previous folder.

import os
os.chdir("..")

In [2]:
### I decided to create a Unique csv file for simplicity.

# To read files
import pandas as pd

df1 = pd.read_sas('data_raw/wtf85.sas7bdat', format='sas7bdat', encoding='utf-8')
df1.head()

Unnamed: 0,year,Icode,Importer,Ecode,Exporter,sitc4,Unit,DOT,Value,Quantity
0,1985.0,100000,World,100000,World,11,,,82377.0,
1,1985.0,100000,World,100000,World,11,N,,114391.0,351003.0
2,1985.0,100000,World,100000,World,11,W,,1955295.0,1194567.41
3,1985.0,100000,World,100000,World,12,,,109286.0,
4,1985.0,100000,World,100000,World,12,N,,57490.0,909997.0


The columns DOT, and quantity are not required on this work. The reason what I'm not using the quantity is simple. The dataframe is not measure with a uniques unit for each product. Also, the current work is center on diversification vs specialization. If it is wanted to use the quantities in a future version, a conversion from units must be done. The, the researcher will need to investigated an expected conversion from each units of every product.

In [3]:
df1 = df1[['year', 'Icode', 'Importer', 'Ecode', 'Exporter', 'sitc4', 'Value']]
df1.head()

Unnamed: 0,year,Icode,Importer,Ecode,Exporter,sitc4,Value
0,1985.0,100000,World,100000,World,11,82377.0
1,1985.0,100000,World,100000,World,11,114391.0
2,1985.0,100000,World,100000,World,11,1955295.0
3,1985.0,100000,World,100000,World,12,109286.0
4,1985.0,100000,World,100000,World,12,57490.0


Based on the UN appendix (documentation/base documentation/FAQ_on_NBER-UN_data.pdf). The value column shows the aggregate amount of money that a country received by the selling of a specific product at an specific unit. So, to get the share column ($S_{cp}$), the cleaned data frame must have the sum of the *value* columns for each *c* and *p* pair.

In [4]:
df1 = df1.groupby(['year', 'Icode', 'Importer', 'Ecode', 'Exporter', 'sitc4']).sum().reset_index()
df1.columns = ['year', 'icode', 'importer', 'ecode', 'exporter', 'product', 'share']
df1.head()

Unnamed: 0,year,icode,importer,ecode,exporter,product,share
0,1985.0,100000,World,100000,World,11,2152063.0
1,1985.0,100000,World,100000,World,12,760400.0
2,1985.0,100000,World,100000,World,13,1046521.0
3,1985.0,100000,World,100000,World,14,435667.0
4,1985.0,100000,World,100000,World,15,751362.0


The script data_gen.py, do this process to all data files located at data_raw folder and concatenate them. Then, it put the resultant dataframe at the data_cleaned folder as a csv file.

In [5]:
from manuscript.data_gen import unify_data
unify_data()

Opening file wtf85.sas7bdat
Opening file wtf86.sas7bdat
Opening file wtf87.sas7bdat
Opening file wtf88.sas7bdat
Opening file wtf89.sas7bdat
Opening file wtf90.sas7bdat
Opening file wtf91.sas7bdat
Opening file wtf92.sas7bdat
Opening file wtf93.sas7bdat
Opening file wtf94.sas7bdat
Opening file wtf95.sas7bdat
          year   icode   importer   ecode     exporter product         share
0       1985.0  100000      World  100000        World    0011  1.545570e+06
1       1985.0  100000      World  100000        World    0012  1.304516e+06
2       1985.0  100000      World  100000        World    0013  3.823720e+06
3       1985.0  100000      World  100000        World    0014  2.002013e+07
4       1985.0  100000      World  100000        World    0015  1.518694e+05
...        ...     ...        ...     ...          ...     ...           ...
762852  1995.0  908960  Areas NES  710360    Australia    8748  0.000000e+00
762853  1995.0  908960  Areas NES  710360    Australia    8822  3.000000e+00