# Introduction

**In this study, the ecoinvent will be downscaled to country level (same region resolution as EXIOBASE) and integrated with better inter-country trade information from EXIOBASE, while process details of ecoinvent are kept, so the region matching is the key connection between the old and new database.** Until now, we already define the basic region resolution of the new version of ecoinvent (called new ecoinvent) is 44 countries from exiobase (explained in the paper) and know how the locations in the original ecoinvent are matched to these 44 countries on database level. <br>
This notebook will do **activity-wise location matching** because different activities in ecoinvent have different location sets and the locations for an activity do not overlap (for an activity with locations RoW and RER, RoW excludes countries covered in RER). Via the activity-wise location matching, each dataset in the new ecoinvent can find a **reference dataset** in the original ecoinvent. Then, **we generate technosphere exchanges for each dataset in the new ecoinvent based on the reference dataset from ecoinvent and country-level consumption mix from exiobase. The technosphere exchanges for each dataset form a vector in the final technosphere matrix.** <br>

*The following figure shows the framework of the algorithm. A muliregion key represents an activity with a certain product output. Numbers in green circles represent sections.*

<img src="main_algorithm_structure.png" width="600">

### Section 1: Get a reference dataset in the original ecoinvnet for each dataset in the new ecoinvnet via location matching
We characterize each dataset in the original ecoinvent by its location and a multiregion key (mr_key = activity name + product name). The datasets with a same mr_key represent an activity happening in different locations. For a location set of a mr_key, we match some locations to 44 countries if possible and keep the rest locations as other countries. Each of 44 countries can only be matched once. Thus, in the final location set of the same mr_key in new ecoinvent, each final location can only find a reference location in original ecoinvent, i.e., each dataset in the new database can find a reference dataset from the original ecoinvent.

### Section 2: Summarize the technosphere exchanges of datasets in the original ecoinvent and generating country-specific consumption mix of products
We summarize the technosphere exchanges (inputs) of each dataset in the original ecoinvent, including the amounts, activities (mr_key) providing the inputs, and potential activity locations in new ecoinvent. Rules for potential locations are different depending on whether the dataset is country-aggregated or not. For a country-aggregated dataset, which is a reference for several country-level datasets in new ecoinvent, all possible final locations are considered for the activities providing inputs, and the amount is the sum of a certian input provided by different regions. For a subnational or country-level dataset, which is the reference for the dataset of the same location in new ecoinvent, the original location of an activity providing an input is kept or disaggregated to country level. <br>
Until now, for an activity (mr_key as consumer), we know all activities (mr_key as producer) providing technosphere inputs, while ratios of producers in different locations are not determined. For each combination of consumer and producer, we match them to industries in EXIOBASE based on sector matching results, and generate a vector showing import ratios from different locations for each region in EXIOBASE. <br>
The summarized technosphere exchanges in reference datasets and vectors of import ratios are used in next section.<br>

### Section 3: Generate technosphere exchanges for datasets in new ecoinvent
Based on section 1&2, each dataset in new ecoinvent can be characterized by a combination of a mr_key and a final location, and can find a reference datast with summarized technosphere exchanges. For an activity providing certain technosphere exchange for a dataset, vectors showing import ratios from different locations are prepared. Thus, we are ready to generate technosphere exchanges for datasets in new ecoinvent now. An additional rule applied is that the activity locations providing technosphsere inputs for a market activty are restricted to be local, while for transforming activities, the origins of inpyts are not restricted, which is to avoid double implementation of country-level consumption mix (see discussion in paper). 



# Table of content <a id='back'></a>
    
   <a href='#import'>Preparation: Import packages and files</a> <br>

1. <a href='#step1'>Get a reference dataset in the original ecoinvnet for each dataset in the new ecoinvnet via location matching </a> <br>
    1.1 <a href='#create_mrkey'>Generate multiregion keys for ecoinvent datasets </a>  <br>
    1.2 <a href='#sec_matching'>Match ISIC codes in ecoinvent to EXIOBASE sector indexes </a>   <br>
    1.3 <a href='#reg_matching'>Conduct activity-wise location matching between ecoinvent and new ecoinvent </a>  <br>
    1.4 <a href='#create_mrd'>Summarize dataset-level macthing between ecoinvent and new ecoinvent  </a>  <br>


2. <a href='#step2'>Summarize the technosphere exchanges of datasets in the original ecoinvent and generate country-specific consumption mixes of products </a><br>
    2.1 <a href='#ds_exc_dict'>Summarize the technosphere exchanges of datasets </a>  <br>
    2.2 <a href='#cons_mix'>Generate necessary country-specific consumption mixes </a>  <br>
    2.3 <a href='#suborgin_PV'>Summarize production volumns for further suborigin allocation </a>  <br>


3. <a href='#step3'>Generate technosphere exchanges for datasets in new ecoinvent </a>  <br>
    3.1 <a href='#full_index'>Create index for each dataset in new ecoinvent </a>   <br>
    3.2 <a href='#gen_exch'>Generate technosphere exchanges for each dataset</a>  <br>


## Preparation: Import packages and files <a id='import'></a> <br>
 <a href='#back'>back</a> 

In [1]:
# import package
import pickle
import collections
import hashlib
import re
import itertools
import time
import multiprocessing as mp
import pandas as pd
import numpy as np
import scipy.sparse
from tqdm import *

In [2]:
# import region file, sector file, ecoinvent dataset, iot
with open('../../Data/region_matching/iot_co.p', 'rb') as i:
    iot_co = pickle.load(i)
with open('../../Data/region_matching/partco.p', 'rb') as i:
    partco = pickle.load(i)
with open('../../Data/region_matching/other_co.p', 'rb') as i:
    other_co = pickle.load(i)
with open('../../Data/region_matching/eco_reg_TO_iot_co.p', 'rb') as i:
    eco_reg_TO_iot_co = pickle.load(i)
with open('../../Data/region_matching/final_loc_set_TO_iot_reg_co.p', 'rb') as i:
    final_loc_set_TO_iot_reg_co = pickle.load(i)

with open('../../Data/sector_matching/isic_TO_exio_name.p', 'rb') as i:
    isic_TO_exio_name = pickle.load(i)

with open('../../Data/lci_iot_imported/iot_flow.p', 'rb') as i:
    iot = pickle.load(i)
    
with open('../../Data/lci_iot_imported/cutoff371_no_mg.pickle', 'rb') as i:
    datasets = pickle.load(i)

In [3]:
iot_co_reg= set(iot.index.levels[0])

In [4]:
# sort the dictionary: from small to large geography
eco_reg_TO_iot_co_SORT = {k: v for k, v in sorted(eco_reg_TO_iot_co.items(), key=lambda item: len(item[1]))}

In [5]:
# union of three is final_loc_set of our final LCI database
final_loc_set = iot_co|partco|other_co
with open('../../Data/tech_vector/final_loc_set.p', 'wb') as o:
    pickle.dump(final_loc_set, o)

In [6]:
# exclude 'production mix' datastes. create code_to_ds dict.
datasets = [d for d in datasets if d['activity type']!='production mix'] 
code_to_ds = {d['code']:d for d in datasets}
with open('../../Data/tech_vector/code_to_ds.p', 'wb') as o:
    pickle.dump(code_to_ds, o)

## 1. Get a reference dataset in the original ecoinvnet for each dataset in the new ecoinvnet via location matching<a id='step1'>
    
1.1 <a href='#create_mrkey'>Generate multiregion keys for ecoinvent datasets </a>  <br>
1.2 <a href='#sec_matching'>Match ISIC codes in ecoinvent to EXIOBASE sector indexes </a>   <br>
1.3 <a href='#reg_matching'>Conduct activity-wise location matching between ecoinvent and new ecoinvent</a>  <br>
1.4 <a href='#create_mrd'>Summarize dataset-level macthing between ecoinvent and new ecoinvent  </a>  <br>
 <a href='#back'>back</a> <br>

### 1.1 Generate multiregion keys for ecoinvent datasets <a id='create_mrkey'></a>  <br>
 <a href='#back'>back</a> 

In [7]:
def create_multiregion_key(name, flow):
    '''
    for an activity, multiregion_key = name + flow = production activity and product
    
    '''
# actually 'municipal waste incineration' and 'municipal incineration' mean same activity
    replace_waste = lambda w:w.replace('municipal waste incineration',
                                       'municipal incineration') # define a func: replace 1 in string w with 2
    if 'municipal waste incineration' in name:
        name = replace_waste(name)
# for CH activity with ' with fly ash extraction' in name, there are same GLO activity without this part in name 
# adapt CH ds name to GLO ds name
    if ' with fly ash extraction' in name:
        adapted_name = re.sub(r' with fly ash extraction', '', name) # delete "with fly..."
        #ensure there is no repetitive ds after remove the string. (no overlapping of locations)
        #e.g.{'treatment of waste sealing sheet, polyethylene, municipal incineration with fly ash extraction': 'CH'} -> no GLO ds
        # {'treatment of waste sealing sheet, polyethylene, municipal incineration': 'CH'} already exist in ecoinvent
        # {'treatment of residue from mechanical treatment, desktop computer, municipal waste incineration': 'CH'}
        # {'treatment of residue from mechanical treatment, desktop computer, municipal incineration with fly ash extraction': 'CH'}
        if len([True for d in datasets if adapted_name in replace_waste(d['name']) and 
                d['location']=='CH' and d['flow']==flow])==1: 
            name = adapted_name
            
    return hashlib.md5((name+flow).encode('utf-8')).hexdigest()

for ds in datasets:
    ds['multiregion key'] = create_multiregion_key(ds['name'], ds['flow'])

mr_keys = {ds['multiregion key'] for ds in datasets} 

with open('../../Data/tech_vector/mr_keys.p', 'wb') as o:
    pickle.dump(mr_keys, o)

In [8]:
# collect TapWat_Irr_Mar_mr_code, when these are as activities providing tap water to another activity, only local producers are kept
TapWat_Irr_Mar_mr_code = set()
for d in datasets:
    if d['name'] == 'market for irrigation' or d['name'] == 'market for tap water':
        TapWat_Irr_Mar_mr_code.add(d['multiregion key'])

with open('../../Data/tech_vector/TapWat_Irr_Mar_mr_code.p', 'wb') as o:
    pickle.dump(TapWat_Irr_Mar_mr_code, o)

In [9]:
# assert there are no repetitive datasets after renaming
for key in mr_keys:
    mr_locations = [ds['location'] for ds in datasets if ds['multiregion key']==key]
    assert len(mr_locations) == len(set(mr_locations)), 'there are repetitive datasets'

### 1.2 Match ISIC codes in ecoinvent to EXIOBASE sector indexes <a id='sec_matching'></a>  
 <a href='#back'>back</a> <br>

In [10]:
# complement missing ISIC classification in recycled content datasets
for d in datasets:
    if 'Recycled Content' in d['name']:
        d['classifications'].append(('ISIC rev.4 ecoinvent', '3900:Remediation activities and other waste management services'))

In [11]:
def extract_isic_code(ds):
    cl = ds['classifications']# example of a "classification": [('ISIC rev.4 ecoinvent','1050:Manufacture of dairy products'),('CPC', '22230: Yoghurt and other fermented or acidified milk and cream')],
    isic_info = [c[1] for c in cl if c[0].startswith('ISIC')][0] # only production mixes have no isic-info #1050:
    isic_code = isic_info.split(':')[0] #1050
    isic_code = re.sub(r'\D','', isic_code) # remove non-digit characters
    return isic_code

for ds in datasets:
    ds['ISIC code'] = extract_isic_code(ds)

In [12]:
# for matchaing sector name to index in iot
iot_sectors = iot.index.levels[1]

In [13]:
def match_isic_code_TO_exio_sectors(isic_code):
    '''
    match isic code in datasets to exio sectors index

    '''
    if isic_code in isic_TO_exio_name.keys():
        exio_sec_name = isic_TO_exio_name[isic_code]
    else:
        possible_full_isic_codes = [code for code in isic_TO_exio_name.keys() if code.startswith(isic_code)]
        exio_sec_name = list({sector for sector in 
                        itertools.chain.from_iterable([isic_TO_exio_name[c] for c in possible_full_isic_codes])})
    iot_sec_indexer = iot_sectors.get_indexer(exio_sec_name) # return an array, an activity - several sectors in exio
    return sorted(iot_sec_indexer)

for ds in datasets:
    ds['exiobase sector index'] = match_isic_code_TO_exio_sectors(ds['ISIC code'])

### 1.3 Conduct activity-wise location matching between ecoinvent and new ecoinvent  <a id='reg_matching'></a>  <br>
 <a href='#back'>back</a> <br>

In [14]:
def match_eco_locations_to_final_loc_set(locations):
    '''
    match ds locations under certain mr_key to final country set
    
    '''
    eco_loc_TO_final_loc_set = {}
    final_loc_set_TO_eco_loc = {}
    
    # match a country specific location to a certain element in final location set
    co_specific_loc = [loc for loc in locations if loc in partco or loc in other_co or loc in iot_co]
    for loc in co_specific_loc:
        eco_loc_TO_final_loc_set[loc] = [loc]
    
    # the rest of iot_co that should be matched to region specific locations
    # to avoid overlapping, co that partco belong to should also be excluded
    partco_list = [loc for loc in locations if loc in partco]
    partco_co = []
    partco_co = {final_loc_set_TO_iot_reg_co[p_co] for p_co in partco_list}
    rest_iot_co = iot_co.difference(co_specific_loc)
    rest_iot_co = rest_iot_co.difference(partco_co)
    
    # match a country-aggregated location to a list of iot_co (in final location set), 
    # the list must exclude iot_co already used in last step. 
    # in this step, assign iot_co list to co_aggre_loc from small to large geographies.so:
        #1. sort co_aggre_loc from small to large geography (number of contained iot_co)
        #2. look for matching iot_co one by one. after each step, exclude iot_co already used.
    co_aggre_loc = set(locations).difference(co_specific_loc)
    #1.
    co_aggre_loc_TO_all_matching_iot_co = {}
    for loc in co_aggre_loc:
        co_aggre_loc_TO_all_matching_iot_co[loc] = eco_reg_TO_iot_co[loc]
    sorted_co_aggre_loc_TO_all_matching_iot_co = {k: v for k, v in sorted(co_aggre_loc_TO_all_matching_iot_co.items(), key=lambda item: len(item[1]))}
    sorted_co_aggre_loc = sorted_co_aggre_loc_TO_all_matching_iot_co.keys()
    #2.
    for loc in sorted_co_aggre_loc:
        possible_matching_iot_co = eco_reg_TO_iot_co[loc]
        eco_loc_TO_final_loc_set[loc] = [c for c in possible_matching_iot_co if c in rest_iot_co] #{eco_loc: [final_loc]}
        rest_iot_co = rest_iot_co.difference(set(eco_loc_TO_final_loc_set[loc]))
    
    
    final_loc_set_TO_eco_loc = {} # {country: locations}.locations can be geographically larger, equal or smaller than country 
    for eco_loc, final_loc_list in eco_loc_TO_final_loc_set.items():
        for final_loc in final_loc_list:
            final_loc_set_TO_eco_loc[final_loc]=eco_loc
    
#     for c in iot_co:
#         assert c in final_loc_set_TO_eco_loc.keys(),'incomplete geography cover'
#             print(c)

    return eco_loc_TO_final_loc_set, final_loc_set_TO_eco_loc

### 1.4 Summarize dataset-level macthing between ecoinvent and new ecoinvent <a id='create_mrd'></a>  <br>
 <a href='#back'>back</a> 

In [15]:
def MultiRegionDataset(mr_key):
    MRD = {}
    MRD['key'] = mr_key
    MRD['datasets'] = [d for d in datasets if d['multiregion key']==MRD['key']] #list the same act+prod in different regions. not sort the list
    activity_type_set = {d['activity type'] for d in MRD['datasets']}         # in principle, only one element in the set
    assert len(activity_type_set)==1, 'mix of activity type'                # if not 1, raise error
    MRD['activity_type'] = activity_type_set.pop()
    MRD['locations'] = [d['location'] for d in MRD['datasets']]                 #list different regions having this act+prod
    eco_loc_TO_final_loc_set, final_loc_set_TO_eco_loc = match_eco_locations_to_final_loc_set(MRD['locations'])
    MRD['eco_loc_TO_final_loc_set'] = eco_loc_TO_final_loc_set
    MRD['final_location_to_eco_location'] = final_loc_set_TO_eco_loc
    MRD['reference_dataset_dict'] = {final_loc:[d for d in MRD['datasets'] if d['location']==eco_loc][0] 
                                   for final_loc, eco_loc in MRD['final_location_to_eco_location'].items()}# country - matching dataset (activity+location)

    iot_sec_ind = [d['exiobase sector index'] for d in MRD['datasets']]
    if len({tuple(i) for i in iot_sec_ind})!=1:                             # choose sec index list with the highest occurence. 
        loc_sec_dict = {d['location']:tuple(d['exiobase sector index']) for d in MRD['datasets']}
        ct = collections.Counter(loc_sec_dict.values())
        MRD['exiobase_sector_index'] = list(ct.most_common(1)[0][0])
    else:                                                                   #exiobase_sector_index for each dataset is the same
         MRD['exiobase_sector_index'] = iot_sec_ind[0]
    
    return MRD


mr_key_MRD = {k:MultiRegionDataset(k) for k in sorted(list(mr_keys))}

with open('../../Data/tech_vector/mr_key_MRD.p', 'wb') as o:
    pickle.dump(mr_key_MRD, o)

In [16]:
list(mr_key_MRD.items())[0]

('0000afec84171c872e995c5b53fac1ad',
 {'key': '0000afec84171c872e995c5b53fac1ad',
  'datasets': [{'comment': 'In this market, expert judgement was used to develop product-specific transport distance estimations based on eurostat transport statistics for 2016 (http://ec.europa.eu/eurostat/data/database, extracted on the 2018-06-01). See exchange comments for additional details.\nThis dataset represents the supply of 1 kg of diethyl ether, without water, in 99.95% solution state from activities that produce it within the geography RER.',
    'classifications': [('ISIC rev.4 ecoinvent',
      '2011:Manufacture of basic chemicals'),
     ('CPC',
      '34170: Ethers, alcohol peroxides, ether peroxides, epoxides, acetals and hemiacetals, and their halogenated, sulphona[…]')],
    'activity type': 'market activity',
    'activity': 'aa773de9-6f0a-46ee-8a4e-079ab653f189',
    'database': 'ecoinvent 3.7.1_cutoff_ecoSpold02',
    'exchanges': [{'flow': 'cfbce515-3f54-4411-ad9d-3d26b7faa15a',
  

In [17]:
ref_dataset_dict = {mr_key: {final_loc:ref_ds['code'] for final_loc,ref_ds in MRD['reference_dataset_dict'].items()} for mr_key, MRD in mr_key_MRD.items()}
# {multiregion key: {country: code of the dataset(activity+location)}}. 1 multiregion key-> at least one code
code_mrkey_dict = {d['code']:d['multiregion key'] for d in datasets} # 1 code -> 1 multiregion key. 

with open('../../Data/tech_vector/ref_dataset_dict.p', 'wb') as o:
    pickle.dump(ref_dataset_dict, o)
with open('../../Data/tech_vector/code_mrkey_dict.p', 'wb') as o:
    pickle.dump(code_mrkey_dict, o)

## 2. Summarize the technosphere exchanges of datasets in the original ecoinvent and generate country-specific consumption mixes of products <a id='step2'>

2.1 <a href='#ds_exc_dict'>Summarize the technosphere exchanges of datasets </a>  <br>
2.2 <a href='#cons_mix'>Generate necessary country-specific consumption mixes </a>  <br>
2.3 <a href='#suborgin_PV'>Summarize production volumns for further suborigin allocation </a>  <br>

 <a href='#back'>back</a> 

#### 2.1 Summarize the technosphere exchanges of datasets<a id='ds_exc_dict'></a>  <br>
 <a href='#back'>back</a> 

In [18]:
# for each technosphere exchange in each dataset, get the code of dataset that provides the technosphere input
# then, add the corresponding information of multiregion key,location, final locations to that exchange
for d in datasets:
    for e in d['exchanges']:
        if e['type']=='technosphere':
            exchange_d = e['input'][1] # code of the exchange activity
            e['multiregion'] = code_mrkey_dict[exchange_d]
            e['location'] = code_to_ds[exchange_d]['location']     
            e['final locations'] = mr_key_MRD[e['multiregion']]['eco_loc_TO_final_loc_set'][e['location']] #is []

In [19]:
with open('../../Data/tech_vector/datasets_no_prodmix_with_new_items.p', 'wb') as o:
    pickle.dump(datasets, o)

In [20]:
ds_exchange_dict = {} #{code: {mr_key for exchanges: amount, final locations}}

In [21]:
co_specific_ds = [d for d in datasets if d['location'] in partco or d['location'] in other_co or d['location'] in iot_co]
#[d for d in datasets if d['location'] in final_loc_set]
# for country-specific or subnational datasets, keep the amount and original origins.

for d in co_specific_ds:
    tex = [e for e in d['exchanges'] if e['type']=='technosphere']
    exchange_dict = collections.defaultdict(list)
    if tex: # there may be loss in tex. loss are also co_specific, following codes will keep it.
        for e in tex:
            exchange_dict[e['multiregion']].append((e['amount'], e['final locations']))
        ds_exchange_dict[d['code']] = exchange_dict  #{code: {mr_key for exchanges: [(amount, [countries])]}}
    else:
        ds_exchange_dict[d['code']] = {}

In [20]:
# result check: 'smelting of copper concentrate, sulfide ore'[CN]
ds_exchange_dict['54703eb7b74591ec08911d4c7c576a1c']

defaultdict(list,
            {'3bdef1d3f2cd6319edd1ddfd49626db7': [(1.34118153563962e-05,
               ['LV',
                'JP',
                'IN',
                'AU',
                'MT',
                'US',
                'SE',
                'CZ',
                'CN',
                'BR',
                'RU',
                'KR',
                'MX',
                'PT',
                'CA',
                'SK',
                'CH',
                'LU',
                'BG',
                'GR',
                'PL',
                'GB',
                'IT',
                'NL',
                'NO',
                'HR',
                'LT',
                'RO',
                'HU',
                'FI',
                'ID',
                'TW',
                'IE',
                'SI',
                'EE',
                'DE',
                'FR',
                'TR',
                'AT',
                'DK',
                'ZA',
       

In [22]:
co_aggre_ds = [d for d in datasets if d['location'] not in final_loc_set]
# for country-aggregated dataset, aggregate technosphere exchanges by mr_key to get amount; add all possible origins behind each mr_key.

def aggregate_technosphere_exchanges_by_mrkey(technosphere_exchanges):
    exchange_to_series = lambda e: pd.Series([e['amount'], e['multiregion']], index=['amount', 'multiregion'])
    exchange_table = pd.DataFrame({i:exchange_to_series(e) for i,e in enumerate(technosphere_exchanges)}).T
    exchanges_by_multiregion = exchange_table.groupby('multiregion').sum() #total inputs of an activity ignoring region
    return exchanges_by_multiregion['amount'].to_dict() #{mr_key: amount}

for d in co_aggre_ds:
    loss = [e for e in d['exchanges'] if e['type']=='technosphere' and e['input'][1] == d['code']]
    tex = [e for e in d['exchanges'] if e['type']=='technosphere' and e['input'][1] != d['code']]
    exchange_dict = {}
    if loss: #loss means import from itself. no need to consider all locations behind the mr_key
        for e in loss:
            exchange_dict[e['multiregion']] = (e['amount'], [e['location']],'loss')
    if tex: # consider possible origins for imports
        mr_amount_d = aggregate_technosphere_exchanges_by_mrkey(tex)#{mr_key: amount}
        tex_mrkeys = list(mr_amount_d.keys())
        mr_locations_d = {}
        for key in tex_mrkeys:
            all_possible_locations_by_mrkey = list(ref_dataset_dict[key].keys())
            mr_locations_d[key] = all_possible_locations_by_mrkey#{mr_key: [countries]}
        assert len(mr_amount_d) == len(mr_locations_d), 'varying length of dicts'
        for mr in tex_mrkeys:
            exchange_dict[mr] = (mr_amount_d[mr], mr_locations_d[mr], 'not_loss')
        ds_exchange_dict[d['code']] = exchange_dict  #{code: {mr_key for exchanges: amount, [countries]}}
    else:
        ds_exchange_dict[d['code']] = {}


In [29]:
with open('../../Data/tech_vector/ds_exchange_dict.p', 'wb') as o:
    pickle.dump(ds_exchange_dict, o)

#### 2.2 Generate necessary country-specific consumption mixes <a id='cons_mix'></a>  <br>
 <a href='#back'>back</a> 

In [31]:
# use iot to generate necessary consumption mix dictionary 
#{(iot_co_reg, consumer_ind, producer_ind):import_ratio_vec}

necesssary_calculations = set()
mr_connections = list()
for ds_code, exc_dict in ds_exchange_dict.items():
    consumer_mr = code_to_ds[ds_code]['multiregion key']
    consumer_sector = tuple(mr_key_MRD[consumer_mr]['exiobase_sector_index']) # matching sectors in iot
    for prod_mr in exc_dict.keys():
        producer_sector = tuple(mr_key_MRD[prod_mr]['exiobase_sector_index'])
        necesssary_calculations.add((consumer_sector, producer_sector))
        mr_connections.append((consumer_mr, prod_mr))
        
with open('../../Data/tech_vector/necesssary_calculations_sectors.p', 'wb') as o:
    pickle.dump(necesssary_calculations, o)

In [32]:
consumption_mix_dict = {}
for consumer_sector, producer_sector in list(necesssary_calculations):
    for loc in iot_co_reg:
        col = pd.MultiIndex.from_product([[loc], iot_sectors[list(consumer_sector)]]).tolist()
        ind = pd.MultiIndex.from_product([iot_co_reg, iot_sectors[list(producer_sector)]]).tolist()
        df = iot.loc[ind, col] # extract useful part
        consumer_sector_sum = df.sum(axis=1).sum(axis=0, level=0) # sum cols, sum rows of an iot_co_reg
        if consumer_sector_sum.sum():
            import_ratio_vec = consumer_sector_sum/consumer_sector_sum.sum() #obtain a df,(number of iot_co_reg, 1)
        else: #there is one consumer_sector_sum = 0
            # sum=0, there is no explicit connection between the sectors, 
            # approximate the co_sector-co_sector import_ratio to co-co_sector ratios. i.e., all sectors in the consuming country are used
            df = iot.xs(loc, axis=1, level=0).loc[ind,:]
            consumer_sector_sum = df.sum(axis=1).sum(axis=0, level=0)
            assert consumer_sector_sum.sum(), 'production sector without data in exiobase'
            import_ratio_vec = consumer_sector_sum/consumer_sector_sum.sum()
        
        consumption_mix_dict[(loc, consumer_sector, producer_sector)] = import_ratio_vec ##pandas.core.series.Series
    
with open('../../Data/tech_vector/consumption_mix_dict.p', 'wb') as o:
    pickle.dump(consumption_mix_dict, o)

In [33]:
len(necesssary_calculations)*len(iot_co_reg) == len(consumption_mix_dict)  #94374

True

#### 2.3 Summarize production volumns for further suborigin allocation <a id='suborgin_PV'></a>  <br>
 <a href='#back'>back</a> 

In [None]:
mrkey_suborigin_PV = {}
for key in mr_keys:
    all_possible_locations_by_mrkey  = list(ref_dataset_dict[key].keys())
    suborigin_list = [loc for loc in all_possible_locations_by_mrkey]
    
    origin_to_suborgin = collections.defaultdict(list)
    for subo in suborigin_list:
        origin_to_suborgin[final_loc_set_TO_iot_reg_co[subo]].append(subo)
#     print(list(origin_to_suborgin.items())[0])
    mrkey_suborigin_PV[key] = origin_to_suborgin
    
    for origin,subo_list in origin_to_suborgin.items():
#         print(origin,subo_list)
        suborigin_PV = {}
        for subo in subo_list:
#             print(subo)
            subo_ds_code = ref_dataset_dict[key][subo]
            d_exc = [d for d in datasets if d['code']==subo_ds_code][0]['exchanges']
            PV = [exc['production volume'] for exc in d_exc if exc['type']=='production'][0]
            suborigin_PV[subo] =PV
        mrkey_suborigin_PV[key][origin] = suborigin_PV
#     print(mrkey_suborigin_PV)
    

with open('../../Data/tech_vector/mrkey_suborigin_PV.p', 'wb') as o:
    pickle.dump(mrkey_suborigin_PV, o)

In [35]:
len(mrkey_suborigin_PV) == len(mr_keys)

True

## 3. Generate technosphere exchanges for datasets in new ecoinvent<a id='step3'>
    
3.1 <a href='#full_index'>Create index for each dataset in new ecoinvent </a>   <br>
3.2 <a href='#gen_exch'>Generate technosphere exchanges for each dataset</a>  <br>
 <a href='#back'>back</a> 

#### 3.1 Create index for each dataset in new ecoinvent <a id='full_index'></a>  <br>
 <a href='#back'>back</a> 

In [24]:
generated_datasets_index = set()
for mr_key, MRD in mr_key_MRD.items():
    for final_loc in list(MRD['reference_dataset_dict'].keys()):
        generated_datasets_index.add((final_loc, mr_key)) 

In [25]:
# index for tech matrix. (final_loc, mr_key)
generated_datasets_index = set()
for mr_key, MRD in mr_key_MRD.items():
    for final_loc in list(MRD['reference_dataset_dict'].keys()):
        generated_datasets_index.add((final_loc, mr_key)) # if for a mr_key, datasets have location GLO or RoW, all iot countries will be covered.

full_index = sorted(list(generated_datasets_index))
full_index_dict = {ind:i for i, ind in enumerate(full_index)}
print(len(full_index))#before 335’099

with open('../../Data/tech_vector/full_index.p', 'wb') as o:
    pickle.dump(full_index, o)

full_reference_dataset_dict = {} #{(country, mr_key): reference dataset}
for index in full_index:
    final_loc, consuming_mr_code = index
    MRD = mr_key_MRD[consuming_mr_code]
    reference_dataset = MRD['reference_dataset_dict'][final_loc]
    full_reference_dataset_dict[index] = reference_dataset['code']

with open('../../Data/tech_vector/full_reference_dataset_dict.p', 'wb') as o:
    pickle.dump(full_reference_dataset_dict, o)


335099


#### 3.2 Generate technosphere exchanges for each dataset <a id='gen_exch'></a>  <br>
 <a href='#back'>back</a> 

In [39]:
def get_consumption_mix_for_co_aggre_datasets(cons_final_loc, cons_mr_code, final_locations, prod_mr_code): 
    '''
    for co_aggre_datasets, in their exchanges, all possible origins behind a mr_key are considered.
    cons_final_loc must be in iot_co, producer final_locations not.
    all possible origins: iot_co, partco, other_co. the latter two are not in iot_co_reg. so:
    1. match partco to co (belongs to iot_co); match other_co to iot_reg. -> consume_from
    2. origins allocation
    3. suborigins allocation
    
    '''
    if len(final_locations)==1:
        consumption_mix = pd.Series({final_locations[0]: 1})
    
    else:
        origin_to_suborgin = collections.defaultdict(list)
        for loc in final_locations:
            origin_to_suborgin[final_loc_set_TO_iot_reg_co[loc]].append(loc)
    #     print(origin_to_suborgin)
        consume_from = list(origin_to_suborgin.keys())
#     assert len(consume_from) == len(list(mrkey_suborigin_PV[prod_mr_code].keys())), 'origin unmatch'.
        #Not necessary for mar ds considering only local supplier
    
        consumer_sec_ind = tuple(mr_key_MRD[cons_mr_code]['exiobase_sector_index'])
        producer_sec_ind = tuple(mr_key_MRD[prod_mr_code]['exiobase_sector_index']) 
        consumption_mix = consumption_mix_dict[(cons_final_loc, consumer_sec_ind, producer_sec_ind)]
        consumption_mix_specific = consumption_mix[consume_from]
        if not consumption_mix_specific.sum():
            # sum is 0. no data about consumption mix limits allocation to consumer country.
            # (or all available producers if no production in consumer country)
            consumption_mix = pd.Series({c:1/len(consume_from) for c in consume_from})
        else:
            consumption_mix = consumption_mix_specific/consumption_mix_specific.sum()#df(iot_reg_co, percent)
#         print(consumption_mix)
        
        consumption_mix_with_final_loc = {}
        for origin in consume_from:
            am = consumption_mix[origin]
            suborigin_list = origin_to_suborgin[origin]
            suborigin_PV_dict = mrkey_suborigin_PV[prod_mr_code][origin]
            assert len(suborigin_list) == len(suborigin_PV_dict), 'suborigin unmatch'
            total_PV = sum(list(suborigin_PV_dict.values()))
            if total_PV:
                for subo in suborigin_list:
                    PV = suborigin_PV_dict[subo]
                    consumption_mix_with_final_loc[subo]=am*PV/total_PV
            else:
                for subo in suborigin_list:
                    consumption_mix_with_final_loc[subo]=am/len(suborigin_list)
        consumption_mix = pd.Series(consumption_mix_with_final_loc)
#         print(consumption_mix)
            
    return consumption_mix

In [38]:
def get_consumption_mix_for_co_specific_datasets(cons_final_loc, cons_mr_code, final_locations, prod_mr_code):
    '''
    for co_specific_datasets, in their exchanges, producer final_locations list waiting for allocation contains only iot_co.
    so no need to consider suborigin problem
    producer final_locations must be in iot_co, but cons_final_loc can be partco/other_co, we need to match them to iot_co
    
    '''
    cons_iot_reg_co = final_loc_set_TO_iot_reg_co[cons_final_loc]
    consume_from = final_locations
    if len(consume_from) > 1:
        consumer_sec_ind = tuple(mr_key_MRD[cons_mr_code]['exiobase_sector_index'])
        producer_sec_ind = tuple(mr_key_MRD[prod_mr_code]['exiobase_sector_index']) 
        consumption_mix = consumption_mix_dict[(cons_iot_reg_co, consumer_sec_ind, producer_sec_ind)] #pandas.core.series.Series
        consumption_mix_specific = consumption_mix[consume_from]
        if not consumption_mix_specific.sum():
            # sum is 0. no data about consumption mix limits allocation to consumer country.
            # (or all available producers if no production in consumer country)
            consumption_mix = pd.Series({c:1/len(consume_from) for c in consume_from})
        else:
            consumption_mix = consumption_mix_specific/consumption_mix_specific.sum()#df(country, percent)

    else:
        consumption_mix = pd.Series({consume_from[0]: 1})
            
    return consumption_mix

In [None]:
def regionalize(index):
    '''
    generate a vector of technosphere exchanges for each dataset (corresponding to an index) in new ecoinvent
    for market datasets, only consider local producer providing technophere inputs.
    
    '''
    print(full_index.index(index))
    cons_final_loc, cons_mr_code = index
    MRD = mr_key_MRD[cons_mr_code]
    reference_dataset = MRD['reference_dataset_dict'][cons_final_loc]
    
    exchanges = ds_exchange_dict[reference_dataset['code']] 
    #for original co_aggre_ds: {mr_key of producer: (amount, [final loc], type_exc)}.
    #for original co_specific_ds: {mr_key of producer: [(amount, [final loc])]}
    exchange_vec = scipy.sparse.lil_matrix((len(full_index),1))
    
    if reference_dataset['location'] != cons_final_loc: # co_aggre_datasets       
        for prod_mr_code, (amount, final_locations, type_exc) in exchanges.items():
            if type_exc == 'not_loss': #final_locations = certain region. now cons_final_loc = prod_final_loc, prod_mr_code = cons_mr_code
                local_loc = [i for i in final_locations if i[:2]==cons_final_loc[:2]]
                if MRD['activity_type']=='market activity' and len(local_loc)>0: #iot_co in final locations or iot_co-partco in final locations
                    final_locations = local_loc
                if (prod_mr_code in TapWat_Irr_Mar_mr_code) and len(local_loc)>0:
                    final_locations = local_loc
                consumption_mix = get_consumption_mix_for_co_aggre_datasets(cons_final_loc,cons_mr_code,final_locations,prod_mr_code)
                consumption_mix_amount = consumption_mix.mul(amount)#df(country, amount_fraction)
                consumption = {(prod_final_loc, prod_mr_code): a for prod_final_loc,a 
                               in consumption_mix_amount.iteritems()}
                for prod_key, amount_fraction in consumption.items():
                    exchange_vec[full_index_dict[prod_key], 0] = amount_fraction #{country: amount_fraction}
            else:
                exchange_vec[full_index_dict[(cons_final_loc, prod_mr_code)], 0] = amount
    
    else: # co_specific_datasets       
        for prod_mr_code, exc_list in exchanges.items():
            for (amount, final_locations) in exc_list:
                local_loc = [i for i in final_locations if i[:2]==cons_final_loc[:2]]
                if MRD['activity_type']=='market activity' and len(local_loc)>0:
                # co=co, partco=partco, co to partco, partco to smaller partco,
                # iot_co in final locations, partco-co in final locations
#                     check.append(len(local_loc))
                    consumption = {(local_loc[0], prod_mr_code): amount}
                else:    
                    if (prod_mr_code in TapWat_Irr_Mar_mr_code) and len(local_loc)>0:
                        final_locations = local_loc
                    consumption_mix = get_consumption_mix_for_co_specific_datasets(cons_final_loc, cons_mr_code, 
                                                                                   final_locations, prod_mr_code)
                    consumption_mix_amount = consumption_mix.mul(amount)#df(country, amount_fraction)
                    consumption = {(prod_final_loc, prod_mr_code): a for prod_final_loc,a 
                                   in consumption_mix_amount.iteritems()}
                for prod_key, amount_fraction in consumption.items():
                    exchange_vec[full_index_dict[prod_key], 0] = amount_fraction #{country: amount_fraction}
                #for the consumer activity, put consumed amount_fraction into full-index vector based on  (country+prod_mr_key).      
    
    # save the regionalized dataset
    exchange_vec_csc = scipy.sparse.csc_matrix(exchange_vec)
    
    return (index, exchange_vec_csc) # (cons_country, cons_mr_key), series({row of (prod_country,prod_mr_key):amount_fraction})


In [None]:
%%time
with mp.Pool(processes=30) as pool:
    regionalized_dataset_list = pool.map(regionalize, full_index) 

with open('../data/tech_vector/regionalized_dataset_list.pickle', 'wb') as o:
    pickle.dump(regionalized_dataset_list, o)