# Final Project

## To Do in Python:  
- Take the list of product codes, issuing countries and corresponding waves of sanctions  
- Find the full codes of products  
- Get the values of export for 2021 from Russia to corresponding country (sanction issuer)  
  
This way we can estimate the potential loss by sanction wave or by country-issuer  

Libraries import

In [1]:
import os
import json
import requests
import pandas as pd
import qgrid
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
import urllib3
from datetime import datetime
from bs4 import BeautifulSoup

#qgrid.disable()

Initial dataset uploading  

In [2]:
Export_all = pd.read_excel('Initial data.xlsx',sheet_name = 0,converters={'Код':str})
Codes_all = pd.read_excel('Initial data.xlsx',sheet_name = 1,converters={'Код ТН ВЭД':str})
Codes_all = Codes_all.dropna(how='all')
Codes_all = Codes_all.reset_index(drop = True)
Codes_all = Codes_all.copy()

In [3]:
Export_all.head()

Unnamed: 0,Код,Наименование товара
0,TOTAL,Все продукты
1,270900,"Нефть сырая и нефтепродукты сырые, полученные ..."
2,999999,"Товары, нигде не указанные"
3,271019,Прочие дистилляты и продукты
4,710812,Золото в прочих необработанных формах


In [4]:
Codes_all.head()

Unnamed: 0,Код ТН ВЭД,Только код (для ВПР),Пошлина,Название по классификации ФТС,Отрасль,Наименование (англ.) -\n первоисточник,Наименование \n(перевод),Страна,Дата,№ пакета санкций
0,Запрет на ввоз определенных изделий из железа ...,,,,,,,,NaT,
1,721041 (объедиенение на уровне 6-значного кода...,721041.0,,ПРОКАТ ПЛОСКИЙ ИЗ ЖЕЛЕЗА ИЛИ НЕЛЕГИРОВАННОЙ СТ...,Металлопродукция,Metallic Coated Sheets,Листы с металлическим покрытием,ЕС,2022-03-15,4.0
2,721049 (объедиенение на уровне 6-значного кода...,721049.0,,ПРОКАТ ПЛОСКИЙ ИЗ ЖЕЛЕЗА ИЛИ НЕЛЕГИРОВАННОЙ СТ...,Металлопродукция,Metallic Coated Sheets,Листы с металлическим покрытием,ЕС,2022-03-15,4.0
3,721061 (объедиенение на уровне 6-значного кода...,721061.0,,ПРОКАТ ПЛОСКИЙ ИЗ ЖЕЛЕЗА ИЛИ НЕЛЕГИРОВАННОЙ СТ...,Металлопродукция,Metallic Coated Sheets,Листы с металлическим покрытием,ЕС,2022-03-15,4.0
4,721069 (объедиенение на уровне 6-значного кода...,721069.0,,ПРОКАТ ПЛОСКИЙ ИЗ ЖЕЛЕЗА ИЛИ НЕЛЕГИРОВАННОЙ СТ...,Металлопродукция,Metallic Coated Sheets,Листы с металлическим покрытием,ЕС,2022-03-15,4.0


As we can see, Codes_all has a rather complicated Code naming pattern. To get the actual codes, we need to analyze each string and get the number inside. As we can see, the structure is mainly "aaa 123 aa": the number between the symbols.  
    
This function will take the value, and if it is string, not int, it will split it by " ", then trying to convert each part to integer. If success, then this is the needed value.

In [5]:
def code_catcher(b,k):
    if not k == len(b):
        try:
            a2 = int(b[k])
            a = b[k]
            #break
        except ValueError:
            k += 1
            a = code_catcher(b,k)
        return(a)
    else:
        return(0)

def str_divider(a):
    if type(a) == str:
        return(code_catcher(a.split(' '),0))
    else:
        return(a)

Here's a small demonstration:

In [6]:
str_divider('aaaa 123 aa a')

'123'

In [7]:
for i in range(0,len(Codes_all)):
    Codes_all['Код ТН ВЭД'][i] = str(str_divider(Codes_all['Код ТН ВЭД'][i]))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Codes_all['Код ТН ВЭД'][i] = str(str_divider(Codes_all['Код ТН ВЭД'][i]))


Next issue is that codes are not neccessarily identical: we can have 290711 in Export_all, but the value in Codes_all would be 2907. So basic merging won't work: we need to cut the last 2 digits.  
  
It was decided to create 2 dictionaries like: {Code: Sanction pack} and {Code: country}  
  
Then we check each value in Export_all, cutting the last digits, and connecting it to dictionary values.

In [8]:
Dict_sanction_pack = dict(zip(Codes_all['Код ТН ВЭД'],Codes_all['№ пакета санкций']))
Dict_country = dict(zip(Codes_all['Код ТН ВЭД'],Codes_all['Страна']))

column_pack = []
column_countries = []
for i in Export_all['Код']:
    try:
        if (str(i)[:-2]) in Dict_sanction_pack:
            column_pack.append(Dict_sanction_pack[(str(i)[:-2])])
        else: 
            column_pack.append(0)
    except ValueError:
        column_pack.append(0)
        
    try:
        if (str(i)[:-2]) in Dict_country:
            column_countries.append(Dict_country[(str(i)[:-2])])
        else: 
            column_countries.append(0)
    except ValueError:
        column_countries.append(0)

Export_all['Sanction pack'] = column_pack
Export_all['Block region'] = column_countries
Export_packs = Export_all.loc[Export_all['Sanction pack'] > 0]
Export_packs = Export_packs.reset_index(drop=True)

The website we will fish the data from has a way of encoding the countries according to M49 regulation. We tried to tie our countries to a dictionary of all the countries, but it turned out sometimes codes could varry. So we decided to use the specific codes from the website, which we found manually - we don't have that many countries after all, so no problem.

In [9]:
Country_codes_dict = {}
column_code = []
Country_codes_dict['ЕС'] = 26
Country_codes_dict['США'] = 842
Country_codes_dict['Великобритания'] = 826
Country_codes_dict['Канада'] = 124
Country_codes_dict['Новая Зеландия'] = 554
Country_codes_dict['Швейцария'] = 757
for i in Export_packs['Block region']:
    column_code.append(Country_codes_dict[i])
Export_packs['countryCode'] = column_code

In [10]:
Export_packs.head()

Unnamed: 0,Код,Наименование товара,Sanction pack,Block region,countryCode
0,270900,"Нефть сырая и нефтепродукты сырые, полученные ...",6.0,ЕС,26
1,271019,Прочие дистилляты и продукты,6.0,ЕС,26
2,710812,Золото в прочих необработанных формах,7.0,Новая Зеландия,554
3,271012,Легкие дистилляты и продукты,6.0,ЕС,26
4,270112,Уголь битуминозный,5.0,ЕС,26


## Now to the most interesting part.  

We have a website, https://www.trademap.org/. It doesn't have API, so if we want to get the data, we either have to search for it manually, or find a way to automate the searches.  
    
We found a pattern in URL creation, which helped us to add needed product and country codes in it. This function sends the url wo a website, grabs back full html-code of the page, and then finds the needed value - export for 2021. Off course, it's not completely automatic - the code needs some initial toggling, just to determine where exactly to look on the page.  
  
After that the function returns the value and the created URL, which can be useful for later. Average time is 1 sec. 

In [11]:
def html_parse(countryCode,Code):
    urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
    if not countryCode == 26:   
        query1 = 'https://www.trademap.org/Bilateral_TS.aspx?nvpm=1%7c643%7c%7c'
        query2 = '%7c%7c'
        query3 = '%7c%7c%7c6%7c1%7c1%7c2%7c2%7c1%7c1%7c1%7c1%7c1'
        vgm_url = query1+str(countryCode)+query2+str(Code)+query3
    else:
        query1 = 'https://www.trademap.org/Bilateral_TS.aspx?nvpm=1%7c643%7c%7c%7c26%7c'
        #query2 = '%7c%7c'
        query3 = '%7c%7c%7c6%7c1%7c1%7c2%7c2%7c1%7c1%7c1%7c1%7c1'
        vgm_url = query1+str(Code)+query3

    #print(vgm_url)    
    headers = requests.utils.default_headers()

    headers.update(
        {
        'User-Agent': 'My User Agent 1.0',
        }
    )
    session = requests.Session()
    retry = Retry(connect=3, backoff_factor=0.5)
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    a = session.get(vgm_url,headers=headers,verify=False).text
    soup = BeautifulSoup(a, 'html.parser')
    dfs = pd.read_html(a)
    value = int(dfs[8].loc[dfs[8][1] == (Code)][5])

    return(value,vgm_url)

Here's the cicle. Sometimes something can be wrong: for example, the needed value is not displayed on the webpage. Then the algorithm will display the problematic place, but will carry on. We had around 20 errors out of 1800 values, which was decided to be acceptable.  

In [None]:
Final_list = []

for i in range(0,len(Export_packs)):
    #print(i)
    if len(Export_packs['Код'][i]) > 1:
        code = str(Export_packs['Код'][i][:-2])
        countryCode = int(Export_packs['countryCode'][i])
        try:
            result,url = html_parse(countryCode,code)
        except TypeError:
            result = 'Err'
            url = 'Err'
            print('Error on ',i,code,countryCode)
        #print(countryCode,code)
        Final_list.append([code,countryCode,result,Export_packs['Sanction pack'][i],url])
    if i % 100 == 0:
        now = datetime.now()
        current_time = now.strftime("%H:%M:%S")
        print('Finished ',i,'out of',len(Export_packs),'; currently ',len(Final_list),'lines. Time: ',current_time)

In [None]:
Finalll = pd.DataFrame(Final_list)

In [None]:
Finalll.columns = ['Code','Country','Sum for 2021','Sanction pack','url']

In [None]:
Finalll = Finalll.drop_duplicates()
Finalll.to_csv('Final result.csv')