### Setting up for Analysis

In [1]:
# Importing packages we could need
import pandas as pd
import os
import numpy as np

In [2]:
# Gets current working directory
os.getcwd()

'c:\\Users\\bfran\\Ironhack\\Week 3\\Project\\Ironhack-PRY-BRZ-MGA'

### Explanation of Data

This dataset we are using compares Paraguay and Brazil reports of Brazil-directed trade.

In layman's terms, it explores the differences between what Paraguay reported exporting to Brazil and what Brazil reported as having imported from Paraguay.

Each row in this dataset specifically examines trade reporting for a single HS code. HS codes are basically a classification of items that most countries generally agree to adhere to. We examined codes at the most specific level, HS 6. The HS 6 level basically distinguishes differences between, say, cardstock colored paper and ruled loose-leaf paper. This level of specificity in examination allows for the greatest level of focus on item based trade fraud. 

Comparing this reported data side-by-side allows a researcher to essentially play a game of 'spot the differences' between the two reports. 

In [3]:
# Saves as df previously created mx_compare.csv file 
df = pd.read_csv('mx_compare.csv')

In [4]:
df

Unnamed: 0,hs_code,pry_x_q,pry_x_unit,brz_m_q,brz_m_unit,pry_x_net_wgt,brz_m_net_wgt,pry_x_value,itic_rate,applied_tariff,adj_x_value,brz_m_value,trade_gap,est_tax_loss,wgt_gap,value_ratio,wgt_ratio,density_ratio
0,20130,21523200.0,kg,20859060.0,kg,21523200.0,20859060.00,1.249529e+08,1.01,9.60,1.265773e+08,120517874.0,6059424.71,581704.77,664140.00,0.95,0.97,0.98
1,20220,52442.3,kg,52443.0,kg,52442.3,52443.00,1.404194e+05,1.01,8.00,1.417300e+05,140643.0,1087.01,86.96,-0.70,0.99,1.0,0.99
2,20230,7425400.0,kg,7144609.0,kg,7425400.0,7144609.00,3.770901e+07,1.01,9.60,3.806096e+07,36218864.0,1842097.60,176841.37,280791.00,0.95,0.96,0.99
3,20622,2883500.0,kg,2692900.0,kg,2883500.0,2692900.00,3.287412e+06,1.03,8.00,3.397540e+06,3053319.0,344221.49,27537.72,190600.00,0.9,0.93,0.96
4,20629,1037276.0,kg,978823.0,kg,1037276.0,978823.00,2.882343e+06,1.03,8.00,2.978902e+06,2749729.0,229172.92,18333.83,58453.00,0.92,0.94,0.98
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
791,851230,0.0,NR,2.0,u,0.0,0.19,0.000000e+00,1.03,18.00,0.000000e+00,4.0,-4.00,-0.72,-0.19,NXR,NXR,
792,870810,0.0,NR,15.0,kg,0.0,15.00,0.000000e+00,1.01,18.00,0.000000e+00,10.0,-10.00,-1.80,-15.00,NXR,NXR,
793,870894,0.0,NR,2.0,kg,0.0,2.00,0.000000e+00,1.01,15.09,0.000000e+00,25.0,-25.00,-3.77,-2.00,NXR,NXR,
794,902610,0.0,NR,102.0,u,0.0,8.48,0.000000e+00,1.01,11.80,0.000000e+00,1278.0,-1278.00,-150.80,-8.48,NXR,NXR,


### Creating Subsets

One thing we noticed in this dataset was that there were a high number of rows where one party reported trade activity but the other did not. 

For instance, in line 792, Paraguay reported that it did not send any weight or quantity of HS item 870810, but Brazil reported that they received 15kg of that same item. 

That lack of reporting by one party undermines our ability to analyze the accuracy of either country's reporting. Imagine trying to play a game of 'spot the difference' when you only have one picture, not two.

In order to better analyze and explore our data, we need to separate the rows where there was reporting from both countries from those where there was not. 

In [5]:
# Creates two subsets

# This subset removes rows in which there were no reported import or export weights or values
df_full_rep = df[(df['value_ratio'] != "NXR") 
                 & (df['value_ratio'] != "NMR") 
                 & (df['wgt_ratio'] != "NXR") 
                 & (df['wgt_ratio'] != "NMR")]


# This subset includes only those rows with incomplete reporting.
df_missing_rep = df[~((df['value_ratio'] != "NXR") 
                      & (df['value_ratio'] != "NMR") 
                      & (df['wgt_ratio'] != "NXR") 
                      & (df['wgt_ratio'] != "NMR"))]

In [6]:
pd.set_option('display.max_rows', None)
df_full_rep['wgt_ratio'].value_counts()

wgt_ratio
1.0         180
0.99         27
0.97         16
0.98         13
1.01         11
0.96          7
0.75          5
0.94          5
0.93          5
1.02          5
0.81          4
1.04          4
0.04          3
1.05          3
0.95          3
1.1           3
1.03          3
0.87          3
0.92          3
0.91          3
0.67          2
1.11          2
0.02          2
0.6           2
0.55          2
0.82          2
0.1           2
0.9           2
0.89          2
1.29          2
0.54          2
0.06          2
1.06          2
0.77          2
0.7           2
1.13          2
0.26          2
1.15          2
3.88          1
1.48          1
0.17          1
11.0          1
2.21          1
2.02          1
0.37          1
0.3           1
1.4           1
1.25          1
0.39          1
0.01          1
6.67          1
1.93          1
0.71          1
0.52          1
0.78          1
0.68          1
1.27          1
3.11          1
3.37          1
214.24        1
6.27          1
0.22          

In [7]:
# Prints the shape of each dataset
print("Shape of full reporting data set: ",df_full_rep.shape)
print("Shape of missing reporting data set: ",df_missing_rep.shape)

Shape of full reporting data set:  (414, 18)
Shape of missing reporting data set:  (382, 18)


This demonstrates that trade reporting was definitely a major problem when examining Brazil-directed trade with Paraguay. 

In [8]:
# Counts the occurrences of unique values in value ratio column
df_missing_rep['value_ratio'].value_counts()

value_ratio
NMR     356
NXR      23
0.01      2
0.04      1
Name: count, dtype: int64

However, contrary to our stated hypothesis, the value counts in the value ratio column seem to indicate that the lack of reporting seems to be more of a Brazil issue. 

Or it could be that Paraguay is grossly misclassifying their exports. 

Either way, nearly half of Brazil-directed trade with Paraguay is poorly documented by one party or the other. This creates ripe ground for trade fraud. 

### Comparing data that can be compared

In order to ensure that we are only examining data that shows serious trade reporting discrepancies, it's necessary to filter out any rows where reported export quantities, values, or weights were within 15% of each other. 

By filtering out any rows that seems to demonstrate fairly consistent reporting, we are left with only rows where discrepancies are large enough to merit more thorough examination.

In [9]:
# Because we know that one of the rows in the full_rep subset has a string value in the wgt ratio column...
# ... we need to replace that with a zero for convenience sake. 
df_full_rep['wgt_ratio'].replace('kWh',0.0, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_full_rep['wgt_ratio'].replace('kWh',0.0, inplace=True)


In [10]:
# Convert both columns to numeric type, coercing any non-numeric values to NaN
df_full_rep['value_ratio'] = pd.to_numeric(df_full_rep['value_ratio'], errors='coerce')
df_full_rep['wgt_ratio'] = pd.to_numeric(df_full_rep['wgt_ratio'], errors='coerce')
df_full_rep['density_ratio'] = pd.to_numeric(df_full_rep['density_ratio'], errors='coerce')

# Filters out all rows where values are close to what we expect them to be. 
condition = ~((df_full_rep['value_ratio'] > 0.85) & (df_full_rep['value_ratio'] < 1.15) & 
              (df_full_rep['wgt_ratio'] > 0.85) & (df_full_rep['wgt_ratio'] < 1.15) &
              (df_full_rep['density_ratio'] > 0.85) & (df_full_rep['density_ratio'] < 1.15))

# Applies filter condition
df_full_rep = df_full_rep[condition]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_full_rep['value_ratio'] = pd.to_numeric(df_full_rep['value_ratio'], errors='coerce')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_full_rep['wgt_ratio'] = pd.to_numeric(df_full_rep['wgt_ratio'], errors='coerce')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_full_rep['density_ratio'] = pd

In [11]:
df_full_rep.shape

(127, 18)

After filtering out all rows where the calculated ratios are what we expect them to be, we are left with 127 rows where ratio discrepancies seem to indicate possible trade fraud.

In [12]:
# Filters out all rows where values are close to what we expect them to be. 
condition_sm = ((df_full_rep['value_ratio'] < 0.85) & 
               (df_full_rep['wgt_ratio'] < 0.85) &
               ((df_full_rep['density_ratio'] < 1.15)))

In [13]:
sm = df_full_rep[condition_sm]

sm

Unnamed: 0,hs_code,pry_x_q,pry_x_unit,brz_m_q,brz_m_unit,pry_x_net_wgt,brz_m_net_wgt,pry_x_value,itic_rate,applied_tariff,adj_x_value,brz_m_value,trade_gap,est_tax_loss,wgt_gap,value_ratio,wgt_ratio,density_ratio
5,20649,100016.0,kg,75014.0,kg,100016.0,75014.0,92000.16,1.03,8.0,95082.17,68066.0,27016.17,2161.29,25002.0,0.72,0.75,0.95
15,90300,1342970.0,kg,1099929.0,kg,1342970.0,1099929.0,2018585.0,1.01,8.0,2046845.19,1652259.0,394586.19,31566.9,243041.0,0.81,0.82,0.99
32,120890,9451.2,kg,6654.0,kg,9451.2,6654.0,32647.2,1.04,8.0,33887.79,25728.0,8159.79,652.78,2797.2,0.76,0.7,1.08
53,200929,66300.0,kg,44200.0,kg,66300.0,44200.0,195630.03,1.02,11.2,200292.55,128210.0,72082.55,8073.25,22100.0,0.64,0.67,0.96
76,251200,1906500.0,kg,1379500.0,kg,1906500.0,1379500.0,322696.54,1.04,3.2,334355.47,212121.0,122234.47,3911.5,527000.0,0.63,0.72,0.88
105,310590,66937.9,kg,51650.0,kg,66937.9,51650.0,250790.88,1.01,1.07,253006.2,211943.0,41063.2,438.01,15287.9,0.84,0.77,1.09
162,391723,2.64,kg,1.0,kg,2.64,1.0,322.4,1.04,16.0,333.95,30.0,303.95,48.63,1.63,0.09,0.38,0.24
168,391890,134189.0,kg,79811.0,kg,134189.0,79811.0,593512.06,1.01,12.8,598457.99,343727.0,254730.99,32605.57,54378.0,0.57,0.59,0.97
170,391990,1468.77,kg,157.65,kg,1468.77,157.65,8471.29,1.01,16.0,8554.59,547.0,8007.59,1281.21,1311.12,0.06,0.11,0.6
174,392043,49081.9,kg,31441.0,kg,49081.9,31441.0,72600.12,1.01,12.8,73531.82,46636.0,26895.82,3442.67,17640.9,0.63,0.64,0.99


So 33 rows indicate signs of possible smuggling into Brazil. Other explanations of this phenomenon could be that the imported item was simply misclassified (meaning it should have called, say, HS 890111, but was instead called HS 891001), or it could be that the value of the product was understated by one party or another. If the product was undervalued by Paraguay, it could indicate that they are cutting some sort of deal to give someone extra product for free. Alternatively, if the product was undervalued by Brazil, it could be that someone in Brazil is trying to skim off some of the product or profits for themself. 

In [14]:
tariffs = pd.read_csv("brz_app_tariffs.csv")

tariffs.columns

Index(['reporter', 'year', 'hs_code', 'applied_tariff', 'commodity'], dtype='object')

In [15]:
columns = ['reporter','year','applied_tariff']

tariffs = tariffs.drop(columns=columns)

tariffs.shape

(16584, 2)

In [16]:
sm_codes = sm['hs_code']
sm_codes.shape

(33,)

In [22]:
sm_com = pd.merge(sm_codes, tariffs, on='hs_code', how='left')

In [26]:
sm_com.drop_duplicates(inplace=True)

In [27]:
sm_com

Unnamed: 0,hs_code,commodity
0,20649,"Edible offal of swine, frozen (excl. livers)"
3,90300,Mate
6,120890,Flours and meal of oil seeds or oleaginous fru...
9,200929,"Grapefruit juice, unfermented, Brix value > 20..."
12,251200,"Siliceous fossil meals, e.g. kieselguhr, tripo..."
15,310590,Mineral or chemical fertilisers containing the...
18,391723,"Rigid tubes, pipes and hoses, of polymers of v..."
21,391890,"Floor coverings of plastics, whether or not se..."
24,391990,"Self-adhesive plates, sheets, film, foil, tape..."
27,392043,"Plates, sheets, film, foil and strip, of non-c..."
