# International Data Preprocessing Examples 

### Introduction 

The goal of this Python package is to be a lightweight and integratable solution for international data preprocessing. This set of examples demonstrates three instances in which the main functions of this package are utilized to conduct data preprocessing. The datasets in these examples are curated "miniature" datasets centered on international metrics. Each example accomplishes a specified goal detailed in each section below. 

### Example 1: Per Capita Calculation

This primary example utilizes a data frame which focuses on immigration metrics. Each row represents the number of immigrants from a given country travelling to the United states in the year 2021. This data is read in from csv format to a pandas dataframe. A preview of the data is included as well. Before preprocessing can begin, we must prepare the dataframe for function input. This is the parse step of the function execution. In this example, the function parse_countries parses the origin and destination country's names into the standard alpha 3 codes. This allows the country columns to be read in by the adjust_per_capita function with its expected input. Finally, we are able to execute the adjust_per_capita function, which takes in the numeric value (immigrant population), country population value (origin country), and population year (year of immigration). This input allows the function to adjust the immigrant population column to be represented as immigrants per capita of the origin country. 

In [1]:
import pandas as pd
from international_data_preprocessing import parse_countries, adjust_per_capita #package import 

immigration_df = pd.read_csv('ExampleCSVImmigration.csv') #read in csv to pandas dataframe
print(immigration_df.head(10))

parse_countries(immigration_df, 'originCountry', ['alpha3']) #parse country name to alpha 3 code
parse_countries(immigration_df, 'destinationCountry', ['alpha3']) #parse country name to alpha 3 code
print(immigration_df.head(10))

adjust_per_capita(immigration_df, 'immigrantPopulation', 'originCountry', 'year') #run per-capita function
print(immigration_df.head(10))

  originCountry destinationCountry  immigrantPopulation  year
0         India      United States              2652853  2021
1         China      United States              2221943  2021
2        Brazil      United States               472637  2021
3     Indonesia      United States                94079  2021
  originCountry destinationCountry  immigrantPopulation  year
0           IND                USA              2652853  2021
1           CHN                USA              2221943  2021
2           BRA                USA               472637  2021
3           IDN                USA                94079  2021
  originCountry destinationCountry  immigrantPopulation  year
0           IND                USA             0.001904  2021
1           CHN                USA             0.001576  2021
2           BRA                USA             0.002209  2021
3           IDN                USA             0.000341  2021


### Example 2: Inflation Adjustment

This next example uses a dataframe centered on international tuition rates. Each row contains data concerning the country, lower end tuition estimate, upper end tuition estimate, year, and currency of the tuition metrics. This dataset is read in from csv format to a pandas dataframe. A preview of this data is also shown below. Similar to example 1, we must first parse some data columns before preprocessing the data. As this example adjusts for inflation, the parsing step aligns with the associated function's required inputs. For the adjust_for_inflation function, the standard three letter currency code must be passed in. Therefore, we must parse the currency country column to the standard code for that country's currency. For example, currency country goes from 'USA' to 'USD.' Once this step has been taken, we can now run the adjust_for_inflation function. This function takes the input of the currency numeric value (tuition lower estimate/tuition upper estimate), country, original year, and the final adjustment year. This function then adjusts these numeric tuition values for inflation from the year 2020 to the year 2021. The final data frame is shown below. 

In [2]:
from international_data_preprocessing import country_to_primary_currency, adjust_for_inflation #import package

tuition_df = pd.read_csv('ExampleCSVTuition.csv') #read in data frame
print(tuition_df.head(10))

country_to_primary_currency(tuition_df, 'currencyCountry', output_col_name='currency', in_place=False) #parse currency country to currency code
print(tuition_df.head(10))

adjust_for_inflation(tuition_df, 'tuitionLowerEstimate', 'country', 'year', 'adjustToYear')
adjust_for_inflation(tuition_df, 'tuitionUpperEstimate', 'country', 'year', 'adjustToYear')
print(tuition_df.head(10))

  currencyCountry country  tuitionLowerEstimate  tuitionUpperEstimate  year  \
0             USA     BRA                     0                 15000  2020   
1             USA     CHN                  4700                 46000  2020   
2             USA     IND                   350                  5500  2020   
3             USA     USA                 20770                 46950  2020   

   adjustToYear  
0          2021  
1          2021  
2          2021  
3          2021  
  currencyCountry country  tuitionLowerEstimate  tuitionUpperEstimate  year  \
0             USA     BRA                     0                 15000  2020   
1             USA     CHN                  4700                 46000  2020   
2             USA     IND                   350                  5500  2020   
3             USA     USA                 20770                 46950  2020   

   adjustToYear currency  
0          2021      USD  
1          2021      USD  
2          2021      USD  
3         

### Example 3: Currency Exchange Adjustment

This final example utilizes the same data frame as example 2. Once this data frame has been adjusted for inflation, it would now be most helpful to represent the numeric currency values in each country's local currency. An additional parsing step is needed in order to adjust the country's standard alpha 3 code to that country's standard currency code. This will allow the function to understand which exchange rate to use. Once this parsing step is taken, the convert_currency function can now be executed. In this case, this function takes in the monetary value column (tuition lower estimate/tuition upper estimate), current currency (currency), currency to exchange (local currency), and year. This will allow the function to exchange the tuition estimates in US dollars to the currency of the country associated with each metric. The final data frame is shown below. 

In [3]:
from international_data_preprocessing import convert_currency

country_to_primary_currency(tuition_df, 'country', output_col_name='localCurrency', in_place=False) #parse country to associated currency

convert_currency(tuition_df, 'tuitionLowerEstimate', 'currency', 'localCurrency', 'adjustToYear', in_place=False, new_col_name='localTuitionLowerEstimate') #adjust for exchange rate 
convert_currency(tuition_df, 'tuitionUpperEstimate', 'currency', 'localCurrency', 'adjustToYear', in_place=False, new_col_name='localTuitionUpperEstimate')
print(tuition_df.head(10))


  currencyCountry country  tuitionLowerEstimate  tuitionUpperEstimate  year  \
0             USA     BRA              0.000000          15432.098765  2020   
1             USA     CHN           4742.684157          46417.759839  2020   
2             USA     IND            368.809273           5795.574289  2020   
3             USA     USA          21172.273191          47859.327217  2020   

   adjustToYear currency localCurrency  localTuitionLowerEstimate  \
0          2021      USD           BRL                   0.000000   
1          2021      USD           CNY               29926.337033   
2          2021      USD           INR               28416.754478   
3          2021      USD           USD               21172.273191   

   localTuitionUpperEstimate  
0               78858.024691  
1              292896.064581  
2              446548.998946  
3               47859.327217  
