# Notebook 1 - Initial PySpark Data Cleaning:
The first step in the process was narrowing down the overall data to a subset we could work with.  Following Labs18 lead, we used a PySpark Sagemaker/EMR Cluster combination to do so.  In simple terms PySpark is Python module used to analyze large databases, and EMR Clusters are the linked CPUs that do the processing.  The initial dataset, `part-deb95738-54f1-4c84-84a0-3449af42f7c3.*.csv`, is too large to be loaded into a pandas or dask dataframe.

Our goal in this notebook was to limit the dataset to columns we needed, and clean those columns for further analysis.  Although the process of setting up a PySpark SageMaker kernel backed by an EMR cluster is complicated for beginners, a step-by-step guide to the process can be found here: https://aws.amazon.com/blogs/machine-learning/build-amazon-sagemaker-notebooks-backed-by-spark-in-amazon-emr/.

In [None]:
# tests connection to EMR cluster, will throw an error if everything isn't set up correctly
%%info

In [2]:
# imports packages and starts spark instance
# (spark instance is started when the first cell of the notebook is run)
import numpy as np
import pyspark.sql.functions as F
from pyspark.sql.functions import regexp_replace

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### Initial Filters

Our analysis was focused on the text in the trade descriptions, so we removed unrelated information from the original dataset.  Initially we considered the following five columns of data for our analysis:
- `CONSIGNOR_NAME` - the name of the exporting company involved in the trade
- `DECLARATION_NUMBER` - the 'trade id' number of the trade
- `DIRECTION_TRANSLATED` - notation of whether the trade was an import or export
- `CONSIGNOR_INN` - the id number of the exporting company involved in the trade
- `DESCRIPTION_GOOD` - the description text of the trade, containing information about goods and quantities traded

In [3]:
# defines the columns we want to keep from the trade data
# for initial KMeans analysis of description text, we will be keeping few columns
# Labs18 Group had larger trade_columns list.  Because we focused our project on text vectorization and machine learning with NLP, we were concerned with 
# the DESCRIPTION_GOOD column and CONSIGNOR_INN
trade_columns = ['CONSIGNOR_NAME'
                ,'DECLARATION_NUMBER'
                ,'DIRECTION_TRANSLATED'
                ,'CONSIGNOR_INN'
                ,'DESCRIPTION_GOOD'
                ]

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [4]:
# reads russian trade data from s3
# selects only columns from trade_columns
df_trade = spark.read.options(header=True, inferSchema=True, delimiter='|')\
              .csv('s3://russia-trade-data/2019_09_25_21_27_50/part-deb95738-54f1-4c84-84a0-3449af42f7c3.*.csv')\
              .select(trade_columns)\

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### RegEx

Following Labs18's lead, we used RegEx to filter out import trades (leaving only exports) and clean the information in the `CONSIGNOR_INN` column.

In [5]:
# filters to keep only the rows for trade exports, as C4 ARMS project is focused on exporters (for now)
df_trade = df_trade.filter(df_trade['DIRECTION_TRANSLATED'] == 'EXPORT')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [6]:
# defines regex expressions to apply to trade data
# for initial KMeans analysis of description text, we will only be using regex filter for INN, as correct INNs are critical for our final analysis
# regexINN filter needs improvement.  Invalid INN numbers were found in the output of the final product.  Cleaning these incorrect INNs would make the 
# final product/model.
regexINN = '(\d{8,12}|None|null|0|00)'

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [7]:
# applies Lab18's Jason's regex to the trade data
# removes rows that don't fit the regex filter from the trade dataset
df_trade_filtered = df_trade.filter(df_trade['CONSIGNOR_INN'].rlike(regexINN))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### Accounting for Errors
Before continuing with text vectorization, we had to remove some punctuation characters from the `DESCRIPTION_GOOD` column, as the punctuation was causing .csv import errors in subsequent notebooks.

In [8]:
# Here, Labs20 group removed commas, hyphens, semi-colons, colons, apostrophes, and quotation marks from the dataset's DESCRIPTION_GOOD column
# Without this step, the Python3 Kernel notebooks that analyze the data have trouble loading in the .csv file, as read_csv interprets the presence
# of some of these punctuation marks as column separators.
df = df_trade_filtered.withColumn("DESCRIPTION_GOOD",regexp_replace("DESCRIPTION_GOOD", ",", ""))
df = df.withColumn("DESCRIPTION_GOOD",regexp_replace("DESCRIPTION_GOOD", "-", ""))
df = df.withColumn("DESCRIPTION_GOOD",regexp_replace("DESCRIPTION_GOOD", ";", ""))
df = df.withColumn("DESCRIPTION_GOOD",regexp_replace("DESCRIPTION_GOOD", ":", ""))
df = df.withColumn("DESCRIPTION_GOOD",regexp_replace("DESCRIPTION_GOOD", "'", ""))
df = df.withColumn("DESCRIPTION_GOOD",regexp_replace("DESCRIPTION_GOOD", '"', ""))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [9]:
# drop unwanted columns.  The DIRECTION_TRANSLATED column is no longer valuable now that the dataframe contains exports only
df = df.drop('DIRECTION_TRANSLATED')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [10]:
# saves dataframe to S3
df.write.save('s3://labs20-arms-bucket/data/df_trade_data_desc.csv', format='csv', header=True)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…