# Capstone project - Analytics in agriculture

### In this file, we can find the ETL process that our project follows to go from the raw data located in 'data/' to the curated data stored in our rdbms. For this first version the rdbms will be PostgreSQL

In [87]:
import psycopg2
import pandas as pd
import time
import configparser
import json

# 1. Extraction

### We are not starting from the very first stage. The extraction phase begins when downloading the data from the database, but since this first step needs to be done yearly due to de refresh schedule that this data is following, we did a manual step before the one described below (Manual step: download files > uncompress files)

### After the short explanation, we proceed with the extraction of the data. The data that our source provides are csv files. Since, the data is completely untouched, we will need to select the files/tables that are useful for our project and rearrange the structure of the columns because as we will see during the etl, the structure given is optimized for storage but not for a more advanced data model.

In [88]:
crops_data = pd.read_csv("data/Production_Crops_E_All_Data.csv", encoding="ANSI")
trade_data = pd.read_csv("data/Trade_Crops_Livestock_E_All_Data.csv", encoding="ANSI")
crops_flags = pd.read_csv("data/Production_Crops_E_Flags.csv", encoding="ANSI")
trade_flags = pd.read_csv("data/Trade_Crops_Livestock_E_Flags.csv", encoding="ANSI")

with open("credentials/redshift.json", 'r') as j:
    redshift = json.loads(j.read())

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


# 2. Transformation

## Creation of dimension tables

In [89]:
dim_countries = crops_data[["Area Code", "Area"]].append(trade_data[["Area Code", "Area"]]).drop_duplicates()
dim_items = crops_data[["Item Code", "Item"]].drop_duplicates()
dim_elements = crops_data[["Element Code", "Element"]].append(trade_data[["Element Code", "Element"]]).drop_duplicates()
dim_flags = crops_flags.append(trade_flags).drop_duplicates()

## Clean dataframes

### Trade data has mixed crops and products data. to increase the performance of the next steps, first we will need to remove the rows that are not crops.

### Dimensions contain lots of duplciated data, therefore they will be trimmed as well

In [90]:
trade_data = trade_data[trade_data["Item Code"].isin(dim_items["Item Code"])]
#temp_items = trade_data["Item"].drop_duplicates()
#temp_items

## rearrange the dataframe structures and creation of the fact table

### The design of this structure, will make the data grow horizontally, but for our SQL schema we can't keep a schema that is growing into this direction, so to rearrange the tables we have divided the data into 2 groups: keys and values. 
* keys: data that will be repeated after each iteration and serves as an identifier for the values
* values: data reported yearly and makes the dataframe grow each year 2 columns more

In [91]:
start = time.time()

raw_crop_keys = crops_data[["Area Code", "Area", "Item Code", "Item", "Element Code", "Element", "Unit"]]
raw_crop_values = crops_data.drop(labels = ["Area Code", "Area", "Item Code", "Item", "Element Code", "Element", "Unit"], axis = 1)

raw_trade_keys = trade_data[["Area Code", "Area", "Item Code", "Item", "Element Code", "Element", "Unit"]]
raw_trade_values = trade_data.drop(labels = ["Area Code", "Area", "Item Code", "Item", "Element Code", "Element", "Unit"], axis = 1)

if(len(raw_crop_values.columns) % 2 == 1):
    print(raw_crop_values.columns)
    raise Exception("Unexpected column found, columns number must be even as they consist of pairs. Please check out the dataframe structure")

if(len(raw_trade_values.columns) % 2 == 1):
    print(raw_trade_values.columns)
    raise Exception("Unexpected column found, columns number must be even as they consist of pairs. Please check out the dataframe structure")

fact_crops_data = pd.DataFrame(columns = ["Area Code", "Area", "Item Code", "Item", "Element Code", "Element", "Unit", "Year", "Value", "Flag"])

fact_trade_data = pd.DataFrame(columns = ["Area Code", "Area", "Item Code", "Item", "Element Code", "Element", "Unit", "Year", "Value", "Flag"])

for A, B in zip(*[iter(raw_crop_values)]*2):
    temp_aux_crops = raw_crop_keys.append(raw_crop_values[[A, B]]).rename(columns = {A: "Value", B: "Flag"})
    print("evaluated from crops_data: ", A)
    fact_crops_data = fact_crops_data.append(temp_aux_crops)

for A, B in zip(*[iter(raw_trade_values)]*2):
    temp_aux_trade = raw_trade_keys.append(raw_trade_values[[A, B]]).rename(columns = {A: "Value", B: "Flag"})
    print("evaluated from trade_data: ", A)
    fact_trade_data = fact_trade_data.append(temp_aux_trade)

end = time.time()

print("elapsed time: ", end - start)

evaluated from crops_data:  Y1961
evaluated from crops_data:  Y1962
evaluated from crops_data:  Y1963
evaluated from crops_data:  Y1964
evaluated from crops_data:  Y1965
evaluated from crops_data:  Y1966
evaluated from crops_data:  Y1967
evaluated from crops_data:  Y1968
evaluated from crops_data:  Y1969
evaluated from crops_data:  Y1970
evaluated from crops_data:  Y1971
evaluated from crops_data:  Y1972
evaluated from crops_data:  Y1973
evaluated from crops_data:  Y1974
evaluated from crops_data:  Y1975
evaluated from crops_data:  Y1976
evaluated from crops_data:  Y1977
evaluated from crops_data:  Y1978
evaluated from crops_data:  Y1979
evaluated from crops_data:  Y1980
evaluated from crops_data:  Y1981
evaluated from crops_data:  Y1982
evaluated from crops_data:  Y1983
evaluated from crops_data:  Y1984
evaluated from crops_data:  Y1985
evaluated from crops_data:  Y1986
evaluated from crops_data:  Y1987
evaluated from crops_data:  Y1988
evaluated from crops_data:  Y1989
evaluated from

# 3. Load

## Load the tables into our redshift cluster

In [92]:
conn = psycopg2.connect(f"host={redshift['endpoint']} dbname={redshift['database']} user={redshift['username']} password={redshift['password']} port={redshift['port']}")

{'endpoint': 'redshift-cluster.crisbfgqf52j.eu-west-2.redshift.amazonaws.com', 'database': 'dev', 'username': 'awsuser', 'password': 'Udacity59', 'port': 5439}


'"5439"'

In [None]:
redshift_credentials['port']