# Field work - feature engineering

In this field work, you will create new features based on the primary layer of your data asset.

#### Instructions
* We ask you to create **3 features** from various tables
* Once a feature is calculated, select a suitable visualization and/or descriptive analysis and note down any insights you will have about the feature
* For each feature, we provide hints which tables and fields are required
* The introductory section of the code accesses all required tables from the data asset
* Don't forget to import any additional packages you might require (e.g., matplotlib, pandas, pyspark, ...)
* Within each exercise, use multiple cells to keep your code modular

## Accessing tables

In [None]:
import os
print(os.environ)

In [None]:
# import os

os.environ["JAVA_HOME"] ="/usr/lib/jvm/java-8-openjdk-amd64/"
os.environ["SPARK_VERSION"] ="2.4.0"
os.environ["SPARK_FOLDER"] ="/opt/spark"
os.environ["SPARK_HOME"] ="/opt/spark/spark-2.4.0-bin-without-hadoop"
os.environ["PYSPARK_DRIVER_PYTHON"] ="ipython3"
os.environ["PYSPARK_PYTHON"] ="/usr/bin/python3"
os.environ["SPARK_DIST_CLASSPATH"] ="/opt/spark/hadoop-3.1.1/etc/hadoop:/opt/spark/hadoop-3.1.1/share/hadoop/common/lib/*:/opt/spark/hadoop-3.1.1/share/hadoop/common/*:/opt/spark/hadoop3.1.1/share/hadoop/hdfs:/opt/spark/hadoop-3.1.1/share/hadoop/hdfs/lib/*:/opt/spark/hadoop-3.1.1/share/hadoop/hdfs/*:/opt/spark/hadoop-3.1.1/share/hadoop/mapreduce/lib/*:/opt/spark/hadoop-3.1.1/share/hadoop/mapreduce/*:/opt/spark/hadoop-3.1.1/share/hadoop/yarn:/opt/spark/hadoop-3.1.1/share/hadoop/yarn/lib/*:/opt/spark/hadoop-3.1.1/share/hadoop/yarn/*:/opt/spark/hadoop-3.1.1/share/hadoop/tools/lib/*"
os.environ["PATH"] = os.environ["SPARK_HOME"]+"/bin:"+os.environ["PATH"]

print(os.environ)

In [None]:
# Load the Kedro context of the data asset
import logging
from kedro.context import load_context
# TASK: Update path where you have imported the repo from bitbucket
context = load_context("/home/akhil_kulkarni/unicorn/supply_chain_data_asset/")
catalog = context.catalog

In [None]:
# List all tables included in the data catalogue
catalog.list()

In [None]:
# Load all required tables through kedro, using the data catalog
pri_product = context.catalog.load("pri_product")
pri_product_hierarchy = context.catalog.load("pri_product_hierarchy")
pri_product_location_behavior = context.catalog.load("pri_product_location_behavior")
pri_location = context.catalog.load("pri_location")
pri_duns_master = context.catalog.load("pri_duns_master")
pri_usp_midas = context.catalog.load("pri_usp_midas")

pri_product.show()

## 1. For each ingredient, calculate the number of countries where it is manufactured
### Create feature
Hints:

Join table *pri_product* with *pri_product_location_behavior* and filter for behaviors which count as manufacturing ('API MANUF' and 'MANUF'). Join the table via the 'location' with table *pri_location* to add the country information. Aggregate by ingrediet and count the number of distinct countries.

### Explore feature
Explore the new feature by performing an appropriate visualization and/or descriptive analyses

Note down any insights gained from the analyses below (e.g., findings, potential alternative representations for the feature, additional cleaning improvements, etc.)

### 2. Do same as #1, but for a Finished product's ingredients 
Hints:
for each finished product, get a list of active ingredients by joining pri_product* with *pri_product_hierarchy*.
For all these ingredients, find number of countries manufacturing those ingredients. 

For example: Let's say Product P1 has Ingredients I1, I2 and I3. I1 is made in China and India, I2 in China and Italy, and I3 in Italy and India, then Product P1 should have the count 3 for countries manufacturing it's ingredients)

### Explore feature
Explore the new feature by performing an appropriate visualization and/or descriptive analyses

Note down any insights gained from the analyses below (e.g., findings, potential alternative representations for the feature, additional cleaning improvements, etc.)

### 3. For each ingredient, calculate the percentage of manufacturing sites that have had an FDA warning letter.
Hints:
get list of sites manufacturing an ingredient by joining *pri_product* with *pri_product_location_behavior*. Check how many of those locations have "HAS_FDA_WARNING" behavior in the behavior table. 

For example, if Ingredient I1 his manufactured in 10 locations, and 2 of those locations has an FDA warning letter, 20% of sites manufacturing I1 have had FDA warning letters.

### Explore feature
Explore the new feature by performing an appropriate visualization and/or descriptive analyses

Note down any insights gained from the analyses below (e.g., findings, potential alternative representations for the feature, additional cleaning improvements, etc.)