# Supply Chain Shipment Pricing Prediction

## Background
In this study, analysis and pricing prediction is applied to supply chain shipment data. Models which will be explored during this analysis are listed and described below:

| Algorithm                                | Definition                                                                                                       | Characteristic                                                                                    |
|------------------------------------------|------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------|
| Gradient boosted trees model (GBT Model) | Each tree is trained to predict and then "correct" for the errors of the previously trained trees                | A set of shallow decision trees trained sequentially.                                             |
| Multiple linear regression (MLR)         | A statistical technique for estimating a predictive target utilizing a linear relationship between two or more predictive factors for one predictive target. | Predicts a dependent variable using multiple independent variables.                               |
| Deep neural network (DNN)                | An artificial neural network consisting of many hidden layers between an input and output layer.                 | This algorithm can model complex nonlinear relationships, and it contains multiple hidden layers. |
| XGBoost regression                       | Extreme gradient boosting acting as an improved algorithm based on the gradient boosting algorithm.              | Excellent efficiency, flexibility, and portability, and it can prevent overfitting.               |
| LightGBM regression                      | Gradient boost-based algorithm that includes two techniques. (1) Gradient based on one-side sampling and (2) exclusive feature bundling. | An ensemble technique that utilizes a leaf-wise tree partitioning method.                         |

### Import Packages

In [None]:
import os
import numpy as np
import pandas as pd
import pandas_bokeh
import tensorflow as tf
import tensorflow_decision_forests as tfdf
import math

In [None]:
print("Found TensorFlow Decision Forests v" + tfdf.__version__)

## Data Collection and Preprocessing

### Data Collection
This data set provides supply chain health commodity shipment and pricing data. Specifically, the data set identifies Antiretroviral (ARV) and HIV lab shipments to supported countries. In addition, the data set provides the commodity pricing and associated supply chain expenses necessary to move the commodities to countries for use. The dataset has similar fields to the Global Fund's Price, Quality and Reporting (PQR) data. PEPFAR and the Global Fund represent the two largest procurers of HIV health commodities. This dataset, when analyzed in conjunction with the PQR data, provides a more complete picture of global spending on specific health commodities. The data are particularly valuable for understanding ranges and trends in pricing as well as volumes delivered by country. The US Government believes this data will help stakeholders make better, data-driven decisions. Care should be taken to consider contextual factors when using the database. Conclusions related to costs associated with moving specific line items or products to specific countries and lead times by product/country will not be accurate.


In [None]:
# Load the data set into Pandas DataFrame
dataset_df = pd.read_csv("Supply_Chain_Shipment_Pricing_Data.csv")

dataset_df.tail()

### Data Preprocessing

In [None]:
dataset_df_reduced = dataset_df.copy()

dataset_df_reduced['weight (kilograms)'] = pd.to_numeric(dataset_df_reduced['weight (kilograms)'], errors = 'coerce')
dataset_df_reduced.dropna(inplace = True)

dataset_df_reduced['freight cost (usd)'] = pd.to_numeric(dataset_df_reduced['freight cost (usd)'], errors = 'coerce')
dataset_df_reduced.dropna(inplace = True)

In [None]:
# Explore country count data.
CountryCount_raw = dataset_df['country'].value_counts().nlargest(50)
CountryCount_reduced = dataset_df_reduced['country'].value_counts().nlargest(50)


In [None]:
dataset_df_reduced.plot_bokeh.bar(x="country", y="freight cost (usd)")

Country count values are represented greatly by South Africa, Nigeria, and CÃ´te d'Ivoire with values over 1000.

In [None]:
# Describe numerical values before removal section
dataset_df.describe()

In [None]:
# Describe manufacturing origin.
ManLocs_raw = dataset_df.groupby('manufacturing site').size().nlargest(5)
ManLocs_reduced = dataset_df_reduced.groupby('manufacturing site').size().nlargest(5)