<a href="https://colab.research.google.com/github/acedesci/scanalytics/blob/master/S05_Data_Preprocessing/S05_LectureEx_1_Notebook_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# S5 - Data Exploration,  Preprocessing and SCM examples
Programming topics covered in this section:
* Data preprocessing

Examples include:
* Exploring Supply Chain health commodity shipment and pricing data

In [1]:
import pandas as pd

## 2. Importing data and creating a report
In this exercise, we will explore some adapted data set which provides supply chain health commodity shipment and pricing data. Specifically, the data set identifies Antiretroviral (ARV) and HIV lab shipments to supported countries. In addition, the data set provides the commodity pricing and associated supply chain expenses necessary to move the commodities to countries for use. The original data are provided by the US Agency for International Development and can be accessed at [this page](https://catalog.data.gov/dataset/supply-chain-shipment-pricing-data).

This is a description of our adapted data in the file `SCMS_Delivery_History_Dataset.csv`.

| VARIABLE NAME | DESCRIPTION | 
|:----|:----|
|id| identification number|
|project_code|identification of the project|
|country|country to which the items are shipped|
|vendor|identification of the vendor of the item|
|manufacturing_side|name of the manufacturer of the item|
|shipment|transportation mode (e.g., air, truck)|
|schedule_delivery_date|programmed date for delivery|
|delivered_to_client_date|real date of delivery|
|delivery_recorded_date|registered date of delivery|
|product_group|main category of the item|
|product_subgroup|subcategory of the item (e.g., HIV test, pediatric, Adult) |
|molecule_type|description of the composition of the item (e.g., Nevirapine, HIV 1/2, Didanosine)|
|brand| item brand (e.g, generic or any other commercial brand)|
|dosage| specifications about the dosage of each item (e.g.,10mg/ml, 200mg)|
|dosage_form|instructions for consumption (e.g., capsule, tablet, oral solution) |
|units_per_pack| number of units in each package|
|quantity_pack_sold| number of packages shipped to the specified country|
|value_sold| total value in $\$$ USD of the shipment (i.e., pack_price * quantity_pack_sold|
|pack_price| price in $\$$ USD per package|
|unit_price| price in $\$$ USD per unit|
|weight_kg| total weight in kilograms of the shipment|
|freight_cost_usd| value in $\$$USD paid for transportation|
|insurance_usd|value in $\$$USD paid for insurance|



Let's import our data.

In [2]:
url = 'https://raw.githubusercontent.com/acedesci/scanalytics/master/S05_Data_Preprocessing/Supply_Chain_Shipment_Pricing_Data.csv'
df_SC = pd.read_csv(url, encoding='latin-1')  # reading data file into a DataFrame
df_SC.head()

HTTPError: ignored

---
## Preprocessing Data

We can use `df.describe()` function to show descriptive statistics of the data

In [None]:
df_SC.describe()

Let's take a look at the type of data in our `DataFrame`. We can notice that columns `schedule_delivery_date`, `delivered_to_client_date`, and `delivery_recorded_date` is `object`, which means they can be string or mixed.

In [None]:
df_SC.dtypes

Let's convert the data in columns `schedule_delivery_date`, `delivered_to_client_date`, and `delivery_recorded_date` to the correct format, as presented below. We can obtain the same results using the function `DataFrame.astype('datatime64')`

In [None]:
# here we replace the original columns with the newly formatted ones
df_SC['schedule_delivery_date'] = pd.to_datetime(df_SC['schedule_delivery_date']) 
df_SC['delivered_to_client_date'] = pd.to_datetime(df_SC['delivered_to_client_date'])
df_SC['delivery_recorded_date'] = pd.to_datetime(df_SC['delivery_recorded_date'])
df_SC.dtypes

We can also see that the columns `weight_kg` and `freight_cost_usd` are also of type `object`. These data should be a numeric value since it represents kilograms and $USD. However, the raw data have some annotations made by the user, as you can see below by printing the first 10 rows of your DataFrame, so this is why it is recognized as type `object`.

In [None]:
df_SC.head(10)

We can then use the `to_numeric` method in order to convert the values under the `weight_kg` and `freight_cost_usd`  columns into a float:

In [None]:
df_SC['weight_kg'] = pd.to_numeric(df_SC['weight_kg'], errors='coerce')
df_SC['freight_cost_usd'] = pd.to_numeric(df_SC['freight_cost_usd'], errors='coerce')

By setting `errors='coerce'`, you will transform the non-numeric values into `NaN`.
Now we can obtain some descriptive statistics for `weight_kg` and `freight_cost_usd` using the `describe()` method.

In [None]:
df_SC.describe()

---
## Missing data

Now let's take a look at the missing values in our DataFrame. We can see how many missing values we have at each column as follows.

In [None]:
df_SC.isna().sum()

We can notice that there are some missing values in the columns `shipment` and `dosage`, which represent the transportation mode (e.g., by air) and the dosage (e.g., 30mg) of each item sold, respectively. There is not that much we can do in order to replace these missing values with meaningful information, so we'll replace the missing values of in these columns with the word `'missing'`. We use the `.fillna()` method with the option `inplace=True` to save the changes in our DataFrame. Check [this page](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) for more information about the `.fillna()` method.

In [None]:
df_SC.fillna(value={'shipment': 'missing', 'dosage': 'missing'}, inplace=True)
df_SC.isna().sum()

Now we will replace the missing values in columns `weight_kg` and `freight_cost_usd` by 0, and  `insurance_usd` by an approximated value, computed as the mean value for this column.

In [None]:
df_SC.fillna(value={'weight_kg': 0, 'freight_cost_usd': 0, 'insurance_usd': df_SC.insurance_usd.mean()})
print(df_SC.isnull().sum())
df_SC.describe()

id                          0
project_code                0
country                     0
vendor                      0
manufacturing_site          0
shipment                    0
schedule_delivery_date      0
delivered_to_client_date    0
delivery_recorded_date      0
product_group               0
product_subgroup            0
molecule_type               0
brand                       0
dosage                      0
dosage_form                 0
units_per_pack              0
quantity_pack_sold          0
value_sold                  0
pack_price                  0
unit_price                  0
weight_kg                   0
freight_cost_usd            0
insurance_usd               0
dtype: int64


Unnamed: 0,id,units_per_pack,quantity_pack_sold,value_sold,pack_price,unit_price,weight_kg,freight_cost_usd,insurance_usd
count,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0
mean,51098.968229,77.990895,18332.53487,157650.6,21.910241,0.611701,2113.574196,6665.812612,240.117626
std,31944.332496,76.579764,40035.302961,345292.1,45.609223,3.275808,10756.353428,13404.868186,493.188408
min,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,12795.75,30.0,408.0,4314.593,4.12,0.08,0.0,0.0,7.03
50%,57540.5,60.0,3000.0,30471.47,9.3,0.16,122.0,1422.09,52.94
75%,83648.25,90.0,17039.75,166447.1,23.5925,0.47,1596.5,7707.64,241.75
max,86823.0,1000.0,619999.0,5951990.0,1345.64,238.65,857354.0,289653.2,7708.44


---
## Data Transformation

### Scaling methods
Variables tend to have different ranges and some algorithms are adversely affected by differences in variable ranges. Variables with greater ranges tend to have larger influence on data model’s results. Therefore, numeric field values may need to be standardized/normalized. 

From the output of the `describe()` method in the previous line of code, we can notice that the numerical variables have different ranges. For instance, `units_per_pack` varies from 1 to 1000, while `weight_kg` varies from 0 to 857354. We would like to apply normalization method to scale the numerical values in our data. 

Let's apply the **Min-max normalization** method, by identifying how much greater the field value is than the minimum value, and scaling this difference by the range of field values.

$$X^*=\frac{X-\min(X)}{\max{X}-\min{X}}$$

Thus,  I compute the normalized version of each of the numerical variable and add this as a new column of our data frame. We can proceed as follows. 

First, we create a list of the columns we want to normalize. 

In [None]:
columns_to_norm = ['units_per_pack', 'quantity_pack_sold', 'value_sold', 'pack_price', 'unit_price', 'weight_kg',
      'freight_cost_usd', 'insurance_usd']

Then, I can create a `for` loop to compute the normalized version for each one of these columns and add it to `df_SC`. 

In [None]:
for col in columns_to_norm:
    col_norm = col + '_norm'   # create a new name for the colum. For example, 'units_per_pack_norm'
    df_SC[col_norm] = (df_SC[col] - df_SC[col].min())/(df_SC[col].max() - df_SC[col].min())   # add the new normalized col
df_SC.describe()

Unnamed: 0,id,units_per_pack,quantity_pack_sold,value_sold,pack_price,unit_price,weight_kg,freight_cost_usd,insurance_usd,units_per_pack_norm,quantity_pack_sold_norm,value_sold_norm,pack_price_norm,unit_price_norm,weight_kg_norm,freight_cost_usd_norm,insurance_usd_norm
count,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0
mean,51098.968229,77.990895,18332.53487,157650.6,21.910241,0.611701,2113.574196,6665.812612,240.117626,0.077068,0.029567,0.026487,0.016282,0.002563,0.002465,0.023013,0.03115
std,31944.332496,76.579764,40035.302961,345292.1,45.609223,3.275808,10756.353428,13404.868186,493.188408,0.076656,0.064573,0.058013,0.033894,0.013726,0.012546,0.046279,0.06398
min,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,12795.75,30.0,408.0,4314.593,4.12,0.08,0.0,0.0,7.03,0.029029,0.000656,0.000725,0.003062,0.000335,0.0,0.0,0.000912
50%,57540.5,60.0,3000.0,30471.47,9.3,0.16,122.0,1422.09,52.94,0.059059,0.004837,0.00512,0.006911,0.00067,0.000142,0.00491,0.006868
75%,83648.25,90.0,17039.75,166447.1,23.5925,0.47,1596.5,7707.64,241.75,0.089089,0.027482,0.027965,0.017533,0.001969,0.001862,0.02661,0.031362
max,86823.0,1000.0,619999.0,5951990.0,1345.64,238.65,857354.0,289653.2,7708.44,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### Dummy Variables
A categorical variable should generally be encoded as **dummy variables** (a.k.a. indicator variables), each taking only one of two values (0 or 1; False or True)
When a categorical variable takes k possible values, you typically have two options to define your dummy variables:
* Option 1: Define k-1 dummy variables, and use the unassigned category as the reference category
* Option 2: Define k dummy variables. Often referred to as **one-hot** encoding.

Let's transform our categorical variable `shipment` into dummy variables using `Option 1`. First, let's take a look at the possible values for the categorical values.

In [None]:
df_SC['shipment'].unique()

array(['Air', 'missing', 'Truck', 'Air Charter', 'Ocean'], dtype=object)

We will create 4 dummy variables with names `'Air'`,  `'Truck'`, `'Air Charter'` and `'Ocean'`, and use `'missing'` as our reference category. One way to do this is by making use of the `DataFrame` function `pd.get_dummies()`, which automatically  converts categorical variable into dummy/indicator variables. You can check [this page](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) for more information. 

In [None]:
df_dummies = pd.get_dummies(df_SC['shipment'])
df_dummies.head()

Unnamed: 0,Air,Air Charter,Ocean,Truck,missing
0,1,0,0,0,0
1,1,0,0,0,0
2,1,0,0,0,0
3,1,0,0,0,0
4,1,0,0,0,0


Then, we can either merge or concatenate the new `DataFrame` we just created (`df_dummies`) with our original `DataFrame`. We can also get ride of the original column `shipment`, as we will use its corresponding indicator variables instead.  We can do this using the function `pd.concat()` and the method `DataFrame.drop()`

In [None]:
# concatenating the original df_SC with df_dummy without the column 'missing'
df_SC = pd.concat([df_SC, df_dummies.drop('missing', axis=1)], axis=1)  

# droping the column 'shipment' and saving the changes in the original DF
df_SC.drop('shipment', axis=1, inplace=True)    
df_SC.describe()

Unnamed: 0,id,units_per_pack,quantity_pack_sold,value_sold,pack_price,unit_price,weight_kg,freight_cost_usd,insurance_usd,units_per_pack_norm,...,value_sold_norm,pack_price_norm,unit_price_norm,weight_kg_norm,freight_cost_usd_norm,insurance_usd_norm,Air,Air Charter,Ocean,Truck
count,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,...,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0
mean,51098.968229,77.990895,18332.53487,157650.6,21.910241,0.611701,2113.574196,6665.812612,240.117626,0.077068,...,0.026487,0.016282,0.002563,0.002465,0.023013,0.03115,0.592115,0.06296,0.035936,0.274119
std,31944.332496,76.579764,40035.302961,345292.1,45.609223,3.275808,10756.353428,13404.868186,493.188408,0.076656,...,0.058013,0.033894,0.013726,0.012546,0.046279,0.06398,0.491465,0.242903,0.186139,0.446091
min,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,12795.75,30.0,408.0,4314.593,4.12,0.08,0.0,0.0,7.03,0.029029,...,0.000725,0.003062,0.000335,0.0,0.0,0.000912,0.0,0.0,0.0,0.0
50%,57540.5,60.0,3000.0,30471.47,9.3,0.16,122.0,1422.09,52.94,0.059059,...,0.00512,0.006911,0.00067,0.000142,0.00491,0.006868,1.0,0.0,0.0,0.0
75%,83648.25,90.0,17039.75,166447.1,23.5925,0.47,1596.5,7707.64,241.75,0.089089,...,0.027965,0.017533,0.001969,0.001862,0.02661,0.031362,1.0,0.0,0.0,1.0
max,86823.0,1000.0,619999.0,5951990.0,1345.64,238.65,857354.0,289653.2,7708.44,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Transforming numerical variables into categorical variables
In some cases, categorical variables may be preferred over numerical ones. We then need to partition the numerical variables into bins according to a specific criteria.
As an example, let's transform our original variable `'weight_kg'` into a categorical variable with values `'light'` (if the weight is up to 100 kg), `'medium'`(if the weight is within the interval (100 kg, 500 kg]), `'heavy'` (if the weight is within the interval (500 kg, 1000 kg]) and `'super-heavy'`(if the weight is > 1000 kg). 

We can implement this transformation using the function `pd.cut()`, which helps us to segment and sort data values into bins. You can check [this page](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html) for more information on this function.


In [None]:
bins = [0, 100., 500., 1000.,  float('inf')]             # defining the bins 
names = ['light', 'medium', 'heavy', 'super-heavy']      # defining the names for the categories
df_SC['weight_category'] = pd.cut(df_SC['weight_kg'], bins, labels=names, include_lowest=True)  # adding the new cat. var. to our DF
df_SC.head()

Unnamed: 0,id,project_code,country,vendor,manufacturing_site,schedule_delivery_date,delivered_to_client_date,delivery_recorded_date,product_group,product_subgroup,...,pack_price_norm,unit_price_norm,weight_kg_norm,freight_cost_usd_norm,insurance_usd_norm,Air,Air Charter,Ocean,Truck,weight_category
0,1,100-CI-T01,Cote d Ivoire,EXW,Ranbaxy Fine Chemicals LTD,2006-06-02,2006-06-02,2006-06-02,HRDT,HIV test,...,0.021551,0.004065,1.5e-05,0.002694,0.03115,1,0,0,0,light
1,3,108-VN-T01,Vietnam,EXW,"Aurobindo Unit III, India",2006-11-14,2006-11-14,2006-11-14,ARV,Pediatric,...,0.004607,0.000126,0.000418,0.01561,0.03115,1,0,0,0,medium
2,4,100-CI-T01,Cote d Ivoire,FCA,ABBVIE GmbH & Co.KG Wiesbaden,2006-08-27,2006-08-27,2006-08-27,HRDT,HIV test,...,0.059451,0.003352,0.000199,0.00571,0.03115,1,0,0,0,medium
3,15,108-VN-T01,Vietnam,EXW,"Ranbaxy, Paonta Shahib, India",2006-09-01,2006-09-01,2006-09-01,ARV,Adult,...,0.002965,0.000293,0.002164,0.055263,0.03115,1,0,0,0,super-heavy
4,16,108-VN-T01,Vietnam,EXW,"Aurobindo Unit III, India",2006-08-11,2006-08-11,2006-08-11,ARV,Adult,...,0.002378,0.00021,0.008853,0.156912,0.03115,1,0,0,0,super-heavy


0.0