<a href="https://colab.research.google.com/github/acedesci/scanalytics/blob/master/S05_Data_Preprocessing/S05_LectureEx_1_Notebook_Processing_V2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# S5 - Data Exploration,  Preprocessing and SCM examples
Programming topics covered in this section:
* Data preprocessing

Examples include:
* Exploring Supply Chain health commodity shipment and pricing data

In [1]:
import pandas as pd

## 2. Importing data and creating a report
In this exercise, we will explore some adapted data set which provides supply chain health commodity shipment and pricing data. Specifically, the data set identifies Antiretroviral (ARV) and HIV lab shipments to supported countries. In addition, the data set provides the commodity pricing and associated supply chain expenses necessary to move the commodities to countries for use. The original data are provided by the US Agency for International Development and can be accessed at [this page](https://catalog.data.gov/dataset/supply-chain-shipment-pricing-data).

This is a description of our adapted data in the file `SCMS_Delivery_History_Dataset.csv`.

| VARIABLE NAME | DESCRIPTION | 
|:----|:----|
|id| identification number|
|project_code|identification of the project|
|country|country to which the items are shipped|
|vendor|identification of the vendor of the item|
|manufacturing_side|name of the manufacturer of the item|
|shipment|transportation mode (e.g., air, truck)|
|schedule_delivery_date|programmed date for delivery|
|delivered_to_client_date|real date of delivery|
|delivery_recorded_date|registered date of delivery|
|product_group|main category of the item|
|product_subgroup|subcategory of the item (e.g., HIV test, pediatric, Adult) |
|molecule_type|description of the composition of the item (e.g., Nevirapine, HIV 1/2, Didanosine)|
|brand| item brand (e.g, generic or any other commercial brand)|
|dosage| specifications about the dosage of each item (e.g.,10mg/ml, 200mg)|
|dosage_form|instructions for consumption (e.g., capsule, tablet, oral solution) |
|units_per_pack| number of units in each package|
|quantity_pack_sold| number of packages shipped to the specified country|
|value_sold| total value in $\$$ USD of the shipment (i.e., pack_price * quantity_pack_sold|
|pack_price| price in $\$$ USD per package|
|unit_price| price in $\$$ USD per unit|
|weight_kg| total weight in kilograms of the shipment|
|freight_cost_usd| value in $\$$USD paid for transportation|
|insurance_usd|value in $\$$USD paid for insurance|



Let's import our data.

In [70]:
url = 'https://raw.githubusercontent.com/acedesci/scanalytics/master/S05_Data_Preprocessing/Supply_Chain_Shipment_Pricing_Data.csv'
df_SC = pd.read_csv(url)  # reading data file into a DataFrame
print(df_SC.columns)
# we replace the space in the column name by "_". This is to avoid an issue of having space in the name

print(column_names)
df_SC.columns = df_SC.columns.str.replace(' ', '_')
df_SC.columns = df_SC.columns.str.replace('(', '')
df_SC.columns = df_SC.columns.str.replace(')', '')
df_SC.columns = df_SC.columns.str.replace('#', '')
df_SC.columns = df_SC.columns.str.replace('/', '')



df_SC.columns

Index(['id', 'project code', 'pq #', 'po / so #', 'asn/dn #', 'country',
       'managed by', 'fulfill via', 'vendor inco term', 'shipment mode',
       'pq first sent to client date', 'po sent to vendor date',
       'scheduled delivery date', 'delivered to client date',
       'delivery recorded date', 'product group', 'sub classification',
       'vendor', 'item description', 'molecule/test type', 'brand', 'dosage',
       'dosage form', 'unit of measure (per pack)', 'line item quantity',
       'line item value', 'pack price', 'unit price', 'manufacturing site',
       'first line designation', 'weight (kilograms)', 'freight cost (usd)',
       'line item insurance (usd)'],
      dtype='object')
['id' 'project code' 'pq #' 'po / so #' 'asn/dn #' 'country' 'managed by'
 'fulfill via' 'vendor inco term' 'shipment mode'
 'pq first sent to client date' 'po sent to vendor date'
 'scheduled delivery date' 'delivered to client date'
 'delivery recorded date' 'product group' 'sub classific

Index(['id', 'project_code', 'pq_', 'po__so_', 'asndn_', 'country',
       'managed_by', 'fulfill_via', 'vendor_inco_term', 'shipment_mode',
       'pq_first_sent_to_client_date', 'po_sent_to_vendor_date',
       'scheduled_delivery_date', 'delivered_to_client_date',
       'delivery_recorded_date', 'product_group', 'sub_classification',
       'vendor', 'item_description', 'moleculetest_type', 'brand', 'dosage',
       'dosage_form', 'unit_of_measure_per_pack', 'line_item_quantity',
       'line_item_value', 'pack_price', 'unit_price', 'manufacturing_site',
       'first_line_designation', 'weight_kilograms', 'freight_cost_usd',
       'line_item_insurance_usd'],
      dtype='object')

---
## Preprocessing Data

We can use `df.describe()` function to show descriptive statistics of the data

In [22]:
df_SC.describe()

Unnamed: 0,id,unit_of_measure_(per_pack),line_item_quantity,line_item_value,pack_price,unit_price,line_item_insurance_(usd)
count,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10037.0
mean,51098.968229,77.990895,18332.53487,157650.6,21.910241,0.611701,240.117626
std,31944.332496,76.579764,40035.302961,345292.1,45.609223,3.275808,500.190568
min,1.0,1.0,1.0,0.0,0.0,0.0,0.0
25%,12795.75,30.0,408.0,4314.593,4.12,0.08,6.51
50%,57540.5,60.0,3000.0,30471.47,9.3,0.16,47.04
75%,83648.25,90.0,17039.75,166447.1,23.5925,0.47,252.4
max,86823.0,1000.0,619999.0,5951990.0,1345.64,238.65,7708.44


In [23]:
df_SC['scheduled_delivery_date']

0         2-Jun-06
1        14-Nov-06
2        27-Aug-06
3         1-Sep-06
4        11-Aug-06
           ...    
10319    31-Jul-15
10320    31-Jul-15
10321    31-Aug-15
10322     9-Sep-15
10323    31-Aug-15
Name: scheduled_delivery_date, Length: 10324, dtype: object

Let's take a look at the type of data in our `DataFrame`. We can notice that columns `schedule_delivery_date`, `delivered_to_client_date`, and `delivery_recorded_date` is `object`, which means they can be string or mixed.

In [24]:
df_SC.dtypes

id                                int64
project_code                     object
pq_#                             object
po_/_so_#                        object
asn/dn_#                         object
country                          object
managed_by                       object
fulfill_via                      object
vendor_inco_term                 object
shipment_mode                    object
pq_first_sent_to_client_date     object
po_sent_to_vendor_date           object
scheduled_delivery_date          object
delivered_to_client_date         object
delivery_recorded_date           object
product_group                    object
sub_classification               object
vendor                           object
item_description                 object
molecule/test_type               object
brand                            object
dosage                           object
dosage_form                      object
unit_of_measure_(per_pack)        int64
line_item_quantity                int64


Let's convert the data in columns `schedule_delivery_date`, `delivered_to_client_date`, and `delivery_recorded_date` to the correct format, as presented below. We can obtain the same results using the function `DataFrame.astype('datatime64')`

In [27]:
# here we replace the original columns with the newly formatted ones
df_SC['scheduled_delivery_date'] = pd.to_datetime(df_SC['scheduled_delivery_date'])
df_SC['delivered_to_client_date'] = pd.to_datetime(df_SC['delivered_to_client_date'])
df_SC['delivery_recorded_date'] = pd.to_datetime(df_SC['delivery_recorded_date'])
df_SC.dtypes

id                                       int64
project_code                            object
pq_#                                    object
po_/_so_#                               object
asn/dn_#                                object
country                                 object
managed_by                              object
fulfill_via                             object
vendor_inco_term                        object
shipment_mode                           object
pq_first_sent_to_client_date            object
po_sent_to_vendor_date                  object
scheduled_delivery_date         datetime64[ns]
delivered_to_client_date        datetime64[ns]
delivery_recorded_date          datetime64[ns]
product_group                           object
sub_classification                      object
vendor                                  object
item_description                        object
molecule/test_type                      object
brand                                   object
dosage       

We can also see that the columns `weight_kg` and `freight_cost_usd` are also of type `object`. These data should be a numeric value since it represents kilograms and $USD. However, the raw data have some annotations made by the user, as you can see below by printing the first 10 rows of your DataFrame, so this is why it is recognized as type `object`.

In [28]:
df_SC.head(10)

Unnamed: 0,id,project_code,pq_#,po_/_so_#,asn/dn_#,country,managed_by,fulfill_via,vendor_inco_term,shipment_mode,pq_first_sent_to_client_date,po_sent_to_vendor_date,scheduled_delivery_date,delivered_to_client_date,delivery_recorded_date,product_group,sub_classification,vendor,item_description,molecule/test_type,brand,dosage,dosage_form,unit_of_measure_(per_pack),line_item_quantity,line_item_value,pack_price,unit_price,manufacturing_site,first_line_designation,weight_(kilograms),freight_cost_(usd),line_item_insurance_(usd),schedule_delivery_date
0,1,100-CI-T01,Pre-PQ Process,SCMS-4,ASN-8,Côte d'Ivoire,PMO - US,Direct Drop,EXW,Air,Pre-PQ Process,Date Not Captured,2006-06-02,2006-06-02,2006-06-02,HRDT,HIV test,RANBAXY Fine Chemicals LTD.,"HIV, Reveal G3 Rapid HIV-1 Antibody Test, 30 T...","HIV, Reveal G3 Rapid HIV-1 Antibody Test",Reveal,,Test kit,30,19,551.0,29.0,0.97,Ranbaxy Fine Chemicals LTD,True,13,780.34,,2006-06-02
1,3,108-VN-T01,Pre-PQ Process,SCMS-13,ASN-85,Vietnam,PMO - US,Direct Drop,EXW,Air,Pre-PQ Process,Date Not Captured,2006-11-14,2006-11-14,2006-11-14,ARV,Pediatric,Aurobindo Pharma Limited,"Nevirapine 10mg/ml, oral suspension, Bottle, 2...",Nevirapine,Generic,10mg/ml,Oral suspension,240,1000,6200.0,6.2,0.03,"Aurobindo Unit III, India",True,358,4521.5,,2006-11-14
2,4,100-CI-T01,Pre-PQ Process,SCMS-20,ASN-14,Côte d'Ivoire,PMO - US,Direct Drop,FCA,Air,Pre-PQ Process,Date Not Captured,2006-08-27,2006-08-27,2006-08-27,HRDT,HIV test,Abbott GmbH & Co. KG,"HIV 1/2, Determine Complete HIV Kit, 100 Tests","HIV 1/2, Determine Complete HIV Kit",Determine,,Test kit,100,500,40000.0,80.0,0.8,ABBVIE GmbH & Co.KG Wiesbaden,True,171,1653.78,,2006-08-27
3,15,108-VN-T01,Pre-PQ Process,SCMS-78,ASN-50,Vietnam,PMO - US,Direct Drop,EXW,Air,Pre-PQ Process,Date Not Captured,2006-09-01,2006-09-01,2006-09-01,ARV,Adult,SUN PHARMACEUTICAL INDUSTRIES LTD (RANBAXY LAB...,"Lamivudine 150mg, tablets, 60 Tabs",Lamivudine,Generic,150mg,Tablet,60,31920,127360.8,3.99,0.07,"Ranbaxy, Paonta Shahib, India",True,1855,16007.06,,2006-09-01
4,16,108-VN-T01,Pre-PQ Process,SCMS-81,ASN-55,Vietnam,PMO - US,Direct Drop,EXW,Air,Pre-PQ Process,Date Not Captured,2006-08-11,2006-08-11,2006-08-11,ARV,Adult,Aurobindo Pharma Limited,"Stavudine 30mg, capsules, 60 Caps",Stavudine,Generic,30mg,Capsule,60,38000,121600.0,3.2,0.05,"Aurobindo Unit III, India",True,7590,45450.08,,2006-08-11
5,23,112-NG-T01,Pre-PQ Process,SCMS-87,ASN-57,Nigeria,PMO - US,Direct Drop,EXW,Air,Pre-PQ Process,Date Not Captured,2006-09-28,2006-09-28,2006-09-28,ARV,Pediatric,Aurobindo Pharma Limited,"Zidovudine 10mg/ml, oral solution, Bottle, 240 ml",Zidovudine,Generic,10mg/ml,Oral solution,240,416,2225.6,5.35,0.02,"Aurobindo Unit III, India",True,504,5920.42,,2006-09-28
6,44,110-ZM-T01,Pre-PQ Process,SCMS-139,ASN-130,Zambia,PMO - US,Direct Drop,DDU,Air,Pre-PQ Process,Date Not Captured,2007-01-08,2007-01-08,2007-01-08,ARV,Pediatric,MERCK SHARP & DOHME IDEA GMBH (FORMALLY MERCK ...,"Efavirenz 200mg [Stocrin/Sustiva], capsule, 90...",Efavirenz,Stocrin/Sustiva,200mg,Capsule,90,135,4374.0,32.4,0.36,MSD South Granville Australia,True,328,Freight Included in Commodity Cost,,2007-01-08
7,45,109-TZ-T01,Pre-PQ Process,SCMS-140,ASN-94,Tanzania,PMO - US,Direct Drop,EXW,Air,Pre-PQ Process,Date Not Captured,2006-11-24,2006-11-24,2006-11-24,ARV,Adult,Aurobindo Pharma Limited,"Nevirapine 200mg, tablets, 60 Tabs",Nevirapine,Generic,200mg,Tablet,60,16667,60834.55,3.65,0.06,"Aurobindo Unit III, India",True,1478,6212.41,,2006-11-24
8,46,112-NG-T01,Pre-PQ Process,SCMS-156,ASN-93,Nigeria,PMO - US,Direct Drop,EXW,Air,Pre-PQ Process,Date Not Captured,2006-12-07,2006-12-07,2006-12-07,ARV,Adult,Aurobindo Pharma Limited,"Stavudine 30mg, capsules, 60 Caps",Stavudine,Generic,30mg,Capsule,60,273,532.35,1.95,0.03,"Aurobindo Unit III, India",False,See ASN-93 (ID#:1281),See ASN-93 (ID#:1281),,2006-12-07
9,47,110-ZM-T01,Pre-PQ Process,SCMS-165,ASN-199,Zambia,PMO - US,Direct Drop,CIP,Air,Pre-PQ Process,11/13/2006,2007-01-30,2007-01-30,2007-01-30,ARV,Adult,ABBVIE LOGISTICS (FORMERLY ABBOTT LOGISTICS BV),"Lopinavir/Ritonavir 200/50mg [Aluvia], tablets...",Lopinavir/Ritonavir,Aluvia,200/50mg,Tablet,120,2800,115080.0,41.1,0.34,ABBVIE (Abbott) St. P'burg USA,True,643,Freight Included in Commodity Cost,,2007-01-30


We can then use the `to_numeric` method in order to convert the values under the `weight_kg` and `freight_cost_usd`  columns into a float:

In [31]:
df_SC['weight_(kilograms)'] = pd.to_numeric(df_SC['weight_(kilograms)'], errors='coerce')
df_SC['freight_cost_(usd)'] = pd.to_numeric(df_SC['freight_cost_(usd)'], errors='coerce')

By setting `errors='coerce'`, you will transform the non-numeric values into `NaN`.
Now we can obtain some descriptive statistics for `weight_kg` and `freight_cost_usd` using the `describe()` method.

In [32]:
df_SC.describe()

Unnamed: 0,id,unit_of_measure_(per_pack),line_item_quantity,line_item_value,pack_price,unit_price,weight_(kilograms),freight_cost_(usd),line_item_insurance_(usd)
count,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,6372.0,6198.0,10037.0
mean,51098.968229,77.990895,18332.53487,157650.6,21.910241,0.611701,3424.441306,11103.234819,240.117626
std,31944.332496,76.579764,40035.302961,345292.1,45.609223,3.275808,13526.96827,15813.026692,500.190568
min,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.75,0.0
25%,12795.75,30.0,408.0,4314.593,4.12,0.08,206.75,2131.12,6.51
50%,57540.5,60.0,3000.0,30471.47,9.3,0.16,1047.0,5869.655,47.04
75%,83648.25,90.0,17039.75,166447.1,23.5925,0.47,3334.0,14406.57,252.4
max,86823.0,1000.0,619999.0,5951990.0,1345.64,238.65,857354.0,289653.2,7708.44


---
## Missing data

Now let's take a look at the missing values in our DataFrame. We can see how many missing values we have at each column as follows.

In [33]:
df_SC.isna().sum()

id                                 0
project_code                       0
pq_#                               0
po_/_so_#                          0
asn/dn_#                           0
country                            0
managed_by                         0
fulfill_via                        0
vendor_inco_term                   0
shipment_mode                    360
pq_first_sent_to_client_date       0
po_sent_to_vendor_date             0
scheduled_delivery_date            0
delivered_to_client_date           0
delivery_recorded_date             0
product_group                      0
sub_classification                 0
vendor                             0
item_description                   0
molecule/test_type                 0
brand                              0
dosage                          1736
dosage_form                        0
unit_of_measure_(per_pack)         0
line_item_quantity                 0
line_item_value                    0
pack_price                         0
u

We can notice that there are some missing values in the columns `shipment` and `dosage`, which represent the transportation mode (e.g., by air) and the dosage (e.g., 30mg) of each item sold, respectively. There is not that much we can do in order to replace these missing values with meaningful information, so we'll replace the missing values of in these columns with the word `'missing'`. We use the `.fillna()` method with the option `inplace=True` to save the changes in our original DataFrame. Check [this page](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) for more information about the `.fillna()` method.

In [None]:
df_SC.fillna(value={'shipment': 'missing', 'dosage': 'missing'}, inplace=True)
df_SC.isna().sum()

Now we will replace the missing values in columns `weight_kg` and `freight_cost_usd` by 0, and  `insurance_usd` by an approximated value, computed as the mean value for this column. Note that we create a new column for this.

In [43]:
df_SC['item_insurance_rp_mean'] = df_SC['line_item_insurance_(usd)'].fillna(df_SC['line_item_insurance_(usd)'].mean())
print(df_SC.isnull().sum())
df_SC[['line_item_insurance_(usd)','item_insurance_rp_mean']][df_SC['line_item_insurance_(usd)'].isna() == True]

id                                 0
project_code                       0
pq_#                               0
po_/_so_#                          0
asn/dn_#                           0
country                            0
managed_by                         0
fulfill_via                        0
vendor_inco_term                   0
shipment_mode                    360
pq_first_sent_to_client_date       0
po_sent_to_vendor_date             0
scheduled_delivery_date            0
delivered_to_client_date           0
delivery_recorded_date             0
product_group                      0
sub_classification                 0
vendor                             0
item_description                   0
molecule/test_type                 0
brand                              0
dosage                          1736
dosage_form                        0
unit_of_measure_(per_pack)         0
line_item_quantity                 0
line_item_value                    0
pack_price                         0
u

Unnamed: 0,line_item_insurance_(usd),item_insurance_rp_mean
0,,240.117626
1,,240.117626
2,,240.117626
3,,240.117626
4,,240.117626
...,...,...
2496,,240.117626
2497,,240.117626
2499,,240.117626
2500,,240.117626


In [58]:
import sklearn

inferred_columns = ['line_item_quantity', 'line_item_value','weight_(kilograms)', 'freight_cost_(usd)']
df_SC_kNN = df_SC[inferred_columns].copy()
df_SC_kNN['item_insurance_rp_kNN'] = df_SC['line_item_insurance_(usd)']

imputer = sklearn.impute.KNNImputer(n_neighbors=2)
df_SC_kNN[:] = imputer.fit_transform(df_SC_kNN)
df_SC['item_insurance_rp_kNN'] = df_SC_kNN['item_insurance_rp_kNN'].copy()
df_SC

Unnamed: 0,id,project_code,pq_#,po_/_so_#,asn/dn_#,country,managed_by,fulfill_via,vendor_inco_term,shipment_mode,pq_first_sent_to_client_date,po_sent_to_vendor_date,scheduled_delivery_date,delivered_to_client_date,delivery_recorded_date,product_group,sub_classification,vendor,item_description,molecule/test_type,brand,dosage,dosage_form,unit_of_measure_(per_pack),line_item_quantity,line_item_value,pack_price,unit_price,manufacturing_site,first_line_designation,weight_(kilograms),freight_cost_(usd),line_item_insurance_(usd),schedule_delivery_date,item_insurance_rp_mean,item_insurance_rp_kNN
0,1,100-CI-T01,Pre-PQ Process,SCMS-4,ASN-8,Côte d'Ivoire,PMO - US,Direct Drop,EXW,Air,Pre-PQ Process,Date Not Captured,2006-06-02,2006-06-02,2006-06-02,HRDT,HIV test,RANBAXY Fine Chemicals LTD.,"HIV, Reveal G3 Rapid HIV-1 Antibody Test, 30 T...","HIV, Reveal G3 Rapid HIV-1 Antibody Test",Reveal,,Test kit,30,19,551.00,29.00,0.97,Ranbaxy Fine Chemicals LTD,True,13.0,780.34,,2006-06-02,240.117626,0.565
1,3,108-VN-T01,Pre-PQ Process,SCMS-13,ASN-85,Vietnam,PMO - US,Direct Drop,EXW,Air,Pre-PQ Process,Date Not Captured,2006-11-14,2006-11-14,2006-11-14,ARV,Pediatric,Aurobindo Pharma Limited,"Nevirapine 10mg/ml, oral suspension, Bottle, 2...",Nevirapine,Generic,10mg/ml,Oral suspension,240,1000,6200.00,6.20,0.03,"Aurobindo Unit III, India",True,358.0,4521.50,,2006-11-14,240.117626,9.915
2,4,100-CI-T01,Pre-PQ Process,SCMS-20,ASN-14,Côte d'Ivoire,PMO - US,Direct Drop,FCA,Air,Pre-PQ Process,Date Not Captured,2006-08-27,2006-08-27,2006-08-27,HRDT,HIV test,Abbott GmbH & Co. KG,"HIV 1/2, Determine Complete HIV Kit, 100 Tests","HIV 1/2, Determine Complete HIV Kit",Determine,,Test kit,100,500,40000.00,80.00,0.80,ABBVIE GmbH & Co.KG Wiesbaden,True,171.0,1653.78,,2006-08-27,240.117626,63.800
3,15,108-VN-T01,Pre-PQ Process,SCMS-78,ASN-50,Vietnam,PMO - US,Direct Drop,EXW,Air,Pre-PQ Process,Date Not Captured,2006-09-01,2006-09-01,2006-09-01,ARV,Adult,SUN PHARMACEUTICAL INDUSTRIES LTD (RANBAXY LAB...,"Lamivudine 150mg, tablets, 60 Tabs",Lamivudine,Generic,150mg,Tablet,60,31920,127360.80,3.99,0.07,"Ranbaxy, Paonta Shahib, India",True,1855.0,16007.06,,2006-09-01,240.117626,193.570
4,16,108-VN-T01,Pre-PQ Process,SCMS-81,ASN-55,Vietnam,PMO - US,Direct Drop,EXW,Air,Pre-PQ Process,Date Not Captured,2006-08-11,2006-08-11,2006-08-11,ARV,Adult,Aurobindo Pharma Limited,"Stavudine 30mg, capsules, 60 Caps",Stavudine,Generic,30mg,Capsule,60,38000,121600.00,3.20,0.05,"Aurobindo Unit III, India",True,7590.0,45450.08,,2006-08-11,240.117626,177.880
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10319,86818,103-ZW-T30,FPQ-15197,SO-50020,DN-4307,Zimbabwe,PMO - US,From RDC,N/A - From RDC,Truck,10/16/2014,N/A - From RDC,2015-07-31,2015-07-15,2015-07-20,ARV,Pediatric,SCMS from RDC,"Lamivudine/Nevirapine/Zidovudine 30/50/60mg, d...",Lamivudine/Nevirapine/Zidovudine,Generic,30/50/60mg,Chewable/dispersible tablet - FDC,60,166571,599655.60,3.60,0.06,"Mylan, H-12 & H-13, India",False,,,705.79,2015-07-31,705.790000,705.790
10320,86819,104-CI-T30,FPQ-15259,SO-50102,DN-4313,Côte d'Ivoire,PMO - US,From RDC,N/A - From RDC,Truck,10/24/2014,N/A - From RDC,2015-07-31,2015-08-06,2015-08-07,ARV,Adult,SCMS from RDC,"Lamivudine/Zidovudine 150/300mg, tablets, 60 Tabs",Lamivudine/Zidovudine,Generic,150/300mg,Tablet - FDC,60,21072,137389.44,6.52,0.11,Hetero Unit III Hyderabad IN,False,,,161.71,2015-07-31,161.710000,161.710
10321,86821,110-ZM-T30,FPQ-14784,SO-49600,DN-4316,Zambia,PMO - US,From RDC,N/A - From RDC,Truck,8/12/2014,N/A - From RDC,2015-08-31,2015-08-25,2015-09-03,ARV,Adult,SCMS from RDC,Efavirenz/Lamivudine/Tenofovir Disoproxil Fuma...,Efavirenz/Lamivudine/Tenofovir Disoproxil Fuma...,Generic,600/300/300mg,Tablet - FDC,30,514526,5140114.74,9.99,0.33,Cipla Ltd A-42 MIDC Mahar. IN,False,,,5284.04,2015-08-31,5284.040000,5284.040
10322,86822,200-ZW-T30,FPQ-16523,SO-51680,DN-4334,Zimbabwe,PMO - US,From RDC,N/A - From RDC,Truck,7/1/2015,N/A - From RDC,2015-09-09,2015-08-04,2015-08-11,ARV,Adult,SCMS from RDC,"Lamivudine/Zidovudine 150/300mg, tablets, 60 Tabs",Lamivudine/Zidovudine,Generic,150/300mg,Tablet - FDC,60,17465,113871.80,6.52,0.11,Mylan (formerly Matrix) Nashik,True,1392.0,,134.03,2015-09-09,134.030000,134.030


---
## Data Transformation

### Scaling methods
Variables tend to have different ranges and some algorithms are adversely affected by differences in variable ranges. Variables with greater ranges tend to have larger influence on data model’s results. Therefore, numeric field values may need to be standardized/normalized. 

From the output of the `describe()` method in the previous line of code, we can notice that the numerical variables have different ranges. For instance, `units_per_pack` varies from 1 to 1000, while `weight_kg` varies from 0 to 857354. We would like to apply normalization method to scale the numerical values in our data. 

Let's apply the **Min-max normalization** method, by identifying how much greater the field value is than the minimum value, and scaling this difference by the range of field values.

$$X^*=\frac{X-\min(X)}{\max{X}-\min{X}}$$

Thus,  I compute the normalized version of each of the numerical variable and add this as a new column of our data frame. We can proceed as follows. 

First, we create a list of the columns we want to normalize. 

In [None]:
columns_to_norm = ['units_per_pack', 'quantity_pack_sold', 'value_sold', 'pack_price', 'unit_price', 'weight_kg',
      'freight_cost_usd', 'insurance_usd']

Then, I can create a `for` loop to compute the normalized version for each one of these columns and add it to `df_SC`. 

In [None]:
for col in columns_to_norm:
    col_norm = col + '_norm'   # create a new name for the colum. For example, 'units_per_pack_norm'
    df_SC[col_norm] = (df_SC[col] - df_SC[col].min())/(df_SC[col].max() - df_SC[col].min())   # add the new normalized col
df_SC.describe()

Unnamed: 0,id,units_per_pack,quantity_pack_sold,value_sold,pack_price,unit_price,weight_kg,freight_cost_usd,insurance_usd,units_per_pack_norm,quantity_pack_sold_norm,value_sold_norm,pack_price_norm,unit_price_norm,weight_kg_norm,freight_cost_usd_norm,insurance_usd_norm
count,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0
mean,51098.968229,77.990895,18332.53487,157650.6,21.910241,0.611701,2113.574196,6665.812612,240.117626,0.077068,0.029567,0.026487,0.016282,0.002563,0.002465,0.023013,0.03115
std,31944.332496,76.579764,40035.302961,345292.1,45.609223,3.275808,10756.353428,13404.868186,493.188408,0.076656,0.064573,0.058013,0.033894,0.013726,0.012546,0.046279,0.06398
min,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,12795.75,30.0,408.0,4314.593,4.12,0.08,0.0,0.0,7.03,0.029029,0.000656,0.000725,0.003062,0.000335,0.0,0.0,0.000912
50%,57540.5,60.0,3000.0,30471.47,9.3,0.16,122.0,1422.09,52.94,0.059059,0.004837,0.00512,0.006911,0.00067,0.000142,0.00491,0.006868
75%,83648.25,90.0,17039.75,166447.1,23.5925,0.47,1596.5,7707.64,241.75,0.089089,0.027482,0.027965,0.017533,0.001969,0.001862,0.02661,0.031362
max,86823.0,1000.0,619999.0,5951990.0,1345.64,238.65,857354.0,289653.2,7708.44,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### Dummy Variables
A categorical variable should generally be encoded as **dummy variables** (a.k.a. indicator variables), each taking only one of two values (0 or 1; False or True)
When a categorical variable takes k possible values, you typically have two options to define your dummy variables:
* Option 1: Define k-1 dummy variables, and use the unassigned category as the reference category
* Option 2: Define k dummy variables. Often referred to as **one-hot** encoding.

Let's transform our categorical variable `shipment` into dummy variables using `Option 1`. First, let's take a look at the possible values for the categorical values.

In [None]:
df_SC['shipment'].unique()

array(['Air', 'missing', 'Truck', 'Air Charter', 'Ocean'], dtype=object)

We will create 4 dummy variables with names `'Air'`,  `'Truck'`, `'Air Charter'` and `'Ocean'`, and use `'missing'` as our reference category. One way to do this is by making use of the `DataFrame` function `pd.get_dummies()`, which automatically  converts categorical variable into dummy/indicator variables. You can check [this page](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) for more information. 

In [None]:
df_dummies = pd.get_dummies(df_SC['shipment'])
df_dummies.head()

Unnamed: 0,Air,Air Charter,Ocean,Truck,missing
0,1,0,0,0,0
1,1,0,0,0,0
2,1,0,0,0,0
3,1,0,0,0,0
4,1,0,0,0,0


Then, we can either merge or concatenate the new `DataFrame` we just created (`df_dummies`) with our original `DataFrame`. We can also get ride of the original column `shipment`, as we will use its corresponding indicator variables instead.  We can do this using the function `pd.concat()` and the method `DataFrame.drop()`

In [None]:
# concatenating the original df_SC with df_dummy without the column 'missing'
df_SC = pd.concat([df_SC, df_dummies.drop('missing', axis=1)], axis=1)  

# droping the column 'shipment' and saving the changes in the original DF
df_SC.drop('shipment', axis=1, inplace=True)    
df_SC.describe()

Unnamed: 0,id,units_per_pack,quantity_pack_sold,value_sold,pack_price,unit_price,weight_kg,freight_cost_usd,insurance_usd,units_per_pack_norm,...,value_sold_norm,pack_price_norm,unit_price_norm,weight_kg_norm,freight_cost_usd_norm,insurance_usd_norm,Air,Air Charter,Ocean,Truck
count,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,...,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0,10324.0
mean,51098.968229,77.990895,18332.53487,157650.6,21.910241,0.611701,2113.574196,6665.812612,240.117626,0.077068,...,0.026487,0.016282,0.002563,0.002465,0.023013,0.03115,0.592115,0.06296,0.035936,0.274119
std,31944.332496,76.579764,40035.302961,345292.1,45.609223,3.275808,10756.353428,13404.868186,493.188408,0.076656,...,0.058013,0.033894,0.013726,0.012546,0.046279,0.06398,0.491465,0.242903,0.186139,0.446091
min,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,12795.75,30.0,408.0,4314.593,4.12,0.08,0.0,0.0,7.03,0.029029,...,0.000725,0.003062,0.000335,0.0,0.0,0.000912,0.0,0.0,0.0,0.0
50%,57540.5,60.0,3000.0,30471.47,9.3,0.16,122.0,1422.09,52.94,0.059059,...,0.00512,0.006911,0.00067,0.000142,0.00491,0.006868,1.0,0.0,0.0,0.0
75%,83648.25,90.0,17039.75,166447.1,23.5925,0.47,1596.5,7707.64,241.75,0.089089,...,0.027965,0.017533,0.001969,0.001862,0.02661,0.031362,1.0,0.0,0.0,1.0
max,86823.0,1000.0,619999.0,5951990.0,1345.64,238.65,857354.0,289653.2,7708.44,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Transforming numerical variables into categorical variables
In some cases, categorical variables may be preferred over numerical ones. We then need to partition the numerical variables into bins according to a specific criteria.
As an example, let's transform our original variable `'weight_kg'` into a categorical variable with values `'light'` (if the weight is up to 100 kg), `'medium'`(if the weight is within the interval (100 kg, 500 kg]), `'heavy'` (if the weight is within the interval (500 kg, 1000 kg]) and `'super-heavy'`(if the weight is > 1000 kg). 

We can implement this transformation using the function `pd.cut()`, which helps us to segment and sort data values into bins. You can check [this page](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html) for more information on this function.


In [None]:
bins = [0, 100., 500., 1000.,  float('inf')]             # defining the bins 
names = ['light', 'medium', 'heavy', 'super-heavy']      # defining the names for the categories
df_SC['weight_category'] = pd.cut(df_SC['weight_kg'], bins, labels=names, include_lowest=True)  # adding the new cat. var. to our DF
df_SC.head()

Unnamed: 0,id,project_code,country,vendor,manufacturing_site,schedule_delivery_date,delivered_to_client_date,delivery_recorded_date,product_group,product_subgroup,...,pack_price_norm,unit_price_norm,weight_kg_norm,freight_cost_usd_norm,insurance_usd_norm,Air,Air Charter,Ocean,Truck,weight_category
0,1,100-CI-T01,Cote d Ivoire,EXW,Ranbaxy Fine Chemicals LTD,2006-06-02,2006-06-02,2006-06-02,HRDT,HIV test,...,0.021551,0.004065,1.5e-05,0.002694,0.03115,1,0,0,0,light
1,3,108-VN-T01,Vietnam,EXW,"Aurobindo Unit III, India",2006-11-14,2006-11-14,2006-11-14,ARV,Pediatric,...,0.004607,0.000126,0.000418,0.01561,0.03115,1,0,0,0,medium
2,4,100-CI-T01,Cote d Ivoire,FCA,ABBVIE GmbH & Co.KG Wiesbaden,2006-08-27,2006-08-27,2006-08-27,HRDT,HIV test,...,0.059451,0.003352,0.000199,0.00571,0.03115,1,0,0,0,medium
3,15,108-VN-T01,Vietnam,EXW,"Ranbaxy, Paonta Shahib, India",2006-09-01,2006-09-01,2006-09-01,ARV,Adult,...,0.002965,0.000293,0.002164,0.055263,0.03115,1,0,0,0,super-heavy
4,16,108-VN-T01,Vietnam,EXW,"Aurobindo Unit III, India",2006-08-11,2006-08-11,2006-08-11,ARV,Adult,...,0.002378,0.00021,0.008853,0.156912,0.03115,1,0,0,0,super-heavy


0.0