<a href="https://colab.research.google.com/github/acedesci/scanalytics/blob/master/S05_Data_Preprocessing/S05_LectureEx_Pandas_Profiling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# S5 - Exploratory Data Analysis (EDA) using pandas profiling
Programming topics covered in this section:
* pandas_profiling

Examples include:
* Exploring Supply Chain health commodity shipment and pricing data

---
# `pandas_profiling`
`pandas_profiling` is an interesting library for exploring your data. However, **it is not installed by default in Anaconda**. In this notebook we will do a brief overview of `pandas_profiling`; for more information, refer to this [documentation](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/introduction.html).

## 1. Installing pandas_profiling

### Using conda
You can install using the conda package manager by running
To install `pandas_profiling` using the conda package manager, you need to open an *Anaconda Prompt*. In this console, you then write

```python
conda install -c conda-forge pandas-profiling
```
    
and press `Enter`.

After a while, text will be displayed in your console and it will ask you if you want to continue. Just press enter to continue.

###  In a notebook/Colab
**IMPORTANT NOTE:** Here you can run the following code to make sure that we have the latest version of pandas-profiling. This may take a while until it is installed. Please run this if you use Colab or, if you use Jupyter but you haven't installed the `pandas-profiling` library yet.


In [None]:
!pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip 

### Importing libraires

In [None]:
import pandas as pd
import pandas_profiling

## 1. Importing data and creating a report
In this exercise, we will explore some adapted data set which provides supply chain health commodity shipment and pricing data. Specifically, the data set identifies Antiretroviral (ARV) and HIV lab shipments to supported countries. In addition, the data set provides the commodity pricing and associated supply chain expenses necessary to move the commodities to countries for use. The original data are provided by the US Agency for International Development and can be accessed at [this page](https://catalog.data.gov/dataset/supply-chain-shipment-pricing-data).

This is a description of our adapted data in the file `SCMS_Delivery_History_Dataset.csv`.

| VARIABLE NAME | DESCRIPTION | 
|:----|:----|
|id| identification number|
|project code|identification of the project|
|country|country to which the items are shipped|
|vendor|identification of the vendor of the item|
|manufacturing site|name of the manufacturer of the item|
|shipment mode|transportation mode (e.g., air, truck)|
|scheduled delivery date|programmed date for delivery|
|delivered to client date|real date of delivery|
|delivery recorded date|registered date of delivery|
|product group|main category of the item|
|product subgroup|subcategory of the item (e.g., HIV test, pediatric, Adult) |
|molecule type|description of the composition of the item (e.g., Nevirapine, HIV 1/2, Didanosine)|
|brand| item brand (e.g, generic or any other commercial brand)|
|dosage| specifications about the dosage of each item (e.g.,10mg/ml, 200mg)|
|dosage form|instructions for consumption (e.g., capsule, tablet, oral solution) |
|units per pack| number of units in each package|
|quantity pack sold| number of packages shipped to the specified country|
|value sold| total value in $\$$ USD of the shipment (i.e., pack_price * quantity pack sold|
|pack price| price in $\$$ USD per package|
|unit price| price in $\$$ USD per unit|
|weight (kilograms)| total weight in kilograms of the shipment|
|freight cost (usd)| value in $\$$USD paid for transportation|
|insurance (usd)|value in $\$$USD paid for insurance|



Let's import our data.

In [None]:
url = 'https://raw.githubusercontent.com/acedesci/scanalytics/master/S05_Data_Preprocessing/Supply_Chain_Shipment_Pricing_Data.csv'
df_SC = pd.read_csv(url)  # reading data file into a DataFrame
df_SC.head()

We now generate the profile report using `pandas_profiling`.

In [None]:
profile = pandas_profiling.ProfileReport(df_SC)

We can access this report in two ways: through widgets and through a HTML report. If you simply call this object `profile` (see blow), it will embed the report here. This option is practical only if the report is not too large. Otherwise it is better to export to a local file (see next block). You can skip this one.

In [None]:
# profile

## 3. Saving the report
If you want to generate a HTML report file, save the `ProfileReport` to an object and use the `to_file()` function, as follows.


In [None]:
file_name = "pandas_profiling_report.html"
profile.to_file(file_name) # saving the report

**Colab**: For Colab, you would need to download this file to save it using the code below.

In [None]:
from google.colab import files

files.download(file_name)