# Explore Data

Source: https://archive.ics.uci.edu/ml/datasets/online+retail#

**Attribute Information**:

`InvoiceNo`: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.

`StockCode`: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.

`Description`: Product (item) name. Nominal.

`Quantity`: The quantities of each product (item) per transaction. Numeric.

`InvoiceDate`: Invice Date and time. Numeric, the day and time when each transaction was generated.

`UnitPrice`: Unit price. Numeric, Product price per unit in sterling.

`CustomerID`: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.

`Country`: Country name. Nominal, the name of the country where each customer resides.

Our goal for this project is to take a greenfield dataset and create a model which will take an array of features as an input and output the customer profile. 

Step 1 is to first try to identify the features which might give us a profile.

Step 2 is to use those features to identify the customer profile (via an appropriate clustering algorithm.)

The purpose of this notebook is to upload a dataset to our AML workspace and to perform data exploration in order to identify the features (columns) that might be used to label a customer profile. 

# Set up

In [None]:
%load_ext autoreload
%autoreload 2

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import matplotlib.pyplot as plt
import seaborn as sns
from azureml.core import Dataset, Datastore, Workspace

# Data

## Register Data

Our repository has a data folder. 

We will take the data/online-retail.csv file and upload it to our AML workspace's default datastore

If you are curious where that datastore is, it is outputted on line 5. 

After we upload the file, we will register the uploaded file as a dataset. 

In [None]:
workspace = Workspace.from_config()
print(workspace.name, workspace.resource_group, workspace.location, workspace.subscription_id, sep = '\n')

datastore = workspace.get_default_datastore()
datastore

Dataset.File.upload_directory('../../.aml/data', datastore)

datastore_path = [(datastore, 'online-retail.csv')]; #print(datastore_path)
# Create Dataset from .csv file
dataset = Dataset.Tabular.from_delimited_files(path=datastore_path)
# Register dataset in AML
dataset.register(workspace=workspace,
                 name='online-retail',
                 description='online retail dataset')


## Load Data Locally

> Potential Bug: 
> - `online-retail.csv` is registered as dataset with both settings `properties` == `None` or `properties` == `date`. When it is loaded by the cell below using `azure.core.Dataset`, a large proportion of the column `InvoiceDate` containing dtype `datetime64[ns]` has become `NaT`.
> - see temporary mitigation in [00-explore-data-01.ipynb](./00-explore-data-01.ipynb)

In [None]:
# azureml-core of version 1.0.72 or higher is required
# azureml-dataprep[pandas] of version 1.1.34 or higher is required
dataset = Dataset.get_by_name(workspace, name='online-retail')
df_orig = dataset.to_pandas_dataframe()

In [None]:
df = df_orig.copy()
df

## Explore Data

### Basic

In [None]:
df.describe(include='all', datetime_is_numeric=True)

In [None]:
df.isnull().sum(axis=0) # axis = 0 refer to column-wise

In [None]:
df.info()

In [None]:
# change data type
df = df.astype({'StockCode':'category',
                'Country' : 'category',})

df.info()

### Correlation

In [None]:
df.corr()

### Pair-plot

In [None]:
df_pairplot = df.copy()
#df_pairplot = df.sample(frac=0.01, random_state=9) # 1% ~ 540 points

_ = sns.pairplot(df_pairplot, hue='Country', height=4, aspect=1.5); plt.show() # 

In [None]:
df_pairplot = df.sample(frac=0.01, random_state=9) # 1% ~ 540 points
_ = sns.pairplot(df_pairplot, hue='Country', height=4, aspect=1.5); plt.show() # 

### `InvoiceDate`

In [None]:
df_InvoiceDate_NaT = df[df['InvoiceDate'].isnull()]; df_InvoiceDate_NaT
df_InvoiceDate_NaT.sort_values(by=['Quantity', 'InvoiceNo']).head(50)
df_InvoiceDate_NaT.sort_values(by=['Quantity', 'InvoiceNo']).tail(50)

Note:
- Unclear why `InvoiceDate` has `NaT`  <-- TODO: Need to register data again, keeping the `InvoiceDate` format as it is. 

### Histogram : Numerical

In [None]:
#df_hist = df[(df['Quantity']>=-10) & (df['Quantity']<=10)]
#df_hist.describe()

df_hist = df.copy()

bins=100

_ = df_hist[['Quantity', 'Country']].plot.hist(bins=bins, alpha=0.5, by='Country', figsize=(10,120)); plt.show()

_ = df_hist[['UnitPrice', 'Country']].plot.hist(bins=bins, alpha=0.5, by='Country', figsize=(10,120)); plt.show()

In [None]:
df.sort_values(by=['Quantity'])
df.sort_values(by=['Quantity']).head(50)
df.sort_values(by=['Quantity']).tail(50)

Note:
- `Quantity` negative, `InvoiceNo` no letter `C`, seems to mean stock adjustment, e.g. damaged, thrown away, etc
- `UnitPrice` have value `0`
- `InvoiceDate` has `NaT` 
- `CustomerID` has `None` 

To Clean:
- Remove rows where `InvoiceDate` is `NaT` 
- Remove rows where `InvoiceNo` has no letter `C`, and `Quantity` is `<0`, or `UnitPrice` is `0` < Unsure of what it means >


In [None]:
df.sort_values(by=['UnitPrice'])
df.sort_values(by=['UnitPrice']).head(50)
df.sort_values(by=['UnitPrice']).tail(50)

Note
- `InvoiceNo` contain letter `A`, which is not in the Data Definition. Seems to mean `Adjust bad debt`, with `StockCode` `B`
- `Stockcode` that seems not to refer to a product, includes, but not limited to, `AMAZONFEE`, `M`, `B`, `POST`, `DOT`
    - TODO: Extract `StockCode` that contain letters to further understand

To Clean
- Remove rows where is `UnitPrice` is `0` or `NaN`

### Histogram : Categorical

In [None]:
categorical_cols = ['Country']

for col in categorical_cols:
    df[col].value_counts().plot(kind='bar', figsize=(7,4), title=col)
    plt.show()


In [None]:
categorical_cols = df.select_dtypes(include=['category'])

for col in categorical_cols:
    df[col].value_counts(normalize=True).nlargest(100)

In [None]:
for col in df.columns:
    col
    df[col].value_counts(normalize=True).nsmallest(100)

In [None]:
for col in df.columns:
    col
    df[col].value_counts(normalize=True).nlargest(100)

### Scatter plot

In [None]:
df_catplot = df.sample(frac=0.01, random_state=9) # 1% ~ 540 points
_ = sns.catplot(x="Quantity", y="UnitPrice", hue="Country", data=df_catplot, height=5, aspect=2) # 5-6 min
_ = plt.xticks(rotation=90)