# About Data
**Attribute Information**:

`InvoiceNo`: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.

`StockCode`: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.

`Description`: Product (item) name. Nominal.

`Quantity`: The quantities of each product (item) per transaction. Numeric.

`InvoiceDate`: Invice Date and time. Numeric, the day and time when each transaction was generated.

`UnitPrice`: Unit price. Numeric, Product price per unit in sterling.

`CustomerID`: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.

`Country`: Country name. Nominal, the name of the country where each customer resides.

## Summary of Data Processing

Rows are reemoved based on the following conditions:
- `UnitPrice` == 0
- `CustomerID` == NaN
- `Country` == `Unspecified`
- `StockCode` == `POST`, `BANK CHARGES`, `PADS`, `DOT`, `CRUK`
- `CustomerID`, where effective `Quantity` < 0

Others:
- Data type changed

# Set up

In [None]:
%load_ext autoreload
%autoreload 2

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

# Data

### Option 1 of 3: Load from a Dataset registered in AML Workspace.
Cell below assumed that the dataset named `online-retail` is regisered in the AML `workspace`

*Potential Issue:*

> `online-retail.csv` is registered as dataset with both settings `properties == None` or `properties == date`. When it is loaded by the cell below using `azure.core.Dataset`, a large proportion of the column `InvoiceDate` containing dtype `datetime64[ns]` has become `NaT`. Refer to option 2 of 3, option 3 of 3 as temporary solutions.

In [None]:
if False:
    from azureml.core import Workspace, Dataset

    # Get workspace configuration
    workspace = Workspace.from_config()
    print(workspace.name, workspace.resource_group, workspace.location, workspace.subscription_id, sep = '\n')

    # Get dataset registered within the workspace
    dataset = Dataset.get_by_name(workspace, name='online-retail')

    # Convert type Dataset to type Pandas DataFrame
    df_orig = dataset.to_pandas_dataframe()

### Opiton 2 of 3: Load from Blob Storage
Cell below assumed that you have a container named `online-retail` and a blob named `online-retail.csv`

In [None]:
if False:    
    from azure.storage.blob import BlobServiceClient
    import pandas as pd

    STORAGEACCOUNTURL= "your-storage-account-url"
    STORAGEACCOUNTKEY= "your-storage-account-key"
    LOCALFILENAME= "../../data/online-retail.csv"
    CONTAINERNAME= "online-retail" # i.e. folder
    BLOBNAME= "online-retail.csv" # i.e. file

    #download from blob
    blob_service_client_instance = BlobServiceClient(account_url=STORAGEACCOUNTURL, credential=STORAGEACCOUNTKEY)
    blob_client_instance = blob_service_client_instance.get_blob_client(CONTAINERNAME, BLOBNAME, snapshot=None)
    with open(LOCALFILENAME, "wb") as blob:
        blob_data = blob_client_instance.download_blob()
        blob_data.readinto(blob)

    # LOCALFILE is the file path
    df_orig = pd.read_csv(LOCALFILENAME)

### Option 3 of 3 : Load from local

In [None]:
if True:
    LOCALFILENAME = "../../src/data/online-retail.csv"
    df_orig = pd.read_csv(LOCALFILENAME)

In [None]:
# Make a copy
df = df_orig.copy()
df

## Explore Data

### Basic Statistics

In [None]:
df.describe(include='all', datetime_is_numeric=True)

### Check for columns with `null`

In [None]:
df.isnull().sum(axis=0) # axis = 0 refer to column-wise

### Basic informaiton about `df`

In [None]:
df.info()

### Change data type

In [None]:
# change data type
df = df.astype({'StockCode' : 'category',
                'Country' : 'category',})

# convert the 'Date' column to datetime format
df['InvoiceDate']= pd.to_datetime(df['InvoiceDate'])

# information about df
df.info()

### Correlation between columns

In [None]:
df.corr()

## Investigate Data

### Investigate `Quantity`

In [None]:
df.sort_values(by=['Quantity']).head(10)
df.sort_values(by=['Quantity']).tail(10)

Note:
- Extreme value for `Quantity`
- `Quantity` negative, `InvoiceNo` no letter `C`, seems to mean stock adjustment, e.g. damaged, thrown away, etc
- `UnitPrice` have value `0`. < REMOVED >
- `CustomerID` has `nan` , what does this mean? Customer who bought things but does not register? Remove for now. < REMOVED >
- rows where `InvoiceNo` has no letter `C`, and `Quantity` is `<0`, or `UnitPrice` is `0`.

### Investigate `UnitPrice`

In [None]:
df.sort_values(by=['UnitPrice']).head(10)
df.sort_values(by=['UnitPrice']).tail(10)

Note
- Extreme value for `UnitPrice`
- `InvoiceNo` contain letter `A`, which is not in the Data Definition. Seems to mean `Adjust bad debt`, with `StockCode` `B`
- `Stockcode` that seems not to refer to a product, includes, but not limited to, `AMAZONFEE`, `M`, `B`, `POST`, `DOT`
    - Extract `StockCode` that contain letters to further understand

## Remove unwanted data

### Remove rows where `CustomerID` is `nan`

In [None]:
# drop rows where 'CustomerID' is nan
df.dropna(subset=['CustomerID'], inplace=True)

# check for columns with null
df.isnull().sum(axis=0) # axis = 0 refer to column-wise

Note:
- By removing `CustomerID` == `NaN`, `Description` does not contain `NaN` anymore.

In [None]:
# change data type to int, then str, due to decimal point, e.g. 1234.0
df = df.astype({'CustomerID' : int})  
df = df.astype({'CustomerID' : str})

# show statistics
df.describe(include='all', datetime_is_numeric=True)

### Normalise Text

In [None]:
# change text to lower case.
df['Description'] = df['Description'].str.lower()

### Remove rows where `Country`==`Unspecified`

In [None]:
# Remove rows where `Country`==`Unspecified`
df = df[df['Country']!='Unspecified']

### Check if `InvoiceNo` now only contain numeric and `C` + numeric

In [None]:
# replace all numeric with '', i.e. extract alphabets
df_temp = df['InvoiceNo'].str.replace('\d+', '') 

# Check for unique alphabets in column 'InvoiceNo'
df_temp.unique()


Note
- `InvoiceNo` now only contain numeric and `C` + numeric

### Check Conditions Below:

- `UnitPrice` have value `0`
- `InvoiceNo` has no letter `C`, and `Quantity` is `<0`, or `UnitPrice` is `0` 

In [None]:
print('UnitPrice <= 0   AND   InvoiceNo contain letter C')
df[(df['UnitPrice']<=0) & (df['InvoiceNo'].str.contains('[a-zA-Z]'))].shape
print('UnitPrice < 0   AND   InvoiceNo contain letter C')
df[(df['UnitPrice']<0) & (df['InvoiceNo'].str.contains('[a-zA-Z]'))].shape
print('UnitPrice == 0   AND   InvoiceNo contain letter C')
df[(df['UnitPrice']==0) & (df['InvoiceNo'].str.contains('[a-zA-Z]'))].shape

print('Quantity <= 0   AND   InvoiceNo contain letter C')
df[(df['Quantity']<=0) & (df['InvoiceNo'].str.contains('[a-zA-Z]'))].shape
print('Quantity < 0   AND   InvoiceNo contain letter C')
df[(df['Quantity']<0) & (df['InvoiceNo'].str.contains('[a-zA-Z]'))].shape
print('Quantity == 0   AND   InvoiceNo contain letter C')
df[(df['Quantity']==0) & (df['InvoiceNo'].str.contains('[a-zA-Z]'))].shape

print('UnitPrice <= 0   AND   InvoiceNo not contain letter C')
df[(df['UnitPrice']<=0) & (~df['InvoiceNo'].str.contains('[a-zA-Z]'))].shape
print('UnitPrice < 0   AND   InvoiceNo not contain letter C')
df[(df['UnitPrice']<0) & (~df['InvoiceNo'].str.contains('[a-zA-Z]'))].shape
print('UnitPrice == 0   AND   InvoiceNo not contain letter C')
df[(df['UnitPrice']==0) & (~df['InvoiceNo'].str.contains('[a-zA-Z]'))].shape

print('Quantity <= 0   AND   InvoiceNo not contain letter C')
df[(df['Quantity']<=0) & (~df['InvoiceNo'].str.contains('[a-zA-Z]'))].shape
print('Quantity < 0   AND   InvoiceNo not contain letter C')
df[(df['Quantity']<0) & (~df['InvoiceNo'].str.contains('[a-zA-Z]'))].shape
print('Quantity == 0   AND   InvoiceNo not contain letter C')
df[(df['Quantity']==0) & (~df['InvoiceNo'].str.contains('[a-zA-Z]'))].shape

Note:
- 40 rows where `UnitPrice` == 0, `InvoiceNo` is numeric. What does this mean? Free gift? Remove for now.
- rows where `InvoiceNo` contain letter `C`, AND `Quantity` is < 0. This is consistent now.

#### When `UnitPrice` == 0

In [None]:
# show rows where 'UnitPrice' == 0
df[df['UnitPrice']==0]

### Remove when `UnitPrice` == 0

In [None]:
# drop rows where 'UnitPrice'==0
df.drop(df[df['UnitPrice']==0].index, inplace=True)

### Investigate `StockCode`

In [None]:
# retrieve rows where 'StockCode' contains alphabets
df_temp = df[df['StockCode'].str.contains('[a-zA-Z]')] # contain any alphabets

# show those unique 'StockCode' that contains alphabets
df_temp['StockCode'].unique()

In [None]:
# remove digits within 'StockCode''
df_temp = df['StockCode'].str.replace('\d+', '')

# show those unique 'StockCode' that contains alphabets
df_temp.unique()

In [None]:
# Show 'StockCode' that contains alphabets
for alphabets in df_temp.unique():
    if alphabets: # if alphabets is not empty
        alphabets
        df[df['StockCode'].str.contains(alphabets)]

Note
- Remove `StockCode` == `POST`, `BANK CHARGES`, `PADS`, `DOT`, `CRUK`
- Remove `UnitPrice < 0.01` < Gone when the above is done >

In [None]:
df.shape

# Condition used to remove rows
condition = (df['StockCode']=='POST') | \
            (df['StockCode']=='BANK CHARGES') | \
            (df['StockCode']=='PADS') | \
            (df['StockCode']=='DOT') | \
            (df['StockCode']=='CRUK')

# Remove rows based on condition stated above
df = df[~condition]

df.shape

#### Check Statistics

In [None]:
df.describe(include='all', datetime_is_numeric=True)

Note
- extreme value for `Quantity` and `UnitPrice` still exist. < Investigate >

### Investigate extreme `Quantity` values 

In [None]:
df.sort_values(by=['Quantity']).head(10)
df.sort_values(by=['Quantity']).tail(10)

Assume that we are only interested in effective sales, we will exclude return of goods. For example:

Initially bought,
|InvoiceNo  |StockCode  |Description    |Unit   |Date   |UnitPrice  |TotalAmount    |Country|
|---        |---        |---            |---    |---    |---        |---            |---|
|538370	    |84946	    |ANTIQUE SILVER TEA GLASS ETCHED	|6	|12/12/2010 11:06	|1.25	|16923.0	|United Kingdom|


Then returned,
|InvoiceNo  |StockCode  |Description    |Unit   |Date   |UnitPrice  |TotalAmount    |Country|
|---        |---        |---            |---    |---    |---        |---            |---|
|C538372	|84946	    |ANTIQUE SILVER TEA GLASS ETCHED	|-2	|12/12/2010 11:12	|1.25	|16923.0	|United Kingdom|

Effectively, this customer bought 4 units within that period. 

In [None]:
# Check for duplicated records based on ['CustomerID', 'StockCode', 'UnitPrice', 'Country']
df_duplicated = df[df.duplicated(subset=['CustomerID', 'StockCode', 'UnitPrice', 'Country'], keep=False)]

# Take the sum when grouped by ['CustomerID', 'StockCode', 'UnitPrice', 'Country']
df_effective_quantity = df_duplicated.groupby(['CustomerID', 'StockCode', 'UnitPrice', 'Country'], as_index=False, observed=True)['Quantity'].sum() 

# Basic Statstics
df_effective_quantity.describe()

# Display dataframe
df_effective_quantity

Note:
- the idea of 'Effective Quantity' is to be used for feature engineering in [01-analyse-customer-value-by-frequency-recency-monetary-value.ipynb](01-analyse-customer-value-by-frequency-recency-monetary-value.ipynb)

====================================================================================================

# Data Management

## Upload Processed Data to Datastore

In [None]:
from azureml.core import Workspace, Dataset

workspace = Workspace.from_config()
print(workspace.name, workspace.resource_group, workspace.location, workspace.subscription_id, sep = '\n')

datastore = workspace.get_default_datastore()
datastore

#if True:
if False: # Replace `False` with `True` to run code below
    filename = '../../.aml/data/online-retail-processed.csv'

    # Save to local
    df.to_csv(filename, index=False)

    Dataset.File.upload_directory('../../.aml/data', datastore)

## Register Dataframe as Dataset

In [None]:
from azureml.core import Workspace, Dataset

workspace = Workspace.from_config()
print(workspace.name, workspace.resource_group, workspace.location, workspace.subscription_id, sep = '\n')

datastore = workspace.get_default_datastore()
datastore

#if True:
if False: # Replace `False` with `True` to run code below

    # Dataset name to register as 
    name = 'online-retail-processed'

    # create a new dataset
    Dataset.Tabular.register_pandas_dataframe(dataframe=df, 
                                            target=datastore, 
                                            name=name, 
                                            show_progress=True, 
                                            tags={'Purpose':'Tutorial'})