In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
%load_ext lab_black
# or nb_black

# Imports and data

In [3]:
from ecommercerecommendation.utils.data import (
    get_data,
    venn_sets,
    REMOVE_CHARS,
    remove_chars,
    clean_entry,
)

In [4]:
df = get_data("clean_data")

In [5]:
print("Columns: " + ", ".join(df.columns))

Columns: InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, Country


Available information/features:
- **InvoiceNo:** to know what products were purchased together
- **CustomerID:** to know what products were bought together by a customer
- **StockCode:** product ID
- **Description:** For future text-based/LLM solutions
- **Quantity:** Is it important?
- **UnitPrice:** Is it important? Maybe for tabular data
- **InvoiceDate:** for date values like month, day of the week etc, maybe hour of the day?
- **Country:** OHE if needed later

# Text cleaning

For Countries, see the last notebook.

For special characters, first check if there is anything else than the chosen special characters:

In [6]:
print(f"Non alpha-numerical characters to deal with: {REMOVE_CHARS}")

Non alpha-numerical characters to deal with: .,'"&/- !()


In [7]:
# No strange characters in the given dataset:
[
    desc
    for desc in sorted(df["Description"].unique())
    if not remove_chars(desc).isalnum()
]

[]

It is only those. Let's see how they appear:

In [8]:
for c in [C for C in REMOVE_CHARS if C != " "]:
    print(f"----------------- \nCharacter: {c}\n")
    print(
        "\n".join(
            "|"
            + df[df["Description"].str.find(c) >= 0]["Description"].drop_duplicates()
            + "|"
        )
    )

----------------- 
Character: .

|RED WOOLLY HOTTIE WHITE HEART.|
|LUNCH BAG BLACK SKULL.|
|COLOUR GLASS. STAR T-LIGHT HOLDER|
|GARLAND, MAGIC GARDEN 1.8M|
|HAND OPEN SHAPE DECO.WHITE|
|ESSENTIAL BALM 3.5g TIN IN ENVELOPE|
|SMOKEY GREY COLOUR D.O.F. GLASS|
|FLOWER GLASS GARLAND NECKL.36"BLACK|
|DIAMANTE RING ASSORTED IN BOX.|
|PEARL & SHELL 42"NECKL. GREEN|
|SILVER M.O.P ORBIT BRACELET|
|GREETING CARD, OVERCROWDED POOL.|
|GREETING CARD, TWO SISTERS.|
|GOLD/M.O.P PENDANT ORBIT NECKLACE|
|SILVER/M.O.P PENDANT ORBIT NECKLACE|
|SILVER M.O.P ORBIT DROP EARRINGS|
|GOLD M.O.P ORBIT DROP EARRINGS|
|WINE BOTTLE DRESSING LT.BLUE|
|FLOWER GLASS GARLAND NECKL.36"BLUE|
|PEARL & SHELL 42"NECKL. IVORY|
|FLOWER GLASS GARLAND NECKL.36"GREEN|
|GOLD M.O.P ORBIT BRACELET|
|SILVER M.O.P. ORBIT NECKLACE|
|GOLD M.O.P. ORBIT NECKLACE|
----------------- 
Character: ,

|AIRLINE LOUNGE,METAL SIGN|
|FANCY FONT BIRTHDAY CARD,|
|TRAY, BREAKFAST IN BED|
|SWISS ROLL TOWEL, CHOCOLATE SPOTS|
|BIRTHDAY CARD, RETRO SPOT|

In [9]:
"e" in df["Description"]

False

In [10]:
print(
    "\n".join(
        "|"
        + df[df["Description"].str.find(".") >= 0]["Description"].drop_duplicates()
        + "|"
    )
)

|RED WOOLLY HOTTIE WHITE HEART.|
|LUNCH BAG BLACK SKULL.|
|COLOUR GLASS. STAR T-LIGHT HOLDER|
|GARLAND, MAGIC GARDEN 1.8M|
|HAND OPEN SHAPE DECO.WHITE|
|ESSENTIAL BALM 3.5g TIN IN ENVELOPE|
|SMOKEY GREY COLOUR D.O.F. GLASS|
|FLOWER GLASS GARLAND NECKL.36"BLACK|
|DIAMANTE RING ASSORTED IN BOX.|
|PEARL & SHELL 42"NECKL. GREEN|
|SILVER M.O.P ORBIT BRACELET|
|GREETING CARD, OVERCROWDED POOL.|
|GREETING CARD, TWO SISTERS.|
|GOLD/M.O.P PENDANT ORBIT NECKLACE|
|SILVER/M.O.P PENDANT ORBIT NECKLACE|
|SILVER M.O.P ORBIT DROP EARRINGS|
|GOLD M.O.P ORBIT DROP EARRINGS|
|WINE BOTTLE DRESSING LT.BLUE|
|FLOWER GLASS GARLAND NECKL.36"BLUE|
|PEARL & SHELL 42"NECKL. IVORY|
|FLOWER GLASS GARLAND NECKL.36"GREEN|
|GOLD M.O.P ORBIT BRACELET|
|SILVER M.O.P. ORBIT NECKLACE|
|GOLD M.O.P. ORBIT NECKLACE|


We'll do the following:
- `.`: remove at the end of the sentences and in the sentence if followed by a space
- `,`: replace to space and then remove multiple spaces
- `'`: keep them as they are
- `"`: replace `number+"` adding a space
- `&`: replace to ` AND `
- `/`: replace to space
- `-`: replace ` - ` to ` `
- `!()`: replace to space

Implemented in `ecommercerecommendation/utils.data.clean_entry)`

Let's see the results:

In [11]:
for c in [C for C in REMOVE_CHARS if C != " "]:
    print(f"----------------- \nCharacter: {c}\n")
    print(
        "\n".join(
            "|"
            + df[df["Description"].str.find(c) >= 0]["Description"]
            .drop_duplicates()
            .apply(clean_entry)
            + "|"
        )
    )

----------------- 
Character: .

|RED WOOLLY HOTTIE WHITE HEART|
|LUNCH BAG BLACK SKULL|
|COLOUR GLASS STAR T-LIGHT HOLDER|
|GARLAND MAGIC GARDEN 1.8M|
|HAND OPEN SHAPE DECO.WHITE|
|ESSENTIAL BALM 3.5g TIN IN ENVELOPE|
|SMOKEY GREY COLOUR D.O.F GLASS|
|FLOWER GLASS GARLAND NECKL.36" BLACK|
|DIAMANTE RING ASSORTED IN BOX|
|PEARL AND SHELL 42" NECKL GREEN|
|SILVER M.O.P ORBIT BRACELET|
|GREETING CARD OVERCROWDED POOL|
|GREETING CARD TWO SISTERS|
|GOLD M.O.P PENDANT ORBIT NECKLACE|
|SILVER M.O.P PENDANT ORBIT NECKLACE|
|SILVER M.O.P ORBIT DROP EARRINGS|
|GOLD M.O.P ORBIT DROP EARRINGS|
|WINE BOTTLE DRESSING LT.BLUE|
|FLOWER GLASS GARLAND NECKL.36" BLUE|
|PEARL AND SHELL 42" NECKL IVORY|
|FLOWER GLASS GARLAND NECKL.36" GREEN|
|GOLD M.O.P ORBIT BRACELET|
|SILVER M.O.P ORBIT NECKLACE|
|GOLD M.O.P ORBIT NECKLACE|
----------------- 
Character: ,

|AIRLINE LOUNGE METAL SIGN|
|FANCY FONT BIRTHDAY CARD|
|TRAY BREAKFAST IN BED|
|SWISS ROLL TOWEL CHOCOLATE SPOTS|
|BIRTHDAY CARD RETRO SPOT|
|FEATHER

That looks OK.

In [12]:
df["Description"] = df["Description"].apply(clean_entry)

In [13]:
df.to_csv("cache/clean_data.csv", index=False, encoding="utf-8")