# Task Modelling & Real-World Model Performance: Analysing Shopify

The following notebook aims at providing an entry point to the aspects of task modelling: Thinking about how ML models can be applied to be useful for real-world tasks. 

An important aspect concerning application scenarios is the real-world performance of models that we can observe. 

To evaluate these two aspects the notebook is configured to download and process data from Shopify, a No Code e-commerce platform. 

The starting point for the analysis is given by the description of these shops and the application of a trained Named Entity (NE) model to extract the key entities from these descriptions. 

# Install Necessary Models and Modules

The following code cell downloads a python module and a machine learning model.


In [50]:
# the following command downloads a model for spacy (a machine learning library focused on text processing)
!python -m spacy download en_core_web_sm
!wget https://github.com/eg-bfh/PoML/raw/main/Top10KShopify.csv

2022-11-29 14:58:59.939830: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 4.6 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
--2022-11-29 14:59:06--  https://github.com/eg-bfh/PoML/raw/main/Top10KShopify.csv
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/eg-bfh/PoML/main/Top10KShopify.csv [following]
--2022-11-29 14:59:07--  https://raw.githubu

In [35]:
# we import the pandas library that helps us to parse the CSV file and provides us with a dataframe (generally a table data structure).
import pandas as pd

In [24]:
# We read the data and keep only the rows where the site description is not null
shopify_df = pd.read_csv(r'Top10KShopify.csv');
shopify_desc_df = shopify_df[shopify_df['Site Description'].notnull()]

## Exercise: NE Extraction on Shopify Descriptions

The following cell loads a machine learning model and applies it in order to extract Named Entities from the shop Descriptions.

The model that is used is the 'en_core_web_sm' model, where 'sm' stands for small. 

You can visit the following page to get more information about the available models:

https://spacy.io/models/en

In [49]:

# Extract Named Entities with spacy
import spacy
nlp = spacy.load("en_core_web_sm")

from prettytable import PrettyTable, ALL
t = PrettyTable(['Shop Description', 'Named Entities'])
t.hrules=ALL
from textwrap import fill


for doc in nlp.pipe(shopify_desc_df.sample(n=100)['Site Description']):
    # Do something with the doc here
    t.add_row([fill(f"{doc}", width=50),fill(f"{[(ent.text, ent.label_) for ent in doc.ents]}", width=40)])

print(t)


+---------------------------------------------------------------------------------------------+------------------------------------------+
|                                       Shop Description                                      |            Extracted Keywords            |
+---------------------------------------------------------------------------------------------+------------------------------------------+
|                         Australia's Offical Panini NBA Distributor -                        |   [('Australia', 'GPE'), ('- Pokemon',   |
|                      Pokemon, YuGiOh! Magic The Gathering Trading Cards                     |                'PERSON')]                |
|                      - Buy Direct and Save - Sell Your Cards - Beckett                      |                                          |
|                                         BGS Grading                                         |                                          |
+--------------------------

## Exercise: Observations

What are your observations in terms of perceived performance:

### Performance

...

### Model Behavior

... 

What do you assume to be the root cause of these observations?

### Potential Cause I
...

### Potential Cause II
...


## Exercise: Keyphrase Extraction on Shopify Descriptions



In [52]:
# This install pytextrank, a library focused on the extraction of keywords (the most important concepts) from text.
pip install pytextrank

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pytextrank
  Downloading pytextrank-3.2.4-py3-none-any.whl (30 kB)
Collecting pygments>=2.7.4
  Downloading Pygments-2.13.0-py3-none-any.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 28.4 MB/s 
[?25hCollecting graphviz>=0.13
  Downloading graphviz-0.20.1-py3-none-any.whl (47 kB)
[K     |████████████████████████████████| 47 kB 4.2 MB/s 
Collecting icecream>=2.1
  Downloading icecream-2.1.3-py2.py3-none-any.whl (8.4 kB)
Collecting asttokens>=2.0.1
  Downloading asttokens-2.2.0-py2.py3-none-any.whl (26 kB)
Collecting executing>=0.3.1
  Downloading executing-1.2.0-py2.py3-none-any.whl (24 kB)
Collecting colorama>=0.3.9
  Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Collecting matplotlib>=3.3
  Downloading matplotlib-3.5.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (11.2 MB)
[K     |████████████████████████████████| 11.2 MB 44.7 MB/s 
Collec

In [53]:
import spacy
import pytextrank

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe("textrank")



<pytextrank.base.BaseTextRankFactory at 0x7f5d3a060890>

In [None]:
sampled_df = shopify_desc_df.sample(n=5)
print('~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
for site_desc, site_title in zip(sampled_df['Site Description'], sampled_df['Site Title']):
    doc = nlp(site_desc)
    print(site_title)
    print('~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
    for p in doc._.phrases[:5]:
        print('{:.4f} {:5d}  {}'.format(p.rank, p.count, p.text))
        print(p.chunks)
    print('~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Modern Farmhouse Style Home Decor – The Cozy Nest
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0.2195     1  farmhouse decor
[farmhouse decor]
0.2070     1  modern farmhouse decor
[modern farmhouse decor]
0.1994     1  rustic decor
[rustic decor]
0.1992     1  tabletop decor
[tabletop decor]
0.1984     1  industrial decor
[industrial decor]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Shape Concept Fajas Colombianas
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0.2226     1  affordable prices
[affordable prices]
0.1956     1  Shape Concept Fajas Colombianas Premium
[Shape Concept Fajas Colombianas Premium]
0.1038     1  the highest quality products
[the highest quality products]
0.0578     1  our customers
[our customers]
0.0000     2  We
[We, We]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

## Exercise: Observations

What are your observations in terms of perceived performance.
Following the impressions from the first exercise, try to repeat the extraction process with a variety of different models.

### Performance

...

### Model Behavior

... 

What do you assume to be the root cause of these observations?

### Potential Cause I
...

### Potential Cause II
...
