### Set up everything
I am using [this](https://spacy.io/api/architectures#HashEmbedCNN) architecture. 

In [58]:
%pip install spacy
!spacy init config config.cfg --lang de --pipeline ner 


Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
[38;5;3m⚠ To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.[0m
[38;5;4mℹ Generated config template specific for your use case[0m
- Language: de
- Pipeline: ner
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


Now we want to edit the config so that it works properly.


In [48]:
import configparser
config = configparser.ConfigParser()
config.read('config.cfg')

config['paths']['train'] = 'output_data.spacy'

config['nlp']['lang'] = 'de'
config['nlp']['pipeline'] = '["ner"]'  

# config['components']['ner'] = 'factory = "ner"'

# config.set('components', 'ner.model', '@architectures = "spacy.TFMDNN.v2"')  
# config.set('components.ner.model', 'hidden_width', '128')
# config.set('components.ner.model', 'maxout_pieces', '3')
# config.set('components.ner.model', 'tok2vec', '{"@architectures": "spacy.Tok2Vec.v2", "width": 96, "depth": 4}')

# config.set('training', 'optimizer', '{"@optimizers": "Adam.v1", "learn_rate": 0.001, "beta1": 0.9, "beta2": 0.999, "eps": 1e-08}')

with open('config.cfg', 'w') as configfile:
    config.write(configfile)

print("Config file updated successfully!")


Config file updated successfully!


In [None]:
# import configparser

# config = configparser.ConfigParser()
# config.read("config.cfg")

# # Update the tok2vec model architecture to use HashEmbedCNN.v2
# config['components.tok2vec.model'] = {
#     '@architectures': 'spacy.Tok2Vec.v2',
#     'embed': {
#         '@architectures': 'spacy.HashEmbedCNN.v2',
#         'width': 96,
#         'depth': 1,
#         'embed_size': 2000,
#         'window_size': 1,
#         'maxout_pieces': 3,
#         'subword_features': True,
#         'pretrained_vectors': False
#     },
#     'encode': {
#         '@architectures': 'spacy.MaxoutWindowEncoder.v2',
#         'width': 96,
#         'depth': 1, # here we can specify the depth of CNN layers
#         'window_size': 1,
#         'maxout_pieces': 3
#     }
# }

# with open('config.cfg', 'w') as configfile:
#     config.write(configfile)


Now we want to use the spancat method to allow for overlapping spans, ie, when a text has multiple of the same entity assigned to it. For example text="Google, Apple and Microsoft ale huge companies" entities = [{start=..., end..., label='company_name'}, {start=..., end..., label='company_name'}, {start=..., end..., label='company_name'}]. This allows us to handle this situation and transform the data into a .spacy format. 

In [74]:
import json
import spacy
from spacy.tokens import DocBin
from spacy.training.example import Example
from spacy.language import Language
from spacy.pipeline import SpanCategorizer # to handle repeated entities in the text

with open("output.json", "r") as file:  # load data for training
    data = json.load(file)

nlp = spacy.blank("de")

@Language.factory('span_cate')
def create_span_categorizer(nlp, name):
    return SpanCategorizer(nlp.vocab, model="sc", suggester=None)

if 'span_cate' in nlp.pipe_names:
    nlp.remove_pipe('span_cate')
    nlp.add_pipe('span_cate', last=True)

doc_bin = DocBin()

for entry in data:
    text = entry["text"]
    entities = entry["entities"]
    doc = nlp.make_doc(text)

    spans = []
    for ent in entities:
        start = ent["start"]
        end = ent["end"]
        label = ent["label"]
        span = doc.char_span(start, end, label=label)
        if span is not None:
            spans.append(span)
    
    doc.spans["sc"] = spans
    doc_bin.add(doc)

doc_bin.to_disk("output_data.spacy")


In [23]:
!python3 -m spacy download de_core_news_sm # download the german model

Defaulting to user installation because normal site-packages is not writeable
Collecting de-core-news-sm==3.8.0
  Using cached https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-3.8.0/de_core_news_sm-3.8.0-py3-none-any.whl (14.6 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_sm')


In [75]:
# train the model with the config file we created above and the data we prepared
# the output directory is ./output
# the training data is ./output_data.spacy
!python3 -m spacy train config.cfg --output ./output --paths.train output_data.spacy 

[38;5;4mℹ Saving to output directory: output[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.0001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00      0.00    0.00    0.00    0.00    0.00
 34     200          0.00      0.00    0.00    0.00    0.00    0.00
 75     400          0.00      0.00    0.00    0.00    0.00    0.00
127     600          0.00      0.00    0.00    0.00    0.00    0.00
192     800          0.00      0.00    0.00    0.00    0.00    0.00
272    1000          0.00      0.00    0.00    0.00    0.00    0.00
372    1200          0.00      0.00    0.00    0.00    0.00    0.00
472    1400          0.00      0.00    0.00    0.00    0.00    0.00
627    1600          0.00      0.00    0.00    0.00    0.00    0.00
[38;5;2m✔ Saved pipeline to output directory

In [76]:
model_path = 'output/model-last'  # path to the model
nlp = spacy.load(model_path)

example_text = "UVP 19.99 14.99 UVP GORDON'S London Dry Gin je 0,7 I UVP 1.49 1.19 UVP COCA-COLA Classic je 0,5 I UVP 3.99 3.59 UVP LAYS Chips je 150 g 2.49 Super Angebot LAYS Chips Knabberbox je 100 g Sparen auf ausgewählte Produkte ab 06.08. bis 07.09. Sonderaktionen für Mitglieder!"

doc = nlp(example_text)

print("Entities detected:")
for ent in doc.ents:
    print(f"{ent.text} ({ent.label_})")


Entities detected:


### Conclusions and challenges
Challenges encountered:
1. Creating the config was much more challenging than I anticipated as it required a lot of fiddling, debugging, understanding the functions, how they interact, etc. The tools provided by spacy were not that helpful so this was a pain. 
2. Converting the data we already have to the format that spacy requires is also tedious and I encountered a lot of problems, such as repeating entities, etc. This is quite annoying to deal with. 

Conclusion: Yes, I guess you can train a 1D CNN to do this since spacy uses a CNN under the hood, provided that you have a sufficient amount of data and you convert it into the format spacy likes. 