In [1]:
import os

os.environ['KERAS_BACKEND'] = "torch"

In [2]:
import warnings

# Some parts of torch that are used by Spacy are deprecated, we can ignore them 
# (The new 3.8 Spacy has some little issues, so we keep it like it is for now)
warnings.filterwarnings("ignore", category=DeprecationWarning)

# Pre-processing
## Reasons
The downloaded information from the BGG API might not be informative, faulty or bloated with useless information. <br>
In order to avoid this we apply some pre-processing steps in order to filter out information we don't need, that may be entire records or some of the 
text inside a line.

During the process we already make the tokenization and stemming of the text using the ```spacy```

## Method 



## Using Spacy
To download the model and use it with spacy:
```
python -m spacy download en_core_web_sm
```

In [4]:
import spacy

model = spacy.load("en_core_web_sm")

## PreProcessingService
Class that holds the process to clean the text and produce a stemmed corpus. <br/> This will then be persisted in a file to avoid re-processing the same data.

In [3]:
from pre_processing import PreProcessingService

ps = PreProcessingService()

In [4]:
demo_text = "This is a demo text. Isn't Root just an amazing game? I love it!"

### 1 - BGG noise removal
BGG comments can carry metadata such as images and some pseudo-html tags. <br>
To avoid processing those we simply remove them applying two regexes:

In [5]:
clean_tags_regex = r"(?i)\[(?P<tag>[A-Z]+)\].*?\[/\1\]"
keep_tag_content_regex = r"\[(?P<tag>[a-z]+)(=[^\]]+)?\](.*?)\[/\1\]"

In [6]:
ps.clean_text("This is a test for processing [IMG]https://cf.geekdo-static.com/mbs/mb_5855_0.gif[/IMG] as content")

'This is a test for processing  as content'

In [7]:
ps.clean_text("This is a test for processing [b=323]bold[/b] as content")

'This is a test for processing bold as content'

### 2 - Language detection
While it of course would be amazing to have a model with multiple languages support, we are focusing on English. <br>
To filter out foreign languages we use the ```langdetect``` library.

In [8]:
from fast_langdetect import detect

german_sentence = "Naja, ich finde die Siedler von Catan noch besser"
print(f"For the demo sentence: \"{demo_text}\" we detected: {detect(demo_text)['lang']}")
print(f"For the demo sentence: \"{german_sentence}\" we detected: {detect(german_sentence)['lang']}")

For the demo sentence: "This is a demo text. Isn't Root just an amazing game? I love it!" we detected: en
For the demo sentence: "Naja, ich finde die Siedler von Catan noch besser" we detected: de


### 3 - Tokenization and lemmatization
Using ```spacy``` we tokenize the text and then we lemmatize it. <br>

In [9]:
ps._make_text_lemmas(demo_text)  # (Should be considered private)

['demo', 'text', 'root', 'amazing', 'game', 'love']

Temporary fix for not filtering out None and punctuation (As for current version the pre_processing already does what we expect)

### 4 - Remove too narrow texts
Comments (reviews) that are too short might not be informative. <br>
We already remove stopwords and punctuation, so we can filter out comments that are too short but we better set a reasonable threshold (not too high).

### 4 - Regenerating a text string

In [10]:
ps.pre_process(demo_text)

'demo text root amazing game love'

## Process Batch
We can process the entire corpus now. This will take a while, but it will save us time in the future.

In [1]:
resource_file_path: str = "../data/corpus.csv"
target_file_path: str = "../data/corpus.preprocessed.v2.csv"

In [2]:
from pre_processing import pre_process_corpus

pre_process_corpus(resource_file_path, target_file_path, False)

Pandas Apply:   0%|          | 0/2136200 [00:00<?, ?it/s]