# Classification with dictionaries and word embeddings

## Example

### 1. Imports

In [1]:
# imports:
import pandas as pd

from weelex import WEELexClassifier
from weelex import Lexicon
from weelex import Embeddings



### 2. Preparation


####  2.1 Dictionary/Lexicon 

- We need `Lexicon` objects.
- These can be in different formats:
    - tabular:
        -  `pandas.DataFrame`where each column is one of the categories and each row is the words for that category
        - `.csv` file path with data of the same format
    - Key-value pairs:
        - `dict` of the form `{'category1':['term1', 'term2'], 'category2': ['term3', 'term4', 'term5']}`
        - `.json` file path with data of the same format

In [2]:
# Tabular data:
df1 = pd.read_csv('examplefiles/mylex1.csv', sep=';', encoding='latin1')
lex1 = Lexicon(df1)
# or:
lex1 = Lexicon('examplefiles/mylex1.csv', sep=';', encoding='latin1')
lex1

          PolitikVR            AutoVR
0        Demokratie           schnell
1            Regime              Auto
2      demokratisch         Automobil
3         Parlament         Autobauer
4         Bundestag          Mercedes
5            Partei               BMW
6          Parteien           Porsche
7           Politik              Audi
8       Politikerin                VW
9         Politiker           Lenkrad
10             Wahl            Felgen
11           wählen            Reifen
12         Kandidat            Straße
13       Wiederwahl                PS
14        Präsident           Auspuff
15        Kanzlerin              Lack
16          Kanzler             Kombi
17  Bundespräsident               Bus
18         Minister        Ledersitze
19      Ministerien            Fahrer
20      Ministerium           Faherin
21     populistisch   Geschwindigkeit
22           rechts            Bolide
23            links        Karosserie
24       Opposition                Km
25       Kor

In [3]:
# mappings/key-value pairs:
lex2 = Lexicon('examplefiles/mylex2.json')
lex2

        Space     Food
0          ab     Brot
1     abseits   Kuchen
2   abstaende  GemÃ¼se
3     abstand      NaN
4   abstÃ¤nde      NaN
5    abwaerts      NaN
6    abwÃ¤rts      NaN
7          an      NaN
8  anstellung      NaN

Additionally, different lexica can be combined into one, for example if different dictionary sources are considered/required:

In [4]:
lex = lex1.merge(lex2, inplace=False)
lex

          PolitikVR            AutoVR       Space     Food
0        Demokratie           schnell          ab     Brot
1            Regime              Auto     abseits   Kuchen
2      demokratisch         Automobil   abstaende  GemÃ¼se
3         Parlament         Autobauer     abstand      NaN
4         Bundestag          Mercedes   abstÃ¤nde      NaN
5            Partei               BMW    abwaerts      NaN
6          Parteien           Porsche    abwÃ¤rts      NaN
7           Politik              Audi          an      NaN
8       Politikerin                VW  anstellung      NaN
9         Politiker           Lenkrad         NaN      NaN
10             Wahl            Felgen         NaN      NaN
11           wählen            Reifen         NaN      NaN
12         Kandidat            Straße         NaN      NaN
13       Wiederwahl                PS         NaN      NaN
14        Präsident           Auspuff         NaN      NaN
15        Kanzlerin              Lack         NaN      N

#### 2.2 Embeddings

- Pretrained embedding vectors need to be provided. In the future, the will be support for self-training or fine tuning.
- Pretrained FastText Vectors can be downloaded on the official website: 
    - [https://fasttext.cc/docs/en/crawl-vectors.html](https://fasttext.cc/docs/en/crawl-vectors.html)
    - Here, we download the German vectors with the `bin` version.
    - store these somewhere
    - the file is several `GB` large $\rightarrow$ dowloading the file and loading it into memory may take some time
    - the file is compressed after download (`.bin.gz`). This is fine. It does not need to be uncompressed


In [5]:
path_to_embeddings = '../../cc.de.300.bin'  # change this to your saved location

In [6]:
embeds = Embeddings()
embeds.load_vectors(path_to_embeddings, embedding_type='fasttext', fine_tuned=False)

The embedding object can be filtered such that it only contains the words that are in the dictionary, which is sufficient for the method.
The filtered embeddings can be saved and in subsequent sessions these can be loaded which reduces the required memory and loading times in subsequent operations.
This is particularly valuable in case you need to work on the following steps of the classification over multiple days and sessions

In [9]:
embeds.filter_terms(lex.vocabulary)

# saving
path_to_filtered_embeddings = './filtered_embeddings'
embeds.save_filtered(path_to_filtered_embeddings)
del embeds

# create new embeds instance and load the filtered vectors
embeds = Embeddings()
embeds.load_filtered(path_to_filtered_embeddings)

### 3. Train the model on the dictionary

- The method works by first training a machine learning ensemble on the dictionary.
- It is possible to provide `main_keys`, i.e. the categories to predict, and `support_keys`, i.e. other categories you do not want a prediction for but provide terms anyhow
- including `support_keys` can improve the classification because it allows the model to differentiate more words
- by default, all the keys of your `Lexicon` instance are main keys. But this can be changed with the `main_keys` and `support_keys` parameter. Alternatively, it is possible to provide a `Lexicon` instance via the `lex` parameter for main categories and another `Lexicon` via the `support_lex` parameter for support categories


In [19]:
classifier = WEELexClassifier(embeds=embeds,
                              relevant_pos=['NOUN'],
                              min_df=1,  # Optional. Selected in able to run on small example. Better to have higher value. Default is 5
                              max_df=0.99,  # Optional. Selected in able to run on small example. Default is 0.95
                              n_docs=20,  # Optional. Selected to run on small example. Ideally, use the length of your data.
                              n_words=10  # Optional. Selected to run on small example. Default is 40000
                              )

In [20]:
lex.keys

['PolitikVR', 'AutoVR', 'Space', 'Food']

In [21]:
# to tune the machine learning model, we specify a grid of hyperparameters
# this will be searched via RandomizedSearch
# This grid is very basic with only 6 possible combinations. It is only
# used for this example and should be expanded upon in a real setting.
param_grid = [{'modeltype': ['svm'],
            'n_models': [2],
            'pca': [10, None],
            'svc_c': [0.1, 1, 10]}]

In [22]:
classifier.weelexfit(lex=lex,
                     support_lex=None,  # entire support lexicon can be passed instead of the 'support_keys' parameter
                     main_keys=['PolitikVR', 'AutoVR'],  # optional. Uses all keys of lex if None
                     support_keys=['Space', 'Food'],  # optional. Is not used if None
                     hp_tuning=True,  # Hyperparameter tuning -> use for best results
                     param_grid=param_grid,  # Hyperparameter grid for hp tuning
                     )



    Sets of parameters:
    0: {'input_shape': 300, 'svc_c': 0.1, 'pca': 10, 'n_models': 2, 'modeltype': 'svm'}
    1: {'input_shape': 300, 'svc_c': 1, 'pca': 10, 'n_models': 2, 'modeltype': 'svm'}
    2: {'input_shape': 300, 'svc_c': 10, 'pca': 10, 'n_models': 2, 'modeltype': 'svm'}




    Sets of parameters:
    0: {'input_shape': 300, 'svc_c': 0.1, 'pca': 10, 'n_models': 2, 'modeltype': 'svm'}
    1: {'input_shape': 300, 'svc_c': 1, 'pca': 10, 'n_models': 2, 'modeltype': 'svm'}
    2: {'input_shape': 300, 'svc_c': 10, 'pca': 10, 'n_models': 2, 'modeltype': 'svm'}


#### 4. Predict a body of texts:



In [23]:
# the Texts to predict in this example:
data = pd.Series(
    [
    'Ich esse gerne Kuchen und andere Süßigkeiten',
    'Dort steht ein schnelles Auto mit einem Lenkrad und Reifen.',
    'Die Politik von heute ist nicht mehr die gleiche wie damals.',
    'Hier ist nochmal ein sehr generischer Satz.',
    'Wie ist das Wetter heute?',
    'Ich esse gerne Kuchen und andere Süßigkeiten',
    'Dort steht ein schnelles Auto mit einem Lenkrad und Reifen.',
    'Die Politik von heute ist nicht mehr die gleiche wie damals.',
    'Hier ist nochmal ein sehr generischer Satz.',
    'Wie ist das Wetter heute?',
    'Ich esse gerne Kuchen und andere Süßigkeiten',
    'Dort steht ein schnelles Auto mit einem Lenkrad und Reifen.',
    'Die Politik von heute ist nicht mehr die gleiche wie damals.',
    'Hier ist nochmal ein sehr generischer Satz.',
    'Wie ist das Wetter heute?',
    'Ich esse gerne Kuchen und andere Süßigkeiten',
    'Dort steht ein schnelles Auto mit einem Lenkrad und Reifen.',
    'Die Politik von heute ist nicht mehr die gleiche wie damals.',
    'Hier ist nochmal ein sehr generischer Satz.',
    'Wie ist das Wetter heute?',
    ])

In [25]:
predictions = classifier.weelexpredict(data)
predictions

Fit vectorizer
Time to vectorize: 0.00 minutes
'Süßigkeit' is not in list
Returning null vector instead
'Reife' is not in list
Returning null vector instead
'Satz' is not in list
Returning null vector instead
'Wetter' is not in list
Returning null vector instead


Unnamed: 0,PolitikVR,AutoVR
0,0,1
1,0,1
2,1,0
3,0,0
4,0,0
5,0,1
6,0,1
7,1,0
8,0,0
9,0,0
