# StyloMetrix Tutorial
This notebook presents full instructions for using StyloMetrix with examples.

## 1. Quick start

StyloMetrix is a tool for stylometric analysis of texts. It is based on Spacy and supports four languages. In order for the tool to work properly, a corresponding language model is required. Below is the list of supported languages and their corresponding models:
- English [en_core_web_trf](https://spacy.io/models/en)
- German [de_core_news_lg](https://spacy.io/models/de)
- Polish [en_nask-0.0.7](http://mozart.ipipan.waw.pl/~rtuora/spacy/)
- Russian [ru_core_news_lg](https://spacy.io/models/ru)
- Ukrainian [uk_core_web_trf](https://spacy.io/models/uk)


The model must be downloaded and installed in the environment where SM will be used. StyloMetrix is installed using `pip install stylo_metrix`.

The following shows how to quickly calculate metrics for several texts. Please remember everything presented in this tutorial can be applied to all supported languages.

In [1]:
# import library
import stylo_metrix as sm

In [2]:
# example texts
texts = ["In our modern world, there are many factors that place the wellbeing of the planet in jeopardy.", 
         "While some people have the opinion that environmental problems are just a natural occurrence, others believe that human beings have a huge impact on the environment.",
         "Regardless of your viewpoint, take into consideration the following factors that place our environment as well as the planet Earth in danger."]

In [3]:
# count metrics
stylo = sm.StyloMetrix('en') # define langauge, one of ('en', 'pl', 'ru', 'ukr')
metrics = stylo.transform(texts)
metrics

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.57it/s]


Unnamed: 0,text,POS_VERB,POS_NOUN,POS_ADJ,POS_ADV,POS_DET,POS_INTJ,POS_CONJ,POS_PART,POS_NUM,...,RE,ASF,ASM,OM,RCI,DMC,OR,QAS,PA,PR
0,"In our modern world, there are many factors th...",0.117647,0.294118,0.117647,0.0,0.117647,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.176471,0.0,0.0
1,While some people have the opinion that enviro...,0.153846,0.307692,0.153846,0.038462,0.192308,0.0,0.115385,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.076923,0.0,0.0
2,"Regardless of your viewpoint, take into consid...",0.136364,0.272727,0.0,0.136364,0.090909,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.136364,0.0,0.0


You can count metrics for juas a string as well.

In [4]:
# You can provide string or list of strings to transform method
metrics_for_one = stylo.transform(texts[0])
metrics_for_one

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.74it/s]


Unnamed: 0,text,POS_VERB,POS_NOUN,POS_ADJ,POS_ADV,POS_DET,POS_INTJ,POS_CONJ,POS_PART,POS_NUM,...,RE,ASF,ASM,OM,RCI,DMC,OR,QAS,PA,PR
0,"In our modern world, there are many factors th...",0.117647,0.294118,0.117647,0.0,0.117647,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.176471,0.0,0.0


## 2. Create StyloMetrix instance

This chapter describes in detail the parameters of the `sm.StyloMetrix` class.

- The basis for building an SM object is to specify the language in which the processed texts are written. This is done by entering a parameter **`lang`** of type `string`, can be one of:
    -  `['english', 'angielski', 'en', 'eng', ]` for english,
    -  `['polish', 'polski', 'pl', 'pol']` for polish,
    -  `['russian', 'rosyjski', 'ru']` for russian,
    -  `['ukrainian', 'ukraiński', 'ukr']` for ukrainian.

- Quite an important parameter is **`debug`**, which takes boolean values. When set to `True`, the result of the `transform` operation will be two DataFrame objects - the first is the results of the metrics calculation, the second contains information about which tokens were taken into account during the metrics count.

In [20]:
stylo = sm.StyloMetrix('en', debug=True) 
metrics, debug = stylo.transform(texts)
debug

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00,  1.03it/s]


Unnamed: 0,text,POS_VERB,POS_NOUN,POS_ADJ,POS_ADV,POS_DET,POS_INTJ,POS_CONJ,POS_PART,POS_NUM,...,RE,ASF,ASM,OM,RCI,DMC,OR,QAS,PA,PR
0,"In our modern world, there are many factors th...","{'TOKENS': [are, place]}","{'TOKENS': [world, factors, wellbeing, planet,...","{'TOKENS': [modern, many]}",{'TOKENS': []},"{'TOKENS': [the, the]}",{'TOKENS': []},{'TOKENS': []},{'TOKENS': []},{'TOKENS': []},...,{'TOKENS': []},{'TOKENS': []},{'TOKENS': []},{'TOKENS': []},{'TOKENS': []},{'TOKENS': []},{'TOKENS': []},"{'TOKENS': [the, the, planet]}",{'TOKENS': []},{'TOKENS': []}
1,While some people have the opinion that enviro...,"{'TOKENS': [have, are, believe, have]}","{'TOKENS': [people, opinion, problems, occurre...","{'TOKENS': [environmental, natural, human, huge]}",{'TOKENS': [just]},"{'TOKENS': [some, the, a, a, the]}",{'TOKENS': []},"{'TOKENS': [While, that, that]}",{'TOKENS': []},{'TOKENS': []},...,{'TOKENS': []},{'TOKENS': []},{'TOKENS': []},{'TOKENS': []},{'TOKENS': []},{'TOKENS': []},{'TOKENS': []},"{'TOKENS': [the, the]}",{'TOKENS': []},{'TOKENS': []}
2,"Regardless of your viewpoint, take into consid...","{'TOKENS': [take, following, place]}","{'TOKENS': [viewpoint, consideration, factors,...",{'TOKENS': []},"{'TOKENS': [Regardless, as, well]}","{'TOKENS': [the, the]}",{'TOKENS': []},{'TOKENS': []},{'TOKENS': []},{'TOKENS': []},...,{'TOKENS': []},{'TOKENS': []},{'TOKENS': []},{'TOKENS': []},{'TOKENS': []},{'TOKENS': []},{'TOKENS': []},"{'TOKENS': [the, the, planet]}",{'TOKENS': []},{'TOKENS': []}


- If we want our results to save automatically, set the **`save_path`** parameter. It takes values of type `string`, which denotes the path to an existing directory where DataFrames are to be saved in csv form.

In [21]:
path = '.\\'
stylo = sm.StyloMetrix('en', debug=True, save_path=path)

stylo.transform(texts[:2])

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.62it/s]

File saved in location: .\sm_output3.csv
File saved in location: .\sm_debug3.csv





(                                                text  POS_VERB  POS_NOUN  \
 0  In our modern world, there are many factors th...  0.117647  0.294118   
 1  While some people have the opinion that enviro...  0.153846  0.307692   
 
     POS_ADJ   POS_ADV   POS_DET  POS_INTJ  POS_CONJ  POS_PART  POS_NUM  ...  \
 0  0.117647  0.000000  0.117647       0.0  0.000000       0.0      0.0  ...   
 1  0.153846  0.038462  0.192308       0.0  0.115385       0.0      0.0  ...   
 
     RE  ASF  ASM   OM  RCI  DMC   OR       QAS   PA   PR  
 0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.176471  0.0  0.0  
 1  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.076923  0.0  0.0  
 
 [2 rows x 197 columns],
                                                 text  \
 0  In our modern world, there are many factors th...   
 1  While some people have the opinion that enviro...   
 
                                  POS_VERB  \
 0                {'TOKENS': [are, place]}   
 1  {'TOKENS': [have, are, believe, have]}   
 
     

- Moreover, it is possible to set the parameter **`nlp`** which denotes a custom Spacy model.
- By default, all available metrics in a given language are counted. We can modify them with parameters **`metrics`** or **`exceptions`**. We can choose ourselves the set of metrics we want to calculate and assign it to `metrics` parameter. As well as we can select all metrics except a given set, in which case such a set is assigned to `exceptions` parameter. The next section will show how we can select metrics.

## 3. Selecting metrics

In order to select metrics, we first need to see what we have to choose from. **Please, keep in mind that metrics might differ between languages.** We can find out available metrics in the following way:

In [22]:
metrics = sm.get_all_metrics('eng')
print(metrics)

0  |  PartOfSpeech  |  POS_VERB  |  Verbs
1  |  PartOfSpeech  |  POS_NOUN  |  Nouns
2  |  PartOfSpeech  |  POS_ADJ  |  Adjectives
3  |  PartOfSpeech  |  POS_ADV  |  Adverbs
4  |  PartOfSpeech  |  POS_DET  |  Determiners
5  |  PartOfSpeech  |  POS_INTJ  |  Interjections
6  |  PartOfSpeech  |  POS_CONJ  |  Conjunctions
7  |  PartOfSpeech  |  POS_PART  |  Particles
8  |  PartOfSpeech  |  POS_NUM  |  Numerals
9  |  PartOfSpeech  |  POS_PREP  |  Prepositions
10  |  PartOfSpeech  |  POS_PRO  |  Pronouns
11  |  Lexical  |  L_REF  |  References
12  |  Lexical  |  L_HASHTAG  |  Hashtags
13  |  Lexical  |  L_MENTION  |  Mentions
14  |  Lexical  |  L_RT  |  Retweets
15  |  Lexical  |  L_LINKS  |  Links
16  |  Lexical  |  L_CONT_A  |  Content words
17  |  Lexical  |  L_FUNC_A  |  Function words
18  |  Lexical  |  L_CONT_T  |  Content words types
19  |  Lexical  |  L_FUNC_T  |  Function words types
20  |  Lexical  |  L_PLURAL_NOUNS  |  Nouns in plural
21  |  Lexical  |  L_SINGULAR_NOUNS  |  Nouns i

Above are the following (from left):
- **order number** - metrics is a `MetricGroup` object, from which we can select individual metrics, or snippets, e.g. `metrics[0]`, or `metrics[10:20]`.
- **category** - each metric is assigned to a subject category.
- **metrics code** - this is a unique string for each metric displayed in the DataFrame.
- **name** - the extended name of the metric.

Metrics can also be accessed in other ways, eg. we can choose all metrics from given category:

In [23]:
# check available categories
categories = sm.get_all_categories('en')
print(categories)

[PartOfSpeech, Lexical, Syntactic, VerbTenses, Statistics, Pronouns, General, Hurtlex]


In [25]:
# choose category
category = categories[2]

# preview what metrics are available within this category
# this is the same DataFrame object as before and you can perform the same operations on it
category_metrics = category.get_metrics()
print(category_metrics)

0  |  Syntactic  |  SY_QUESTION  |  Number of words in interrogative sentences
1  |  Syntactic  |  SY_NARRATIVE  |  Number of words in narrative sentences
2  |  Syntactic  |  SY_NEGATIVE_QUESTIONS  |  Words in negative questions
3  |  Syntactic  |  SY_SPECIAL_QUESTIONS  |  Words in special questions
4  |  Syntactic  |  SY_TAG_QUESTIONS  |  Words in tag questions
5  |  Syntactic  |  SY_GENERAL_QUESTIONS  |  Words in general questions
6  |  Syntactic  |  SY_EXCLAMATION  |  Number of words in exclamatory sentences
7  |  Syntactic  |  SY_IMPERATIVE  |  Words in imperative sentences
8  |  Syntactic  |  SY_SUBORD_SENT  |  Words in subordinate sentences
9  |  Syntactic  |  SY_SUBORD_SENT_PUNCT  |  Punctuation in subordinate sentences
10  |  Syntactic  |  SY_COORD_SENT  |  Words in coordinate sentences
11  |  Syntactic  |  SY_COORD_SENT_PUNCT  |  Punctuation in coordinate sentences
12  |  Syntactic  |  SY_SIMPLE_SENT  |  Tokens in simple sentences
13  |  Syntactic  |  SY_DIRECT_SPEECH  |  Word

In [26]:
# example subset of metrics for analysis
metrics_to_analyse = metrics[60:100]
print(metrics_to_analyse)

0  |  Syntactic  |  SY_DIRECT_SPEECH  |  Words in direct speech
1  |  Syntactic  |  SY_INVERSE_PATTERNS  |  Incidents of inverse patterns
2  |  Syntactic  |  FOS_SIMILE  |  Simile
3  |  Syntactic  |  FOS_FRONTING  |  Fronting
4  |  Syntactic  |  PS_SYNTACTIC_IRRITATION  |  Incidents of continuous tenses as irritation markers
5  |  Syntactic  |  SY_INTENSIFIER  |  Intensifiers
6  |  VerbTenses  |  VT_PRESENT_SIMPLE  |  Present Simple tense
7  |  VerbTenses  |  VT_PRESENT_PROGRESSIVE  |  Present Continuous tense
8  |  VerbTenses  |  VT_PRESENT_PERFECT  |  Present Perfect tense
9  |  VerbTenses  |  VT_PRESENT_PERFECT_PROGR  |  Present Prefect Continuous tense
10  |  VerbTenses  |  VT_PRESENT_SIMPLE_PASSIVE  |  Present Simple passive
11  |  VerbTenses  |  VT_PRESENT_PROGR_PASSIVE  |  Present Continuous passive
12  |  VerbTenses  |  VT_PRESENT_PERFECT_PASSIVE  |  Present Perfect passive
13  |  VerbTenses  |  VT_PAST_SIMPLE  |  Past Simple tense
14  |  VerbTenses  |  VT_PAST_SIMPLE_BE  |  Pa

In [27]:
# choose metrics_to_analyse excluding syntatic
stylo = sm.StyloMetrix('en', metrics=metrics_to_analyse, exceptions=category_metrics)
metrics = stylo.transform(texts)
metrics

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  1.66it/s]


Unnamed: 0,text,VT_PRESENT_SIMPLE,VT_PRESENT_PROGRESSIVE,VT_PRESENT_PERFECT,VT_PRESENT_PERFECT_PROGR,VT_PRESENT_SIMPLE_PASSIVE,VT_PRESENT_PROGR_PASSIVE,VT_PRESENT_PERFECT_PASSIVE,VT_PAST_SIMPLE,VT_PAST_SIMPLE_BE,...,VT_WOULD_PROGRESSIVE,VT_WOULD_PERFECT,VT_WOULD_PERFECT_PASSIVE,VT_SHOULD,VT_SHOULD_PASSIVE,VT_SHALL,VT_SHALL_PASSIVE,VT_SHOULD_PROGRESSIVE,VT_SHOULD_PERFECT,VT_SHOULD_PERFECT_PASSIVE
0,"In our modern world, there are many factors th...",0.117647,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,While some people have the opinion that enviro...,0.153846,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Regardless of your viewpoint, take into consid...",0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


- You can calculate the value for a single metric (provided as a list)
- or you can provide list of any available metrics, as well as groups of metrics

In [28]:
stylo = sm.StyloMetrix('en', metrics=[metrics_to_analyse[0]])
metrics = stylo.transform(texts)
metrics

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.15it/s]


Unnamed: 0,text,SY_DIRECT_SPEECH
0,"In our modern world, there are many factors th...",0
1,While some people have the opinion that enviro...,0
2,"Regardless of your viewpoint, take into consid...",0


Groups of metrics can be added (concatenated) as well as subtracted (remove some groups)

In [31]:
metrics = sm.get_all_metrics('en')
group1 = metrics[20:30]
group2 = metrics[50:70]
group3 = metrics[25:55]
final_group = group1 + group2 - group3
print(final_group)

0  |  Lexical  |  L_PLURAL_NOUNS  |  Nouns in plural
1  |  Lexical  |  L_SINGULAR_NOUNS  |  Nouns in singular
2  |  Lexical  |  L_PROPER_NAME  |  Proper names
3  |  Lexical  |  L_PERSONAL_NAME  |  Personal names
4  |  Lexical  |  L_NOUN_PHRASES  |  Incidence of noun phrases
5  |  Syntactic  |  SY_SUBORD_SENT  |  Words in subordinate sentences
6  |  Syntactic  |  SY_SUBORD_SENT_PUNCT  |  Punctuation in subordinate sentences
7  |  Syntactic  |  SY_COORD_SENT  |  Words in coordinate sentences
8  |  Syntactic  |  SY_COORD_SENT_PUNCT  |  Punctuation in coordinate sentences
9  |  Syntactic  |  SY_SIMPLE_SENT  |  Tokens in simple sentences
10  |  Syntactic  |  SY_DIRECT_SPEECH  |  Words in direct speech
11  |  Syntactic  |  SY_INVERSE_PATTERNS  |  Incidents of inverse patterns
12  |  Syntactic  |  FOS_SIMILE  |  Simile
13  |  Syntactic  |  FOS_FRONTING  |  Fronting
14  |  Syntactic  |  PS_SYNTACTIC_IRRITATION  |  Incidents of continuous tenses as irritation markers
15  |  Syntactic  |  SY_INT

## 4. Creating new metrics
Creating your metrics can be done in 2 ways:
- with the help of the decorator **`custom_metric`** - a function is then decorated, which takes `doc` (spacy-processed text, divided into tokens) The function should return a result in the range [0, 1] and debug.

In [32]:
from stylo_metrix import custom_metric

@custom_metric('en')
def SAMPL1(doc):
    result = 0.9
    debug = [doc[0], doc[1], doc[3]]
    return result, debug

- Using the inheritance mechanism of the **`Metric`** class. Note that the `category`, and `name_en` and `name_local` fields are required for proper operation. The category object must then be retrieved from the list available via `sm.get_all_categories()` or a new one must be created. The counting method itself is implemented in the `count(doc)` method.

In [36]:
categories = sm.get_all_categories('en')
category = categories[5]

class SAMPL2(sm.Metric):
    category = category
    name_en = "abc"
    name_local = "abc"
    
    def count(doc):
        result = 0.1
        debug = [doc[2], doc[3], doc[4]]
        return result, debug

In [37]:
# create a new category - indicate the language to which the category belongs.
# (same as with get_all_metrics(), etc.).

class C1(sm.Category):
    lang = 'en'        # define language
    name_en = "C1"     # name in enslish
    name_local = "C1"  # local name

    
class SAMPL3(sm.Metric):
    category = C1
    name_en = "abc"
    name_local = "abc"
    
    def count(doc):
        result = 0.99
        debug = [doc[9], doc[0], doc[1]]
        return result, debug

The created metrics are automatically saved after calling the code in which they are defined. So after calling the above cells we are already able to use them. This can be seen by looking at all available metrics.

In [38]:
print(sm.get_all_metrics('en'))

0  |  PartOfSpeech  |  POS_VERB  |  Verbs
1  |  PartOfSpeech  |  POS_NOUN  |  Nouns
2  |  PartOfSpeech  |  POS_ADJ  |  Adjectives
3  |  PartOfSpeech  |  POS_ADV  |  Adverbs
4  |  PartOfSpeech  |  POS_DET  |  Determiners
5  |  PartOfSpeech  |  POS_INTJ  |  Interjections
6  |  PartOfSpeech  |  POS_CONJ  |  Conjunctions
7  |  PartOfSpeech  |  POS_PART  |  Particles
8  |  PartOfSpeech  |  POS_NUM  |  Numerals
9  |  PartOfSpeech  |  POS_PREP  |  Prepositions
10  |  PartOfSpeech  |  POS_PRO  |  Pronouns
11  |  Lexical  |  L_REF  |  References
12  |  Lexical  |  L_HASHTAG  |  Hashtags
13  |  Lexical  |  L_MENTION  |  Mentions
14  |  Lexical  |  L_RT  |  Retweets
15  |  Lexical  |  L_LINKS  |  Links
16  |  Lexical  |  L_CONT_A  |  Content words
17  |  Lexical  |  L_FUNC_A  |  Function words
18  |  Lexical  |  L_CONT_T  |  Content words types
19  |  Lexical  |  L_FUNC_T  |  Function words types
20  |  Lexical  |  L_PLURAL_NOUNS  |  Nouns in plural
21  |  Lexical  |  L_SINGULAR_NOUNS  |  Nouns i

## 5. Integration with scikit-learn
StyloMetrix is designed to allow integration with the popular machine learning tool scikit-learn. StyloMetrix module can be set up as one of the steps of the scikit-learn pipeline.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_20newsgroups
from sklearn.ensemble import RandomForestClassifier
import numpy as np

newsgroups = fetch_20newsgroups(subset='train')
X = newsgroups['data'][:500]
y = newsgroups['target'][:500]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

stylo = sm.StyloMetrix('en')
clf = RandomForestClassifier(max_depth=4, random_state=42)
pipe = Pipeline([('stylo', stylo), ('rmf', clf)])
pipe.fit(X=X_train, y=y_train)
pipe.score(X=X_test, y=y_test)