## ParagraphWebLanguageScoreRetagger

ParagraphWebLanguageScoreRetagger is a retagger for identifying texts or parts of texts, that represent the usage of web language. It detects certain features describing web language in all the paragraphs of a text and attaches scores found to paragraph layer.

In [1]:
from estnltk import Text
from paragraphweblanguagescoreretagger import ParagraphWebLanguageScoreRetagger

In [2]:
weblang_score_retagger=  ParagraphWebLanguageScoreRetagger()
weblang_score_retagger

name,output layer,output attributes,input layers
ParagraphWebLanguageScoreRetagger,paragraphs,"('word_count', 'emoticons', 'missing_commas', 'unknown_words', 'letter_reps', 'no_spaces', 'capital_letters', 'foreign_letters', 'ignored_capital', 'incorrect_spaces')","('paragraphs', 'words', 'compound_tokens', 'clauses')"

0,1
use_unknown_words,True
use_emoticons,True
use_letter_reps,True
use_punct_reps,False
use_capital_letters,True
use_missing_commas,True
use_ignored_capital,True
use_no_spaces,True
use_incorrect_spaces,True
use_foreign_letters,True


Before applying ParagraphWebLanguageScoreRetagger, the input Text object must have layers "paragraphs", "words", "compound_tokens" and "clauses".

Texts can be analysed based on 10 different features that describe the usage of web language. Flags of these features can be set True or False, by default 9 features are used. 
<br>
For example, ParagraphWebLanguageScoreRetagger(use_punct_reps=True) activates the feature **punct_reps** that by default is set to False.

#### Flags and what they detect:

- **use_unknown_words** -- words that are unknown to the morphological analyser (if morphological analysis without guessing is used)
<br>
- **use_emoticons** -- emoticons, e.g. *:D, :)*
<br>
- **use_letter_reps** -- same letter more than twice in a row, e.g. *jaaaaa*
<br>
- **use_punct_reps** -- punctuation marks multiple times (except a dot), e.g. *!!!!!!*
<br>
- **use_capital_letters** -- longer parts of text in capital letters, e.g. *MINE METSA! KUHU SA LÄHED?*
<br>
- **use_missing_commas** -- missing commas
<br>
- **use_ignored_capital** -- lowercase letters used instead of capital letters in sentence-initial positions, e.g. *Tere? kuidas läheb?*
<br>
- **use_no_spaces** -- no spaces after punctuation marks, e.g. *Ilm on ilus.Päike paistab.*
<br>
- **use_incorrect_spaces** -- incorrect spaces before and after punctuation marks, e.g. *Tore ! Mulle meeldib.*
<br>
- **use_foreign_letters** -- usage of foreign letters inside (non-capitalized) words, e.g. *ma ei viici yksi*

### Example #1

Let's first try ParagraphWebLanguageScoreRetagger on a string consisting of 4 sentences and 2 paragraphs.

In [3]:
text=Text('''Tšau ! mis teed???


Kas sa kinno ei viitsi minna? mul on niiii igav et ma lähen hulluks varsti!''')
# Add required layers
text.tag_layer(["compound_tokens", "words", "paragraphs","clauses"])
# Add annotation (adds scores of attributes of web language to paragraph layer)
weblang_score_retagger.retag(text)

text
Tšau ! mis teed???Kas sa kinno ei viitsi minna? mul on niiii igav et ma lähen hulluks varsti!

0,1
whole_text_score,0.272727

layer name,attributes,parent,enveloping,ambiguous,span count
paragraphs,"word_count, emoticons, missing_commas, unknown_words, letter_reps, no_spaces, capital_letters, foreign_letters, ignored_capital, incorrect_spaces",,sentences,False,2
sentences,,,words,False,4
tokens,,,,False,22
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,False,22
morph_analysis,"lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,22
clauses,clause_type,,words,False,4


In [4]:
for parag in range(len(text["paragraphs"])):
    print("Paragraph",parag+1,":",text.paragraphs[parag].enclosing_text)
    for i in text["paragraphs"].attributes:
        if text.paragraphs[i][parag] > 0:
            print("Attribute:",i,"--> Score:",text.paragraphs[i][parag])
    print("---")

Paragraph 1 : Tšau ! mis teed???
Attribute: word_count --> Score: 5
Attribute: ignored_capital --> Score: 1
Attribute: incorrect_spaces --> Score: 1
---
Paragraph 2 : Kas sa kinno ei viitsi minna? mul on niiii igav et ma lähen hulluks varsti!
Attribute: word_count --> Score: 17
Attribute: missing_commas --> Score: 1
Attribute: unknown_words --> Score: 1
Attribute: letter_reps --> Score: 1
Attribute: ignored_capital --> Score: 1
---


The received output above shows us, that the text has 2 paragraphs, ParagraphWebLanguageScoreRetagger has detected a number of features from text and added the scores to paragraph layer.
<br>
Note that feature **word_count** is not defined as a flag -- it is used for calculating whole text score that is always calculated and added.

If a flag of a feature is set to True (whether by default or not), but no such features are detected in the text, 0 will be added to paragraph layer as a score of this feature.

In [5]:
text.paragraphs[0]['missing_commas'] # e.g. first paragraph had no missing commas

0

**whole_text_score** - all the scores of features attached to paragraph layer are summed and divided by the total number of words used in the text.

In [6]:
text.meta["whole_text_score"] 

0.2727272727272727

### Example #2

ParagraphWebLanguageScoreRetagger helps to compare and categorize different texts -- if one gets a 0 as a whole_text_score and the other 0.2727, for example, as the previous example, we might say the first text can possibly be a canonical language text and second one a non-canonical language text.

We can test the idea on different text files that already have been categorized as either canonical or non-canonical texts.

In [7]:
from estnltk.converters import json_to_text
import os

cwd = os.getcwd()
path = os.path.join(cwd, "kirjak_vs_mittekirjak_ettenten") # files taken from a folder "kirjak_vs_mittekirjak_ettenten"

for file in os.listdir(path):
    file_location = os.path.join(path, file)
    if "kirjak__filmitalgud_ee__58638.json" in file_location or "mittekirjak__www_lemmik_ee__100692.json" in file_location \
    or "kirjak__uudised_err_ee__98236.json" in file_location or "mittekirjak__juura_ee__100106.json" in file_location:
        filename=file_location.split("\\")[-1]
        text = json_to_text(file=file_location)
        text.tag_layer(["compound_tokens", "words", "paragraphs","clauses"])
        weblang_score_retagger.retag(text) 
        
        if "mittekirjak" in filename:
            print("Non-canonical text:\n")
            print(text.text,'\n')
        else:
            print("Canonical text:\n")
            print(text.text,'\n')
        
        for i in text["paragraphs"].attributes:
            print(i,text["paragraphs"][i])
            
        print("whole_text_score:",text.meta["whole_text_score"] )
        print("----------------------\n")

Canonical text:

Videopäevik: Küllap ka nõid on kunagi armastanud!

Viimasel näitlejakoolitusel Hiiumaal Kärdlas osales ka Publiku videopäeviku pidaja Gert. Kogu maikuu sai hoogu võetud, et hirmust üle saada ja peaaegu õnnestus. Igal filminäitleja koolitusel tehti alguses veidi lõdvestavaid harjutusi ning siis asuti konkreetsete stseenide juurde, mis stsenaariumis kirjas.

Meie tänane kangelane pidi koos Brendaga teelt eksima ja sattuma nõia juurde. Hirmust ülesaamiseks sisendas Gert endale, et küllap ka nõid on kunagi armastanud. 

word_count [9, 44, 27]
emoticons [0, 0, 0]
missing_commas [0, 0, 0]
unknown_words [0, 0, 0]
letter_reps [0, 0, 0]
no_spaces [0, 0, 0]
capital_letters [0, 0, 0]
foreign_letters [0, 0, 0]
ignored_capital [0, 0, 0]
incorrect_spaces [0, 0, 0]
whole_text_score: 0.0
----------------------

Canonical text:

Endine Tartu haridusosakonna finantseerimise peaspetsialist Irina Aab sai aastaid tulu ka linnalt sadu tuhandeid kroone teeninud Kersti Võlu Koolituskeskuse ko

Canonical texts have got whole_text_score as 0 -- no features that describe the usage of web language were found.
<br>
Non-canonical texts have got whole_text_score as 0.1297 and 0.0966 and as it can be seen on the given output above, different features were detected in all the paragraphs of these texts.
<br>
The output confirmes that the non-canonical texts included more of such features described than the canonical texts. 