## ParagraphWebLanguageScoreRetagger

ParagraphWebLanguageScoreRetagger is a retagger for identifying texts or parts of texts, that represent the usage of web language. It detects certain attributes describing web language in all the paragraphs of a text and attaches scores found to paragraph layer.

In [7]:
from estnltk import Text
from paragraphweblanguagescoreretagger import ParagraphWebLanguageScoreRetagger

In [8]:
weblang_score_retagger=  ParagraphWebLanguageScoreRetagger()
weblang_score_retagger

name,output layer,output attributes,input layers
ParagraphWebLanguageScoreRetagger,paragraphs,"('word_count', 'emoticons', 'missing_commas', 'unknown_words', 'letter_reps', 'no_spaces', 'capital_letters', 'foreign_letters', 'ignored_capital', 'incorrect_spaces')","('paragraphs', 'words', 'compound_tokens', 'clauses')"

0,1
use_unknown_words,True
use_emoticons,True
use_letter_reps,True
use_punct_reps,False
use_capital_letters,True
use_missing_commas,True
use_ignored_capital,True
use_no_spaces,True
use_incorrect_spaces,True
use_foreign_letters,True


Before applying ParagraphWebLanguageScoreRetagger, the input Text object must have layers "paragraphs", "words", "compound_tokens" and "clauses".

Texts can be analysed based on 10 different attributes that describe the usage of web language. Flags of attributes can be set True or False, by default 9 attributes are used. 
<br>
For example, ParagraphWebLanguageScoreRetagger(use_punct_reps=True) activates the attribute **punct_reps** that by default is set to False.

#### Flags and what they detect:

- **use_unknown_words** -- words without morphological analysis
<br>
- **use_emoticons** -- emoticons, eg. *:D, :)*
<br>
- **use_letter_reps** -- same letter more than twice in a row, eg. *jaaaaa*
<br>
- **use_punct_reps** -- punctuation marks multiple times (except a dot), eg. *!!!!!!*
<br>
- **use_capital_letters** -- longer parts of text in capital letters, eg. *MINE METSA! KUHU SA LÄHED?*
<br>
- **use_missing_commas** -- missing commas
<br>
- **use_ignored_capital** -- ignored capital letters, eg. *Tere? kuidas läheb?*
<br>
- **use_no_spaces** -- no spaces after punctuation marks, eg. *Ilm on ilus.Päike paistab.*
<br>
- **use_incorrect_spaces** -- incorrect spaces before and after punctuation marks, eg. *Tore ! Mulle meeldib.*
<br>
- **use_foreign_letters** -- foreign letters, eg. *q*

### Example #1

Let's first try ParagraphWebLanguageScoreRetagger on a string consisting of 4 sentences and 2 paragraphs.

In [3]:
text=Text('''Tšau ! mis teed???


Kas sa kinno ei viitsi minna? mul on niiii igav et ma lähen hulluks varsti!"''')
# Add required layers
text.tag_layer(["compound_tokens", "words", "paragraphs","clauses"])
# Add annotation (adds scores of attributes of web language to paragraph layer)
weblang_score_retagger.retag(text)

text
"Tšau ! mis teed???Kas sa kinno ei viitsi minna? mul on niiii igav et ma lähen hulluks varsti!"""

0,1
whole_text_score,0.217391

layer name,attributes,parent,enveloping,ambiguous,span count
paragraphs,"word_count, emoticons, missing_commas, unknown_words, letter_reps, no_spaces, capital_letters, foreign_letters, ignored_capital, incorrect_spaces",,sentences,False,2
sentences,,,words,False,4
tokens,,,,False,23
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,False,23
morph_analysis,"lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,23
clauses,clause_type,,words,False,4


In [5]:
for i in text["paragraphs"].attributes:
    for parag in range(len(text["paragraphs"])):
        print("Attribute:",i)
        print("Score in paragraph",parag+1,":",text.paragraphs[i][parag])
    print("---")

Attribute: word_count
Score in paragraph 1 : 5
Attribute: word_count
Score in paragraph 2 : 18
---
Attribute: emoticons
Score in paragraph 1 : 0
Attribute: emoticons
Score in paragraph 2 : 0
---
Attribute: missing_commas
Score in paragraph 1 : 0
Attribute: missing_commas
Score in paragraph 2 : 1
---
Attribute: unknown_words
Score in paragraph 1 : 0
Attribute: unknown_words
Score in paragraph 2 : 0
---
Attribute: letter_reps
Score in paragraph 1 : 0
Attribute: letter_reps
Score in paragraph 2 : 1
---
Attribute: no_spaces
Score in paragraph 1 : 0
Attribute: no_spaces
Score in paragraph 2 : 0
---
Attribute: capital_letters
Score in paragraph 1 : 0
Attribute: capital_letters
Score in paragraph 2 : 0
---
Attribute: foreign_letters
Score in paragraph 1 : 0
Attribute: foreign_letters
Score in paragraph 2 : 0
---
Attribute: ignored_capital
Score in paragraph 1 : 1
Attribute: ignored_capital
Score in paragraph 2 : 1
---
Attribute: incorrect_spaces
Score in paragraph 1 : 1
Attribute: incorrect_s

The received output above shows us, that the text has 2 paragraphs, ParagraphWebLanguageScoreRetagger has detected a number of attributes from text and added the scores, even 0 if none was found, to paragraph layer.
<br>
<br>
Note that attribute **word_count** is not defined as a flag -- it is used for calculating whole text score that is always calculated and added.
<br>
**whole_text_score** - all the scores of attributes attached to paragraph layer are summed and divided by the total number of words used in the text.

In [6]:
text.meta["whole_text_score"] 

0.21739130434782608

### Example #2

ParagraphWebLanguageScoreRetagger helps to compare and categorize different texts -- if one gets a 0 as a whole_text_score and the other 0.217, for example, as the previous example, we might say the first text can possibly be a canonical language text and second one a non-canonical language text.

We can test the idea on two different text files that already have been categorized as either canonical or non-canonical texts.

In [18]:
from estnltk.converters import json_to_text
import os

cwd = os.getcwd()
path = os.path.join(cwd, "test_files") # files taken from a folder "test_files"

for file in os.listdir(path):
    file_location = os.path.join(path, file)
    if "json" in file_location:
        filename=file_location.split("\\")[-1]
        text = json_to_text(file=file_location)
        weblang_score_retagger.retag(text) 
        
        if "mittekirjak" in filename:
            print("Non-canonical text:")
        else:
            print("Canonical text:")
        
        for i in text["paragraphs"].attributes:
            print(i,text["paragraphs"][i])
            
        print("whole_text_score:",text.meta["whole_text_score"] )
        print("----------------")

Canonical text:
word_count [44, 46, 52, 50, 120, 1, 1]
emoticons [0, 0, 0, 0, 0, 0, 0]
missing_commas [0, 0, 0, 0, 0, 0, 0]
unknown_words [0, 0, 0, 0, 0, 0, 0]
letter_reps [0, 0, 0, 0, 0, 0, 0]
no_spaces [0, 0, 0, 0, 0, 0, 0]
capital_letters [0, 0, 0, 0, 0, 0, 0]
foreign_letters [0, 0, 0, 0, 0, 0, 0]
ignored_capital [0, 0, 0, 0, 0, 0, 0]
incorrect_spaces [0, 0, 0, 0, 0, 0, 0]
whole_text_score: 0.0
----------------
Non-canonical text:
word_count [63, 82, 77, 62, 81, 64, 81, 80, 56, 64, 100, 63, 76]
emoticons [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
missing_commas [0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
unknown_words [2, 0, 0, 1, 2, 3, 0, 0, 1, 0, 2, 3, 2]
letter_reps [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
no_spaces [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 1, 2]
capital_letters [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
foreign_letters [1, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 3, 0]
ignored_capital [0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 4, 0, 0]
incorrect_spaces [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1]
whole_text_

Canonical text has got whole_text_score as 0 -- no attributes that describe the usage of web language were found.
<br>
Non-canonical text got whole_text_score as 0.0569 and as it can be seen on the given output above, different attributes were detected in all the 13 paragraphs.
<br>
The output confirmes that the non-canonical text included more of such attributes described than the canonical text. 