Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

# Lab 3.2: Stylistic Analysis

In the previous lab, we have seen how to extract information about the syntactic structure of sentences. In this lab, we show a few more examples for extracting stylistic features. We provide examples for three simple stylistic features, but this is just the start. **Think about stylistic features that could characterize your dataset.**

In [1]:
import pandas as pd
import stanza
import string

# Read in TSV
tsv_file = "../data/veganism_overview_en.tsv"
news_content = pd.read_csv(tsv_file, sep="\t", keep_default_na=False, header=0)
nlp = stanza.Pipeline('en', processors='tokenize,pos,lemma')

# We filter out empty articles
news_content = news_content[news_content["Text"].str.len() >0 ]
articles = news_content["Text"]
processed_articles = []
for article in articles:
    processed_articles.append(nlp.process(article))

2023-10-25 13:44:59 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json:   0%|   …

2023-10-25 13:44:59 INFO: Loading these models for language: en (English):
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| pos       | combined_charlm   |
| lemma     | combined_nocharlm |

2023-10-25 13:44:59 INFO: Using device: cpu
2023-10-25 13:44:59 INFO: Loading: tokenize
2023-10-25 13:44:59 INFO: Loading: pos
2023-10-25 13:45:00 INFO: Loading: lemma
2023-10-25 13:45:00 INFO: Done loading processors!


In this lab, we extract information from the MRC database. This database contains many types of psycholinguistic features for English words. Download the file [mrc.dct](https://github.com/samzhang111/mrc-psycholinguistics/blob/master/mrc2.dct) and save it in your data folder. As an example, we extract the concreteness ratings.

**Check the [documentation](https://websites.psychology.uwa.edu.au/school/MRCDatabase/mrc2.html#CONC) and figure out the meaning of the features. What does it mean if a text has a high level of average concreteness?**

Note: some browsers save *mrc.dct* as *mrc.dct.txt*. Make sure to adjust either the filename or the path.

In [2]:
def load_mrc():
    words ={}
    print("Start loading...")
    for line in open('../data/mrc2.dct','r'):

        # This code is from https://github.com/samzhang111/mrc-psycholinguistics/blob/master/extract.py
        word, phon, dphon, stress = line[51:].split('|')

        nlet = int(line[0:2])
        nphon = int(line[2:4])
        nsyl = int(line[4])
        kf_freq = int(line[5:10])
        kf_ncats = int(line[10:12])
        kf_nsamp = int(line[12:15])
        tl_freq = int(line[15:21])
        brown_freq = int(line[21:25])
        fam = int(line[25:28])
        conc = int(line[28:31])
        imag = int(line[31:34])
        meanc = int(line[34:37])
        meanp = int(line[37:40])
        aoa = int(line[40:43])
        tq2 = line[43]
        wtype = line[44]
        pdwtype = line[45]
        alphasyl = line[46]
        status = line[47]
        var = line[48]
        cap = line[49]
        irreg = line[50]

        # For this example, we only extract the concreteness rating, but you might try other features
        # In this case, you could save a tuple as value for each word in the dictionary
        words[word.lower()] = int(conc)
    print("Done.")
    return words
mrc = load_mrc()

Start loading...
Done.


We define a function that calculates the mean concreteness for a list of tokens. Tokens which do not have a concreteness rating in the MRC database are ignored.


In [3]:
import statistics

def calculate_concreteness(tokens, mrc_concreteness):
    concreteness = []
    for token in tokens:

        # For words that are not in the database, we assign the rating 0
        concreteness_rating = mrc_concreteness.get(token.lower(), 0)

        # We only consider words that have a concreteness rating in the database when calculating the mean concreteness.
        # This might be problematic.
        # It could be good to additionally keep track of the number of unrated words.
        if concreteness_rating > 0:
            concreteness.append(concreteness_rating)

    if len(concreteness) > 0:
        return statistics.mean(concreteness)
    else:
        return 0.0


We build a document representation on three stylistic features: type-token ratio, average sentence length, average concreteness. **Add more stylistic features to better distinguish between document styles. A good overview of features can be found in this [article](https://link.springer.com/article/10.3758/BF03195564).** Read section "IDENTIFIER INFORMATION AND MEASURES SUPPLIED BY COH-METRIX". This [article](https://ep.liu.se/ecp/080/002/ecp12080002.pdf)(section 2) on readability can provide additional information. Both papers focus on formal texts. **Discuss features that could be relevant for less formal data (e.g., Twitter).** One could for example also calculate the emoji-ratio or the average number of typos or the ratio of capitalized words or ...

In [4]:
from collections import Counter

ttr = []
avg_sentence_len = []
avg_concreteness = []


for article in processed_articles:

    # Calculate TTR
    token_frequencies = Counter()
    for sentence in article.sentences:
        all_tokens =[token.text for token in sentence.tokens]
        token_frequencies.update(all_tokens)
    num_types = len(token_frequencies.keys())
    num_tokens = sum(token_frequencies.values())
    tt_ratio = num_types/float(num_tokens)
    ttr.append(tt_ratio)

    # Calculate average sentence length
    sentence_lengths =[len(sentence.tokens) for sentence in article.sentences]
    avg_sentence_len.append(statistics.mean(sentence_lengths))

    # Calculate concreteness
    tokens = [word.lemma for word in sentence.words]
    concreteness =[calculate_concreteness(tokens, mrc) for sentence in article.sentences]
    avg_concreteness.append(statistics.mean(concreteness))

    # Calculate other metrics
    # ...

# Add the information to the data frame
news_content["Type-Token Ratio"] = ttr
news_content["Avg Sentence Length"] = avg_sentence_len
news_content["Avg Concreteness"] = avg_concreteness
news_content.to_csv("../data/toy_stylistic_features.csv")

TypeError: 'Document' object is not iterable

The document representations can now be used to perform clustering or classification. If we use neural network classifiers, the document representations are usually learned directly. In this case, it can be interesting to test if stylistic features are implicitly contained in the learned representations (this is called *probing*). Keep in mind that this is just a toy example and that it is your job to come up with more advanced stylistic analyses.