# Analisi Stanza: distribuzione di aggettivi e verbi con genere morfologico per autore (F/M) e genere testuale

Questo notebook esegue un'analisi morfosintattica dei testi GxG (escluso YouTube) per confrontare l'uso di **aggettivi** e **verbi** con genere grammaticale (maschile o femminile) tra autori di genere **F** e **M**, per ciascun **genere testuale**.


In [1]:
import stanza
import re
import pandas as pd
from collections import defaultdict
from tqdm import tqdm

# Inizializza la pipeline italiana
stanza.download("it")
nlp = stanza.Pipeline(lang="it", processors="tokenize,mwt,pos,lemma,depparse", use_gpu=False)

# File di input
files = {
    "Children": "../data/dataset_originale/Training/GxG_Children.txt",
    "Diary": "../data/dataset_originale/Training/GxG_Diary.txt",
    "Journalism": "../data/dataset_originale/Training/GxG_Journalism.txt",
    "Twitter": "../data/dataset_originale/Training/GxG_Twitter.txt"
}


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

2025-06-01 23:17:35 INFO: Downloaded file to C:\Users\agnes\stanza_resources\resources.json
2025-06-01 23:17:35 INFO: Downloading default packages for language: it (Italian) ...
2025-06-01 23:17:37 INFO: File exists: C:\Users\agnes\stanza_resources\it\default.zip
2025-06-01 23:17:45 INFO: Finished downloading models and saved to C:\Users\agnes\stanza_resources
2025-06-01 23:17:45 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

2025-06-01 23:17:45 INFO: Downloaded file to C:\Users\agnes\stanza_resources\resources.json
2025-06-01 23:17:47 INFO: Loading these models for language: it (Italian):
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| mwt       | combined          |
| pos       | combined_charlm   |
| lemma     | combined_nocharlm |
| depparse  | combined_charlm   |

2025-06-01 23:17:47 INFO: Using device: cpu
2025-06-01 23:17:47 INFO: Loading: tokenize
2025-06-01 23:17:51 INFO: Loading: mwt
2025-06-01 23:17:51 INFO: Loading: pos
2025-06-01 23:17:55 INFO: Loading: lemma
2025-06-01 23:17:57 INFO: Loading: depparse
2025-06-01 23:17:57 INFO: Done loading processors!


In [None]:
# Analizza tutti i testi e raccoglie conteggi ADJ/VERB con Gender per autori F e M
counts = defaultdict(lambda: {"F": 0, "M": 0})

for genre, path in files.items():
    with open(path, "r", encoding="utf-8") as f:
        content = f.read()
        matches = re.findall(r'<doc id="\d+" genre="\w+" gender="(F|M)">([\s\S]*?)</doc>', content)

    for gender, text in tqdm(matches, desc=genre):
        doc = nlp(text.strip())
        for sent in doc.sentences:
            for word in sent.words:
                feats = word.feats if word.feats else ""
                if word.upos in ["ADJ", "VERB"] and ("Gender=Fem" in feats or "Gender=Masc" in feats):
                    key = (genre, word.upos)
                    counts[key][gender] += 1

# Trasforma in DataFrame
rows = []
for (genre, pos), d in counts.items():
    F = d["F"]
    M = d["M"]
    tot = F + M
    rows.append({
        "Genere testuale": genre,
        "Categoria": pos,
        "F": F,
        "M": M,
        "Totale": tot,
        "%F": round(F / tot * 100, 1) if tot > 0 else 0.0,
        "%M": round(M / tot * 100, 1) if tot > 0 else 0.0
    })

df = pd.DataFrame(rows).sort_values(["Genere testuale", "Categoria"]).reset_index(drop=True)
df


Children:   8%|██████                                                                 | 17/200 [01:15<11:58,  3.93s/it]

La tabella mostra il numero di aggettivi (ADJ) e verbi (VERB) con marcatura di genere morfologico, prodotti da autori F e M in ciascun genere testuale. Le colonne %F e %M indicano la distribuzione relativa.
