[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](htt`ps://colab.research.google.com/github/danielmlow/llm_course/blob/main/linguistic_properties.ipynb)

# English linguistic features with spaCy + TextDescriptives

This tutorial shows how to extract a wide set of linguistic and readability features from English text using spaCy and the TextDescriptives library (https://github.com/HLasse/TextDescriptives).

Sections:
- Install dependencies and download a spaCy model
- Extract features using the TextDescriptives convenience functions
- Integrate TextDescriptives as a spaCy pipeline component and access extensions on `Doc` objects
- Run metrics on a list of texts and export to a DataFrame

Notes:
- This notebook uses `textdescriptives` v2+ API. See the project README and docs for a full list of metrics and components. https://hlasse.github.io/TextDescriptives/usingthepackage.html#available-attributes 
- If you already have `spacy` and `en_core_web_sm` (or another English model) installed, skip the install cells.

In [1]:
# Install required packages (run once).
# If you prefer to install manually in your environment, skip this cell.
import sys
import subprocess
packages = [
    "textdescriptives",
    "spacy"
]
subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--upgrade'] + packages)
# download a small English model (change to en_core_web_lg if you want larger vectors)
subprocess.check_call([sys.executable, '-m', 'spacy', 'download', 'en_core_web_sm'])
print('Installed textdescriptives and spacy + downloaded en_core_web_sm')

Collecting textdescriptives
  Downloading textdescriptives-2.8.4-py3-none-any.whl.metadata (25 kB)
Collecting spacy
  Downloading spacy-3.8.7-cp311-cp311-macosx_11_0_arm64.whl.metadata (27 kB)
Collecting numpy<2.0.0,>=1.20.0 (from textdescriptives)
  Using cached numpy-1.26.4-cp311-cp311-macosx_11_0_arm64.whl.metadata (114 kB)
Collecting pyphen>=0.11.0 (from textdescriptives)
  Downloading pyphen-0.17.2-py3-none-any.whl.metadata (3.2 kB)
Collecting ftfy>=6.0.3 (from textdescriptives)
Collecting spacy
  Downloading spacy-3.8.7-cp311-cp311-macosx_11_0_arm64.whl.metadata (27 kB)
Collecting numpy<2.0.0,>=1.20.0 (from textdescriptives)
  Using cached numpy-1.26.4-cp311-cp311-macosx_11_0_arm64.whl.metadata (114 kB)
Collecting pyphen>=0.11.0 (from textdescriptives)
  Downloading pyphen-0.17.2-py3-none-any.whl.metadata (3.2 kB)
Collecting ftfy>=6.0.3 (from textdescriptives)
  Downloading ftfy-6.3.1-py3-none-any.whl.metadata (7.3 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Using c

## Quick example: extract metrics from a single text using `td.extract_metrics`
`td.extract_metrics` is a convenience function that will auto-download an appropriate spaCy model if needed and return a pandas DataFrame with the requested metrics.

In [None]:
import textdescriptives as td

text = (
    'The world is changed. I feel it in the water. I feel it in the earth. '
    'I smell it in the air. Much that once was is lost, for none now live who remember it.'
)
# extract all metrics (can be large). Use metrics=['readability','pos_proportions'] to get a subset.
df_all = td.extract_metrics(text=text, lang='en', metrics=None)

df_all.head().T

[38;5;4mℹ No spacy model provided. Inferring spacy model for en.[0m
Collecting en-core-web-lg==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl (400.7 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/400.7 MB[0m [31m?[0m eta [36m-:--:--[0mCollecting en-core-web-lg==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl (400.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.7/400.7 MB[0m [31m41.1 MB/s[0m  [33m0:00:09[0m:00:01[0m00:01[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.7/400.7 MB[0m [31m41.1 MB/s[0m  [33m0:00:09[0m:00:01[0m
[?25hInstalling collected packages: en-core-web-lg
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.8.0
[38;5;2m✔ Download and installation successful[0m

Unnamed: 0,0
text,The world is changed. I feel it in the water. ...
passed_quality_check,False
n_stop_words,24.0
alpha_ratio,0.853659
mean_word_length,2.95122
...,...
rix,0.4
dependency_distance_mean,1.761905
dependency_distance_std,0.530263
prop_adjacent_dependency_relation_mean,0.457143


In [4]:
df_all = df_all.T
print(df_all.index)


Index(['text', 'passed_quality_check', 'n_stop_words', 'alpha_ratio',
       'mean_word_length', 'doc_length', 'symbol_to_word_ratio_#',
       'proportion_ellipsis', 'proportion_bullet_points',
       'contains_lorem ipsum', 'duplicate_line_chr_fraction',
       'duplicate_paragraph_chr_fraction', 'duplicate_ngram_chr_fraction_5',
       'duplicate_ngram_chr_fraction_6', 'duplicate_ngram_chr_fraction_7',
       'duplicate_ngram_chr_fraction_8', 'duplicate_ngram_chr_fraction_9',
       'duplicate_ngram_chr_fraction_10', 'top_ngram_chr_fraction_2',
       'top_ngram_chr_fraction_3', 'top_ngram_chr_fraction_4', 'oov_ratio',
       'first_order_coherence', 'second_order_coherence', 'entropy',
       'perplexity', 'per_word_perplexity', 'pos_prop_ADJ', 'pos_prop_ADP',
       'pos_prop_ADV', 'pos_prop_AUX', 'pos_prop_CCONJ', 'pos_prop_DET',
       'pos_prop_INTJ', 'pos_prop_NOUN', 'pos_prop_NUM', 'pos_prop_PART',
       'pos_prop_PRON', 'pos_prop_PROPN', 'pos_prop_PUNCT', 'pos_prop_SCONJ',


## Integrate TextDescriptives into a spaCy pipeline
You can add `textdescriptives` components to any spaCy `nlp` pipeline. Components are named `textdescriptives/<component>` or use the shorthand `textdescriptives/all` to add all components. After adding, results are available on the `doc._.` extensions.

In [5]:
import spacy
import textdescriptives as td
import pandas as pd

# load a spaCy model (use en_core_web_sm or en_core_web_lg)
nlp = spacy.load('en_core_web_sm')
# add all textdescriptives components to the pipeline
nlp.add_pipe('textdescriptives/all')

doc = nlp(text)
# print some of the attributes produced by TextDescriptives
print('Readability metrics (doc._.readability):')
print(doc._.readability)
print('Some basic descriptive stats from doc._.descriptive_stats keys:')
print({k: v for k, v in doc._.descriptive_stats.items() if k in ['n_tokens','n_sentences','mean_word_length']})

# convert the doc to a DataFrame (single-row)
df_from_doc = td.extract_df(doc)
df_from_doc.T.head()

Readability metrics (doc._.readability):
{'flesch_reading_ease': 107.87857142857146, 'flesch_kincaid_grade': -0.048571428571428044, 'smog': 5.683917801722854, 'gunning_fog': 3.942857142857143, 'automated_readability_index': -2.4542857142857173, 'coleman_liau_index': -0.7085714285714317, 'lix': 12.714285714285715, 'rix': 0.4}
Some basic descriptive stats from doc._.descriptive_stats keys:
{'n_tokens': 35, 'n_sentences': 5}


  similarities.append(sent.similarity(sents[i + order]))


Unnamed: 0,0
text,The world is changed. I feel it in the water. ...
passed_quality_check,False
n_stop_words,24.0
alpha_ratio,0.853659
mean_word_length,2.95122


## Inspect spaCy-side features alongside TextDescriptives outputs
You can still access token-level spaCy data like `.pos_`, `.dep_`, `.lemma_`, etc. below we show tokens and a few token-level metrics.

In [6]:
# token-level view
tokens = [(t.text, t.lemma_, t.pos_, t.dep_, t.is_stop) for t in doc]
pd.DataFrame(tokens, columns=['text','lemma','pos','dep','is_stop']).head(20)

Unnamed: 0,text,lemma,pos,dep,is_stop
0,The,the,DET,det,True
1,world,world,NOUN,nsubjpass,False
2,is,be,AUX,auxpass,True
3,changed,change,VERB,ROOT,False
4,.,.,PUNCT,punct,False
5,I,I,PRON,nsubj,True
6,feel,feel,VERB,ROOT,False
7,it,it,PRON,dobj,True
8,in,in,ADP,prep,True
9,the,the,DET,det,True


## Batch processing: run metrics on multiple texts and save results
Use a loop or list comprehension to compute metrics for a list of documents and concatenate results into a single DataFrame. TextDescriptives will reuse the spaCy model passed to `extract_metrics` if you pass an `nlp` or a model name.

In [7]:
texts = [
    'This is a short, simple sentence.',
    'Here is a longer sentence, with more clauses, punctuation, and complexity: it should raise reading difficulty.',
    'Once upon a time, in a land far away, there lived a programmer who loved natural language processing.'
]
# extract only readability and pos proportions for speed, or just use None to get all metrics
df_batch = pd.concat([td.extract_metrics(text=t, spacy_model='en_core_web_sm', metrics=['readability','pos_proportions']) for t in texts], ignore_index=True)
df_batch

Unnamed: 0,text,pos_prop_ADJ,pos_prop_ADP,pos_prop_ADV,pos_prop_AUX,pos_prop_CCONJ,pos_prop_DET,pos_prop_INTJ,pos_prop_NOUN,pos_prop_NUM,...,n_characters,n_sentences,flesch_reading_ease,flesch_kincaid_grade,smog,gunning_fog,automated_readability_index,coleman_liau_index,lix,rix
0,"This is a short, simple sentence.",0.25,0.0,0.0,0.125,0.0,0.125,0.0,0.125,0.0,...,28,1,102.045,0.516667,,2.4,1.98,4.746667,22.666667,1.0
1,"Here is a longer sentence, with more clauses, ...",0.095238,0.047619,0.047619,0.095238,0.047619,0.047619,0.0,0.238095,0.0,...,95,1,63.695,8.35,,13.9,13.06375,15.425,53.5,6.0
2,"Once upon a time, in a land far away, there li...",0.047619,0.047619,0.142857,0.0,0.0,0.142857,0.0,0.238095,0.0,...,84,1,75.765,7.163333,,11.644444,8.765,9.015556,40.222222,4.0


## Select specific metrics or components
The `metrics` argument accepts a list (e.g., `['readability','descriptive_stats']`) or `None` for all. You can also add only the components you need into a spaCy pipeline: e.g., `nlp.add_pipe('textdescriptives/readability')`.

## some pointers
- See the TextDescriptives docs for full API reference and an explanation of each metric: https://hlasse.github.io/TextDescriptives/
- If you need multilingual processing, TextDescriptives supports several languages; pass `lang` to `extract_metrics` or load a spaCy model for your target language.
- To compute semantic coherence you may want a model with word vectors (e.g., `en_core_web_lg`).


# Some Spacy linguistic features (there are more)

In [9]:
import spacy

# Cargar el modelo de español
nlp = spacy.load("en_core_web_sm")



doc = nlp(texts[0])

print("=== Tokens, Lemas, POS, Morphology ===")
for token in doc:
    print(f"{token.text:12} | Lema: {token.lemma_:12} | POS: {token.pos_:8} | Morphology: {token.morph}")



=== Tokens, Lemas, POS, Morphology ===
This         | Lema: this         | POS: PRON     | Morphology: Number=Sing|PronType=Dem
is           | Lema: be           | POS: AUX      | Morphology: Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
a            | Lema: a            | POS: DET      | Morphology: Definite=Ind|PronType=Art
short        | Lema: short        | POS: ADJ      | Morphology: Degree=Pos
,            | Lema: ,            | POS: PUNCT    | Morphology: PunctType=Comm
simple       | Lema: simple       | POS: ADJ      | Morphology: Degree=Pos
sentence     | Lema: sentence     | POS: NOUN     | Morphology: Number=Sing
.            | Lema: .            | POS: PUNCT    | Morphology: PunctType=Peri


In [10]:
print("\n=== Dependency Relations ===")
for token in doc:
    print(f"{token.text:10} <--{token.dep_}-- {token.head.text}")




=== Dependency Relations ===
This       <--nsubj-- is
is         <--ROOT-- is
a          <--det-- sentence
short      <--amod-- sentence
,          <--punct-- sentence
simple     <--amod-- sentence
sentence   <--attr-- is
.          <--punct-- is
