# TextRank Example

(C) 2023-2024 by [Damir Cavar](http://damir.cavar.me/)

**Version:** 1.1, January 2024

**Download:** This and various other Jupyter notebooks are available from my [GitHub repo](https://github.com/dcavar/python-tutorial-for-ipython).

**Prerequisites:**

To install [spaCy](https://spacy.io/) follow the instructions on the [Install spaCy page](https://spacy.io/usage).

In [None]:
!pip install -U pip setuptools wheel

The following installation of spaCy is ideal for my environment, i.e., using a GPU and CUDA 12.x. See the [spaCy homepage](https://spacy.io/usage) for detailed installation instructions.

In [None]:
!pip install -U 'spacy[cuda12x,transformers,lookups,ja]'

Install the required modules:

    pip install ipyfilechooser pytextrank

Read more about the TextRank algorithm in the publications listed at the [pytextrank page](https://pypi.org/project/pytextrank/).

In [None]:
!pip install -U ipyfilechooser pytextrank

## Introduction

In [1]:
import spacy
import pytextrank
import os
from ipyfilechooser import FileChooser

c:\Users\damir\AppData\Local\Programs\Python\Python312\Lib\site-packages


We can use the file chooser to select the document to be analyzed. I recommend using `data/bio_1.txt`. Run the code box below and pick your text file for analysis.

In [2]:
fc = FileChooser()
display(fc)

FileChooser(path='C:\Users\damir\Dropbox\Develop\python-tutorial-notebooks\notebooks', filename='', title='', …

We need to load a language model and add the textrank component to the NLP pipeline. In the following example we are loading the small English core language model.

In [3]:
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textrank")

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


<pytextrank.base.BaseTextRankFactory at 0x15f3514f500>

We load the selected file into memory:

In [4]:
with open(os.path.join(fc.selected_path, fc.selected_filename), mode='r', encoding='utf-8') as ifp:
    text = ifp.read()

We can now NLP the text:

In [5]:
doc = nlp(text)

The identified terminology can be extracted as follows:

In [6]:
for phrase in doc._.phrases:
    print(phrase.text)
    print(phrase.rank, phrase.count)
    print(phrase.chunks)
    print()

lung cancer
0.09571732260976343 4
[lung cancer, lung cancer, lung cancer, lung cancer]

Peter M A van Ooijen
0.08537151002315986 1
[Peter M A van Ooijen]

Harry J M Groen
0.08148188356925473 2
[Harry J M Groen, Harry J M Groen]

CT screening
0.0807004379512615 1
[CT screening]

cancer diagnosis
0.07615665631325398 1
[cancer diagnosis]

year
0.07439838943862638 2
[year, year]

years
0.07439838943862638 3
[years, years, years]

Michael A den
0.07357908615794032 1
[Michael A den]

Marjolein A Heuvelmans
0.07333500455663015 2
[Marjolein A Heuvelmans, Marjolein A Heuvelmans]

Harry J de Koning
0.07316207795711296 1
[Harry J de Koning]

Joachim G J V Aerts
0.06941620354365778 1
[Joachim G J V Aerts]

NELSON Netherlands Trial Register
0.06934932971708942 1
[NELSON Netherlands Trial Register]

low rates
0.06744312806832187 1
[low rates]

Susan van t Westeinde
0.06655772587228924 1
[Susan van 't Westeinde]

Susan van 
0.06470202901117691 1
[Susan van ']

CT
0.06403996062958595 6
[CT, CT, CT, CT

(C) 2022-2024 by [Damir Cavar](http://damir.cavar.me/)