# Using stanza for Named Entity Recognition (continued)

## Installation

Run the code cell below to install stanza:

In [1]:
!pip install stanza

Collecting stanza
  Downloading stanza-1.10.1-py3-none-any.whl.metadata (13 kB)
Collecting emoji (from stanza)
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.3.0->stanza)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.3.0->stanza)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata 

## Import library and download language model

After installing it, we import stanza into our notebook.

In [17]:
import stanza

## Creating the pipeline

Download the English language model and build the pipeline (we specify that it should only tokenize the text, separate multiword tokens and perform Named Entity Recognition):


In [18]:
# Download the language model:
stanza.download("en")

# Create the pipeline, specifying the language:
nlp = stanza.Pipeline(lang="en", processors='tokenize,mwt,ner')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: en (English) ...
INFO:stanza:File exists: /root/stanza_resources/en/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: en (English):
| Processor | Package                   |
-----------------------------------------
| tokenize  | combined                  |
| mwt       | combined                  |
| ner       | ontonotes-ww-multi_charlm |

INFO:stanza:Using device: cuda
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!


Cloning Github repository here

In [19]:
!git clone https://github.com/bergah/FASDH25-portfolio2.git

fatal: destination path 'FASDH25-portfolio2' already exists and is not an empty directory.


In [20]:
! ls /content/

FASDH25-portfolio2  sample_data


import libarry

In [21]:
import os

extract articles written in January 2024

In [13]:
#set the part to the folder
article_path = "/content/FASDH25-portfolio2/articles/"
#list all the files toto extract january 2024 articles
all_files = os.listdir(article_path)
#extract only the place names from the article written in "2024-01"
january_files = [f for f in all_files if f.startswith("2024-01")]
#print the count of january files
print("There are", len(january_files), "found")


There are 326 found


### Place names
Use a loop to print only the named entities that are place names.

If you don't remember how to do that, look back at last week's notebook!

In [24]:
# check for any january files from january 2024
if january_files:

  # open the first january article
  with open(os.path.join(article_path, january_files[0]), encoding="utf-8") as file:

    # read the entire article into a variable called "text"
    text = file.read()

    # run the article text through the stanza nlp pipeline to analyze it
    doc = nlp(text)

    #loop through each entity that stanza found
    for e in doc.entities:
      # only print entities that are places
      if e.type in ["GPE", "LOC"]:
        print(e)


{
  "text": "West Bank",
  "type": "GPE",
  "start_char": 51,
  "end_char": 60
}
{
  "text": "West Bank",
  "type": "GPE",
  "start_char": 148,
  "end_char": 157
}
{
  "text": "Dura",
  "type": "GPE",
  "start_char": 300,
  "end_char": 304
}
{
  "text": "Hebron",
  "type": "GPE",
  "start_char": 310,
  "end_char": 316
}
{
  "text": "West Bank",
  "type": "GPE",
  "start_char": 333,
  "end_char": 342
}
{
  "text": "Tulkarem",
  "type": "GPE",
  "start_char": 1084,
  "end_char": 1092
}
{
  "text": "West Bank",
  "type": "GPE",
  "start_char": 1109,
  "end_char": 1118
}
{
  "text": "Gaza Strip",
  "type": "GPE",
  "start_char": 1184,
  "end_char": 1194
}
{
  "text": "the West Bank",
  "type": "GPE",
  "start_char": 1215,
  "end_char": 1228
}
{
  "text": "the West Bank",
  "type": "GPE",
  "start_char": 1511,
  "end_char": 1524
}
{
  "text": "Israel",
  "type": "GPE",
  "start_char": 1694,
  "end_char": 1700
}
{
  "text": "West Bank",
  "type": "GPE",
  "start_char": 1751,
  "end_char": 17

### Counting place names

We can now use a dictionary to count how many times each place is counted in the text, as we did with regular expressions:

In [25]:
# create an empty dictionary
places = {}

# loop through the entities:
for e in doc.entities:
  # add a condition so that only place names are processed:
  if e.type in ["GPE", "LOC"]:
    # add the count to the text:
    places[e.text] = places.get(e.text, 0) +1

print(places)


{'West Bank': 5, 'Dura': 1, 'Hebron': 1, 'Tulkarem': 1, 'Gaza Strip': 1, 'the West Bank': 2, 'Israel': 1, 'Nablus': 1}


Extract and clean place names

In [39]:
# clean and merge place names
cleaned_places = {}

for name, count in places.items():
  # eliminate possessive endings like "Gaza's" or "Israel's"
  if name.endswith("’s") or name.endswith("'s"):
    name = name[:-2]

  # eliminate whitespace and punctuation
  name = name.strip(" ,.!?;:\n\t")

  # merge counts into cleaned dictionary
  cleaned_places[name] = cleaned_places.get(name, 0) + count


### Storing data in a tsv file

We can now store the counts in a tsv file, so we can reuse it in a different script.

Let's create a tsv file with two columns: "name" and "frequency".
We'll create the tsv file in two steps:

1. we create the header: that is, the column names, separated by tabs
2. we loop through all the place names, and we create a new row in the table for each place. Each row will contain the place name and its frequency, separated by a tab. Each row will have to start on a new line, so we'll also have to add a newline character \n to the row; should we add it at the beginning or end of the line, or both?

Fill in the blanks:

In [41]:
filename = "/content/FASDH25-portfolio2/ner_counts.tsv"

# open the file in writing mode and with unicode UTF-8 encoding:
with open(filename, mode= 'w', encoding= 'utf-8') as file:
  # create a header of the tsv files, which consists of the column names separated by a tab:
  header = "Place\tCount\n"
  # write the header to the file:
  file.write(header)
  # Now, loop through the places dictionary and create a new row for each item in the dictionary
  for place, count in cleaned_places.items():
    # create a row with the place and count seprated by a tab
    row = f"{place}\t{count}\n"
    # finally, write the row to the file:
    file.write(row)

The file will now be stored in our colab's session environment. You can see it by clicking the folder icon in the left-hand tool bar in colab. Double-click it to view it in colab. Right-click it and choose "Download" to download the file.

To access it in your script, use the path `/content/ner_counts.tsv`

In [43]:
with open("/content/FASDH25-portfolio2/ner_counts.tsv", encoding="utf-8") as file:
  print(file.read())

Place	Count
West Bank	5
Dura	1
Hebron	1
Tulkarem	1
Gaza Strip	1
the West Bank	2
Israel	1
Nablus	1

