# Using stanza for Named Entity Recognition (continued)

## Installation

Run the code cell below to install stanza:

In [None]:
!git clone https://github.com/Zainab1317/FASDH25-portfolio2.git

Cloning into 'FASDH25-portfolio2'...
remote: Enumerating objects: 4424, done.[K
remote: Counting objects: 100% (9/9), done.[K
remote: Compressing objects: 100% (7/7), done.[K
remote: Total 4424 (delta 4), reused 2 (delta 2), pack-reused 4415 (from 3)[K
Receiving objects: 100% (4424/4424), 19.37 MiB | 20.26 MiB/s, done.
Resolving deltas: 100% (35/35), done.


In [None]:
!pip install stanza

Collecting stanza
  Downloading stanza-1.10.1-py3-none-any.whl.metadata (13 kB)
Collecting emoji (from stanza)
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.3.0->stanza)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.3.0->stanza)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata 

## Import library and download language model

After installing it, we import stanza into our notebook.

In [None]:
import stanza

## Creating the pipeline

Download the English language model and build the pipeline (we specify that it should only tokenize the text, separate multiword tokens and perform Named Entity Recognition):


In [None]:
# Download the language model:
stanza.download("en")

# Create the NLP pipeline for tokenization,multi-word token expansion and named entity recognition and specifying the language:
nlp = stanza.Pipeline(lang="en", processors='tokenize,mwt,ner')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: en (English) ...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.10.0/models/default.zip:   0%|          | …

INFO:stanza:Downloaded file to /root/stanza_resources/en/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: en (English):
| Processor | Package                   |
-----------------------------------------
| tokenize  | combined                  |
| mwt       | combined                  |
| ner       | ontonotes-ww-multi_charlm |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!


In [None]:
#import additional necessary libraries
import os
import pandas as pd
import re



#setting path to folder that contains required text articles
folder = "/content/FASDH25-portfolio2/articles"  # path to the articles

#empty variable for jan 2024 articles
jan_articles=0
# Dictionary to store place name counts
place_counts = {}
# loop through the entities in the folder:
for filename in os.listdir(folder):
  # loop through files from Jan 2024 make sure that the articles end with .txt
  if "2024-01" in filename:
    jan_articles += 1
    # make path
    path = os.path.join(folder, filename)
    #open and read files
    with open(path, encoding="utf-8") as file:
        text = file.read()
#Using the pipeline for named entities focusing on Geopolitical entities and place based entities
# 1) Chatgpt corrected code, varable consistency
    doc = nlp(text)
    for sentence in doc.sentences:
      for entity in sentence.ents:
        if entity.type in ["GPE", "LOC"]:
          place = entity.text.strip()
          place_counts[place] = place_counts.get(place, 0) + 1

KeyboardInterrupt: 

In [None]:
#cleaning the named entities
# 2) Help taken from Chatgpt to fix error
clean_counts = {}

for place, count in place_counts.items():
    #Removes possessive endings like 's
    place = re.sub(r"['`]s\b", "", place)
    #Removes punctuation from entites
    place = re.sub(r"[^\w\s]", "", place)
    #removes "the" from entities
    place = re.sub(r"^the\s+", "", place, flags=re.IGNORECASE)

    clean_counts[place] = clean_counts.get(place, 0) + count
# printing normalized place names with their total counts
print(clean_counts)

In [None]:
filename = "ner_counts.tsv"
# write results to a TSV file with columns "place" and "count"
with open("ner_counts.tsv", mode= "w", encoding= "utf-8") as file:
  # create a header of the tsv files:
  header = "place\tcount\n"
  file.write(header)
  # loop through the places dictionary, creating a row for all items in the dictionary
  for place, count in places.items():
    row = f"{place}\t{count}\n"
    file.write(row)

#open file and print results
with open("/content/ner_counts.tsv", encoding="utf-8") as file:
  print(file.read())