# Using stanza for Named Entity Recognition (continued)

## Installation

Run the code cell below to install stanza:

In [None]:
!pip install stanza



## Import library and download language model

After installing it, we import stanza into our notebook.

In [None]:
import stanza

## Creating the pipeline

Download the English language model and build the pipeline (we specify that it should only tokenize the text, separate multiword tokens and perform Named Entity Recognition):


In [None]:
# Download the language model:
stanza.download("en")

# Create the pipeline, specifying the language:
nlp = stanza.Pipeline(lang="en", processors='tokenize,mwt,ner')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: en (English) ...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.10.0/models/default.zip:   0%|          | …

INFO:stanza:Downloaded file to /root/stanza_resources/en/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: en (English):
| Processor | Package                   |
-----------------------------------------
| tokenize  | combined                  |
| mwt       | combined                  |
| ner       | ontonotes-ww-multi_charlm |

INFO:stanza:Using device: cuda
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!


#:Load in repository


In [None]:
!git clone https://github.com/faizahmed-22/FASDH25-portfolio2.git #cloning in info into collab


Cloning into 'FASDH25-portfolio2'...
remote: Enumerating objects: 4362, done.[K
remote: Counting objects: 100% (5/5), done.[K
remote: Compressing objects: 100% (4/4), done.[K
remote: Total 4362 (delta 2), reused 1 (delta 1), pack-reused 4357 (from 2)[K
Receiving objects: 100% (4362/4362), 17.77 MiB | 16.99 MiB/s, done.
Resolving deltas: 100% (8/8), done.


Create a new stanza document by feeding the `article` variable to our `nlp` pipeline object. Then print each entity (let the code cell above the previous one inspire you):

In [None]:
#import libraries
import os
import pandas as pd
import re

# create an empty dictionary
places = {}

#folder to repository
folder = "/content/FASDH25-portfolio2/articles"  # path to the articles

#empty variable for jan 2024 articles
jan_articles=0

# loop through the entities:
for filename in os.listdir(folder):
  # loop through files from Jan 2024 make sure that the articles end with .txt
  if "2024-01" in filename:
    jan_articles += 1
    # make path
    path = os.path.join(folder, filename)
    #open and read files
    with open(path, encoding="utf-8") as file:
        text = file.read()
        doc = nlp(text) # feed the `text` variable into our `nlp` analyzer to create a stanza document:
        for e in doc.entities: #to ensure all articles are read add entities to loop
          if e.type in ["GPE", "LOC"]: #identify by geography and location
             place = e.text.strip()
             places[place] = places.get(place, 0) + 1 #counting place names




### Place names
Use a loop to print only the named entities that are place names.

If you don't remember how to do that, look back at last week's notebook!

### Counting place names

We can now use a dictionary to count how many times each place is counted in the text, as we did with regular expressions:

In [None]:
#print results
print("Number of articles from January 2024:", jan_articles)
print(places)



Number of articles from January 2024: 326
{'West Bank': 120, 'Dura': 2, 'Hebron': 10, 'Tulkarem': 2, 'Gaza Strip': 31, 'the West Bank': 40, 'Israel': 1593, 'Nablus': 5, 'the Red Sea': 194, 'the United States': 97, 'United Kingdom': 12, 'Yemen': 182, 'Iran': 206, 'Gaza': 1605, 'US': 706, 'Sanaa': 15, 'Saudi Arabia': 39, 'Aden': 3, 'Tel Aviv': 49, 'the Gaza Strip': 123, 'UK': 95, 'Palestine': 124, 'The Red Sea': 5, 'Africa': 29, 'Red Sea': 50, 'Marib': 3, 'the Middle East': 77, 'the United Arab Emirates': 13, 'Turkey': 25, 'Jordan': 42, 'Qatar': 64, 'UAE': 7, 'Charleston': 1, 'South Carolina': 4, 'Gaza City': 31, 'Doha': 19, 'Hong Kong': 2, 'South Africa': 200, 'the State of Palestine': 1, 'Lebanon': 175, 'Hague': 6, 'Pretoria': 8, 'Uganda': 11, 'China': 28, 'Russia': 43, 'The Hague': 33, 'Kuwait': 2, 'Gaza’s': 18, 'Ukraine': 47, 'Canada': 42, 'Montreal': 1, 'Milton, Ontario': 1, 'Jabalia': 11, 'Israel’s': 31, 'Ottawa': 3, 'Egypt': 43, 'Rafah': 40, 'Toronto': 1, 'Calgary': 1, 'Afghanista

### Evaluation

To check how well Stanza performed, we can put the tags into the text.

We can use the `start_char` property to insert the tag codes into the text; this way we can easily see which places Stanza missed.

We're going to loop over all the character indexes in the string;
and whenever we reach a character index where an entity starts,
we add the entity's tag to the text.

In [None]:
#cleaning name entities
normalized_places = {}

for place, count in places.items():
    # removing possessive endings like 's or ’s, took ChatGPT's help for the regex
    place = re.sub(r"[’'`]s\b", "", place)

    # striping punctuation characters
    place = re.sub(r"[^\w\s]", "", place)

    # removing leading 'the',
    place = re.sub(r"^the\s+", "", place, flags=re.IGNORECASE)

    # combining counts for places with normalized same name
    if place in normalized_places:
        normalized_places[place] += count
    else:
        normalized_places[place] = count

# printing normalized place names with their total counts
print(normalized_places)

{'West Bank': 162, 'Dura': 2, 'Hebron': 10, 'Tulkarem': 2, 'Gaza Strip': 159, 'Israel': 1625, 'Nablus': 5, 'Red Sea': 250, 'United States': 160, 'United Kingdom': 43, 'Yemen': 188, 'Iran': 209, 'Gaza': 1623, 'US': 717, 'Sanaa': 15, 'Saudi Arabia': 39, 'Aden': 3, 'Tel Aviv': 51, 'UK': 95, 'Palestine': 124, 'Africa': 29, 'Marib': 3, 'Middle East': 102, 'United Arab Emirates': 14, 'Turkey': 25, 'Jordan': 43, 'Qatar': 65, 'UAE': 7, 'Charleston': 1, 'South Carolina': 4, 'Gaza City': 31, 'Doha': 19, 'Hong Kong': 2, 'South Africa': 208, 'State of Palestine': 1, 'Lebanon': 178, 'Hague': 39, 'Pretoria': 8, 'Uganda': 12, 'China': 30, 'Russia': 43, 'Kuwait': 2, 'Ukraine': 47, 'Canada': 42, 'Montreal': 1, 'Milton Ontario': 1, 'Jabalia': 11, 'Ottawa': 3, 'Egypt': 44, 'Rafah': 40, 'Toronto': 1, 'Calgary': 1, 'Afghanistan': 7, 'Austria': 3, 'Australia': 13, 'Finland': 3, 'Germany': 31, 'Italy': 10, 'Japan': 9, 'Netherlands': 14, 'Iceland': 1, 'Sweden': 3, 'Switzerland': 9, 'Romania': 4, 'Washington D

In [None]:
filename = "ner_counts.tsv"
# open file
with open("ner_counts.tsv", mode= "w", encoding= "utf-8") as file:
  # create a header of the tsv files:
  header = "Place\tCount\n"
  file.write(header)
  # loop through the places dictionary, creating a row for all items in the dictionary
  for place, count in places.items():
    row = f"{place}\t{count}\n"
    file.write(row)

#open file and print normalised results
with open("/content/ner_counts.tsv", encoding="utf-8") as file:
  print(file.read())

Place	Count
West Bank	120
Dura	2
Hebron	10
Tulkarem	2
Gaza Strip	31
the West Bank	40
Israel	1593
Nablus	5
the Red Sea	194
the United States	97
United Kingdom	12
Yemen	182
Iran	206
Gaza	1605
US	706
Sanaa	15
Saudi Arabia	39
Aden	3
Tel Aviv	49
the Gaza Strip	123
UK	95
Palestine	124
The Red Sea	5
Africa	29
Red Sea	50
Marib	3
the Middle East	77
the United Arab Emirates	13
Turkey	25
Jordan	42
Qatar	64
UAE	7
Charleston	1
South Carolina	4
Gaza City	31
Doha	19
Hong Kong	2
South Africa	200
the State of Palestine	1
Lebanon	175
Hague	6
Pretoria	8
Uganda	11
China	28
Russia	43
The Hague	33
Kuwait	2
Gaza’s	18
Ukraine	47
Canada	42
Montreal	1
Milton, Ontario	1
Jabalia	11
Israel’s	31
Ottawa	3
Egypt	43
Rafah	40
Toronto	1
Calgary	1
Afghanistan	7
Austria	3
Australia	12
Finland	3
Germany	31
Italy	10
Japan	9
Netherlands	14
Iceland	1
Sweden	2
Switzerland	9
Romania	4
the United Kingdom	28
The United States	21
Washington, DC	4
Jerusalem	26
Gretna	2
Louisiana	3
New Orleans	5
@MirandaCleland	1
East Jerusalem	23
S