# Using stanza for Named Entity Recognition (continued)

## Installation

Run the code cell below to install stanza:

In [13]:
 #installing stanza
 !pip install stanza



## Import library and download language model

After installing it, we import stanza into our notebook.

In [14]:
import stanza
import os
import time


## Creating the pipeline

Download the English language model and build the pipeline (we specify that it should only tokenize the text, separate multiword tokens and perform Named Entity Recognition):


In [15]:
# Download the language model:
stanza.download("en")

# Create the pipeline, specifying the language:
nlp = stanza.Pipeline(lang="en", processors='tokenize,mwt,ner')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: en (English) ...
INFO:stanza:File exists: /root/stanza_resources/en/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: en (English):
| Processor | Package                   |
-----------------------------------------
| tokenize  | combined                  |
| mwt       | combined                  |
| ner       | ontonotes-ww-multi_charlm |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!


#cloning the repository
!git clone https://github.com/ZeeshanKarim-916/FASDH25-portfolio2.git




In [16]:
#Clonning the repository
!git clone https://github.com/ZeeshanKarim-916/FASDH25-portfolio2.git

fatal: destination path 'FASDH25-portfolio2' already exists and is not an empty directory.


#Extracting the 2024 Articles only


In [17]:
# setting the file pathway of the articles folder
path = "/content/FASDH25-portfolio2/articles"
#list the files in the folder and only use the articles from January 2024
files = os.listdir(path) # The code has been taken from slides 9.2
# Keep articles from Jan 2024 only
jan_files = [f for f in files if f.startswith("2024-01")]
# Show how many files were found
print("January files found:", len(jan_files))


January files found: 326


##Looping through January 2024 files

In [18]:
import os

# Create a dictionary to store place name counts

places = {}  # help taken from slide 9.2

# writing a filepath to our cloned repository's articles
folder = "/content/FASDH25-portfolio2/articles"
jan_files_articles_count = 0 # keep count of January articles.

# Loop through files that begin with "2024-01-"
for filename in os.listdir(folder) #codes taken from week 11.1
    if filename.startswith("2024-01"):
      jan_files_articles_count += 1
  # create a path to the file:
      path = os.path.join(folder,filename)
  # open and read the file:
      with open(path, encoding="utf-8") as file:
          text = file.read()
    # use the nlp pipeline to analyse the text:
          doc = nlp(text)
    # select only the entities that are place names:
          for e in doc.entities:
              if e.type in ["GPE", "LOC"]:

        # add 1 to the count of the place in our dictionary
        # (and/or add the place to the dictionary if it was not there yet):
                  place = e.text.strip()
                  places[place] = places.get(place, 0) + 1
print("Articles from January 2024:", jan_files_articles_count)
print(places)


Articles from January 2024: 326
{'Morocco': 13, 'Israel': 1593, 'Gaza': 1605, 'Rabat': 3, 'United States': 40, 'the United Arab Emirates': 13, 'UAE': 7, 'Bahrain': 11, 'Sudan': 3, 'US': 706, 'Western Sahara': 3, 'Washington': 60, 'Tel Aviv': 49, 'Algeria': 7, 'Marrakesh': 1, 'the Western Sahara': 1, 'Morocco’s': 1, 'Maghreb': 1, 'Ukraine': 47, 'Saudi Arabia': 39, 'California': 3, 'West Bank': 120, 'Dena': 1, 'Israel’s': 31, 'Oakland': 1, 'the United States': 97, 'South Africa': 200, 'Jordan': 42, 'Jerusalem': 26, 'East Jerusalem': 23, 'Egypt': 43, 'Qatar': 64, 'Kuala Lumpur': 4, 'Malaysia': 8, 'Palestine': 124, 'Indonesia’s': 1, 'Jakarta': 2, 'Johannesburg': 4, 'London': 17, 'Paris': 8, 'Vienna': 1, 'Berlin': 5, 'Amman': 6, 'Washington DC': 3, 'UK': 95, 'Manchester': 1, 'Yemen': 182, 'Washington, DC': 4, 'India': 50, 'Hyderabad': 1, 'Colombo’s Kollupitiya': 1, 'Namibia': 10, 'Germany': 31, 'Palestinian Territories': 1, 'Sweden': 2, 'Iran': 206, 'Kerman': 6, 'Lebanon': 175, 'Bethlehem':

### Cleaning the Places Names




In [19]:
import re
import os

normalized_places = {}

# Standard naming dictionary # Help taken from AI-Solution 1
standard_names = {
    'beruit': 'Beirut',
    'britain': 'United Kingdom',
    'dahiyeb': 'Dahiyeh',
    'gaza': 'Gaza',
    'gaza city': 'Gaza',
    'islamic republic of iran': 'Iran',
    'republic of yemen': 'Yemen',
    'state of israel': 'Israel',
    'state of palestine': 'Palestine',
    'tel israel': 'Tel Aviv',
    'uae': 'United Arab Emirates',
    'u.s.': 'United States',
    'uk': 'United Kingdom',
    'usa': 'United States',
    'westbank': 'West Bank'
}

for place, count in places.items():
    # Remove all possessives like 's
    place = re.sub(r"[’'`]s\b", "", place)
    # Convert newlines to spaces
    place = place.replace('\n', ' ')

    # Remove all the punctuation
    place = re.sub(r"[^\w\s]", "", place)    #Codes taken from slides 9.2

    # Remove leading 'the' if it appears
    place = re.sub(r"^the\s+", "", place, flags=re.IGNORECASE)

    # Check for Gaza (special case)
    if re.search(r'gaza', place.lower()):
        place = standard_names['gaza']
    else:
        # Lookup normalized place (lowercase for safety): help taken form AI-solution1
        place = standard_names.get(place.lower(), place)

    # Merge counts for normalized places
    if place in normalized_places:
        normalized_places[place] += count
    else:
        normalized_places[place] = count

# Print the cleaned and aggregated place names with counts
print(normalized_places)


{'Morocco': 14, 'Israel': 1632, 'Gaza': 1830, 'Rabat': 3, 'United States': 162, 'United Arab Emirates': 21, 'Bahrain': 11, 'Sudan': 3, 'US': 717, 'Western Sahara': 4, 'Washington': 62, 'Tel Aviv': 52, 'Algeria': 7, 'Marrakesh': 1, 'Maghreb': 1, 'Ukraine': 47, 'Saudi Arabia': 39, 'California': 3, 'West Bank': 164, 'Dena': 1, 'Oakland': 1, 'South Africa': 208, 'Jordan': 43, 'Jerusalem': 26, 'East Jerusalem': 23, 'Egypt': 44, 'Qatar': 65, 'Kuala Lumpur': 4, 'Malaysia': 8, 'Palestine': 125, 'Indonesia': 3, 'Jakarta': 2, 'Johannesburg': 4, 'London': 17, 'Paris': 8, 'Vienna': 1, 'Berlin': 5, 'Amman': 6, 'Washington DC': 7, 'United Kingdom': 152, 'Manchester': 1, 'Yemen': 189, 'India': 50, 'Hyderabad': 1, 'Colombo Kollupitiya': 1, 'Namibia': 10, 'Germany': 31, 'Palestinian Territories': 1, 'Sweden': 3, 'Iran': 210, 'Kerman': 6, 'Lebanon': 178, 'Bethlehem': 4, 'Nairoukh': 1, 'China': 30, 'Italy': 10, 'Spain': 7, 'Turkey': 25, 'Shawawra': 1, 'Hague': 39, 'Khan Younis': 23, 'Syria': 84, 'Mazzeh'

### Store Data in TSV File




In [20]:
filename = "ner_counts.tsv"   #from slides 11.

# Open the file in writing mode and with UTF-8 encoding
with open(filename, mode="w", encoding="utf-8") as file:
    # Write header of the row
    header = "Place\tCount\n"
    file.write(header)

    # Loop through the cleaned place counts and write each row
    for place, count in normalized_places.items():
        row = f"{place}\t{count}\n"
        file.write(row)


We can improve the readability by adding xml-style opening and closing tags (e.g., `<GPE>Rafah</GPE>`) instead of only a tag at the beginning of the entity. Adapt the code below so that it adds xml-style start and end tags:

In [21]:
with open("/content/ner_counts.tsv", encoding="utf-8") as file:
  print(file.read())

Place	Count
Morocco	14
Israel	1632
Gaza	1830
Rabat	3
United States	162
United Arab Emirates	21
Bahrain	11
Sudan	3
US	717
Western Sahara	4
Washington	62
Tel Aviv	52
Algeria	7
Marrakesh	1
Maghreb	1
Ukraine	47
Saudi Arabia	39
California	3
West Bank	164
Dena	1
Oakland	1
South Africa	208
Jordan	43
Jerusalem	26
East Jerusalem	23
Egypt	44
Qatar	65
Kuala Lumpur	4
Malaysia	8
Palestine	125
Indonesia	3
Jakarta	2
Johannesburg	4
London	17
Paris	8
Vienna	1
Berlin	5
Amman	6
Washington DC	7
United Kingdom	152
Manchester	1
Yemen	189
India	50
Hyderabad	1
Colombo Kollupitiya	1
Namibia	10
Germany	31
Palestinian Territories	1
Sweden	3
Iran	210
Kerman	6
Lebanon	178
Bethlehem	4
Nairoukh	1
China	30
Italy	10
Spain	7
Turkey	25
Shawawra	1
Hague	39
Khan Younis	23
Syria	84
Mazzeh	2
Damascus	17
Houthis	3
Red Sea	250
BabelMandeb Strait	1
Gulf of Aden	27
Sanaa	15
Hodeidah	5
Taiz	2
Dhamar	1
alBayda	1
Saada	3
Arabian Sea	6
Bab alMandeb Strait	9
Asia	18
Europe	30
Kuwait	2
Middle East	102
Ankara	7
West	24
Tehran	25
South

The file will now be stored in our colab's session environment. You can see it by clicking the folder icon in the left-hand tool bar in colab. Double-click it to view it in colab. Right-click it and choose "Download" to download the file.

To access it in your script, use the path `/content/ner_counts.tsv`