# Use stanza to extract all place names from (part of) the corpus

## Installation

Run the code cell below to install stanza:

In [None]:
!pip install stanza



## Import library and download language model

After installing it, we import stanza into our notebook.

In [None]:
import stanza

## Creating the pipeline

Download the English language model and build the pipeline (we specify that it should only tokenize the text, separate multiword tokens and perform Named Entity Recognition):


In [None]:
# Download the language model:
stanza.download("en")

# Create the pipeline, specifying the language:
nlp = stanza.Pipeline(lang="en", processors='tokenize,mwt,ner')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: en (English) ...
INFO:stanza:File exists: /root/stanza_resources/en/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: en (English):
| Processor | Package                   |
-----------------------------------------
| tokenize  | combined                  |
| mwt       | combined                  |
| ner       | ontonotes-ww-multi_charlm |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!


# using multiple files
since we have processed a single file already we can now handle up large number of files too.

To do that we will first download FASDH25 GitHub repository into colab. Unlike python we will use ! mark before the command when using colab.

Now, lets complete the command below to clone the repository and run it.

In [None]:
# cloning our FASDH25 folder:
!git clone https://github.com/UlyaBatool/FASDH25-portfolio2.git

fatal: destination path 'FASDH25-portfolio2' already exists and is not an empty directory.


Now here we can loop through all the articles in the folder one by one from 2024 only.

In [None]:
# import system module
import os

# use an empty dictionary to store name and their frequencies
places = {}
# specifies the path to the folder
folder = "/content/FASDH25-portfolio2/articles"
# initializes a counter
jan_2024_article_count = 0  # Counter for January 2024 articles

# loop through each file
for filename in os.listdir(folder):
  # check if the filename contain 2024-01
    if "2024-01" in filename:
      # count the article
        jan_2024_article_count += 1
        # defines the path
        path = os.path.join(folder, filename)
        # open the file
        with open(path, encoding="utf-8") as file:
          # read all the text from the file
            text = file.read()
            # process the text using pre loaded NLP pipline
            doc = nlp(text)
            # lop through all the name entities
            for e in doc.entities:
              # filter the entities
                if e.type in ["GPE", "LOC"]:
                  # add place to the dictionary
                    place = e.text.strip()
                    # count the articles
                    places[place] = places.get(place, 0) + 1

# Print results
print("Number of articles from January 2024:", jan_2024_article_count)
print(places)

Number of articles from January 2024: 326
{'Gaza': 1605, 'Israel': 1593, 'South Africa': 200, 'Palestine': 124, 'Dublin': 3, 'The Hague': 33, 'Russia': 43, 'Ukraine': 47, 'Moscow': 4, 'US': 706, 'UK': 95, 'West': 24, 'the Global South': 2, 'Ramallah': 24, 'West Bank': 120, 'Gaza Strip': 31, 'the Gaza Strip': 123, 'Israel’s': 31, 'the United States': 97, 'the West Bank': 40, 'East Jerusalem': 23, 'PA': 1, 'Oslo': 2, 'Jerusalem': 26, 'Middle East': 25, 'United States’': 2, 'the Middle East': 77, 'Bahrain': 11, 'Turkey': 25, 'Greece': 8, 'Jordan': 42, 'Qatar': 64, 'the United Arab Emirates': 13, 'Saudi Arabia': 39, 'Egypt': 43, 'Tel Aviv': 49, 'America': 4, 'South Carolina': 4, 'Lebanon': 175, 'Beirut': 84, 'Washington': 60, 'Iran': 206, 'Asia': 18, 'al-Dabshah': 1, 'Khirbet Selm': 1, 'Bint Jbeil': 1, 'Syria': 83, 'Ibil El Saqi': 1, 'al-Tawil’s': 1, 'al-Khader': 1, 'the Bekaa Valley': 1, 'the Litani River': 1, 'Red Sea': 50, 'Alborz': 4, 'the Red Sea': 194, 'Bab al-Mandeb Strait': 4, 'the

# Clean up the named entity names
Now we check the extracted data for duplicate place. Then we will standardize these names and finally, merge the counts of duplicates into a single, cleaned versions.

In [None]:
import re

normalized_places = {}

for place, count in places.items():
    #  Remove possessives like 's
    place = re.sub(r"[’'`]s\b", "", place)

    # Remove punctuation
    place = re.sub(r"[^\w\s]", "", place)

    # Remove leading 'the' if it appears (to handle "The United States" and "United States")
    place = re.sub(r"^the\s+", "", place, flags=re.IGNORECASE)  # case-insensitive removal of "The"

    #  Merge counts for normalized places
    if place in normalized_places:
        normalized_places[place] += count
    else:
        normalized_places[place] = count

# Print the cleaned and aggregated place names with counts
print(normalized_places)

{'Gaza': 1623, 'Israel': 1625, 'South Africa': 208, 'Palestine': 124, 'Dublin': 3, 'Hague': 39, 'Russia': 43, 'Ukraine': 47, 'Moscow': 4, 'US': 717, 'UK': 95, 'West': 24, 'Global South': 2, 'Ramallah': 24, 'West Bank': 162, 'Gaza Strip': 159, 'United States': 160, 'East Jerusalem': 23, 'PA': 1, 'Oslo': 2, 'Jerusalem': 26, 'Middle East': 102, 'Bahrain': 11, 'Turkey': 25, 'Greece': 8, 'Jordan': 43, 'Qatar': 65, 'United Arab Emirates': 14, 'Saudi Arabia': 39, 'Egypt': 44, 'Tel Aviv': 51, 'America': 4, 'South Carolina': 4, 'Lebanon': 178, 'Beirut': 87, 'Washington': 62, 'Iran': 209, 'Asia': 18, 'alDabshah': 1, 'Khirbet Selm': 1, 'Bint Jbeil': 1, 'Syria': 84, 'Ibil El Saqi': 1, 'alTawil': 1, 'alKhader': 1, 'Bekaa Valley': 1, 'Litani River': 1, 'Red Sea': 250, 'Alborz': 4, 'Bab alMandeb Strait': 9, 'Gulf of Aden': 27, 'Indian Ocean': 2, 'Africa': 29, 'Yemen': 188, 'Damascus': 17, 'Djibouti': 4, 'Britain': 14, 'Tehran': 25, 'United Kingdom': 43, 'Sanaa': 15, 'Rafah': 40, 'Khreis': 1, 'Norther

## Writing Results in a tsv file
Now we will write the filtered places name and their counts to a file named ner_counts.tsv. It can be done in two steps:
1) First we will write a header and separeted by a tab \t
2  In the second step we will loop through each place name and write a new entry. In this way each row will include the name and its frequency.

In [None]:
# sets the name of the file
filename = "ner_counts.tsv"
# opens the tsv files with encoding utf-8
with open(filename, 'w', encoding="utf-8") as file:
  # now define the headers
     file.write('placename\tcount\n')
     # Iterate through the list of dictionaries and write to the file
     for entity, count in normalized_places.items():
      # formate each row
      row = f"{entity}\t{count}\n"
      # write the row in tsv file
      file.write(row)


The file is now saved in the colab session environment.

In [None]:
# opens the tsv file with encoding utf-8
with open("/content/ner_counts.tsv", 'r', encoding="utf-8") as file:
  print(file.read())

placename	count
Gaza	1623
Israel	1625
South Africa	208
Palestine	124
Dublin	3
Hague	39
Russia	43
Ukraine	47
Moscow	4
US	717
UK	95
West	24
Global South	2
Ramallah	24
West Bank	162
Gaza Strip	159
United States	160
East Jerusalem	23
PA	1
Oslo	2
Jerusalem	26
Middle East	102
Bahrain	11
Turkey	25
Greece	8
Jordan	43
Qatar	65
United Arab Emirates	14
Saudi Arabia	39
Egypt	44
Tel Aviv	51
America	4
South Carolina	4
Lebanon	178
Beirut	87
Washington	62
Iran	209
Asia	18
alDabshah	1
Khirbet Selm	1
Bint Jbeil	1
Syria	84
Ibil El Saqi	1
alTawil	1
alKhader	1
Bekaa Valley	1
Litani River	1
Red Sea	250
Alborz	4
Bab alMandeb Strait	9
Gulf of Aden	27
Indian Ocean	2
Africa	29
Yemen	188
Damascus	17
Djibouti	4
Britain	14
Tehran	25
United Kingdom	43
Sanaa	15
Rafah	40
Khreis	1
Northern Gaza	1
Strip	15
Deir elBalah	14
Maghazi	5
Nuseirat	11
Europe	30
Manchester City	1
Barcelona	1
El Arish	3
Italy	10
Iraq	64
Eastern Mediterranean	1
New York City	1
DC	14
Cape Town	2
Pretoria	8
South Africansfrom	1
Africa4Palestine	1
J

# Geocoding

Geocoding is the process of finding coordinates for a place.

The process uses APIs, Application Programming Interfaces,
which are internet services that are designed not for human reading
but for being called by applications.

There are many APIs that provide geocoding services. They typically have a database of place names and their coordinates. If you send a geocoding API a place name, it will return its coordinates (and perhaps some other data). Many of them are not free. In our case, we'll use the free GeoNames API to find our place names.

First, try it out by pasting the following URL in your browser (make sure to replace `<your_user_name>` with your geonames user name:

`http://api.geonames.org/searchJSON?q=Gaza&maxRows=5&username=<your_user_name>`

Paste the response here:

"totalResultsCount": 5276,
  "geonames": [
    {
      "adminCode1": "GZ",
      "lng": "34.46672",
      "geonameId": 281133,
      "toponymName": "Gaza",
      "countryId": "6254930",
      "fcl": "P",
      "population": 410000,
      "countryCode": "PS",
      "name": "Gaza",
      "fclName": "city, village,...",
      "adminCodes1": {

      },
      "countryName": "Palestine",
      "fcodeName": "seat of a first-order administrative division",
      "adminName1": "Gaza Strip",
      "lat": "31.50161",
      "fcode": "PPLA"
    },
    {
      "adminCode1": "GZ",
      "lng": "34.48347",
      "geonameId": 281129,
      "toponymName": "Jabālyā",
      "countryId": "6254930",
      "fcl": "P",
      "population": 168568,
      "countryCode": "PS",
      "name": "Jabalia",
      "fclName": "city, village,...",
      "adminCodes1": {

      },
      "countryName": "Palestine",
      "fcodeName": "populated place",
      "adminName1": "Gaza Strip",
      "lat": "31.5272",
      "fcode": "PPL"
    },
    {
      "adminCode1": "GZ",
      "lng": "34.30627",
      "geonameId": 281124,
      "toponymName": "Khān Yūnis",
      "countryId": "6254930",
      "fcl": "P",
      "population": 173183,
      "countryCode": "PS",
      "name": "Khan Yunis",
      "fclName": "city, village,...",
      "adminCodes1": {

      },
      "countryName": "Palestine",
      "fcodeName": "seat of a second-order administrative division",
      "adminName1": "Gaza Strip",
      "lat": "31.34018",
      "fcode": "PPLA2"
    },
    {
      "adminCode1": "02",
      "lng": "33",
      "geonameId": 1046058,
      "toponymName": "Gaza Province",
      "countryId": "1036973",
      "fcl": "A",
      "population": 1422460,
      "countryCode": "MZ",
      "name": "Gaza Province",
      "fclName": "country, state, region,...",
      "adminCodes1": {
        "ISO3166_2": "G"
      },
      "countryName": "Mozambique",
      "fcodeName": "first-order administrative division",
      "adminName1": "Gaza Province",
      "lat": "-23.5",
      "fcode": "ADM1"
    },
    {
      "adminCode1": "GZ",
      "lng": "34.24357",
      "geonameId": 281102,
      "toponymName": "Rafaḩ",
      "countryId": "6254930",
      "fcl": "P",
      "population": 126305,
      "countryCode": "PS",
      "name": "Rafah",
      "fclName": "city, village,...",
      "adminCodes1": {

      },
      "countryName": "Palestine",
      "fcodeName": "seat of a second-order administrative division",
      "adminName1": "Gaza Strip",
      "lat": "31.29722",
      "fcode": "PPLA2"
    }
  ]
}



I have created a function, `get_coordinates` that will take your a place name and your Geonames user name as an argument and return the coordinates. Please fill in your user name and run the code cell to make the function available:

In [None]:
import requests
import time

geonames_username = "ulya.batool"

def get_coordinates(place, username=geonames_username, fuzzy=0, timeout=1):
  """This function gets a single set of coordinates from the geonames API.

  Args:
    place (str): the place name
    username (str): your geonames user name
    fuzzy (int): 0 = exact matching, 1 = fuzzy matching (allow similar but not exact matches)
    timeout (int): number of seconds to wait before a call to the geonames API
      (to avoid being blocked for overloading the server)

  Returns:
    dictionary: keys: latitude, longitude
  """
  # wait a short while, so that we don't overload the server:
  time.sleep(timeout)
  # make the API call:
  url = "http://api.geonames.org/searchJSON?"
  params = {"q": place, "username": username, "fuzzy": fuzzy, "maxRows": 1, "isNameRequired": True}
  response = requests.get(url, params=params)
  # convert the response into a dictionary:
  results = response.json()
  print(results)
  # get the first result:
  try:
    result = results["geonames"][0]
    return {"latitude": result["lat"], "longitude": result["lng"]}
  except (IndexError, KeyError):
    print("No results found for your API call", response.request.url)

Now, we can test this function on a list of file names:

In [None]:
test_names = ["Khan Younis", "Khān Yūnis", "United States", "blabla"]

# loop through the list of names and call the function:
for name in test_names:
  # call the function and assign the outcome to a variable called `coordinates`:
  coordinates = get_coordinates(name)
  # print the coordinates, only if they were found:
  if coordinates:
    print("=>", coordinates)

{'totalResultsCount': 287, 'geonames': [{'adminCode1': 'GZ', 'lng': '34.30627', 'geonameId': 281124, 'toponymName': 'Khān Yūnis', 'countryId': '6254930', 'fcl': 'P', 'population': 173183, 'countryCode': 'PS', 'name': 'Khan Yunis', 'fclName': 'city, village,...', 'adminCodes1': {}, 'countryName': 'Palestine', 'fcodeName': 'seat of a second-order administrative division', 'adminName1': 'Gaza Strip', 'lat': '31.34018', 'fcode': 'PPLA2'}]}
=> {'latitude': '31.34018', 'longitude': '34.30627'}
{'totalResultsCount': 318, 'geonames': [{'adminCode1': 'GZ', 'lng': '34.30627', 'geonameId': 281124, 'toponymName': 'Khān Yūnis', 'countryId': '6254930', 'fcl': 'P', 'population': 173183, 'countryCode': 'PS', 'name': 'Khan Yunis', 'fclName': 'city, village,...', 'adminCodes1': {}, 'countryName': 'Palestine', 'fcodeName': 'seat of a second-order administrative division', 'adminName1': 'Gaza Strip', 'lat': '31.34018', 'fcode': 'PPLA2'}]}
=> {'latitude': '31.34018', 'longitude': '34.30627'}
{'totalResults

Now, reuse the code above to get the coordinates for the place names from the places we stored in the `ner_counts.tsv` file.

Write a new tsv file, `ner_gazetteer.tsv`, which contains three columns: name, latitude, longitude.

In [None]:
# get the places from the tsv file:
places=[]

# reads the tsv file
with open("/content/ner_counts.tsv", 'r', encoding="uft-8") as file:
  lines = file.readlines()

header=lines[0].strip().split('\t')
place_index = header.index('Place')


# get the coordinates fo each place:
for line in lines[1:]:



# write coordinates to tsv file: