# Extracting State, District, Village, geocodes using traditional methods.

There aren't many reliable resources that provides geographical data of india. In this notebook, there are simple scraping functions that can be used to get data related to states, districts, villages, and their geo coordinates.

This notebook can be used to access location data or be extended for extraction of villages and their codes. It is a starter notebook, created to understand how far we can go with simple approaches without using ML. So it can be considered as a baseline while evaluating other complex models.

TODO:

*   To get started, a naive function for extracting district is provided in this notebook, which has an accuracy of 80%. It can be further improved using fuzzy matching. Whether you're working on naive approaches or ML, fuzzy matching can be helpful especially during evaluation with our data. Since there can be errors at different levels of the manual coding of SATP database and other data sources.

*   New functions are needed for extracting villages and their geocodes from the input summaries. It would be better to try ML approaches, since these approaches doesn't care about context, which is necessary for some inputs.



In [1]:
pip install unidecode



# Import Libraries

In [2]:
import json
import random
import requests
import unidecode
import pandas as pd
from tqdm.auto import tqdm
from bs4 import BeautifulSoup

# Get Data

In [3]:
data = pd.read_csv("https://raw.githubusercontent.com/eteitelbaum/code-satp/refs/heads/main/data/satp_clean.csv")
data.head(3)

Unnamed: 0,incident_number,state,district,block,village_name,other_areas,constituency,longitude,latitude,year,...,commander_arrests,cadre_arrests,sympathizer_arrests,unknown_arrests,total_surrenders,commander_surrenders,cadre_surrenders,sympathizer_surrenders,unknown_surrenders,incident_summary
0,101010701.0,Andhra Pradesh,Hyderabad,Gachibowli (Rangareddy),,Cyberabad,Serilingampally,17.4325,78.371806,2007,...,0,0.0,1,0.0,0.0,0,0.0,0,0,An alleged arms supplier to the Communist Part...
1,101010901.0,Andhra Pradesh,Nizamabad,,Kamareddy,,Kamareddy,18.320889,78.337139,2009,...,0,0.0,0,0.0,1.0,0,1.0,0,0,A Kamareddy dalam (squad) member belonging to ...
2,101030601.0,Andhra Pradesh,Khammam,,Bhadrachalam,,Bhadrachalam,17.668056,80.896861,2006,...,1,0.0,0,0.0,0.0,0,0.0,0,0,Senior CPI-Maoist 'Polit Bureau' and 'central ...


In [4]:
# convert the names into lowercase for easier comparisions.
data["state"] = data["state"].apply(lambda x: x.lower() if isinstance(x, str) else x)
data["district"] = data["district"].apply(lambda x: x.lower() if isinstance(x, str) else x)
data["village_name"] = data["village_name"].apply(lambda x: x.lower() if isinstance(x, str) else x)
data[["state", "district", "village_name"]].head()

Unnamed: 0,state,district,village_name
0,andhra pradesh,hyderabad,
1,andhra pradesh,nizamabad,kamareddy
2,andhra pradesh,khammam,bhadrachalam
3,andhra pradesh,vishakhapatnam,
4,andhra pradesh,visakhapatnam,teegalabanda


# Function to scrape States and Districts in India from Wikipedia.

In [5]:
def get_states_and_districts():

    """

      state and district mapping, a list of states, and a list of districts.

      Args:
        None
      Returns:
        state_districts: state and district mapping
        states: list of states
        districts: list of districts

    """

    url = "https://en.wikipedia.org/wiki/List_of_districts_in_India"
    response = requests.get(url)

    if response.status_code != 200:
        return "Error fetching data from Wikipedia"

    soup = BeautifulSoup(response.content, 'html.parser')

    # Dictionary to hold state names as keys and their districts as values
    states_districts = {}

    # Find all divs with the class 'mw-heading mw-heading3' (which contain the state names)
    state_names = soup.find_all('div', {'class': 'mw-heading mw-heading3'})
    state_names = [state_name.text[:-10].strip().lower() for state_name in state_names[3:-1]]

    states = soup.find_all('table', {'class': "wikitable sortable"})
    districts_list = []
    for state in states:
      districts_list.append([])
      districts = state.find_all('tr')[1:]

      for district in districts:
        districts_list[-1].append(district.find_all('td')[2].text.strip().lower())

    for i, state_name in enumerate(state_names):
        states_districts[state_name] = districts_list[i]

    all_districts = [d for district in districts_list for d in district]

    print(f"No of states found: {len(state_names)}")
    print(f"No of districts found: {len(all_districts)}")

    return states_districts, state_names, all_districts


# Fetch the states and districts
state_district_data, states, districts = get_states_and_districts()

No of states found: 36
No of districts found: 793


In [6]:
next(iter(state_district_data)), state_district_data[next(iter(state_district_data))][:5]

('andhra pradesh',
 ['alluri sitharama raju',
  'anakapalli',
  'ananthapuramu',
  'annamayya',
  'bapatla'])

In [7]:
states[:5], districts[:5]

(['andhra pradesh', 'arunachal pradesh', 'assam', 'bihar', 'chhattisgarh'],
 ['alluri sitharama raju',
  'anakapalli',
  'ananthapuramu',
  'annamayya',
  'bapatla'])

# Function to scrape villages and their geocodes from GeoNames website

In [8]:
# https://www.geonames.org/advanced-search.html?

In [9]:
all_villages = {}
def get_villages(sd_data):

    """
    Returns a dictionary with hierarchical structure of states, districts, and villages with their geocodes.

    Args:
      sd_data: state and district mapping
    Returns:
      all_villages: a dictionary of structure {state:{district{village: {latitude:, longitude:}}}}
    """

    for state, districts in sd_data.items():
        all_villages[state] = {}
        for district in tqdm(districts):
            all_villages[state][district] = {}
            base_url = "https://www.geonames.org/search.html"
            start_row = 0

            while True:
                # Step 1: Prepare the request URL
                params = {
                    'q': district,
                    'country': 'IN',
                    'continentCode': 'AS',
                    'startRow': start_row
                }

                # Step 2: Send the HTTP request to fetch the page content
                response = requests.get(base_url, params=params)

                if response.status_code != 200:
                    print(f"Failed to retrieve page for : {district}")
                    break

                # Step 3: Parse the page content
                soup = BeautifulSoup(response.content, 'html.parser')

                # Step 4: Find the table with village names and coordinates
                table = soup.find('table', class_='restable')
                if not table:
                    break  # Exit when there are no more tables (i.e., no more results)

                # Step 5: Loop through each row and extract village name, latitude, and longitude
                rows = table.find_all('tr')[2:-1]  # Skip the header row
                for row in rows:
                    try:
                      cols = row.find_all('td')

                      # Extract village name from the first column's href attribute
                      village_name = cols[1].find('a')

                      latitude = cols[4].get_text(strip=True)
                      longitude = cols[5].get_text(strip=True)

                      if village_name is not None and village_name not in states and village_name not in districts:
                          village_name = unidecode.unidecode(village_name.text.strip().lower())
                          all_villages[state][district][village_name] = {'Latitude': latitude, 'Longitude': longitude}
                          village_count += 1
                    except Exception as e:
                      continue

                # Step 6: Move to the next page (next 50 results)
                start_row += 50
        with open('/content/drive/MyDrive/SATP/location_data.json', 'w') as json_file:
              json.dump(all_villages, json_file)
    return all_villages

#location_data = get_villages(state_district_data) # Takes around 30 min to get all the data.
#No need to run, since this is already saved and can be accessed from the data folder.

In [10]:
!wget https://raw.githubusercontent.com/eteitelbaum/code-satp/refs/heads/main/data/location_data.json

--2024-10-05 04:39:50--  https://raw.githubusercontent.com/eteitelbaum/code-satp/refs/heads/main/data/location_data.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 51732669 (49M) [text/plain]
Saving to: ‘location_data.json’


2024-10-05 04:39:50 (194 MB/s) - ‘location_data.json’ saved [51732669/51732669]



In [11]:
with open('location_data.json', 'r') as file:
    location_data = json.load(file)

In [12]:
villages = []
for state in location_data.keys():
  for district in location_data[state].keys():
    for village in location_data[state][district].keys():
      villages.append(village)

print(len(villages), villages[:5])

539142 ['alluri sitharama raju', 'rampachodavaram', 'paderu', 'koyyuru', 'hukumpeta']


In [13]:
len(districts), districts[:5]

(793,
 ['alluri sitharama raju',
  'anakapalli',
  'ananthapuramu',
  'annamayya',
  'bapatla'])

In [14]:
district_words = [dws for d in districts for dws in d.split()]
print(len(district_words), district_words[:5])

953 ['alluri', 'sitharama', 'raju', 'anakapalli', 'ananthapuramu']


In [15]:
district_words = set(district_words + [d for d in districts])
print(len(district_words))
list(district_words)[:5]

938


['north garo hills', 'gramin', 'kodagu', 'purba medinipur', 'katni']

# Naive approach to extract district from the input summary.

In [16]:
def get_district(x, district_words):

  """

    Given a news summary, returns the district found in the input.

    Args:
        x: news summary,
        district_words: a set of all words in districts and districts themselves.

    Returns:
        district name found in the input.

  """

  x = x.lower()
  if " district" in x: # check if the summary has district keyword in it.
    pre_words = x.split(' district')[0].strip().split() # split at that index
    res = ''

    for i in range(len(pre_words)-1, -1, -1): # loop back from the district keyword
      if (pre_words[i] in district_words) or (pre_words[i].replace('-', ' ') in district_words): # check if the words preceding are present in the scraped data.
        res = pre_words[i] + ' ' + res
      else:
        break # break if there is a keyword that's not present in the district_words set.

    res = res.strip().replace('-', ' ')
    if res in districts:
      return res

  # additional process for explicit matching when the above method fail or district keyword is not present.
  for d in districts: #loop through all the districts and check if there is match.
    if d in x:
      return d

  return ""

In [17]:
# run the function on random samples
item = random.randint(1, 9000)
print("ITEM:", item)
print("ACTUAl:", data["district"][item])
print("Extracted:", get_district(data["incident_summary"][item], district_words))
print("Summary:", data["incident_summary"][item])

ITEM: 8208
ACTUAl: malkangiri
Extracted: malkangiri
Summary: CPI-Maoist cadres abducted five villagers, including one former Maoist, who was about to surrender before the Police from Potteru village under Kalimela block of Motu Police limits in Malkangiri District.


In [18]:
data["extracted_district"] = data["incident_summary"].apply(lambda x: get_district(x, district_words))
data["extracted_district"][:5]

Unnamed: 0,extracted_district
0,hyderabad
1,nizamabad
2,khammam
3,patna
4,una


# Evaluation

In [19]:
print("District Extraction Score:", sum(data["district"] == data["extracted_district"]) / data.shape[0])

District Extraction Score: 0.7934684003628667


Ideas: add fuzzy matching, preprocessing functions.