### Match district names in `ICRISAT_allcrops.csv` to district names in shapefile

Let us first load a shapefile of Indian districts in 2020. As we will see, district names in the shape file are marginally different from those in our agriculture data set `ICRISAT_allcrops.csv`. We will manually correct such discrepancies to ease our life going forward.

In [1]:
# Load relevant packages.
import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt

### 1. Load shapefile of Indian districts.

In [2]:
# Load shapefile of Indian districts.
shp = gpd.read_file('../Raw_data/India_districts2020.shp')

In [3]:
shp.head()

Unnamed: 0,objectid,statecode,statename,state_ut,distcode,distname,distarea,totalpopul,totalhh,totpopmale,totpopfema,st_areasha,st_lengths,geometry
0,1,5,Uttarakhand,STATE,66,Nainital,4251.0,954605.0,191383.0,493666.0,460939.0,5322546000.0,506182.695952,"POLYGON ((79.52659 29.05543, 79.52550 29.05545..."
1,2,5,Uttarakhand,STATE,60,Dehradun,3088.0,1696694.0,347001.0,892199.0,804495.0,4177236000.0,578188.681639,"POLYGON ((77.87557 30.26052, 77.87467 30.26087..."
2,3,5,Uttarakhand,STATE,64,Almora,3144.0,622506.0,140577.0,291081.0,331425.0,4140751000.0,463454.225766,"POLYGON ((79.28494 29.92735, 79.28495 29.92723..."
3,4,5,Uttarakhand,STATE,65,Champawat,1766.0,259648.0,53953.0,131125.0,128523.0,2294297000.0,314508.010612,"POLYGON ((80.12479 29.01308, 80.12481 29.01306..."
4,5,5,Uttarakhand,STATE,56,Uttarkashi,8016.0,330086.0,67602.0,168597.0,161489.0,10851660000.0,786425.588972,"POLYGON ((78.92267 31.25333, 78.93106 31.26840..."


In [4]:
# Standardize state name across all data sets.
shp.loc[shp.statename=='Chhatisgarh', 'statename'] = 'Chhattisgarh'

### 2. Load crop yield data.

In [5]:
# Load .csv file of agriculture data.
df = pd.read_csv('../Raw_data/ICRISAT_allcrops.csv')

df.head()

Unnamed: 0,Dist Code,Year,State Code,State Name,Dist Name,RICE AREA (1000 ha),RICE PRODUCTION (1000 tons),RICE YIELD (Kg per ha),WHEAT AREA (1000 ha),WHEAT PRODUCTION (1000 tons),WHEAT YIELD (Kg per ha),GROUNDNUT AREA (1000 ha),GROUNDNUT PRODUCTION (1000 tons),GROUNDNUT YIELD (Kg per ha),SUGARCANE AREA (1000 ha),SUGARCANE PRODUCTION (1000 tons),SUGARCANE YIELD (Kg per ha),COTTON AREA (1000 ha),COTTON PRODUCTION (1000 tons),COTTON YIELD (Kg per ha)
0,1,1990,14,Chhattisgarh,Durg,397.9,481.4,1210.0,18.2,13.4,736.0,3.3,2.8,848.0,0.02,0.0,0.0,0.0,0.0,0.0
1,1,1991,14,Chhattisgarh,Durg,393.2,508.6,1293.0,18.3,11.8,645.0,2.5,2.6,1040.0,0.08,0.1,1250.0,0.0,0.0,0.0
2,1,1992,14,Chhattisgarh,Durg,398.4,514.5,1291.0,17.1,10.7,626.0,1.6,2.3,1438.0,0.06,0.1,1667.0,0.01,0.0,0.0
3,1,1993,14,Chhattisgarh,Durg,410.2,569.1,1387.0,17.0,12.1,712.0,2.4,2.5,1042.0,0.04,0.1,2500.0,0.03,0.0,0.0
4,1,1994,14,Chhattisgarh,Durg,430.1,601.7,1399.0,17.5,14.2,811.0,1.1,1.1,1000.0,0.1,0.1,1000.0,0.0,0.0,0.0


In [6]:
# Districts in df
df_distname = np.unique(df['Dist Name'])
print('No. of unique district names in .csv file = %d'% (len(df_distname)))

No. of unique district names in .csv file = 574


### 3. Building a unique identifier for catalog cross-matching

Can we use district name as an identifier to connect rows in our crop yield data to polygons in our shapefile?

In [7]:
# Districts in shp
shp_distname, shp_dist_counts = np.unique(shp['distname'], return_counts=True)
print('No. of unique district names in shapefile = %d'% (len(shp_distname)))

No. of unique district names in shapefile = 686


In [8]:
print('Max count of any individual district name in shape file = %d'% (np.max(shp_dist_counts)))

Max count of any individual district name in shape file = 2


In [9]:
shp_distname[np.where(shp_dist_counts==2)]

array(['Aurangabad', 'Balrampur', 'Bijapur', 'Bilaspur', 'Hamirpur',
       'Pratapgarh', 'Raigarh'], dtype=object)

In [10]:
shp[shp.distname=='Aurangabad']

Unnamed: 0,objectid,statecode,statename,state_ut,distcode,distname,distarea,totalpopul,totalhh,totpopmale,totpopfema,st_areasha,st_lengths,geometry
250,218,10,Bihar,STATE,235,Aurangabad,3305.0,2540073.0,391898.0,1318684.0,1221389.0,4239872000.0,404564.103599,"POLYGON ((84.61118 24.61312, 84.61133 24.61285..."
330,354,27,Maharashtra,STATE,515,Aurangabad,10107.0,3701282.0,751915.0,1924469.0,1776813.0,11665680000.0,759012.265374,"POLYGON ((74.67514 19.93563, 74.67490 19.93571..."


It turns out that there are two districts named `Aurangabad`, one in Bihar and one in Maharashtra. So, district name alone cannot serve as a unique identifier. Let's build a column of strings combining `distname` and `statename`.

In [11]:
shp['dist_state'] = shp['distname'] + ', ' + shp['statename']

In [12]:
shp_diststate, shp_diststate_counts = np.unique(shp['dist_state'], return_counts=True) 
print('No. of districts in shapefile = %d'% (len(shp_diststate)))

No. of districts in shapefile = 693


In [13]:
print('Max count of any individual district + state combo in shape file = %d'% (np.max(shp_diststate_counts)))

Max count of any individual district + state combo in shape file = 1


In [14]:
# Building dist-state in df.
df['dist_state'] = df['Dist Name'] + ', ' + df['State Name']
df_diststate, df_diststate_counts = np.unique(df['dist_state'], return_counts=True)
print('No. of districts in crop yield data set = %d'% (len(df_diststate)))

No. of districts in crop yield data set = 579


### 4. Correcting misspellings of district names in shapefile

Let's look for identifier values in our `.csv` file of crop yield data that lack a case-insensitive counterpart in the shapefile.

In [15]:
missing_diststates = []

for dist in df_diststate:
    if dist not in shp_diststate:
        missing_diststates.append(dist)
print('Total no. of district + state combos apparently missing in shapefile = %d'% (len(missing_diststates)))        

Total no. of district + state combos apparently missing in shapefile = 158


Are these 169 district + state combos really not in the shapefile? For example, let's check for `Ahmedabad` in the shapefile.

In [16]:
print(*list(shp_diststate[:10]), sep='\n')

Adilabad, Telangana
Agra, Uttar Pradesh
Ahmadabad, Gujarat
Ahmadnagar, Maharashtra
Aizawl, Mizoram
Ajmer, Rajasthan
Akola, Maharashtra
Alappuzha, Kerala
Aligarh, Uttar Pradesh
Alirajpur, Madhya Pradesh


It turns out that our shapefile contains a district `Ahmadabad`, which is just an alternate spelling of Ahmedabad. Let's construct a dictionary matching district + state combos in the shapefile to those in our `.csv` file.

In [17]:
# Dictionary matching shapefile dist-states to corresponding entries in .csv file of crop yield data.
dict_matches = {'North Twenty Four Parganas, West Bengal': '24 - Paraganas North, West Bengal',
                'South Twenty Four Parganas, West Bengal': '24 - Paraganas South, West Bengal',
                'Ahmadabad, Gujarat': 'Ahmedabad, Gujarat',
                'Ahmadnagar, Maharashtra': 'Ahmednagar, Maharashtra',
                'Almora, Uttarakhand': 'Almorah, Uttarakhand',
                'Amravati, Maharashtra': 'Amarawati, Maharashtra',
                'Amethi, Uttar Pradesh': 'Amethi C.S.M.Nagar, Uttar Pradesh',
                'Jyotiba Phule Nagar, Uttar Pradesh': 'Amroha J.B.Fulenagar, Uttar Pradesh',
                'Anugul, Orissa': 'Angul, Orissa',
                'Aravali, Gujarat':'Aravalli, Gujarat',
                'Bagalkot, Karnataka': 'Bagalkote, Karnataka',
                'Baghpat, Uttar Pradesh': 'Bagpat, Uttar Pradesh',
                'Baleshwar, Orissa': 'Balasore, Orissa',
                'Banas Kantha, Gujarat': 'Banaskantha, Gujarat',
                'Bangalore Rural, Karnataka': 'Bangalore(Rural), Karnataka',
                'Bangalore, Karnataka': 'Bangalore(Urban), Karnataka',
                'Bara Banki, Uttar Pradesh': 'Barabanki, Uttar Pradesh',
                'Bilaspur, Himachal Pradesh': 'Bilashpur, Himachal Pradesh',
                'Bid, Maharashtra': 'Beed, Maharashtra',
                'Kaimur (Bhabua), Bihar': 'Bhabhua Kaimur, Bihar',
                'Bhadradri, Telangana':'Bhadradri Kothagudam, Telangana',
                'Bathinda, Punjab': 'Bhatinda, Punjab',
                'Balangir, Orissa': 'Bolangir, Orissa',
                'Batod, Gujarat':'Botad, Gujarat',
                'Baudh, Orissa': 'Boudh, Orissa',
                'Bulandshahr, Uttar Pradesh': 'Buland Shahar, Uttar Pradesh',
                'Buldana, Maharashtra': 'Buldhana, Maharashtra',
                'Barddhaman, West Bengal': 'Burdwan, West Bengal',
                'Chamarajanagar, Karnataka': 'Chamaraja Nagar, Karnataka',
                'Purba Champaran, Bihar': 'Champaran(East), Bihar',
                'Pashchim Champaran, Bihar': 'Champaran(West, Bihar',
                'Champawat, Uttarakhand': 'Champavat, Uttarakhand',
                'Kancheepuram, Tamil Nadu': 'Chengalpattu MGR Kancheepuram, Tamil Nadu',
                'Chikmagalur, Karnataka': 'Chickmagalur, Karnataka',
                'Thoothukkudi, Tamil Nadu': 'Chidambanar Toothukudi, Tamil Nadu',
                'Chikkaballapura, Karnataka': 'Chikkaballapur, Karnataka',
                'Chittaurgarh, Rajasthan': 'Chittorgarh, Rajasthan',
                'Koch Bihar, West Bengal':'Cooch Behar, West Bengal',
                'Dohad, Gujarat': 'Dahod, Gujarat',
                'The Dangs, Gujarat': 'Dangs, Gujarat',
                'Dakshin Bastar Dantewada, Chhattisgarh': 'Dantewara, Chhattisgarh',
                'Darjiling, West Bengal':'Darjeeling, West Bengal',
                'Debagarh, Orissa': 'Deogarh, Orissa',
                'Devbhumi Dwarka, Gujarat': 'Devbhoomi Dwarka, Gujarat',
                'Deoghar, Jharkhand': 'Devghar Deogarh, Jharkhand',
                'Dhaulpur, Rajasthan': 'Dholpur, Rajasthan',
                'Dindigul, Tamil Nadu': 'Dindigul Anna, Tamil Nadu',
                'Purba Medinipur, West Bengal':'East Midnapore Purba Midnapore, West Bengal',                
                'Ernakulam, Kerala': 'Eranakulam, Kerala',
                'Firozpur, Punjab': 'Ferozpur, Punjab',
                'Gautam Buddha Nagar, Uttar Pradesh': 'G.B.Nagar, Uttar Pradesh',
                'Garhwa, Jharkhand': 'Gadva Garhwa, Jharkhand',
                'Gondiya, Maharashtra': 'Gondia, Maharashtra',
                'Hardwar, Uttarakhand': 'Haridwar, Uttarakhand',
                'Mahamaya Nagar, Uttar Pradesh': 'Hathras, Uttar Pradesh',
                'Hisar, Haryana': 'Hissar, Haryana',
                'Hugli, West Bengal': 'Hooghly, West Bengal',
                'Haora, West Bengal':'Howrah, West Bengal',
                'Hydrabad, Telangana': 'Hyderabad, Telangana',
                'Jagtial, Telangana': 'Jagityal, Telangana',
                'Jalor, Rajasthan':'Jalore, Rajasthan',
                'Jangaon, Telangana': 'Janagaon, Telangana',
                'Janjgir-Champa, Chhattisgarh': 'Janjgir, Chhattisgarh',
                'Jayashankar, Telangana': 'Jayashankar Bhuppaly, Telangana',
                'Jhunjhunun, Rajasthan': 'Jhunjhunu, Rajasthan',
                'Jogulamba, Telangana': 'Jogulamba Gadwal, Telangana',
                'Kadapa(YSR), Andhra Pradesh': 'Kadapa YSR, Andhra Pradesh',
                'Kamrup Metropolitan, Assam':'Kamrup(Metro), Assam',
                'Uttar Bastar Kanker, Chhattisgarh': 'Kanker, Chhattisgarh',
                'Kanniyakumari, Tamil Nadu': 'Kanyakumari, Tamil Nadu',
                'Karauli, Rajasthan': 'Karoli, Rajasthan',
                'Kanshiram Nagar, Uttar Pradesh': 'Kasganj Khansi Ram Nagar, Uttar Pradesh',
                'Kabeerdham, Chhattisgarh': 'Kawardha, Chhattisgarh',
                'Kendujhar, Orissa': 'Keonjhar, Orissa',
                'Khandwa (East Nimar), Madhya Pradesh': 'Khandwa, Madhya Pradesh',
                'Khargone (West Nimar), Madhya Pradesh': 'Khargone, Madhya Pradesh',
                'Kodarma, Jharkhand': 'Khodrama Koderma, Jharkhand',
                'Khordha, Orissa': 'Khurda, Orissa',
                'Komaram Bheem, Telangana': 'Kumurambheem Asifabad, Telangana',
                'Kaushambi, Uttar Pradesh': 'Kushambi, Uttar Pradesh',
                'Kushinagar, Uttar Pradesh': 'Kushi Nagar Padrauna, Uttar Pradesh',
                'Kachchh, Gujarat':'Kutch, Gujarat',
                'Lohardaga, Jharkhand': 'Lohardagga, Jharkhand',
                'Mancherial, Telangana':'Macherial, Telangana',
                'Mahasamund, Chhattisgarh': 'Mahasmund, Chhattisgarh',
                'Mahisagar, Gujarat':'Mahi Sagar, Gujarat',
                'Mahrajganj, Uttar Pradesh': 'Mahrajgani, Uttar Pradesh',
                'Maldah, West Bengal':'Malda, West Bengal',
                'Medchal, Telangana':'Malkaigiri, Telangana',
                'Morigaon, Assam': 'Marigaon, Assam',
                'Mayurbhanj, Orissa': 'Mayurbhanja, Orissa',
                'Mahesana, Gujarat': 'Mehsana, Gujarat',
                'Paschim Medinipur, West Bengal':'West Midnapore, West Bengal',
                'Mirzapur, Uttar Pradesh': 'Mirzpur, Uttar Pradesh',
                'Mumbai, Maharashtra': 'Mumbai City, Maharashtra',
                'Mumbai Suburban, Maharashtra': 'Mumbai sub, Maharashtra',
                'Munger, Bihar': 'Mungair, Bihar',
                'Mungeli, Chhattisgarh': 'Mungli, Chhattisgarh',
                'Narsimhapur, Madhya Pradesh': 'Narsinghpur, Madhya Pradesh',
                'Nashik, Maharashtra': 'Nasik, Maharashtra',
                'Nabarangapur, Orissa': 'Nawarangpur, Orissa',
                'Vellore, Tamil Nadu': 'North Arcot Vellore, Tamil Nadu',
                'Dima Hasao, Assam': 'North Cachar Hil, Assam',
                'Pakur, Jharkhand': 'Pakund Pakur, Jharkhand',
                'Palamu, Jharkhand': 'Palamau, Jharkhand',
                'Panch Mahals, Gujarat': 'Panchmahal, Gujarat',
                'Peddapalli, Telangana': 'Peddapally, Telangana',
                'Perambalur, Tamil Nadu': 'Perambular, Tamil Nadu',
                'Erode, Tamil Nadu': 'Periyar(Erode), Tamil Nadu',
                'Kandhamal, Orissa': 'Phulbani(Kandhamal), Orissa',
                'Pithoragarh, Uttarakhand': 'Pithorgarh, Uttarakhand',
                'Purnia, Bihar': 'Purnea, Bihar',
                'Puruliya, West Bengal':'Purulia, West Bengal',
                'Rae Bareli, Uttar Pradesh': 'Rae - Bareily, Uttar Pradesh',
                'Raigarh, Maharashtra': 'Raigad, Maharashtra',
                'Rajanna Sircilla, Telangana':'Rajanna Siricilla, Telangana',
                'Ramanagara, Karnataka': 'Ramanagaram, Karnataka',
                'Ramanathapuram, Tamil Nadu': 'Ramananthapuram, Tamil Nadu',
                'Ramgarh, Jharkhand': 'Ramgadh, Jharkhand',
                'Rupnagar, Punjab': 'Roopnagar, Punjab',
                'Sahibzada Ajit Singh Nagar, Punjab': 'S.A.S Nagar, Punjab',
                'Shahid Bhagat Singh Nagar, Punjab': 'S.B.S Nagar, Punjab',
                'Sri Potti Sriramulu Nellore, Andhra Pradesh': 'S.P.S.Nellore, Andhra Pradesh',
                'Sabar Kantha, Gujarat': 'Sabarkantha, Gujarat',
                'Sahibganj, Jharkhand': 'Sahebganj, Jharkhand',
                'Sant Kabir Nagar, Uttar Pradesh': 'Santh Kabir Nagar, Uttar Pradesh',
                'Sant Ravidas Nagar (Bhadohi), Uttar Pradesh': 'Santh Ravi Das Nagar Bhadoi, Uttar Pradesh',
                'Dumka, Jharkhand': 'Santhal Paragana Dumka, Jharkhand',
                'Saraikela-Kharsawan, Jharkhand': 'Sariakela Kharsawan, Jharkhand',
                'Samli, Uttar Pradesh': 'Shamli, Uttar Pradesh',
                'Sheikhpura, Bihar': 'Sheikapura, Bihar',
                'Sheopur, Madhya Pradesh': 'Sheopur Kalan, Madhya Pradesh',
                'Shimoga, Karnataka': 'Shimoge, Karnataka',
                'Shrawasti, Uttar Pradesh': 'Shravasti, Uttar Pradesh',
                'Muktsar, Punjab': 'Shri Mukatsar Sahib, Punjab',
                'Sivasagar, Assam': 'Sibsagar, Assam',
                'Siddharthnagar, Uttar Pradesh': 'Sidharthnagar, Uttar Pradesh',
                'Purbi Singhbhum, Jharkhand': 'Singhbhum East, Jharkhand',
                'Pashchimi Singhbhum, Jharkhand': 'Singhbhum West, Jharkhand',
                'Sivaganga, Tamil Nadu': 'Sivagangai Pasumpon, Tamil Nadu',
                'Sonipat, Haryana': 'Sonepat, Haryana',
                'Subarnapur, Orissa': 'Sonepur, Orissa',
                'Cuddalore, Tamil Nadu': 'South Arcot Cuddalore, Tamil Nadu',
                'Sawai Madhopur, Rajasthan': 'Swami Madhopur, Rajasthan',
                'Tarn Taran, Punjab': 'Taran Taran, Punjab',
                'Tirunelveli, Tamil Nadu': 'Thirunelveli, Tamil Nadu',
                'Tiruppur, Tamil Nadu': 'Thiruppur, Tamil Nadu',
                'Tiruvannamalai, Tamil Nadu': 'Thiruvannamalai, Tamil Nadu',
                'Tiruchirappalli, Tamil Nadu': 'Tiruchirapalli Trichy, Tamil Nadu',
                'Thiruvarur, Tamil Nadu': 'Tiruvarur, Tamil Nadu',
                'Uttarkashi, Uttarakhand': 'Uttar Kashi, Uttarakhand',
                'Viluppuram, Tamil Nadu': 'Villupuram, Tamil Nadu',
                'Virudhunagar, Tamil Nadu': 'Virudhunagar Kamarajar, Tamil Nadu',
                'Warangal (R), Telangana': 'Warangal, Telangana',
                'Warangal (U), Telangana': 'Warangal Urban, Telangana',
                'Yadadri, Telangana':'Yadadri Bhuvanagiri, Telangana',
                'Yadgir, Karnataka': 'Yadagiri, Karnataka',
                'Yavatmal, Maharashtra': 'Yeotmal, Maharashtra'}

In [18]:
# Replace district names in shp.
for key in dict_matches.keys():
    shp.loc[shp['dist_state']==key, 'dist_state'] = dict_matches[key]
print('Replacement complete')

Replacement complete


In [19]:
# Check for any further discrepancies in identifier values between shp and df.
anymore_dst = []
shp_diststate, diststate_counts = np.unique(shp['dist_state'], return_counts=True)

for dst in df_diststate:
    if dst not in shp_diststate:
        anymore_dst.append(dst)
print('Total no. of identifiers apparently missing in shapefile = %d'% (len(anymore_dst)))

Total no. of identifiers apparently missing in shapefile = 0


That's a relief. Let's now write the updated shape file to disk.

In [20]:
# Update district names in shapefile to match corresponding entries in crop yield data set.
for diststate in shp_diststate:
    shp.loc[shp['dist_state']==diststate, 'distname'] = diststate.split(',')[0]

In [21]:
shp.to_file('../Final_data/India_dld.shp')