# Zubin Sharma: Python Final Project 

# Purpose of this project:

### Various technologies exist to create maps, including ARC GIS and various Python packages
### While ideally all policymakers would take GLBL 550, in the case that they are unable to, they should still have tools that enable them to make maps with a few clicks and a few inputs
### The application below is meant to be a prototype, whereby a cleaned data file is linked to a GeoJSON for a particular region (India)
### Having connected the two, the user can then be asked for simple input as to what kind of map they would like to create based on the options in the dataframe. 
### Based on the variable which the user inputs, a map will then get created
### In the future, one could imagine an application which uses multiple GeoJSONs at the country, state, or even district level and multiple dataframes which can be called to help users make maps that fit their purposes in a few clicks.

# Step 1: Import and clean the data file 

In [1]:
import numpy as np
import json
import pandas as pd
import plotly.express as px

In [3]:
# Read in the data file from Census_data pgm
# The data file was downloaded from a Wikipedia page
# Using this snippet of code: 
#   dfs=pd.read_html("https://en.wikipedia.org/wiki/List_of_states_and_union_territories_of_India_by_population")
#   dfs[1].to_csv("state_census.csv")
# This CSV file can also be accessed and downloaded from here https://drive.google.com/file/d/1BtzpYJstv18jRLrpKS_t-DJzk9RImUbN/view?usp=sharing
df = pd.read_csv("2021.12.18 GLBL 550 - Final Project - state_census.csv")

In [4]:
df.head()

Unnamed: 0,Row Counter,Rank,State or union territory,Population,National Share (%),Decadal growth(2001â€“2012),Rural population,Percent rural,Urban population,Percent urban,Area[17],Density[a],Sex ratio
0,0,1,Uttar Pradesh,199812341,,20.2,155317278,,44495063,,"240,928Â km2 (93,023Â sqÂ mi)","828/km2 (2,140/sqÂ mi)",912
1,1,2,Maharashtra,112374333,,20.0,61556074,,50818259,,"307,713Â km2 (118,809Â sqÂ mi)",365/km2 (950/sqÂ mi),929
2,2,3,Bihar,104099452,,25.4,92341436,,11758016,,"94,163Â km2 (36,357Â sqÂ mi)","1,102/km2 (2,850/sqÂ mi)",918
3,3,4,West Bengal,91276115,,13.8,62183113,,29093002,,"88,752Â km2 (34,267Â sqÂ mi)","1,029/km2 (2,670/sqÂ mi)",953
4,4,5,Madhya Pradesh,72626809,,16.3,52557404,,20069405,,"308,245Â km2 (119,014Â sqÂ mi)",236/km2 (610/sqÂ mi),931


In [5]:
# Parse Density column to get just the number
# df['Density[a]'].apply(lambda x: x.split("/"))
# df['Density[a]'].apply(lambda x: x.split("/")[0])
# df['Density[a]'].apply(lambda x: x.split("/")[0].replace(",",""))
df['Density']=df['Density[a]'].apply(lambda x: int(x.split("/")[0].replace(",","")))
df['Area']=df['Area[17]'].apply(lambda x: int(x.split("Â")[0].replace(",","")))


In [6]:
#Removing special characters from column names 
#Required step to later drop columns containing NaN values 
df.columns = df.columns.str.replace('[(,),%,â,€,“]', ' ')

  df.columns = df.columns.str.replace('[(,),%,â,€,“]', ' ')


In [7]:
df.head()

Unnamed: 0,Row Counter,Rank,State or union territory,Population,National Share,Decadal growth 2001 2012,Rural population,Percent rural,Urban population,Percent urban,Area[17],Density[a],Sex ratio,Density,Area
0,0,1,Uttar Pradesh,199812341,,20.2,155317278,,44495063,,"240,928Â km2 (93,023Â sqÂ mi)","828/km2 (2,140/sqÂ mi)",912,828,240928
1,1,2,Maharashtra,112374333,,20.0,61556074,,50818259,,"307,713Â km2 (118,809Â sqÂ mi)",365/km2 (950/sqÂ mi),929,365,307713
2,2,3,Bihar,104099452,,25.4,92341436,,11758016,,"94,163Â km2 (36,357Â sqÂ mi)","1,102/km2 (2,850/sqÂ mi)",918,1102,94163
3,3,4,West Bengal,91276115,,13.8,62183113,,29093002,,"88,752Â km2 (34,267Â sqÂ mi)","1,029/km2 (2,670/sqÂ mi)",953,1029,88752
4,4,5,Madhya Pradesh,72626809,,16.3,52557404,,20069405,,"308,245Â km2 (119,014Â sqÂ mi)",236/km2 (610/sqÂ mi),931,236,308245


In [8]:
#Population density redefined on log scale to better represent differences in state density
#Sex ratio given a new scale so that we can later apply the divergent color map 

df['DensityLog'] = np.log10(df['Density'])
df['SexRatioScale']=df['Sex ratio'] - 1000

In [9]:
df.drop(['Percent urban', 'Percent rural', 'National Share    ', 'Row Counter'], axis=1, inplace = True)

In [10]:
df.rename(columns={'Decadal growth 2001   2012 ':'Decadal Growth 2001 to 2012'}, inplace=True)


In [11]:
df.head()

Unnamed: 0,Rank,State or union territory,Population,Decadal Growth 2001 to 2012,Rural population,Urban population,Area[17],Density[a],Sex ratio,Density,Area,DensityLog,SexRatioScale
0,1,Uttar Pradesh,199812341,20.2,155317278,44495063,"240,928Â km2 (93,023Â sqÂ mi)","828/km2 (2,140/sqÂ mi)",912,828,240928,2.91803,-88
1,2,Maharashtra,112374333,20.0,61556074,50818259,"307,713Â km2 (118,809Â sqÂ mi)",365/km2 (950/sqÂ mi),929,365,307713,2.562293,-71
2,3,Bihar,104099452,25.4,92341436,11758016,"94,163Â km2 (36,357Â sqÂ mi)","1,102/km2 (2,850/sqÂ mi)",918,1102,94163,3.042182,-82
3,4,West Bengal,91276115,13.8,62183113,29093002,"88,752Â km2 (34,267Â sqÂ mi)","1,029/km2 (2,670/sqÂ mi)",953,1029,88752,3.012415,-47
4,5,Madhya Pradesh,72626809,16.3,52557404,20069405,"308,245Â km2 (119,014Â sqÂ mi)",236/km2 (610/sqÂ mi),931,236,308245,2.372912,-69


In [12]:
#Save cleaned India dataframe to csv so that it can later be imported 
df.to_csv('cleanIndiadf.csv')

In [13]:
df.columns

Index(['Rank', 'State or union territory', 'Population',
       'Decadal Growth 2001 to 2012', 'Rural population', 'Urban population',
       'Area[17]', 'Density[a]', 'Sex ratio', 'Density', 'Area', 'DensityLog',
       'SexRatioScale'],
      dtype='object')

# Step 2: Having cleaned the CSV file, we now link it GeoJSON object
## GeoJSON contains India State maps

In [14]:
#We can imagine that the first step is a separate Jupyter Notebook to clean the data file
#In this case, the two have been put into one JN for ease of viewing

In [15]:
# Render maps outside jupyter notebook to avoid bloat in the Notebook
import plotly.io as pio
pio.renderers.default = 'browser'
# pio.renderers

In [16]:
#Read in cleaned CSV file 

df = pd.read_csv("cleanIndiadf.csv")

In [17]:
df.columns

Index(['Unnamed: 0', 'Rank', 'State or union territory', 'Population',
       'Decadal Growth 2001 to 2012', 'Rural population', 'Urban population',
       'Area[17]', 'Density[a]', 'Sex ratio', 'Density', 'Area', 'DensityLog',
       'SexRatioScale'],
      dtype='object')

In [24]:
#This was JSON file was downloaded from https://un-mapped.carto.com/tables/states_india/public/map
#I also have the specific file here - https://drive.google.com/file/d/1d8IEldsSPFbiT56XRO8e7HubweVT1Hnm/view?usp=sharing
india_states = json.load(open("2021.12.18 GLBL 550 - Final Project - states_india.geojson", 'r'))

In [25]:
india_states['features'][0]

{'type': 'Feature',
 'geometry': {'type': 'MultiPolygon',
  'coordinates': [[[[78.34088, 19.883615],
     [78.351327, 19.88184],
     [78.370422, 19.883346],
     [78.379149, 19.879733],
     [78.388848, 19.879703],
     [78.389673, 19.874372],
     [78.388883, 19.864121],
     [78.390691, 19.856213],
     [78.390645, 19.853215],
     [78.39395, 19.846705],
     [78.402384, 19.836943],
     [78.413779, 19.830435],
     [78.433447, 19.8237],
     [78.449385, 19.819844],
     [78.469482, 19.816847],
     [78.481036, 19.817011],
     [78.489156, 19.807863],
     [78.494337, 19.799196],
     [78.498808, 19.793852],
     [78.508559, 19.793125],
     [78.514515, 19.801887],
     [78.517292, 19.814976],
     [78.52413, 19.820588],
     [78.531195, 19.822351],
     [78.562889, 19.81634],
     [78.57869, 19.814543],
     [78.590001, 19.81245],
     [78.596781, 19.816171],
     [78.600308, 19.818109],
     [78.608696, 19.818273],
     [78.6194, 19.814049],
     [78.624399, 19.809511],
     [78.6

In [26]:
india_states['features'][1]['properties']

{'cartodb_id': 2, 'state_code': 35, 'st_nm': 'Andaman & Nicobar Island'}

In [27]:
# Get IDs to connect map and data
state_id_map = {}
for feature in india_states['features']:
    feature['id'] = feature['properties']['state_code']
    state_id_map[feature['properties']['st_nm']] = feature['id']


In [28]:
state_id_map

{'Telangana': 0,
 'Andaman & Nicobar Island': 35,
 'Andhra Pradesh': 28,
 'Arunanchal Pradesh': 12,
 'Assam': 18,
 'Bihar': 10,
 'Chhattisgarh': 22,
 'Daman & Diu': 25,
 'Goa': 30,
 'Gujarat': 24,
 'Haryana': 6,
 'Himachal Pradesh': 2,
 'Jammu & Kashmir': 1,
 'Jharkhand': 20,
 'Karnataka': 29,
 'Kerala': 32,
 'Lakshadweep': 31,
 'Madhya Pradesh': 23,
 'Maharashtra': 27,
 'Manipur': 14,
 'Chandigarh': 4,
 'Puducherry': 34,
 'Punjab': 3,
 'Rajasthan': 8,
 'Sikkim': 11,
 'Tamil Nadu': 33,
 'Tripura': 16,
 'Uttar Pradesh': 9,
 'Uttarakhand': 5,
 'West Bengal': 19,
 'Odisha': 21,
 'Dadara & Nagar Havelli': 26,
 'Meghalaya': 17,
 'Mizoram': 15,
 'Nagaland': 13,
 'NCT of Delhi': 7}

In [29]:
# If you want to list the geo map states in alpha order to check State list in external files:
import collections
od = collections.OrderedDict(sorted(state_id_map.items()))
od

OrderedDict([('Andaman & Nicobar Island', 35),
             ('Andhra Pradesh', 28),
             ('Arunanchal Pradesh', 12),
             ('Assam', 18),
             ('Bihar', 10),
             ('Chandigarh', 4),
             ('Chhattisgarh', 22),
             ('Dadara & Nagar Havelli', 26),
             ('Daman & Diu', 25),
             ('Goa', 30),
             ('Gujarat', 24),
             ('Haryana', 6),
             ('Himachal Pradesh', 2),
             ('Jammu & Kashmir', 1),
             ('Jharkhand', 20),
             ('Karnataka', 29),
             ('Kerala', 32),
             ('Lakshadweep', 31),
             ('Madhya Pradesh', 23),
             ('Maharashtra', 27),
             ('Manipur', 14),
             ('Meghalaya', 17),
             ('Mizoram', 15),
             ('NCT of Delhi', 7),
             ('Nagaland', 13),
             ('Odisha', 21),
             ('Puducherry', 34),
             ('Punjab', 3),
             ('Rajasthan', 8),
             ('Sikkim', 11),
        

In [30]:
# Get ID for the data file
# Some state names in the data file didn't match the carto map file
# One could write more automated cleaning routines here - for now, I changed in place: Arunanchal Pradseh, Manipur, J&K etc.
df['id'] = df['State or union territory'].apply(lambda x:state_id_map[x])

In [31]:
#Testing that our map actually works 

fig=px.choropleth(df,
                  locations='id',
                  geojson=india_states, 
                  color='DensityLog', 
                  hover_name='State or union territory',
                  hover_data=['Density','Sex ratio'],
                title = 'Density Log')
fig.update_geos(fitbounds="locations", visible=False)
fig.show()

# Step 3: Write a function that makes it easy for a user to create maps as per their requirement 

In [32]:
#Use Mapbox for basemap - This function uses a regular color scale
def drawmap(measure):
    fig=px.choropleth_mapbox(
        df,
    locations="id",
    geojson=india_states, 
    color=measure, 
    hover_name='State or union territory',
    hover_data=['Density[a]',measure],
    mapbox_style="carto-positron",
    center={'lat': 24, 'lon': 78},
    zoom=3.8,
    opacity=0.7)
    fig.update_geos(fitbounds="locations", visible=False)
    fig.show()
    
# drawmap(df['id'], df['SexRatioScale'])
# drawmap(df['id'], df['DensityLog'])

In [33]:
# Use Mapbox for basemap - This function uses a diverging color scale
def drawmapdiverging(measure):
    fig=px.choropleth_mapbox(
    df,
    locations="id",
    geojson=india_states, 
    color=measure, 
    hover_name='State or union territory',
    hover_data=['Density','Sex ratio'],
    color_continuous_scale=px.colors.diverging.BrBG,
    color_continuous_midpoint=0,
    mapbox_style="carto-positron",
    center={'lat': 24, 'lon': 78},
    zoom=3.8,
    opacity=0.7)
    fig.update_geos(fitbounds="locations", visible=False)
    fig.show()
            
    
# drawmap(df['id'], df['SexRatioScale'])

In [None]:
print("Create your own map of India!  See the variables available based on the previous 2011 Census:")
print()
print(df.columns)

while True:
    measure = input("What would you like to make a map of? Please choose from the columns above: ")
    if measure in df.columns:
        if measure == 'SexRatioScale':
            drawmapdiverging(measure)
        else:
            drawmap(measure)
        break 
        #Calling the function drawmapdiverging and passing in the argument measure
    elif measure.lower() == "quit":
        break
        #The user can enter quit if they want to exit the funtion. 
    else: 
        print("Sorry I don't recognize this measure.  Please try again")
        #If the user enters an input which isn't in the dataframe, then this response is returned and the user can then try again.


Create your own map of India!  See the variables available based on the previous 2011 Census:

Index(['Unnamed: 0', 'Rank', 'State or union territory', 'Population',
       'Decadal Growth 2001 to 2012', 'Rural population', 'Urban population',
       'Area[17]', 'Density[a]', 'Sex ratio', 'Density', 'Area', 'DensityLog',
       'SexRatioScale', 'id'],
      dtype='object')


In [None]:
#While the dataframe was previously cleaned and changed so that the data can be displayed in the maps, the dataframe contains both variables which are useful for plotting and variables which are useful for labels.
#It is also possible to create a list for the purpose of sharing the variables with the user in a cleaner way
#This list can also be passed into the while loop

final_varlist = ['Population', 'Decadal Growth 2001 to 2012', 'Rural population', 'Urban population', 'Sex ratio', 'Density', 'Area', 'DensityLog', 'SexRatioScale']


In [None]:
print("Create your own map of India!  See the variables available based on the previous 2011 Census:")
print()
print(final_varlist)

while True:
    measure = input("What would you like to make a map of? Please choose from the columns above (type 'quit' to exit): ")
    if measure in final_varlist:
        if measure == 'SexRatioScale':
            drawmapdiverging(measure)
        else:
            drawmap(measure)
        break 
        #Calling the function drawmapdiverging and passing in the argument measure
    elif measure.lower() == "quit":
        break
        #The user can enter quit if they want to exit the funtion. 
    else: 
        print("Sorry I don't recognize this measure.  Please try again")
        #If the user enters an input which isn't in the dataframe, then this response is returned and the user can then try again.


# Conclusions

#### I have dreamed of making maps for several years, but always lacked the tools to do so. 
#### Python packages like plotly and choropleth make map-making significantly more accessible.
#### On the other hand, it is easy to imagine many policymakers not having the tools to make their own maps.
#### The attempt here was to show a scalable process for map-making, which consists of: 
#### (1) Data cleaning that ends up with a CSV file that later only needs to be imported 
#### (2) Based on the cleaned data file, the CSV and GeoJSON files can be linked
#### (3) Once the CSV and GeoJSON are linked, a simple function can be written in which a map is created based on user input.
#### While the nature of the data cleaning will change slightly based on the underlying data structure, the basic process is repeatable, especially for the steps beyond the data cleaning.
#### Therefore, based on this exercise, a very similar function could be written for a different dataset for India -- most easily for the upcoming 2021 census, or, with a bit more work, for the National Family Health Survey (NFHS), Socioeconomic and Caste Census, or otherwise.  
#### With slight changes to the GeoJSON code, it can be repeated for other geographies as well.  
#### There are a number of other potential imporvements that could be made, which I will detail below: 
#### (1) Step zero, before data cleaning, could be code to dynamically scrape data from an existing source and then clean it within the Jupyter Notebook.  While I could use some of the Python tools learned in class to optimize the data frame, in some cases, it seems easier to scrape the file and then edit it manually before reading it back into the Jupyter Notebook and mapping application. 
#### (2) In terms of data cleaning, much of the data was cleaned through one-off lines of code which were written for each specific challenge encountered.  In the future, it would be useful to create functions that could identify the usable data and clean it dynamically rather than create extra columns (as I did for density, for example.)
#### (3) In this Jupyter Notebook, I read in a single GeoJSON object.  However, users may want to create maps at different geographical aggregations.  Ideally, based on the levels of aggregation in the data frame, the program would ask users at what geographical level they would like to construct their map -- for example, at the state, district, block, or village levels.  Having done so, they would then go onto the final step of selecting the variable of interest, such as literacy or population density, and then the map would be created. 
#### (4) In the current program, when the user inputs their selected variable and runs the function, the map is created in a new browser tab.  Once in the browser tab, the user has to navigate back to the Jupyter Notebook manually and re-run the interaction.  There would ideally be a navigation button in the map that would bring users back to the Jupyter Notebook interaction.  I was unable to figure out how to do so.  Given more time, this would be a feature I would add.  "We will fix it in post," as Elon Musk says,  



# Final thoughts

#### For someone who had never typed any code in his life, to be able to complete these steps in a matter of a few months is a pretty rapid transformation. 
#### Many of the foundational skills that we learned early on in the semester -- like creating lists, for loops, if statements, and so on -- came in handy -- similar to knowing verb conjugation or sentence structure in a foreign language.  
#### I can see many useful applications of the skills which I have learned here moving forward, and I believe that I have a strong base on which to build.  Many thanks to Casey and Flynn for your patience and efforts! 

# Final video

#### Link: https://drive.google.com/drive/u/1/folders/16cLr9DOx_zS9SlfOHyaTNlOULZeDHR5-