# Standardizing our key columns before merging dataframes
Many a times we want to merge different data frames. But often, there is a significant challenge when our key columns of different data frames are not standardised. They contain the same elements, but there could be a spelling mistake or leading spaces etc. Let's see two such data frames. 


In [0]:
import pandas as pd

In [6]:
# Dataset-1: Hospital beds in India
beds = pd.read_csv('/content/Govt_HospitalBedsInIndia.csv')
beds.head()

Unnamed: 0,State/UT,Number of Govt Hospitals in Rural Areas,Number of Govt Beds in Rural Areas,Number of Govt Hospitals in Urban Areas,Number of Govt Beds in Urban Areas,Total Number of Govt Hospitals,Total Number of Govt Beds,Reference Period
0,Andhra Pradesh,193,6480,65,16658,258,23138,01.01.2017
1,Arunachal Pradesh*,208,2136,10,268,218,2404,31.12.2018
2,Assam *,1176,10944,50,6198,1226,17142,31.12.2017
3,Bihar,1032,5510,115,6154,1147,11664,31.12.2018
4,Chhattisgarh,169,5070,45,4342,214,9412,01.01.2016


In [7]:
# Dataset-2: Population in various states
population = pd.read_csv('/content/Population.csv')
population.head()

Unnamed: 0,Rank,State or union territory,Population,Decadal growth,Rural population,Urban population,Sex ratio
0,1,Uttar Pradesh,199812342,0.202,155317278,44495063,912
1,2,Maharashtra,112374333,0.2,61556074,50818259,929
2,3,Bihar,104099452,0.254,92341436,11758016,918
3,4,West Bengal,91276115,0.238,62183113,29093002,953
4,5,Madhya Pradesh,72626809,0.163,52557404,20069405,931


You may want to merge both data sets to know per-capita availability of hospital beds in each state. But merging will not work as the state names are not standardised. 

In [14]:
percapita_beds = pd.merge(beds, population, left_on ='State/UT', right_on ='State or union territory',how='inner')
percapita_beds

Unnamed: 0,State/UT,Number of Govt Hospitals in Rural Areas,Number of Govt Beds in Rural Areas,Number of Govt Hospitals in Urban Areas,Number of Govt Beds in Urban Areas,Total Number of Govt Hospitals,Total Number of Govt Beds,Reference Period,Rank,State or union territory,Population,Decadal growth,Rural population,Urban population,Sex ratio
0,Andhra Pradesh,193,6480,65,16658,258,23138,01.01.2017,10,Andhra Pradesh,"49,577,103[b]",0.11,34966693,14610410,993
1,Bihar,1032,5510,115,6154,1147,11664,31.12.2018,3,Bihar,104099452,0.254,92341436,11758016,918
2,Chhattisgarh,169,5070,45,4342,214,9412,01.01.2016,17,Chhattisgarh,25545198,0.226,19607961,5937237,991
3,Gujarat,363,11688,75,8484,438,20172,31.12.2018,9,Gujarat,60439692,0.193,34694609,25745083,919
4,Jharkhand,519,5842,36,4942,555,10784,31.12.2015,14,Jharkhand,32988134,0.224,25055073,7933061,948
5,Kerala,981,16865,299,21139,1280,38004,01.01.2017,13,Kerala,33406061,0.049,17471135,15934926,1084
6,Madhya Pradesh,330,9900,135,21206,465,31106,01.01.2018,5,Madhya Pradesh,72626809,0.163,52557404,20069405,931
7,Maharashtra,273,12398,438,39048,711,51446,31.12.2015,2,Maharashtra,112374333,0.2,61556074,50818259,929
8,Manipur,23,730,7,697,30,1427,01.01.2014,23,Manipur,2570390,0.186,1793875,776515,992
9,Nagaland,21,630,15,1250,36,1880,31.12.2015,24,Nagaland,1978502,?0.6%,1407536,570966,931


Only 16 entries out of 36 could be part of the join. Because other 20 entries have minor differences within them. 

In [15]:
sorted(list(beds['State/UT']))

['A&N Island',
 'Andhra Pradesh',
 'Arunachal Pradesh*',
 'Assam *',
 'Bihar',
 'Chandigarh',
 'Chhattisgarh',
 'D&N Haveli*',
 'Daman & Diu',
 'Delhi',
 'Goa*',
 'Gujarat',
 'Haryana*',
 'Himachal Pradesh*',
 'Jammu & Kashmir',
 'Jharkhand',
 'Karnataka*',
 'Kerala',
 'Lakshadweep',
 'Madhya Pradesh',
 'Maharashtra',
 'Manipur',
 'Meghalaya*',
 'Mizoram*',
 'Nagaland',
 'Odisha*',
 'Puducherry',
 'Punjab*',
 'Rajasthan *',
 'Sikkim*',
 'Tamil Nadu*',
 'Telangana*',
 'Tripura*',
 'Uttar Pradesh*',
 'Uttarakhand',
 'West Bengal']

In [17]:
sorted(list(population['State or union territory']))

['Andaman and Nicobar Islands',
 'Andhra Pradesh',
 'Arunachal Pradesh',
 'Assam',
 'Bihar',
 'Chandigarh',
 'Chhattisgarh',
 'Dadra and Nagar Haveli and Daman and Diu',
 'Delhi',
 'Goa',
 'Gujarat',
 'Haryana',
 'Himachal Pradesh',
 'Jammu and Kashmir',
 'Jharkhand',
 'Karnataka',
 'Kerala',
 'Ladakh',
 'Lakshadweep',
 'Madhya Pradesh',
 'Maharashtra',
 'Manipur',
 'Meghalaya',
 'Mizoram',
 'Nagaland',
 'Odisha',
 'Puducherry',
 'Punjab',
 'Rajasthan',
 'Sikkim',
 'Tamil Nadu',
 'Telangana',
 'Tripura',
 'Uttar Pradesh',
 'Uttarakhand',
 'West Bengal']

You can observe that many states/UTs are represented differently. 
"Andaman and Nicobar Islands" in one dataframe ; "A&N Island" in another. 

Such issues affect our JOINS. If it is for single use we can simply rename the fields bothering our join. But when we have to merge multiple data frames, you require a **standard key column**

In our case let's fix that the state names in population dataframe as the standard format. In any dataframes we work in future, the state names in those data frames should be the same as the standard format. Only then JOINs can be done easily. 

## LOGIC:
1. Select a state name and compare it with all the standard state names. After each comparison, there should be a metric to measure the matching between both strings.
2. We will select the standard state name that has maximum matching with the selected state. 
3. Rename the state name with standard name. 
4. Repeat with other states. 

In [0]:
#Sequence Matcher helps us get the metric that measures how two strings are matching
from difflib import SequenceMatcher

#We will write a function that gives us matching score between two strings a and b. Higher the score,better the match
def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

In [20]:
similar('Apple',"Apple")

1.0

In [21]:
similar('Apple',"Aple")

0.8888888888888888

In [22]:
similar('Apple',"Pineapple")

0.5714285714285714

In [23]:
similar('Apple',"Mango")

0.0

In [28]:
#Fix the standard state names
standard_state_names = list(population['State or union territory'])

#Empty list to store the standard names of each state in our dataframe
standard_names = []

for state in list(beds['State/UT']):
  # Empty list to store all scores
  matches = []

  #Position of selected state in the list. We need it because, we will have to rename it with stadard value later.
  statepos = list(beds['State/UT']).index(state)

  for standard in standard_state_names:
    matches.append(similar(state, standard))

  #Position of the maximum score in the list. We will use it to know what is the standard state name for every state.  
  maxpos = matches.index(max(matches))
  standard_names.append(standard_state_names[maxpos])


beds['State/UT'] = pd.DataFrame(standard_names)
beds

Unnamed: 0,State/UT,Number of Govt Hospitals in Rural Areas,Number of Govt Beds in Rural Areas,Number of Govt Hospitals in Urban Areas,Number of Govt Beds in Urban Areas,Total Number of Govt Hospitals,Total Number of Govt Beds,Reference Period
0,Andhra Pradesh,193,6480,65,16658,258,23138,01.01.2017
1,Arunachal Pradesh,208,2136,10,268,218,2404,31.12.2018
2,Assam,1176,10944,50,6198,1226,17142,31.12.2017
3,Bihar,1032,5510,115,6154,1147,11664,31.12.2018
4,Chhattisgarh,169,5070,45,4342,214,9412,01.01.2016
5,Goa,18,1397,25,1615,43,3012,31.12.2018
6,Gujarat,363,11688,75,8484,438,20172,31.12.2018
7,Haryana,609,6690,59,4550,668,11240,31.12.2016
8,Himachal Pradesh,705,5665,96,6734,801,12399,31.12.2017
9,Jammu and Kashmir,35,1221,108,6070,143,7291,31.12.2018


Let's Merge the dataframes now

In [29]:
percapita_beds = pd.merge(beds, population, left_on ='State/UT', right_on ='State or union territory',how='inner')
percapita_beds

Unnamed: 0,State/UT,Number of Govt Hospitals in Rural Areas,Number of Govt Beds in Rural Areas,Number of Govt Hospitals in Urban Areas,Number of Govt Beds in Urban Areas,Total Number of Govt Hospitals,Total Number of Govt Beds,Reference Period,Rank,State or union territory,Population,Decadal growth,Rural population,Urban population,Sex ratio
0,Andhra Pradesh,193,6480,65,16658,258,23138,01.01.2017,10,Andhra Pradesh,"49,577,103[b]",0.11,34966693,14610410,993
1,Arunachal Pradesh,208,2136,10,268,218,2404,31.12.2018,26,Arunachal Pradesh,1383727,0.26,1066358,317369,938
2,Assam,1176,10944,50,6198,1226,17142,31.12.2017,15,Assam,31205576,0.171,26807034,4398542,954
3,Bihar,1032,5510,115,6154,1147,11664,31.12.2018,3,Bihar,104099452,0.254,92341436,11758016,918
4,Chhattisgarh,169,5070,45,4342,214,9412,01.01.2016,17,Chhattisgarh,25545198,0.226,19607961,5937237,991
5,Goa,18,1397,25,1615,43,3012,31.12.2018,25,Goa,1458545,0.082,551731,906814,973
6,Gujarat,363,11688,75,8484,438,20172,31.12.2018,9,Gujarat,60439692,0.193,34694609,25745083,919
7,Haryana,609,6690,59,4550,668,11240,31.12.2016,18,Haryana,25351462,0.199,16509359,8842103,879
8,Himachal Pradesh,705,5665,96,6734,801,12399,31.12.2017,20,Himachal Pradesh,6864602,0.129,6176050,688552,972
9,Jammu and Kashmir,35,1221,108,6070,143,7291,31.12.2018,UT1,Jammu and Kashmir,12267032,0.236,9064220,3202812,890


# PERFECT MERGE!