# Inconsistent Data Entry
## Table of Contents <a id='TOC'></a>
- [Package Import](#package-import)
- [Data Import](#data-import)
- [Text pre-processing](#text-pre-processing)
- [Using fuzzy matching](#using-fuzzy-matching)
- [The first city name](#city-one)
- [The second city name](#city-two)
- [Extra Work](#extra-work)

## Package Import <a id='package-import'></a>
[TOC](#TOC)

In [1]:
import pandas as pd
import numpy as np

import fuzzywuzzy
from fuzzywuzzy import process
import chardet

np.random.seed(0)

## Data Import <a id='data-import'></a>
[TOC](#TOC)

In [64]:
# look at the first ten thousand bytes to guess the character encoding
_file = "data/PakistanSuicideAttacks Ver 11 (30-November-2017).csv"
with open(_file, 'rb') as f:
    result = chardet.detect(f.read(100000))
# see what the character encoding might be
result

{'confidence': 0.73, 'encoding': 'Windows-1252', 'language': ''}

In [65]:
suicide_attacks = pd.read_csv(_file, encoding="Windows-1252")

In [5]:
suicide_attacks.head()

Unnamed: 0,S#,Date,Islamic Date,Blast Day Type,Holiday Type,Time,City,Latitude,Longitude,Province,...,Targeted Sect if any,Killed Min,Killed Max,Injured Min,Injured Max,No. of Suicide Blasts,Explosive Weight (max),Hospital Names,Temperature(C),Temperature(F)
0,1,Sunday-November 19-1995,25 Jumaada al-THaany 1416 A.H,Holiday,Weekend,,Islamabad,33.718,73.0718,Capital,...,,14.0,15.0,,60,2.0,,,15.835,60.503
1,2,Monday-November 6-2000,10 SHa`baan 1421 A.H,Working Day,,,Karachi,24.9918,66.9911,Sindh,...,,,3.0,,3,1.0,,,23.77,74.786
2,3,Wednesday-May 8-2002,25 safar 1423 A.H,Working Day,,7:45 AM,Karachi,24.9918,66.9911,Sindh,...,Christian,13.0,15.0,20.0,40,1.0,2.5 Kg,1.Jinnah Postgraduate Medical Center 2. Civil ...,31.46,88.628
3,4,Friday-June 14-2002,3 Raby` al-THaany 1423 A.H,Working Day,,11:10:00 AM,Karachi,24.9918,66.9911,Sindh,...,Christian,,12.0,,51,1.0,,,31.43,88.574
4,5,Friday-July 4-2003,4 Jumaada al-awal 1424 A.H,Working Day,,,Quetta,30.2095,67.0182,Baluchistan,...,Shiite,44.0,47.0,,65,1.0,,1.CMH Quetta \n2.Civil Hospital 3. Boland Medi...,33.12,91.616


In [4]:
suicide_attacks.describe()

Unnamed: 0,S#,Latitude,Killed Min,Killed Max,Injured Min,No. of Suicide Blasts,Temperature(C),Temperature(F)
count,496.0,493.0,350.0,480.0,365.0,414.0,491.0,489.0
mean,248.5,32.614705,14.725714,15.20625,31.39726,1.115942,21.111599,69.972579
std,143.327132,2.475917,17.60093,20.270436,38.603842,0.394989,8.369068,15.069622
min,1.0,24.879503,0.0,0.0,0.0,1.0,-2.37,27.734
25%,124.75,31.8238,3.0,3.0,7.0,1.0,14.69,58.37
50%,248.5,33.5833,8.0,8.0,20.0,1.0,21.405,70.529
75%,372.25,34.0043,20.0,18.25,40.0,1.0,28.115,82.499
max,496.0,35.3833,125.0,148.0,320.0,4.0,44.0,111.0


## Text pre-processing <a id='text-pre-processing'></a>
[TOC](#TOC)

In [66]:
# get all the unique values in the `City` column
cities = suicide_attacks['City'].unique()

# sort them alphabetically and then take a closer look
cities.sort()
for i in cities:
    print(i)

ATTOCK
Attock 
Bajaur Agency
Bannu
Bhakkar 
Buner
Chakwal 
Chaman
Charsadda
Charsadda 
D. I Khan
D.G Khan
D.G Khan 
D.I Khan
D.I Khan 
Dara Adam Khel
Dara Adam khel
Fateh Jang
Ghallanai, Mohmand Agency 
Gujrat
Hangu
Haripur
Hayatabad
Islamabad
Islamabad 
Jacobabad
KURRAM AGENCY
Karachi
Karachi 
Karak
Khanewal
Khuzdar
Khyber Agency
Khyber Agency 
Kohat
Kohat 
Kuram Agency 
Lahore
Lahore 
Lakki Marwat
Lakki marwat
Lasbela
Lower Dir
MULTAN
Malakand 
Mansehra
Mardan
Mohmand Agency
Mohmand Agency 
Mohmand agency
Mosal Kor, Mohmand Agency
Multan
Muzaffarabad
North Waziristan
North waziristan
Nowshehra
Orakzai Agency
Peshawar
Peshawar 
Pishin
Poonch
Quetta
Quetta 
Rawalpindi
Sargodha
Sehwan town
Shabqadar-Charsadda
Shangla 
Shikarpur
Sialkot
South Waziristan
South waziristan
Sudhanoti
Sukkur
Swabi 
Swat
Swat 
Taftan
Tangi, Charsadda District
Tank
Tank 
Taunsa
Tirah Valley
Totalai
Upper Dir
Wagah
Zhob
bannu
karachi
karachi 
lakki marwat
peshawar
swat


In [67]:
# convert to lower case
suicide_attacks['City'] = suicide_attacks['City'].str.lower()
# remove trailing white spaces
suicide_attacks['City'] = suicide_attacks['City'].str.strip()

In [68]:
province = suicide_attacks['Province'].unique()
province.sort()
for i in province:
    print(i)

AJK
Balochistan
Baluchistan
Capital
FATA
Fata
KPK
Punjab
Sindh


In [69]:
suicide_attacks.Province = suicide_attacks.Province.str.lower()
suicide_attacks.Province = suicide_attacks.Province.str.strip()

## Using fuzzy matching to correct inconsistent data entry <a id='using-fuzzy-matching'></a>
[TOC](#TOC)

In [70]:
cities = suicide_attacks['City'].unique()
cities.sort()
for i in cities:
    print(i)

attock
bajaur agency
bannu
bhakkar
buner
chakwal
chaman
charsadda
d. i khan
d.g khan
d.i khan
dara adam khel
fateh jang
ghallanai, mohmand agency
gujrat
hangu
haripur
hayatabad
islamabad
jacobabad
karachi
karak
khanewal
khuzdar
khyber agency
kohat
kuram agency
kurram agency
lahore
lakki marwat
lasbela
lower dir
malakand
mansehra
mardan
mohmand agency
mosal kor, mohmand agency
multan
muzaffarabad
north waziristan
nowshehra
orakzai agency
peshawar
pishin
poonch
quetta
rawalpindi
sargodha
sehwan town
shabqadar-charsadda
shangla
shikarpur
sialkot
south waziristan
sudhanoti
sukkur
swabi
swat
taftan
tangi, charsadda district
tank
taunsa
tirah valley
totalai
upper dir
wagah
zhob


>**Fuzzy matching:** The process of automatically finding text strings that are very similar to the target string. In general, a string is considered "closer" to another one the fewer characters you'd need to change if you were transforming one string into another. So "apple" and "snapple" are two changes away from each other (add "s" and "n") while "in" and "on" and one change away (rplace "i" with "o"). You won't always be able to rely on fuzzy matching 100%, but it will usually end up saving you at least a little time.

This sounds a lot like aligning two protein or DNA sequences.

### The first city name <a id='city-one'></a>
[TOC](#TOC)

In [71]:
# get the top 10 closest matches to "d.i khan"
matches = fuzzywuzzy.process.extract("d.i khan", cities, limit=10, 
                                     scorer=fuzzywuzzy.fuzz.token_sort_ratio)
# take a look at the matches
matches

[('d. i khan', 100),
 ('d.i khan', 100),
 ('d.g khan', 88),
 ('khanewal', 50),
 ('sudhanoti', 47),
 ('hangu', 46),
 ('kohat', 46),
 ('dara adam khel', 45),
 ('chaman', 43),
 ('mardan', 43)]

In [72]:
# function to replace rows in the provided column of the provided dataframe
# that match the provided string above the provided ratio
def replace_matches_in_column(df, column, string_to_match, min_ratio=90):
    # get a list of unique strings
    strings = df[column].unique()
    
    # get the top 10 closest matches to our input string
    matches = fuzzywuzzy.process.extract(string_to_match, strings, limit=10,
                                         scorer=fuzzywuzzy.fuzz.token_sort_ratio)
    
    # only get matches with a ratio > 90
    close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]
    
    # get the rows of all the close matches
    rows_with_matches = df[column].isin(close_matches)
    
    # replace all close match rows with input matches
    df.loc[rows_with_matches, column] = string_to_match
    
    print("All done!")

In [73]:
replace_matches_in_column(df=suicide_attacks, column='City', string_to_match='d.i khan')

All done!


In [74]:
cities = suicide_attacks['City'].unique()
cities.sort()
for i in cities:
    print(i)

attock
bajaur agency
bannu
bhakkar
buner
chakwal
chaman
charsadda
d.g khan
d.i khan
dara adam khel
fateh jang
ghallanai, mohmand agency
gujrat
hangu
haripur
hayatabad
islamabad
jacobabad
karachi
karak
khanewal
khuzdar
khyber agency
kohat
kuram agency
kurram agency
lahore
lakki marwat
lasbela
lower dir
malakand
mansehra
mardan
mohmand agency
mosal kor, mohmand agency
multan
muzaffarabad
north waziristan
nowshehra
orakzai agency
peshawar
pishin
poonch
quetta
rawalpindi
sargodha
sehwan town
shabqadar-charsadda
shangla
shikarpur
sialkot
south waziristan
sudhanoti
sukkur
swabi
swat
taftan
tangi, charsadda district
tank
taunsa
tirah valley
totalai
upper dir
wagah
zhob


### The second city name <a id='city-two'></a>
[TOC](#TOC)

In [75]:
matches = fuzzywuzzy.process.extract("kuram agency", cities, limit=10,
                                     scorer=fuzzywuzzy.fuzz.token_sort_ratio)
matches

[('kuram agency', 100),
 ('kurram agency', 96),
 ('bajaur agency', 72),
 ('khyber agency', 72),
 ('orakzai agency', 69),
 ('mohmand agency', 62),
 ('mosal kor, mohmand agency', 61),
 ('ghallanai, mohmand agency', 50),
 ('gujrat', 44),
 ('d.g khan', 40)]

So the cutoff should be >95.

In [76]:
replace_matches_in_column(df=suicide_attacks, column='City', string_to_match='kuram agency', min_ratio=95)

All done!


In [77]:
cities = suicide_attacks['City'].unique()
cities.sort()
for i in cities:
    print(i)

attock
bajaur agency
bannu
bhakkar
buner
chakwal
chaman
charsadda
d.g khan
d.i khan
dara adam khel
fateh jang
ghallanai, mohmand agency
gujrat
hangu
haripur
hayatabad
islamabad
jacobabad
karachi
karak
khanewal
khuzdar
khyber agency
kohat
kuram agency
lahore
lakki marwat
lasbela
lower dir
malakand
mansehra
mardan
mohmand agency
mosal kor, mohmand agency
multan
muzaffarabad
north waziristan
nowshehra
orakzai agency
peshawar
pishin
poonch
quetta
rawalpindi
sargodha
sehwan town
shabqadar-charsadda
shangla
shikarpur
sialkot
south waziristan
sudhanoti
sukkur
swabi
swat
taftan
tangi, charsadda district
tank
taunsa
tirah valley
totalai
upper dir
wagah
zhob


## Extra work <a id='extra-work'></a>
[TOC](#TOC)

In [78]:
suicide_attacks.columns

Index(['S#', 'Date', 'Islamic Date', 'Blast Day Type', 'Holiday Type', 'Time',
       'City', 'Latitude', 'Longitude', 'Province', 'Location',
       'Location Category', 'Location Sensitivity', 'Open/Closed Space',
       'Influencing Event/Event', 'Target Type', 'Targeted Sect if any',
       'Killed Min', 'Killed Max', 'Injured Min', 'Injured Max',
       'No. of Suicide Blasts', 'Explosive Weight (max)', 'Hospital Names',
       'Temperature(C)', 'Temperature(F)'],
      dtype='object')

In [102]:
day_type = suicide_attacks['Holiday Type'].dropna().unique()
day_type.sort()
for i in day_type:
    print(i)

Ashura
Christmas/birthday of Quaid-e-Azam
Defence Day
Eid Holidays
General Elections
Iqbal Day
Labour Day
Pakistan Day
Weekend


In [82]:
def matches(string, series):
    _matches = fuzzywuzzy.process.extract(string, series, limit=10,
                                          scorer=fuzzywuzzy.fuzz.token_sort_ratio)
    return _matches

In [100]:
matches("Christmas/birthday of Quaid-e-Azam", day_type)

[('Christmas/ birthday of Quaid-e-Azam', 100),
 ('Christmas/birthday of Quaid-e-Azam', 100),
 ('Pakistan Day', 35),
 ('Iqbal Day', 33),
 ('Labour Day', 32),
 ('Eid Holidays', 30),
 ('Defence Day', 27),
 ('General Elections', 24),
 ('Ashura', 20),
 ('Weekend', 10)]

So I'm going to set the min_ratio to 40 and replace all of the Eid... holidays.

In [85]:
replace_matches_in_column(suicide_attacks, "Holiday Type",
                          "Eid-ul-azha", min_ratio=70)

All done!


In [93]:
replace_matches_in_column(suicide_attacks, "Holiday Type",
                          "Ashura", min_ratio=59)

All done!


In [97]:
replace_matches_in_column(suicide_attacks, "Holiday Type",
                          "Eid Holidays", min_ratio=40)

All done!


In [101]:
replace_matches_in_column(suicide_attacks, "Holiday Type",
                          "Christmas/birthday of Quaid-e-Azam", min_ratio=90)

All done!
