# Introduction

## Purpose:

 + The purpose of the jupyter notebook is to practice data cleaning using the pandas library.

## The dataset:

   + The dataset that will be used for this cleaning practice is a summary of suicide bombings in pakistan which can be found on [kaggle](https://www.kaggle.com/zusmani/pakistansuicideattacks/downloads/pakistansuicideattacks.zip/6)

### Import

In [1]:
# Import packages necessary to open zip files and clean data

import zipfile
import pandas as pd

### Gather

In [2]:
# Open zip file containing dataset

with zipfile.ZipFile('pakistansuicideattacks.zip') as myzip:
    myzip.extractall()

In [6]:
# Open csv as pandas dataframe

df = pd.read_csv('terrorist_bombings.csv',encoding = "ISO-8859-1")


In [7]:
# Show first 5 rows of dataframe

df.head()

Unnamed: 0,S#,Date,Islamic Date,Blast Day Type,Holiday Type,Time,City,Latitude,Longitude,Province,...,Targeted Sect if any,Killed Min,Killed Max,Injured Min,Injured Max,No. of Suicide Blasts,Explosive Weight (max),Hospital Names,Temperature(C),Temperature(F)
0,1,Sunday-November 19-1995,25 Jumaada al-THaany 1416 A.H,Holiday,Weekend,,Islamabad,33.718,73.0718,Capital,...,,14.0,15.0,,60,2.0,,,15.835,60.503
1,2,Monday-November 6-2000,10 SHa`baan 1421 A.H,Working Day,,,Karachi,24.9918,66.9911,Sindh,...,,,3.0,,3,1.0,,,23.77,74.786
2,3,Wednesday-May 8-2002,25 safar 1423 A.H,Working Day,,7:45 AM,Karachi,24.9918,66.9911,Sindh,...,Christian,13.0,15.0,20.0,40,1.0,2.5 Kg,1.Jinnah Postgraduate Medical Center 2. Civil ...,31.46,88.628
3,4,Friday-June 14-2002,3 Raby` al-THaany 1423 A.H,Working Day,,11:10:00 AM,Karachi,24.9918,66.9911,Sindh,...,Christian,,12.0,,51,1.0,,,31.43,88.574
4,5,Friday-July 4-2003,4 Jumaada al-awal 1424 A.H,Working Day,,,Quetta,30.2095,67.0182,Baluchistan,...,Shiite,44.0,47.0,,65,1.0,,1.CMH Quetta \n2.Civil Hospital 3. Boland Medi...,33.12,91.616


### Assess

In [8]:
# Check for clarity of column names

df.columns

Index(['S#', 'Date', 'Islamic Date', 'Blast Day Type', 'Holiday Type', 'Time',
       'City', 'Latitude', 'Longitude', 'Province', 'Location',
       'Location Category', 'Location Sensitivity', 'Open/Closed Space',
       'Influencing Event/Event', 'Target Type', 'Targeted Sect if any',
       'Killed Min', 'Killed Max', 'Injured Min', 'Injured Max',
       'No. of Suicide Blasts', 'Explosive Weight (max)', 'Hospital Names',
       'Temperature(C)', 'Temperature(F)'],
      dtype='object')

In [9]:
# Check for correct data types

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 496 entries, 0 to 495
Data columns (total 26 columns):
S#                         496 non-null int64
Date                       496 non-null object
Islamic Date               342 non-null object
Blast Day Type             486 non-null object
Holiday Type               72 non-null object
Time                       285 non-null object
City                       496 non-null object
Latitude                   493 non-null float64
Longitude                  493 non-null object
Province                   496 non-null object
Location                   493 non-null object
Location Category          461 non-null object
Location Sensitivity       460 non-null object
Open/Closed Space          461 non-null object
Influencing Event/Event    191 non-null object
Target Type                470 non-null object
Targeted Sect if any       448 non-null object
Killed Min                 350 non-null float64
Killed Max                 480 non-null float64
I

**Obervations**

+ Date in string format, not datetime
+ Explosive weight in string format, not int or float

In [16]:
# Check for null values

df.isnull().sum()

S#                           0
Date                         0
Islamic Date               154
Blast Day Type              10
Holiday Type               424
Time                       211
City                         0
Latitude                     3
Longitude                    3
Province                     0
Location                     3
Location Category           35
Location Sensitivity        36
Open/Closed Space           35
Influencing Event/Event    305
Target Type                 26
Targeted Sect if any        48
Killed Min                 146
Killed Max                  16
Injured Min                131
Injured Max                 32
No. of Suicide Blasts       82
Explosive Weight (max)     324
Hospital Names             199
Temperature(C)               5
Temperature(F)               7
dtype: int64

**Observations:**

+ Column 'Holiday Type' has 424 null values

+ Column 'Explosive Weight (max)' has 324 null values

In [18]:
# Check for duplicate rows

df.duplicated().sum()

0

In [19]:
# Check for impossible values
df. describe()

Unnamed: 0,S#,Latitude,Killed Min,Killed Max,Injured Min,No. of Suicide Blasts,Temperature(C),Temperature(F)
count,496.0,493.0,350.0,480.0,365.0,414.0,491.0,489.0
mean,248.5,32.614705,14.725714,15.20625,31.39726,1.115942,21.111599,69.972579
std,143.327132,2.475917,17.60093,20.270436,38.603842,0.394989,8.369068,15.069622
min,1.0,24.879503,0.0,0.0,0.0,1.0,-2.37,27.734
25%,124.75,31.8238,3.0,3.0,7.0,1.0,14.69,58.37
50%,248.5,33.5833,8.0,8.0,20.0,1.0,21.405,70.529
75%,372.25,34.0043,20.0,18.25,40.0,1.0,28.115,82.499
max,496.0,35.3833,125.0,148.0,320.0,4.0,44.0,111.0


In [22]:
# Check for non descript or repetetive values for categorical columns.


df['Holiday Type'].value_counts()



Weekend                                45
Ashura                                  4
Eid Milad un-Nabi                       3
Labour Day                              3
Eid-ul-Fitar                            3
Eid Holidays                            2
Ashura Holiday                          2
Iqbal Day                               2
Pakistan Day                            2
Eid-ul-azha                             1
General Elections                       1
Defence Day                             1
Eid ul Azha Holiday                     1
Christmas/birthday of Quaid-e-Azam      1
Christmas/ birthday of Quaid-e-Azam     1
Name: Holiday Type, dtype: int64

**Observations:**
+ Defense Day mispelled as 'Defence Day'

In [23]:
# Check for non descript or repetetive values for categorical columns.
df['Blast Day Type'].value_counts()



Working Day    403
Holiday         78
Weekend          5
Name: Blast Day Type, dtype: int64

In [24]:
# Check for non descript or repetetive values for categorical columns.

df['City'].value_counts()


Peshawar                      72
Quetta                        35
Swat                          25
Bannu                         22
Karachi                       21
Rawalpindi                    19
Islamabad                     17
Hangu                         17
Lahore                        14
Khyber Agency                 14
Bajaur Agency                 13
North Waziristan              13
North waziristan              11
Lahore                        11
Kohat                         11
Mardan                         9
D.I Khan                       8
Charsadda                      7
Tank                           7
Karachi                        6
South waziristan               6
Mohmand Agency                 6
Kuram Agency                   5
South Waziristan               5
Nowshehra                      5
D.I Khan                       4
Buner                          4
Lower Dir                      4
Swat                           4
Sargodha                       3
          

**Observations:**

+ North Waziristan repeated and spelled as 'North waziristan'  

In [25]:
# Check for non descript or repetetive values for categorical columns.

df['Location'].value_counts()


Imambargah                                                                3
Islamabad Marriott Hotal                                                  2
Mingora police Station                                                    2
Security check post Miramshah \nNorth Wazirstan                           2
Orangi Town Faqeer Colony                                                 2
Security Checkpost near Miramshah                                         1
Inside a mosque in Shahi Bagh area near Tirah IDPs registration center    1
Security checkpost located in Tal area of Hangu                           1
Crime Investigation Department                                            1
checkpost near ISI Office building in Qasim Bala area -Multan Cantt       1
Boghra Road                                                               1
Sarband Area on the outskirts of city                                     1
Nowshera near\nProvincial Info Ministr house                              1
Qasim market

In [26]:
# Check for non descript or repetetive values for categorical columns.

df['Location Category'].value_counts()


Police                        92
Mobile                        70
Military                      70
Religious                     57
Market                        40
Park/Ground                   32
Residence                     25
Government                    19
Hotel                         10
Office Building                9
Foreign                        6
Educational                    6
Hospital                       5
Transport                      5
Bank                           4
Commercial/residence           2
Airport                        1
Civilian                       1
Residential Building           1
                               1
Foreigner                      1
Highway                        1
Government/Office Building     1
Government Official            1
foreign                        1
Name: Location Category, dtype: int64

**Observations:**

+ 'Foreign' 'Foreigner' and 'foreign' vl

In [None]:
# Check for non descript or repetetive values for categorical columns.

df['Location Sensitivity'].value_counts()

### Clean

#### Define

#### Code

#### Test

### Conclusion