# "Cleaning Data with Pandas"
> "Using Pandas to clean 2019 animal rescue data from Cape Cod."

- toc: false
- badges: true
- comments: true
- author: Antonio Jurlina
- categories: [learning, python]

In [1]:
import pandas as pd
import numpy as np
import datetime as dt
import os

In [2]:
os.chdir('/Users/antoniojurlina/Projects/learning_python/data/')

data=pd.read_csv('CapeCodCases2019.csv')

data.head()

Unnamed: 0,patients.address_found,patients.city_found,patients.common_name,patients.county_found,patients.found_at,patients.disposition,patients.keywords,admissions.id,admissions.case_year,people.postal_code
0,Long Beach,Hyannis,Common Eider,Barnstable County,1/1/2019,Died +24hr,,1,2019,2601.0
1,3260 Main Street,Brewster,Common Eider,Barnstable County,1/2/2019,Euthanized +24hr,,2,2019,2671.0
2,Thumpertown Beach,Eastham,Razorbill,Barnstable County,1/2/2019,Dead on arrival,,3,2019,
3,5575 State Highway,Eastham,Southern Flying Squirrel,Barnstable County,1/3/2019,Released,,4,2019,2651.0
4,5575 State Highway,Eastham,Southern Flying Squirrel,Barnstable County,1/3/2019,Released,,5,2019,2651.0


By reviewing unique values of variables city, county and address, the following strings were determined to be different instantations of a missing value.

In [None]:
missing_values = ["Umknown", "No Info Given", "Unknown.", 
                  "Unknown", "No Address Given", "No Address Given.", 
                  "None", "No Information Given.", "Nan", "nan", ""]

From this point, we proceed to clean the data, starting with renaming the variables.

In [56]:
data = data.rename(columns={'patients.address_found': 'address', 
                            'patients.city_found': 'city',
                            'patients.common_name': 'common_name',
                            'patients.county_found': 'county', 
                            'patients.found_at': 'date', 
                            'patients.disposition': 'disposition', 
                            'patients.keywords': "keywords", 
                            'admissions.id': "admission_id", 
                            'admissions.case_year': "case_year", 
                            'people.postal_code': 'zip_code'})

print(data.shape)
data.head()

(1871, 10)


Unnamed: 0,address,city,common_name,county,date,disposition,keywords,admission_id,case_year,zip_code
0,Long Beach,Hyannis,Common Eider,Barnstable County,1/1/2019,Died +24hr,,1,2019,2601.0
1,3260 Main Street,Brewster,Common Eider,Barnstable County,1/2/2019,Euthanized +24hr,,2,2019,2671.0
2,Thumpertown Beach,Eastham,Razorbill,Barnstable County,1/2/2019,Dead on arrival,,3,2019,
3,5575 State Highway,Eastham,Southern Flying Squirrel,Barnstable County,1/3/2019,Released,,4,2019,2651.0
4,5575 State Highway,Eastham,Southern Flying Squirrel,Barnstable County,1/3/2019,Released,,5,2019,2651.0


Then, most variables are insured to be stored as strings, with several having each first letter capitalized. The date a rescued animal was found is stored in date format.

In [57]:
data['address'] = data['address'].astype(str).str.title()
data['city'] = data['city'].astype(str).str.title()
data['common_name'] = data['common_name'].astype(str).str.title()
data['county'] = data['county'].astype(str).str.title()
data['date'] = pd.to_datetime(data['date'].astype(str), format='%m/%d/%Y')
data['disposition'] = data['disposition'].astype(str)
data['keywords'] = data['keywords'].astype(str)
data['zip_code'] = data['zip_code'].astype(str)

Further cleaning proceeds with replacement of all strings previously determined to count as 'missing' with NaN's, and some text cleanup.

In [58]:
missing = data['address'].isin(missing_values)
data.loc[missing, 'address'] = np.nan

missing = data['county'].isin(missing_values)
data['county'] = data['county'].str.replace("County", "").str.strip()
data.loc[missing, 'county'] = np.nan

data.loc[data['county']=='2563', 'zip_code'] = '2563'
data.loc[data['county']=='2563', 'county'] = '2563'

missing = data['city'].isin(missing_values)
data.loc[missing, 'city'] = np.nan

data.loc[data['city']=='Orleans, Ma', 'city'] = 'Orleans'
data.loc[data['city']=='Eastham, Ma', 'city'] = 'Eastham'
data.loc[data['city']=='Eastham,', 'city'] = 'Eastham'
data.loc[data['city']=='Eastham.', 'city'] = 'Eastham'
data.loc[data['city']=='N. Eastham', 'city'] = 'North Eastham'
data.loc[data['city']=='E. Orleans', 'city'] = 'East Orleans'

Finally, any row where the address, county *and* city are missing, are dropped, given there can be no analysis of interest on them.

In [59]:
to_drop = np.where(data['address'].isnull() & 
                   data['city'].isnull() &
                   data['county'].isnull())

data = data.drop(np.concatenate(to_drop))
data = data.reset_index(drop = True)

print(data.shape)
data.head()

(1799, 10)


Unnamed: 0,address,city,common_name,county,date,disposition,keywords,admission_id,case_year,zip_code
0,Long Beach,Hyannis,Common Eider,Barnstable,2019-01-01,Died +24hr,,1,2019,2601.0
1,3260 Main Street,Brewster,Common Eider,Barnstable,2019-01-02,Euthanized +24hr,,2,2019,2671.0
2,Thumpertown Beach,Eastham,Razorbill,Barnstable,2019-01-02,Dead on arrival,,3,2019,
3,5575 State Highway,Eastham,Southern Flying Squirrel,Barnstable,2019-01-03,Released,,4,2019,2651.0
4,5575 State Highway,Eastham,Southern Flying Squirrel,Barnstable,2019-01-03,Released,,5,2019,2651.0


For the data and other notebooks, see [github.com/antoniojurlina/learning_python](https://github.com/antoniojurlina/learning_python).