## Shark Attacks


In [1]:
import pandas as pd
import re
data = pd.read_csv("input/GSAF5.csv", encoding = "ISO-8859-15")

In [2]:
data.head()

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
0,2016.09.18.c,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,...,,"Orlando Sentinel, 9/19/2016",2016.09.18.c-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.18.c,2016.09.18.c,5993,,
1,2016.09.18.b,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,Chucky Luciano,M,...,,"Orlando Sentinel, 9/19/2016",2016.09.18.b-Luciano.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.18.b,2016.09.18.b,5992,,
2,2016.09.18.a,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,...,,"Orlando Sentinel, 9/19/2016",2016.09.18.a-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.18.a,2016.09.18.a,5991,,
3,2016.09.17,17-Sep-16,2016,Unprovoked,AUSTRALIA,Victoria,Thirteenth Beach,Surfing,Rory Angiolella,M,...,,"The Age, 9/18/2016",2016.09.17-Angiolella.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.17,2016.09.17,5990,,
4,2016.09.15,16-Sep-16,2016,Unprovoked,AUSTRALIA,Victoria,Bells Beach,Surfing,male,M,...,2 m shark,"The Age, 9/16/2016",2016.09.16-BellsBeach.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.16,2016.09.15,5989,,


### 1st clean: Drop invalid columns
- Columns 'Unnamed: 22' & 'Unnamed: 23' are not referenced in the description of the dataset and doesn't contain any relevant information. Proceed to drop them.
- Columns 'Case Number.1' & 'Case Number.2' are duplicates of 'Case Number'. Proceed to drop them.
- "Date" cannot be normalized. We drop it to later get the dates from the Case Number column on a structured way.

In [3]:
data = data.drop(['Unnamed: 22','Unnamed: 23', 'Case Number.1', 'Case Number.2', 'Date'], axis=1)

### 2nd clean: Renaming columns
- Some names of the columns aren't clean or clear enough. Below the list of columns renamed
    - Sex: remove a blank space at the end.
    - Country: Changed to Area, since several entrances refer to seas or regions broader than a country.

In [4]:
data.rename(columns={
    'Sex ':'Sex', 
    'Country': 'Place'
    }, inplace=True)

### 3rd clean: Change column types

In [5]:
data.dtypes

Case Number               object
Year                       int64
Type                      object
Place                     object
Area                      object
Location                  object
Activity                  object
Name                      object
Sex                       object
Age                       object
Injury                    object
Fatal (Y/N)               object
Time                      object
Species                   object
Investigator or Source    object
pdf                       object
href formula              object
href                      object
original order             int64
dtype: object

### 3rd clean: 

Among the total 5900 events registered, only 137 happened before 1700. To evaluate only statistically relevant data, events registered before 1700 will not be considered

In [6]:
data = data[data['Year'] > 1700]
print(data.shape)

(5852, 19)


### 4th clean: Unifying categories
    - Sex: Typo found on 2 entrances. Fixed.
    - Country: We have reduced the list of countries from the original set of 197 categories, to 174. For that purpose we have used both regular expressions and manual replacement.

In [7]:
data.replace({'Sex': {'M ': 'M'}}, inplace=True)

In [8]:
type(data['Sex'])

pandas.core.series.Series

In [9]:
data['Sex'].value_counts()

M      4723
F       572
.         1
lli       1
N         1
Name: Sex, dtype: int64

In [10]:
#remove end ?
#remove start/end blank spaces
#remove 2nd country after /
data.replace(regex={
    r'\?':'', 
    r'\s\/\s[A-Z\s]+': '', 
    r'\s$':'', r'^\s':''
}, inplace=True)
data.replace({
    'Place': {
    'UNITED ARAB EMIRATES (UAE)':'UNITED ARAB EMIRATES', 
    'Fiji':'FIJI', 'ST. MAARTIN':'ST. MARTIN', 
    'Seychelles':'SEYCHELLES', 
    'Sierra Leone':'SIERRA LEONE', 
    'St Helena': 'ST HELENA', 
    'ENGLAND': 'UNITED KINGDOM', 
    'SCOTLAND': 'UNITED KINGDOM' 
    }
}, inplace=True)

In [11]:
len(set(data['Place']))

174

In [14]:
null_cols = data.isnull().sum()
print(null_cols[null_cols > 0])
print(data.shape)


Place                       36
Area                       370
Location                   453
Activity                   502
Name                       193
Sex                        554
Age                       2554
Injury                      21
Fatal (Y/N)                 19
Time                      3080
Species                   2830
Investigator or Source      15
href formula                 1
href                         3
dtype: int64
(5852, 19)


In [15]:
# Regular Expressions: Clean Country, Extract month

In [16]:
data['Date'] = data['Case Number']

In [13]:
activities = 

data['Activity'].value_counts()

Surfing                                             906
Swimming                                            849
Fishing                                             419
Spearfishing                                        324
Bathing                                             150
                                                   ... 
Skindiving, fish at belt                              1
Fell overboard from SS Ripley Castle                  1
Wading, knocked down & swept away by large waves      1
Spearfishing, but swimming at surface                 1
Swimming a quarter mile offshore                      1
Name: Activity, Length: 1418, dtype: int64