## Shark Attacks


In [1]:
import pandas as pd
import re
data = pd.read_csv("input/GSAF5.csv", encoding = "ISO-8859-15")

In [28]:
data.head()

Unnamed: 0,Case Number,Year,Type,Place,Area,Location,Name,Sex,Age,Injury,...,Species,Investigator or Source,pdf,href formula,href,original order,Activity,Date,Month,Hour
0,2016.09.18,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",male,M,16.0,Minor injury to thigh,...,,"Orlando Sentinel, 9/19/2016",2016.09.18.c-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,5993,Surfing,2016.09.18,9,13:00
1,2016.09.18,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Chucky Luciano,M,36.0,Lacerations to hands,...,,"Orlando Sentinel, 9/19/2016",2016.09.18.b-Luciano.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,5992,Surfing,2016.09.18,9,11:00
2,2016.09.18,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",male,M,43.0,Lacerations to lower leg,...,,"Orlando Sentinel, 9/19/2016",2016.09.18.a-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,5991,Surfing,2016.09.18,9,10:43
3,2016.09.17,2016,Unprovoked,AUSTRALIA,Victoria,Thirteenth Beach,Rory Angiolella,M,,Struck by fin on chest & leg,...,,"The Age, 9/18/2016",2016.09.17-Angiolella.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,5990,Surfing,2016.09.17,9,Unknown
4,2016.09.15,2016,Unprovoked,AUSTRALIA,Victoria,Bells Beach,male,M,,No injury: Knocked off board by shark,...,2 m shark,"The Age, 9/16/2016",2016.09.16-BellsBeach.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,5989,Surfing,2016.09.15,9,Unknown


### 1st clean: Drop invalid columns
- Columns 'Unnamed: 22' & 'Unnamed: 23' are not referenced in the description of the dataset and doesn't contain any relevant information. Proceed to drop them.
- Columns 'Case Number.1' & 'Case Number.2' are duplicates of 'Case Number'. Proceed to drop them.
- "Date" cannot be normalized. We drop it to later get the dates from the Case Number column on a structured way.

In [3]:
data = data.drop(['Unnamed: 22','Unnamed: 23', 'Case Number.1', 'Case Number.2', 'Date'], axis=1)

### 2nd clean: Renaming columns
- Some names of the columns aren't clean or clear enough. Below the list of columns renamed
    - Sex: remove a blank space at the end.
    - Country: Changed to Area, since several entrances refer to seas or regions broader than a country.

In [4]:
data.rename(columns={
    'Sex ':'Sex', 
    'Country': 'Place'
    }, inplace=True)

### 3rd clean: 

Among the total 5900 events registered, only 137 happened before 1700. To evaluate only statistically relevant data, events registered before 1700 will not be considered

In [5]:
data = data[data['Year'] > 1700]
print(data.shape)

(5852, 19)


### 4th clean: Unifying categories
    - Sex: Typo found on 2 entrances. Fixed.
    - Country: We have reduced the list of countries from the original set of 197 categories, to 174. For that purpose we have used both regular expressions and manual replacement.

In [6]:
data.replace({'Sex': {'M ': 'M'}}, inplace=True)

In [7]:
type(data['Sex'])

pandas.core.series.Series

In [8]:
data['Sex'].value_counts()

M      4723
F       572
N         1
lli       1
.         1
Name: Sex, dtype: int64

In [9]:
#remove end ?
#remove start/end blank spaces
#remove 2nd country after /
data.replace(regex={
    r'\?':'', 
    r'\s\/\s[A-Z\s]+': '', 
    r'\s$':'', r'^\s':''
}, inplace=True)



In [10]:
# On column Place, manually fixed some duplicates
data.replace({'Place': { 'UNITED ARAB EMIRATES (UAE)':'UNITED ARAB EMIRATES', 
'Fiji':'FIJI', 'ST. MAARTIN':'ST. MARTIN', 
'Seychelles':'SEYCHELLES', 
'Sierra Leone':'SIERRA LEONE', 
'St Helena': 'ST HELENA', 
'ENGLAND': 'UNITED KINGDOM', 
'SCOTLAND': 'UNITED KINGDOM'}
}, inplace=True)

In [11]:
len(set(data['Place']))

174

Reduce from the original 1418 unique entrances on Activities to 6: 'Surfing', 'Swimming', 'Fishing', 'Spearfishing', 'Bathing' & 'Others'.

In [12]:
data.rename(columns={'Activity':'unActivity'}, inplace=True)
data_activity = data['unActivity']
activity = []
for e in data_activity:
    if re.search(r'Surf[\w\s\,]+|surf[\w\s\,]+', str(e)):
        e = 'Surfing'
    elif re.search(r'Fish[\w\s\,]+|fish[\w\s\,]+', str(e)):
        e = 'Fishing'
    elif re.search(r'Spear[\w\s\,]+|spear[\w\s\,]+', str(e)):
        e = 'Spearing'
    elif re.search(r'bath[\w\s\,]+|bath[\w\s\,]+', str(e)):
        e = 'Bathing'
    elif re.search(r'Swim[\w\s\,]+|swim[\w\s\,]+', str(e)):
        e = 'Swimming'
    elif re.search(r'Div[\w\s\,]+|Div[\w\s\,]+', str(e)):
        e = 'Diving'
    else: e = 'Others'
    activity.append(e)
data['Activity'] = activity
data = data.drop(['unActivity'], axis=1)

### Clean n: Clean the letters and points at the end of some entries in the column 'Case Number'

In [13]:
data['Case Number'].replace(regex = {r'.[A-Za-z]$':''}, inplace = True)

In [14]:
data['Date'] = data['Case Number']

### Clean n: Create a new column for the month, extracting it from the 'Case Number' column.

In [15]:
data['Month'] = [e[5:7] for e in data['Case Number']]

In [16]:
data.head()

Unnamed: 0,Case Number,Year,Type,Place,Area,Location,Name,Sex,Age,Injury,...,Time,Species,Investigator or Source,pdf,href formula,href,original order,Activity,Date,Month
0,2016.09.18,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",male,M,16.0,Minor injury to thigh,...,13h00,,"Orlando Sentinel, 9/19/2016",2016.09.18.c-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,5993,Surfing,2016.09.18,9
1,2016.09.18,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Chucky Luciano,M,36.0,Lacerations to hands,...,11h00,,"Orlando Sentinel, 9/19/2016",2016.09.18.b-Luciano.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,5992,Surfing,2016.09.18,9
2,2016.09.18,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",male,M,43.0,Lacerations to lower leg,...,10h43,,"Orlando Sentinel, 9/19/2016",2016.09.18.a-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,5991,Surfing,2016.09.18,9
3,2016.09.17,2016,Unprovoked,AUSTRALIA,Victoria,Thirteenth Beach,Rory Angiolella,M,,Struck by fin on chest & leg,...,,,"The Age, 9/18/2016",2016.09.17-Angiolella.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,5990,Surfing,2016.09.17,9
4,2016.09.15,2016,Unprovoked,AUSTRALIA,Victoria,Bells Beach,male,M,,No injury: Knocked off board by shark,...,,2 m shark,"The Age, 9/16/2016",2016.09.16-BellsBeach.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,5989,Surfing,2016.09.15,9


### Clean n: Cleaning the hour, keeping only the values that correspond to a 24h value

In [17]:
data['Time'] = data['Time'].replace(regex = {r'\s[\w\-\d\/\()]+|\-[\w\-\d\/]+|j$|^\>|^\<':'', r'h':':'})

In [18]:
hour = []
time = data['Time']
for e in time:
    if re.search(r'\d{2}\:\d{2}', str(e)) == None:
        e = 'Unknown'
    print(e)
    hour.append(e)
data['Hour'] = hour

13:00
11:00
10:43
Unknown
Unknown
Unknown
15:15
14:30
15:40
Unknown
Unknown
Unknown
Unknown
15:00
14:00
17:00
16:00
Unknown
Unknown
Unknown
11:30
Unknown
Unknown
Unknown
12:00
19:05
Unknown
Unknown
Unknown
Unknown
Unknown
11:00
Unknown
Unknown
Unknown
Unknown
10:00
Unknown
14:30
22:00
16:20
14:34
11:00
Unknown
15:25
14:55
06:00
Unknown
17:30
15:00
11:30
08:30
11:30
Unknown
Unknown
Unknown
16:00
16:00
15:45
12:00
18:00
17:46
Unknown
13:20
15:49
Unknown
07:00
17:30
08:00
12:00
Unknown
Unknown
10:45
Unknown
Unknown
Unknown
11:00
19:00
Unknown
11:00
Unknown
Unknown
13:30
Unknown
Unknown
Unknown
Unknown
19:00
16:00
15:00
Unknown
12:30
13:20
Unknown
Unknown
14:00
11:30
09:30
Unknown
Unknown
Unknown
Unknown
11:30
Unknown
Unknown
Unknown
07:00
Unknown
Unknown
10:30
Unknown
Unknown
Unknown
Unknown
Unknown
18:15
11:00
11:00
14:00
04:00
Unknown
14:50
15:00
14:30
13:50
Unknown
19:20
11:30
11:00
16:20
10:25
Unknown
16:50
13:00
10:00
08:30
16:20
Unknown
10:45
16:45
15:52
12:30
07:30
19:00
15:00
06:1

15:30
10:30
08:00
Unknown
Unknown
Unknown
08:30
Unknown
Unknown
10:30
15:00
Unknown
11:58
11:51
Unknown
Unknown
11:00
13:00
17:00
16:30
Unknown
Unknown
Unknown
13:30
07:30
Unknown
16:15
Unknown
Unknown
Unknown
Unknown
15:15
10:55
Unknown
16:30
10:00
14:30
Unknown
Unknown
11:30
Unknown
Unknown
Unknown
Unknown
08:00
17:15
Unknown
07:05
11:30
10:00
Unknown
14:00
Unknown
Unknown
15:00
18:12
Unknown
Unknown
07:15
19:30
07:10
09:30
10:55
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
16:00
Unknown
15:45
14:30
Unknown
03:00
Unknown
13:45
Unknown
09:00
Unknown
16:30
Unknown
Unknown
Unknown
Unknown
Unknown
17:30
15:30
07:30
13:00
12:50
Unknown
14:00
14:15
09:45
11:00
07:40
Unknown
Unknown
14:30
Unknown
17:00
15:00
Unknown
Unknown
Unknown
17:50
15:00
17:00
Unknown
Unknown
Unknown
12:33
Unknown
Unknown
Unknown
14:30
Unknown
12:30
Unknown
Unknown
14:00
Unknown
15:30
13:20
06:45
Unknown
15:20
10:15
09:45
Unknown
10:00
19:30
16:30
20:00
17:30
17:15
Unknown
Unknown
Unknown
Unknown
15:06
09:00
09:00


Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
12:00
Unknown
14:00
17:30
Unknown
Unknown
Unknown
07:30
19:55
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
09:00
10:00
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
11:00
Unknown
16:30
15:30
14:00
Unknown
10:40
Unknown
16:00
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
12:15
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
12:00
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
17:30
17:30
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
11:00
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
15:20
Unknown
03:00
Unknown
Unknown
Unknown
Unknown
Unknown


In [19]:
null_cols = data.isnull().sum()
print(null_cols[null_cols > 0])
print(data.shape)

Place                       36
Area                       370
Location                   453
Name                       193
Sex                        554
Age                       2554
Injury                      21
Fatal (Y/N)                 19
Time                      3080
Species                   2830
Investigator or Source      15
href formula                 1
href                         3
dtype: int64
(5852, 22)


### n clean: Change column types
- Change the column Fatal (Y/N) to a boolean, to make sure that answer can be only True or False.

In [20]:
data.dtypes

Case Number               object
Year                       int64
Type                      object
Place                     object
Area                      object
Location                  object
Name                      object
Sex                       object
Age                       object
Injury                    object
Fatal (Y/N)               object
Time                      object
Species                   object
Investigator or Source    object
pdf                       object
href formula              object
href                      object
original order             int64
Activity                  object
Date                      object
Month                     object
Hour                      object
dtype: object

In [21]:
data = data.replace({'Fatal (Y/N)':
             { 'N' : '0', 
               'Y' : '1',
               'n' : '0',
               'y' : '1',
             }})
data['Fatal (Y/N)'].astype(bool)
data.rename(columns={ 'Fatal (Y/N)' : 'Fatal (T/F)'}, inplace=True)

In [39]:
final_table = data[['Date', 'Year', 'Month', 'Hour', 'Place', 'Area', 'Activity', 'Sex', 'Fatal (T/F)']]
final_table.to_csv("./cleaned_data_GSAF5.csv")
display(final_table)

Unnamed: 0,Date,Year,Month,Hour,Place,Area,Activity,Sex,Fatal (T/F)
0,2016.09.18,2016,09,13:00,USA,Florida,Surfing,M,0
1,2016.09.18,2016,09,11:00,USA,Florida,Surfing,M,0
2,2016.09.18,2016,09,10:43,USA,Florida,Surfing,M,0
3,2016.09.17,2016,09,Unknown,AUSTRALIA,Victoria,Surfing,M,0
4,2016.09.15,2016,09,Unknown,AUSTRALIA,Victoria,Surfing,M,0
...,...,...,...,...,...,...,...,...,...
5847,1742.12.17,1742,12,Unknown,,,Swimming,M,1
5848,1738.04.06,1738,04,Unknown,ITALY,Sicily,Swimming,M,1
5849,1733.00.00,1733,00,Unknown,ICELAND,Bardestrand,Others,,
5850,1721.06.00,1721,06,Unknown,ITALY,Sardinia,Swimming,M,1
