### Shark Attacks

#### Possible questions:

a) Within the unprovoked type of incident, which activity suffered the most?
> What about provoked accidents?

b) Which country has had the biggest number of incidents?
> Is the ranking maintained if filtered by fatal accidents?

c) Which shark species is most associated with attacks?

d) Do the attacks happen with more frequency at a specific time of the day?

#### Coding logic:

1) Save DataFrame backup 

2) Perform general table cleaning: (a) remove duplicates, (b) remove entire null rows, (c) adequate column title (lower snake case)

3) Select columns related to a specific question

4) Understand possible inconsistencies

5) Clean data using new variable to save DataFrame

In [5]:
import os
os.listdir()

['shark_analysis.ipynb',
 'attacks.csv',
 'README.md',
 '.gitattributes',
 '.ipynb_checkpoints',
 '.git']

In [2]:
import numpy as np
import pandas as pd

In [3]:
np.__version__

'1.14.3'

In [4]:
pd.__version__

'0.23.0'

In [8]:
# Retrieving database
data = pd.read_csv('attacks.csv', encoding='latin-1')

In [55]:
# Understanding database (1)
data.head()

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
0,2018.06.25,25-Jun-2018,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,...,White shark,"R. Collier, GSAF",2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.25,2018.06.25,6303.0,,
1,2018.06.18,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,"St. Simon Island, Glynn County",Standing,Adyson McNeely,F,...,,"K.McMurray, TrackingSharks.com",2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.18,2018.06.18,6302.0,,
2,2018.06.09,09-Jun-2018,2018.0,Invalid,USA,Hawaii,"Habush, Oahu",Surfing,John Denges,M,...,,"K.McMurray, TrackingSharks.com",2018.06.09-Denges.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.09,2018.06.09,6301.0,,
3,2018.06.08,08-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,...,2 m shark,"B. Myatt, GSAF",2018.06.08-Arrawarra.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.08,2018.06.08,6300.0,,
4,2018.06.04,04-Jun-2018,2018.0,Provoked,MEXICO,Colima,La Ticla,Free diving,Gustavo Ramos,M,...,"Tiger shark, 3m",A .Kipper,2018.06.04-Ramos.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.04,2018.06.04,6299.0,,


In [10]:
# Understanding database (2)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25723 entries, 0 to 25722
Data columns (total 24 columns):
Case Number               8702 non-null object
Date                      6302 non-null object
Year                      6300 non-null float64
Type                      6298 non-null object
Country                   6252 non-null object
Area                      5847 non-null object
Location                  5762 non-null object
Activity                  5758 non-null object
Name                      6092 non-null object
Sex                       5737 non-null object
Age                       3471 non-null object
Injury                    6274 non-null object
Fatal (Y/N)               5763 non-null object
Time                      2948 non-null object
Species                   3464 non-null object
Investigator or Source    6285 non-null object
pdf                       6302 non-null object
href formula              6301 non-null object
href                      6302 non-null obje

In [37]:
# 1) Saving database backup
data_bk = data.copy()

In [36]:
# 2) Data cleaning: removing duplicates
# DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)
data.drop_duplicates(keep = 'first', inplace = True)

In [35]:
# 2) Data cleaning: removing mostly null rows
# DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
data.dropna(thresh = 12, axis = 0, how = 'all', inplace = True)

In [39]:
# 2) Data cleaning: renaming columns to lower and snake case
# TBD

In [42]:
# 3) Selecting necessary columns to answer question:
# Within the unprovoked type of incident, which activity suffered the most?

data.columns
# Case Number: to be used as ID column
# Type: necessary to distinguish between types of incidents
# Activity: also necessary to attribute different kinds of activities

Index(['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex ', 'Age', 'Injury', 'Fatal (Y/N)', 'Time',
       'Species ', 'Investigator or Source', 'pdf', 'href formula', 'href',
       'Case Number.1', 'Case Number.2', 'original order', 'Unnamed: 22',
       'Unnamed: 23'],
      dtype='object')

In [43]:
data_a = data[['Case Number', 'Type', 'Activity']]

In [50]:
data_a.info()
# TBD: Transform 'Case Number' into date

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6302 entries, 0 to 6301
Data columns (total 3 columns):
Case Number    6301 non-null object
Type           6298 non-null object
Activity       5758 non-null object
dtypes: object(3)
memory usage: 196.9+ KB


In [53]:
data_a['Type'].value_counts()
# TBD: Disconsider occurences not in ['Unprovoked', 'Provoked']

Unprovoked      4595
Provoked         574
Invalid          547
Sea Disaster     239
Boating          203
Boat             137
Questionable       2
Boatomg            1
Name: Type, dtype: int64

In [54]:
data_a['Activity'].value_counts()

Surfing                                                                            971
Swimming                                                                           869
Fishing                                                                            431
Spearfishing                                                                       333
Bathing                                                                            162
Wading                                                                             149
Diving                                                                             127
Standing                                                                            99
Snorkeling                                                                          89
Scuba diving                                                                        76
Body boarding                                                                       61
Body surfing                               

In [None]:
# 3) Selecting necessary columns to answer question:
# Which country has had the biggest number of incidents?

data.columns: 'Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex ', 'Age', 'Injury', 'Fatal (Y/N)', 'Time',
       'Species ', 'Investigator or Source', 'pdf', 'href formula', 'href',
       'Case Number.1', 'Case Number.2', 'original order', 'Unnamed: 22',
       'Unnamed: 23'
# Case Number: to be used as ID column
# Country: necessary to group by countries
# Fatal (Y/N): (optional) potentially necessary to evaluate accident's gravity