# Shark Attacks

### Importing libraries

In [1]:
import os
import numpy as np
import pandas as pd

In [2]:
os.listdir()

['shark_analysis.ipynb',
 'Untitled.ipynb',
 'attacks.csv',
 'README.md',
 '.gitattributes',
 '.ipynb_checkpoints',
 '.git']

### Reading and understanding database

In [3]:
db_attacks = pd.read_csv('attacks.csv', encoding='latin-1')

In [4]:
db_attacks.head()

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
0,2018.06.25,25-Jun-2018,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,...,White shark,"R. Collier, GSAF",2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.25,2018.06.25,6303.0,,
1,2018.06.18,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,"St. Simon Island, Glynn County",Standing,Adyson McNeely,F,...,,"K.McMurray, TrackingSharks.com",2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.18,2018.06.18,6302.0,,
2,2018.06.09,09-Jun-2018,2018.0,Invalid,USA,Hawaii,"Habush, Oahu",Surfing,John Denges,M,...,,"K.McMurray, TrackingSharks.com",2018.06.09-Denges.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.09,2018.06.09,6301.0,,
3,2018.06.08,08-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,...,2 m shark,"B. Myatt, GSAF",2018.06.08-Arrawarra.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.08,2018.06.08,6300.0,,
4,2018.06.04,04-Jun-2018,2018.0,Provoked,MEXICO,Colima,La Ticla,Free diving,Gustavo Ramos,M,...,"Tiger shark, 3m",A .Kipper,2018.06.04-Ramos.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.04,2018.06.04,6299.0,,


In [5]:
db_attacks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25723 entries, 0 to 25722
Data columns (total 24 columns):
Case Number               8702 non-null object
Date                      6302 non-null object
Year                      6300 non-null float64
Type                      6298 non-null object
Country                   6252 non-null object
Area                      5847 non-null object
Location                  5762 non-null object
Activity                  5758 non-null object
Name                      6092 non-null object
Sex                       5737 non-null object
Age                       3471 non-null object
Injury                    6274 non-null object
Fatal (Y/N)               5763 non-null object
Time                      2948 non-null object
Species                   3464 non-null object
Investigator or Source    6285 non-null object
pdf                       6302 non-null object
href formula              6301 non-null object
href                      6302 non-null obje

### General Cleaning

#### Duplicates

In [6]:
db_attacks.drop_duplicates(keep = 'first', inplace = True)

#### Rows with less than 12 filled columns (50%)

In [7]:
db_attacks.dropna(axis = 0, thresh = 12, how = 'all', inplace = True)

In [8]:
db_attacks.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6302 entries, 0 to 6301
Data columns (total 24 columns):
Case Number               6301 non-null object
Date                      6302 non-null object
Year                      6300 non-null float64
Type                      6298 non-null object
Country                   6252 non-null object
Area                      5847 non-null object
Location                  5762 non-null object
Activity                  5758 non-null object
Name                      6092 non-null object
Sex                       5737 non-null object
Age                       3471 non-null object
Injury                    6274 non-null object
Fatal (Y/N)               5763 non-null object
Time                      2948 non-null object
Species                   3464 non-null object
Investigator or Source    6285 non-null object
pdf                       6302 non-null object
href formula              6301 non-null object
href                      6302 non-null object

#### Columns with less than 5 filled rows (0.07%)

In [9]:
db_attacks.dropna(axis = 1, thresh = 5, how = 'all', inplace = True)

### Saving backup file

In [10]:
data_bk = db_attacks.copy()

### Formulating questions

a) Which country has had the biggest number of incidents?
> Is the ranking maintained if filtered by fatal accidents?

b) Within the unprovoked type of incident, which activity suffered the most?
> What about provoked accidents?

c) Which shark species is most associated with attacks?

d) Do the attacks happen with more frequency at a specific time of the day?

#### a) Which country has had the biggest number of incidents?

Selecting necessary columns to answer question:

In [11]:
db_attacks.columns

Index(['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex ', 'Age', 'Injury', 'Fatal (Y/N)', 'Time',
       'Species ', 'Investigator or Source', 'pdf', 'href formula', 'href',
       'Case Number.1', 'Case Number.2', 'original order'],
      dtype='object')

To identify countries with the most occurrences, the following columns will be used:

> Case Number: to be used as ID column | Country: to identify the location

Filtering dataframe to select only chosen columns

In [12]:
attacks_by_country = db_attacks[['Case Number', 'Country']].groupby(by = 'Country', as_index = False).count()

In [13]:
attacks_by_country.head()

Unnamed: 0,Country,Case Number
0,PHILIPPINES,1
1,TONGA,3
2,ADMIRALTY ISLANDS,1
3,AFRICA,1
4,ALGERIA,1


In [14]:
attacks_by_country.sort_values(by = 'Case Number', ascending = False).head()

Unnamed: 0,Country,Case Number
204,USA,2228
14,AUSTRALIA,1338
171,SOUTH AFRICA,579
145,PAPUA NEW GUINEA,134
127,NEW ZEALAND,128


In [15]:
top_attacks_by_country = attacks_by_country.sort_values(by = 'Case Number', ascending = False).head(15).reset_index()

In [16]:
top_attacks_by_country['Country'] = top_attacks_by_country['Country'].str.replace('REUNION', 'REUNION ISLAND')

In [17]:
top_attacks_by_country
# thorough data cleaning for 'Country' column was disconsidered once it didn't seem to impact final results

Unnamed: 0,index,Country,Case Number
0,204,USA,2228
1,14,AUSTRALIA,1338
2,171,SOUTH AFRICA,579
3,145,PAPUA NEW GUINEA,134
4,127,NEW ZEALAND,128
5,23,BRAZIL,112
6,16,BAHAMAS,109
7,113,MEXICO,89
8,90,ITALY,71
9,61,FIJI,62


#### b) Is the ranking maintained if filtered by fatal accidents?

Selecting necessary columns to answer question:

In [18]:
db_attacks.columns

Index(['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex ', 'Age', 'Injury', 'Fatal (Y/N)', 'Time',
       'Species ', 'Investigator or Source', 'pdf', 'href formula', 'href',
       'Case Number.1', 'Case Number.2', 'original order'],
      dtype='object')

To identify countries with the most occurrences, the following columns will be used:

> Case Number: to be used as ID column | Country: to identify the location | Fatal (Y/N): to understand attack's result

Filtering dataframe to understand Fatal (Y/N) column

In [19]:
db_attacks['Fatal (Y/N)'].value_counts()

N          4293
Y          1388
UNKNOWN      71
 N            7
N             1
M             1
y             1
2017          1
Name: Fatal (Y/N), dtype: int64

Cleaning the Fatal (Y/N) column

In [20]:
db_attacks = db_attacks[(db_attacks['Fatal (Y/N)'] != 'UNKNOWN')]

In [21]:
db_attacks = db_attacks[(db_attacks['Fatal (Y/N)'] != '2017')]

In [22]:
db_attacks = db_attacks[(db_attacks['Fatal (Y/N)'] != 'M')]

In [66]:
db_attacks['Fatal (Y/N)'] = db_attacks['Fatal (Y/N)'].str.replace(' ?N ?', 'N', regex = True)

In [70]:
db_attacks['Fatal (Y/N)'] = db_attacks['Fatal (Y/N)'].str.replace('y', 'Y', regex = True)

In [71]:
db_attacks['Fatal (Y/N)'].value_counts()

N    4301
Y    1389
Name: Fatal (Y/N), dtype: int64

Creating new DataFrame to store data on number of attacks and its fatality

In [72]:
fatal_attacks_by_country = attacks_by_country.merge(db_attacks[['Country', 'Fatal (Y/N)']], how = 'left', left_on = 'Country', right_on = 'Country') 

Grouping by 'Country', counting numeric variables and organizing from highest values

In [73]:
fatal_attacks_by_country = fatal_attacks_by_country.groupby(by = 'Country').count().sort_values(by = 'Fatal (Y/N)', ascending = False).reset_index()

In [74]:
top_fatal_attacks_by_country = fatal_attacks_by_country.head(15)

In [75]:
top_fatal_attacks_by_country['Country'] = top_fatal_attacks_by_country['Country'].str.replace('REUNION', 'REUNION ISLAND')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [76]:
top_fatal_attacks_by_country = top_fatal_attacks_by_country[['Country', 'Fatal (Y/N)']]

Comparing ranking for countries with the most attacks and countries with the most fatal attacks

In [77]:
top_attacks_by_country

Unnamed: 0,index,Country,Case Number
0,204,USA,2228
1,14,AUSTRALIA,1338
2,171,SOUTH AFRICA,579
3,145,PAPUA NEW GUINEA,134
4,127,NEW ZEALAND,128
5,23,BRAZIL,112
6,16,BAHAMAS,109
7,113,MEXICO,89
8,90,ITALY,71
9,61,FIJI,62


In [78]:
top_fatal_attacks_by_country

Unnamed: 0,Country,Fatal (Y/N)
0,USA,2023
1,AUSTRALIA,1204
2,SOUTH AFRICA,513
3,PAPUA NEW GUINEA,130
4,NEW ZEALAND,115
5,BAHAMAS,104
6,BRAZIL,102
7,MEXICO,78
8,FIJI,61
9,REUNION ISLAND,60


Although top 5 countries are the same - for most attacks and most fatal attacks - there were changes in the overall ranking, such as: 
> Brazil, Bahamas, Italy, Fiji and Reunion Island

#### c) What is the proportion of fatal attacks within all attacks?

Putting previous DataFrames in the same order

In [79]:
top_fatal = top_fatal_attacks_by_country.sort_values(by = 'Country')
top_attacks = top_attacks_by_country.sort_values(by = 'Country')

In [80]:
top_death_probability = top_fatal

Creating ratio between number of fatal accidents and total number of attacks

In [81]:
top_death_probability['Fatality Probability'] = top_fatal['Fatal (Y/N)'] / top_attacks['Case Number']

In [82]:
top_death_probability = top_death_probability.sort_values(by = 'Fatality Probability', ascending = False).reset_index()

In [83]:
top_death_probability = top_death_probability.reindex(columns = ['Country', 'Fatal (Y/N)', 'Fatality Probability'])

In [84]:
top_death_probability

Unnamed: 0,Country,Fatal (Y/N),Fatality Probability
0,NEW CALEDONIA,52,0.981132
1,PAPUA NEW GUINEA,130,0.970149
2,REUNION ISLAND,60,0.967742
3,BRAZIL,102,0.93578
4,BAHAMAS,104,0.928571
5,PHILIPPINES,56,0.918033
6,USA,2023,0.907989
7,ITALY,54,0.9
8,AUSTRALIA,1204,0.899851
9,NEW ZEALAND,115,0.898438


Now we have a new scenario of countries with a high fatality probability. Top 5 countries are:

> New Caledonia, Papua New Guinea, Reunion Island, Brazil and Bahamas

It is also interest to note that the three first regions all belong to French territory