# Shark attack data project

#### Import data, explore it, determine what needs to be cleaned or removed in order to make data useful and make hipotesis/questions

In [35]:
import numpy as np
import pandas as pd
import os
import matplotlib as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')


In [3]:
attacks = pd.read_csv("../data/attacks.csv", encoding='latin1')


In [88]:
pd.set_option('display.max_columns', None) #Displays all the columns if they don't fit in the notebook
attacks.sample()

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
2273,1996.10.28.b,28-Oct-1996,1996.0,Unprovoked,BRAZIL,Pernambuco,Barra de Jangada,Surfing,Gilvan Jaime de Freitas Júnior,,,Leg bitten,N,,,D. Duarte,1996.10.28.b-deFrietas.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,1996.10.28.b,1996.10.28.b,4030.0,,


### `What would be interesting to find out?`

- Are sahrks gender discriminative?
- Waht percentage of the attacks occur when doing activities like boat trips, fishing, surfing...?
- ~~Which species is te most agresive?~~ (too many unique values)
- Whats the mortality rate of a shark attack?
- Do they increase or decrease alongst the years?
- What are the most common injuries (parts of the body)?
- How many megalodon attack are...?
- what time of the day do more attack occur?
- waht range of age is the most attacked?
- where do they occur? country/region

### `Whats this data`

`This is a dirty and desorganized dataset of global shark attacks with the following information`

- `Case number`: Case indexes
- `Date`
- `Year`
- `Type`: Type of the incident that can mainly be: Boating, Unprovoked, Provoked, Questionable o Sea Disaster
- `Country`
- `Area`: Where the attack occured
- `Location`: More specific location of the incident
- `Activity`: that the person was doing when the incident happened
- `Name`
- `Sex`
- `Age`
- `Injury`: description of the injury
- `Fatal (Y/N)`: If the person was killed "Y" or survived the attack "N".
- `Time`: The hour and minutes wehn the incident happened
- `Species`: The species of shark involved in the incident
- `Investigatior or Source`: Person or entity that who carried out the case investigation (could be both)
- `pdf`: name of document related to the incident
- `href formula & href`: link to the actual document
- `Case Number.1 & Case Number.2`: copies of 'Case Number'
- `original order`: ? another identifier?????


### `What needs to be cleaned?`

- 'Unnamed: 22' and 'Unnamed: 23' are almost compleatly empty. Can drop them
- Fix columns names. 'Sex ' and 'Species ' have space at the end. Remove spaces for '_'and capital letters. 'Fatal (Y/N)' -> Fatal
- From row index 8707 to the end is all NaN but actual data goes till row 6302
- Standarize. All Sex values should be 'F'->female, 'M'->male or 'unknown'
- Standarize. All fatality values should be 'Y'->yes, 'N'->no or 'unknown'
- Too many unique values in species, not useful data
- Lots of matching values between the three identifier columns. Drop two of them and make all 'Case Number' values, unique identifier
- Remove duplictae rows, indexes: 4688, 5709, 6295
- Dates/ year need to be cleaned and standrized in the same format
- Description on the activity being realized when the incident happened need to be simpler. Find out the most comon activities, use regex to put them in the same category and the rest should be unknown
- Injury should also be cleaned with regex
- Change all null values of the reduced list for 'unknown'
- Type values will be: Boating, Unprovoked, Provoked, Questionable o Sea Disaster
- Country/Area/Location: Useful data. May be deducible one with each other. it's cleanable but might take too much work???
- Time: Useful data. might take too much work but cleanable.
- not going to use pdf for now

`Explorantion practices used to reach to this conclusions can be found in the section below`

In [6]:
attacks.shape

(25723, 24)

##### Exploration methods/attributes

- table
    - attacks.shape`
    - attacks.columns`
    - attacks.info
- cohesiveness of data
    - attacks.duplicated
    - attacks.ina().sum()
- values
    - attacks.dtypes: categorical / qty / str, int, float
    - attacks.describe, # qtve variables
    - attacks[col1].value_counts() # frequency of each level of cat, dimensions
    - attacks.unique()/.nunique()

In [77]:
attacks.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25723 entries, 0 to 25722
Data columns (total 24 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Case Number             8702 non-null   object 
 1   Date                    6302 non-null   object 
 2   Year                    6300 non-null   float64
 3   Type                    6298 non-null   object 
 4   Country                 6252 non-null   object 
 5   Area                    5847 non-null   object 
 6   Location                5762 non-null   object 
 7   Activity                5758 non-null   object 
 8   Name                    6092 non-null   object 
 9   Sex                     5737 non-null   object 
 10  Age                     3471 non-null   object 
 11  Injury                  6274 non-null   object 
 12  Fatal (Y/N)             5763 non-null   object 
 13  Time                    2948 non-null   object 
 14  Species                 3464 non-null 

In [8]:
attacks.isna().sum()
# 'Unnamed: 22' and 'Unnamed: 23' are almost compleatly empty. Can drop them

Case Number               17021
Date                      19421
Year                      19423
Type                      19425
Country                   19471
Area                      19876
Location                  19961
Activity                  19965
Name                      19631
Sex                       19986
Age                       22252
Injury                    19449
Fatal (Y/N)               19960
Time                      22775
Species                   22259
Investigator or Source    19438
pdf                       19421
href formula              19422
href                      19421
Case Number.1             19421
Case Number.2             19421
original order            19414
Unnamed: 22               25722
Unnamed: 23               25721
dtype: int64

In [17]:
attacks.columns.values.tolist()
#Fix columns names. Sex and species have space at the end.

['Case Number',
 'Date',
 'Year',
 'Type',
 'Country',
 'Area',
 'Location',
 'Activity',
 'Name',
 'Sex ',
 'Age',
 'Injury',
 'Fatal (Y/N)',
 'Time',
 'Species ',
 'Investigator or Source',
 'pdf',
 'href formula',
 'href',
 'Case Number.1',
 'Case Number.2',
 'original order',
 'Unnamed: 22',
 'Unnamed: 23']

In [9]:
attacks.isna().all(axis=1).sum()

17020

In [10]:
attacks.index[attacks.isna().all(axis=1)].min()

8702

In [11]:
attacks.index[attacks.isna().all(axis=1)].max()
#From row index 8707 to the end is all NaN

25721

In [40]:
attacks.index.max()

25722

In [59]:
attacks.iloc[25722]

Case Number                xx
Date                      NaN
Year                      NaN
Type                      NaN
Country                   NaN
Area                      NaN
Location                  NaN
Activity                  NaN
Name                      NaN
Sex                       NaN
Age                       NaN
Injury                    NaN
Fatal (Y/N)               NaN
Time                      NaN
Species                   NaN
Investigator or Source    NaN
pdf                       NaN
href formula              NaN
href                      NaN
Case Number.1             NaN
Case Number.2             NaN
original order            NaN
Unnamed: 22               NaN
Unnamed: 23               NaN
Name: 25722, dtype: object

In [99]:
#attacks.iloc[8701]
#attacks.iloc[7373]
#attacks.iloc[6373]
attacks.iloc[6302]

Case Number                    0
Date                         NaN
Year                         NaN
Type                         NaN
Country                      NaN
Area                         NaN
Location                     NaN
Activity                     NaN
Name                         NaN
Sex                          NaN
Age                          NaN
Injury                       NaN
Fatal (Y/N)                  NaN
Time                         NaN
Species                      NaN
Investigator or Source       NaN
pdf                          NaN
href formula                 NaN
href                         NaN
Case Number.1                NaN
Case Number.2                NaN
original order            6304.0
Unnamed: 22                  NaN
Unnamed: 23                  NaN
Name: 6302, dtype: object

In [72]:
# check where does the actual data stops because there's lots of rows with empty values except for Case Number that has value '0'
# columns_to_check = [col for col in attacks.columns if col != 'Case Number']

# same with values of column 'original order'
columns_to_check = [col for col in attacks.columns if col != 'Case Number' and col != 'original order']

min_index = attacks[columns_to_check].isna().all(axis=1).idxmax()

print(f"actual data ends at row {min_index}")

actual data ends at row 6302


In [12]:
attacks.describe()

Unnamed: 0,Year,original order
count,6300.0,6309.0
mean,1927.272381,3155.999683
std,281.116308,1821.396206
min,0.0,2.0
25%,1942.0,1579.0
50%,1977.0,3156.0
75%,2005.0,4733.0
max,2018.0,6310.0


In [15]:
attacks["Sex "].unique()
#Standarize. All Sex values should be 'F'->female, 'M'->male or 'unknown'

array(['F', 'M', nan, 'M ', 'lli', 'N', '.'], dtype=object)

In [16]:
attacks["Fatal (Y/N)"].unique()
#Standarize. All fatality values should be 'Y'->yes, 'N'->no or 'unknown'

array(['N', 'Y', nan, 'M', 'UNKNOWN', '2017', ' N', 'N ', 'y'],
      dtype=object)

In [13]:
attacks["Species "].nunique()
#Too many unique values in species, not useful data

1549

In [81]:
# attacks['Case Number'].equals(attacks['Case Number.1']) -> False
# attacks['Case Number'].equals(attacks['Case Number.2']) -> False
# attacks['Case Number.1'].equals(attacks['Case Number.2']) -> False

matching_cases = attacks.iloc[:6302][attacks['Case Number'] == attacks['Case Number.1']]
non_matching_cases = attacks.iloc[:6302][attacks['Case Number'] != attacks['Case Number.1']]

matching_count = matching_cases.shape[0]
non_matching_count = non_matching_cases.shape[0]

print(matching_count, non_matching_count)
# When comparing seems like NaN != NaN. These are not counted as a match. No need to use the index 8702 .


6278 24


In [82]:
column_pairs = [
    ('Case Number', 'Case Number.1'),
    ('Case Number', 'Case Number.2'),
    ('Case Number.1', 'Case Number.2')
]

results = []

for column1, column2 in column_pairs:
    matching_count = attacks.iloc[:6302][attacks[column1] == attacks[column2]].shape[0]
    non_matching_count = attacks.iloc[:6302][attacks[column1] != attacks[column2]].shape[0]
    result_string = f"{column1} and {column2} have {matching_count} matching values and {non_matching_count} non-matching values"
    results.append(result_string)

for result in results:
    print(result)

Case Number and Case Number.1 have 6278 matching values and 24 non-matching values
Case Number and Case Number.2 have 6298 matching values and 4 non-matching values
Case Number.1 and Case Number.2 have 6282 matching values and 20 non-matching values


In [76]:
attacks["Case Number"].is_unique
#Lots of matching values between the three identifier columns. Drop two of them and make all 'Case Number' values, unique identifier

False

In [79]:
subset = attacks[['Date', 'Year', 'Type', 'Country', 'Area', 'Location', 'Activity', 'Name', 'Sex ', 'Age', 'Injury', 'Fatal (Y/N)']]

subset.duplicated().any()

True

In [85]:
subset[:6302][subset.duplicated()]
# These three rows are dublicated

Unnamed: 0,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N)
4688,Fall 1943,1943.0,Unprovoked,USA,Hawaii,"Midway Island, Northwestern Hawaiian Islands",Spearfishing,2 males,M,,Calf nipped in each case,N
5709,1890,1890.0,Unprovoked,INDIA,Tamil Nadu,Tuticorin,Diving,a pearl diver,M,,No details,UNKNOWN
6295,Before 1906,0.0,Unprovoked,AUSTRALIA,,,Fishing,fisherman,M,,FATAL,Y


In [105]:
attacks["Type"].nunique()

8

In [107]:
attacks["Type"].unique()

array(['Boating', 'Unprovoked', 'Invalid', 'Provoked', 'Questionable',
       'Sea Disaster', nan, 'Boat', 'Boatomg'], dtype=object)

In [95]:
#attacks["Year"].nunique()
attacks["Year"].unique()

array([2018., 2017.,   nan, 2016., 2015., 2014., 2013., 2012., 2011.,
       2010., 2009., 2008., 2007., 2006., 2005., 2004., 2003., 2002.,
       2001., 2000., 1999., 1998., 1997., 1996., 1995., 1984., 1994.,
       1993., 1992., 1991., 1990., 1989., 1969., 1988., 1987., 1986.,
       1985., 1983., 1982., 1981., 1980., 1979., 1978., 1977., 1976.,
       1975., 1974., 1973., 1972., 1971., 1970., 1968., 1967., 1966.,
       1965., 1964., 1963., 1962., 1961., 1960., 1959., 1958., 1957.,
       1956., 1955., 1954., 1953., 1952., 1951., 1950., 1949., 1948.,
       1848., 1947., 1946., 1945., 1944., 1943., 1942., 1941., 1940.,
       1939., 1938., 1937., 1936., 1935., 1934., 1933., 1932., 1931.,
       1930., 1929., 1928., 1927., 1926., 1925., 1924., 1923., 1922.,
       1921., 1920., 1919., 1918., 1917., 1916., 1915., 1914., 1913.,
       1912., 1911., 1910., 1909., 1908., 1907., 1906., 1905., 1904.,
       1903., 1902., 1901., 1900., 1899., 1898., 1897., 1896., 1895.,
       1894., 1893.,

In [94]:
attacks["Date"].nunique()


5433

In [98]:
#attacks["Country"].nunique()
attacks["Country"].unique()

array(['USA', 'AUSTRALIA', 'MEXICO', 'BRAZIL', 'ENGLAND', 'SOUTH AFRICA',
       'THAILAND', 'COSTA RICA', 'MALDIVES', 'BAHAMAS', 'NEW CALEDONIA',
       'ECUADOR', 'MALAYSIA', 'LIBYA', nan, 'CUBA', 'MAURITIUS',
       'NEW ZEALAND', 'SPAIN', 'SAMOA', 'SOLOMON ISLANDS', 'JAPAN',
       'EGYPT', 'ST HELENA, British overseas territory', 'COMOROS',
       'REUNION', 'FRENCH POLYNESIA', 'UNITED KINGDOM',
       'UNITED ARAB EMIRATES', 'PHILIPPINES', 'INDONESIA', 'CHINA',
       'COLUMBIA', 'CAPE VERDE', 'Fiji', 'DOMINICAN REPUBLIC',
       'CAYMAN ISLANDS', 'ARUBA', 'MOZAMBIQUE', 'FIJI', 'PUERTO RICO',
       'ITALY', 'ATLANTIC OCEAN', 'GREECE', 'ST. MARTIN', 'FRANCE',
       'PAPUA NEW GUINEA', 'TRINIDAD & TOBAGO', 'KIRIBATI', 'ISRAEL',
       'DIEGO GARCIA', 'TAIWAN', 'JAMAICA', 'PALESTINIAN TERRITORIES',
       'GUAM', 'SEYCHELLES', 'BELIZE', 'NIGERIA', 'TONGA', 'SCOTLAND',
       'CANADA', 'CROATIA', 'SAUDI ARABIA', 'CHILE', 'ANTIGUA', 'KENYA',
       'RUSSIA', 'TURKS & CAICOS', 'UNITE

In [101]:
attacks["Location"].nunique()
#might be too much work

4108

In [103]:
attacks["Area"].nunique()
#might be too much work

825

In [110]:
attacks["Time"].unique()
# cleanable but might be too much

array(['18h00', '14h00  -15h00', '07h45', nan, 'Late afternoon', '17h00',
       '14h00', 'Morning', '15h00', '08h15', '11h00', '10h30', '10h40',
       '16h50', '07h00', '09h30', 'Afternoon', '21h50', '09h40', '08h00',
       '17h35', '15h30', '07h30', '19h00, Dusk', 'Night', '16h00',
       '15h01', '12h00', '13h45', '23h30', '09h00', '14h30', '18h30',
       '12h30', '16h30', '18h45', '06h00', '10h00', '10h44', '13h19',
       'Midday', '13h30', '10h45', '11h20', '11h45', '19h30', '08h30',
       '15h45', 'Shortly before 12h00', '17h34', '17h10', '11h15',
       '08h50', '17h45', '13h00', '10h20', '13h20', '02h00', '09h50',
       '11h30', '17h30', '9h00', '10h43', 'After noon', '15h15', '15h40',
       '19h05', '1300', '14h30 / 15h30', '22h00', '16h20', '14h34',
       '15h25', '14h55', '17h46', 'Morning ', '15h49', '19h00',
       'Midnight', '09h30 / 10h00', '10h15', '18h15', '04h00', '14h50',
       '13h50', '19h20', '10h25', '10h45-11h15', '16h45', '15h52',
       '06h15', '14h