![Ironhack logo](https://i.imgur.com/1QgrNNw.png)

<body>
    <p style="font-size:28px;text-align:center"><b>Project 02 | Data Cleaning & Manipulation</b></p>
</body>

# Introduction

The objective of this projects was to answer a problem statement, practicing data cleaning and manipulation.

---

<body>
    <p style="font-size:20px"><b>Problem Statement</b></p>
</body>

_What are the most common characteristics of people involved in shark incidents in the history?_

---

To answer this problem, the following characteristics will be analyzed:
- Gender
- Age
- Activity

# Setup

## Import

In [1]:
import pandas as pd
import numpy as np
import re

## Load the dataset

In [2]:
# Load the dataset from a Excel file into a Pandas DataFrame
df = pd.read_excel('GSAF5.xls')

In [3]:
# Check information about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25792 entries, 0 to 25791
Data columns (total 24 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Case Number             8771 non-null   object 
 1   Date                    6558 non-null   object 
 2   Year                    6556 non-null   float64
 3   Type                    6552 non-null   object 
 4   Country                 6508 non-null   object 
 5   Area                    6091 non-null   object 
 6   Location                6008 non-null   object 
 7   Activity                6002 non-null   object 
 8   Name                    6343 non-null   object 
 9   Sex                     5987 non-null   object 
 10  Age                     3660 non-null   object 
 11  Injury                  6528 non-null   object 
 12  Fatal (Y/N)             6006 non-null   object 
 13  Time                    3139 non-null   object 
 14  Species                 3610 non-null 

### Conclusions about the dataset for cleaning purposes
1. Seeing the information about the dataset, it is clear that there are a lot of rows that only contains missing values `NaN`.
2. The columns 'Unnamed: 22' and 'Unnamed: 23' probably do not have any information since they have 1 and 2 non-null values, respectively.

## Create a backup version of the raw dataset

In [4]:
# Create a backup version
df_bkp = df.copy()

# General Data Cleaning

## Headers

In [5]:
# Remove unnecessary spaces, put everything in lower case and replace spaces with underscores
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
df.columns

Index(['case_number', 'date', 'year', 'type', 'country', 'area', 'location',
       'activity', 'name', 'sex', 'age', 'injury', 'fatal_(y/n)', 'time',
       'species', 'investigator_or_source', 'pdf', 'href_formula', 'href',
       'case_number.1', 'case_number.2', 'original_order', 'unnamed:_22',
       'unnamed:_23'],
      dtype='object')

In [6]:
# Rename the column 'fatal_(y/n)' to make it simpler
df = df.rename(columns={'fatal_(y/n)': 'fatal'})
df.columns

Index(['case_number', 'date', 'year', 'type', 'country', 'area', 'location',
       'activity', 'name', 'sex', 'age', 'injury', 'fatal', 'time', 'species',
       'investigator_or_source', 'pdf', 'href_formula', 'href',
       'case_number.1', 'case_number.2', 'original_order', 'unnamed:_22',
       'unnamed:_23'],
      dtype='object')

## Columns that are not relevant to the analysis

As mentioned above, the columns `unnamed:_22` and `unnamed:_23` probably do not have any relevant information, so they can be removed.

In [7]:
# Remove the columns 'unnamed:_22' and 'unnamed:_23'
df = df.drop(columns=['unnamed:_22', 'unnamed:_23'])

# Check the result
df.head()

Unnamed: 0,case_number,date,year,type,country,area,location,activity,name,sex,...,fatal,time,species,investigator_or_source,pdf,href_formula,href,case_number.1,case_number.2,original_order
0,2020.08.20,20-Aug-2020,2020.0,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Boogie boarding,Carolina Jones,F,...,N,11h00,,"K. McMurray, TrackingSharks.com",2020.08.20-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2020.08.20,2020.08.20,6559.0
1,2020.08.14,14-Aug-2020,2020.0,Unprovoked,AUSTRALIA,New South Wales,"Shelly Beach, Port Macquarie",Surfing,Chantelle Doyle,F,...,N,09h30,"White shark, 2-to 3m","B. Myatt, GSAF",2020.08.14-ShellyBeach.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2020.08.14,2020.08.14,6558.0
2,2020.08.10,10-Aug-2020,2020.0,Provoked,USA,Florida,"Off Gasparilla Island, Charlotte County",Fishing,male,M,...,N,16h00,"Blacktip shark, 6'","K. McMurray, TrackingSharks.com",2020.08.10-Provoked.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2020.08.10,2020.08.10,6557.0
3,2020.08.02,02-Aug-2020,2020.0,Unprovoked,USA,Virgin Islands,"Candle Reef, St. Croix",Snorkeling,Melony Klein,F,...,N,14h00,"Nurse shark, 5'","K. McMurray, TrackingSharks.com",2020.08.02-Klein.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2020.08.02,2020.08.02,6556.0
4,2020.07.31.c,31-Jul-2020,2020.0,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,Megan Tossi,F,...,N,17h00,,"K. McMurray, TrackingSharks.com",2020.07.31.c-Tossi..pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2020.07.31.c,2020.07.31.c,6555.0


Checking the table above, it is clear that the columns `investigator_or_source`, `pdf`, `href_formula` and `href` will not be relevant to the analysis since they are just information about the source of each incident. Therefore, these columns can be removed.

In [8]:
# Remove the columns 'nvestigator_or_source', 'pdf', 'href_formula' and 'href'
df = df.drop(columns=['investigator_or_source', 'pdf', 'href_formula', 'href'])

# Check the result
df.head()

Unnamed: 0,case_number,date,year,type,country,area,location,activity,name,sex,age,injury,fatal,time,species,case_number.1,case_number.2,original_order
0,2020.08.20,20-Aug-2020,2020.0,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Boogie boarding,Carolina Jones,F,50.0,Minor lacerations to left leg,N,11h00,,2020.08.20,2020.08.20,6559.0
1,2020.08.14,14-Aug-2020,2020.0,Unprovoked,AUSTRALIA,New South Wales,"Shelly Beach, Port Macquarie",Surfing,Chantelle Doyle,F,35.0,Lacerations to right calf and posterior thigh,N,09h30,"White shark, 2-to 3m",2020.08.14,2020.08.14,6558.0
2,2020.08.10,10-Aug-2020,2020.0,Provoked,USA,Florida,"Off Gasparilla Island, Charlotte County",Fishing,male,M,55.0,Injury to left forearm by hooked shark PROVOKE...,N,16h00,"Blacktip shark, 6'",2020.08.10,2020.08.10,6557.0
3,2020.08.02,02-Aug-2020,2020.0,Unprovoked,USA,Virgin Islands,"Candle Reef, St. Croix",Snorkeling,Melony Klein,F,,Lacerations to hand and wrist,N,14h00,"Nurse shark, 5'",2020.08.02,2020.08.02,6556.0
4,2020.07.31.c,31-Jul-2020,2020.0,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,Megan Tossi,F,22.0,Lacerations to foot,N,17h00,,2020.07.31.c,2020.07.31.c,6555.0


The content of columns `case_number`, `case_number.1`, `case_number.2` and `original_order` is just a way to reference each case and is not relevant to the analysis. So, they will also be removed.

In [9]:
# Remove the columns 'case_number', 'case_number.1', 'case_number.2' and 'original_order'
df = df.drop(columns=['case_number', 'case_number.1', 'case_number.2', 'original_order'])

# Check the result
df.head()

Unnamed: 0,date,year,type,country,area,location,activity,name,sex,age,injury,fatal,time,species
0,20-Aug-2020,2020.0,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Boogie boarding,Carolina Jones,F,50.0,Minor lacerations to left leg,N,11h00,
1,14-Aug-2020,2020.0,Unprovoked,AUSTRALIA,New South Wales,"Shelly Beach, Port Macquarie",Surfing,Chantelle Doyle,F,35.0,Lacerations to right calf and posterior thigh,N,09h30,"White shark, 2-to 3m"
2,10-Aug-2020,2020.0,Provoked,USA,Florida,"Off Gasparilla Island, Charlotte County",Fishing,male,M,55.0,Injury to left forearm by hooked shark PROVOKE...,N,16h00,"Blacktip shark, 6'"
3,02-Aug-2020,2020.0,Unprovoked,USA,Virgin Islands,"Candle Reef, St. Croix",Snorkeling,Melony Klein,F,,Lacerations to hand and wrist,N,14h00,"Nurse shark, 5'"
4,31-Jul-2020,2020.0,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,Megan Tossi,F,22.0,Lacerations to foot,N,17h00,


After removing the unnecessary columns, the information about the new dataframe is shown below.

The column `name` is also irrelevant, since there will not be a specific analysis of each individual. So, this column can be removed.

In [10]:
# Remove column 'name'
df = df.drop(columns='name')

# Check the result
df.head()

Unnamed: 0,date,year,type,country,area,location,activity,sex,age,injury,fatal,time,species
0,20-Aug-2020,2020.0,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Boogie boarding,F,50.0,Minor lacerations to left leg,N,11h00,
1,14-Aug-2020,2020.0,Unprovoked,AUSTRALIA,New South Wales,"Shelly Beach, Port Macquarie",Surfing,F,35.0,Lacerations to right calf and posterior thigh,N,09h30,"White shark, 2-to 3m"
2,10-Aug-2020,2020.0,Provoked,USA,Florida,"Off Gasparilla Island, Charlotte County",Fishing,M,55.0,Injury to left forearm by hooked shark PROVOKE...,N,16h00,"Blacktip shark, 6'"
3,02-Aug-2020,2020.0,Unprovoked,USA,Virgin Islands,"Candle Reef, St. Croix",Snorkeling,F,,Lacerations to hand and wrist,N,14h00,"Nurse shark, 5'"
4,31-Jul-2020,2020.0,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,F,22.0,Lacerations to foot,N,17h00,


In [11]:
# Check information about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25792 entries, 0 to 25791
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   date      6558 non-null   object 
 1   year      6556 non-null   float64
 2   type      6552 non-null   object 
 3   country   6508 non-null   object 
 4   area      6091 non-null   object 
 5   location  6008 non-null   object 
 6   activity  6002 non-null   object 
 7   sex       5987 non-null   object 
 8   age       3660 non-null   object 
 9   injury    6528 non-null   object 
 10  fatal     6006 non-null   object 
 11  time      3139 non-null   object 
 12  species   3610 non-null   object 
dtypes: float64(1), object(12)
memory usage: 2.6+ MB


## Rows

### Rows containing only missing values

There is total of 25792 lines, but all columns do not have even 9,000 rows with `non-NaN`. So,as mentioned before, there are a lot of rows that only contains missing values `NaN` and they can be removed. 

In [12]:
# Remove all rows that contains only NaN
df = df.dropna(how='all')

# Check the result
print(f'Number of rows after removing the rows containing only missing values: {df.shape[0]}\n')

# Check information about the dataframe
df.info()

Number of rows after removing the rows containing only missing values: 6558

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6558 entries, 0 to 6557
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   date      6558 non-null   object 
 1   year      6556 non-null   float64
 2   type      6552 non-null   object 
 3   country   6508 non-null   object 
 4   area      6091 non-null   object 
 5   location  6008 non-null   object 
 6   activity  6002 non-null   object 
 7   sex       5987 non-null   object 
 8   age       3660 non-null   object 
 9   injury    6528 non-null   object 
 10  fatal     6006 non-null   object 
 11  time      3139 non-null   object 
 12  species   3610 non-null   object 
dtypes: float64(1), object(12)
memory usage: 717.3+ KB


### Duplicate rows

In [13]:
# Check if there are duplicate rows
df.duplicated().sum()

7

In [14]:
# Check duplicated rows
df[df.duplicated(keep=False)]

Unnamed: 0,date,year,type,country,area,location,activity,sex,age,injury,fatal,time,species
3577,Reported 26-Jun-1972,1972.0,Unprovoked,AUSTRALIA,Queensland,Pancake Creek,,M,,FATAL,Y,,
3578,Reported 26-Jun-1972,1972.0,Unprovoked,AUSTRALIA,Queensland,Pancake Creek,,M,,FATAL,Y,,
3622,1971,1971.0,Unprovoked,IRAN,Khuzestan Province,"Ahvaz, on the Karun River",,M,,Survived,N,,
3623,1971,1971.0,Unprovoked,IRAN,Khuzestan Province,"Ahvaz, on the Karun River",,M,,Survived,N,,
4490,Aug-1956,1956.0,Provoked,UNITED KINGDOM,Cornwall,The Lizard,Attempting to kill a shark with explosives,M,,"FATAL, PROVOKED INCIDENT",Y,,
4491,Aug-1956,1956.0,Provoked,UNITED KINGDOM,Cornwall,The Lizard,Attempting to kill a shark with explosives,M,,"FATAL, PROVOKED INCIDENT",Y,,
4940,Fall 1943,1943.0,Unprovoked,USA,Hawaii,"Midway Island, Northwestern Hawaiian Islands",Spearfishing,M,,Calf nipped in each case,N,,"""small sharks"""
4941,Fall 1943,1943.0,Unprovoked,USA,Hawaii,"Midway Island, Northwestern Hawaiian Islands",Spearfishing,M,,Calf nipped in each case,N,,"""small sharks"""
5719,Reported 10-Oct-1906,1906.0,Unprovoked,USA,Hawaii,,Swimming,M,,FATAL,Y,,
5721,Reported 10-Oct-1906,1906.0,Unprovoked,USA,Hawaii,,Swimming,M,,FATAL,Y,,


As shown above, there are some rows that are duplicated. Therefore, they can be removed.

In [15]:
# Remove duplicated rows
df = df.drop_duplicates(keep='first')

# Check the result
df.duplicated().sum()

0

In [16]:
# Check the new number of rows
n_rows = df.shape[0]
print(f'After removing the duplicates, the dataframe has {n_rows} rows.')

After removing the duplicates, the dataframe has 6551 rows.


# Specific Data Cleaning & Manipulation - Columns

## Gender

The gender is represented by the column `sex`.

In [17]:
# Check the unique values in the column
print(df.sex.unique())

['F' 'M' nan 'M ' 'lli' 'M x 2' 'N' '.']


In [18]:
# Check the number of each value
df.sex.value_counts()

M        5280
F         693
N           2
M           2
.           1
M x 2       1
lli         1
Name: sex, dtype: int64

In [19]:
# Remove unnecessary spaces in the values
df.sex = df.sex.str.strip()

# Check the result
df.sex.value_counts()

M        5282
F         693
N           2
.           1
M x 2       1
lli         1
Name: sex, dtype: int64

In [20]:
# Check the rows with values that are not 'M', nor 'F', nor NaN
print(f'They represent only {(len(df[(df.sex != "M") & (df.sex != "F") & (df.sex == df.sex)].index) / n_rows) * 100:.2f}',
      f'% of the dataset.', sep='')
df[(df.sex != 'M') & (df.sex != 'F') & (df.sex == df.sex)]

They represent only 0.08% of the dataset.


Unnamed: 0,date,year,type,country,area,location,activity,sex,age,injury,fatal,time,species
1867,11-Nov-2004,2004.0,Unprovoked,USA,California,"Bunkers, Humboldt Bay, Eureka, Humboldt County",Surfing,lli,38.0,"Lacerations to hand, knee & thigh",N,13h30,5.5 m [18'] white shark
3205,23-Oct-1962,1982.0,Sea Disaster,USA,Carolina coast,,Yacht Trashman capsized in storm,M x 2,,FATAL,Y x 2,,
5191,11-Jul-1934,1934.0,Watercraft,AUSTRALIA,New South Wales,Cronulla,Fishing,N,,No injury to occupants Sharks continually foll...,N,,"Blue pointer, 11'"
5690,Reported 02-Jun-1908,1908.0,Sea Disaster,PAPUA NEW GUINEA,New Britain,Matupi,.,.,,"Remains of 3 humans recovered from shark, but ...",Y,,Allegedly a 33-foot shark
6386,Reported 18-Dec-1801,1801.0,Provoked,,,,Standing on landed shark's tail,N,,"FATAL, PROVOKED INCIDENT",Y,,12' shark


Given the low percentage, they will not be considered in the analysis, in other other, they will be removed.

In [21]:
print(f'Before the removal, there was {n_rows} rows.', end=' ')

# Remove the rows mentioned above
df = df.drop(df[(df.sex != 'M') & (df.sex != 'F') & (df.sex == df.sex)].index)

# Number of rows
n_rows = df.shape[0]
print(f'After the removal, there is {n_rows} rows.')

Before the removal, there was 6551 rows. After the removal, there is 6546 rows.


In [22]:
# Check rows with missing values in the 'sex' column
df[df.sex.isna()]

Unnamed: 0,date,year,type,country,area,location,activity,sex,age,injury,fatal,time,species
8,29-Jul-2020,2020.0,Watercraft,AUSTRALIA,Tasmania,Tenth Island,Sightseeing,,,"No injury to occupants, injury to shark attemp...",,09h08,"White shark, 4m"
22,20-Jun-2020,2020.0,Unprovoked,BAHAMAS,Exumas,"Pig Beach, Pig Island",,,,,N,,
32,17 May 2020,2020.0,Unprovoked,FRENCH POLYNESIA,Tahiti,Vallée Blanche,Snorkeling,,,Leg bitten,N,,Blacktip shark
48,05-Feb-2020,2020.0,Unprovoked,USA,Maui,,Stand-Up Paddle boarding,,,"No injury, but paddleboard bitten",N,09h40,Tiger shark
65,20-Dec-2019,2019.0,Provoked,AUSTRALIA,New South Wales,Shellharbour,Fishing,,,PROVOKED INCIDENT,,,White shark
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6468,Before 2004,0.0,Watercraft,MOZAMBIQUE,Inhambane Province,Off Inhambane,Fishing,,,"No injury to occupants, shark bumped boat",N,,Whale shark
6493,"No date, Before 1963",0.0,Unprovoked,SINGAPORE,,"Keppel Harbor, 2 miles from Singapore city ce...",Swimming,,,Recovered,N,,
6515,1941-1945,0.0,Sea Disaster,,,,A group of survivors on a raft for 17-days,,,"FATAL, shark leapt into raft and bit the man w...",Y,Late afternoon,1.2 m [4'] shark
6534,Between 1918 & 1939,0.0,Unprovoked,REUNION,Saint-Denis,Barachois,Swimming,,,FATAL,Y,,


In [23]:
# Check % of missing values
df.sex.isna().mean()*100

8.722884204094102

For now, they will be remain in the dataframe, but they may be removed later.

## Age

### Data Cleaning

In [24]:
# Check the unique values in the column
print(f'There are {len(list(df.age.unique()))} unique values in this column.\n')
print(df.age.unique())

There are 237 unique values in this column.

[50 35 55 nan 22 14 28 38 4 63 23 11 12 10 29 15 36 7 16 30 60 18 9 26 57
 'Teen' 24 59 13 75 21 '30s' 45 33 17 37 70 44 '28 & 22' 32 20 51
 '22, 57, 31' '60s' 40 49 "20's" 43 8 64 19 65 67 53 34 25 58 74 46 41 31
 '9 & 60' 48 '20s' 42 39 56 61 'a minor' 6 62 52 54 69 '40s' 3 82 73 68 47
 66 72 27 71 '38' '39' '23' '32' '52' '68' '12' '18' '19' '43' '47' '6'
 '37' '9' '36' '10' '16' '13' '11' '17' '14' '30' '50' '29' '65' '63' '26'
 '71' '48' '70' '58' '18 months' '22' '41' '35' '57' '20' '24' '34' '15'
 '44' '53' '7' '40' '28' '33' '31' '45' '50s' '8' '51' '61' '42' '25'
 'teen' '66' '21' '77' '46' '60' '74' '55' '27' '3' '56' '64' '28 & 26'
 '62' '5' '49' '54' '86' '59' '18 or 20' '12 or 13' '46 & 34'
 '28, 23 & 30' 'Teens' 77 '36 & 26' '8 or 10' 84 '\xa0 ' ' ' '30 or 36'
 '6½' '21 & ?' '33 or 37' 'mid-30s' '23 & 20' 5 ' 30' '7      &    31'
 ' 28' '20?' "60's" '69' '32 & 30' '16 to 18' '87' '67' 'Elderly'
 'mid-20s' 'Ca. 33' '74 ' '45 ' '

In [25]:
# Check the number of each value
df.age.value_counts()

19             86
16             86
18             82
17             82
17             80
               ..
a minor         1
18 or 20        1
?    &   14     1
X               1
1               1
Name: age, Length: 236, dtype: int64

In [26]:
# Convert all values in this columnn to string if it is not a missing value
df.age = df.age.apply(lambda x : str(x) if x == x else x)

# Check result
df.age.unique()

array(['50', '35', '55', nan, '22', '14', '28', '38', '4', '63', '23',
       '11', '12', '10', '29', '15', '36', '7', '16', '30', '60', '18',
       '9', '26', '57', 'Teen', '24', '59', '13', '75', '21', '30s', '45',
       '33', '17', '37', '70', '44', '28 & 22', '32', '20', '51',
       '22, 57, 31', '60s', '40', '49', "20's", '43', '8', '64', '19',
       '65', '67', '53', '34', '25', '58', '74', '46', '41', '31',
       '9 & 60', '48', '20s', '42', '39', '56', '61', 'a minor', '6',
       '62', '52', '54', '69', '40s', '3', '82', '73', '68', '47', '66',
       '72', '27', '71', '18 months', '50s', 'teen', '77', '28 & 26', '5',
       '86', '18 or 20', '12 or 13', '46 & 34', '28, 23 & 30', 'Teens',
       '36 & 26', '8 or 10', '84', '\xa0 ', ' ', '30 or 36', '6½',
       '21 & ?', '33 or 37', 'mid-30s', '23 & 20', ' 30',
       '7      &    31', ' 28', '20?', "60's", '32 & 30', '16 to 18',
       '87', 'Elderly', 'mid-20s', 'Ca. 33', '74 ', '45 ', '21 or 26',
       '20 ', '>50', '

Converting all the values to the same data type make it easier to manipulate them.

In [27]:
len(df.age.unique())

164

In [28]:
# Check rows that contains '&'
df[df.age.str.contains('&', na=False)]

Unnamed: 0,date,year,type,country,area,location,activity,sex,age,injury,fatal,time,species
80,29-Oct-2019,2019.0,Unprovoked,AUSTRALIA,Queensland,"Off Airlie Beach, Whitsundays",Snorkeling,M,28 & 22,Raddon’s right foot was severed and Maggs sust...,N,10h20,Tiger shark
168,10-Jan-2019,2019.0,Invalid,AUSTRALIA,Queensland,"Catseye Beach, Hamilton Island, Whitsundays",Wading,F,9 & 60,Injuries to foot & leg,N,09h30,Reported as shark attacks but injuries caused ...
940,10-Mar-2013,2013.0,Unprovoked,PHILIPPINES,Palawan,Off Likas Island,Swimming to shore with floatioon devices after...,M,28 & 26,Minor leg injuries,N,,"""small sharks"""
1749,Reported 28-Jan-2006,2006.0,Watercraft,ATLANTIC OCEAN,300 miles from Antigua,,Competing in the Woodvale Atlantic Rowing Race,M,46 & 34,No injury to occupants; shark rammed boat repe...,N,,12' shark
1751,23-Jan-2006,2006.0,Watercraft,ATLANTIC OCEAN,800 miles from land,,Competing in the Woodvale Atlantic Rowing Race,F,"28, 23 & 30","No injury to occupants; a shark, accidentally ...",N,,
1986,14-Sep-2003,2003.0,Watercraft,SOUTH AFRICA,Western Cape Province,Melkbosstrand,Fishing,,36 & 26,"No injury to occupants, shark bit boat",N,09h40,2 m cow shark
2168,12-Aug-2001,2001.0,Unprovoked,THAILAND,Rayong Province,Laem Mae Pim Beach,Fell off banana boat,M,21 & ?,Legs bitten,N,,3 m [10'] shark
2269,10-Aug-2000,2000.0,Invalid,USA,Florida,Florida Keys,Attempting to illegally enter the USA,M,23 & 20,Shark involvement probably post-mortem,,,Shark involvement prior to death was not confi...
2665,Reported 11-Sep-1994,1994.0,Sea Disaster,USA,Florida,Florida Straits,Adrift on refugee raft,M,7 & 31,FATAL,Y,,
2754,04-Jan-1993,1993.0,Watercraft,CARIBBEAN SEA,,Off Dominican Republic,,M,32 & 30,"No injury to occupants. Sharks, attracted to o...",N,,Two 3 m [10'] oceanic whitetip sharks


In [29]:
# Number of rows that contain "&"
r_clean = len(df[df.age.str.contains("&", na=False)])
print(f'There are {r_clean} rows containing "&" and they represent only {(r_clean / n_rows) * 100:.2f}% of the', end=' '
      f'dataset.\nSince they have ages of more than one person, they will be removed from the dataframe', sep=' ')

There are 24 rows containing "&" and they represent only 0.37% of the dataset.
Since they have ages of more than one person, they will be removed from the dataframe

In [30]:
print(f'Before the removal, there was {n_rows} rows.', end=' ')

# Remove the rows mentioned above
df = df.drop(df[df.age.str.contains('&', na=False)].index)

# Number of rows
n_rows = df.shape[0]
print(f'After the removal, there is {n_rows} rows.')

Before the removal, there was 6546 rows. After the removal, there is 6522 rows.


In [31]:
# Check the unique values
print(f'Now, there are {len(list(df.age.unique()))} unique values in this column.\n')
print(df.age.unique())

Now, there are 140 unique values in this column.

['50' '35' '55' nan '22' '14' '28' '38' '4' '63' '23' '11' '12' '10' '29'
 '15' '36' '7' '16' '30' '60' '18' '9' '26' '57' 'Teen' '24' '59' '13'
 '75' '21' '30s' '45' '33' '17' '37' '70' '44' '32' '20' '51' '22, 57, 31'
 '60s' '40' '49' "20's" '43' '8' '64' '19' '65' '67' '53' '34' '25' '58'
 '74' '46' '41' '31' '48' '20s' '42' '39' '56' '61' 'a minor' '6' '62'
 '52' '54' '69' '40s' '3' '82' '73' '68' '47' '66' '72' '27' '71'
 '18 months' '50s' 'teen' '77' '5' '86' '18 or 20' '12 or 13' 'Teens'
 '8 or 10' '84' '\xa0 ' ' ' '30 or 36' '6½' '33 or 37' 'mid-30s' ' 30'
 ' 28' '20?' "60's" '16 to 18' '87' 'Elderly' 'mid-20s' 'Ca. 33' '74 '
 '45 ' '21 or 26' '20 ' '>50' '18 to 22' 'adult' '9 months' '25 to 35' '1'
 '(adult)' '25 or 28' 'X' '"middle-age"' '13 or 18' '2 to 3 months'
 'MAKE LINE GREEN' ' 43' '81' '"young"' '7 or 8' '78' 'F' 'Both 11'
 '9 or 10' 'young' '  ' 'A.M.' '10 or 12' '31 or 33' '2½' '13 or 14']


In [32]:
# Check rows that the age values does not make sense or can not be categorized
df[df.age.apply(lambda x : True if x in ['\xa0 ', 'X', ' ', '20?', '>50', '13 or 18', 'MAKE LINE GREEN', 'Ca. 33',
                                         '"young"', 'F', 'Both 11', 'young', '  ', 'A.M.'] else False)]

Unnamed: 0,date,year,type,country,area,location,activity,sex,age,injury,fatal,time,species
2067,27-Sep-2002,2002.0,Provoked,USA,Florida,"Key Largo, Monroe County",Fishing,M,,Left thumb lacerated PROVOKED INCIDENT,N,Afternoon,1.8 m [6'] blacktip shark
2068,27-Sep-2002,2002.0,Provoked,TONGA,Vava'u,Swimming with humpback whales,Swimming,M,,Thigh lacerated PROVOKED INCIDENT,N,15h00,"Tiger shark, 1.5 m [5']k"
2684,06-Apr-1994,1994.0,Unprovoked,BAHAMAS,,Rum Cay,Wading,M,20?,Lower right leg bitten,N,10h30,2' to 3' shark
3229,May 1982,1982.0,Invalid,USA,South Carolina,"Daws Island, Broad River (near Beaufort)",,F,Ca. 33,Human remains recovered from tiger shark,,,Shark involvement prior to death was not confi...
3355,29-Dec-1978,1978.0,Unprovoked,AUSTRALIA,Queensland,Bribie Island,,M,,Survived,N,,
3612,11-Apr-1971,1971.0,Unprovoked,SOUTH AFRICA,Western Cape Province,Buffels Bay,Swimming,M,>50,"FATAL, multiple bites",Y,11h30,White shark according to tooth pattern and wit...
4302,07-Oct-1959,1959.0,Unprovoked,MOZAMBIQUE,Maputo Province,"Xefina Island, Bay of Maputo",Swimming,M,X,"FATAL, died in Lourenco Marques Hospital",Y,,
4325,11-Aug-1959,1959.0,Unprovoked,JAPAN,Wakayama Prefecture,"Isonoura Beach, Wakayama City",Swimming,M,13 or 18,"FATAL, left thigh bitten",Y,14h30,"Blue shark, 3 m [10']"
4643,08-Jan-1953,1953.0,Watercraft,AUSTRALIA,Tasmania,Wynyard,Fishing,,MAKE LINE GREEN,"No injury to occupant, shark charged boat",N,Afternoon,10' to 12' shark
4700,24-Mar-1951,1951.0,Unprovoked,AUSTRALIA,New South Wales,Sydney,"Fishing, casting in the surf",M,"""young""",Severe lacerations of chest & thigh,N,,5.5 m [18'] shark


In [33]:
# Number of rows that contain "&"
r_clean = len(df[df.age.apply(lambda x : True if x in ['\xa0 ', 'X', ' ', '20?', '>50', '13 or 18', 'MAKE LINE GREEN',
                                                       'Ca. 33', '"young"', 'F', 'Both 11', 'young', '  ', 'A.M.'] 
                              else False)].index)
print(f'There are {r_clean} rows containing "&" and they represent only {(r_clean / n_rows) * 100:.2f}% of the', end=' '
      f'dataset.\nSince they can not be categorized, they will be removed from the dataframe', sep=' ')

There are 16 rows containing "&" and they represent only 0.25% of the dataset.
Since they can not be categorized, they will be removed from the dataframe

In [34]:
print(f'Before the removal, there was {n_rows} rows.', end=' ')

# Remove the rows mentioned above
df = df.drop(df[df.age.apply(lambda x : True if x in ['\xa0 ', 'X', ' ', '20?', '>50', '13 or 18', 'MAKE LINE GREEN', 
                                                      'Ca. 33', '"young"', 'F', 'Both 11', 'young', '  ', 'A.M.', '30s',
                                                     'a minor'] 
                             else False)].index)

# Number of rows
n_rows = df.shape[0]
print(f'After the removal, there is {n_rows} rows.')

Before the removal, there was 6522 rows. After the removal, there is 6498 rows.


In [35]:
# Check the unique values
print(f'Now, there are {len(list(df.age.unique()))} unique values in this column.\n')
print(df.age.unique())

Now, there are 125 unique values in this column.

['50' '35' '55' nan '22' '14' '28' '38' '4' '63' '23' '11' '12' '10' '29'
 '15' '36' '7' '16' '30' '60' '18' '9' '26' '57' 'Teen' '24' '59' '13'
 '75' '21' '45' '33' '17' '37' '70' '44' '32' '20' '51' '22, 57, 31' '60s'
 '40' '49' "20's" '43' '8' '64' '19' '65' '67' '53' '34' '25' '58' '74'
 '46' '41' '31' '48' '20s' '42' '39' '56' '61' 'a minor' '6' '62' '52'
 '54' '69' '40s' '3' '82' '73' '68' '47' '66' '72' '27' '71' '18 months'
 '50s' 'teen' '77' '5' '86' '18 or 20' '12 or 13' 'Teens' '8 or 10' '84'
 '30 or 36' '6½' '33 or 37' 'mid-30s' ' 30' ' 28' "60's" '16 to 18' '87'
 'Elderly' 'mid-20s' '74 ' '45 ' '21 or 26' '20 ' '18 to 22' 'adult'
 '9 months' '25 to 35' '1' '(adult)' '25 or 28' '"middle-age"'
 '2 to 3 months' ' 43' '81' '7 or 8' '78' '9 or 10' '10 or 12' '31 or 33'
 '2½' '13 or 14']


In [36]:
# Check % of missing values
df.age.isna().mean()*100

44.429055093875036

The percentage of missing values is too high and, if they are removed, it would greatly impact the dataset. So, for now, they will remain in the dataframe and they will be analyzed again later.

### Categorizing the age

The column age will be categorized into:

<table>
    <thead>
        <tr>
            <th>AGE</th>
            <th>CATEGORY</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>< 11</td>
            <td>Child</td>
        </tr>
        <tr>
            <td>12 - 17</td>
            <td>Teenager</td>
        </tr>
        <tr>
            <td>18 - 35</td>
            <td>Young Adult</td>
        </tr>
        <tr>
            <td>36 - 65</td>
            <td>Adult</td>
        </tr>
        <tr>
            <td>> 65</td>
            <td>Elder</td>
        </tr>
        <tr>
            <td>NaN</td>
            <td>No category</td>
        </tr>
    </tbody>
</table>

In [48]:
# Create a function to categorize the ages
def cat_ages(age):
    '''
    
    '''
    
    try:
        age = int(age)
        if age > 65:
            return 'Elder'
        elif age > 35:
            return 'Adult'
        elif age > 17:
            return 'Young Adult'
        elif age > 11:
            return 'Teenager'
        else:
            return 'Child'
    except:
        if age != age:
            return 'No category'
        elif age in ['18 months', '8 or 10', '6½', '9 months', '2 to 3 months', '7 or 8', '9 or 10', '10 or 12', '2½']:
            return 'Child'
        elif age in ['Teen', 'teen', '12 or 13', 'Teens', '16 to 18', '13 or 14']:
            return 'Teenager'
        elif age in ["20's", '20s', '18 or 20', 'mid-20s', '30 or 36', '21 or 26', '18 to 22', '25 to 35', '25 or 28',
                     '31 or 33']:
            return 'Young Adult'
        elif age in ['40s', '50s', 'mid-30s', '33 or 37', 'adult', '(adult)', '"middle-age"']:
            return 'Adult'
        elif age in ['60s', "60's", 'Elderly']:
            return 'Elder'

In [49]:
# Categorize the ages
df['cat_age'] = df.age.apply(cat_ages)

# Check the result
df.cat_age.value_counts(dropna=False)

No category    2887
Young Adult    1723
Adult           840
Teenager        751
Child           235
Elder            60
NaN               2
Name: cat_age, dtype: int64

In [51]:
# Check number of rows
df.cat_age.value_counts(dropna=False).sum() == n_rows

True

In [52]:
# Check information abou the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6498 entries, 0 to 6557
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   date      6498 non-null   object 
 1   year      6496 non-null   float64
 2   type      6492 non-null   object 
 3   country   6449 non-null   object 
 4   area      6035 non-null   object 
 5   location  5954 non-null   object 
 6   activity  5946 non-null   object 
 7   sex       5932 non-null   object 
 8   age       3611 non-null   object 
 9   injury    6468 non-null   object 
 10  fatal     5952 non-null   object 
 11  time      3116 non-null   object 
 12  species   3571 non-null   object 
 13  cat_age   6496 non-null   object 
dtypes: float64(1), object(13)
memory usage: 761.5+ KB


In [53]:
# Check the dataset
df.head()

Unnamed: 0,date,year,type,country,area,location,activity,sex,age,injury,fatal,time,species,cat_age
0,20-Aug-2020,2020.0,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Boogie boarding,F,50.0,Minor lacerations to left leg,N,11h00,,Adult
1,14-Aug-2020,2020.0,Unprovoked,AUSTRALIA,New South Wales,"Shelly Beach, Port Macquarie",Surfing,F,35.0,Lacerations to right calf and posterior thigh,N,09h30,"White shark, 2-to 3m",Young Adult
2,10-Aug-2020,2020.0,Provoked,USA,Florida,"Off Gasparilla Island, Charlotte County",Fishing,M,55.0,Injury to left forearm by hooked shark PROVOKE...,N,16h00,"Blacktip shark, 6'",Adult
3,02-Aug-2020,2020.0,Unprovoked,USA,Virgin Islands,"Candle Reef, St. Croix",Snorkeling,F,,Lacerations to hand and wrist,N,14h00,"Nurse shark, 5'",No category
4,31-Jul-2020,2020.0,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,F,22.0,Lacerations to foot,N,17h00,,Young Adult


## Activity

## Extra - Year

In [43]:
# Check the unique values in the column
df.year.unique()

array([2020., 2019., 2018., 2017.,   nan, 2016., 2015., 2014., 2013.,
       2012., 2011., 2010., 2009., 2008., 2007., 2006., 2005., 2004.,
       2003., 2002., 2001., 2000., 1999., 1998., 1997., 1996., 1995.,
       1984., 1994., 1993., 1992., 1991., 1990., 1989., 1969., 1988.,
       1987., 1986., 1985., 1983., 1982., 1981., 1980., 1979., 1978.,
       1977., 1976., 1975., 1974., 1973., 1972., 1971., 1970., 1968.,
       1967., 1966., 1965., 1964., 1963., 1962., 1961., 1960., 1959.,
       1958., 1957., 1956., 1955., 1954., 1953., 1952., 1951., 1950.,
       1949., 1948., 1848., 1947., 1946., 1945., 1944., 1943., 1942.,
       1941., 1940., 1939., 1938., 1937., 1936., 1935., 1934., 1933.,
       1932., 1931., 1930., 1929., 1928., 1927., 1926., 1925., 1924.,
       1923., 1922., 1921., 1920., 1919., 1918., 1917., 1916., 1915.,
       1914., 1913., 1912., 1911., 1910., 1909., 1908., 1907., 1906.,
       1905., 1904., 1903., 1902., 1901., 1900., 1899., 1898., 1897.,
       1896., 1895.,

# Analyzing the dataset

## Gender

## Activity

## Age

## Age

In [None]:
df.Age.unique()

In [None]:
# Function to convert string to integer
def convert_str_to_int(age):
    try:
        age = int(age)
    except:
        age = -1
    return age

In [None]:
# Number 
df_total = df.Date.count()
print(df_total)

# Convert the strings in column 'Age' to integers
df['age_int'] = df.Age.apply(convert_str_to_int)

# How many were not digits
print(df.loc[df.age_int == -1, :].age_int.value_counts() / 6558)
print(df.loc[df.age_int == -1, :].age_int.value_counts() / df_total)

# Check the result
df

In [None]:
# Classification by age
df['age_cat'] = np.where(df['age_int'] > 65, 'Elder',
                        np.where(df['age_int'] > 35, 'Adult',
                                np.where(df['age_int'] > 17, 'Young Adult',
                                        np.where(df['age_int'] > 12, 'Teenager',
                                                 np.where(df['age_int'] == -1, '-', 'Child')))))

## Year

In [None]:
df.Year.unique()

In [None]:
# Convert the strings in column 'Year' to integers
df['year_int'] = df.Year.apply(convert_str_to_int)

# Check the result
df

In [None]:
# Check % incidents occured before 1801
print(f'The incidents before 1801 represents only {(df[df.Year < 1801].Year.count() / df.Year.count()) * 100:.2f}%',
      f'of the dataset. So, only the years from 1800 will be analysed.', sep=' ')

In [None]:
# Selecting only the years from 1800
df = df.loc[df['Year'] >= 1801, :]
df.info()

In [None]:
df

In [None]:
# Classification by year
df['century'] = np.where(df['year_int'] >= 2001, 21,
                            np.where(df['year_int'] >= 1901, 20, 19))

In [None]:
df

## Gender

In [None]:
# Check the column 'Sex'
df.Sex.unique()

In [None]:
df.Sex.count()

In [None]:
# % of each value
df.Sex.value_counts() / df.Sex.count()

In [None]:
# % of each NaN
df.Sex.isna().sum()

In [None]:
# Check number of missing values
print(f'The number of NaN in the column "Sex" is {df.Sex.isna().sum()}, which represents '
      f'{(df.Sex.isna().sum() / df.Sex.count())*100:.2f}% of the dataset.', sep=' ')

Since the NaN represents only a samll part of the dataset, the rows containing NaN in the column 'Sex' will be removed.

In [None]:
# Remove rows that the column 'Sex' is a missing value
df[df.Sex.isna()]
df = df.drop(df[df.Sex.isna()].index)

In [None]:
# Check infromation about the changed dataset
df.info()

In [None]:
# Check values in the column 'Sex'
df.Sex.value_counts()

In [None]:
# Remove unnecessary spaces in the values of the column 'Sex'
df['Sex'] = df['Sex'].str.strip()
df.Sex.value_counts()

In [None]:
# Remove rows that the gender is 'N', 'lli', '.' or 'M x 2'
df = df.drop(df[(df.Sex == '.') | (df.Sex == 'N') | (df.Sex == 'lli') | ((df.Sex == 'M x 2'))].index)
df.Sex.value_counts()

In [None]:
# Check infromation about the changed dataset
# 5 columns were removed
df.info()

In [None]:
# Check column 'Sex' percentage
df_gener_total = df.Sex.value_counts().sum()
df.Sex.value_counts() / df_gener_total

In [None]:
print(f'The males represent {(df.Sex.value_counts() / df_gener_total)[0]*100:.2f}% of the people who were'
      f'involved in incidents with sharks, while women represent only {(df.Sex.value_counts() / df_gener_total)[1]*100:.2f}%.',sep=' ')

In [None]:
# Create a subset containing only females
df_male = df.loc[df['Sex'] == 'F']

In [None]:
# Create a subset containing only males
df_male = df.loc[df['Sex'] == 'M']
df_male.info()

In [None]:
# Check males age
df_male.age_cat.value_counts()

In [None]:
df_male_total = df_male.Date.count()
df_male_total

In [None]:
# Check males age
df_male.age_cat.value_counts() / df_male_total

In [None]:
(df_male.age_cat.value_counts() / df_male_total)*100

In [None]:
df_male_age = df_male.loc[df_male['age_cat'] != '-']
df_male_age_total = df_male_age.Date.count()
df_male_age.info()

In [None]:
df_male_age_total

In [None]:
df_male_age.age_cat.value_counts()

In [None]:
df_male_age.age_cat.value_counts() / df_male_age_total

In [None]:
(df_male_age.age_cat.value_counts() / df_male_age_total)*100

50% of male who were in a shark incident were Young Adults.

In [None]:
df_male_ya = df_male_age[df_male_age['age_cat'] == 'Young Adult']
df_male_ya.info()

In [None]:
df_male_ya_total = df_male_ya.Date.count()
df_male_ya.groupby(by='Activity').age_cat.count().sort_values(ascending=False)

In [None]:
(df_male_ya.groupby(by='Activity').age_cat.count().sort_values(ascending=False) / df_male_ya_total)*100

**Young adult males surfing were the most involved in shark incidents.**

# Exporting dataset

In [None]:
df

In [None]:
#df.to_csv('sharks_clean.csv')

# Extra - Analysis though centuries

## Setup

In [None]:
# Create subsets for each century
df_cen21 = df.loc[df['century'] == 21, :]
df_cen20 = df.loc[df['century'] == 20, :]
df_cen19 = df.loc[df['century'] == 19, :]

In [None]:
df_cen21_total = df_cen21.Date.count()
print(df_cen21_total)
df_cen21.info()

In [None]:
df_cen20_total = df_cen20.Date.count()
print(df_cen20_total)
df_cen20.info()

In [None]:
df_cen19_total = df_cen19.Date.count()
print(df_cen19_total)
df_cen19.info()

We are not even 1/4 of the 21st century and the number of shark incidents getting closer to the number of incidents in the 20th century. However, we also have to consider that there may be a probability that not all incidents were registered in the last century. 

## Gender

In [None]:
df_cen21.Sex.value_counts() / df_cen21_total

In [None]:
(df_cen21.Sex.value_counts() / df_cen21_total) * 100

In [None]:
df_cen20.Sex.value_counts() / df_cen20_total

In [None]:
(df_cen20.Sex.value_counts() / df_cen20_total) * 100

In [None]:
df_cen19.Sex.value_counts() / df_cen19_total

In [None]:
(df_cen19.Sex.value_counts() / df_cen19_total) * 100

This result reflects each gender's behaviors in each century. Long time ago, women usually would just take care of the house and were not allowed to do many things.

## Activity

In [None]:
df_cen21.Activity.value_counts() / df_cen21_total

In [None]:
(df_cen21.Activity.value_counts() / df_cen21_total) * 100

In [None]:
df_cen20.Activity.value_counts() / df_cen20_total

In [None]:
(df_cen20.Activity.value_counts() / df_cen20_total) * 100

In [None]:
df_cen19.Activity.value_counts() / df_cen19_total

In [None]:
(df_cen19.Activity.value_counts() / df_cen19_total) * 100