## Project Shark Attack

### HYPOTHESIS

There is a crucial scientific information about shark attacks on surfers: when surfers are laying down on top of their surfboards, they look much alike sea lions and turtles - which are shark's pray This is called the 'Mistaken Identity Theory' (https://royalsocietypublishing.org/doi/10.1098/rsif.2021.0533).
So there is a belief that sharks mistaken surfers with their natural prays.
Another relevant information is about the size of the surfboards. Longboards for instance are huge surfboards (9ft+). So just like in the animal kingdom, predators are more likely to attack a pray that is, in most cases, smaller than them. The data base does not have information about the victims hight or weight, neither the size of the surfboards that they were riding at the time of the incident. However, the data base provides demographic information such as age.
Younger surfers are smaller than adults and they ride smaller surfboards making them a easier pray for sharks. Although the number of adult surfers are larger than the number of youth surfers, the logic would put a smaller being as the primary target for sharks.  

First Hypothesis: SURFERS UP TO 19 YEARS OLD HAVE MORE CHANCES TO BE ATTACKED BY SHARKS.

If you paddle out in famous places such as Hawaii and Australia, you will notice that older man and woman of all ages usually rides Longboards. Although the number of woman riding short boards have been growing exponentially in the last decade, there is still a discrepancy out there.  

Second Hypothesis: EVEN IF WE ADD THE NUMBER OF OLDER MEN SURFERS VICTIMS (62+ YEARS) AND THE NUMBER OF WOMAN SURFERS VICTIMS OF ALL AGES, THE RESULT STILL GOING TO BE SMALLER THAN THE NUMBER OF VICTIMS UP TO 19 YEARS OLD.

### GENERAL

In [None]:
#import libraries
import pandas as pd
import numpy as np
import re
import pickle
import matplotlib.pyplot as plt

In [None]:
#open the dictionaty that will be necessary to analize what the victimns were doing when they were attacked by a shark
with open('surfing_dict.pickle', 'rb') as f:
    surfing_dict = pickle.load(f)

In [None]:
#read and display the dataframe shark attacks
#read the excel file in a variable called shark_attacks
shark_attacks = pd.read_excel(r'C:\Users\PC\Desktop\Ironhack\WR_Ironhack_Projects\Shark-Attack\GSAF5.xls')

#display the dataframe
shark_attacks

In [None]:
#check infos about 
#look for null columns
shark_attacks.info()

In [None]:
#treat and display the dataframe shark attacks
#tremove the space from all column tittles
shark_attacks.columns = shark_attacks.columns.str.rstrip()

#eliminate columns filled with NaN elements and all the others that have no use to my analysis
shark_attacks.drop(columns = ['Unnamed: 21', 'Unnamed: 22', 'Date', 'Name', 'Species', 'Type', 'Time', 
                              'Investigator or Source', 'pdf', 'href formula', 'href', 'Case Number', 
                              'Case Number.1', 'original order', 'Injury'], inplace = True)


#eliminate rows filled with ONLY NaN values
shark_attacks.dropna(axis = 0, how = 'all', inplace = True)

#display the first 10 rows of the dataframe
shark_attacks.head(10)

### AGE

In [None]:
#treat and present results of column Age
#check the lengh of column Age and print how many shark attack cases have been reported
len_age = len(shark_attacks['Age'])
print(f'Until 2023 there was {len_age} reported shark attacks worldwide.')

#check unique elements in Age column
#print(shark_attacks['Age'].unique().tolist())

#check how many elements appear only once on Age column
value_counts_age = shark_attacks['Age'].value_counts()
unique_values_age = value_counts_age[value_counts_age == 1]
unique_list_age = unique_values_age.index.tolist()
#print(unique_list_age)
#print(len(unique_list_age))

#convert NaN to numeric values
shark_attacks['Age'] = pd.to_numeric(shark_attacks['Age'], errors = 'coerce')

#convert NaN to -1
shark_attacks['Age'].fillna( -1, inplace = True)

#convert column Age from object to int
shark_attacks['Age'] = shark_attacks['Age'].astype(int)

#check how many people from 0 - 19 yeras old were attacked
age_0_19 = shark_attacks[(shark_attacks['Age'] >= 0) & (shark_attacks['Age'] <= 19)]
count_age_0_19 = len(age_0_19)
print(f'\nThe number of people attacked by sharks from ages 1 to 19 is {count_age_0_19}.')

#create classes of different age groups and get the % of each group in column Age
age_count_group = pd.cut(shark_attacks['Age'], [-2, 0, 19, 40, 61, 110], labels = ['No Class', 'Youth (1-19)',
             'Young Adults (20-40)', 'Adults (40-62)', 'Elderly (62+)']).value_counts(normalize = True) * 100

age_count_group = age_count_group.round().astype(int)
print(f'\n% of shark attacks for each group:\n{age_count_group}')

### SEX

In [None]:
#treat and present results of column Sex
#convert all the letters in the column Sex to upperercase
shark_attacks['Sex'] = shark_attacks['Sex'].str.upper()

#remove spaces
shark_attacks['Sex'] = shark_attacks['Sex'].str.replace(' ', '')

#check unique elements in Sex column
#print(shark_attacks['Sex'].unique().tolist())

#check how many NaN in column Sex
nan_sum_sex = shark_attacks['Sex'].isna().sum()

#replace NaN in Sex column with the acronym 'unk'
shark_attacks['Sex'].fillna('unk', inplace = True)

#replace the unkown items to the acronym 'unk' 
unk_words_sex = {'LLI': 'unk', 'MX2': 'M', 'N': 'unk', '.': 'unk'}
shark_attacks['Sex'] = shark_attacks['Sex'].replace(unk_words_sex)

#confirm that 'unk' is the only unique element besides 'M' and 'F'
differnet_sex = shark_attacks.loc[~shark_attacks['Sex'].isin(['M', 'F']), 'Sex']
unique_different_sex = differnet_sex.unique()
#print(unique_different_sex)

#how many shark attacks on M and F
count_male = shark_attacks['Sex'].str.count('M').sum()
count_female = shark_attacks['Sex'].str.count('F').sum()
print(f'According to collected data, {count_male} men and {count_female} woman were attacked by sharks worldwide.')

#present the % of attacks in both genders
sex_attacks = shark_attacks['Sex'].value_counts(True) * 100
sex_attacks = sex_attacks.round().astype(int)
print(f'\n% of attacks in Men(M) and Women(F):\n{sex_attacks}')

### ACTIVITY

In [None]:
#treat and present results of column Activity
#convert all the letters in the column Activity to lower
shark_attacks['Activity'] = shark_attacks['Activity'].str.lower()

#check if there are numeric elements in Activity column
num_act = shark_attacks['Activity'].str.contains('\d').any()
#num_act

#check if there are numeric elements different than NaN in Activity column
not_nan_sex = pd.to_numeric(shark_attacks['Activity'], errors = 'coerce').notnull().any()
#not_nan_sex

#check how many NaN in column Activity
nan_sum_act = shark_attacks['Activity'].isna().sum()
#nan_sum_act

#replace NaN in Activity column to the acronym 'unk'
shark_attacks['Activity'].fillna('unk', inplace = True)

#check the list of unique elements in column Activity
#print(sorted(shark_attacks['Activity'].unique().tolist()))

#treatment of the column Activity with the dictionary that was loaded in the beginning of the code
shark_attacks['Activity'] = shark_attacks['Activity'].map(surfing_dict).fillna('unk')

#how many shark attacks on surfers
count_surfing = shark_attacks['Activity'].str.count('surfing').sum()
print(f"Although almost 50% of the incidents caused by sharks have no clear information about what extacly the victim was doing, surfing appears to be the riskiest kown activity practice when considering shark attacks.\nIn the last 200 years {count_surfing} people were victims of shark attacks while surfing.")

#present the % of attacks for each activity
activity_attacks = shark_attacks['Activity'].value_counts(True) * 100
#activity_attacks = activity_attacks.round().astype(int)
print(f'\n% of attacks for each activity:\n\n{activity_attacks}.')

### ATTACKS OUTCOME 

In [None]:
#treat and present results of column Fatal 
#convert all the letters in the column Sex to upperercase
shark_attacks['Fatal (Y/N)'] = shark_attacks['Fatal (Y/N)'].str.upper()

#remove spaces
shark_attacks['Fatal (Y/N)'] = shark_attacks['Fatal (Y/N)'].str.replace(' ', '')

#check list of unique elements in column Fatal
#print(shark_attacks['Fatal (Y/N)'].unique().tolist())

#check how many NaN in column Sex
nan_sum_fatal = shark_attacks['Fatal (Y/N)'].isna().sum()
#nan_sum_fatal

#replace NaN with 'unk'
shark_attacks['Fatal (Y/N)'].fillna('unk', inplace = True)

#create a dictionary to replace words
unk_words_fatal = {'F': 'Y', 'UNKNOWN': 'unk', 'YX2': 'Y', 'M': 'unk', 'NQ': 'unk'}
shark_attacks['Fatal (Y/N)'] = shark_attacks['Fatal (Y/N)'].replace(unk_words_fatal)

#check if there is something besides the proper answeres (Y/N) or 'unk'
differnet_fatal = shark_attacks.loc[~shark_attacks['Fatal (Y/N)'].isin(['Y', 'N']), 'Fatal (Y/N)']
unique_different_fatal = differnet_fatal.unique()
#print(unique_different_fatal)

#count the data e present the results for Fatal column
count_n_fatal = shark_attacks['Fatal (Y/N)'].str.count('N').sum()
count_y_fatal = shark_attacks['Fatal (Y/N)'].str.count('Y').sum()
count_u_fatal = shark_attacks['Fatal (Y/N)'].str.count('unk').sum()
print(f'According to data that have been colected worldwide, {count_y_fatal} people were killed by sharks and {count_n_fatal} were attacked but survived.\nThere is also {count_u_fatal} reported cases of an unkown outcome.')
#print(count_n_fatal + count_y_fatal + count_u_fatal)

### AGE x GENDER x SURFERS x FATALITIES

In [None]:
#create and display a new column called Age Groups
#create a funcion to determine the surfer's age group
def age_group(age):
    """use conditionals to determine if x age belongs to x_age_group
    exemple: input: 35 years
             output: young adult group"""
    if age <= 0:
        return 'No Class'
    elif age <= 19:
        return 'Youth'
    elif age <= 40:
        return 'Young Adult'
    elif age < 61:
        return 'Adult'
    else:
        return 'Elderly'

#create the new column
shark_attacks.insert(7, 'Age Group', shark_attacks['Age'].apply(age_group))

#after the calculation was done, transform -1 to unk to be more presentable and readable in the dataframe
shark_attacks['Age'] = shark_attacks['Age'].replace(-1, 'unk')

#present final dataframe
shark_attacks.head(10)

### TESTING HYPOTHESIS

#### First: Surfers from ages between 1 to 19 are most likely to be attacked by sharks.

In [None]:
#show shark attacks on surfers separetaed by age group 
no_class_surfers = shark_attacks.groupby(['Age Group', 'Activity']).size()['No Class', 'surfing']
youth_surfers = shark_attacks.groupby(['Age Group', 'Activity']).size()['Youth', 'surfing']
y_adult_surfers = shark_attacks.groupby(['Age Group', 'Activity']).size()['Young Adult', 'surfing']
adult_surfers = shark_attacks.groupby(['Age Group', 'Activity']).size()['Adult', 'surfing']
elderly_surfers = shark_attacks.groupby(['Age Group', 'Activity']).size()['Elderly', 'surfing']

In [None]:
#present chart first hypothesis
first_hypothesis = pd.pivot_table(shark_attacks.query('Activity == "surfing"'), index = 'Age Group', columns = 'Activity', aggfunc = 'count', values = 'Year').loc[['Youth', 'Young Adult', 'Adult', 'Elderly']].plot(kind = 'bar')

first_hypothesis.set_ylabel('Shark Attacks on Surfers')
first_hypothesis.set_title('First Hypothesis')

plt.show()

In [None]:
#print results
print(f'The total number of shark attacks on surfers was {no_class_surfers + youth_surfers + y_adult_surfers + adult_surfers + elderly_surfers}.')
print(f'\n{no_class_surfers} unkown surfers were victims of shark attacks.')
print(f'{youth_surfers} children and teenagers were victims of shark attacks while surfing.')
print(f'{y_adult_surfers} young adults were victims of shark attacks while surfing.')
print(f'{adult_surfers} adults were victims of shark attacks while surfing.')
print(f'{elderly_surfers} elderly people were victims of shark attacks while surfing.')

print(f'\nThe results shows that my first hypothesis - Surfers from 1 to 19 years old are most likely to be attacked by sharks - was false.\nThe age group more affected by this kind of incident is the {y_adult_surfers}')

In [None]:
#demonstrate how many people of each age group and gender were victims of shark attack while surfing and the fatalities ouctcome
no_class_surfers_x = shark_attacks[(shark_attacks['Age Group'] == 'No Class') & (shark_attacks['Activity'] == 'surfing') & (shark_attacks['Sex'] == 'M')].shape[0]
no_class_surfers_y = shark_attacks[(shark_attacks['Age Group'] == 'No Class') & (shark_attacks['Activity'] == 'surfing') & (shark_attacks['Sex'] == 'F')].shape[0]

youth_surfers_m = shark_attacks[(shark_attacks['Age Group'] == 'Youth') & (shark_attacks['Activity'] == 'surfing') & (shark_attacks['Sex'] == 'M')].shape[0]
youth_surfers_f = shark_attacks[(shark_attacks['Age Group'] == 'Youth') & (shark_attacks['Activity'] == 'surfing') & (shark_attacks['Sex'] == 'F')].shape[0]

y_adult_surfers_m = shark_attacks[(shark_attacks['Age Group'] == 'Young Adult') & (shark_attacks['Activity'] == 'surfing') & (shark_attacks['Sex'] == 'M')].shape[0]
y_adult_surfers_f = shark_attacks[(shark_attacks['Age Group'] == 'Young Adult') & (shark_attacks['Activity'] == 'surfing') & (shark_attacks['Sex'] == 'F')].shape[0]

adult_surfers_m = shark_attacks[(shark_attacks['Age Group'] == 'Adult') & (shark_attacks['Activity'] == 'surfing') & (shark_attacks['Sex'] == 'M')].shape[0]
adult_surfers_f = shark_attacks[(shark_attacks['Age Group'] == 'Adult') & (shark_attacks['Activity'] == 'surfing') & (shark_attacks['Sex'] == 'F')].shape[0]

elderly_surfers_m = shark_attacks[(shark_attacks['Age Group'] == 'Elderly') & (shark_attacks['Activity'] == 'surfing') & (shark_attacks['Sex'] == 'M')].shape[0]
elderly_surfers_f = shark_attacks[(shark_attacks['Age Group'] == 'Elderly') & (shark_attacks['Activity'] == 'surfing') & (shark_attacks['Sex'] == 'F')].shape[0]

#add the variables indicating gender to check if the the result is equal to the number of shark attacks on surfers 
total = no_class_surfers_x+no_class_surfers_y+youth_surfers_m+youth_surfers_f+y_adult_surfers_m+y_adult_surfers_f+adult_surfers_m+adult_surfers_f+elderly_surfers_m+elderly_surfers_f
#print(t)

f_no_class_surfers_x = shark_attacks[(shark_attacks['Age Group'] == 'No Class') & (shark_attacks['Activity'] == 'surfing') & (shark_attacks['Sex'] == 'M') & (shark_attacks['Fatal (Y/N)'] == 'Y')].shape[0]
f_no_class_surfers_y = shark_attacks[(shark_attacks['Age Group'] == 'No Class') & (shark_attacks['Activity'] == 'surfing') & (shark_attacks['Sex'] == 'F') & (shark_attacks['Fatal (Y/N)'] == 'Y')].shape[0]

f_youth_surfers_m = shark_attacks[(shark_attacks['Age Group'] == 'Youth') & (shark_attacks['Activity'] == 'surfing') & (shark_attacks['Sex'] == 'M') & (shark_attacks['Fatal (Y/N)'] == 'Y')].shape[0]
f_youth_surfers_f = shark_attacks[(shark_attacks['Age Group'] == 'Youth') & (shark_attacks['Activity'] == 'surfing') & (shark_attacks['Sex'] == 'F') & (shark_attacks['Fatal (Y/N)'] == 'Y')].shape[0]

f_y_adult_surfers_m = shark_attacks[(shark_attacks['Age Group'] == 'Young Adult') & (shark_attacks['Activity'] == 'surfing') & (shark_attacks['Sex'] == 'M') & (shark_attacks['Fatal (Y/N)'] == 'Y')].shape[0]
f_y_adult_surfers_f = shark_attacks[(shark_attacks['Age Group'] == 'Young Adult') & (shark_attacks['Activity'] == 'surfing') & (shark_attacks['Sex'] == 'F') & (shark_attacks['Fatal (Y/N)'] == 'Y')].shape[0]

f_adult_surfers_m = shark_attacks[(shark_attacks['Age Group'] == 'Adult') & (shark_attacks['Activity'] == 'surfing') & (shark_attacks['Sex'] == 'M') & (shark_attacks['Fatal (Y/N)'] == 'Y')].shape[0]
f_adult_surfers_f = shark_attacks[(shark_attacks['Age Group'] == 'Adult') & (shark_attacks['Activity'] == 'surfing') & (shark_attacks['Sex'] == 'F') & (shark_attacks['Fatal (Y/N)'] == 'Y')].shape[0]

f_elderly_surfers_m = shark_attacks[(shark_attacks['Age Group'] == 'Elderly') & (shark_attacks['Activity'] == 'surfing') & (shark_attacks['Sex'] == 'M') & (shark_attacks['Fatal (Y/N)'] == 'Y')].shape[0]
f_elderly_surfers_f = shark_attacks[(shark_attacks['Age Group'] == 'Elderly') & (shark_attacks['Activity'] == 'surfing') & (shark_attacks['Sex'] == 'F') & (shark_attacks['Fatal (Y/N)'] == 'Y')].shape[0]

print('The gender and age group of the victmims of shark attacks while surfing as well as the fatalities outcomes were:\n')
print(f'{youth_surfers_m} male children and teenagers.\n{f_youth_surfers_m} cases were fatal.\n')
print(f'{youth_surfers_f} female children and teenager.\n{f_youth_surfers_f} case was fatal.\n')
print(f'{y_adult_surfers_m} young adults men.\n{f_y_adult_surfers_m} cases were fatal.\n')
print(f'{y_adult_surfers_f} young adults women.\n{f_y_adult_surfers_f} cases were fatal.\n')
print(f'{adult_surfers_m} adult men.\n{f_adult_surfers_m} cases were fatal.\n')
print(f'{adult_surfers_f} adult women.\n{f_adult_surfers_f} cases were fatal.\n')
print(f'{elderly_surfers_m} elderly men.\n{f_elderly_surfers_m} cases were fatal.\n')
print(f'{elderly_surfers_f} elderly women.\n{f_elderly_surfers_f} cases were fatal.\n')
print(f'{no_class_surfers_x} men of unkown age.\n{f_no_class_surfers_x} cases were fatal.\n')
print(f'{no_class_surfers_y} women of unkown age.\n{f_no_class_surfers_y} case was fatal.')

#### Second: The number of children and teenagers of all genders victims of shark attacks while surfing is larger than the combined number between woman of all ages and elderly of all genders that were attacked by a shark while surfing 

In [None]:
#demonstrate the difference betwwen attacks on youth surfers and woman+elderly
woman_surfer = no_class_surfers_y + youth_surfers_f + y_adult_surfers_f + adult_surfers_f + elderly_surfers_f
woman_elderly_surfers = woman_surfer + elderly_surfers

result = youth_surfers != woman_elderly_surfers

In [None]:
#present chart second hypothesis
second_hypothesis = pd.pivot_table(shark_attacks.query('Activity == "surfing"'), index = 'Age Group', columns = 'Sex', aggfunc = 'count', values = 'Year')
second_hypothesis_chart = pd.DataFrame({'Youth': youth_surfers, 'Woman+Elderly': woman_elderly_surfers}, index=[0]).plot(kind='bar', figsize=(8,5))
bar_names = {'youth_surfers': 'Youth Surfers', 'woman_elderly_surfers': 'Woman+Elderly'}
second_hypothesis_chart.set_ylabel('Shark Attacks on Surfers')
second_hypothesis_chart.set_title('Second Hypothesis')

plt.show()

In [None]:
#print results second hypothesis
print(f'The total number of victims between woman of all ages and elderly people while surfing is {woman_elderly_surfers}.\nThe number of children and teenagers of all genders victims of shark attacks while surfing is {youth_surfers}.')
print(f'\nThe results shows that my second hypothesis - The number of children and teenagers of all genders victims of shark attacks while surfing is larger than the combined number between woman of all ages and elderly of all genders that were attacked by a shark while surfing - was {result}.')