# Main hypothesis

El Niño is the warm phase of the El Niño–Southern Oscillation (ENSO) and is associated with a band of warm ocean water that develops in the central and east-central equatorial Pacific, including the area off the Pacific coast of South America. The ENSO is the cycle of warm and cold sea surface temperature (SST) of the tropical central and eastern Pacific Ocean. 

El Niño phases are known to last close to four years; however, records demonstrate that the cycles have lasted between two and seven years. During the development of El Niño, rainfall develops between September–November. The cool phase of ENSO is Spanish: La Niña. The ENSO cycle, including both El Niño and La Niña, causes global changes in temperature and rainfall.


![ElNino_map.png](attachment:ElNino_map.png)

The largest El Niño event of the 21st century was in 2015-16. Numerous reports have been published relating 2015 as the worst year of shark attacks ever.


![News1.png](attachment:News1.png)


![News2.png](attachment:News2.png)

![News3.png](attachment:News3.png)

NULL HYPOTHESIS (H0): THE DATA REALLY SHOWS AN INCREASE OF CASES DURING THE YEAR 2015 DUE TO "EL NIÑO" EFFECT

## Let's start working with our (partially) cleaned database

In [23]:
# Basic imports

import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib.pyplot import pyplot as plt
from IPython.display import display_html
from itertools import chain,cycle

# This is a function to display multiple tables in the same row

def display_side_by_side(*args,titles=cycle([''])):
    html_str=''
    for df,title in zip(args, chain(titles,cycle(['</br>'])) ):
        html_str+='<th style="text-align:center"><td style="vertical-align:top">'
        html_str+=f'<h2 style="text-align: center;">{title}</h2>'
        html_str+=df.to_html().replace('table','table style="display:inline"')
        html_str+='</td></th>'
    display_html(html_str,raw=True)
    
#Custom function to display errors

def display_errors(df):
    
    total_nas = pd.DataFrame(df.isna().sum())
    total_null = pd.DataFrame(df.isnull().sum())
    duplicated = pd.DataFrame(df.duplicated().value_counts())


    return display_side_by_side(total_nas, total_null, duplicated, titles=['Sum of NAs', "Sum of Nuls", "Number of Duplicates"])

ImportError: cannot import name 'pyplot' from 'matplotlib.pyplot' (C:\Users\Carles\anaconda3\envs\Ironhack\lib\site-packages\matplotlib\pyplot.py)

In [2]:
# Check if the database is ok

sharks = pd.read_csv("../data/attacks_cleaned.csv")
display(sharks.head(), display_errors(sharks), sharks.shape)


Unnamed: 0,0
Month,0
Year,0
Type,0
Country,0
Location,12
Activity,16
Sex,8
Age,221
Injury,0
Fatal (Y/N),78

Unnamed: 0,0
Month,0
Year,0
Type,0
Country,0
Location,12
Activity,16
Sex,8
Age,221
Injury,0
Fatal (Y/N),78

Unnamed: 0,0
False,2874


Unnamed: 0,Month,Year,Type,Country,Location,Activity,Sex,Age,Injury,Fatal (Y/N),Time,Species
0,Jun,2018,Boating,USA,"Oceanside, San Diego County",Paddling,F,57,"No injury to occupant, outrigger canoe and pad...",N,18h00,White shark
1,Jun,2018,Unprovoked,USA,"St. Simon Island, Glynn County",Standing,F,11,Minor injury to left thigh,N,14h00 -15h00,
2,Jun,2018,Invalid,USA,"Habush, Oahu",Surfing,M,48,Injury to left lower leg from surfboard skeg,N,07h45,
3,Jun,2018,Unprovoked,BRAZIL,"Piedade Beach, Recife",Swimming,M,18,FATAL,Y,Late afternoon,Tiger shark
4,May,2018,Unprovoked,USA,"Lighhouse Point Park, Ponce Inlet, Volusia County",Fishing,M,52,Minor injury to foot. PROVOKED INCIDENT,N,,Lemon shark


None

(2874, 12)

Because we want to compare the number of cases of 2015 with the rest of the years and also other El Niño years, our main columns will be "Month" and "Year". El Niño is a seasonal phenomena, therefore we need to see the differences month by month.

![ElNino_occurrences.png](attachment:ElNino_occurrences.png)

Worst El Niño cases: 
- 1578
- 1728
- 1790-93
- 1828
- 1876-78
- 1891
- 1925-26
- 1982-83
- 1997-98
- 2014-16

Let's see what we have:

In [3]:
min_=sharks["Year"].min()
max_=sharks["Year"].max()

print(f"The minimum year is {min_} and the maximum is {max_}")

The minimum year is 1703 and the maximum is 2018


We can now group our date by year and month:

In [60]:
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]

sharks["Month"] = pd.Categorical(sharks["Month"], categories=months, ordered=True)
sharks.sort_values(by=["Month", "Year"],inplace=True)
sharks.reset_index()

#OPTIONAL

sharks_month_year = pd.DataFrame(sharks.groupby("Year")["Month"].value_counts())

display(sharks_month_year)

"""Total number of cases by month and year"""

Unnamed: 0_level_0,Unnamed: 1_level_0,Month
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
1703,Mar,1
1703,Jan,0
1703,Feb,0
1703,Apr,0
1703,May,0
...,...,...
2018,Aug,0
2018,Sep,0
2018,Oct,0
2018,Nov,0


'Total number of cases by month and year'

In [58]:
#sharks_month_year.plot(kind = "bar", ylabel="dsads", figsize = (200,50) )