# CLEANING

![alt text](https://miro.medium.com/max/413/0*Cir0TzUEkHMbb8QB "Cleaning data")

- Cargamos **librerías, funciones y el data set** que vamos a usar para la limpieza 

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [2]:
import sys
sys.path.append("src/")
from src.cleaning_functions import *

In [3]:
ds = pd.read_csv("data/attacks.csv",encoding = "ISO-8859-1")

- Hacemos una copia del data set "just in case"...

In [4]:
df = ds.copy()

# EXPLORACIÓN DATA SET

![alt text](https://memegenerator.net/img/instances/40379228/let-me-take-a-look-at-this.jpg "Take a look")

In [5]:
df.shape #25.723 Lineas y 24 columnas

(25723, 24)

In [6]:
df.sample()

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
9452,,,,,,,,,,,...,,,,,,,,,,


df.isna().sum()

Hay muchas columnas bastante sucias donde prácticamente todo son NaN.
De hecho "Unnamed 22 y 23" solo tienen dos datos

In [7]:
df[["Unnamed: 22","Unnamed: 23"]].notna().sum()

Unnamed: 22    1
Unnamed: 23    2
dtype: int64

- Podemos ver el porcentaje de NaN por cada columna para ver su grado de suciedad/validez:

In [8]:
df.isnull().sum().apply(lambda x: x/df.shape[0]).sort_values(ascending=False)

Unnamed: 22               0.999961
Unnamed: 23               0.999922
Time                      0.885394
Species                   0.865335
Age                       0.865062
Sex                       0.776970
Activity                  0.776154
Location                  0.775998
Fatal (Y/N)               0.775959
Area                      0.772694
Name                      0.763169
Country                   0.756949
Injury                    0.756094
Investigator or Source    0.755666
Type                      0.755161
Year                      0.755083
href formula              0.755044
Date                      0.755005
pdf                       0.755005
href                      0.755005
Case Number.1             0.755005
Case Number.2             0.755005
original order            0.754733
Case Number               0.661704
dtype: float64

In [9]:
df.loc[df['Unnamed: 22'].notna()] #stopped here, mierdato

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
1478,2006.05.27,27-May-2006,2006.0,Unprovoked,USA,Hawaii,"North Shore, O'ahu",Surfing,Bret Desmond,M,...,,R. Collier,2006.05.27-Desmond.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2006.05.27,2006.05.27,4825.0,stopped here,


In [10]:
df.loc[df['Unnamed: 23'].notna()] #Teramo y change filename

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
4415,1952.03.30,30-Mar-1952,1952.0,Unprovoked,NETHERLANDS ANTILLES,Curacao,,Went to aid of child being menaced by the shark,A.J. Eggink,M,...,"Bull shark, 2.7 m [9'] was captured & dragged ...","J. Randall, p.352 in Sharks & Survival; H.D. B...",1952.03.30-Eggink.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,1952.03.30,1952.03.30,1888.0,,Teramo
5840,1878.09.14.R,Reported 14-Sep-1878,1878.0,Provoked,USA,Connecticut,"Branford, New Haven County",Fishing,Captain Pattison,M,...,,"St. Joseph Herald, 9/14/1878",1878.09.14.R-Pattison.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,1878.09.14.R,1878.09.14.R,463.0,,change filename


In [11]:
df.columns

Index(['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex ', 'Age', 'Injury', 'Fatal (Y/N)', 'Time',
       'Species ', 'Investigator or Source', 'pdf', 'href formula', 'href',
       'Case Number.1', 'Case Number.2', 'original order', 'Unnamed: 22',
       'Unnamed: 23'],
      dtype='object')

- Vemos que hay un par de columnsa con espacios que podemos quitar: Species y Sex

In [12]:
df.columns = df.columns.str.rstrip()

In [13]:
df.columns

Index(['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex', 'Age', 'Injury', 'Fatal (Y/N)', 'Time',
       'Species', 'Investigator or Source', 'pdf', 'href formula', 'href',
       'Case Number.1', 'Case Number.2', 'original order', 'Unnamed: 22',
       'Unnamed: 23'],
      dtype='object')

# LIMPIEZA

![alt text](https://miro.medium.com/max/568/1*S1HH5F8PqWWcId9sb0L8og.jpeg "Dropna")

- Quitamos todas las filas y columnas que tengan todo NaN

In [14]:
df.dropna(axis = 0, how = 'all', inplace = True)
df.dropna(axis = 1, how = 'all', inplace = True)

df.drop_duplicates(inplace=True)


df.shape #hemos quitado bastantes filas, realmente columnas no hemos limpiado directamente ya que Unnamed 22 y 23 tenian 1 y 2 datos

(6311, 24)

In [15]:
df["Case Number"].sample(40)

4940      1934.06.20
1161    2009.01.27.R
1141      2009.03.27
676     2013.05.08.b
1105    2009.07.24.R
5839      1878.10.13
2470      1993.09.30
5547    1901.06.29.R
1487      2006.04.19
3947      1960.11.22
4556    1948.00.00.c
3210      1975.08.12
1436      2006.09.02
1570      2005.07.15
1964    2001.04.08.b
2843    1985.09.08.b
767     2012.07.07.c
4350      1954.01.15
6033    1849.06.08.b
380       2015.07.31
5029    1931.00.00.a
4114    1959.01.17.b
5       2018.06.03.b
5738      1887.10.18
4686    1943.00.00.d
4117      1959.01.12
5935    1865.07.14.R
3562      1966.08.14
2768      1987.09.13
2783      1987.04.15
4948      1934.02.22
1277    2008.04.08.R
2992      1982.02.13
1704    2004.01.15.R
1162    2009.01.26.R
1803      2002.12.24
4466      1950.07.21
3649      1964.12.25
2747      1988.02.15
3603    1965.11.03.b
Name: Case Number, dtype: object

In [16]:
df[["Case Number", "Date","Investigator or Source",'Case Number.1', 'Case Number.2', 'original order']].sample(30)

Unnamed: 0,Case Number,Date,Investigator or Source,Case Number.1,Case Number.2,original order
2943,1983.04.19,19-Apr-1983,"M. Vorenberg, GSAF",1983.04.19,1983.04.19,3360.0
1628,2004.10.30.x,30-Oct-2004,"New Zealand Herald, 11/24/2004/15/2004",2004.10.30.x,2004.10.30.x,4675.0
4367,1953.09.02,02-Sep-1953,"Honolulu Star Bulletin, 9/2/1953; G.H. Balazs;...",1953.09.02,1953.09.02,1936.0
3828,1962.04.05,05-Apr-1962,T. Wallett,1962.04.05,1962.04.05,2475.0
5994,1857.05.05,05-May-1857,"The Buffalo Commercial Advertiser, 5//18/1857",1857.05.05,1857.05.05,309.0
2470,1993.09.30,30-Sep-1993,"Houston Chronicle, 10/9/1993; Times of London...",1993.09.30,1993.09.30,3833.0
4409,1952.07.05,05-Jul-1952,"V.M. Coppleson (1958), p.265; J. Randall in Sh...",1952.07.05,1952.07.05,1894.0
3726,1963.11.30.b,30-Nov-1963,"C. Dudley, R.D. Weeks, GSAF; H.D. Baldridge, S...",1963.11.30.b,1963.11.30.b,2577.0
3937,1961.01.01,01-Jan-1961,V.M. Coppleson (1962) p.246,1961.01.01,1961.01.01,2366.0
4855,1937.02.11,11-Feb-1937,"Canberra Times, 2/12/1937",1937.02.11,1937.02.11,1448.0


- Echando una ojeada a estas columnas podemosdescartarlas ya que o están duplicadas o son datos que no nos van a aportar para nuestras hipótesis

In [17]:
df[["pdf", "href","href formula"]].sample(20)

Unnamed: 0,pdf,href,href formula
5124,1927.10.00-Josiah.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...
1434,2006.09.03.b-Darlan-dos-Santos-Luz.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...
3220,1975.07.05-DennisThompson.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...
3881,1961.09.06-Sailor.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...
1566,2005.07.22-Pearce.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...
4158,1958.06.17-Irving.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...
2581,1991.09.19-Huneidi.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...
5625,1896.00.00.b-filibuster.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...
41,2018.02.17-Palmer.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...
2545,1992.07.08.b-LG.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...


- Miramos uno de los links del href para ver contenido:

In [18]:
df["href"][3513] 

'http://sharkattackfile.net/spreadsheets/pdf_directory/1967.08.25-Casucci.pdf'

https://sharkattackfile.net/spreadsheets/pdf_directory/1967.08.25-Casucci.pdf

In [19]:
df["href formula"][6091]

'http://sharkattackfile.net/spreadsheets/pdf_directory/1830.04.30-Bromwick.pdf'

http://sharkattackfile.net/spreadsheets/pdf_directory/1830.04.30-Bromwick.pdf


- Algunos estaban vacíos (iamgino que habrá cambiado en algo la dirección), otros contienen pdf´s con datos del ataque reportado: imagenes, recortes de periódico, fotos de la ubicación, etc....
Pero en este caso no voy a utilizarlos.

- Ojeadas las columnas podemos hacer **drop** de las que **no considero interesantes** para el caso:

'Case Number','Investigator or Source', 'pdf', 'href formula', 'href', 'Case Number.1', 'Case Number.2', 'original order', 'Unnamed: 22','Unnamed: 23'

In [20]:
df.drop(['Case Number','Investigator or Source', 'pdf', 'href formula', 'href', 'Case Number.1', 'Case Number.2', 'original order', 'Unnamed: 22','Unnamed: 23'], axis=1, inplace=True)

In [21]:
df.columns

Index(['Date', 'Year', 'Type', 'Country', 'Area', 'Location', 'Activity',
       'Name', 'Sex', 'Age', 'Injury', 'Fatal (Y/N)', 'Time', 'Species'],
      dtype='object')

### YEAR y DATE COLUMN

In [22]:
df.Year.unique() #Vamos a ver si podemos rescatar algun año a través del contenido de la columna date

array([2018., 2017.,   nan, 2016., 2015., 2014., 2013., 2012., 2011.,
       2010., 2009., 2008., 2007., 2006., 2005., 2004., 2003., 2002.,
       2001., 2000., 1999., 1998., 1997., 1996., 1995., 1984., 1994.,
       1993., 1992., 1991., 1990., 1989., 1969., 1988., 1987., 1986.,
       1985., 1983., 1982., 1981., 1980., 1979., 1978., 1977., 1976.,
       1975., 1974., 1973., 1972., 1971., 1970., 1968., 1967., 1966.,
       1965., 1964., 1963., 1962., 1961., 1960., 1959., 1958., 1957.,
       1956., 1955., 1954., 1953., 1952., 1951., 1950., 1949., 1948.,
       1848., 1947., 1946., 1945., 1944., 1943., 1942., 1941., 1940.,
       1939., 1938., 1937., 1936., 1935., 1934., 1933., 1932., 1931.,
       1930., 1929., 1928., 1927., 1926., 1925., 1924., 1923., 1922.,
       1921., 1920., 1919., 1918., 1917., 1916., 1915., 1914., 1913.,
       1912., 1911., 1910., 1909., 1908., 1907., 1906., 1905., 1904.,
       1903., 1902., 1901., 1900., 1899., 1898., 1897., 1896., 1895.,
       1894., 1893.,

- Vemos años un poco raros, así que paso a verlos en detalle

In [23]:
df[(df.Year == 0)]

Unnamed: 0,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species
6177,Ca. 214 B.C.,0.0,Unprovoked,,Ionian Sea,,Ascending from a dive,"Tharsys, a sponge diver",M,,"FATAL, shark/s bit him in two",Y,,
6178,Ca. 336.B.C..,0.0,Unprovoked,GREECE,Piraeus,In the haven of Cantharus,Washing his pig in preparation for a religious...,A candidate for initiation,M,,"FATAL, shark ""bit off all lower parts of him u...",Y,,
6179,493 B.C.,0.0,Sea Disaster,GREECE,Off Thessaly,,Shipwrecked Persian Fleet,males,M,,Herodotus tells of sharks attacking men in the...,Y,,
6180,Ca. 725 B.C.,0.0,Sea Disaster,ITALY,Tyrrhenian Sea,Krater found during excavations at Lacco Ameno...,Shipwreck,males,M,,Depicts shipwrecked sailors attacked by a sha...,Y,,
6181,Before 1939,0.0,Unprovoked,CANADA,,Grand Banks,Fishing,Joe Folsom,M,,Arm bitten,N,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6297,Before 1903,0.0,Unprovoked,AUSTRALIA,Western Australia,Roebuck Bay,Diving,male,M,,FATAL,Y,,
6298,Before 1903,0.0,Unprovoked,AUSTRALIA,Western Australia,,Pearl diving,Ahmun,M,,FATAL,Y,,
6299,1900-1905,0.0,Unprovoked,USA,North Carolina,Ocracoke Inlet,Swimming,Coast Guard personnel,M,,FATAL,Y,,
6300,1883-1889,0.0,Unprovoked,PANAMA,,"Panama Bay 8ºN, 79ºW",,Jules Patterson,M,,FATAL,Y,,


- Hay 125 lineas con el df.Year == CERO pero en df.Date si que figuran datos, a ver que podemos rescatar de ahí.

In [24]:
list(df.Date.unique())

['25-Jun-2018',
 '18-Jun-2018',
 '09-Jun-2018',
 '08-Jun-2018',
 '04-Jun-2018',
 '03-Jun-2018',
 '27-May-2018',
 '26-May-2018',
 '24-May-2018',
 '21-May-2018',
 '13-May-2018',
 'May 2018',
 '12-May-2018',
 '09-May-2018',
 'Reported 30-Apr-2018',
 '28-Apr-2018',
 '25-Apr-2018',
 '24-Apr-2018',
 '23-Apr-2018',
 '22-Apr-2018',
 '19-Apr-2018',
 '15-Apr-2018',
 '14-Apr-2018',
 'Reported 10-Apr-2018',
 '09-Apr-2018',
 '05-Apr-2018',
 '03-Apr-2018',
 '31-Mar-2018',
 '14-Mar-2018',
 '9-Mar-2018',
 '24-Feb-2018',
 '23-Feb-2018',
 '18-Feb-2018',
 '15-Feb-2018',
 '14-Feb-2018',
 '11-Feb-2018',
 '03-Feb-2018',
 '01-Feb-2018',
 '28-Jan-2018',
 '21-Jan-2018',
 '14-Jan-2018',
 '13-Jan-2018',
 '12-Jan-2018',
 '05-Jan-2018',
 '31-Dec-2017',
 '30-Dec-2017',
 '21-Dec-2017',
 '09-Dec-2017',
 '30-Nov-2017',
 'Reported 25-Nov-2017',
 '24-Nov-2017',
 '18-Nov-2017',
 'Reported 13-Nov-2017',
 '13-Nov-2017',
 '04-Nov-2017',
 'Reported 31-Oct-2017',
 '28-Oct-2017',
 '26-Oct-2017',
 '23-Oct-2017',
 '22-Oct-2017',

- Mucho reported y mucho espacio en blanco:

In [25]:
df.Date = df.Date.replace(regex=r'(?i)Reported\s{1,9}',value='')
list(df.Date.unique())

['25-Jun-2018',
 '18-Jun-2018',
 '09-Jun-2018',
 '08-Jun-2018',
 '04-Jun-2018',
 '03-Jun-2018',
 '27-May-2018',
 '26-May-2018',
 '24-May-2018',
 '21-May-2018',
 '13-May-2018',
 'May 2018',
 '12-May-2018',
 '09-May-2018',
 '30-Apr-2018',
 '28-Apr-2018',
 '25-Apr-2018',
 '24-Apr-2018',
 '23-Apr-2018',
 '22-Apr-2018',
 '19-Apr-2018',
 '15-Apr-2018',
 '14-Apr-2018',
 '10-Apr-2018',
 '09-Apr-2018',
 '05-Apr-2018',
 '03-Apr-2018',
 '31-Mar-2018',
 '14-Mar-2018',
 '9-Mar-2018',
 '24-Feb-2018',
 '23-Feb-2018',
 '18-Feb-2018',
 '15-Feb-2018',
 '14-Feb-2018',
 '11-Feb-2018',
 '03-Feb-2018',
 '01-Feb-2018',
 '28-Jan-2018',
 '21-Jan-2018',
 '14-Jan-2018',
 '13-Jan-2018',
 '12-Jan-2018',
 '05-Jan-2018',
 '31-Dec-2017',
 '30-Dec-2017',
 '21-Dec-2017',
 '09-Dec-2017',
 '30-Nov-2017',
 '25-Nov-2017',
 '24-Nov-2017',
 '18-Nov-2017',
 '13-Nov-2017',
 '04-Nov-2017',
 '31-Oct-2017',
 '28-Oct-2017',
 '26-Oct-2017',
 '23-Oct-2017',
 '22-Oct-2017',
 '21-Oct-2017',
 '18-Oct-2017',
 '09-Oct-2017',
 '05-Oct-201

- Aunque no salen en los **uniques** hay muchos Dates que salen como rangos, o datos de antes de Cristo.
- Vamos a denominarlos momentáneamente salvables para ver cuantos hay exactamente:

In [26]:
salvables = df.loc[(df["Year"] == 0) & (df["Date"] != np.nan)]
salvables.shape

(125, 14)

In [27]:
salvables.sample(30)

Unnamed: 0,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species
6182,1990 or 1991,0.0,Unprovoked,KENYA,Mombasa,Kilindini,Diving,Conway Plough & Dr. Jonathan Higgs,M,,Conway's leg was bitten Higgs injury was FATAL,N,,
6240,Before 1961,0.0,Unprovoked,SEYCHELLES,Amirante Islands,Marie Louise Island,Swimming from capsized pirogue,Aristede,M,,FATAL,Y,,Tiger shark
6235,"No date, Before 1963",0.0,Unprovoked,AUSTRALIA,Torres Strait,,Diving for trochus,male,M,,Calf removed,N,,0.9 m [3'] shark
6206,Before 2012,0.0,Unprovoked,,,In a river feeding into the Bay of Bengal,Netting shrimp,Sametra Mestri,F,,Hand severed,N,,
6287,Before 1917,0.0,Unprovoked,FIJI,Moala Island,,Wreck of large double sailing canoe,20 Fijians,,,"FATAL, 18 people were killed by sharks, 2 sur...",Y,,
6258,Before 1952,0.0,Unprovoked,KIRIBATI,Gilbert Islands,Nonouti,,Gilbertese fisherman,M,,FATAL,Y,,
6183,Before 2016,0.0,Unprovoked,KENYA,Mombasa,Kilindini,Diving,Hamisi Njenga,M,,FATAL,Y,,
6181,Before 1939,0.0,Unprovoked,CANADA,,Grand Banks,Fishing,Joe Folsom,M,,Arm bitten,N,,
6257,Before Mar-1956,0.0,Unprovoked,NORTH PACIFIC OCEAN,,Wake Island,"Fishing, wading with string of fish",male,M,,Survived,N,,
6220,"No date, Before May-1996",0.0,Unprovoked,KOREA,South Korea,Cheju Island,Diving,"female, a Hae Nyeo",F,,"FATAL, injured while diving, then shark bit her",Y,,


- Llamamos a las funciones definidas en el **cleanin_functions.py**, mas concretamente a rescatar fechas que aplica 3 funciones secuencialmente donde coge por orden los que contienen BC, los que son fechas sueltas (tipo Before YYYY) y luego los intervalos de los cuales saca la media, para rellenar los datos de df.Year para esos valores de df.Dates  #TODO, mira tema de late´s y demas para meterlos tb

In [28]:
df.Year = df.Date.apply(rescatar_fechas)

- Si miramos el dato primero de nuestra lista anterior vemos que:

In [29]:
df["Date"][6228]

'No date, Before 1969'

In [30]:
df["Year"][6228]

'1969'

In [31]:
df["Date"][6265]

'1941-1942'

In [32]:
df["Year"][6265]

'1941'

In [33]:
#pd.set_option('max_rows', None)

In [34]:
df[["Date","Year"]].sample(10)

Unnamed: 0,Date,Year
5048,20-Feb-1930,1930
3224,15-Jun-1975,1975
4377,Jun-1953,1953
765,14-Jul-2012,2012
310,02-Feb-2016,2016
4673,27-May-1943,1943
839,05-Dec-2011,2011
2950,Mar-1983,1983
4791,27-Sep-1939,1939
809,14-Mar-2012,2012


In [35]:
#pd.set_option('max_rows', 20)

## AREA / LOCATION COLUMNS

- Ambas columnas son bastante imprecisas en cuanto a ubicacion, por lo que mejor voy a trabajar sobre la columna **country.**

In [36]:
df[["Country","Area","Location"]].sample(10)

Unnamed: 0,Country,Area,Location
5540,SOUTH AFRICA,KwaZulu-Natal,Durban
5246,PHILIPPINES,"Cavite Province, Luzon",
2048,USA,Florida,"Pensacola Bay, Escambia County"
3362,BRITISH ISLES,South Devon,Beesands
3582,USA,Puerto Rico,
4899,AUSTRALIA,Torres Strait,"Near Warrior Reefs, Queensland"
5219,USA,Hawaii,"Keawanui, Kamalo, Moloka'i"
983,SOUTH AFRICA,Western Cape Province,"Melkbaai, Strand"
4822,AUSTRALIA,New South Wales,Tweed Heads
2483,USA,California,"Abalone Point, Westport Union Landing, Mendoci..."


- Country voy a limpiar con una funcion metidad en cleaning_functios.py llamada paises. He cogido una lista de paises de Github que esta
metida en un CSV, lo que hace la funcion es mirar si la cadena correspondiente está en el CSV y sino le asigna NaN adema de un par de incorrecciones de la escritura de nombres que tambien he metido

In [37]:
df.Country = df.Country.apply(paises)

In [38]:
df.Country.notna().sum()

5982

In [39]:
#Esto se puede mejorar, para resultados sea...

# ACTIVITY COLUMN

In [40]:
df.Activity.sample(30)

3328                                   Diving for abalone
2433                                              Surfing
3954                            Fishing for rock lobsters
1462                                              Walking
5636                                               Diving
494                                             Kayaking 
3113                                         Spearfishing
1145                                              Surfing
5318                                               Diving
1796                                             Swimming
5260                                     Jumped overboard
3495                                             Swimming
285                                          Spearfishing
5133                                              Fishing
2280                                             Swimming
135                                               Surfing
4970    Diving for trochus  from dinghy when seized by...
4976          

- La lista de actividades es amplia, vamos a agruparlas y filtrarlas a través de una función llamada actividad tambien contenida en cleaning_functions.py

In [41]:
df.Activity = df.Activity.apply(actividad)
df.Activity.sample(30)

5326        swimming
4180         fishing
3019          diving
3682         fishing
2506    sea disaster
2807            surf
5749             NaN
4039         fishing
1102         fishing
5138          diving
3365            surf
6265        swimming
1204            surf
3175            surf
646          fishing
4335             NaN
1340            surf
1506         boating
3590         fishing
480             surf
2783        swimming
1815        swimming
5038        swimming
381             surf
5593        swimming
3126             NaN
2343          diving
2634          diving
614           diving
3983         fishing
Name: Activity, dtype: object

# COLUMNA INJURY

In [42]:
list(df["Fatal (Y/N)"].unique())

['N', 'Y', nan, 'M', 'UNKNOWN', '2017', ' N', 'N ', 'y']

In [43]:
df[["Fatal (Y/N)","Injury"]].sample(30)

Unnamed: 0,Fatal (Y/N),Injury
1579,N,Knee bitten
5647,N,Left leg bitten PROVOKED INCIDENT
6226,N,"Ankle punctured & lacerated, hands abraded PRO..."
1907,Y,FATAL
11,N,Injuries to lower right leg and foot
4754,N,"Survived, but suffered a forequarter amputation"
4454,Y,FATAL
4143,N,Shark tried to bite prop twice
4311,N,Forearm slashed wrist to elbow by hooked shark...
2387,Y,"FATAL, hand & leg severely injured by shark th..."


- El tema de valores distintos de los previsibles "Y"/ "N" ademas de categorizar la columna injury lo vamos a hacer a través de dos funciones

In [44]:
df["Injury"] = df["Injury"].apply(lesiones)

In [45]:
df['Fatal (Y/N)'] = df["Fatal (Y/N)"].apply(fatal)

In [46]:
df[["Fatal (Y/N)","Injury"]].sample(10)

Unnamed: 0,Fatal (Y/N),Injury
2176,N,injury
3807,N,injury
630,N,injury
5619,Y,fatal
2767,Y,fatal
4506,N,injury
2232,N,injury
2531,Y,fatal
5025,N,injury
341,N,injury


In [47]:
#TODO age

# SPECIES COLUMN

In [48]:
df.Species.sample(30)

4688                                       "small sharks"
6180                                                  NaN
4636                                                  NaN
3564                                              Invalid
5009                                                  NaN
5802                                                  NaN
4766    White shark, species identity confirmed by too...
2697           1.8 m to 2.4 m [6' to 8'] hammerhead shark
3428                            White shark, 5 m [16.5'] 
4126                                                  NaN
1276                                             6' shark
41                                                    NaN
6196                                                  NaN
5571                                      3 m [10'] shark
3168    White shark, 3.5 m [11.5'], species identity c...
2277                                     1.2 m [4'] shark
1060                                   Zambesi shark, 2m 
4097          

- Para limpiar esto vamos a usar algo parecido a lo que hemos hecho con activity para categorizar y leer las cadenas de dentro.Usaré la lista de nombres de https://sharkattackfile.net/species.htm
 que es de donde viene nuestro dataframe además

In [50]:
df.Species = df.Species.apply(species)

In [51]:
df.Species.sample(30)

4633                NaN
1680              tiger
4979                NaN
1798               bull
920               tiger
3957                NaN
5199                NaN
2537           blacktip
4759                NaN
5100                NaN
1348                NaN
2921              white
5765                NaN
5356                NaN
4466                NaN
56                nurse
3156                NaN
2938                NaN
1413                NaN
2495                NaN
3023    grey reef shark
5375      bronze whaler
996               white
315                mako
2605              nurse
3897              white
3905              tiger
1124                NaN
3991              white
4246                NaN
Name: Species, dtype: object

# Exportamos el data frame limpio a un CSV nuevo

In [54]:
df.to_csv("src/attack_limpio.csv",index=False)

- La visualizacion continua en `analysis.ipynb` [📑](analysis.ipynb) 