# Cleaning the data

Before I start deleting and modifying stuff, it's good to have a look to the table

In [1]:
import pandas as pd
import src.funciones_toni as tn
import re
import numpy as np

In [2]:
sharkraw = pd.read_csv("data/attacks.csv",encoding = "ISO-8859-1")

In [3]:
sharkraw.shape

(25723, 24)

In [4]:
sharkraw.head()

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
0,2018.06.25,25-Jun-2018,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,...,White shark,"R. Collier, GSAF",2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.25,2018.06.25,6303.0,,
1,2018.06.18,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,"St. Simon Island, Glynn County",Standing,Adyson McNeely,F,...,,"K.McMurray, TrackingSharks.com",2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.18,2018.06.18,6302.0,,
2,2018.06.09,09-Jun-2018,2018.0,Invalid,USA,Hawaii,"Habush, Oahu",Surfing,John Denges,M,...,,"K.McMurray, TrackingSharks.com",2018.06.09-Denges.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.09,2018.06.09,6301.0,,
3,2018.06.08,08-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,...,2 m shark,"B. Myatt, GSAF",2018.06.08-Arrawarra.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.08,2018.06.08,6300.0,,
4,2018.06.04,04-Jun-2018,2018.0,Provoked,MEXICO,Colima,La Ticla,Free diving,Gustavo Ramos,M,...,"Tiger shark, 3m",A .Kipper,2018.06.04-Ramos.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.04,2018.06.04,6299.0,,


In [5]:
sharkraw.columns

Index(['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex ', 'Age', 'Injury', 'Fatal (Y/N)', 'Time',
       'Species ', 'Investigator or Source', 'pdf', 'href formula', 'href',
       'Case Number.1', 'Case Number.2', 'original order', 'Unnamed: 22',
       'Unnamed: 23'],
      dtype='object')

Some column names are wrong, with space after the name, this can lead to confusion and errors later, so I change it right now.

In [6]:
sharkraw.rename(columns = {'Sex ':'Sex'}, inplace = True)
sharkraw.rename(columns = {'Species ':'Species'}, inplace = True)
sharkraw.columns

Index(['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex', 'Age', 'Injury', 'Fatal (Y/N)', 'Time',
       'Species', 'Investigator or Source', 'pdf', 'href formula', 'href',
       'Case Number.1', 'Case Number.2', 'original order', 'Unnamed: 22',
       'Unnamed: 23'],
      dtype='object')

In [7]:
shark2=sharkraw.drop_duplicates()
shark2.shape

(6312, 24)

With this I can see that lots of rows were empty. From 25.000 original rows now I have a 6.300.

In [8]:
shark2.isnull().sum().sort_values(ascending=False)

Unnamed: 22               6311
Unnamed: 23               6310
Time                      3364
Species                   2848
Age                       2841
Sex                        575
Activity                   554
Location                   550
Fatal (Y/N)                549
Area                       465
Name                       220
Country                     60
Injury                      38
Investigator or Source      27
Type                        14
Year                        12
href formula                11
pdf                         10
href                        10
Case Number.1               10
Case Number.2               10
Date                        10
original order               3
Case Number                  2
dtype: int64

There are some columns with lots of missing information (NaN values), in particular, Unnamed22 and Unnamed23. 

In [9]:
shark2["Species"].value_counts(dropna=False)

NaN                                                   2848
White shark                                            163
Shark involvement prior to death was not confirmed     105
Invalid                                                102
Shark involvement not confirmed                         88
                                                      ... 
8' bull shark or Caribbean reef shark                    1
4.2 m white shark                                        1
White shark, 12' to 15' female                           1
Grey nurse shark, 8'                                     1
Injury believed caused by an eel, not a shark            1
Name: Species, Length: 1550, dtype: int64

This information seems interesting for me, however, the data is very poorly registered. There are lots of long sentences meaning the same.
Regex:
- Sharks defined by names (Capital + shark): ([A-Z][a-z]*\sshark)
- Sharks defined by lenght (in yrds): \d+']\sshark|\d+'\sshark
- Not confirmed: (not confirmed)|(Invalid)|(Questionable)|(unconfirmed)

In [10]:
shark2["Species"]=shark2["Species"].astype(str)
shark2.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  shark2["Species"]=shark2["Species"].astype(str)


Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
0,2018.06.25,25-Jun-2018,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,...,White shark,"R. Collier, GSAF",2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.25,2018.06.25,6303.0,,
1,2018.06.18,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,"St. Simon Island, Glynn County",Standing,Adyson McNeely,F,...,,"K.McMurray, TrackingSharks.com",2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.18,2018.06.18,6302.0,,
2,2018.06.09,09-Jun-2018,2018.0,Invalid,USA,Hawaii,"Habush, Oahu",Surfing,John Denges,M,...,,"K.McMurray, TrackingSharks.com",2018.06.09-Denges.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.09,2018.06.09,6301.0,,
3,2018.06.08,08-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,...,2 m shark,"B. Myatt, GSAF",2018.06.08-Arrawarra.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.08,2018.06.08,6300.0,,
4,2018.06.04,04-Jun-2018,2018.0,Provoked,MEXICO,Colima,La Ticla,Free diving,Gustavo Ramos,M,...,"Tiger shark, 3m",A .Kipper,2018.06.04-Ramos.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.04,2018.06.04,6299.0,,


In [11]:
pattern= "[A-Z][a-z]*\sshark|\d+']\sshark|\d+'\sshark"
shark2['Species_sorted'] = shark2['Species'].apply(lambda x: tn.regeshark(x, pattern))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  shark2['Species_sorted'] = shark2['Species'].apply(lambda x: tn.regeshark(x, pattern))


In [12]:
shark2['Species_sorted'].value_counts(dropna=False).head(20)

NaN                  4391
White shark           436
Tiger shark           237
5' shark              131
Bull shark            130
6' shark              104
4' shark               98
8' shark               51
Nurse shark            49
Wobbegong shark        46
3' shark               44
Mako shark             44
Raggedtooth shark      43
10' shark              41
12' shark              40
7' shark               36
Blacktip shark         34
Lemon shark            32
Blue shark             29
Zambesi shark          29
Name: Species_sorted, dtype: int64

Now the shark species column look sorted and clean.

I'm going to do the same with the Date column, there are 3 columns with information about the date, i.e. "Case Number",	"Date" and "Year". The year can give some information, so I'm keeping this column, from the other 2 columns I'm only interested in the month, I want to see if there is a correlation between the time of the year and the shark attacks.

In [13]:
pattern="[A-Z][a-z]+"


shark2["Date"]=shark2["Date"].astype(str)

shark2['Month_attack'] = shark2['Date'].apply(lambda x: tn.monthattack(x, pattern))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  shark2["Date"]=shark2["Date"].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  shark2['Month_attack'] = shark2['Date'].apply(lambda x: tn.monthattack(x, pattern))


In [15]:
shark2['Month_attack'].sample(50)

1114    Jul
3872    Oct
509     Aug
3755    Apr
5310    Jul
2484    Aug
764     Jul
5570    Aug
1426    Oct
4568    Oct
4022    Jan
2900    Mar
4772    Jan
5193    Dec
2991    Feb
4683    Jan
4003    Apr
3670    Aug
3568    Jul
2647    Jan
2557    Mar
3128    Feb
1749    Aug
149     Apr
3008    Oct
1887    Jan
6037    Jul
780     Jun
936     Mar
2829    Feb
2266    Jan
1590    May
5636    Jun
5831    Sep
3797    Aug
2857    Mar
757     Jul
5308    Jul
5905    May
2589    Jul
4840    Sep
3546    Nov
4013    Feb
4169    Jan
3313    Dec
2746    Mar
2567    Jan
5851    Dec
4931    Oct
3791    Sep
Name: Month_attack, dtype: object

Now I have a column with the months where the attacks took place.