# Step one: Cleaning

In this notebook we will focus on taking the original data file and reducing all the errors in it. This will produce a clean file that will allow for meaningful analysis.

## Importing libraries and reading the file

In [714]:
import pandas as pd

attacks = pd.read_csv("data/attacks.csv", encoding='unicode_escape')

#We will alos run the 'cleaning.py' script as it will define functions we will use

%run -i 'src/cleaning.py'

## Initial exploration
Now we will start exploring the file. Using the ".head()" we can get a preview of the columns and a the first rows of the dataset.

In [715]:
attacks.head()

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
0,2018.06.25,25-Jun-2018,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,...,White shark,"R. Collier, GSAF",2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.25-Wolfe.pdf,2018.06.25,2018.06.25,6303.0,,
1,2018.06.18,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,"St. Simon Island, Glynn County",Standing,Adyson McNeely,F,...,,"K.McMurray, TrackingSharks.com",2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.18-McNeely.pdf,2018.06.18,2018.06.18,6302.0,,
2,2018.06.09,09-Jun-2018,2018.0,Invalid,USA,Hawaii,"Habush, Oahu",Surfing,John Denges,M,...,,"K.McMurray, TrackingSharks.com",2018.06.09-Denges.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.09-Denges.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.09-Denges.pdf,2018.06.09,2018.06.09,6301.0,,
3,2018.06.08,08-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,...,2 m shark,"B. Myatt, GSAF",2018.06.08-Arrawarra.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.08-Arrawarra.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.08-Arrawarra.pdf,2018.06.08,2018.06.08,6300.0,,
4,2018.06.04,04-Jun-2018,2018.0,Provoked,MEXICO,Colima,La Ticla,Free diving,Gustavo Ramos,M,...,"Tiger shark, 3m",A .Kipper,2018.06.04-Ramos.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.04-Ramos.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.04-Ramos.pdf,2018.06.04,2018.06.04,6299.0,,


In [716]:
attacks.columns

Index(['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex ', 'Age', 'Injury', 'Fatal (Y/N)', 'Time',
       'Species ', 'Investigator or Source', 'pdf', 'href formula', 'href',
       'Case Number.1', 'Case Number.2', 'original order', 'Unnamed: 22',
       'Unnamed: 23'],
      dtype='object')

In [717]:
attacks.shape

(25723, 24)

Using 'head()' , '.columns' and '.shape' we can see that we are dealing with a file of 24 columns (names listed above) and 25723 entries. Lets explore further:

In [718]:
attacks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25723 entries, 0 to 25722
Data columns (total 24 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Case Number             8702 non-null   object 
 1   Date                    6302 non-null   object 
 2   Year                    6300 non-null   float64
 3   Type                    6298 non-null   object 
 4   Country                 6252 non-null   object 
 5   Area                    5847 non-null   object 
 6   Location                5762 non-null   object 
 7   Activity                5758 non-null   object 
 8   Name                    6092 non-null   object 
 9   Sex                     5737 non-null   object 
 10  Age                     3471 non-null   object 
 11  Injury                  6274 non-null   object 
 12  Fatal (Y/N)             5763 non-null   object 
 13  Time                    2948 non-null   object 
 14  Species                 3464 non-null 

From this output we gather a few things:

#### Types and names of the columns: 
* There is only two columns with floats
    * This is odd considering that the column "Age" should also be a number
    * The column "Original order" probably adds little to no value
* Columns 0, 19 and 20 all reference a case number
* Columns 17 and 18 could also be duplicated, since they both reference "href"

####  Non-null values:
* Column 0 "Case number" has significantly more non-null values than the rest of the columns
* Columns 22 and 23 are unnamed and have 2 or less non-null values, these columns add no value to the analysis
* Several columns have a non-null count nearing 6300. But most columns have less than this ammount. This means that we are dealing with a great amount of incomplete rows.


### Deleting Unnamed columns:
We will start cleaning the file by deleting the last 2 columns. For this we will use a function declared in the 'cleaning.py' ran in the beginning of this notebook:

In [719]:
delete_columns(attacks,['Unnamed: 22','Unnamed: 23'])

Deleted columns:  ['Unnamed: 22', 'Unnamed: 23']


In [720]:
attacks.sample(5)

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Fatal (Y/N),Time,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order
6330,0.0,,,,,,,,,,...,,,,,,,,,,
14975,,,,,,,,,,,...,,,,,,,,,,
7481,0.0,,,,,,,,,,...,,,,,,,,,,
17458,,,,,,,,,,,...,,,,,,,,,,
24022,,,,,,,,,,,...,,,,,,,,,,


### Dealing with the "Case Number" columns
Our data set contains 3 columns referring to a case number. For this we will use a function declared in the 'cleaning.py' ran in the beginning of this notebook: 

In [721]:
compare_columns(attacks,["Case Number","Case Number.1","Case Number.2"])

Column information:


Columns,Case Number,Case Number.1,Case Number.2
Lengths,25723,25723,25723
Non - null values,8702,6302,6302


Data type comparison:


Unnamed: 0,Case Number,Case Number.1,Case Number.2
<class 'float'>,17021,19421,19421
<class 'str'>,8702,6302,6302


Value comparisons:


Unnamed: 0,Case Number,Case Number.1,Case Number.2
Case Number,25723,23298,23318
Case Number.1,23298,25723,25703
Case Number.2,23318,25703,25723


The above matrix shows the amount of rows that each column shares with other columns. E.g. how many cells store the same information in both columns
Sample of columns with differences:


Unnamed: 0,Case Number,Case Number.1,Case Number.2
19516,,,
82,2017.09.16.b,2017.09.16.b,2017.09.16.b
9900,,,
24839,,,
3042,1981.02.02,1981.02.02,1981.02.02


We can see that the columns "Case Number.1" and "Case Number.2" hold almost the same information (They return the same values in all the tests, and in the comparison matrix they only have 20 values that are different). Lets check these values:

In [722]:
different_value_columns(attacks, ["Case Number.1", "Case Number.2"])

Columns differ in the following 20 values


Unnamed: 0,Case Number.1,Case Number.2
34,2018.04.02,2018.04.03
117,2017/07.20.a,2017.07.20.a
144,2017.06.06,2017.05.06
217,2016.09.16,2016.09.15
314,2015.01.24.b,2016.01.24.b
334,2015.11.07,2015.12.23
339,2015.10.28,2015.10.28.a
560,2013.05.04,2014.05.04
3522,1967/07.05,1967.07.05
3795,1962.08.30.b,"1962,08.30.b"


We can see that the columns hold mostly the same information. Most of the differences are symbols (commas for periods or slashes) and missing letters. We will keep column Case Number.2 and drop Case Number.1

In [723]:
attacks.drop(["Case Number.1"], axis=1, inplace=True)
attacks.head()

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Injury,Fatal (Y/N),Time,Species,Investigator or Source,pdf,href formula,href,Case Number.2,original order
0,2018.06.25,25-Jun-2018,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,...,"No injury to occupant, outrigger canoe and paddle damaged",N,18h00,White shark,"R. Collier, GSAF",2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.25-Wolfe.pdf,2018.06.25,6303.0
1,2018.06.18,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,"St. Simon Island, Glynn County",Standing,Adyson McNeely,F,...,Minor injury to left thigh,N,14h00 -15h00,,"K.McMurray, TrackingSharks.com",2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.18-McNeely.pdf,2018.06.18,6302.0
2,2018.06.09,09-Jun-2018,2018.0,Invalid,USA,Hawaii,"Habush, Oahu",Surfing,John Denges,M,...,Injury to left lower leg from surfboard skeg,N,07h45,,"K.McMurray, TrackingSharks.com",2018.06.09-Denges.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.09-Denges.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.09-Denges.pdf,2018.06.09,6301.0
3,2018.06.08,08-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,...,Minor injury to lower leg,N,,2 m shark,"B. Myatt, GSAF",2018.06.08-Arrawarra.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.08-Arrawarra.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.08-Arrawarra.pdf,2018.06.08,6300.0
4,2018.06.04,04-Jun-2018,2018.0,Provoked,MEXICO,Colima,La Ticla,Free diving,Gustavo Ramos,M,...,Lacerations to leg & hand shark PROVOKED INCIDENT,N,,"Tiger shark, 3m",A .Kipper,2018.06.04-Ramos.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.04-Ramos.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.04-Ramos.pdf,2018.06.04,6299.0


Now lets compare "Case Number" with "Case Number.2"

In [724]:
different_value_columns(attacks, ["Case Number.2", "Case Number"])

Columns differ in the following 2405 values


Unnamed: 0,Case Number.2,Case Number
390,2015.07.10,2015.07-10
4949,1934.02.08.R,1934.01.08.R
5488,1905.09.06.R,
5944,1864.05.00,1864.05
6302,,0
...,...,...
8698,,0
8699,,0
8700,,0
8701,,0


We can see that "Case Number" has a great deal of 0 values, lets print the value_counts for "Case Number" where it is different from "Case Number.2"

In [725]:
print_test = different_value_columns(attacks, ["Case Number.2", "Case Number"])
print_test["Case Number"].value_counts()

Columns differ in the following 2405 values


0               2400
2015.07-10         1
1934.01.08.R       1
                   1
1864.05            1
xx                 1
Name: Case Number, dtype: int64

As we suspected, "Case Number" has 2400 values where it holds "0", one "xx" value and only 4 cases where it is similar to "Case Number.2". Since "Case Number.2" holds more complete information in these values and "Case Number" adds no value with these 0 values, we will drop it.

In [726]:
attacks.drop(["Case Number"], axis=1, inplace=True)
attacks.head()

Unnamed: 0,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source,pdf,href formula,href,Case Number.2,original order
0,25-Jun-2018,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,57.0,"No injury to occupant, outrigger canoe and paddle damaged",N,18h00,White shark,"R. Collier, GSAF",2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.25-Wolfe.pdf,2018.06.25,6303.0
1,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,"St. Simon Island, Glynn County",Standing,Adyson McNeely,F,11.0,Minor injury to left thigh,N,14h00 -15h00,,"K.McMurray, TrackingSharks.com",2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.18-McNeely.pdf,2018.06.18,6302.0
2,09-Jun-2018,2018.0,Invalid,USA,Hawaii,"Habush, Oahu",Surfing,John Denges,M,48.0,Injury to left lower leg from surfboard skeg,N,07h45,,"K.McMurray, TrackingSharks.com",2018.06.09-Denges.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.09-Denges.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.09-Denges.pdf,2018.06.09,6301.0
3,08-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,,Minor injury to lower leg,N,,2 m shark,"B. Myatt, GSAF",2018.06.08-Arrawarra.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.08-Arrawarra.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.08-Arrawarra.pdf,2018.06.08,6300.0
4,04-Jun-2018,2018.0,Provoked,MEXICO,Colima,La Ticla,Free diving,Gustavo Ramos,M,,Lacerations to leg & hand shark PROVOKED INCIDENT,N,,"Tiger shark, 3m",A .Kipper,2018.06.04-Ramos.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.04-Ramos.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.04-Ramos.pdf,2018.06.04,6299.0


Now lets explore the columsn "href formula" and "href" since they appear to hold the same information. Since they hold web addresses, I'll set the column width to be greater in order to be able to see the full information.

In [727]:
pd.set_option("max_colwidth", 500)
different_value_columns(attacks, ["href formula", "href"])

Columns differ in the following 60 values


Unnamed: 0,href formula,href
50,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.01.13-Stewart.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/http://sharkattackfile.net/spreadsheets/pdf_directory/2018.01.13-Stewart.pdf
96,http://sharkattackfile.net/spreadsheets/pdf_directory/2017.08.27-Brundler.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/http://sharkattackfile.net/spreadsheets/pdf_directory/2017.08.27-Brundler.pdf
131,http://sharkattackfile.net/spreadsheets/pdf_directory/2017.06.05-FrenchPolynesia.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/http://sharkattackfile.net/spreadsheets/pdf_directory/2017.06.05-FrenchPolynesia.pdf
133,http://sharkattackfile.net/spreadsheets/pdf_directory/2017.06.11-Goff.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/http://sharkattackfile.net/spreadsheets/pdf_directory/2017.06.11-Goff.pdf
141,http://sharkattackfile.net/spreadsheets/pdf_directory/2017.05.27-Selwood.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/http://sharkattackfile.net/spreadsheets/pdf_directory/2017.05.27-Selwood.pdf
168,http://sharkattackfile.net/spreadsheets/pdf_directory/2017.03.19-Fernandez.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/http://sharkattackfile.net/spreadsheets/pdf_directory/2017.03.19-Fernandez.pdf
234,http://sharkattackfile.net/spreadsheets/pdf_directory/2016.07.29-Spain.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/http://sharkattackfile.net/spreadsheets/pdf_directory/2016.07.29-Spain.pdf
241,http://sharkattackfile.net/spreadsheets/pdf_directory/2016.07.23.a-Cutbirth.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2016.07.23-Cutbirth.pdf
276,http://sharkattackfile.net/spreadsheets/pdf_directory/2016.05.21.a-Girl.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/http://sharkattackfile.net/spreadsheets/pdf_directory/2016.05.21.a-Girl.pdf
324,http://sharkattackfile.net/spreadsheets/pdf_directory/2015.12.21.a-Brazil.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/http://sharkattackfile.net/spreadsheets/pdf_directory/2015.12.21.a-Brazil.pdf


We can see that most of the differences are due to differences in the case number. We can also see that href have several instances where it repeats the web address. Since this column adds little value to the analysis we will conduct later, for simplicity I will fill the missing values of "href formula" with the values in "href" and drop the "href" column.

In [728]:
attacks.iloc[3245,16] = attacks.iloc[3245,17]
attacks.iloc[3244,16] = attacks.iloc[3244,17]
attacks.drop(["href"], axis=1, inplace=True)
attacks.head()

Unnamed: 0,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source,pdf,href formula,Case Number.2,original order
0,25-Jun-2018,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,57.0,"No injury to occupant, outrigger canoe and paddle damaged",N,18h00,White shark,"R. Collier, GSAF",2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.25-Wolfe.pdf,2018.06.25,6303.0
1,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,"St. Simon Island, Glynn County",Standing,Adyson McNeely,F,11.0,Minor injury to left thigh,N,14h00 -15h00,,"K.McMurray, TrackingSharks.com",2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.18-McNeely.pdf,2018.06.18,6302.0
2,09-Jun-2018,2018.0,Invalid,USA,Hawaii,"Habush, Oahu",Surfing,John Denges,M,48.0,Injury to left lower leg from surfboard skeg,N,07h45,,"K.McMurray, TrackingSharks.com",2018.06.09-Denges.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.09-Denges.pdf,2018.06.09,6301.0
3,08-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,,Minor injury to lower leg,N,,2 m shark,"B. Myatt, GSAF",2018.06.08-Arrawarra.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.08-Arrawarra.pdf,2018.06.08,6300.0
4,04-Jun-2018,2018.0,Provoked,MEXICO,Colima,La Ticla,Free diving,Gustavo Ramos,M,,Lacerations to leg & hand shark PROVOKED INCIDENT,N,,"Tiger shark, 3m",A .Kipper,2018.06.04-Ramos.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.04-Ramos.pdf,2018.06.04,6299.0


Now that we have altered our data frame a bit, lets tidy up. Ill reorder and rename a few columns to maintain structre and then we will check the shape and non-null count again.

In [729]:
attacks = attacks[['Case Number.2', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location', 'Activity',
       'Name', 'Sex ', 'Age', 'Injury', 'Fatal (Y/N)', 'Time', 'Species ',
       'Investigator or Source', 'pdf', 'href formula',
       'original order']]
attacks.rename(columns={'href formula' : 'href',
                        'Case Number.2' : 'Case Number',
                        'Species ' : 'Species'}, inplace=True, errors='raise')
attacks.head()

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source,pdf,href,original order
0,2018.06.25,25-Jun-2018,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,57.0,"No injury to occupant, outrigger canoe and paddle damaged",N,18h00,White shark,"R. Collier, GSAF",2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.25-Wolfe.pdf,6303.0
1,2018.06.18,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,"St. Simon Island, Glynn County",Standing,Adyson McNeely,F,11.0,Minor injury to left thigh,N,14h00 -15h00,,"K.McMurray, TrackingSharks.com",2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.18-McNeely.pdf,6302.0
2,2018.06.09,09-Jun-2018,2018.0,Invalid,USA,Hawaii,"Habush, Oahu",Surfing,John Denges,M,48.0,Injury to left lower leg from surfboard skeg,N,07h45,,"K.McMurray, TrackingSharks.com",2018.06.09-Denges.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.09-Denges.pdf,6301.0
3,2018.06.08,08-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,,Minor injury to lower leg,N,,2 m shark,"B. Myatt, GSAF",2018.06.08-Arrawarra.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.08-Arrawarra.pdf,6300.0
4,2018.06.04,04-Jun-2018,2018.0,Provoked,MEXICO,Colima,La Ticla,Free diving,Gustavo Ramos,M,,Lacerations to leg & hand shark PROVOKED INCIDENT,N,,"Tiger shark, 3m",A .Kipper,2018.06.04-Ramos.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/2018.06.04-Ramos.pdf,6299.0


In [730]:
attacks.tail(100)

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source,pdf,href,original order
25623,,,,,,,,,,,,,,,,,,,
25624,,,,,,,,,,,,,,,,,,,
25625,,,,,,,,,,,,,,,,,,,
25626,,,,,,,,,,,,,,,,,,,
25627,,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25718,,,,,,,,,,,,,,,,,,,
25719,,,,,,,,,,,,,,,,,,,
25720,,,,,,,,,,,,,,,,,,,
25721,,,,,,,,,,,,,,,,,,,


It looks like the bottom of our dataframe is full of NaNs. Lets get rid of all the rows that are just NaNs.

In [731]:
attacks.dropna(how='all',inplace=True)
attacks.tail(100)

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source,pdf,href,original order
6209,ND.0110,"No date, late 1960s",0.0,Unprovoked,VENEZUELA,Los Roques Islands,,Spearfishing,4 French divers,M,,"FATAL (x3), one survived with minor injuries",Y,,said to involve 2.5 m hammerhead sharks,http://waterco.com.br/ataque_tubarao.htm,ND-0110-FrenchDivers.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/ND-0110-FrenchDivers.pdf,94.0
6210,ND-0109,Before 2006,0.0,Unprovoked,USA,Florida,"Tampa Bay, Hillsborough County",Wade-fishing,Ed Snyder,M,,"No injury, shark rammed his back",N,,,Fishingworld.com,ND-0109-Ed-Snyder.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/ND-0109-Ed-Snyder.pdf,93.0
6211,ND-0108,Before 2003,0.0,Unprovoked,GREECE,Dodecanese Islands,Near Symi Island,Free diving for sponges,male,M,,FATAL,Y,,,M. Kalafatas,ND-0108-SpongeDiver-Symi.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/ND-0108-SpongeDiver-Symi.pdf,92.0
6212,ND-0107,Before 2004,0.0,Boat,MOZAMBIQUE,Inhambane Province,Off Inhambane,Fishing,"4.8-metre skiboat, Occupants: Rod Salm & 4 friends",,,"No injury to occupants, shark bumped boat",N,,Whale shark,South African Shark Attack File,ND-0107-Inhambane.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/ND-0107-Inhambane.pdf,91.0
6213,ND-0106,Before 1962,0.0,Unprovoked,SOUTH AFRICA,Western Cape Province,"Murray Bay, Robben Island",Swimming,"male, a mental patient",M,,"FATAL, body not recovered",Y,,,"L.Green, A Decent Fellow doesn't Work, p.225",ND-0106-MentalPatient.pdf,http://sharkattackfile.net/spreadsheets/pdf_directory/ND-0106-MentalPatient.pdf,90.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6304,,,,,,,,,,,,,,,,,,,6306.0
6305,,,,,,,,,,,,,,,,,,,6307.0
6306,,,,,,,,,,,,,,,,,,,6308.0
6307,,,,,,,,,,,,,,,,,,,6309.0


In [732]:
attacks.shape

(6309, 19)

Our DataFrame dropped from 25.723 rows to just 6.309. However, looking at the tail, it looks like we have rows that are NaNs except for the last column ("Original Order"). We might have columns like this in the middle of the DataFrame as well. Let's drop all rows with more than 14 NaN as they add little value to our analysis.

In [733]:
attacks.dropna(thresh=14,inplace=True)
attacks.shape

(6167, 19)

We have now dropped to 6.167 rows! Lets check the DataFrame Info.

In [734]:
attacks.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6167 entries, 0 to 6301
Data columns (total 19 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Case Number             6167 non-null   object 
 1   Date                    6167 non-null   object 
 2   Year                    6165 non-null   float64
 3   Type                    6164 non-null   object 
 4   Country                 6140 non-null   object 
 5   Area                    5791 non-null   object 
 6   Location                5716 non-null   object 
 7   Activity                5709 non-null   object 
 8   Name                    6030 non-null   object 
 9   Sex                     5689 non-null   object 
 10  Age                     3468 non-null   object 
 11  Injury                  6148 non-null   object 
 12  Fatal (Y/N)             5670 non-null   object 
 13  Time                    2946 non-null   object 
 14  Species                 3424 non-null   

Great! We've gotten our Data set to a number 6.167 rows and most of the columns have almost all values completed. Now lets start cleaning the data in each columns, starting with the date information.


## Date information
It looks like the case number is in fact the date. It might be easier to get the value of date from this column than to extract it from the "Date" column.

In [735]:
attacks["Case Number"].value_counts()

1966.12.26      2
1907.10.16.R    2
2012.09.02.b    2
2013.10.05      2
1983.06.15      2
               ..
2000.02.19      1
2000.02.21      1
2000.03.00      1
2000.03.02      1
ND.0001         1
Name: Case Number, Length: 6151, dtype: int64

Since most of the data in this column could translate into a date we'll use the "extract_date" function to get a new column with a formatted date using a regex expression.

In [736]:
extract_date(attacks,"Case Number")
attacks.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6167 entries, 0 to 6301
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   Case Number             6167 non-null   object        
 1   Date                    6167 non-null   object        
 2   Year                    6165 non-null   float64       
 3   Type                    6164 non-null   object        
 4   Country                 6140 non-null   object        
 5   Area                    5791 non-null   object        
 6   Location                5716 non-null   object        
 7   Activity                5709 non-null   object        
 8   Name                    6030 non-null   object        
 9   Sex                     5689 non-null   object        
 10  Age                     3468 non-null   object        
 11  Injury                  6148 non-null   object        
 12  Fatal (Y/N)             5670 non-null   object  

## Activity information
The activities in the DataFrame are quite varied. For this we will use the "extract_activity" function to standarize the data.

In [737]:
extract_activity(attacks,"Activity")
attacks["activity_p"].value_counts()

swimming    2739
surfing     1529
fishing      987
others       638
sailing      274
Name: activity_p, dtype: int64

We can see that the function reduced the "Others category" to around 10% of the values.

## Time of day information

In [738]:
process_time_of_day(attacks,"Time")
print(attacks[["cat_time_of_day"]].value_counts())
print(attacks[["int_time_of_day"]].value_counts())

cat_time_of_day
afternoon          245
morning            151
night               93
dtype: int64
int_time_of_day
11                 259
16                 240
15                 237
14                 237
12                 204
13                 200
17                 199
10                 181
18                 131
09                 123
08                  92
07                  83
19                  53
06                  38
20                  30
05                  11
03                  10
23                   8
02                   7
01                   6
04                   6
21                   5
22                   5
00                   1
dtype: int64


In [739]:
attacks.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6167 entries, 0 to 6301
Data columns (total 23 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   Case Number             6167 non-null   object        
 1   Date                    6167 non-null   object        
 2   Year                    6165 non-null   float64       
 3   Type                    6164 non-null   object        
 4   Country                 6140 non-null   object        
 5   Area                    5791 non-null   object        
 6   Location                5716 non-null   object        
 7   Activity                5709 non-null   object        
 8   Name                    6030 non-null   object        
 9   Sex                     5689 non-null   object        
 10  Age                     3468 non-null   object        
 11  Injury                  6148 non-null   object        
 12  Fatal (Y/N)             5670 non-null   object  

## Age information

In [740]:
correct_age(attacks,"Age")
attacks[["Age","age_p"]]

Unnamed: 0,Age,age_p
0,57,57
1,11,11
2,48,48
3,,
4,,
...,...,...
6297,,
6298,,
6299,,
6300,,


## Fatal attack information

In [741]:
correct_fatality(attacks,'Fatal (Y/N)')
attacks[["Fatal (Y/N)","fatal_p"]]
attacks.columns

Index(['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex ', 'Age', 'Injury', 'Fatal (Y/N)', 'Time',
       'Species', 'Investigator or Source', 'pdf', 'href', 'original order',
       'date_p', 'activity_p', 'int_time_of_day', 'cat_time_of_day', 'age_p',
       'fatal_p'],
      dtype='object')

## Species information

In [742]:
%run -i 'src/cleaning.py'
process_species(attacks,"Species")
attacks[["Species","species_p"]]

Unnamed: 0,Species,species_p
0,White shark,White Shark
1,,Uncomfirmed Species
2,,Uncomfirmed Species
3,2 m shark,Uncomfirmed Species
4,"Tiger shark, 3m",Tiger Shark
...,...,...
6297,,Uncomfirmed Species
6298,,Uncomfirmed Species
6299,,Uncomfirmed Species
6300,,Uncomfirmed Species


## Export to CSV

In [743]:
attacks.columns

Index(['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex ', 'Age', 'Injury', 'Fatal (Y/N)', 'Time',
       'Species', 'Investigator or Source', 'pdf', 'href', 'original order',
       'date_p', 'activity_p', 'int_time_of_day', 'cat_time_of_day', 'age_p',
       'fatal_p', 'species_p'],
      dtype='object')

In [744]:
attacks.to_csv('./data/clean_attacks_full.csv')
delete_columns(attacks,['Date','Type','Activity', 'Name', 'Age', 'Fatal (Y/N)', 'Time', 'Species','Investigator or Source', 'pdf', 'href', 'original order', 'Injury'])
attacks = attacks[['Case Number', 'date_p', 'Year', 'Country', 'Area', 'Location', 'Sex ', 'age_p', 'activity_p', 'int_time_of_day', 'cat_time_of_day', 'fatal_p', 'species_p']]
attacks.rename(columns={'date_p' : 'Date',
                        'age_p' : 'Age',
                        'activity_p' : 'Activity',
                        'int_time_of_day' : 'Time_int',
                        'cat_time_of_day' : 'Time_cat',
                        'fatal_p' : 'Fatal',
                        'species_p' : 'Species'}, inplace=True, errors='raise')
attacks.to_csv('./data/clean_attacks.csv')


Deleted columns:  ['Date', 'Type', 'Activity', 'Name', 'Age', 'Fatal (Y/N)', 'Time', 'Species', 'Investigator or Source', 'pdf', 'href', 'original order', 'Injury']
