# Project: Data Cleaning and Manipulation with Pandas

# Dataset : Shark Attacks 

## 1. Importing libraries and Data exploration.

##### 1.1. We import the relevant libraries. In this case, we will import Pandas, Numpy, Matplotlib and Seaborn

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import math

##### 1.2. Then we load the data set.

In [2]:
df = pd.read_csv("/Users/Fabi/Documents/GitHub/data-ber-10-19/module-1_projects/pandas-project/your-code/GSAF5.csv",encoding = "ISO-8859-1")

#This will remove trailing spaces at the end of the column names

df.columns = df.columns.str.rstrip()

##### 1.3. We explore the data set by looking its first rows. 

In [3]:
df.head()

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
0,2016.09.18.c,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,...,,"Orlando Sentinel, 9/19/2016",2016.09.18.c-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.18.c,2016.09.18.c,5993,,
1,2016.09.18.b,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,Chucky Luciano,M,...,,"Orlando Sentinel, 9/19/2016",2016.09.18.b-Luciano.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.18.b,2016.09.18.b,5992,,
2,2016.09.18.a,18-Sep-16,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,...,,"Orlando Sentinel, 9/19/2016",2016.09.18.a-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.18.a,2016.09.18.a,5991,,
3,2016.09.17,17-Sep-16,2016,Unprovoked,AUSTRALIA,Victoria,Thirteenth Beach,Surfing,Rory Angiolella,M,...,,"The Age, 9/18/2016",2016.09.17-Angiolella.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.17,2016.09.17,5990,,
4,2016.09.15,16-Sep-16,2016,Unprovoked,AUSTRALIA,Victoria,Bells Beach,Surfing,male,M,...,2 m shark,"The Age, 9/16/2016",2016.09.16-BellsBeach.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.09.16,2016.09.15,5989,,


##### 1.4. We want to know as well how many rows and columns has the data set to further explore whether there are too many null columns

In [4]:
df.shape

(5992, 24)

##### 1.5. Data types

We want to know whether the type of data for each column is suitable or we sould change it:

In [5]:
df.dtypes

Case Number               object
Date                      object
Year                       int64
Type                      object
Country                   object
Area                      object
Location                  object
Activity                  object
Name                      object
Sex                       object
Age                       object
Injury                    object
Fatal (Y/N)               object
Time                      object
Species                   object
Investigator or Source    object
pdf                       object
href formula              object
href                      object
Case Number.1             object
Case Number.2             object
original order             int64
Unnamed: 22               object
Unnamed: 23               object
dtype: object

### 2. Duplicated rows

We don't have duplicated rows in our data frame:


In [6]:

# Here we exclude this possible indexing column:

(df[[col for col in list(df.columns)  if col!= "original order"]]
[df[[col for col in list(df.columns) if col!= "original order"]]
.duplicated()])



Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Time,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,Unnamed: 22,Unnamed: 23


### 3. Duplicated columns

3.1. In almost all the rows, the columns Case Number, Case Number.1  and Case Number.2 have the same values:

In [7]:
(df[
    (df["Case Number"] == df["Case Number.1"])
    & 
    (df["Case Number"] == df["Case Number.2"])
].shape)

(5979, 24)

So, we can drop the columns "Case Number.1" and "Case Number.2" as well.

In [8]:
df.drop(columns = ["Case Number.1", "Case Number.2"], inplace = True)

3.2. Similarly, the columns "href" and "href formula" might contain the same information. To validate this, the next line show us the rows where the values for these two columns are different.

In [9]:
df[df["href formula"] != df["href"]][["href formula", "href"]].shape

(54, 2)

We have 54 rows with information that differs in both columns; this is roughly 0.9% fo the total rows. As "href" has 3 NaNs and "href formula" has just 1 NaN, we can drop the
column  "href".

In [10]:
# Locating the value of the missing row in href formula which is present in href:

df[(df["href formula"] != df["href"]) & (df["href formula"]).isna()]["href"]

#shark_attack.drop(columns = ["href"], inplace = True)

3019    http://sharkattackfile.net/spreadsheets/pdf_di...
Name: href, dtype: object

In [11]:
# Then assign this value to the dataset:

df.at[3019, 'href formula'] = df.loc[3019]["href"]

In [12]:
# Validating the change:

df.loc[3019]["href formula"]

'http://sharkattackfile.net/spreadsheets/pdf_directory/1975.01.19-Barrowman.pdf'

Now we can finaly drop the column "href" form the data set:

In [13]:
df.drop(columns = ["href"], inplace = True)

3.3. We will order the data frame by "original order number" and have a look to the columns "Case Number" and "original order". Possibly we can drop one of them and make it an index for our data frame.

In [14]:
df[["Case Number", "Date", "original order"]].sort_values(by ='original order').head(10)

Unnamed: 0,Case Number,Date,original order
5991,ND.0001,1845-1853,2
5990,ND.0002,1883-1889,3
5989,ND.0003,1900-1905,4
5988,ND.0004,Before 1903,5
5987,ND.0005,Before 1903,6
5986,ND.0006,Before 1906,7
5985,ND.0007,Before 1906,8
5984,ND.0008,Before 1906,9
5983,ND.0009,Before 1906,10
5982,ND.0010,Circa 1862,11


### 4. Missing values

We want to know now is there are missing information in the data set. To this end, we use the function isna provided in Pandas. Also we introduce a function that give us the percentage of NaN's contained in each column over the total of rows in the data set.

In [15]:
def null_cols(data):
    
    """
    This function takes a dataframe df and shows the columns of df that have NaN values
    and the number of them
    
    """
    
    nulls = data.isna().sum()
    return nulls[nulls > 0] / len(data) * 100


In [16]:
null_cols(df)

Country                    0.717623
Area                       6.708945
Location                   8.277704
Activity                   8.795060
Name                       3.337784
Sex                        9.462617
Age                       44.742991
Injury                     0.450601
Fatal (Y/N)                0.317089
Time                      53.621495
Species                   48.965287
Investigator or Source     0.250334
Unnamed: 22               99.983311
Unnamed: 23               99.966622
dtype: float64

### Some preliminar conclusions:

* The unnamed columns 22 and 23 should be dropped, because are almost entirely null.

* Due to the relevance of the "Species" column in the data set, we cannot drop this column, even when the percentage of NaN's there is near to 50%.

* The column "Time" and "Year" seems to be irrelevant. However we will explore it further.

* The columns "Country", "Area" and "Location" are related, so it might be possible to infer the missing values one from the others.

* The column "Age" maybe can be infered from anothers after further inspection and it seems to be relevant also in our context so we decided not to drop it.


#### 4.1. Droping "Unnamed" and "Time" columns

As they are not relevant, we can drop then those unnamed columns from our dataset. Also we decided to drop the column "Time" because it has over 50% of null values and contains information which is not precise.

In [17]:
df.drop(columns = ["Unnamed: 22", "Unnamed: 23", "Time"], inplace = True)

In [18]:
null_cols(df)

Country                    0.717623
Area                       6.708945
Location                   8.277704
Activity                   8.795060
Name                       3.337784
Sex                        9.462617
Age                       44.742991
Injury                     0.450601
Fatal (Y/N)                0.317089
Species                   48.965287
Investigator or Source     0.250334
dtype: float64

#### 4.2. Filling values in "Injury" and "Fatal(Y/N)" heuristically

"Injury" and "Fatal(Y/N)" are correlated. We can use the information of each other in order to fill the missing information. This line code show us the rows in which we have values for injury but NaNs in Fatal(Y/N)

In [19]:
df[(df["Injury"].isna()== False) & (df["Fatal (Y/N)"].isna())][["Injury", "Fatal (Y/N)"]]

Unnamed: 0,Injury,Fatal (Y/N)
54,"No injury, but sharks repeatedly hit their fin...",
1844,Reported as shark attack but probable drowning,
2449,FATAL,
3280,"Diver shot the shark, then it injured his arm ...",
3435,"Disappeared, probable drowning but sharks in a...",
3901,Boat damaged,
4107,No injury to occupants. Shark tore nets & traw...,
4112,Human remains found in shark,
5307,"Disappeared, but shark involvement unconfirmed",
5437,"No injury, no attack",


In [20]:
# This is a list with the indexes of the rows which have NaN values in the column "Fatal"
rows_missing = list(df[(df["Injury"].isna()== False) & (df["Fatal (Y/N)"].isna())].index)

We can infer the missing values as follows:

In [21]:
inferences = ["N", "Y", "Y", "N", "Y", "N", "N", "Y", "N", "N", "Y", "N", "Y", "Y", "Y", "Y", "Y", "Y", "Y"]


At this point we introduce the next function which allows us to fill or replace data existent in our dataframe:

In [22]:
def filling(data, indexes, values, col_name):
    
    """
    
    This function fills a column col_name of a data frame df at the places located
    by indexes with the corresponding values in values.
    
    """
    
    j = 0
    for i in indexes:   
        data.at[ i, col_name] = values[j]
        j = j+1

In [23]:
filling(df, rows_missing, inferences, "Fatal (Y/N)")

In [24]:
null_cols(df)

Country                    0.717623
Area                       6.708945
Location                   8.277704
Activity                   8.795060
Name                       3.337784
Sex                        9.462617
Age                       44.742991
Injury                     0.450601
Species                   48.965287
Investigator or Source     0.250334
dtype: float64

#### 4.3. Filling values in "Country", "Area" and "Location"

As "Country", "Area" an "Location" are related, we will try to infer the missing values for one of these columns from the other two. We are going to infer the missing values of "country" first.

##### 4.3.1. Missing values in "Country"

In [25]:
df[(df.Country.isna()) & (df.Area.isna()== False) & (df.Location.isna()==False)][["Country", "Area", "Location"]]

Unnamed: 0,Country,Area,Location
3162,,Caribbean Sea,Between St. Kitts & Nevis
4040,,Between Comores & Madagascar,Geyser Bank
4271,,Caribbean Sea,Between Cuba & Costa Rica
4790,,French Southern Territories,Île Saint-Paul


In [26]:
rows_missing = list(df[(df.Country.isna()) & (df.Area.isna()== False) & (df.Location.isna()==False)].index)

In [27]:
inferences = ["Saint Kitts and Nevis", "France", "Cuba", "France"]

In [28]:
filling(df, rows_missing, inferences, "Country")

In [29]:
df[(df.Country.isna()) & (df.Area.isna()) & (df.Location.isna()==False)][["Country", "Area", "Location"]]

Unnamed: 0,Country,Area,Location
3379,,,Florida Strait
4412,,,225 miles east of Hong Kong
5189,,,Near the equator
5560,,,Santa Cruz
5847,,,Carlisle Bay
5896,,,In a river feeding into the Bay of Bengal


In [30]:
rows_missing = list(df[(df.Country.isna()) & (df.Area.isna()) & (df.Location.isna()==False)].index)

inferences = ["US", "China", "Ecuador", "US", "Barbados", "India"]
inferences1 = ["Florida", "Hong Kong", "Ecuador", "California", "Carlisle Bay", "Bengal"]

filling(df, rows_missing, inferences, "Country")
filling(df, rows_missing, inferences1, "Area")


In [31]:
null_cols(df)

Country                    0.550734
Area                       6.608812
Location                   8.277704
Activity                   8.795060
Name                       3.337784
Sex                        9.462617
Age                       44.742991
Injury                     0.450601
Species                   48.965287
Investigator or Source     0.250334
dtype: float64

If we have the three relevant columns null, we cannot infer any of the fields. As they are a very small part of the data frame (< 0.6%)  we decided to drop these rows.

In [32]:
rows_missing = list(df[df.Country.isna() & ( df.Area.isna()) & (df.Location.isna())].index)

df = df.drop(rows_missing, axis =0).copy()



In [33]:
df[df.Country.isna() & ( (df.Area.isna()== False) & (df.Location.isna()))]

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Species,Investigator or Source,pdf,href formula,original order
2731,1983.00.00.d,Ca. 1983,1983,Unprovoked,,English Channel,,Swimming,Padma Shri Taranath Narayan Shenoy,M,,Left leg bitten,N,,"Times of India, 2/5/2012",1983.00.00.d-Shenoy.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,3262
3792,1960.01.26,26-Jan-60,1960,Sea Disaster,,"Between Timor & Darwin, Australia",,Portuguese Airliner with 9 people aboard went ...,,,,"As searchers approached wreckage, sharks circl...",N,,"V.M. Coppleson (1962), p.260",1960.01.26-Portuguese airliner.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,2201
4005,1956.09.13,13-Sep-56,1956,Unprovoked,,Near the Andaman & Nicobar Islands,,Climbing back on ship,male,M,,FATAL,Y,Blue shark,M. Hosina,1956.09.13-TunaBoat.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,1988
4473,1942.11.00.a,Nov-42,1942,Sea Disaster,,Off South American coast,,Dutch merchant ship Zaandam torpedoed by the ...,,M,,FATAL,Y,,"M. Murphy; V.M. Coppleson (1962), pp.207-208",1942.11.00.a-Izzi.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,1520
4485,1942.06.00,Jun-42,1942,Unprovoked,,300 miles east of St. Thomas (Virgin Islands),,On life raft tethered to lifeboat. A seaman pu...,male,M,,Forearm lacerated,N,,"V.M. Coppleson (1962), p.258",1942.06.00-on-life-raft.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,1508
5370,1897.03.15.b.R,Reported 15-Mar-1897,1897,Unprovoked,,Mediterranean Sea,,Swimming,male,M,,FATAL,Y,,"Daily Northwestern, 5/15/1897",1897.03.15.b.R-Mediterranean.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,623
5558,1881.08.16.R,Reported 16-Aug-1881,1881,Unprovoked,,Western Banks,,"Floating, holding onto an oar after dory capsized",George Sedgwick,M,20.0,FATAL,Y,,"Lewiston Evening Journal, 8/16/1881",1881.08.16.R-Sedgwick.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,435
5866,0077.00.00,77 A.D.,77,Unprovoked,,Ionian Sea,,Sponge diving,males,M,,FATAL,Y,,Perils mentioned by Pliny the Elder (23 A.D. t...,77AD-Pliny.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,127
5868,0.0214,Ca. 214 B.C.,0,Unprovoked,,Ionian Sea,,Ascending from a dive,"Tharsys, a sponge diver",M,,"FATAL, shark/s bit him in two",Y,,"Reported by Greek poet, Leonidas of Tarentum (...",214BC-Tharsys.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,125


We drop certain rows which "Area" values are not enough to infer the rest.

In [34]:
df = df.drop([4473, 5370, 5558, 5866, 5868]).copy()

In [35]:
rows_missing = list(df[df.Country.isna() & ( (df.Area.isna()== False) & (df.Location.isna()))].index)

# Putting values for countries 
inferences = ["England", "Australia", "India", "St. Thomas"]
filling(df, rows_missing, inferences, "Country")

In [36]:
#inferences = ["English Channel", "Darwin", "Bengal", "St. Thomas"]

inferences = list(df[((df.Area.isna()== False) & (df.Location.isna()))]["Area"].loc[rows_missing])

filling(df, rows_missing, inferences, "Location")


In [37]:
inferences = ["English Channel", "Darwin", "Bengal", "St. Thomas"]
filling(df, rows_missing, inferences, "Area")

##### 4.3.2. We make sure every country is written in upper case:

In [38]:
df["Country"] = df['Country'].str.upper() 

### 4.4. Cleaning and parsing the column Date

We will order the data frame by "original order number" and have a look to the columns "Case Number" and "original order". Possibly we can drop one of them and make it an index for our data frame.

In [39]:
df.sort_values(by ='original order').head()

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Species,Investigator or Source,pdf,href formula,original order
5991,ND.0001,1845-1853,0,Unprovoked,CEYLON (SRI LANKA),Eastern Province,"Below the English fort, Trincomalee",Swimming,male,M,15.0,"FATAL. ""Shark bit him in half, carrying away t...",Y,,S.W. Baker,ND-0001-Ceylon.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,2
5990,ND.0002,1883-1889,0,Unprovoked,PANAMA,,"Panama Bay 8ºN, 79ºW",,Jules Patterson,M,,FATAL,Y,,"The Sun, 10/20/1938",ND-0002-JulesPatterson.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,3
5989,ND.0003,1900-1905,0,Unprovoked,USA,North Carolina,Ocracoke Inlet,Swimming,Coast Guard personnel,M,,FATAL,Y,,"F. Schwartz, p.23; C. Creswell, GSAF",ND-0003-Ocracoke_1900-1905.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,4
5988,ND.0004,Before 1903,0,Unprovoked,AUSTRALIA,Western Australia,,Pearl diving,Ahmun,M,,FATAL,Y,,"H. Taunton; N. Bartlett, pp. 233-234",ND-0004-Ahmun.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,5
5987,ND.0005,Before 1903,0,Unprovoked,AUSTRALIA,Western Australia,Roebuck Bay,Diving,male,M,,FATAL,Y,,"H. Taunton; N. Bartlett, p. 234",ND-0005-RoebuckBay.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,6


We see that the first entries of the column Case Number are related with the original order. Also, the Case Number contains the date in case there is one.

In [40]:
df[(df["Case Number"].str.contains("ND")) & (df["Date"].str.contains("No date"))]

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Species,Investigator or Source,pdf,href formula,original order
5899,ND.0110,"No date, late 1960s",0,Unprovoked,VENEZUELA,Los Roques Islands,,Spearfishing,4 French divers,M,,"FATAL (x3), one survived with minor injuries",Y,said to involve 2.5 m hammerhead sharks,http://waterco.com.br/ataque_tubarao.htm,ND-0110-FrenchDivers.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,94
5905,ND.0102,"No date, Before 1963",0,Unprovoked,BAHREIN,,,Pearl diving,male,M,,FATAL,Y,Tiger shark,A.C. Doyle,ND-0102-Bahrein.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,88
5907,ND.0097,No date,0,Unprovoked,USA,Florida,"Key West, Monroe County",Kitesurfing,Paul Menta,M,,Hand bitten,N,,Internet,ND-0097-PaulMenta.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,86
5908,ND.0096,No date,0,Unprovoked,REUNION,Grand'Anse,Petite-île,yachtsman in a zodiac,,M,,Survived,N,,G. Van Grevelynghe,ND-0096-Zodiac-Reunion.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,85
5910,ND.0094,"No date, Before May-1996",0,Unprovoked,KOREA,South Korea,Cheju Island,Diving,"female, a Hae Nyeo",F,,"FATAL, injured while diving, then shark bit her",Y,,"K. Amsler, Divernet.com",ND-0094-HaeNyeo.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,83
5911,ND.0093,"No date, Before Mar-1995",0,Unprovoked,FRENCH POLYNESIA,Tuamotus,Rangiroa,Fishing,male,M,,"Speared a shark, fell overboard and another sh...",N,,J. Windh,ND-0093-Rangiroa.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,82
5913,ND.0090,"No date, Before Aug-1989",0,Unprovoked,VANUATU,Malampa Province,Malakula,,female,F,,FATAL,Y,,S. Combs,ND-0090-Vanuatu-female.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,80
5914,ND.0089,"No date, Before Aug-1987",0,Provoked,VANUATU,Malampa Province,"Hokai, Malakula",Attempting to drive shark from area,a chief,M,,Speared shark broke outrigger of canoe throwin...,N,A large hammerhead shark,S. Combs,ND-0089-VanuatuChief.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,79
5915,ND.0088,"No date, Before 1987",0,Unprovoked,IRAN,Khuzestan Province,"Ahvaz, on the Karun River",,Mr. Jabar-Kaaby,M,,Foot severed,N,Bull shark,B. Coad & F. Papahn,ND-0088-Jabar-Kaaby.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,78
5916,ND.0087,"No date, Before 1975",0,Provoked,USA,Florida,"Riviera Beach, Palm Beach County",Skin diving. Grabbed shark's tail; shark turne...,Carl Bruster,M,19.0,"Ankle punctured & lacerated, hands abraded PRO...",N,"Nurse shark, 2.1 m [7']","R. Skocik, p.176",ND-0087-Carl-Bruster.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,77


The string "ND" present in the entries of "Date" means that there is no precise date. Also in these rows the value for "Year" is 0. If there is "ND" in "Case Number" and additionaly the string "No date" match with the value in "Date", we decide to drop the row.


In [41]:
to_drop = list(df[(df["Case Number"].str.contains("ND")) & (df["Date"].eq("No date"))].index)

In [42]:
df = df.drop(to_drop).copy()

Next, we are going to replace some words as "No Date" and "Before" from the column "Date". As they are not a significative number of rows in the data set, we will maintain just the year.

In [43]:
rows = list(df[(df["Case Number"].str.contains("ND")) & (df["Date"].str.match("No date"))].index)

In [44]:
df["Date"] = df["Date"].str.replace("No Date, Before", "").copy()
df["Date"] = df["Date"].str.replace("Before", "").copy()
df["Date"] = df["Date"].str.replace("No date,", "").copy()
df[(df["Case Number"].str.contains("ND")) & (df["Date"].str.match("No date"))]

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Species,Investigator or Source,pdf,href formula,original order
5979,ND.0013,No date (3 days after preceding incident) & pr...,0,Unprovoked,SOUTH AFRICA,KwaZulu-Natal,Durban,Fishing,a native fisherman,M,,"FATAL, body not recovered but shark was caught...",Y,,"Rural New Yorker, 7/19/1913",ND-0013-Durban-native-fisherman.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,14


In [45]:
# We drop this row which does not have relevant information
df = df.drop(5979).copy()

In [46]:
rows_missing = list(df[(df["Case Number"].str.contains("ND.")) & (df["Date"].str.contains("Between"))].index)

In [47]:
filling(df, rows_missing, ["1957", "1928"], "Date")

We don't want to lose information about the Species, because we consider it very important. However, we want to drop the rows which not contain a valid date

In [48]:
# Rows which have information about Species but don't have a valid date and Case number with no date

rows_missing = list(df[(df["Case Number"].str.contains("ND")) & (df["Species"].isna()== False)].index)

In [49]:
# Seriescorresponding to column Date of the rows above

dates = df[(df["Case Number"].str.contains("ND")) & (df["Species"].isna()== False)]["Date"].copy()


Here we iterate over the index of dates and replace the values with valid dates heuristically. We take the first day of the year


In [50]:
year_pattern = r"\d{4}$"

year_s_pattern = r"(\d{4})(\w{1})$"

for i in rows_missing:
    year = re.findall(year_pattern, dates[i]) # for strings containing just a year
    years = re.findall(year_s_pattern, dates[i]) #for strings containing XXXXs
    if len(year) > 0:
        dates[i] = year[0] +  "-01-01"
    if len(years) > 0:
        dates[i] = years[0][0] + "-01-01"
        
        

In [51]:
# Filling the column Date with the values obtained above

filling(df, list(dates.index), list(dates), "Date")


In [52]:
dates = df[df["Date"].str.match(r"\d{4}$")]["Date"].copy()
rows_missing = list(df[df["Date"].str.match(r"\d{4}$")].index)

In [53]:
for i in rows_missing:
    year = re.findall(year_pattern, dates[i]) # for strings containing just a year
    years = re.findall(year_s_pattern, dates[i]) #for strings containing XXXXs
    if len(year) > 0:
        dates[i] = year[0] +  "-01-01"
    if len(years) > 0:
        dates[i] = years[0][0] + "-01-01"       

In [54]:
filling(df, list(dates.index), list(dates), "Date")

Now we have almost clean the column "Date":

In [55]:
df[df["Date"].str.match(r"\d{4}$")]
df["Date"] = df["Date"].str.replace("Reported ", "").copy()
df["Date"] = df["Date"].str.replace("Reported  ", "").copy()
df["Date"] = df["Date"].str.lstrip("Circa").copy()
df["Date"] = df["Date"].str.lstrip().copy()

Again we repeat the process above with these last changes:

In [56]:
dates = df[df["Date"].str.match(r"(\d{4})(\w{1})$")]["Date"].copy()
rows_missing = list(dates.index)

In [57]:
for i in rows_missing:
    years = re.findall(year_s_pattern, dates[i]) #for strings containing XXXXs
    if len(years) > 0:
        dates[i] = years[0][0] + "-01-01"

In [58]:
filling(df, rows_missing, list(dates), "Date")

In [61]:
df["Date"] = pd.to_datetime(df["Date"], errors = "coerce")

In [62]:
# Filtering the greatest real date: Notice that the info is contained in the column Case Number

df[~(df["Case Number"].str.match("ND"))].sort_values(by = "Case Number", ascending = False).head(5)

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Species,Investigator or Source,pdf,href formula,original order
5896,nd-0114,2012-01-01,0,Unprovoked,INDIA,Bengal,In a river feeding into the Bay of Bengal,Netting shrimp,Sametra Mestri,F,,Hand severed,N,,National Georgraphic Television,ND-0114-BayOfBengal.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,97
0,2016.09.18.c,2016-09-18,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,16.0,Minor injury to thigh,N,,"Orlando Sentinel, 9/19/2016",2016.09.18.c-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,5993
1,2016.09.18.b,2016-09-18,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,Chucky Luciano,M,36.0,Lacerations to hands,N,,"Orlando Sentinel, 9/19/2016",2016.09.18.b-Luciano.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,5992
2,2016.09.18.a,2016-09-18,2016,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,43.0,Lacerations to lower leg,N,,"Orlando Sentinel, 9/19/2016",2016.09.18.a-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,5991
3,2016.09.17,2016-09-17,2016,Unprovoked,AUSTRALIA,Victoria,Thirteenth Beach,Surfing,Rory Angiolella,M,,Struck by fin on chest & leg,N,,"The Age, 9/18/2016",2016.09.17-Angiolella.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,5990


We realize that in the column "Date" there are unrealistic dates, way after the greatest real date: "2016-09-18": 

In [63]:
from datetime import date 

df[df["Date"]> "2016.09.18"].sort_values(by = "Date", ascending = False).head()


Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Species,Investigator or Source,pdf,href formula,original order
3852,1959.08.02,2176-01-01,1959,Invalid,ITALY,Tuscany,"Cala del Corvo, Isola del Giglio",Scuba diving,Karl Pollerer & Eric Eisesenid,M,34 & 19,Probable drowing. Shark involvement unconfirmed,Y,,"C. Moore, GSAF",1959.08.02-Giglio.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,2141
3220,1968.12.29,2068-12-29,1968,Invalid,SOUTH AFRICA,KwaZulu-Natal,Port St. John's,Freediving,John Domoney,M,,No injury,N,,H.D.Baldridge (1994) SAF Case #1588. Note: Una...,1968.12.29-NV-Domoney.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,2773
3221,1968.12.26,2068-12-26,1968,Provoked,AUSTRALIA,New South Wales,"Marineland Aquarium, Manley, Sydney",Feeding mullet to sharks,Peter Jones,M,27,Laceration to finger by a captive shark PROVOK...,N,"Grey nurse shark, 10'","The Age, 12/27/1968",1968.12.26-Jones.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,2772
3222,1968.12.25,2068-12-25,1968,Unprovoked,NEW ZEALAND,South Island,"St. Clair Beach, Dunedin",Surfing,Gary Barton,M,17,"Hit in face by shark, arm abraded, surfboard b...",N,White shark,"R. D. Weeks, GSAF; Otago Daily Times, 12/26/1968",1968.12.25-Barton.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,2771
3223,1968.12.09,2068-12-09,1968,Unprovoked,AUSTRALIA,South Australia,Thistle Island,,Dick OBrien,M,,Survived,N,White shark,"T. Peake, GSAF",1968.12.09-O'Brien.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,2770


Now we put in the right way all the dates exceding "2016.09.18"; notice that all of them are 100 years over the real date.


In [64]:

bad_dates = df[df["Date"]> "2016.09.18"].copy()

bad_dates["Date"] = bad_dates['Date'].apply(lambda x: x.replace(x.year - 100))



In [65]:
rows_missing = list(bad_dates.index)
filling(df, rows_missing, list(bad_dates.Date), "Date")

In [66]:
df[df["Date"]> "2016.09.18"].sort_values(by = "Date", ascending = False).head()

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Species,Investigator or Source,pdf,href formula,original order
3852,1959.08.02,2076-01-01,1959,Invalid,ITALY,Tuscany,"Cala del Corvo, Isola del Giglio",Scuba diving,Karl Pollerer & Eric Eisesenid,M,34 & 19,Probable drowing. Shark involvement unconfirmed,Y,,"C. Moore, GSAF",1959.08.02-Giglio.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,2141


We fill manually the correct date for this entry:

In [67]:
filling(df, [3852], [pd.Timestamp("1959.08.02")], "Date")

Now, we drop rows which do not have valuable information about the date.

In [68]:
rows_to_drop = list(df[(df["Date"].isna()) & (df["Case Number"].str.contains("ND"))].index)
df = df.drop(rows_to_drop, axis =0).copy()


Still there are rows with unreallistic values for date. Now we use regex to obtain good data values at the column  "Case Number":

In [69]:
bad_dates = df[(df["Date"].isna())].copy()
bad_dates["Case Number"] = bad_dates["Case Number"].str.replace(".[a-z]", "").copy()
bad_dates["Case Number"] = bad_dates["Case Number"].str.replace(".[A-Z]", "").copy()
bad_dates["Case Number"] = bad_dates["Case Number"].str.replace(".", "").copy()
bad_dates["Case Number"] = bad_dates["Case Number"].str.replace("0000", "0101").copy()
bad_dates_2 = bad_dates[bad_dates["Case Number"].str.contains("00$")].copy()

This function repairs the dates ending in "00"

In [70]:
def f(x):
    x = x[:-2]
    return x + "01"

bad_dates_2["Case Number"] = bad_dates['Case Number'].apply(f)

In [71]:
#Parsing the dates

bad_dates_2["Case Number"] = pd.to_datetime(bad_dates_2["Case Number"], errors = "coerce")

In [72]:
# We use the column "Case Number" to fill in a suitable way the "Date" column:
bad_dates_2["Date"] = bad_dates_2["Case Number"]

In [73]:
rows_missing = list(bad_dates_2.index)
filling(bad_dates, rows_missing, list(bad_dates_2["Date"]), "Date")
bad_dates["Case Number"] = pd.to_datetime(bad_dates["Case Number"], errors = "coerce")
rows_missing= list(bad_dates.index)
filling(df, rows_missing, list(bad_dates["Case Number"]), "Date")

We will drop all the rows with no date and no species


In [74]:
rows_to_drop = list(df[df.Date.isna() & df.Species.isna()].index)
df = df.drop(rows_to_drop, axis=0)
bad_dates = df[df.Date.isna()].copy()
bad_dates["Case Number"] = bad_dates['Case Number'].apply(f)
bad_dates["Date"] = pd.to_datetime(bad_dates["Case Number"], errors = "coerce")

In [75]:
rows_missing= list(bad_dates.index)
filling(df, rows_missing, list(bad_dates["Date"]), "Date")
bad_dates = df[df.Date.isna()].copy()

Now we define a function similar to *f* which let us put in a good way the column "Date" without droping values.

In [76]:
def g(x):
    if re.search("00$", x):
        x = x[:-4]
        return x+ "1.01"
    else:
        x = x[:-4]
        if re.search("00.$", x):
            x = x[:-2]
            return x + "1.01"
        else:
            return x + "01"

In [77]:
bad_dates["Date"] = bad_dates['Case Number'].apply(g).copy()
bad_dates["Date"] = pd.to_datetime(bad_dates["Date"], errors = "coerce").copy()
rows_missing= list(bad_dates.index)
filling(df, rows_missing, list(bad_dates["Date"]), "Date")

In [78]:
df[df.Date.isna()]

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Species,Investigator or Source,pdf,href formula,original order
5863,1554.00.00,NaT,1554,Unprovoked,FRANCE,Nice & Marseilles,,,males (wearing armor),M,,,UNKNOWN,Possibly white sharks,G. Rondelet,1554.00.00-Rondelet.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,130


In [79]:
# We drop this entry:

df= df.drop([5863], axis=0)

### 4.2. Droping the column "Case Number", "Year" and "original order"

Now, we have suitable values at the column "Date". We can then drop the column "Case Number","Year" and "original order.

In [80]:
df.drop(columns = ["Case Number", "Year", "original order"], inplace = True)

### 4.3. Filling values for "Name" and "Investigator or Source"

Also, we will fill the NaN's at the column "Name" as "Anonymous", and the "Investigator or Source" field as "Unknown":

In [81]:
df["Name"].fillna("Anonymous", inplace = True) 
df["Investigator or Source"].fillna("Unknown", inplace = True) 

In [82]:
null_cols(df)

Area         6.148867
Location     7.579629
Activity     8.550502
Sex          9.419179
Age         43.893715
Injury       0.340657
Species     48.032703
dtype: float64

### 4.4. Columns "Age" and "Activity"

The field "Age" might be relevant, so we are not going to drop these column. However, we suggest this is in correlation with the field "Activity".

#### 4.4.1. Column "Age" conversion type
We define a function to recover most of the dates and convert them into integers.

In [83]:
df[df.Age.isna()== False].Age.unique()

array(['16', '36', '43', '60s', '51', '50', '12', '9', '22', '25', '37',
       '20', '49', '15', '21', '40', '72', '18', '29', '31', '11', '10',
       '59', '42', '34', '35', '19', '6', '27', '64', '60', '23', '52',
       '13', '57', '48', '39', '24', '26', '69', '46', 'Teen', '41', '45',
       '65', '38', '71', '32', '58', '28', '54', '44', '14', '7', '62',
       '40s', '68', '47', '17', '30', '63', '70', '18 months', '53',
       '20s', '33', '30s', '50s', '8', '61', '55', 'teen', '66', '77',
       '74', '3', '56', '28 & 26', '5', '86', '18 or 20', '12 or 13',
       '46 & 34', '28, 23 & 30', 'Teens', '36 & 26', '8 or 10', '84',
       '\xa0 ', ' ', '30 or 36', '6½', '21 & ?', '75', '33 or 37',
       'mid-30s', '73', '23 & 20', '7      &    31', '20?', "60's",
       '32 & 30', '16 to 18', '87', '67', 'Elderly', 'mid-20s', 'Ca. 33',
       '21 or 26', '>50', '18 to 22', 'adult', '9 & 12', '? & 19',
       '9 months', '25 to 35', '23 & 26', '1', '(adult)', '33 & 37',
       '25

In [86]:
def extract_ages(word):
    
    """
    This function extract the numeric values contained in the field "Age".
    word is a string and it returns a float or NaN
    """
    
    if  re.findall(r"adult", str(word)):
        return 35
    elif re.findall(r"teen", str(word).lower()):
        return 15
    elif re.findall(r"young", str(word)):
        return 25
    elif re.findall(r"middle-age", str(word)):
        return 45
    elif re.findall(r"Elderly", str(word)):
        return 75
    elif re.findall(r"mid", str(word)):
        return int(re.findall(r"\d{1,2}", str(word))[0])
    elif re.findall(r"month", str(word)):
        return math.floor(int(re.findall(r"\d{1,2}", str(word))[0])/12)
    elif re.findall(r"or|&|to|½", str(word)):
        return int(re.findall(r"\d{1,2}", str(word))[0])
    elif re.findall(r"s", str(word)):
        return int(re.findall(r"\d{1,2}", str(word))[0])+5
    else:
        return pd.to_numeric(word, errors = "coerce")
    

In [87]:
df["Age"] = df["Age"].apply(extract_ages)

#### 4.4.2. Tiding column "Activity"

* First, we remove the empty spaces at the end of the strings of the field "Activity".

In [88]:
df["Activity"] = df["Activity"].str.rstrip().copy()

From the description of "Activity" we want to extract the first verb in gerund contained in it and identify the Activity with it.

In [89]:
#Pattern for verb in gerund

lista = list(df[df.Activity.isna()== False].index)

ing = r'\b(\w+ing)\b'

#Here we use list comprehension to obtain the capitalized actions:

to_fill= [re.findall(ing, df.Activity[ind])[0].capitalize() if re.findall(ing, df.Activity[ind]) else df.Activity[ind] for ind in lista]



In [90]:
#Replacing the values in the column of the data frame.

filling(df, lista, to_fill, "Activity")


### 4.4.3. Predicting values for Age and Activity

This function shows us the distribution through the top categories in column "Y" of the most frequent value at the column "X":

In [91]:
def predictor_0(Y, X, data):
    
    """
    This function gives us a pair (most_X, dataframe) showing the distribution of frequencies
    of X over the most frequent Y of the top n.
    
    X and Y are columns of a dataframe data, and n is an integer which defines the rank that we give
    to the column Y.
    """
    most_freq_Y = list(data[Y].value_counts().index)
    data_filter = data[(data[Y].isin(most_freq_Y)) & (df[X].isna()== False)][[Y, X]][X].value_counts()
    most_X = list(data_filter.index)[0]
    return (most_X, data[(data[Y].isin(most_freq_Y)) & (data[X]== most_X)][[Y, X]][Y].value_counts())


# for i in list(data_filter.index)

# a.append((i, data[(data[Y].isin(most_freq_Y)) & (data[X]== i)][[Y, X]][Y].value_counts())[1].index[0])


The following indicates that for the values in the column "Activity", the most popular age is 15 years old and the distribution of the totals for people with this Age for the other activities is as follows:

In [92]:
predictor_0("Activity", "Age", df)

(15.0, Surfing           53
 Swimming          30
 Standing          10
 Boarding           9
 Wading             7
 Spearfishing       6
 Fishing            5
 Diving             4
 Bathing            3
 Treading           2
 Splashing          2
 Freediving         2
 Snorkeling         2
 Clamming           1
 Paddling           1
 Playing            1
 Fell overboard     1
 Holding            1
 Lying              1
 Walking            1
 Sitting            1
 Sea Disaster       1
 Name: Activity, dtype: int64)

In [93]:
def predictor(Y, X, data):
    
    """
    X and Y are two variables correlated.
    This function gives us a dictionary whose keys are the values of column X and for each value in
    give us the more frequent value of Y.
   
    """
    most_freq_Y = list(data[Y].value_counts().index)
    data_filter = data[(data[Y].isin(most_freq_Y)) & (df[X].isna()== False)][[Y, X]][X].value_counts()
    a= dict()
    for i in list(data_filter.index):
        a[i] = data[(data[Y].isin(most_freq_Y)) & (data[X]== i)][[Y, X]][Y].value_counts().index[0]
    return a
    

It is possible to infer with this dictionary the values for "Age" in the dataset.

In [94]:
predictions = predictor("Age", "Activity", df)

df_2 = df[(df.Activity.isna()== False) & (df.Age.isna())]    

Then, we use our dictionary of predictions in order to fill the NaN values of "Age" who has non NaN's values in the column "Activity":

In [95]:
lista= list(df_2.index)

In [96]:
to_fill = [predictions[df.Activity[ind]] if df.Activity[ind] in predictions.keys() else df.Age[ind] for ind in lista ]

In [97]:
filling(df, lista, to_fill, "Age")

In [98]:
lista= (df[(df.Age.isna()) & (df.Activity.isna()== False)].index)

Once again, we see how much we are decreasing the NaN values in the dataframe:

In [99]:
null_cols(df)

Area         6.148867
Location     7.579629
Activity     8.550502
Sex          9.419179
Age         12.008176
Injury       0.340657
Species     48.032703
dtype: float64

Now, we are going to predict the values in "Activity" based on the known values of "Age"; we do a dual procedure:

In [100]:
lista = list(df[(df.Age.isna()== False) & df.Activity.isna()].index)

In [101]:
predictions = predictor("Activity", "Age", df)

In [102]:
to_fill = [predictions[df.Age[ind]] if df.Age[ind] in predictions.keys() else df.Activity[ind] for ind in lista ]

In [103]:
filling(df, lista, to_fill, "Activity")

In [104]:
null_cols(df)

Area         6.148867
Location     7.579629
Activity     6.285130
Sex          9.419179
Age         12.008176
Injury       0.340657
Species     48.032703
dtype: float64

## 4.5. Filling values for column "Species". 

It might be possible to infer the species from the Country and Area but we do not how to reach this goal. We suggest that there is a correlation between these variables but for now we will fill
this NaN's with "Unidentified":

In [105]:
df["Species"].fillna("Unidentified", inplace = True) 

## 4.6. Droping irrelevant rows based on NaN's counting by columns

So far, we have significatively decreased the number of NaN's in the columns:

In [106]:
null_cols(df)

Area         6.148867
Location     7.579629
Activity     6.285130
Sex          9.419179
Age         12.008176
Injury       0.340657
dtype: float64

As "Injury" has a pretty low number of NaN rows, we can drop them.

In [107]:
to_drop = list(df[(df.Injury.isna())].index)
df = df.drop(to_drop)

In [108]:
null_cols(df)

Area         6.135703
Location     7.554264
Activity     6.204068
Sex          9.280465
Age         11.912494
dtype: float64

### 4.6.1. Predicting values for Area and Location

We will predict the value of the missing rows in "Area" using the column "Country". Similarly, we predict the value of "Location"
by using the column "Area":

In [109]:
lista = list(df[df.Area.isna()].index)
predictions = predictor("Area", "Country", df)
to_fill = [predictions[df.Country[ind]] if df.Country[ind] in predictions.keys() else df.Area[ind] for ind in lista ]
filling(df, lista, to_fill, "Area")

In [110]:
lista = list(df[(df.Location.isna()) & (df.Area.isna()== False)].index)
predictions = predictor("Location", "Area", df)
to_fill = [predictions[df.Area[ind]] if df.Area[ind] in predictions.keys() else df.Location[ind] for ind in lista ]
filling(df, lista, to_fill, "Location")

There are some countries and Areas with misspellings; we correct those entries:

In [111]:
df = df.replace("COLUMBIA", "COLOMBIA")
df = df.replace("Isla provedencia", "Isla Providencia")

In [115]:
null_cols(df)

Area         0.752008
Location     2.409844
Activity     6.204068
Sex          9.280465
Age         11.912494
dtype: float64

Now we can drop the rest of the rows with NaN's values in Area because are <1% of the total.

In [120]:
df = df.dropna(subset=['Area'])
null_cols(df)

Location     1.963148
Activity     6.164973
Sex          9.264681
Age         11.727226
dtype: float64

### 4.6.2. Assing the values fo column "Fatal (Y/N) " to boolean

If we have "Y" assing True and "N" assign False

In [113]:
df["Fatal (Y/N)"]= (df["Fatal (Y/N)"] == "Y")

### 5. Changing column names and Re-indexing

We will rename the columns with more descriptive names.

In [121]:
df.columns

Index(['Date', 'Type', 'Country', 'Area', 'Location', 'Activity', 'Name',
       'Sex', 'Age', 'Injury', 'Fatal (Y/N)', 'Species',
       'Investigator or Source', 'pdf', 'href formula'],
      dtype='object')

In [122]:
df.columns = ["Date", "Type", 'Country', "Area", "Location", "Activity_during_attack", "Victim_Name", "Gender", "Age", "Injury", "Fatal_attack", "Shark_species", "Source_information", "Documentation_pdf", "Documentation_href"]

Now, we order the data set by "Date" in descending order, and also we reindex the rows.

In [137]:
df.sort_values(by = "Date", ascending = False, inplace = True)

df.head(5)

Unnamed: 0,Date,Type,Country,Area,Location,Activity_during_attack,Victim_Name,Gender,Age,Injury,Fatal_attack,Shark_species,Source_information,Documentation_pdf,Documentation_href
0,2016-09-18,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,16.0,Minor injury to thigh,False,Unidentified,"Orlando Sentinel, 9/19/2016",2016.09.18.c-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...
2,2016-09-18,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,male,M,43.0,Lacerations to lower leg,False,Unidentified,"Orlando Sentinel, 9/19/2016",2016.09.18.a-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...
1,2016-09-18,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,Chucky Luciano,M,36.0,Lacerations to hands,False,Unidentified,"Orlando Sentinel, 9/19/2016",2016.09.18.b-Luciano.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...
3,2016-09-17,Unprovoked,AUSTRALIA,Victoria,Thirteenth Beach,Surfing,Rory Angiolella,M,17.0,Struck by fin on chest & leg,False,Unidentified,"The Age, 9/18/2016",2016.09.17-Angiolella.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...
4,2016-09-16,Unprovoked,AUSTRALIA,Victoria,Bells Beach,Surfing,male,M,17.0,No injury: Knocked off board by shark,False,2 m shark,"The Age, 9/16/2016",2016.09.16-BellsBeach.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...


### 6. Exporting dataframe as csv file

In [138]:
df.to_csv("/Users/Fabi/Documents/GitHub/data-ber-10-19/module-1_projects/pandas-project/your-code/Shark_attacks_clean.csv", index= False)