# Step one: Cleaning

In this notebook we will focus on taking the original data file and reducing all the errors in it. This will produce a clean file that will allow for meaningful analysis.

## Importing libraries and reading the file

In [12]:
import pandas as pd


attacks = pd.read_csv("data/attacks.csv", encoding='unicode_escape')

#We will alos run the 'cleaning.py' script as it will define functions we will use

%run -i 'src/cleaning.py'

## Initial exploration
Now we will start exploring the file. Using the ".head()" we can get a preview of the columns and a the first rows of the dataset.

In [2]:
attacks.head()

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
0,2018.06.25,25-Jun-2018,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,...,White shark,"R. Collier, GSAF",2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.25,2018.06.25,6303.0,,
1,2018.06.18,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,"St. Simon Island, Glynn County",Standing,Adyson McNeely,F,...,,"K.McMurray, TrackingSharks.com",2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.18,2018.06.18,6302.0,,
2,2018.06.09,09-Jun-2018,2018.0,Invalid,USA,Hawaii,"Habush, Oahu",Surfing,John Denges,M,...,,"K.McMurray, TrackingSharks.com",2018.06.09-Denges.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.09,2018.06.09,6301.0,,
3,2018.06.08,08-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,...,2 m shark,"B. Myatt, GSAF",2018.06.08-Arrawarra.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.08,2018.06.08,6300.0,,
4,2018.06.04,04-Jun-2018,2018.0,Provoked,MEXICO,Colima,La Ticla,Free diving,Gustavo Ramos,M,...,"Tiger shark, 3m",A .Kipper,2018.06.04-Ramos.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.04,2018.06.04,6299.0,,


In [3]:
attacks.columns

Index(['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex ', 'Age', 'Injury', 'Fatal (Y/N)', 'Time',
       'Species ', 'Investigator or Source', 'pdf', 'href formula', 'href',
       'Case Number.1', 'Case Number.2', 'original order', 'Unnamed: 22',
       'Unnamed: 23'],
      dtype='object')

In [4]:
attacks.shape

(25723, 24)

Using 'head()' , '.columns' and '.shape' we can see that we are dealing with a file of 24 columns (names listed above) and 25723 entries. Lets explore further:

In [5]:
attacks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25723 entries, 0 to 25722
Data columns (total 24 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Case Number             8702 non-null   object 
 1   Date                    6302 non-null   object 
 2   Year                    6300 non-null   float64
 3   Type                    6298 non-null   object 
 4   Country                 6252 non-null   object 
 5   Area                    5847 non-null   object 
 6   Location                5762 non-null   object 
 7   Activity                5758 non-null   object 
 8   Name                    6092 non-null   object 
 9   Sex                     5737 non-null   object 
 10  Age                     3471 non-null   object 
 11  Injury                  6274 non-null   object 
 12  Fatal (Y/N)             5763 non-null   object 
 13  Time                    2948 non-null   object 
 14  Species                 3464 non-null 

From this output we gather a few things:

#### Types and names of the columns: 
* There is only two columns with floats
    * This is odd considering that the column "Age" should also be a number
    * The column "Original order" probably adds little to no value
* Columns 0, 19 and 20 all reference a case number
* Columns 17 and 18 could also be duplicated, since they both reference "href"

####  Non-null values:
* Column 0 "Case number" has significantly more non-null values than the rest of the columns
* Columns 22 and 23 are unnamed and have 2 or less non-null values, these columns add no value to the analysis
* Several columns have a non-null count nearing 6300. But most columns have less than this ammount. This means that we are dealing with a great amount of incomplete rows.

We will start cleaning the file by deleting the last 2 columns. For this we will use a function declared in the 'cleaning.py' ran in the beginning of this notebook:

In [6]:
delete_columns(attacks,['Unnamed: 22','Unnamed: 23'])

Deleted columns:  ['Unnamed: 22', 'Unnamed: 23']


In [7]:
attacks.sample(5)

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Fatal (Y/N),Time,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order
2649,1989.12.19,19-Dec-1989,1989.0,Provoked,USA,Hawaii,"90 miles east of Hilo, Hawai'i",On board 51' fishing vessel One Ki,George Sohswel,M,...,N,,,"J. Borg, p.78; L. Taylor (1993), pp.108-109",1989.12.19-Sohswel.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,1989.12.19,1989.12.19,3654.0
7957,0,,,,,,,,,,...,,,,,,,,,,
6606,0,,,,,,,,,,...,,,,,,,,,,
10227,,,,,,,,,,,...,,,,,,,,,,
22307,,,,,,,,,,,...,,,,,,,,,,


In [67]:
def compare_columns (df, columns):
    """
    This function takes a data frame and a list. For the columns of the dataframe specified in the list, it will return descriptive information comparing those columns.
    """

    #First it determines the lenght of every column specified
    lengths_list = []
    for i in columns:
        lengths_list.append(len(df[i]))
    
    #Then it determines the amount of non-null values in every column
    non_null_list = []
    for i in columns:
        non_null_list.append(df[i].notnull().sum())

    #Now it builds a data frame with this information
    results = pd.DataFrame({
        "Columns" : columns,
        "Lengths" : lengths_list,
        "Non - null values" : non_null_list
    })
    results = results.set_index("Columns")


    #Builds a data frame with information about the types of data in every column
    data_types_list = []
    for i in columns:
        data_types_list.append(df[i].map(type).value_counts())
    data_types = pd.DataFrame(data_types_list)






    #Displays all the information collected
    print("Column information:")
    display(results.transpose())
    print("Data type comparison:")
    display(data_types.transpose())
    print("Sample of columns:")
    display(df[columns].sample(5))

    pass


compare_columns(attacks,["Case Number","Case Number.1","Case Number.2"])

Column information:


Columns,Case Number,Case Number.1,Case Number.2
Lengths,25723,25723,25723
Non - null values,8702,6302,6302


Data type comparison:


Unnamed: 0,Case Number,Case Number.1,Case Number.2
<class 'float'>,17021,19421,19421
<class 'str'>,8702,6302,6302


Sample of columns:


Unnamed: 0,Case Number,Case Number.1,Case Number.2
14708,,,
14541,,,
3582,1966.04.08,1966.04.08,1966.04.08
11942,,,
1919,2001.08.19.b,2001.08.19.b,2001.08.19.b


In [47]:
#len(attacks["Case Number"])
#len(attacks["Case Number.1"])
#attacks["Case Number"].map(type).unique()
attacks["Case Number"].map(type).value_counts()

#attacks["Case Number"].notnull().sum()

<class 'float'>    17021
<class 'str'>       8702
Name: Case Number, dtype: int64