![IronHack Logo](https://s3-eu-west-1.amazonaws.com/ih-materials/uploads/upload_d5c5793015fec3be28a63c4fa3dd4d55.png)

# Your first Project: Sharks Attack!

## Overview

The goal of this project is to combine everything you have learned about data wrangling, cleaning, manipulation and visualization with Pandas so you can see how it all works together. For this project, you will start with this messy data set [Shark Attack](https://www.kaggle.com/teajay/global-shark-attacks/version/1). You will need to import it, use your data wrangling skills to clean it up, prepare it to be analyzed, set some hypotesis and visualize the results.

You will have **one week** to complete the process and upload your results in GitHub. 

Next saturday, you will present your results in front of the class. The presentation will take place with the Read Me and the code as your supporting tools. 

**You will be working individually for this project**, but we'll be guiding you along the process and helping you as you go. Show us what you've got!

---

## Requirements

Your main goal is to understand the dataset, set some hypothesis (Minimum Three) and work with the methods and libraries used in class to try to confirm them. Not confirming the hypothesis is not a BAD result - Is just another result. 

The technical requirements for this project are as follows:

* The dataset that we provide you is a significantly messy data set. Apply the different cleaning and manipulation techniques you have learned with pandas, numpy and so. 
* Examine the data for potential issues.
* Produce a Jupyter Notebook that shows the steps you took and the code you used to clean and transform your data set.
* If IYO some of the code could be used in future projects, encapsulate it in functions. 
* Use Matplotlib, Seaborn or Plotly to visualize your results. 
* Create a .md file to show your results! Everybody could see your Read Me in Github. Try to make it simple, quick, with interesting results and more visual than just text. 
* Bonus: Save the functions you create inside a .py file in your repo, and import them in your Jupyter Notebook. 
* Bonus/Next Steps: In future steps, try to cover what Sabi explained you last Wednesday: Could you figure out a model to obtain more insights about your cleaned datasets?



## Suggested Ways to Get Started

* **Examine the data and try to understand what the fields mean** before diving into data cleaning and manipulation methods.
* **Break the project down into different steps and subsets** - use the topics covered in the lessons to form a check list, add anything else you can think of that may be wrong with your data set, and then work through the check list.
* **Use the tools in your tool kit** - your knowledge of Python, data structures, Pandas, and data wrangling.
* **Work through the lessons in class** & ask questions when you need to! Think about adding relevant code to your project each night, instead of, you know... _procrastinating_.
* **Commit early, commit often**, don’t be afraid of doing something incorrectly because you can always roll back to a previous version.
* **Consult documentation and resources** to better understand the tools you are using and how to accomplish what you want.


## Project Feedback + Evaluation

* __Technical Requirements__: Did you deliver a project that met all the technical requirements? Given what the class has covered so far, did you build something that was reasonably complex?

* __Creativity__: Did you add a personal spin or creative element into your project submission? Did you incorporate domain knowledge or unique perspective into your analysis.

* __Code Quality__: Did you follow code style guidance and best practices covered in class?



## Presentation Guideline and Criteria

### Format

* Presentation Time: 6 minutes
* Q & A: 3 minutes
* **Total Time:** 9 minutes

In [66]:
import warnings
warnings.filterwarnings("ignore")

In [67]:
import pandas as pd

In [68]:
import os

# Obtiene el directorio actual
current_dir = os.getcwd()
current_dir

'/Users/haddock_ana/Ironhack/Projects/Proyect 1'

In [69]:
file_path = '/Users/haddock_ana/Ironhack/Projects/Proyect 1/Datasets/attacks.csv'
attacks = pd.read_csv(file_path, encoding="cp1252")
attacks.head(2)

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
0,2018.06.25,25-Jun-2018,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,57,"No injury to occupant, outrigger canoe and pad...",N,18h00,White shark,"R. Collier, GSAF",2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.25,2018.06.25,6303.0,,
1,2018.06.18,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,"St. Simon Island, Glynn County",Standing,Adyson McNeely,F,11,Minor injury to left thigh,N,14h00 -15h00,,"K.McMurray, TrackingSharks.com",2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.18,2018.06.18,6302.0,,


In [70]:
attacks.shape

(25723, 24)

## Cleanning the DataFrame
1. Take away those columns that are not interesting for our study
    - First subset
2. Delete the duplicated cases
3. Analyse the number of nulls the selected columns have
    - Analyse each variable
4. Decide wheather complete them or delete those samples

In [71]:
attacks = attacks [["Date", "Year", "Type", "Country", "Area", 'Location',
       'Activity', 'Name', 'Sex ', 'Age', 'Injury', 'Fatal (Y/N)', 'Time',
       'Species ', 'Investigator or Source']]

attacks.head(2)

Unnamed: 0,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source
0,25-Jun-2018,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,57,"No injury to occupant, outrigger canoe and pad...",N,18h00,White shark,"R. Collier, GSAF"
1,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,"St. Simon Island, Glynn County",Standing,Adyson McNeely,F,11,Minor injury to left thigh,N,14h00 -15h00,,"K.McMurray, TrackingSharks.com"


### Duplicated values

In [72]:
attacks.shape

(25723, 15)

In [73]:
attacks.duplicated().sum()

19421

In [74]:
attacks.drop_duplicates(inplace=True)
attacks.duplicated().sum()

0

### Null values

In [75]:
attacks.isna().sum()

Date                         1
Year                         3
Type                         5
Country                     51
Area                       456
Location                   541
Activity                   545
Name                       211
Sex                        566
Age                       2831
Injury                      29
Fatal (Y/N)                540
Time                      3354
Species                   2839
Investigator or Source      18
dtype: int64

- There are variables that can be deducted or field by complementary columns.
- My approach here is to remove those cases with relevant variables (that are not posible to replace):
  - Date
  - Country

In [76]:
attacks_v1 = attacks.copy() #genero una copia para luego eliminar los nulos

In [84]:
attack [attacks['Date'].isnull()]

NameError: name 'attack' is not defined

In [82]:
attacks.isna().sum()

Date                         1
Year                         3
Type                         5
Country                     51
Area                       456
Location                   541
Activity                   545
Name                       211
Sex                        566
Age                       2831
Injury                      29
Fatal (Y/N)                540
Time                      3354
Species                   2839
Investigator or Source      18
dtype: int64