# DSC 540 - Final Project Milestone 3
## HTML Transformations

### Ashley Deibler

### Perform 5 transformation and/or cleansing steps to your website data.

### Import HTML file.

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

In [2]:
def read_html_with_beautiful_soup(file_path):
    # Read HTML file
    with open(file_path, 'r') as f:
        # Parse HTML using BeautifulSoup
        soup = BeautifulSoup(f, 'html.parser')
    # Find all tables in the HTML
    tables = soup.find_all('table')
    # Read tables into DataFrame using read_html()
    df = pd.read_html(str(tables))[0]
    return df

In [3]:
html_file_path = 'C:/Users/diggy/DSC540-Deibler/Species Implicated in Attacks.html'
df = read_html_with_beautiful_soup(html_file_path)

df.head(10)

Unnamed: 0,Species,Common Name,Non-fatal Unprovoked,Fatal Unprovoked,Total
0,Carcharhinus amblyrhynchos,Grey Reef,8,1,9
1,Carcharhinus brachyurus,Bronze Whaler,15,1,16
2,Carcharhinus brevipinna,Spinner,16,0,16
3,Carcharhinus falciformis,Silky,3,0,3
4,Carcharhinus galapagensis,Galapagos,1,1,2
5,Carcharhinus leucas,Bull,93,26,119
6,Carcharhinus limbatus,Blacktip,35,0,35
7,Carcharhinus longimanus,Oceanic Whitetip,12,3,15
8,Carcharhinus melanopterus,Blacktip Reef,14,0,14
9,Carcharhinus obscurus,Dusky,1,1,2


### Transformation 1. Sort data in descending order based on 'Total' attacks
Doing this transformation will allow me to be able to compare it with my flat file, to determine the shark species of highest interest for this study. The sharks with the highest number of documented attacks will be of highest interest. This can help me to narrow down my data down the line. 

In [4]:
df= df.sort_values(by='Total', ascending = False)
df.head(10)

Unnamed: 0,Species,Common Name,Non-fatal Unprovoked,Fatal Unprovoked,Total
34,Total,35+ Species,807,142,949
14,Carcharodon carcharias,White,292,59,351
15,Galeocerdo cuvier,Tiger,103,39,142
5,Carcharhinus leucas,Bull,93,26,119
12,Carcharhinus spp.,Requiem,46,5,51
13,Carcharias taurus,Sand Tiger,36,0,36
6,Carcharhinus limbatus,Blacktip,35,0,35
27,Orectolobus spp.,Wobbegong,31,0,31
30,Sphyrna spp.,Hammerhead,18,0,18
1,Carcharhinus brachyurus,Bronze Whaler,15,1,16


### Transformation 2. Remove Row '34' displaying summarizing Information.
This row, which was made noticable when sorting the data frame in descending order, summarizes the informtion in the data frame. It shows irrelevent data showing the total number of non-fatal unprovoked attacks, total number of fatal unprovoked attacks, and total number of attacks across all species. This summary row is not necessary for this analysis. 

In [5]:
df = df.drop(df[df['Common Name'] == '35+ Species'].index)
df.head()

Unnamed: 0,Species,Common Name,Non-fatal Unprovoked,Fatal Unprovoked,Total
14,Carcharodon carcharias,White,292,59,351
15,Galeocerdo cuvier,Tiger,103,39,142
5,Carcharhinus leucas,Bull,93,26,119
12,Carcharhinus spp.,Requiem,46,5,51
13,Carcharias taurus,Sand Tiger,36,0,36


### Transformation 3. Change column names
I want to change the names of the 'Non-fatal Unprovoked' and 'Fatal Unprovoked' columns. I will clarify in my final project and analysis that the attacks being investigated are all unprovoked, so therefore I would like to remove the word 'Unprovoked' from each column name to reduce clutter. In addition, there are no columns showing 'provoked' data, so I don't think it is necessary to have this word in the column headings. 

In [6]:
df = df.rename(columns={'Non-fatal Unprovoked':'Non-fatal Attacks', 'Fatal Unprovoked':'Fatal Attacks'})
df.head(10)

Unnamed: 0,Species,Common Name,Non-fatal Attacks,Fatal Attacks,Total
14,Carcharodon carcharias,White,292,59,351
15,Galeocerdo cuvier,Tiger,103,39,142
5,Carcharhinus leucas,Bull,93,26,119
12,Carcharhinus spp.,Requiem,46,5,51
13,Carcharias taurus,Sand Tiger,36,0,36
6,Carcharhinus limbatus,Blacktip,35,0,35
27,Orectolobus spp.,Wobbegong,31,0,31
30,Sphyrna spp.,Hammerhead,18,0,18
1,Carcharhinus brachyurus,Bronze Whaler,15,1,16
2,Carcharhinus brevipinna,Spinner,16,0,16


### Transformation 4. Address duplicates by adding new column
I noticed that there are entries such as 'Orectolobus spp.', 'Sphyrna spp.', 'Isurus spp.', and 'Rhinobatos spp.' These such entries imply attacks occurred by one of the species within that family, but the specific identity wasn't determined. Because specific species within these families are also listed in the dataset, I would like to create a new column specifying the family each species belongs to. This way, I can also do analyses as to which family is responsible for most attacks, and allows me to encorperate the data from both the family entries and species entries.  

Because the type of information I want to include in this new column is difficult to extrapolate from existing columns, I will manually create this column. 

In [7]:
print(df['Species'])

14         Carcharodon carcharias
15              Galeocerdo cuvier
5             Carcharhinus leucas
12              Carcharhinus spp.
13              Carcharias taurus
6           Carcharhinus limbatus
27               Orectolobus spp.
30                   Sphyrna spp.
1         Carcharhinus brachyurus
2         Carcharhinus brevipinna
7         Carcharhinus longimanus
8       Carcharhinus melanopterus
28                Prionace glauca
23         Negaprion brevirostris
20              Isurus oxyrinchus
0      Carcharhinus amblyrhynchos
17         Ginglymostoma cirratum
21                    Isurus spp.
19          Isistius brasiliensis
11          Carcharhinus plumbeus
31              Triaenodon obesus
24         Notorynchus cepedianus
10            Carcharhinus perezi
25          Orectolobus maculatus
3        Carcharhinus falciformis
26            Orectolobus ornatus
9           Carcharhinus obscurus
4       Carcharhinus galapagensis
22                    Lamna nasus
29            

In [8]:
df['Family']=['Lamnidae', 'Galeocerdonidae', 'Carcharhinidae', 'Carcharhinidae', 'Odontaspididae', 'Carcharhinidae',
             'Orectolobidae', 'Sphyrnidae', 'Carcharhinidae', 'Carcharhinidae', 'Carcharhinidae',
             'Carcharhinidae', 'Carcharhinidae', 'Carcharhinidae', 'Lamnidae', 'Carcharhinidae',
             'Ginglymostomatidae', 'Lamnidae', 'Dalatiidae', 'Carcharhinidae', 'Carcharhinidae', 
             'Hexanchidae', 'Carcharhinidae', 'Orectolobidae', 'Carcharhinidae', 'Orectolobidae',
             'Carcharhinidae', 'Carcharhinidae', 'Lamnidae', 'Rhinobatidae', 'Heterodontidae',
             'Squatinidae', 'Triakidae', 'Triakidae']

In [9]:
df.head()

Unnamed: 0,Species,Common Name,Non-fatal Attacks,Fatal Attacks,Total,Family
14,Carcharodon carcharias,White,292,59,351,Lamnidae
15,Galeocerdo cuvier,Tiger,103,39,142,Galeocerdonidae
5,Carcharhinus leucas,Bull,93,26,119,Carcharhinidae
12,Carcharhinus spp.,Requiem,46,5,51,Carcharhinidae
13,Carcharias taurus,Sand Tiger,36,0,36,Odontaspididae


### Transformation 5. Rearrange columns to optimize readability. 
With the new column and the way the data frame is formatted, it would make more sense for the 'Family' column to be the first column. This way it keeps all descriptive data together, and all quantitative data together. 

In [10]:
df = df.loc[:,['Family', 'Species', 'Common Name', 'Non-fatal Attacks', 'Fatal Attacks', 'Total']]
df.head()

Unnamed: 0,Family,Species,Common Name,Non-fatal Attacks,Fatal Attacks,Total
14,Lamnidae,Carcharodon carcharias,White,292,59,351
15,Galeocerdonidae,Galeocerdo cuvier,Tiger,103,39,142
5,Carcharhinidae,Carcharhinus leucas,Bull,93,26,119
12,Carcharhinidae,Carcharhinus spp.,Requiem,46,5,51
13,Odontaspididae,Carcharias taurus,Sand Tiger,36,0,36


### Ethical Implications
The data itself was already fairly clean and organized. I only needed to make a few simple changes to optimize the data for use in my specific analysis. I changed column names, removed an irrelevent row, created a new column to more easily group species by family, and rearranged the order of columns. As far as risks go regarding transformations made. 
One assumption I made when transforming my data was that it will be a known concept that the attacks being addressed in this study are 'Unprovoked' attacks. The original dataset included two rows specifying that the inputs in that columns were indeed unprovoked attacks. I decided to remove the word 'unprovoked' under the impression that I will make it clear throughout my final analysis that these attacks were indeed unprovoked, in order to de-clutter the column names a bit. I also think that adding the word 'Attacks' into the column names helps to specify exactly what the data is representing.

In [11]:
import sqlite3
conn = sqlite3.connect('sharks.db')

In [12]:
df.to_sql('attacks', conn, if_exists='replace')

34