![Ironhack logo](https://i.imgur.com/1QgrNNw.png)

<body>
    <p style="font-size:28px;text-align:center"><b>Project 02 | Data Cleaning & Manipulation</b></p>
</body>

# Introduction

The objective of this projects was to answer a problem statement, practicing data cleaning and manipulation.

---

<body>
    <p style="font-size:20px"><b>Problem Statement</b></p>
</body>

_What are the most common characteristics of people involved in shark incidents in the history?_

---

To answer this problem, the following characteristics will be analyzed:
- Gender
- Age
- Activity

# Setup

## Import

In [1]:
import pandas as pd
import numpy as np

## Load the dataset

In [2]:
# Load the dataset from a Excel file into a Pandas DataFrame
df = pd.read_excel('GSAF5.xls')

In [3]:
# Check information about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25792 entries, 0 to 25791
Data columns (total 24 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Case Number             8771 non-null   object 
 1   Date                    6558 non-null   object 
 2   Year                    6556 non-null   float64
 3   Type                    6552 non-null   object 
 4   Country                 6508 non-null   object 
 5   Area                    6091 non-null   object 
 6   Location                6008 non-null   object 
 7   Activity                6002 non-null   object 
 8   Name                    6343 non-null   object 
 9   Sex                     5987 non-null   object 
 10  Age                     3660 non-null   object 
 11  Injury                  6528 non-null   object 
 12  Fatal (Y/N)             6006 non-null   object 
 13  Time                    3139 non-null   object 
 14  Species                 3610 non-null 

### Conclusions about the dataset for cleaning purposes
1. Seeing the information about the dataset, it is clear that there are a lot of rows that only contains missing values `NaN`.
2. The columns 'Unnamed: 22' and 'Unnamed: 23' probably do not have any information since they have 1 and 2 non-null values, respectively.

## Create a backup version of the raw dataset

In [4]:
# Create a backup version
df_bkp = df.copy()

# General Data Cleaning

## Headers

In [5]:
# Remove unnecessary spaces, put everything in lower case and replace spaces with underscores
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
df.columns

Index(['case_number', 'date', 'year', 'type', 'country', 'area', 'location',
       'activity', 'name', 'sex', 'age', 'injury', 'fatal_(y/n)', 'time',
       'species', 'investigator_or_source', 'pdf', 'href_formula', 'href',
       'case_number.1', 'case_number.2', 'original_order', 'unnamed:_22',
       'unnamed:_23'],
      dtype='object')

In [6]:
# Rename the column 'fatal_(y/n)' to make it simpler
df = df.rename(columns={'fatal_(y/n)': 'fatal'})
df.columns

Index(['case_number', 'date', 'year', 'type', 'country', 'area', 'location',
       'activity', 'name', 'sex', 'age', 'injury', 'fatal', 'time', 'species',
       'investigator_or_source', 'pdf', 'href_formula', 'href',
       'case_number.1', 'case_number.2', 'original_order', 'unnamed:_22',
       'unnamed:_23'],
      dtype='object')

## Columns that are not relevant to the analysis

As mentioned above, the columns `unnamed:_22` and `unnamed:_23` probably do not have any relevant information, so they can be removed.

In [7]:
# Remove the columns 'unnamed:_22' and 'unnamed:_23'
df = df.drop(columns=['unnamed:_22', 'unnamed:_23'])

# Check the result
df.head()

Unnamed: 0,case_number,date,year,type,country,area,location,activity,name,sex,...,fatal,time,species,investigator_or_source,pdf,href_formula,href,case_number.1,case_number.2,original_order
0,2020.08.20,20-Aug-2020,2020.0,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Boogie boarding,Carolina Jones,F,...,N,11h00,,"K. McMurray, TrackingSharks.com",2020.08.20-NSB.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2020.08.20,2020.08.20,6559.0
1,2020.08.14,14-Aug-2020,2020.0,Unprovoked,AUSTRALIA,New South Wales,"Shelly Beach, Port Macquarie",Surfing,Chantelle Doyle,F,...,N,09h30,"White shark, 2-to 3m","B. Myatt, GSAF",2020.08.14-ShellyBeach.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2020.08.14,2020.08.14,6558.0
2,2020.08.10,10-Aug-2020,2020.0,Provoked,USA,Florida,"Off Gasparilla Island, Charlotte County",Fishing,male,M,...,N,16h00,"Blacktip shark, 6'","K. McMurray, TrackingSharks.com",2020.08.10-Provoked.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2020.08.10,2020.08.10,6557.0
3,2020.08.02,02-Aug-2020,2020.0,Unprovoked,USA,Virgin Islands,"Candle Reef, St. Croix",Snorkeling,Melony Klein,F,...,N,14h00,"Nurse shark, 5'","K. McMurray, TrackingSharks.com",2020.08.02-Klein.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2020.08.02,2020.08.02,6556.0
4,2020.07.31.c,31-Jul-2020,2020.0,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,Megan Tossi,F,...,N,17h00,,"K. McMurray, TrackingSharks.com",2020.07.31.c-Tossi..pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2020.07.31.c,2020.07.31.c,6555.0


Checking the table above, it is clear that the columns `investigator_or_source`, `pdf`, `href_formula` and `href` will not be relevant to the analysis since they are just information about the source of each incident. Therefore, these columns can be removed.

In [8]:
# Remove the columns 'nvestigator_or_source', 'pdf', 'href_formula' and 'href'
df = df.drop(columns=['investigator_or_source', 'pdf', 'href_formula', 'href'])

# Check the result
df.head()

Unnamed: 0,case_number,date,year,type,country,area,location,activity,name,sex,age,injury,fatal,time,species,case_number.1,case_number.2,original_order
0,2020.08.20,20-Aug-2020,2020.0,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Boogie boarding,Carolina Jones,F,50.0,Minor lacerations to left leg,N,11h00,,2020.08.20,2020.08.20,6559.0
1,2020.08.14,14-Aug-2020,2020.0,Unprovoked,AUSTRALIA,New South Wales,"Shelly Beach, Port Macquarie",Surfing,Chantelle Doyle,F,35.0,Lacerations to right calf and posterior thigh,N,09h30,"White shark, 2-to 3m",2020.08.14,2020.08.14,6558.0
2,2020.08.10,10-Aug-2020,2020.0,Provoked,USA,Florida,"Off Gasparilla Island, Charlotte County",Fishing,male,M,55.0,Injury to left forearm by hooked shark PROVOKE...,N,16h00,"Blacktip shark, 6'",2020.08.10,2020.08.10,6557.0
3,2020.08.02,02-Aug-2020,2020.0,Unprovoked,USA,Virgin Islands,"Candle Reef, St. Croix",Snorkeling,Melony Klein,F,,Lacerations to hand and wrist,N,14h00,"Nurse shark, 5'",2020.08.02,2020.08.02,6556.0
4,2020.07.31.c,31-Jul-2020,2020.0,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,Megan Tossi,F,22.0,Lacerations to foot,N,17h00,,2020.07.31.c,2020.07.31.c,6555.0


The content of columns `case_number`, `case_number.1`, `case_number.2` and `original_order` is just a way to reference each case and is not relevant to the analysis. So, they will also be removed.

In [9]:
# Remove the columns 'case_number', 'case_number.1', 'case_number.2' and 'original_order'
df = df.drop(columns=['case_number', 'case_number.1', 'case_number.2', 'original_order'])

# Check the result
df.head()

Unnamed: 0,date,year,type,country,area,location,activity,name,sex,age,injury,fatal,time,species
0,20-Aug-2020,2020.0,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Boogie boarding,Carolina Jones,F,50.0,Minor lacerations to left leg,N,11h00,
1,14-Aug-2020,2020.0,Unprovoked,AUSTRALIA,New South Wales,"Shelly Beach, Port Macquarie",Surfing,Chantelle Doyle,F,35.0,Lacerations to right calf and posterior thigh,N,09h30,"White shark, 2-to 3m"
2,10-Aug-2020,2020.0,Provoked,USA,Florida,"Off Gasparilla Island, Charlotte County",Fishing,male,M,55.0,Injury to left forearm by hooked shark PROVOKE...,N,16h00,"Blacktip shark, 6'"
3,02-Aug-2020,2020.0,Unprovoked,USA,Virgin Islands,"Candle Reef, St. Croix",Snorkeling,Melony Klein,F,,Lacerations to hand and wrist,N,14h00,"Nurse shark, 5'"
4,31-Jul-2020,2020.0,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,Megan Tossi,F,22.0,Lacerations to foot,N,17h00,


After removing the unnecessary columns, the information about the new dataframe is shown below.

The column `name` is also irrelevant, since there will not be a specific analysis of each individual. So, this column can be removed.

In [10]:
# Remove column 'name'
df = df.drop(columns='name')

# Check the result
df.head()

Unnamed: 0,date,year,type,country,area,location,activity,sex,age,injury,fatal,time,species
0,20-Aug-2020,2020.0,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Boogie boarding,F,50.0,Minor lacerations to left leg,N,11h00,
1,14-Aug-2020,2020.0,Unprovoked,AUSTRALIA,New South Wales,"Shelly Beach, Port Macquarie",Surfing,F,35.0,Lacerations to right calf and posterior thigh,N,09h30,"White shark, 2-to 3m"
2,10-Aug-2020,2020.0,Provoked,USA,Florida,"Off Gasparilla Island, Charlotte County",Fishing,M,55.0,Injury to left forearm by hooked shark PROVOKE...,N,16h00,"Blacktip shark, 6'"
3,02-Aug-2020,2020.0,Unprovoked,USA,Virgin Islands,"Candle Reef, St. Croix",Snorkeling,F,,Lacerations to hand and wrist,N,14h00,"Nurse shark, 5'"
4,31-Jul-2020,2020.0,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Surfing,F,22.0,Lacerations to foot,N,17h00,


In [11]:
# Check information about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25792 entries, 0 to 25791
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   date      6558 non-null   object 
 1   year      6556 non-null   float64
 2   type      6552 non-null   object 
 3   country   6508 non-null   object 
 4   area      6091 non-null   object 
 5   location  6008 non-null   object 
 6   activity  6002 non-null   object 
 7   sex       5987 non-null   object 
 8   age       3660 non-null   object 
 9   injury    6528 non-null   object 
 10  fatal     6006 non-null   object 
 11  time      3139 non-null   object 
 12  species   3610 non-null   object 
dtypes: float64(1), object(12)
memory usage: 2.6+ MB


## Rows

### Rows containing only missing values

There is total of 25792 lines, but all columns do not have even 9,000 rows with `non-NaN`. So,as mentioned before, there are a lot of rows that only contains missing values `NaN` and they can be removed. 

In [12]:
# Remove all rows that contains only NaN
df = df.dropna(how='all')

# Check the result
print(f'Number of rows after removing the rows containing only missing values: {df.shape[0]}\n')

# Check information about the dataframe
df.info()

Number of rows after removing the rows containing only missing values: 6558

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6558 entries, 0 to 6557
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   date      6558 non-null   object 
 1   year      6556 non-null   float64
 2   type      6552 non-null   object 
 3   country   6508 non-null   object 
 4   area      6091 non-null   object 
 5   location  6008 non-null   object 
 6   activity  6002 non-null   object 
 7   sex       5987 non-null   object 
 8   age       3660 non-null   object 
 9   injury    6528 non-null   object 
 10  fatal     6006 non-null   object 
 11  time      3139 non-null   object 
 12  species   3610 non-null   object 
dtypes: float64(1), object(12)
memory usage: 717.3+ KB


### Duplicate rows

In [13]:
# Check if there are duplicate rows
df.duplicated().sum()

7

In [14]:
# Check duplicated rows
df[df.duplicated(keep=False)]

Unnamed: 0,date,year,type,country,area,location,activity,sex,age,injury,fatal,time,species
3577,Reported 26-Jun-1972,1972.0,Unprovoked,AUSTRALIA,Queensland,Pancake Creek,,M,,FATAL,Y,,
3578,Reported 26-Jun-1972,1972.0,Unprovoked,AUSTRALIA,Queensland,Pancake Creek,,M,,FATAL,Y,,
3622,1971,1971.0,Unprovoked,IRAN,Khuzestan Province,"Ahvaz, on the Karun River",,M,,Survived,N,,
3623,1971,1971.0,Unprovoked,IRAN,Khuzestan Province,"Ahvaz, on the Karun River",,M,,Survived,N,,
4490,Aug-1956,1956.0,Provoked,UNITED KINGDOM,Cornwall,The Lizard,Attempting to kill a shark with explosives,M,,"FATAL, PROVOKED INCIDENT",Y,,
4491,Aug-1956,1956.0,Provoked,UNITED KINGDOM,Cornwall,The Lizard,Attempting to kill a shark with explosives,M,,"FATAL, PROVOKED INCIDENT",Y,,
4940,Fall 1943,1943.0,Unprovoked,USA,Hawaii,"Midway Island, Northwestern Hawaiian Islands",Spearfishing,M,,Calf nipped in each case,N,,"""small sharks"""
4941,Fall 1943,1943.0,Unprovoked,USA,Hawaii,"Midway Island, Northwestern Hawaiian Islands",Spearfishing,M,,Calf nipped in each case,N,,"""small sharks"""
5719,Reported 10-Oct-1906,1906.0,Unprovoked,USA,Hawaii,,Swimming,M,,FATAL,Y,,
5721,Reported 10-Oct-1906,1906.0,Unprovoked,USA,Hawaii,,Swimming,M,,FATAL,Y,,


As shown above, there are some rows that are duplicated. Therefore, they can be removed.

In [15]:
# Remove duplicated rows
df = df.drop_duplicates(keep='first')

# Check the result
df.duplicated().sum()

0

In [16]:
# Check the new number of rows
print(f'After removing the duplicates, the dataframe has {df.shape[0]} rows.')

After removing the duplicates, the dataframe has 6551 rows.


# Specific Data Cleaning - Columns

## Year

In [43]:
# Check the unique values in the column
df.year.unique()

array([2020., 2019., 2018., 2017.,   nan, 2016., 2015., 2014., 2013.,
       2012., 2011., 2010., 2009., 2008., 2007., 2006., 2005., 2004.,
       2003., 2002., 2001., 2000., 1999., 1998., 1997., 1996., 1995.,
       1984., 1994., 1993., 1992., 1991., 1990., 1989., 1969., 1988.,
       1987., 1986., 1985., 1983., 1982., 1981., 1980., 1979., 1978.,
       1977., 1976., 1975., 1974., 1973., 1972., 1971., 1970., 1968.,
       1967., 1966., 1965., 1964., 1963., 1962., 1961., 1960., 1959.,
       1958., 1957., 1956., 1955., 1954., 1953., 1952., 1951., 1950.,
       1949., 1948., 1848., 1947., 1946., 1945., 1944., 1943., 1942.,
       1941., 1940., 1939., 1938., 1937., 1936., 1935., 1934., 1933.,
       1932., 1931., 1930., 1929., 1928., 1927., 1926., 1925., 1924.,
       1923., 1922., 1921., 1920., 1919., 1918., 1917., 1916., 1915.,
       1914., 1913., 1912., 1911., 1910., 1909., 1908., 1907., 1906.,
       1905., 1904., 1903., 1902., 1901., 1900., 1899., 1898., 1897.,
       1896., 1895.,

## Gender

The gender is represented by the column `sex`.

In [17]:
# Check the unique values in the column
print(df.sex.unique())

['F' 'M' nan 'M ' 'lli' 'M x 2' 'N' '.']


In [23]:
# Check the number of each value
df.sex.value_counts()

M        5280
F         693
N           2
M           2
M x 2       1
lli         1
.           1
Name: sex, dtype: int64

In [24]:
# Remove unnecessary spaces in the values
df.sex = df.sex.str.strip()

# Check the result
df.sex.value_counts()

M        5282
F         693
N           2
M x 2       1
lli         1
.           1
Name: sex, dtype: int64

In [42]:
# Check the rows with values that are not 'M', nor 'F', nor NaN
df[(df.sex != 'M') & (df.sex != 'F') & (df.sex == df.sex)]

Unnamed: 0,date,year,type,country,area,location,activity,sex,age,injury,fatal,time,species
1867,11-Nov-2004,2004.0,Unprovoked,USA,California,"Bunkers, Humboldt Bay, Eureka, Humboldt County",Surfing,lli,38.0,"Lacerations to hand, knee & thigh",N,13h30,5.5 m [18'] white shark
3205,23-Oct-1962,1982.0,Sea Disaster,USA,Carolina coast,,Yacht Trashman capsized in storm,M x 2,,FATAL,Y x 2,,
5191,11-Jul-1934,1934.0,Watercraft,AUSTRALIA,New South Wales,Cronulla,Fishing,N,,No injury to occupants Sharks continually foll...,N,,"Blue pointer, 11'"
5690,Reported 02-Jun-1908,1908.0,Sea Disaster,PAPUA NEW GUINEA,New Britain,Matupi,.,.,,"Remains of 3 humans recovered from shark, but ...",Y,,Allegedly a 33-foot shark
6386,Reported 18-Dec-1801,1801.0,Provoked,,,,Standing on landed shark's tail,N,,"FATAL, PROVOKED INCIDENT",Y,,12' shark


In [32]:
# Number of rows
n_rows = df.shape[0]

# Check % of missing values
df.sex.isna().mean()*100

8.716226530300718

In [40]:
df.sex.isna().sum()

571

## Age

In [None]:
df.Age.unique()

In [None]:
# Function to convert string to integer
def convert_str_to_int(age):
    try:
        age = int(age)
    except:
        age = -1
    return age

In [None]:
# Number 
df_total = df.Date.count()
print(df_total)

# Convert the strings in column 'Age' to integers
df['age_int'] = df.Age.apply(convert_str_to_int)

# How many were not digits
print(df.loc[df.age_int == -1, :].age_int.value_counts() / 6558)
print(df.loc[df.age_int == -1, :].age_int.value_counts() / df_total)

# Check the result
df

In [None]:
# Classification by age
df['age_cat'] = np.where(df['age_int'] > 65, 'Elder',
                        np.where(df['age_int'] > 35, 'Adult',
                                np.where(df['age_int'] > 17, 'Young Adult',
                                        np.where(df['age_int'] > 12, 'Teenager',
                                                 np.where(df['age_int'] == -1, '-', 'Child')))))

In [None]:
df

## Year

In [None]:
df.Year.unique()

In [None]:
# Convert the strings in column 'Year' to integers
df['year_int'] = df.Year.apply(convert_str_to_int)

# Check the result
df

In [None]:
# Check % incidents occured before 1801
print(f'The incidents before 1801 represents only {(df[df.Year < 1801].Year.count() / df.Year.count()) * 100:.2f}%',
      f'of the dataset. So, only the years from 1800 will be analysed.', sep=' ')

In [None]:
# Selecting only the years from 1800
df = df.loc[df['Year'] >= 1801, :]
df.info()

In [None]:
df

In [None]:
# Classification by year
df['century'] = np.where(df['year_int'] >= 2001, 21,
                            np.where(df['year_int'] >= 1901, 20, 19))

In [None]:
df

## Gender

In [None]:
# Check the column 'Sex'
df.Sex.unique()

In [None]:
df.Sex.count()

In [None]:
# % of each value
df.Sex.value_counts() / df.Sex.count()

In [None]:
# % of each NaN
df.Sex.isna().sum()

In [None]:
# Check number of missing values
print(f'The number of NaN in the column "Sex" is {df.Sex.isna().sum()}, which represents '
      f'{(df.Sex.isna().sum() / df.Sex.count())*100:.2f}% of the dataset.', sep=' ')

Since the NaN represents only a samll part of the dataset, the rows containing NaN in the column 'Sex' will be removed.

In [None]:
# Remove rows that the column 'Sex' is a missing value
df[df.Sex.isna()]
df = df.drop(df[df.Sex.isna()].index)

In [None]:
# Check infromation about the changed dataset
df.info()

In [None]:
# Check values in the column 'Sex'
df.Sex.value_counts()

In [None]:
# Remove unnecessary spaces in the values of the column 'Sex'
df['Sex'] = df['Sex'].str.strip()
df.Sex.value_counts()

In [None]:
# Remove rows that the gender is 'N', 'lli', '.' or 'M x 2'
df = df.drop(df[(df.Sex == '.') | (df.Sex == 'N') | (df.Sex == 'lli') | ((df.Sex == 'M x 2'))].index)
df.Sex.value_counts()

In [None]:
# Check infromation about the changed dataset
# 5 columns were removed
df.info()

In [None]:
# Check column 'Sex' percentage
df_gener_total = df.Sex.value_counts().sum()
df.Sex.value_counts() / df_gener_total

In [None]:
print(f'The males represent {(df.Sex.value_counts() / df_gener_total)[0]*100:.2f}% of the people who were'
      f'involved in incidents with sharks, while women represent only {(df.Sex.value_counts() / df_gener_total)[1]*100:.2f}%.',sep=' ')

In [None]:
# Create a subset containing only females
df_male = df.loc[df['Sex'] == 'F']

In [None]:
# Create a subset containing only males
df_male = df.loc[df['Sex'] == 'M']
df_male.info()

In [None]:
# Check males age
df_male.age_cat.value_counts()

In [None]:
df_male_total = df_male.Date.count()
df_male_total

In [None]:
# Check males age
df_male.age_cat.value_counts() / df_male_total

In [None]:
(df_male.age_cat.value_counts() / df_male_total)*100

In [None]:
df_male_age = df_male.loc[df_male['age_cat'] != '-']
df_male_age_total = df_male_age.Date.count()
df_male_age.info()

In [None]:
df_male_age_total

In [None]:
df_male_age.age_cat.value_counts()

In [None]:
df_male_age.age_cat.value_counts() / df_male_age_total

In [None]:
(df_male_age.age_cat.value_counts() / df_male_age_total)*100

50% of male who were in a shark incident were Young Adults.

In [None]:
df_male_ya = df_male_age[df_male_age['age_cat'] == 'Young Adult']
df_male_ya.info()

In [None]:
df_male_ya_total = df_male_ya.Date.count()
df_male_ya.groupby(by='Activity').age_cat.count().sort_values(ascending=False)

In [None]:
(df_male_ya.groupby(by='Activity').age_cat.count().sort_values(ascending=False) / df_male_ya_total)*100

**Young adult males surfing were the most involved in shark incidents.**

# Exporting dataset

In [None]:
df

In [None]:
#df.to_csv('sharks_clean.csv')

# Extra - Analysis though centuries

## Setup

In [None]:
# Create subsets for each century
df_cen21 = df.loc[df['century'] == 21, :]
df_cen20 = df.loc[df['century'] == 20, :]
df_cen19 = df.loc[df['century'] == 19, :]

In [None]:
df_cen21_total = df_cen21.Date.count()
print(df_cen21_total)
df_cen21.info()

In [None]:
df_cen20_total = df_cen20.Date.count()
print(df_cen20_total)
df_cen20.info()

In [None]:
df_cen19_total = df_cen19.Date.count()
print(df_cen19_total)
df_cen19.info()

We are not even 1/4 of the 21st century and the number of shark incidents getting closer to the number of incidents in the 20th century. However, we also have to consider that there may be a probability that not all incidents were registered in the last century. 

## Gender

In [None]:
df_cen21.Sex.value_counts() / df_cen21_total

In [None]:
(df_cen21.Sex.value_counts() / df_cen21_total) * 100

In [None]:
df_cen20.Sex.value_counts() / df_cen20_total

In [None]:
(df_cen20.Sex.value_counts() / df_cen20_total) * 100

In [None]:
df_cen19.Sex.value_counts() / df_cen19_total

In [None]:
(df_cen19.Sex.value_counts() / df_cen19_total) * 100

This result reflects each gender's behaviors in each century. Long time ago, women usually would just take care of the house and were not allowed to do many things.

## Activity

In [None]:
df_cen21.Activity.value_counts() / df_cen21_total

In [None]:
(df_cen21.Activity.value_counts() / df_cen21_total) * 100

In [None]:
df_cen20.Activity.value_counts() / df_cen20_total

In [None]:
(df_cen20.Activity.value_counts() / df_cen20_total) * 100

In [None]:
df_cen19.Activity.value_counts() / df_cen19_total

In [None]:
(df_cen19.Activity.value_counts() / df_cen19_total) * 100