# Data Engineering Project | August 2022 | Victims Dataset
## Explore Data

In this notebook some functions will be applied to the input CSVs to understand
the data, thus come with a proper data architecture for the project.

### Import relevant packages

In [1]:
import pandas as pd

### Load CSV data

In [3]:
acc_deaths_df = pd.read_csv('../data/accidental_deaths_2021_to_date.csv')
acc_deaths_df.name = 'Accidental Deaths 2021 To Date'

acc_injuries_df = pd.read_csv('../data/accidental_injuries_2021_to_date.csv')
acc_injuries_df.name = 'Accidental Injuries 2021 To Date'

chi_killed_df = pd.read_csv('../data/children_killed_2021_to_date.csv')
chi_killed_df.name = 'Children Killed 2021 To Date'

chi_injured_df = pd.read_csv('../data/children_injured_2021_to_date.csv')
chi_injured_df.name = 'Children Injured 2021 To Date'

teens_killed_df = pd.read_csv('../data/teens_killed_2021_to_date.csv')
teens_killed_df.name = 'Teens Killed 2021 To Date'

teens_injured_df = pd.read_csv('../data/teens_injured_2021_to_date.csv')
teens_injured_df.name = 'Teens Injured 2021 To Date'

ms_inj_kil_df = pd.read_csv('../data/mass_shootings_Injured_killed_2021_to_date.csv')
ms_inj_kil_df.name = 'Mass Shootings Injured Killed 2021 To Date'

In [6]:
all_input_dfs = [acc_deaths_df, acc_injuries_df, chi_killed_df,
                 chi_injured_df, teens_killed_df, teens_injured_df,
                 ms_inj_kil_df]

### Check number of rows, column names and data types per file

In [32]:
for df in all_input_dfs:
    print(f'----- {df.name} -----')
    print(f'Number of rows: {len(df.index)}')
    print(f'Column names: {list(df.columns)}')
    print(f'Column data types: \n{df.dtypes}')

----- Accidental Deaths 2021 To Date -----
Number of rows: 714
Column names: ['Incident ID', 'Incident Date', 'State', 'City Or County', 'Address', '# Killed', '# Injured', 'Operations']
Column data types: 
Incident ID         int64
Incident Date      object
State              object
City Or County     object
Address            object
# Killed            int64
# Injured           int64
Operations        float64
dtype: object
----- Accidental Injuries 2021 To Date -----
Number of rows: 1930
Column names: ['Incident ID', 'Incident Date', 'State', 'City Or County', 'Address', '# Killed', '# Injured', 'Operations']
Column data types: 
Incident ID         int64
Incident Date      object
State              object
City Or County     object
Address            object
# Killed            int64
# Injured           int64
Operations        float64
dtype: object
----- Children Killed 2021 To Date -----
Number of rows: 436
Column names: ['Incident ID', 'Incident Date', 'State', 'City Or County', 'Add

#### Observations
- All files have the same number, name and type of columns
- Number of records per file differ from one to another

### Check how the datasets relate to each other

Check how many accidents involved both deaths and injuries

In [22]:
len(set(acc_deaths_df['Incident ID']).intersection(acc_injuries_df['Incident ID']))

24

Check how many incidents involving children have injured and killed victims


In [23]:
len(set(chi_killed_df['Incident ID']).intersection(chi_injured_df['Incident ID']))

27

Check how many incidents involving teens have injured and killed victims

In [24]:
len(set(teens_killed_df['Incident ID']).intersection(teens_injured_df['Incident ID']))

312

Check how many incidents involving teens and children have killed victims

In [25]:
len(set(teens_killed_df['Incident ID']).intersection(chi_killed_df['Incident ID']))

22

Check how many incidents involving teens and children have injured victims

In [26]:
len(set(teens_injured_df['Incident ID']).intersection(chi_injured_df['Incident ID']))

88

Check how many mass shootings involve children

In [36]:
len(
    (
        set(chi_injured_df['Incident ID']).
        union(set(chi_killed_df['Incident ID']))
    ).
    intersection(set(ms_inj_kil_df['Incident ID']))
)

77

#### Observations

- There are "duplicated" records in the datasets, this means that some events
  fall under different categories