## EDA: Answering Questions from the Dataset

For this drill, we will using a modified dataset from the CO Animal Shelter info (named PACFA and can be found here: https://ag.colorado.gov/ics/pet-animal-care-facilities-act-pacfa/animal-shelter-and-rescue-individual-statistics )

The data consists of two separate files - one for the 2019 statistics and one for the 2020 statistics.  

Your boss wants you to answer a few of questions from the dataset:
- Overall numbers for 2019 and 2020 in these categories:
  - Intake for cats & dogs
  - Outake for cats & dogs
  - Comparison of 2020 vs 2019 both in numbers and a percent
  - Make this a table so you can present it to your boss & other stakeholders
- Did data quality improve from 2019 to 2020?
- Anecdotally, during 2020 it seemed like everyone was adopting new pets - does this show in the data?


Some considerations about the dataset:
- In 2020, the collection requirements and process changed in an attempt to improve data quality. This also caused the column names to be slightly different between years. 
- You can combine the files into one but this isn't necessary; both approaches have different problems to solve!

### Answering Questions

Use as many cells below as you'd like to answer the questions from your boss!

First thing is to make a summary table with this information. You could find the information and cut/paste into a spreadsheet or make a new dataframe and save it (which we will do below).

- Overall numbers for 2019 and 2020 in these categories:
  - Intake for cats & dogs
  - Outake for cats & dogs
  - Comparison of 2020 vs 2019 both in numbers and a percent
  - Make this a table so you can present it to your boss & other stakeholders


The vision here is to have 4 rows:
- Intake dogs
- Outake dogs
- Intake cats
- Outake cats

With 4 columns:
- 2019
- 2020
- Diff in number 2019-2020
- Diff in % 2019-2020

In [1]:
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)

In [2]:
as19 = pd.read_csv('2019_shelter_report_analysis.csv')
as20 = pd.read_csv('2020_shelter_report_analysis.csv')

In [3]:
sum(sum(as19[col]) for col in as19.columns if '_in_' in col)

170345.0

In [4]:
intake_dogs_2019 = sum(sum(as19[col]) for col in as19.columns if '_in_D' in col)
outake_dogs_2019 = sum(sum(as19[col]) for col in as19.columns if '_out_D' in col)
intake_cats_2019 = sum(sum(as19[col]) for col in as19.columns if '_in_C' in col)
outake_cats_2019 = sum(sum(as19[col]) for col in as19.columns if '_out_C' in col)

In [5]:
intake_dogs_2020 = sum(sum(as20[col]) for col in as20.columns if '_in_D' in col)
outake_dogs_2020 = sum(sum(as20[col]) for col in as20.columns if '_out_D' in col)
intake_cats_2020 = sum(sum(as20[col]) for col in as20.columns if '_in_C' in col)
outake_cats_2020 = sum(sum(as20[col]) for col in as20.columns if '_out_C' in col)

In [6]:
final = pd.DataFrame([
        [intake_dogs_2019, intake_dogs_2020],
        [outake_dogs_2019, outake_dogs_2020],
        [intake_cats_2019, intake_cats_2020],
        [outake_cats_2019, outake_cats_2020]
    ],
    columns=['2019', '2020'],
    index=['intake_dogs', 'outake_dogs', 'intake_cats', 'outake_cats']
    )

In [7]:
final

Unnamed: 0,2019,2020
intake_dogs,103790.0,94498.0
outake_dogs,97934.0,90294.0
intake_cats,66555.0,62588.0
outake_cats,59022.0,58303.0


In [8]:
final['Total_diff'] = final['2020'] - final['2019']
final['Percent_diff'] = final['Total_diff'] / final['2019']
final

Unnamed: 0,2019,2020,Total_diff,Percent_diff
intake_dogs,103790.0,94498.0,-9292.0,-0.089527
outake_dogs,97934.0,90294.0,-7640.0,-0.078012
intake_cats,66555.0,62588.0,-3967.0,-0.059605
outake_cats,59022.0,58303.0,-719.0,-0.012182


- Did data quality improve from 2019 to 2020?

There isn't any noticeable data quality changes between 2019, 2020. There are some empty rows in each and there were more columns in the 2019 data (the 12/31/2019 summary data) that were dropped in 2020. 

- Anecdotally, during 2020 it seemed like everyone was adopting new pets - does this show in the data?

From the data the numers of both intake and outage went down but the %s show a greater percent of animals were adopted based on intake.

In [9]:
print('2019:', final.loc['outake_dogs', '2019'] / final.loc['intake_dogs', '2019'])
print('2020:', final.loc['outake_dogs', '2020'] / final.loc['intake_dogs', '2020'])

2019: 0.9435783794199827
2020: 0.9555122859743064


In [10]:
print('2019:', final.loc['outake_cats', '2019'] / final.loc['intake_cats', '2019'])
print('2020:', final.loc['outake_cats', '2020'] / final.loc['intake_cats', '2020'])

2019: 0.886815415821501
2020: 0.9315363967533713
