## Extract, Transform, Load (ETL) Project


Data was sourced from openFDA - the US Food and Drug Administration (USFDA) API. We selected animal and veterinary API endpoints. The available API was the Adverse Event Report. During our EDA project, we selected data on a small set of attributes of cats, recorded from 1987-2021. However, the same code no longer would run so we switched to selecting data on dogs. 

We obtained an API key, defined a base url, an made API calls. The output was converted to json format. The key was not used for most of the data retrieval. Data extraction was facilitated by a loop.  The original 750 records were stored in a dataframe and also saved as a .csv file.

Our dataframe comprises reports of incidents where drug exposure resulted in adverse reactions in dogs. After inspecting the variables, we trimmed the dataframe to a set of ??? records. We didn't filter it down as much for the dogs dataset that we had for the cats. 

### Retrieving data and creating dataframe

In [1]:
#Setting dependencies

import numpy as np
import pandas as pd

import json
import requests
import time

import matplotlib.pyplot as plt
from pandas.plotting import table
from pprint import pprint
import seaborn as sns

# Import API key

from api_keys import api_key

In [2]:
#creating empty dictionary to store extracted data

cats_data = {'species':[],
                'gender':[],
                'age':[],
                'weight':[],
                'breed':[],
                'drug':[],
                'outcome':[],
                'date_in':[],
                'react_term_code':[],
                'react_term_name':[],
                'first_exp_da':[],
                'last_exp_da':[],
                'admin_by':[],
                'route':[],
                'dosage_form':[]}

#verifying search on Cat species 

#base_url = "https://api.fda.gov/animalandveterinary/event.json?search=animal.species=Cat'"
#base_url = "https://api.fda.gov/animalandveterinary/event.json?search=animal.species=Cat+(animal.breed.breed_component='Domestic Shorthair'+OR+animal.breed.breed_component='Domestic Longhair')"
#base_url = "https://api.fda.gov/animalandveterinary/event.json?search=animal.species=Cat+(animal.breed.breed_component='Crossbred Feline')"
base_url = "https://api.fda.gov/animalandveterinary/event.json?search=animal.breed.breed_component<>'Unknown'"
req = requests.get(base_url)
data = req.json()

#add 1 row to the cats_data dictionary with information extracted - limit for this api call is 1 result

cats_data['species'].append(data['results'][0]['animal']['species'])
cats_data['gender'].append(data['results'][0]['animal']['gender'])
cats_data['age'].append(data['results'][0]['animal']['age']['min'])
cats_data['weight'].append(data['results'][0]['animal']['weight']['min'])
cats_data['breed'].append(data['results'][0]['animal']['breed']['breed_component'])
cats_data['drug'].append(data['results'][0]['drug'][0]['active_ingredients'][0]['name'])
cats_data['outcome'].append(data['results'][0]['outcome'][0]['medical_status'])

cats_data['date_in'].append(data['results'][0]['original_receive_date'])
cats_data['react_term_code'].append(data['results'][0]['reaction'][0]['veddra_term_code'])
cats_data['react_term_name'].append(data['results'][0]['reaction'][0]['veddra_term_name'])
cats_data['first_exp_da'].append(data['results'][0]['drug'][0]['first_exposure_date'])
cats_data['last_exp_da'].append(data['results'][0]['drug'][0]['last_exposure_date'])
cats_data['admin_by'].append(data['results'][0]['drug'][0]['administered_by'])
cats_data['route'].append(data['results'][0]['drug'][0]['route'])
cats_data['dosage_form'].append(data['results'][0]['drug'][0]['dosage_form'])
print(cats_data)

{'species': ['Dog'], 'gender': ['Female'], 'age': ['10.00'], 'weight': ['5.900'], 'breed': ['Poodle (unspecified)'], 'drug': ['Spinosad'], 'outcome': ['Recovered/Normal'], 'date_in': ['20120627'], 'react_term_code': ['334'], 'react_term_name': ['Vomiting'], 'first_exp_da': ['20120601'], 'last_exp_da': ['20120702'], 'admin_by': ['Animal Owner'], 'route': ['Oral'], 'dosage_form': ['Tablet, chewable']}


In [3]:
# This url returns only the first match, so we used a loop to skip entries and pick up another one.otherwise the 
# same one is returned every time.
# Returning 750 results runs for a bit of time but will provide us with more data to look at/scrub

cats_data = {'species':[],
                'gender':[],
                'age':[],
                'weight':[],
                'breed':[],
                'drug':[],
                'outcome':[],
                'date_in':[],
                'react_term_code':[],
                'react_term_name':[],
                'first_exp_da':[],
                'last_exp_da':[],
                'admin_by':[],
                'route':[],
                'dosage_form':[]}
#base_url = "https://api.fda.gov/animalandveterinary/event.json?search=animal.species=Cat'"
base_url = "https://api.fda.gov/animalandveterinary/event.json?search=animal.breed.breed_component<>'Unknown'"
# NB: It turns out that the API key was unnecessary - the + api_key part of the call was not copied, and the call was successful. These data were extracted without using the key.

counter = 1
for i in range(750):
    req = requests.get(base_url)
    data = req.json()

    # validate the data, if age and weight don't exist, replace with 0. If date fields don't exist, replace with nan.
    # For all other fields that don't exist, replace with 'Unknown'. 
    try:
        cats_data['age'].append(data['results'][0]['animal']['age']['min'])
    except:
        cats_data['age'].append('0')
    
    try:
        cats_data['weight'].append(data['results'][0]['animal']['weight']['min'])
    except:
        cats_data['weight'].append('0')

    try:
        cats_data['outcome'].append(data['results'][0]['outcome'][0]['medical_status'])
    except:
        cats_data['outcome'].append('Unknown')
    
    try:
        cats_data['date_in'].append(data['results'][0]['original_receive_date'])
    except:
        cats_data['date_in'].append(np.nan)

    try:
        cats_data['species'].append(data['results'][0]['animal']['species'])
    except:
        cats_data['species'].append('Unknown')
        
    try:
        cats_data['gender'].append(data['results'][0]['animal']['gender'])
    except:
        cats_data['gender'].append('Unknown')
    
    try:
        cats_data['breed'].append(data['results'][0]['animal']['breed']['breed_component'])
    except:
        cats_data['breed'].append('Unknown')
    
    try:
        cats_data['drug'].append(data['results'][0]['drug'][0]['active_ingredients'][0]['name'])
    except:
        cats_data['drug'].append('Unknown')
        
    try:    
        cats_data['react_term_code'].append(data['results'][0]['reaction'][0]['veddra_term_code'])
    except:
        cats_data['react_term_code'].append('Unknown')
    
    try:
        cats_data['react_term_name'].append(data['results'][0]['reaction'][0]['veddra_term_name'])
    except:
        cats_data['react_term_name'].append('Unknown')
    
    try:
        cats_data['first_exp_da'].append(data['results'][0]['drug'][0]['first_exposure_date'])
    except:
        cats_data['first_exp_da'].append(np.nan)
        
    try: 
        cats_data['last_exp_da'].append(data['results'][0]['drug'][0]['last_exposure_date'])
    except:
        cats_data['last_exp_da'].append(np.nan)
        
    try:
        cats_data['admin_by'].append(data['results'][0]['drug'][0]['administered_by'])
    except:
        cats_data['admin_by'].append('Unknown')
        
    try:
        cats_data['route'].append(data['results'][0]['drug'][0]['route'])
    except:
        cats_data['route'].append('Unknown')
        
    try:
        cats_data['dosage_form'].append(data['results'][0]['drug'][0]['dosage_form'])
    except:
        cats_data['dosage_form'].append('Unknown')
        
    counter +=1
    #counter is converted to a string and used as a skip value for gathering random records
    str_count = str(counter)
    #base_url = "https://api.fda.gov/animalandveterinary/event.json?search=animal.species=Cat+(animal.breed.breed_component='Domestic Shorthair'+OR+animal.breed.breed_component='Domestic Longhair')&skip=" + str_count
    #base_url = "https://api.fda.gov/animalandveterinary/event.json?search=animal.species=Cat'&skip=" + str_count
    base_url = "https://api.fda.gov/animalandveterinary/event.json?search=animal.breed.breed_component<>'Unknown'&skip=" + str_count

In [5]:
# converting the raw data in the cats dictionary to a dataframe and writing it out to a csv file - mainly because it takes
# a while to run so if the dataframe gets messed up, it can be read from the file instead of running the api again

cat_df = pd.DataFrame.from_dict(cats_data)
cat_df.to_csv("cat_data.csv",index=False)

#print the info of the dataframe

cat_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 750 entries, 0 to 749
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   species          750 non-null    object
 1   gender           750 non-null    object
 2   age              750 non-null    object
 3   weight           750 non-null    object
 4   breed            750 non-null    object
 5   drug             750 non-null    object
 6   outcome          750 non-null    object
 7   date_in          750 non-null    object
 8   react_term_code  750 non-null    object
 9   react_term_name  750 non-null    object
 10  first_exp_da     687 non-null    object
 11  last_exp_da      655 non-null    object
 12  admin_by         750 non-null    object
 13  route            750 non-null    object
 14  dosage_form      750 non-null    object
dtypes: object(15)
memory usage: 88.0+ KB


In [16]:
# read the csv file in 
# changing to just use dogs soon so naming the dataframe appropriately
dog_df = pd.read_csv('cat_data.csv')

#### Begin data analysis and pre-processing

In [17]:
dog_df['species'].value_counts()

Dog              629
Cat               61
Horse             19
Cattle            17
Human             13
Pig                3
Unknown            2
Chicken            2
Turkey             1
Mouse              1
Other Birds        1
Other Mammals      1
Name: species, dtype: int64

We are just going with dogs this time around so dropping the remaining species.

In [18]:
dog_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 750 entries, 0 to 749
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   species          750 non-null    object 
 1   gender           750 non-null    object 
 2   age              750 non-null    float64
 3   weight           750 non-null    float64
 4   breed            750 non-null    object 
 5   drug             750 non-null    object 
 6   outcome          750 non-null    object 
 7   date_in          750 non-null    int64  
 8   react_term_code  750 non-null    int64  
 9   react_term_name  750 non-null    object 
 10  first_exp_da     687 non-null    float64
 11  last_exp_da      655 non-null    float64
 12  admin_by         750 non-null    object 
 13  route            750 non-null    object 
 14  dosage_form      750 non-null    object 
dtypes: float64(4), int64(2), object(9)
memory usage: 88.0+ KB


In [19]:
dog_df.drop(dog_df.index[dog_df['species'] != 'Dog'], inplace = True)
dog_df['species'].value_counts()

Dog    629
Name: species, dtype: int64

Start looking at age and weight - to see about dropping those that are 0, since those were unknown's from the api pull

In [20]:
dog_df['age'].value_counts()

0.00     82
2.00     55
3.00     52
4.00     43
7.00     43
8.00     42
5.00     41
6.00     39
10.00    35
11.00    26
9.00     23
1.00     22
12.00    17
13.00    17
14.00    15
15.00    10
16.00     7
1.50      6
17.00     5
22.00     4
4.50      4
18.00     3
21.00     3
2.50      3
1.30      3
6.50      2
1.25      2
3.50      2
2.40      1
8.75      1
3.75      1
12.50     1
9.70      1
12.70     1
9.80      1
2.90      1
19.00     1
4.30      1
4.20      1
23.00     1
11.50     1
10.80     1
4.90      1
7.50      1
5.50      1
20.00     1
1.20      1
10.50     1
16.50     1
6.40      1
9.60      1
Name: age, dtype: int64

Need to drop the dogs with 0 age. Once those are dropped, the age column will be changed to int to combine more of the ages together.

In [21]:
dog_df.drop(dog_df.index[dog_df['age'] == 0], inplace = True)
dog_df['age'].describe()

count    547.000000
mean       7.027971
std        4.516529
min        1.000000
25%        3.000000
50%        6.000000
75%       10.000000
max       23.000000
Name: age, dtype: float64

In [22]:
dog_df[["age"]] = dog_df[["age"]].astype(int)
dog_df['age'].value_counts()

2     60
3     55
4     50
7     44
8     43
5     42
6     42
10    37
1     34
11    27
9     26
12    19
13    17
14    15
15    10
16     8
17     5
22     4
21     3
18     3
20     1
23     1
19     1
Name: age, dtype: int64

Time to evaluate the weight field - dropping the ones with zeroes. Also changing this to int in order to combine weights.

In [23]:
dog_df['weight'].value_counts()

0.000     14
6.800      9
11.340     7
3.629      6
4.990      6
          ..
4.763      1
6.033      1
3.402      1
12.000     1
6.010      1
Name: weight, Length: 343, dtype: int64

In [24]:
dog_df.drop(dog_df.index[dog_df['weight'] == 0], inplace = True)
dog_df['weight'].describe()

count    533.000000
mean      18.705396
std       13.339007
min        0.573000
25%        7.031000
50%       15.876000
75%       28.580000
max       72.575000
Name: weight, dtype: float64

In [26]:
dog_df[["weight"]] = dog_df[["weight"]].astype(int)
dog_df['weight'].value_counts()

9     31
4     30
6     26
3     24
5     23
2     23
7     20
8     18
27    18
30    15
18    13
24    13
32    13
10    13
11    13
15    13
28    12
19    12
17    12
29    12
34    11
31    11
14    11
12    10
26    10
33     9
20     9
13     9
22     9
39     8
35     8
23     8
25     6
1      6
38     6
16     5
40     5
36     5
43     4
45     3
42     3
37     3
46     3
58     3
44     2
52     2
21     2
67     1
48     1
53     1
0      1
41     1
50     1
62     1
72     1
Name: weight, dtype: int64

Changing to int cause a 0 to show again so dropping it

In [27]:
dog_df.drop(dog_df.index[dog_df['weight'] == 0], inplace = True)

Look at the 3 date fields next. First_exp_da and last_exp_da have some nulls. Dropping those to see what is left.

In [28]:
dog_df['last_exp_da'].isnull().sum()

36

In [29]:
dog_df.dropna(inplace=True)

Dropping the nulls leaves 494 rows in the dog_df dataframe

In [30]:
dog_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 494 entries, 0 to 749
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   species          494 non-null    object 
 1   gender           494 non-null    object 
 2   age              494 non-null    int32  
 3   weight           494 non-null    int32  
 4   breed            494 non-null    object 
 5   drug             494 non-null    object 
 6   outcome          494 non-null    object 
 7   date_in          494 non-null    int64  
 8   react_term_code  494 non-null    int64  
 9   react_term_name  494 non-null    object 
 10  first_exp_da     494 non-null    float64
 11  last_exp_da      494 non-null    float64
 12  admin_by         494 non-null    object 
 13  route            494 non-null    object 
 14  dosage_form      494 non-null    object 
dtypes: float64(2), int32(2), int64(2), object(9)
memory usage: 57.9+ KB


### DATA FRAME WRANGLING

Continuing clean up of the dataframe

In [31]:
dog_df['outcome'].value_counts()

Outcome Unknown           274
Recovered/Normal          131
Ongoing                    53
Recovered with Sequela     26
Died                        6
Euthanized                  4
Name: outcome, dtype: int64

In [32]:
dog_df['breed'].value_counts()

Retriever - Labrador                                45
Crossbred Canine/dog                                33
Chihuahua                                           18
Beagle                                              15
Shepherd Dog - German                               15
                                                    ..
['Pointer (unspecified)', 'Dog (unknown)']           1
['Terrier (unspecified)', 'Dog (other)']             1
['Shepherd Dog - German', 'Crustacea (unknown)']     1
Terrier - Cairn                                      1
Schipperke                                           1
Name: breed, Length: 152, dtype: int64

There's a lot of strange characters in the breed name. Try to strip them out first before putting the breeds together.

In [33]:
dog_df['breed'] = dog_df['breed'].str.strip('[]')
dog_df['breed'] = dog_df['breed'].str.strip('()')


In [34]:
dog_df['breed'].value_counts()

Retriever - Labrador                              45
Crossbred Canine/dog                              33
Chihuahua                                         18
Beagle                                            15
Shepherd Dog - German                             15
                                                  ..
'Pointer (unspecified)', 'Dog (unknown)'           1
'Terrier (unspecified)', 'Dog (other)'             1
'Shepherd Dog - German', 'Crustacea (unknown)'     1
Terrier - Cairn                                    1
Schipperke                                         1
Name: breed, Length: 152, dtype: int64

Going to spend some time cleaning up the breed names to group them together.

In [35]:
dog_df['breed'].replace("'Pointer (unspecified)', 'Dog (unknown)'",'Pointer', inplace=True)
dog_df['breed'].replace('Terrier - Cairn','Terrier', inplace=True)
dog_df['breed'].replace('Terrier (unspecified)', 'Terrier', inplace=True)
dog_df['breed'].replace("'Terrier (unspecified)', 'Dog (unknown)'",'Terrier', inplace=True)

In [36]:
dog_df['breed'].replace("'Terrier (unspecified)', 'Dog (other)'",'Terrier', inplace=True)
dog_df['breed'].replace("'Retriever - Golden', 'Poodle (unspecified)'", 'Retriever - Golden', inplace=True)
dog_df['breed'].replace("'Spitz - German Pomeranian', 'Chihuahua'", 'Chihuahua', inplace=True)
dog_df['breed'].replace("'Shepherd Dog - German', 'Crustacea (unknown)'",'Shepherd Dog - German', inplace=True)

In [37]:
dog_df['breed'].replace('Terrier - Border', 'Terrier', inplace=True)
dog_df['breed'].replace("'Maltese', 'Japanese Chin (Spaniel)', 'Papillon - Spaniel - Continental Toy (with erect ears or with dropped ears (Phaléne))', 'Shih Tzu'",'Maltese', inplace=True)
dog_df['breed'].replace("'Shepherd Dog - Australian', 'Retriever - Labrador'",'Retriever - Labrador', inplace=True)
dog_df['breed'].replace("'Chihuahua', 'Crossbred Canine/dog'", 'Chihuahua', inplace=True)

In [38]:
dog_df['breed'].replace("'Spaniel - Cocker American'", 'Spaniel - Cocker American', inplace=True)
dog_df['breed'].replace("'Rottweiler'", 'Rottweiler',inplace=True)
dog_df['breed'].replace("'Poodle - Standard'", 'Poodle - Standard', inplace=True)
dog_df['breed'].replace("Retriever - Labrador', 'Dog (unknown)", 'Retriever - Labrador', inplace=True)
dog_df['breed'].replace('Dog (unknown', 'Dog', inplace=True)
dog_df['breed'].replace('Boxer (German Boxer','German Boxer', inplace=True)
dog_df['breed'].replace('Dachshund (unspecified', 'Dachshund', inplace=True)
dog_df['breed'].replace("Retriever - Labrador', 'Poodle (unspecified)", 'Retriever - Labrador Poodle', inplace=True)
dog_df['breed'].replace('Hound (unspecified)', 'Hound', inplace=True)
dog_df['breed'].replace('Hound (unspecified', 'Hound', inplace=True)
dog_df['breed'].replace("'Retriever - Labrador', 'Dog (unknown)', 'Bulldog - French'", 'Retriever - Labrador', inplace=True)
dog_df['breed'].replace("'Maine Coon', 'Ragdoll'", 'Maine Coon', inplace=True)
dog_df['breed'].replace('Dachshund - Standard Wire-haired', 'Dachshund', inplace=True)
dog_df['breed'].replace("'Retriever - Labrador', 'Shepherd (unspecified)'", 'Retriever - Labrador',inplace=True)
dog_df['breed'].replace('Terrier - Silky', 'Terrier', inplace=True)

In [39]:
dog_df['breed'].replace("'Terrier - Yorkshire', 'Chihuahua'", 'Terrier - Yorkshire',inplace=True)
dog_df['breed'].replace("'Siberian Husky', 'Dog (unknown)'",'Siberian Husky', inplace=True)
dog_df['breed'].replace("'Poodle (unspecified)', 'Dog (unknown)'",'Poodle', inplace=True)
dog_df['breed'].replace("'Mastiff', 'Dog (unknown)'",'Mastiff', inplace=True)
dog_df['breed'].replace("'Shepherd Dog - German', 'Crossbred Canine/dog'", 'Shepherd Dog - German', inplace=True)
dog_df['breed'].replace("'Cattle Dog - Australian (blue heeler, red heeler, Queensland cattledog)', 'Collie - Border'", 'Cattle Dog', inplace=True)
dog_df['breed'].replace("'Cattle Dog - Australian (blue heeler, red heeler, Queensland cattledog)', 'Dog(unknown)'", 'Cattle Dog', inplace=True)
dog_df['breed'].replace("'Cattle Dog - Australian (blue heeler, red heeler, Queensland cattledog)', 'Dog (unknown)'", 'Cattle Dog', inplace=True)
dog_df['breed'].replace("'Maltese','Poodle (unspecified)'",'Maltese',inplace=True)
dog_df['breed'].replace("'Maltese', 'Poodle (unspecified)'",'Maltese', inplace=True)
dog_df['breed'].replace("'Retriever - Labrador', 'Dog (unknown)'", 'Retriever - Labrador', inplace=True)
dog_df['breed'].replace("'Retriever - Labrador', 'Dog (unknown)'", 'Retriever - Labrador', inplace=True)
dog_df['breed'].replace('Shepherd (unspecified', 'Shepherd', inplace=True)


In [40]:
dog_df['breed'].replace("'Pug', 'Chihuahua'", 'Pug', inplace=True)
dog_df['breed'].replace("'Spaniel - Cocker American', 'Poodle (unspecified)'", 'Spaniel - Cocker American', inplace=True)
dog_df['breed'].replace("'Shepherd Dog - German', 'Dog (unknown)'", 'Shepherd Dog - German', inplace=True)
dog_df['breed'].replace("'Shepherd (unspecified)', 'Dog (unknown)'", 'Shepherd', inplace=True)
dog_df['breed'].replace("'Alaskan Malamute', 'Retriever - Labrador'", 'Alaskan Malamute', inplace=True)
dog_df['breed'].replace("'Bulldog', 'Pit Bull'", 'Bulldog', inplace=True)
dog_df['breed'].replace("'Chihuahua', 'Dachshund (unspecified)'", 'Chihuahua', inplace=True)
dog_df['breed'].replace("'Chihuahua', 'Greyhound - Italian'", 'Chihuahua', inplace=True)
dog_df['breed'].replace("'Chinese Crested Dog (unspecified)', 'Poodle (unspecified)'", 'Chinese Crested Dog', inplace=True)
dog_df['breed'].replace("'Collie - Border', 'Dog (unknown)'", 'Collie - Border', inplace=True)
dog_df['breed'].replace("'Crossbred Canine/dog', 'Retriever - Labrador'", 'Retriever - Labrador', inplace=True)
dog_df['breed'].replace("'Crossbred Canine/dog', 'Shepherd Dog - German'", 'Shepherd Dog - German', inplace=True)
dog_df['breed'].replace("'Dachshund (unspecified)', 'Dog (unknown)'", 'Dachshund', inplace=True)
dog_df['breed'].replace("'Great Pyrenees', 'Dog (unknown)'", 'Great Pyrenees', inplace=True)
dog_df['breed'].replace("'Crossbred Canine/dog', 'Retriever - Labrador'", 'Retriever - Labrador', inplace=True)


In [41]:
dog_df['breed'].value_counts()

Retriever - Labrador             59
Crossbred Canine/dog             33
Chihuahua                        21
Shepherd Dog - German            21
Beagle                           15
                                 ..
'Rottweiler', 'Dog (unknown)'     1
Shar Pei                          1
'Pug', 'Shih Tzu'                 1
Griffon - Brussels                1
Schipperke                        1
Name: breed, Length: 125, dtype: int64

In [42]:
# Get the count of each value
value_counts = dog_df['breed'].value_counts()

# Select the values where the count is less than 2
to_remove = value_counts[value_counts < 2].index

# Keep rows where the breed column is not in to_remove
dog_df = dog_df[~dog_df.breed.isin(to_remove)]

In [43]:
dog_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 435 entries, 1 to 749
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   species          435 non-null    object 
 1   gender           435 non-null    object 
 2   age              435 non-null    int32  
 3   weight           435 non-null    int32  
 4   breed            435 non-null    object 
 5   drug             435 non-null    object 
 6   outcome          435 non-null    object 
 7   date_in          435 non-null    int64  
 8   react_term_code  435 non-null    int64  
 9   react_term_name  435 non-null    object 
 10  first_exp_da     435 non-null    float64
 11  last_exp_da      435 non-null    float64
 12  admin_by         435 non-null    object 
 13  route            435 non-null    object 
 14  dosage_form      435 non-null    object 
dtypes: float64(2), int32(2), int64(2), object(9)
memory usage: 51.0+ KB


In [44]:
dog_df['react_term_code'].value_counts()

334     70
335     40
305     34
1039    28
2430    24
        ..
193      1
998      1
124      1
2071     1
837      1
Name: react_term_code, Length: 104, dtype: int64

In [45]:
dog_df['react_term_name'].value_counts()

Vomiting                                       70
Emesis                                         40
Digestive tract disorder NOS                   34
Lack of efficacy - NOS                         28
Lack of efficacy (endoparasite) - heartworm    23
                                               ..
Inappropriate urination                         1
General illness                                 1
Skin irritation                                 1
INEFFECTIVE, ANTIBIOTIC                         1
Abnormal breathing                              1
Name: react_term_name, Length: 107, dtype: int64

In [46]:
dog_df['react_term_name'].replace('Emesis (multiple)', 'Emesis', inplace=True)

In [47]:
dog_df['drug'].value_counts()

Spinosad                                                         151
Ivermectin                                                        72
Milbemycin Oxime                                                  38
Afoxolaner                                                        22
Moxidectin                                                        13
Nitenpyram                                                         9
Imidacloprid                                                       8
Melarsomine Dihydrochloride Injection                              8
Milbemycin Oxime, Lufenuron                                        7
Sarolaner                                                          7
Oclacitinib Maleate                                                6
Ivermectin 272Mcg, Pyrantel Pamoate 228Mg, Praziquantel 228Mg      6
Fluralaner Chew Tablets                                            6
Ivermectin/Pyrantel Pamoate Chewable 136Mcg/326Mg                  5
Ivermectin/Pyrantel Pamoate Chewab

Finally, we look at the "Gender" category:

In [48]:
dog_df['gender'].value_counts()

Female     246
Male       183
Unknown      4
Mixed        2
Name: gender, dtype: int64

There are four categories, but we didn't drop "Unknown" and "Mixed" from the set. "Unknown" comes from lack of information, and "Mixed" may or may not be a hermaphrodite. 

Now, we can run the summary statistics on the numerical values:

In [49]:
dog_df['age'].describe()

count    435.000000
mean       6.965517
std        4.564056
min        1.000000
25%        3.000000
50%        6.000000
75%       10.000000
max       23.000000
Name: age, dtype: float64

And summary proportions of nominal categorical variables:

In [50]:
dog_df['breed'].value_counts()/dog_df['breed'].value_counts().sum()

Retriever - Labrador     0.135632
Crossbred Canine/dog     0.075862
Chihuahua                0.048276
Shepherd Dog - German    0.048276
Beagle                   0.034483
                           ...   
Weimaraner               0.004598
Mountain Cur             0.004598
Dachshund - Miniature    0.004598
Dalmatian                0.004598
Dog (other               0.004598
Name: breed, Length: 66, dtype: float64

In [51]:
dog_df['gender'].value_counts()/dog_df['gender'].value_counts().sum()

Female     0.565517
Male       0.420690
Unknown    0.009195
Mixed      0.004598
Name: gender, dtype: float64

In [52]:
dog_df['outcome'].value_counts()/dog_df['outcome'].value_counts().sum()

Outcome Unknown           0.563218
Recovered/Normal          0.262069
Ongoing                   0.108046
Recovered with Sequela    0.050575
Died                      0.009195
Euthanized                0.006897
Name: outcome, dtype: float64

In [53]:
dog_df['drug'].value_counts()/dog_df['drug'].value_counts().sum()

Spinosad                                                         0.347126
Ivermectin                                                       0.165517
Milbemycin Oxime                                                 0.087356
Afoxolaner                                                       0.050575
Moxidectin                                                       0.029885
Nitenpyram                                                       0.020690
Imidacloprid                                                     0.018391
Melarsomine Dihydrochloride Injection                            0.018391
Milbemycin Oxime, Lufenuron                                      0.016092
Sarolaner                                                        0.016092
Oclacitinib Maleate                                              0.013793
Ivermectin 272Mcg, Pyrantel Pamoate 228Mg, Praziquantel 228Mg    0.013793
Fluralaner Chew Tablets                                          0.013793
Ivermectin/Pyrantel Pamoate Chewable 1

Last step prior to updating the database is to write the csv file out

In [54]:
dog_df.to_csv("dog_data.csv",index=False)