# Real-world Data Wrangling

In this project, you will apply the skills you acquired in the course to gather and wrangle real-world data with two datasets of your choice.

You will retrieve and extract the data, assess the data programmatically and visually, accross elements of data quality and structure, and implement a cleaning strategy for the data. You will then store the updated data into your selected database/data store, combine the data, and answer a research question with the datasets.

Throughout the process, you are expected to:

1. Explain your decisions towards methods used for gathering, assessing, cleaning, storing, and answering the research question
2. Write code comments so your code is more readable

## 1. Gather data

In this section, you will extract data using two different data gathering methods and combine the data. Use at least two different types of data-gathering methods.

### **1.1.** Problem Statement
In 2-4 sentences, explain the kind of problem you want to look at and the datasets you will be wrangling for this project.

 *The City of Vancouver tracks not only every animal that comes into its shelters, but also those that are reported as lost by their owners. While the city does track those that are matched back to their owner, is it possible that an animal still tracked as lost has possibly been accounted for? If so, is it possible to find somewhat reliable means to match animals based on data entered into a report for lost ones?*

### **1.2.** Gather at least two datasets using two different data gathering methods

List of data gathering methods:

- Download data manually
- Programmatically downloading files
- Gather data by accessing APIs
- Gather and extract data from HTML files using BeautifulSoup
- Extract data from a SQL database

Each dataset must have at least two variables, and have greater than 500 data samples within each dataset.

For each dataset, briefly describe why you picked the dataset and the gathering method (2-3 full sentences), including the names and significance of the variables in the dataset. Show your work (e.g., if using an API to download the data, please include a snippet of your code). 

Load the dataset programmtically into this notebook.

#### City of Vancouver Animal Control Inventory - Lost and Found

This dataset is information from the City of Vancouver where an owner of an animal has reported them as lost. It also tracks those that were either reported as found or were matched by the shelter back to the owner. 

I chose this dataset because it will address what animals were reported as lost within the city. This does not cover every animal that was lost, however it does provide a large sample size for this metro area.

Further information on the dataset can be found [here](https://opendata.vancouver.ca/explore/dataset/animal-control-inventory-lost-and-found/information).

Type: JSON

Method: This data was gathered by querying the City of Vancouver's database with the standard Opendatasoft API. I am doing it this way because the data is updated daily, and this guarantees that the most up-to-date information will be used.

Dataset variables:

- *breed* - type of animal or breed that fits best.
- *color* - color of the animal's coat/fur.
- *date* - date that the animal was lost
- *name* - the given name of the animal being tracked (if known).
- *sex* - used to label the biological sex of the animal, as well as if they are spayed or neutered (marked with `F/S` or `M/N` accordingly). `X` = unknown
- *state* - the last state of being for the animal, i.e. `matched` or `lost`.

After some poking around, I found out that without a `group by` statement, the server only returns 100 results. By including a `group by` statement for all of the fields, this should theoretically drop duplicate values. I'm also filtering out anything before `1998-10-03`, as this is the earliest `DateImpounded` timestamp in the other dataset that will be used.

In [80]:
import requests
import pandas as pd
import json
import datetime

# "lf" is for *l*ost and *f*ound
lf_api_query = "https://opendata.vancouver.ca/api/explore/v2.1/catalog/datasets/animal-control-inventory-lost-and-found/records?where=date%20%3E%20%221998-10-02%22&group_by=date%2C%20breed%2C%20color%2C%20name%2C%20sex%2C%20state&order_by=date&limit=-1"
lf_data = requests.get(lf_api_query)
lf_data.raise_for_status()

In [81]:
lf_json = lf_data.json()

print(lf_json.keys())
print(lf_json['total_count'])

lf_json['results'][0:3]
print(type(lf_json['results']))     # all elements
print(type(lf_json['results'][0]))  # individual element

dict_keys(['total_count', 'results'])
17829
<class 'list'>
<class 'dict'>


Based on our mild digging above, we would want to load specifically the data in the `results` key as a Pandas DataFrame, since `results` is simply a `list` of `dict`'s, which Pandas.DataFrame's constructor can handle.

In [82]:
lf_df = pd.DataFrame(lf_json['results'])

#### City of Vancouver Animal Control Inventory - Register

This dataset is a "general record of each animal that has come into the custody" the City of Vancouver's animal control service.

I chose this dataset to have a record to compare all of the lost and found animals to in the event an animal is reported as lost and the City of Vancouver happens to have them, or someone very much like them, already processed into their database.

Like with the lost and found dataset, this data is updated daily. Because I must choose a different method to pull this data, I will download it programatically, as well as in CSV format just to make sure I cover all bases for this project.

Type: Semicolon (`;`) delimited "CSV" file.

Method: Programatic download via HTTP GET request

Dataset variables:

- *AnimalID* - Unique sequential number given to each entry.
- *Breed* - Type of animal.
- *ShotsDate* - Date when vaccinated.
- *Sex* - M = Male, F = Female, M/N = Male Neutered, F/S = Female Spayed.
- *ReceiptNumber* - Point of sales system of record receipt number.
- *DateImpounded* - Date first in custody of the City of Vancouver.
- *PitNumber* - Number identifying animal kennel, does not change while in custody of the city.
- *Name* - Name if known.
- *KennelNumber* - Kennel number displayed at the top of each kennel.
- *DispositionDate* - Date when animal was no longer under the control of the city.
- *Color* - Color of coat.
- *Code* - Walk-ability index (*Green = easy, Yellow = moderate, Blue = hard*).
- *ApproxWeight* - Approximate weight of animal.
- *Age category* - Rough estimate of age - puppy, young adult, adult, senior.
- *Source* - Where the animal came from (Brought-in, Holding stray, Transferred).
- *Status* - Current state/disposition of animal.
- *ACO* - Animal control officer number or initials of employee.

In [83]:
# "reg" is for registry
reg_url = "https://opendata.vancouver.ca/api/explore/v2.1/catalog/datasets/animal-control-inventory-register/exports/csv?lang=en&timezone=America%2FChicago&use_labels=true&delimiter=%3B"
reg_data = requests.get(reg_url)
reg_data.raise_for_status()

In [84]:
file_dl_date = datetime.date.today()
filename = f"vancouver-ac-registry_{file_dl_date.strftime('%Y%m%d')}.csv"
relativepath = "./datasets/" + filename

with open(relativepath, mode="wb") as f:
    f.write(reg_data.content)

We'll make the `AnimalID` column our index column for our DataFrame as that seems most sensible. Following this, we'll start assessing our data.

In [85]:
reg_df = pd.read_csv(relativepath, sep=";", index_col='AnimalID')

## 2. Assess data

Assess the data according to data quality and tidiness metrics using the report below.

List **two** data quality issues and **two** tidiness issues. Assess each data issue visually **and** programmatically, then briefly describe the issue you find.  **Make sure you include justifications for the methods you use for the assessment.**

### Quality Issue 1:

In [86]:
#FILL IN - Inspecting the dataframe visually
lf_df.head()

Unnamed: 0,date,breed,color,name,sex,state
0,1999-01-03T00:00:00+00:00,Rotty X Shep,Black & Tan,Tex,M/N,Lost
1,1999-01-04T00:00:00+00:00,Dog,Light Colour,,M/N,Found
2,1999-01-04T00:00:00+00:00,Golden Lab X,Black & Tan,Oscar,M,Lost
3,1999-01-04T00:00:00+00:00,Shep X,Black & Tan,,F,Found
4,1999-01-04T00:00:00+00:00,Shep X Collie,Black & Tan,Angel,F,Lost


In [87]:
lf_df.tail()

Unnamed: 0,date,breed,color,name,sex,state
17824,2025-07-14T00:00:00+00:00,Shih Tzu schnauzer mix,Grey w/ white,Nelly,F/S,Lost
17825,2025-07-15T00:00:00+00:00,Cat - DMH - Black,Black with white patch on ches,George,M/N,Matched
17826,2025-07-15T00:00:00+00:00,Cat - DSH - Bengal,Creamy White/brown spots,Unknown,F/S,Lost
17827,2025-07-16T00:00:00+00:00,Aussie Doodle,black and White,Pearl,F/S,Lost
17828,2025-07-16T00:00:00+00:00,Tabby,Brown,Puss,F/S,Lost


In [88]:
#FILL IN - Inspecting the dataframe programmatically
lf_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17829 entries, 0 to 17828
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   date    17829 non-null  object
 1   breed   17781 non-null  object
 2   color   17727 non-null  object
 3   name    16060 non-null  object
 4   sex     17439 non-null  object
 5   state   17820 non-null  object
dtypes: object(6)
memory usage: 835.9+ KB


In [89]:
print("Duplicate rows:", lf_df.duplicated().any())
print(lf_df.date.max())
print(lf_df.date.min())

Duplicate rows: False
2025-07-16T00:00:00+00:00
1999-01-03T00:00:00+00:00


In [90]:
print("The date column is the following datatype:", type(lf_df['date'][0]))
num_colors_and = lf_df[lf_df['color'].str.contains(r" and ", na=False)].shape[0]
num_colors_amp = lf_df[lf_df['color'].str.contains(r"&", na=False)].shape[0]

print(num_colors_amp, "contain '&' in the color description while", num_colors_and, "contain 'and'.")

The date column is the following datatype: <class 'str'>
5503 contain '&' in the color description while 424 contain 'and'.


In [91]:
print("There are", lf_df['color'].unique().shape[0], "unique strings in the color column.")

There are 3395 unique strings in the color column.


Issue and justification: Aside from the lack of total completeness, which is understandable, a lot of the columns do not have a consistent format:
- "&" is primarily used in place of "and" in the `color` column. This matters when trying to answer our question as this means that matching based on `color` will require fuzzy matching. 
- The formatting in the `breed` column is also inconsistent, but this is addressed further in the tidiness section.
- 

### Quality Issue 2:

In [92]:
#FILL IN - Inspecting the dataframe visually
reg_df.head()

Unnamed: 0_level_0,Breed,ShotsDate,Sex,ReceiptNumber,DateImpounded,PitNumber,Name,KennelNumber,DispositionDate,Color,Code,ApproxWeight,Age category,Source,Status,ACO
AnimalID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
34304,"Rabbit, Domestic Tortoiseshell",,F/S,,2023-07-14,,Mulberry (new name),Bunny Lane,,Brown and Tan,,,Puppy,BROUGHT-IN,Viewable,
34731,Mastiff X,,M,DAL 24-216813,2024-04-16,,Boss,48,,Tan,,,Senior,SEIZED,Seized,11.0
35338,Corgi/Pitbull X,2025-06-07,F,,2025-04-23,,Shorty,6,,Black & White,,20.4 kg,Adult,SEIZED,Adoptable,14.0
35342,Corgi/Pitbull X,,F,Foster DL 25-225415,2025-04-23,,Angelica (new name),100,,Tan w/ white belly,,,Puppy,SEIZED,Fostered,14.0
35404,Guinea Pig,,M,,2025-05-20,,Henry (new name),100,,Beige/Brown,,,Adult,VPD IMPOUND,Fostered,


In [93]:
reg_df['DateImpounded'].sort_values()

AnimalID
7060     1998-10-03
745      1998-11-30
406      1998-12-04
134      1998-12-10
135      1998-12-19
            ...    
35511    2025-07-14
35514    2025-07-16
35515    2025-07-16
35519    2025-07-17
35517    2025-07-17
Name: DateImpounded, Length: 26058, dtype: object

In [94]:
reg_df.duplicated().value_counts()

False    26017
True        41
Name: count, dtype: int64

In [95]:
reg_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 26058 entries, 34304 to 8760
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Breed            26050 non-null  object 
 1   ShotsDate        3752 non-null   object 
 2   Sex              25771 non-null  object 
 3   ReceiptNumber    21668 non-null  object 
 4   DateImpounded    26058 non-null  object 
 5   PitNumber        17763 non-null  float64
 6   Name             23310 non-null  object 
 7   KennelNumber     26037 non-null  object 
 8   DispositionDate  12297 non-null  object 
 9   Color            26013 non-null  object 
 10  Code             1173 non-null   object 
 11  ApproxWeight     11643 non-null  object 
 12  Age category     8017 non-null   object 
 13  Source           24084 non-null  object 
 14  Status           26049 non-null  object 
 15  ACO              21300 non-null  object 
dtypes: float64(1), object(15)
memory usage: 3.4+ MB


In [96]:
reg_df['Age category'].value_counts()

Age category
Adult          4906
Young Adult    1538
Senior         1131
Puppy           442
Name: count, dtype: int64

Issue and justification: Immediately, we can see that there are a lot of null values, especially for `Code`, `ShotsDate`, and `Age category` columns. Each of those have no more than about 30% of the values filled. It would be difficult, however, to infer anything to answer our question using any of these columns. 

The best columns would probably be `name`, `color`, `sex`, and `breed` for making inferences, so this means we can probably just drop the columns that are mostly null. For `name` and the other column.s previousy mentioned, we can probably just replace the null values with something like "Unknown"

All of these null values make correlating animals together based on similar factors less reliable, especially if something critical like the name isn't disclosed or the color is not reported as the staff intaking animals would describe. Names are also likely the highest differentiator between two animals with similar breeds and coat colors, as well.

There are also duplicate values, but those will be easy to handle. The biggest issue I feel is the insane abundance of missing values.

### Tidiness Issue 1:

In [97]:
lf_df.head()

Unnamed: 0,date,breed,color,name,sex,state
0,1999-01-03T00:00:00+00:00,Rotty X Shep,Black & Tan,Tex,M/N,Lost
1,1999-01-04T00:00:00+00:00,Dog,Light Colour,,M/N,Found
2,1999-01-04T00:00:00+00:00,Golden Lab X,Black & Tan,Oscar,M,Lost
3,1999-01-04T00:00:00+00:00,Shep X,Black & Tan,,F,Found
4,1999-01-04T00:00:00+00:00,Shep X Collie,Black & Tan,Angel,F,Lost


In [98]:
lf_df.columns  # are the columns values instead of variable names? 

Index(['date', 'breed', 'color', 'name', 'sex', 'state'], dtype='object')

In [99]:
print("There are", lf_df.breed.unique().shape[0], "unique breeds in the dataset.")

breed_s = pd.Series(lf_df.breed.unique()).sort_values(na_position='first', ignore_index=True)
breed_s.head(10)

There are 4039 unique breeds in the dataset.


0                            None
1          (Miniature) Pomeranian
2              .Unknown Breed Mix
3      1 Pit Bull & 1 Terrier mix
4    1 Pitbull & 1 Bernese/Poodle
5             1/2 Pit & 1/2 Presa
6                      2 Dachunds
7                2 Great Pyrenese
8                       2 Husky X
9                 2 Labs & 1 Shep
dtype: object

In [100]:
breed_s.tail(10)

4029    stratfordshire terrier
4030                     tabby
4031    tabby Grey black brown
4032       tabby short haireds
4033               terrier (?)
4034                 terrier X
4035             terrier X pug
4036      very small ( 5 lbs )
4037             westy terrier
4038              yorkie cross
dtype: object

In [101]:
breed_s.str.extractall("(.*\d.*)")

Unnamed: 0_level_0,Unnamed: 1_level_0,0
Unnamed: 0_level_1,match,Unnamed: 2_level_1
3,0,1 Pit Bull & 1 Terrier mix
4,0,1 Pitbull & 1 Bernese/Poodle
5,0,1/2 Pit & 1/2 Presa
6,0,2 Dachunds
7,0,2 Great Pyrenese
...,...,...
3638,0,Staff Terrier #2
3660,0,Stafforshire Terrier 14 months
3971,0,lab and 2nd dog German Sheper
4027,0,small dog Benji type (10lbs)


Issue and justification: It seems the `breed` column is being used to track reports for multiple animals (e.g., `1 Pit Bull & 1 Terrier mix`). A more tidy way to do this would be splitting those out into separate columns. If the report needs to be associated with those from the same report, there can be a report ID that all applicable reports share, however I will not need that information to answer my question.

By having multiple animals in one row, this not only inaccurately represents the number of animals reported missing, but also makes grouping or filtering by the reported breed unreliable.

### Tidiness Issue 2: 

In [119]:
#FILL IN - Inspecting the dataframe visually
reg_df.head(10)

Unnamed: 0_level_0,Breed,ShotsDate,Sex,ReceiptNumber,DateImpounded,PitNumber,Name,KennelNumber,DispositionDate,Color,Code,ApproxWeight,Age category,Source,Status,ACO
AnimalID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
34304,"Rabbit, Domestic Tortoiseshell",,F/S,,2023-07-14,,Mulberry (new name),Bunny Lane,,Brown and Tan,,,Puppy,BROUGHT-IN,Viewable,
34731,Mastiff X,,M,DAL 24-216813,2024-04-16,,Boss,48,,Tan,,,Senior,SEIZED,Seized,11
35338,Corgi/Pitbull X,2025-06-07,F,,2025-04-23,,Shorty,6,,Black & White,,20.4 kg,Adult,SEIZED,Adoptable,14
35342,Corgi/Pitbull X,,F,Foster DL 25-225415,2025-04-23,,Angelica (new name),100,,Tan w/ white belly,,,Puppy,SEIZED,Fostered,14
35404,Guinea Pig,,M,,2025-05-20,,Henry (new name),100,,Beige/Brown,,,Adult,VPD IMPOUND,Fostered,
35407,French Bulldog,,F/S,DA - 25 249924,2025-05-22,,Pippin (new name),100,,Brown & Cream Brindle,,,Adult,HOLDING STRAY,Fostered,11
35408,Shepherd/Rottweiller x,,F,,2025-05-23,,Jelly,100,,"Brown, black w/ tan",,26.0 kg,Young Adult,BROUGHT-IN,Fostered,
35409,Dane/Mastiff X,,M,,2025-05-23,,Dane (new name),100,,Black,,,Adult,HOLDING STRAY,Fostered,ACO42
35444,"Rabbit, Harlequin",,M,,2025-06-11,,Pipkin (new name),ISO,,"White, grey & black",,,Young Adult,VPD IMPOUND,Viewable,
35474,Husky/Lab X,,F,,2025-06-29,,Sapphire (new name),100,,Cream w/Black,,40kg,Adult,HOLDING STRAY,Fostered,40


In [126]:
reg_df.sort_index().head()

Unnamed: 0_level_0,Breed,ShotsDate,Sex,ReceiptNumber,DateImpounded,PitNumber,Name,KennelNumber,DispositionDate,Color,Code,ApproxWeight,Age category,Source,Status,ACO
AnimalID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1,Pit Bull,2005-06-18,M/N,17057 ks,2005-06-12,20038.0,Taz,200,,Tan,,45,,HOLDING STRAY,Sold,3.0
2,English Setter,,M/N,20372 BC3,2006-03-19,20041.0,Dudley,200,,White & Brown,,100lbs,,HOLDING STRAY,Redeemed,
3,Lab X,,M/N,N/C BC,2006-07-24,40041.0,Evander,400,,Black,,,,HOLDING STRAY,Ride Home Free,5.0
4,Pomeranian,,M,"22171,MC",2006-09-08,20017.0,Chewbaca,200,,Brown,,10 lbs.,,HOLDING STRAY,Redeemed,14.0
5,American Bulldog X,2006-10-07,M/N,22565 - skj,2006-09-06,20053.0,Cadillac,200,,White with Brown Patch on Eye,,80 lbs,,HOLDING STRAY,Sold,22.0


In [103]:
reg_df.columns

Index(['Breed', 'ShotsDate', 'Sex', 'ReceiptNumber', 'DateImpounded',
       'PitNumber', 'Name', 'KennelNumber', 'DispositionDate', 'Color', 'Code',
       'ApproxWeight', 'Age category', 'Source', 'Status', 'ACO'],
      dtype='object')

In [129]:
reg_df['Breed'].str.extractall("(.*[lL]ab.*)").value_counts()

0                     
Lab X                     872
Lab                       571
Labrador                  495
Labrador X                 78
Shep X Lab                 78
                         ... 
Black Lab X Pit             1
Black Lab/GSD               1
Black X Lab                 1
Heeler X Lab X Bassett      1
rotty X lab                 1
Name: count, Length: 490, dtype: int64

Issue and justification: There are a few places where some improvements can be made. 
- `Name` - Some rows contain "`(New name)`," which is tracking two factors of data in a single column.
  - This could be resolved with another column of boolean values, but I do not think it will be entirely necessary for our cause. Therefore, we will remove occurrences of `(New name)`
- `Breed` - Some breeds have "`mix`" while others have "`X`". Presumably, `X` is also for "mix," but this is merely an assumption. Additionally, `Black Lab` is not a breed, the breed is Labrador Retriever. 
  - One way to handle this is to see if the `Breed` field contains the `Color` of the animal, then trimming that out if it does.

## 3. Clean data
Clean the data to solve the 4 issues corresponding to data quality and tidiness found in the assessing step. **Make sure you include justifications for your cleaning decisions.**

After the cleaning for each issue, please use **either** the visually or programatical method to validate the cleaning was succesful.

At this stage, you are also expected to remove variables that are unnecessary for your analysis and combine your datasets. Depending on your datasets, you may choose to perform variable combination and elimination before or after the cleaning stage. Your dataset must have **at least** 4 variables after combining the data.

In [None]:
# Make copies of the datasets to ensure the raw dataframes are not impacted
lf_df_clean = lf_df.copy()
reg_df_clean = reg_df.copy()

### **Quality Issue 1: FILL IN**

In [None]:
# FILL IN - Apply the cleaning strategy


In [107]:
# FILL IN - Validate the cleaning was successful

Justification: *FILL IN*

### **Quality Issue 2: FILL IN**

In [108]:
#FILL IN - Apply the cleaning strategy

In [109]:
#FILL IN - Validate the cleaning was successful

Justification: *FILL IN*

### **Tidiness Issue 1: FILL IN**

In [110]:
#FILL IN - Apply the cleaning strategy

In [111]:
#FILL IN - Validate the cleaning was successful

Justification: *FILL IN*

### **Tidiness Issue 2: FILL IN**

In [112]:
#FILL IN - Apply the cleaning strategy

In [113]:
#FILL IN - Validate the cleaning was successful

Justification: *FILL IN*

### **Remove unnecessary variables and combine datasets**

Depending on the datasets, you can also peform the combination before the cleaning steps.

In [114]:
#FILL IN - Remove unnecessary variables and combine datasets

## 4. Update your data store
Update your local database/data store with the cleaned data, following best practices for storing your cleaned data:

- Must maintain different instances / versions of data (raw and cleaned data)
- Must name the dataset files informatively
- Ensure both the raw and cleaned data is saved to your database/data store

In [115]:
#FILL IN - saving data

## 5. Answer the research question

### **5.1:** Define and answer the research question 
Going back to the problem statement in step 1, use the cleaned data to answer the question you raised. Produce **at least** two visualizations using the cleaned data and explain how they help you answer the question.

*Research question:* FILL IN from answer to Step 1

In [None]:
#Visual 1 - FILL IN

*Answer to research question:* FILL IN