# Real-world Data Wrangling

In this project, you will apply the skills you acquired in the course to gather and wrangle real-world data with two datasets of your choice.

You will retrieve and extract the data, assess the data programmatically and visually, accross elements of data quality and structure, and implement a cleaning strategy for the data. You will then store the updated data into your selected database/data store, combine the data, and answer a research question with the datasets.

Throughout the process, you are expected to:

1. Explain your decisions towards methods used for gathering, assessing, cleaning, storing, and answering the research question
2. Write code comments so your code is more readable

## 1. Gather data

In this section, you will extract data using two different data gathering methods and combine the data. Use at least two different types of data-gathering methods.

### **1.1.** Problem Statement
In 2-4 sentences, explain the kind of problem you want to look at and the datasets you will be wrangling for this project.

 *The City of Vancouver tracks not only every animal that comes into its shelters, but also those that are reported as lost by their owners. While the city does track those that are matched back to their owner, is it possible that an animal still tracked as lost has possibly been accounted for? If so, is it possible to find somewhat reliable means to match animals based on data entered into a report for lost ones?*

### **1.2.** Gather at least two datasets using two different data gathering methods

List of data gathering methods:

- Download data manually
- Programmatically downloading files
- Gather data by accessing APIs
- Gather and extract data from HTML files using BeautifulSoup
- Extract data from a SQL database

Each dataset must have at least two variables, and have greater than 500 data samples within each dataset.

For each dataset, briefly describe why you picked the dataset and the gathering method (2-3 full sentences), including the names and significance of the variables in the dataset. Show your work (e.g., if using an API to download the data, please include a snippet of your code). 

Load the dataset programmtically into this notebook.

#### City of Vancouver Animal Control Inventory - Lost and Found

This dataset is information from the City of Vancouver where an owner of an animal has reported them as lost. It also tracks those that were either reported as found or were matched by the shelter back to the owner. 

I chose this dataset because it will address what animals were reported as lost within the city. This does not cover every animal that was lost, however it does provide a large sample size for this metro area.

Further information on the dataset can be found [here](https://opendata.vancouver.ca/explore/dataset/animal-control-inventory-lost-and-found/information).

Type: JSON

Method: This data was gathered by querying the City of Vancouver's database with the standard Opendatasoft API. I am doing it this way because the data is updated daily, and this guarantees that the most up-to-date information will be used.

Dataset variables:

- *breed* - type of animal or breed that fits best.
- *color* - color of the animal's coat/fur.
- *date* - date that the animal was lost
- *name* - the given name of the animal being tracked (if known).
- *sex* - used to label the biological sex of the animal, as well as if they are spayed or neutered (marked with `F/S` or `M/N` accordingly). `X` = unknown
- *state* - the last state of being for the animal, i.e. `matched` or `lost`.

After some poking around, I found out that without a `group by` statement, the server only returns 100 results. By including a `group by` statement for all of the fields, this should theoretically drop duplicate values. I'm also filtering out anything before `2000-11-17`, as this is the earliest timestamp in the other dataset that will be used.

In [2]:
import requests
import pandas as pd
import json
import datetime

# "lf" is for *l*ost and *f*ound
lf_api_query = "https://opendata.vancouver.ca/api/explore/v2.1/catalog/datasets/animal-control-inventory-lost-and-found/records?where=date%20%3E%20%222000-11-16%22&group_by=date%2C%20breed%2C%20color%2C%20name%2C%20sex%2C%20state&order_by=date&limit=-1"
lf_data = requests.get(lf_api_query)
lf_data.raise_for_status()

In [3]:
lf_json = lf_data.json()

print(lf_json.keys())
print(lf_json['total_count'])

lf_json['results'][0:3]
print(type(lf_json['results']))     # all elements
print(type(lf_json['results'][0]))  # individual element

dict_keys(['total_count', 'results'])
15720
<class 'list'>
<class 'dict'>


Based on our mild digging above, we would want to load specifically the data in the `results` key as a Pandas DataFrame, since `results` is simply a `list` of `dict`'s, which Pandas.DataFrame's constructor can handle.

In [4]:
lf_df = pd.DataFrame(lf_json['results'])

#### City of Vancouver Animal Control Inventory - Register

This dataset is a "general record of each animal that has come into the custody" the City of Vancouver's animal control service.

I chose this dataset to have a record to compare all of the lost and found animals to in the event an animal is reported as lost and the City of Vancouver happens to have them, or someone very much like them, already processed into their database.

Like with the lost and found dataset, this data is updated daily. Because I must choose a different method to pull this data, I will download it programatically, as well as in CSV format just to make sure I cover all bases for this project.

Type: Semicolon (`;`) delimited "CSV" file.

Method: Programatic download via HTTP GET request

Dataset variables:

- *AnimalID* - Unique sequential number given to each entry.
- *Breed* - Type of animal.
- *ShotsDate* - Date when vaccinated.
- *Sex* - M = Male, F = Female, M/N = Male Neutered, F/S = Female Spayed.
- *ReceiptNumber* - Point of sales system of record receipt number.
- *DateImpounded* - Date first in custody of the City of Vancouver.
- *PitNumber* - Number identifying animal kennel, does not change while in custody of the city.
- *Name* - Name if known.
- *KennelNumber* - Kennel number displayed at the top of each kennel.
- *DispositionDate* - Date when animal was no longer under the control of the city.
- *Color* - Color of coat.
- *Code* - Walk-ability index (*Green = easy, Yellow = moderate, Blue = hard*).
- *ApproxWeight* - Approximate weight of animal.
- *Age category* - Rough estimate of age - puppy, young adult, adult, senior.
- *Source* - Where the animal came from (Brought-in, Holding stray, Transferred).
- *Status* - Current state/disposition of animal.
- *ACO* - Animal control officer number or initials of employee.

In [5]:
# "reg" is for registry
reg_url = "https://opendata.vancouver.ca/api/explore/v2.1/catalog/datasets/animal-control-inventory-register/exports/csv?lang=en&timezone=America%2FChicago&use_labels=true&delimiter=%3B"
reg_data = requests.get(reg_url)
reg_data.raise_for_status()

In [6]:
file_dl_date = datetime.date.today()
filename = f"vancouver-ac-registry_{file_dl_date.strftime('%Y%m%d')}.csv"
relativepath = "./datasets/" + filename

with open(relativepath, mode="wb") as f:
    f.write(reg_data.content)

We'll make the `AnimalID` column our index column for our DataFrame as that seems most sensible. Following this, we'll start assessing our data.

In [7]:
reg_df = pd.read_csv(relativepath, sep=";", index_col='AnimalID')

## 2. Assess data

Assess the data according to data quality and tidiness metrics using the report below.

List **two** data quality issues and **two** tidiness issues. Assess each data issue visually **and** programmatically, then briefly describe the issue you find.  **Make sure you include justifications for the methods you use for the assessment.**

### Quality Issue 1:

In [8]:
#FILL IN - Inspecting the dataframe visually
lf_df.head()

Unnamed: 0,date,breed,color,name,sex,state
0,2000-11-17T00:00:00+00:00,Dashound,Brown,,M,Found
1,2000-11-17T00:00:00+00:00,Siberian Husky X,Cream & White,Sequoia,F,Lost
2,2000-11-17T00:00:00+00:00,Terrier X Span,Black & White,Sara,F/S,Lost
3,2000-11-18T00:00:00+00:00,Akita,Brownish Blonde,Ty,F/S,Lost
4,2000-11-18T00:00:00+00:00,Chihuahua,Brownish,Piojo,M,Matched


In [9]:
lf_df.tail()

Unnamed: 0,date,breed,color,name,sex,state
15715,2025-07-14T00:00:00+00:00,Shih Tzu schnauzer mix,Grey w/ white,Nelly,F/S,Lost
15716,2025-07-15T00:00:00+00:00,Cat - DMH - Black,Black with white patch on ches,George,M/N,Matched
15717,2025-07-15T00:00:00+00:00,Cat - DSH - Bengal,Creamy White/brown spots,Unknown,F/S,Lost
15718,2025-07-16T00:00:00+00:00,Aussie Doodle,black and White,Pearl,F/S,Lost
15719,2025-07-16T00:00:00+00:00,Tabby,Brown,Puss,F/S,Lost


In [10]:
#FILL IN - Inspecting the dataframe programmatically
lf_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15720 entries, 0 to 15719
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   date    15720 non-null  object
 1   breed   15672 non-null  object
 2   color   15624 non-null  object
 3   name    14571 non-null  object
 4   sex     15391 non-null  object
 5   state   15711 non-null  object
dtypes: object(6)
memory usage: 737.0+ KB


In [11]:
print(lf_df.duplicated().any())  # check for duplicate rows
print(lf_df.date.max())
print(lf_df.date.min())

False
2025-07-16T00:00:00+00:00
2000-11-17T00:00:00+00:00


Issue and justification: *FILL IN*

### Quality Issue 2:

In [12]:
#FILL IN - Inspecting the dataframe visually
reg_df.head()

Unnamed: 0_level_0,Breed,ShotsDate,Sex,ReceiptNumber,DateImpounded,PitNumber,Name,KennelNumber,DispositionDate,Color,Code,ApproxWeight,Age category,Source,Status,ACO
AnimalID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
74,Australian Terrier,,F,"23404,MC",2007-02-10,20053.0,Holden,200,,Black & Tan,,5 lbs,,HOLDING STRAY,Redeemed,
76,Shepherd X,,M/N,23407-ARD,2007-02-10,20041.0,Fenris,200,,Black & White,,50 lbs,,HOLDING STRAY,Redeemed,
79,Lab X,2007-02-13,M/N,N/C,2007-02-10,20024.0,Paxton,200,,Black & White,,40 lbs,,BROUGHT-IN,Sold,4.0
80,Collie X GSD,,F/S,"23409,MC",2007-02-10,20012.0,Lulu,200,,Black & Tan,,45 lbs,,HOLDING STRAY,Redeemed,20.0
81,German Shepherd,,F/S,"23411,DK",2007-02-11,20094.0,Bella,200,,Black & Tan,,60lbs,,HOLDING STRAY,Redeemed,5.0


In [13]:
reg_df['DispositionDate'].sort_values()

AnimalID
21467    2000-11-17
14807    2009-10-19
20002    2009-10-19
14802    2009-10-19
14800    2009-10-19
            ...    
9340            NaN
9341            NaN
9343            NaN
9345            NaN
9348            NaN
Name: DispositionDate, Length: 26056, dtype: object

In [14]:
reg_df.duplicated().value_counts()

False    26015
True        41
Name: count, dtype: int64

In [15]:
#FILL IN - Inspecting the dataframe programmatically
reg_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 26056 entries, 74 to 9348
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Breed            26048 non-null  object 
 1   ShotsDate        3752 non-null   object 
 2   Sex              25769 non-null  object 
 3   ReceiptNumber    21668 non-null  object 
 4   DateImpounded    26056 non-null  object 
 5   PitNumber        17763 non-null  float64
 6   Name             23310 non-null  object 
 7   KennelNumber     26035 non-null  object 
 8   DispositionDate  12297 non-null  object 
 9   Color            26011 non-null  object 
 10  Code             1173 non-null   object 
 11  ApproxWeight     11643 non-null  object 
 12  Age category     8015 non-null   object 
 13  Source           24082 non-null  object 
 14  Status           26047 non-null  object 
 15  ACO              21299 non-null  object 
dtypes: float64(1), object(15)
memory usage: 3.4+ MB


In [16]:
reg_df['Age category'].value_counts()

Age category
Adult          4905
Young Adult    1538
Senior         1131
Puppy           441
Name: count, dtype: int64

Issue and justification: Immediately, we can see that there are a lot of null values, especially for `Code`, `ShotsDate`, and `Age category` columns. Each of those have no more than about 30% of the values filled. It would be difficult, however, to infer anything to answer our question using any of these columns. The best columns would probably be `name`, `color`, and `breed`.

### Tidiness Issue 1:

In [17]:
#FILL IN - Inspecting the dataframe visually
lf_df.head()

Unnamed: 0,date,breed,color,name,sex,state
0,2000-11-17T00:00:00+00:00,Dashound,Brown,,M,Found
1,2000-11-17T00:00:00+00:00,Siberian Husky X,Cream & White,Sequoia,F,Lost
2,2000-11-17T00:00:00+00:00,Terrier X Span,Black & White,Sara,F/S,Lost
3,2000-11-18T00:00:00+00:00,Akita,Brownish Blonde,Ty,F/S,Lost
4,2000-11-18T00:00:00+00:00,Chihuahua,Brownish,Piojo,M,Matched


In [18]:
print("The date column is the following datatype:", type(lf_df['date'][0]))
num_colors_and = lf_df[lf_df['color'].str.contains(r" and ", na=False)].shape[0]
num_colors_amp = lf_df[lf_df['color'].str.contains(r"&", na=False)].shape[0]

print(num_colors_amp, "contain '&' in the color description while", num_colors_and, "contain 'and'.")

The date column is the following datatype: <class 'str'>
4481 contain '&' in the color description while 418 contain 'and'.


In [40]:
print("There are", lf_df.breed.unique().shape[0], "unique breeds in the dataset.")

breed_s = pd.Series(lf_df.breed.unique()).sort_values(na_position='first', ignore_index=True)
breed_s.head(10)

There are 3631 unique breeds in the dataset.


0                            None
1          (Miniature) Pomeranian
2              .Unknown Breed Mix
3      1 Pit Bull & 1 Terrier mix
4    1 Pitbull & 1 Bernese/Poodle
5             1/2 Pit & 1/2 Presa
6                2 Great Pyrenese
7                       2 Husky X
8                  2 Rotty (pups)
9                         2 Sheps
dtype: object

In [41]:
breed_s.tail(10)

3621    small dog Benji type (10lbs)
3622        small dog-med length fur
3623                           tabby
3624          tabby Grey black brown
3625             tabby short haireds
3626                       terrier X
3627                   terrier X pug
3628            very small ( 5 lbs )
3629                   westy terrier
3630                    yorkie cross
dtype: object

In [20]:
print("There are", lf_df['color'].unique().shape[0], "unique strings in the color column.")

There are 3209 unique strings in the color column.


Issue and justification: *FILL IN*

### Tidiness Issue 2: 

In [21]:
#FILL IN - Inspecting the dataframe visually

In [22]:
#FILL IN - Inspecting the dataframe programmatically

Issue and justification: *FILL IN*

## 3. Clean data
Clean the data to solve the 4 issues corresponding to data quality and tidiness found in the assessing step. **Make sure you include justifications for your cleaning decisions.**

After the cleaning for each issue, please use **either** the visually or programatical method to validate the cleaning was succesful.

At this stage, you are also expected to remove variables that are unnecessary for your analysis and combine your datasets. Depending on your datasets, you may choose to perform variable combination and elimination before or after the cleaning stage. Your dataset must have **at least** 4 variables after combining the data.

In [23]:
# FILL IN - Make copies of the datasets to ensure the raw dataframes 
# are not impacted

### **Quality Issue 1: FILL IN**

In [24]:
# FILL IN - Apply the cleaning strategy

In [25]:
# FILL IN - Validate the cleaning was successful

Justification: *FILL IN*

### **Quality Issue 2: FILL IN**

In [26]:
#FILL IN - Apply the cleaning strategy

In [27]:
#FILL IN - Validate the cleaning was successful

Justification: *FILL IN*

### **Tidiness Issue 1: FILL IN**

In [28]:
#FILL IN - Apply the cleaning strategy

In [29]:
#FILL IN - Validate the cleaning was successful

Justification: *FILL IN*

### **Tidiness Issue 2: FILL IN**

In [30]:
#FILL IN - Apply the cleaning strategy

In [31]:
#FILL IN - Validate the cleaning was successful

Justification: *FILL IN*

### **Remove unnecessary variables and combine datasets**

Depending on the datasets, you can also peform the combination before the cleaning steps.

In [32]:
#FILL IN - Remove unnecessary variables and combine datasets

## 4. Update your data store
Update your local database/data store with the cleaned data, following best practices for storing your cleaned data:

- Must maintain different instances / versions of data (raw and cleaned data)
- Must name the dataset files informatively
- Ensure both the raw and cleaned data is saved to your database/data store

In [33]:
#FILL IN - saving data

## 5. Answer the research question

### **5.1:** Define and answer the research question 
Going back to the problem statement in step 1, use the cleaned data to answer the question you raised. Produce **at least** two visualizations using the cleaned data and explain how they help you answer the question.

*Research question:* FILL IN from answer to Step 1

In [34]:
#Visual 1 - FILL IN

*Answer to research question:* FILL IN

In [35]:
#Visual 2 - FILL IN

*Answer to research question:* FILL IN

### **5.2:** Reflection
In 2-4 sentences, if you had more time to complete the project, what actions would you take? For example, which data quality and structural issues would you look into further, and what research questions would you further explore?

*Answer:* FILL IN