# Real-world Data Wrangling

## 1. Gather data

### **1.1.** Problem Statement

 *The City of Vancouver tracks not only every animal that comes into its shelters, but also those that are reported as lost by their owners. While the city does track those that are matched back to their owner, is it possible that an animal still tracked as lost has possibly been accounted for? If so, is it possible to find somewhat reliable means to match animals based on data entered into a report for lost ones?*

### **1.2.** Gather at least two datasets using two different data gathering methods


#### City of Vancouver Animal Control Inventory - Lost and Found

This dataset is information from the City of Vancouver where an owner of an animal has reported them as lost. It also tracks those that were either reported as found or were matched by the shelter back to the owner. 

I chose this dataset because it will address what animals were reported as lost within the city. This does not cover every animal that was lost, however it does provide a large sample size for this metro area.

Further information on the dataset can be found [here](https://opendata.vancouver.ca/explore/dataset/animal-control-inventory-lost-and-found/information).

Type: JSON

Method: This data was gathered by querying the City of Vancouver's database with the standard Opendatasoft API. I am doing it this way because the data is updated daily, and this guarantees that the most up-to-date information will be used.

Dataset variables:

- *breed* - type of animal or breed that fits best.
- *color* - color of the animal's coat/fur.
- *date* - date that the animal was lost
- *name* - the given name of the animal being tracked (if known).
- *sex* - used to label the biological sex of the animal, as well as if they are spayed or neutered (marked with `F/S` or `M/N` accordingly). `X` = unknown
- *state* - the last state of being for the animal, i.e. `matched` or `lost`.

After some poking around, I found out that without a `group by` statement, the server only returns 100 results. By including a `group by` statement for all of the fields, this should theoretically drop duplicate values. I'm also filtering out anything before `1998-10-03`, as this is the earliest `DateImpounded` timestamp in the other dataset that will be used.

In [1]:
import requests
import pandas as pd
import datetime

# "lf" is for *l*ost and *f*ound
lf_api_query = "https://opendata.vancouver.ca/api/explore/v2.1/catalog/datasets/animal-control-inventory-lost-and-found/records?where=date%20%3E%20%221998-10-02%22&group_by=date%2C%20breed%2C%20color%2C%20name%2C%20sex%2C%20state&order_by=date&limit=-1"
lf_data = requests.get(lf_api_query)
lf_data.raise_for_status()

In [2]:
lf_json = lf_data.json()

print(lf_json.keys())
print(lf_json['total_count'])

lf_json['results'][0:3]
print(type(lf_json['results']))     # all elements
print(type(lf_json['results'][0]))  # individual element

dict_keys(['total_count', 'results'])
17867
<class 'list'>
<class 'dict'>


Based on our mild digging above, we would want to load specifically the data in the `results` key as a Pandas DataFrame, since `results` is simply a `list` of `dict`'s, which Pandas.DataFrame's constructor can handle.

In [3]:
lf_df = pd.DataFrame(lf_json['results'])

#### City of Vancouver Animal Control Inventory - Register

This dataset is a "general record of each animal that has come into the custody" the City of Vancouver's animal control service.

I chose this dataset to have a record to compare all of the lost and found animals to in the event an animal is reported as lost and the City of Vancouver happens to have them, or someone very much like them, already processed into their database.

Like with the lost and found dataset, this data is updated daily. Because I must choose a different method to pull this data, I will download it programatically, as well as in CSV format just to make sure I cover all bases for this project.

Type: Semicolon (`;`) delimited "CSV" file.

Method: Programatic download via HTTP GET request

Dataset variables:

- *AnimalID* - Unique sequential number given to each entry.
- *Breed* - Type of animal.
- *ShotsDate* - Date when vaccinated.
- *Sex* - M = Male, F = Female, M/N = Male Neutered, F/S = Female Spayed.
- *ReceiptNumber* - Point of sales system of record receipt number.
- *DateImpounded* - Date first in custody of the City of Vancouver.
- *PitNumber* - Number identifying animal kennel, does not change while in custody of the city.
- *Name* - Name if known.
- *KennelNumber* - Kennel number displayed at the top of each kennel.
- *DispositionDate* - Date when animal was no longer under the control of the city.
- *Color* - Color of coat.
- *Code* - Walk-ability index (*Green = easy, Yellow = moderate, Blue = hard*).
- *ApproxWeight* - Approximate weight of animal.
- *Age category* - Rough estimate of age - puppy, young adult, adult, senior.
- *Source* - Where the animal came from (Brought-in, Holding stray, Transferred).
- *Status* - Current state/disposition of animal.
- *ACO* - Animal control officer number or initials of employee.

In [4]:
# "reg" is for registry
reg_url = "https://opendata.vancouver.ca/api/explore/v2.1/catalog/datasets/animal-control-inventory-register/exports/csv?lang=en&timezone=America%2FChicago&use_labels=true&delimiter=%3B"
reg_data = requests.get(reg_url)
reg_data.raise_for_status()

In [5]:
# Save contents to a file labeled with today's date.
file_dl_date = datetime.date.today()
filename = f"vancouver-ac-registry_{file_dl_date.strftime('%Y%m%d')}.csv"
relativepath = "./datasets/" + filename

# Write dataset JSON stored as binary to target file
with open(relativepath, mode="wb") as f:
    f.write(reg_data.content)

We'll make the `AnimalID` column our index column for our DataFrame since it's essentially a built-in order for the animals. Following this, we'll start assessing our data.

In [6]:
reg_df = pd.read_csv(relativepath, sep=";", index_col='AnimalID')

## 2. Assess data

When we are assessing data, we are on the lookout for **quality** and **tidiness** (structural) issues.

**Quality Issues:**
- Completeness - The collected data is sufficient for addressing specific problems.
- Validity - Data conforms to the defined schema.
- Accuracy - Data accurately represents the reality it is describing.
- Consistency - A standard format is followed. Data matches that which can be found in other sources.
- Uniqueness - non-duplicate or overlapping values in the data.

**Tidiness Issues:**
- Each variable forms an individual column, i.e. color, name, birth year, etc. 
  - This also means that each column only contains one variable, or "factor that varies."
- Each observation forms an individual row, i.e. red, Scott, 1997, etc.
  - As with columns, only a single observation per row.
- Each type of observational unit forms a table, i.e. a table of immediate family members.

### Quality Issue 1:

We'll use the `.head()` and `.tail()` methods to quickly browse the dataset for anything that sticks out. This helps us ensure completeness to some degree, as well as validity and consistency.

In [7]:
# Inspecting the dataframe visually
lf_df.head()

Unnamed: 0,date,breed,color,name,sex,state
0,1999-01-03T00:00:00+00:00,Rotty X Shep,Black & Tan,Tex,M/N,Lost
1,1999-01-04T00:00:00+00:00,Dog,Light Colour,,M/N,Found
2,1999-01-04T00:00:00+00:00,Golden Lab X,Black & Tan,Oscar,M,Lost
3,1999-01-04T00:00:00+00:00,Shep X,Black & Tan,,F,Found
4,1999-01-04T00:00:00+00:00,Shep X Collie,Black & Tan,Angel,F,Lost


In [8]:
lf_df.tail()

Unnamed: 0,date,breed,color,name,sex,state
17862,2025-08-01T00:00:00+00:00,German Shepherd Lab X,Black,Millie,F,Matched
17863,2025-08-01T00:00:00+00:00,"Pointer, Stafford Terrier",White with Red,Willow,F/S,Matched
17864,2025-08-02T00:00:00+00:00,Cat - DSH - Black & White,Black & White,Bean,F/S,Lost
17865,2025-08-02T00:00:00+00:00,GSD/Lab X,Black,Milly,F,Matched
17866,2025-08-03T00:00:00+00:00,Rottweiler,Black & Brown,Marks,M,Lost


In [9]:
lf_df.sample(5)  # random sampling to see if anything else jumps out

Unnamed: 0,date,breed,color,name,sex,state
6781,2005-07-07T00:00:00+00:00,Rotty,Black & Tan,Gus,M/N,Lost
536,1999-07-28T00:00:00+00:00,Pomeranian,Sable & White Tail,,M,Found
5262,2003-11-01T00:00:00+00:00,Bull Mastiff,Brindle,Vanna,F/S,Lost
7686,2006-07-24T00:00:00+00:00,Chihuahua mix,Tan,Hoochie Momma,F,Lost
7242,2006-01-10T00:00:00+00:00,Beagle,Tri Colour,Buddy,M/N,Lost


Thankfully, there are no duplicate entries in this data (as of 20250723), indicating each row has "uniqueness."

In [10]:
lf_df.duplicated().value_counts()

False    17867
Name: count, dtype: int64

There are a lot of colors in the `color` column, which indicates potential a lack of consistency and validity.

In [11]:
print("There are", lf_df['color'].unique().shape[0], "unique strings in the color column.")

There are 3410 unique strings in the color column.


One prime example is this hamster, which has `Golden/Blonde` fur. This indicates there is not a standardized process for determining and labeling fur color.

In [12]:
lf_df.iloc[16633]

date     2023-06-26T00:00:00+00:00
breed                      Hamster
color                Golden/Blonde
name                        Chivis
sex                              M
state                         Lost
Name: 16633, dtype: object

I am also curious about how many of those colors have consistency issues, for example using an ampersand instead of an "and". The use of slashes may also indicate `color` labeling similar to `Chivis`, as seen immediately above.

In [13]:
num_colors_amp = pd.Series(lf_df.color.unique()).str.contains(r'&', na=False).value_counts().iloc[1]
num_colors_and = pd.Series(lf_df.color.unique()).str.contains(r' and ', na=False).value_counts().iloc[1]
num_colors_fwdsl = pd.Series(lf_df.color.unique()).str.contains(r'/', na=False).value_counts().iloc[1]

print(num_colors_amp, "unique color strings contain '&' in the color description while", num_colors_and, "contain 'and'.")
print(num_colors_fwdsl, " unique colors contain a forward slash (/).")

1152 unique color strings contain '&' in the color description while 191 contain 'and'.
1346  unique colors contain a forward slash (/).


Issue and justification: Aside from lacking completeness, a lot of the columns do not have a consistent format:
Currently, "&" is primarily used in place of "and" in the `color` column. This matters when trying to answer our question as this means that matching based on `color` will different approaches (fuzzy matching, tokenization, etc.). 

As a side note, the formatting in the `breed` column is also quite inconsistent, but this is addressed further in the tidiness section.

### Quality Issue 2:

We'll do a head, tail, and sample again.

In [14]:
reg_df.head()

Unnamed: 0_level_0,Breed,ShotsDate,Sex,ReceiptNumber,DateImpounded,PitNumber,Name,KennelNumber,DispositionDate,Color,Code,ApproxWeight,Age category,Source,Status,ACO
AnimalID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
6214,Border Collie X,,F,NC,2002-02-17,20035.0,Goggle,200,,Black & White,,,,BROUGHT-IN,Redeemed,17
6218,Labrador X,,F,"11830,JC",2002-03-05,20030.0,,200,,Tan,,,,HOLDING STRAY,Redeemed,6
6222,Rotty (with full tail),2000-06-07,M/N,,2000-05-28,40062.0,Harry Wong,400,,Black & Tan,,,,BROUGHT-IN,Transferred,0
6223,Keeshound X,2001-02-21,M/N,,2001-02-14,40089.0,Dexter,400,,Tri-Color,,,,COMPLAINT,Transferred,8
6224,Rotty,2001-12-12,M/N,11848 ar,2001-12-05,20021.0,Nicholas,200,,Black & Tan,,,,BROUGHT-IN,Sold,0


In [15]:
reg_df.tail()

Unnamed: 0_level_0,Breed,ShotsDate,Sex,ReceiptNumber,DateImpounded,PitNumber,Name,KennelNumber,DispositionDate,Color,Code,ApproxWeight,Age category,Source,Status,ACO
AnimalID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
520,Shepherd,1999-03-15,M/N,8652 ar,1999-03-14,20012.0,Simba ($53.50),200,,Black,,,,,Sold,17
526,Shep X,1999-04-14,M/N,8560 #17,1999-04-04,20025.0,Monty,200,,Tan,,,,,Sold,17
531,Lab X,1999-04-28,F,8564 #17,1999-04-12,20028.0,Suzie,200,,Blk/Tan/White,,,,,Sold,0
537,Shep X,,M,8626 #17,1999-04-19,20036.0,Zep(real name Oscar),200,,Blk/Tan,,,,,Redeemed,17
548,Shep X,,M/N,"8555, MC 7",1999-04-23,20023.0,Kilo,200,,Black,,,,,Sold,1


In [16]:
reg_df.sample(5)

Unnamed: 0_level_0,Breed,ShotsDate,Sex,ReceiptNumber,DateImpounded,PitNumber,Name,KennelNumber,DispositionDate,Color,Code,ApproxWeight,Age category,Source,Status,ACO
AnimalID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
4701,Lab,2001-07-29,M/N,"12464,TU",2001-07-20,20079.0,Joe,200,,Black,,,,SPCA,Sold,1.0
12975,Chihuahua,,F,17054 DT,2005-10-27,20017.0,Blackie,200,,Black & White Feet,,8 Lbs,,BROUGHT-IN,Redeemed,
9810,Sheltie,,F,15923 - skj,2005-01-27,20009.0,Mei Mei,200,,White & Brown & Black,,25,,HOLDING STRAY,Redeemed,20.0
7650,Standard Poodle,,M,n/c,2003-07-16,20070.0,Darcy,200,,Black,,45lbs.(222kgs.,,COMPLAINT,Ride Home Free,8.0
15909,Shar Pei,2008-12-10,M/N,N/C,2007-12-15,20030.0,Horatio (new name),200,,Tan,,35 lbs,,HOLDING STRAY,Sold,14.0


Since there are duplicates, we will need to drop those.

In [17]:
reg_df.duplicated().value_counts()

False    26067
True        41
Name: count, dtype: int64

Using the results from the `.info()` method, we can quickly visually parse about how many null values there are per column.

In [18]:
reg_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 26108 entries, 6214 to 548
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Breed            26100 non-null  object 
 1   ShotsDate        3755 non-null   object 
 2   Sex              25820 non-null  object 
 3   ReceiptNumber    21709 non-null  object 
 4   DateImpounded    26108 non-null  object 
 5   PitNumber        17763 non-null  float64
 6   Name             23354 non-null  object 
 7   KennelNumber     26087 non-null  object 
 8   DispositionDate  12349 non-null  object 
 9   Color            26063 non-null  object 
 10  Code             1173 non-null   object 
 11  ApproxWeight     11645 non-null  object 
 12  Age category     8061 non-null   object 
 13  Source           24134 non-null  object 
 14  Status           26099 non-null  object 
 15  ACO              21327 non-null  object 
dtypes: float64(1), object(15)
memory usage: 3.4+ MB


`Age category` is the only column with a space character, as well as not being in `CamelCase`. Additionally, these columns are `CamelCase` instead of all lowercase, like the column labels in the `lost and found` dataset.

In [19]:
reg_df['Age category'].value_counts()   # Puppy being a category assumes the animal is a dog

Age category
Adult          4922
Young Adult    1557
Senior         1134
Puppy           448
Name: count, dtype: int64

Issue and justification: Immediately, we can see that there are a lot of null values, especially for `Code`, `ShotsDate`, and `Age category` columns. Each of those have no more than about 30% of the values filled. It would be difficult, however, to infer anything to answer our question using any of these columns. 

The best columns would probably be `name`, `color`, `sex`, and `breed` for making inferences, so this means we can probably just drop the columns that are mostly null. For `name` and the other column.s previousy mentioned, we can probably just replace the null values with something like "Unknown"

All of these null values make correlating animals together based on similar factors less reliable, especially if something critical like the name isn't disclosed or the color is not reported as the staff intaking animals would describe. Names are also likely the highest differentiator between two animals with similar breeds and coat colors, as well.

There are also duplicate values, but those will be easy to handle. The biggest issue I feel is the insane abundance of missing values.

### Tidiness Issue 1:

I'm of the opinion that validating the 3 qualities of tidiness is most easily began visually with `.head()/.tail()`, then drilling down programatically with `.value_counts()`, queries, `.columns`, and more. 

In [20]:
lf_df.head()  # lets ground overselves in the data again

Unnamed: 0,date,breed,color,name,sex,state
0,1999-01-03T00:00:00+00:00,Rotty X Shep,Black & Tan,Tex,M/N,Lost
1,1999-01-04T00:00:00+00:00,Dog,Light Colour,,M/N,Found
2,1999-01-04T00:00:00+00:00,Golden Lab X,Black & Tan,Oscar,M,Lost
3,1999-01-04T00:00:00+00:00,Shep X,Black & Tan,,F,Found
4,1999-01-04T00:00:00+00:00,Shep X Collie,Black & Tan,Angel,F,Lost


In [21]:
lf_df.tail()

Unnamed: 0,date,breed,color,name,sex,state
17862,2025-08-01T00:00:00+00:00,German Shepherd Lab X,Black,Millie,F,Matched
17863,2025-08-01T00:00:00+00:00,"Pointer, Stafford Terrier",White with Red,Willow,F/S,Matched
17864,2025-08-02T00:00:00+00:00,Cat - DSH - Black & White,Black & White,Bean,F/S,Lost
17865,2025-08-02T00:00:00+00:00,GSD/Lab X,Black,Milly,F,Matched
17866,2025-08-03T00:00:00+00:00,Rottweiler,Black & Brown,Marks,M,Lost


Based on the output above, the columns are each a distinct type of variable. It does seem, however, some columns store multiple variables in a single datapoint, especially `sex`, which indicates the reproductive sterility of an animal. This column, and another to be investigated further below, can be split into additional columns.

In [22]:
lf_df.sex.value_counts()

sex
M      5104
M/N    4560
F      3906
F/S    3689
X       221
Name: count, dtype: int64

In [23]:
print("There are", lf_df.breed.unique().shape[0], "unique breeds in the dataset.")

breed_s = pd.Series(lf_df.breed.unique()).sort_values(na_position='first', ignore_index=True)
breed_s.head()

There are 4052 unique breeds in the dataset.


0                            None
1          (Miniature) Pomeranian
2              .Unknown Breed Mix
3      1 Pit Bull & 1 Terrier mix
4    1 Pitbull & 1 Bernese/Poodle
dtype: object

In [24]:
breed_s.tail(10)

4042    stratfordshire terrier
4043                     tabby
4044    tabby Grey black brown
4045       tabby short haireds
4046               terrier (?)
4047                 terrier X
4048             terrier X pug
4049      very small ( 5 lbs )
4050             westy terrier
4051              yorkie cross
dtype: object

In [25]:
breed_s.str.extractall("(.*\d.*)")

Unnamed: 0_level_0,Unnamed: 1_level_0,0
Unnamed: 0_level_1,match,Unnamed: 2_level_1
3,0,1 Pit Bull & 1 Terrier mix
4,0,1 Pitbull & 1 Bernese/Poodle
5,0,1/2 Pit & 1/2 Presa
6,0,2 Dachunds
7,0,2 Great Pyrenese
...,...,...
3650,0,Staff Terrier #2
3672,0,Stafforshire Terrier 14 months
3983,0,lab and 2nd dog German Sheper
4040,0,small dog Benji type (10lbs)


Issue and justification: It seems the `breed` column could be useful for finding single rows with multiple animals reported (e.g., `1 Pit Bull & 1 Terrier mix`). A more tidy way to do this would be splitting those out into separate rows. If the report needs to be associated with those from the same report, there can be a report ID that all applicable reports share, however I will not need that information to answer my question.

By having multiple animals in one row, this not only inaccurately represents the number of animals reported missing, but also makes grouping or filtering by the reported breed unreliable.

The `sex` column could also be split into two columns, where one solely stores the biological sex of the animal and the other stores if they are neutered or spayed. This can just be a boolean value for simplicity's sake.

### Tidiness Issue 2: 

In [26]:
reg_df.head()

Unnamed: 0_level_0,Breed,ShotsDate,Sex,ReceiptNumber,DateImpounded,PitNumber,Name,KennelNumber,DispositionDate,Color,Code,ApproxWeight,Age category,Source,Status,ACO
AnimalID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
6214,Border Collie X,,F,NC,2002-02-17,20035.0,Goggle,200,,Black & White,,,,BROUGHT-IN,Redeemed,17
6218,Labrador X,,F,"11830,JC",2002-03-05,20030.0,,200,,Tan,,,,HOLDING STRAY,Redeemed,6
6222,Rotty (with full tail),2000-06-07,M/N,,2000-05-28,40062.0,Harry Wong,400,,Black & Tan,,,,BROUGHT-IN,Transferred,0
6223,Keeshound X,2001-02-21,M/N,,2001-02-14,40089.0,Dexter,400,,Tri-Color,,,,COMPLAINT,Transferred,8
6224,Rotty,2001-12-12,M/N,11848 ar,2001-12-05,20021.0,Nicholas,200,,Black & Tan,,,,BROUGHT-IN,Sold,0


In [27]:
reg_df.sort_index().head()

Unnamed: 0_level_0,Breed,ShotsDate,Sex,ReceiptNumber,DateImpounded,PitNumber,Name,KennelNumber,DispositionDate,Color,Code,ApproxWeight,Age category,Source,Status,ACO
AnimalID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1,Pit Bull,2005-06-18,M/N,17057 ks,2005-06-12,20038.0,Taz,200,,Tan,,45,,HOLDING STRAY,Sold,3.0
2,English Setter,,M/N,20372 BC3,2006-03-19,20041.0,Dudley,200,,White & Brown,,100lbs,,HOLDING STRAY,Redeemed,
3,Lab X,,M/N,N/C BC,2006-07-24,40041.0,Evander,400,,Black,,,,HOLDING STRAY,Ride Home Free,5.0
4,Pomeranian,,M,"22171,MC",2006-09-08,20017.0,Chewbaca,200,,Brown,,10 lbs.,,HOLDING STRAY,Redeemed,14.0
5,American Bulldog X,2006-10-07,M/N,22565 - skj,2006-09-06,20053.0,Cadillac,200,,White with Brown Patch on Eye,,80 lbs,,HOLDING STRAY,Sold,22.0


Skimming through the breed column revealed multiple occurrences of a string like "`<breed-name> X`". Immediately, I am unsure what this may indicate, but I have a presumption that I will elaborate upon when I describe the issue I'll be addressing with this dataset.

In [28]:
reg_df.query('AnimalID == 3').iloc[0]  # iloc will return a series, which is a bit easier to read for a single row

Breed                       Lab X
ShotsDate                     NaN
Sex                           M/N
ReceiptNumber              N/C BC
DateImpounded          2006-07-24
PitNumber                 40041.0
Name                      Evander
KennelNumber                  400
DispositionDate               NaN
Color                       Black
Code                          NaN
ApproxWeight                  NaN
Age category                  NaN
Source              HOLDING STRAY
Status             Ride Home Free
ACO                             5
Name: 3, dtype: object

Looking a little further, it looks like `Lab X` is incredibly common.

In [29]:
reg_df['Breed'].str.extractall("(.*[lL]ab.*)").value_counts()

0              
Lab X              872
Lab                571
Labrador           495
Labrador X          78
Shep X Lab          78
                  ... 
Black Lab X Pit      1
Black Lab/GSD        1
Black X Lab          1
Heeler/Lab X         1
rotty X lab          1
Name: count, Length: 490, dtype: int64

Issue and justification: There are a few places where some improvements can be made. 
- `Name` - Some rows contain "`(New name)`," which is tracking two factors of data in a single column.
  - This could be resolved with another column of boolean values, but I do not think it will be entirely necessary for our cause. Therefore, we will remove occurrences of `(New name)`
- `Breed` - Some breeds have "`mix`" while others have "`X`". Presumably, `X` is also for "mix," but this is merely an assumption. 
  - Additionally, `Black Lab` is not a breed, the breed is Labrador Retriever. Despite this, we will be leaving it in order to simplify this project.

Another general issue is that this field has inconsistent capitalization. Eventually, I will normalize all of the string fields so that they are all in lowercase.

## 3. Clean data

There are a few operations I plan to do before we begin the real cleaning:
1. Lowercase all strings
2. Lowercase all column names and remove spaces

Following that, we will address the four issues brought up during phase 2.

In [30]:
# Make copies of the datasets to ensure the raw dataframes are not impacted
lf_df_clean = lf_df.copy()
reg_df_clean = reg_df.copy()

reg_df_clean.rename(columns=lambda x: x.strip().lower().replace(" ", ""), inplace=True)


for col in lf_df_clean.columns:
    lf_df_clean[col] = lf_df_clean[col].str.lower()

#lf_df_clean.rename(columns={"name": "animalname"}, inplace=True)

#reg_df_clean.index.rename(reg_df_clean.index.name.lower(), inplace=True)

for col in reg_df_clean.columns:
    if col == 'pitnumber': continue  # this is type float64

    reg_df_clean[col] = reg_df_clean[col].str.lower()

In [31]:
lf_df_clean

Unnamed: 0,date,breed,color,name,sex,state
0,1999-01-03t00:00:00+00:00,rotty x shep,black & tan,tex,m/n,lost
1,1999-01-04t00:00:00+00:00,dog,light colour,,m/n,found
2,1999-01-04t00:00:00+00:00,golden lab x,black & tan,oscar,m,lost
3,1999-01-04t00:00:00+00:00,shep x,black & tan,,f,found
4,1999-01-04t00:00:00+00:00,shep x collie,black & tan,angel,f,lost
...,...,...,...,...,...,...
17862,2025-08-01t00:00:00+00:00,german shepherd lab x,black,millie,f,matched
17863,2025-08-01t00:00:00+00:00,"pointer, stafford terrier",white with red,willow,f/s,matched
17864,2025-08-02t00:00:00+00:00,cat - dsh - black & white,black & white,bean,f/s,lost
17865,2025-08-02t00:00:00+00:00,gsd/lab x,black,milly,f,matched


Now we'll validate...

In [32]:
# Random sample of the lost and found table is all lowercase?
lf_df_clean.sample(1)

Unnamed: 0,date,breed,color,name,sex,state
4111,2002-10-04t00:00:00+00:00,germ shep x lab,brown,prince,m/n,lost


In [33]:
# Random sample of the register table is all lowercase?
reg_df_clean.dropna().sample(1) # dropna will prevent NaN -> "NaN"

Unnamed: 0_level_0,breed,shotsdate,sex,receiptnumber,dateimpounded,pitnumber,name,kennelnumber,dispositiondate,color,code,approxweight,agecategory,source,status,aco
AnimalID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
24068,terrier,2013-05-20,m/n,da 13-368611,2013-04-08,11.0,kipper,200,2013-07-07,cream,yellow,15lbs,young adult,holding stray,sold,21


In [34]:
# All columns lowercase?
str(reg_df_clean.columns.tolist()).islower()

True

### **Quality Issue 1: to '&' or to 'and?'**

For the issue of inconsistent formatting in the lost and found data's `color` column, we will just remove the "&" and "and" strings, parsing the results out into a list. This gets it split out into something ready for us to access programatically right out of the gate.

This cleaning will be applied to both datasets, as I noticed that (understandably), the same issue exists within the register dataset.

We will need to start off by handling the null values in the `color` column, though, so that we don't have to repeat the same steps of converting it to an array with the second quality issue.

In [35]:
# Getting ahead of ourselves a bit, but handling null color values
lf_df_clean.fillna({'color':'unknown'}, inplace=True)
reg_df_clean.fillna({'color':'unknown'}, inplace=True)

In [36]:
# Replace " w/", "/", " and ", "&", " with " with "|", then split into array on "|"
pattern = r"( w\/| [^\w\s] | and |/| with )"
lf_df_clean['color'] = lf_df_clean['color'].str.replace(pat=pattern, repl="|", regex=True)
lf_df_clean['color'] = lf_df_clean['color'].str.split("|")

# The same thing, but for reg_df
reg_df_clean['color'] = reg_df_clean['color'].str.replace(pat=pattern, repl="|", regex=True)
# we aren't going to split, yet, as this will cause some errors with the next quality issue.

In [37]:
# Validation cleaning succeeded for lost and found data
lf_df_clean['color'].sample(5)

7253           [beige]
6052      [black, tan]
8546    [black, brown]
2607           [black]
4350           [black]
Name: color, dtype: object

In [38]:
# Validation on register data
reg_df_clean['color'].sample(5)

AnimalID
21521          black|tan
33529          tan|white
20266         grey|white
17586              brown
13551    brown|tan|white
Name: color, dtype: object

In [39]:
# Further validation
# Before -> After
print(lf_df.iloc[372].color, "->", lf_df_clean.iloc[372].color)
print(lf_df.iloc[424].color, "->", lf_df_clean.iloc[424].color)
print(lf_df.iloc[439].color, "->", lf_df_clean.iloc[439].color)
print("\nSplitting reg_df_clean will happen later")
print(reg_df.iloc[372].Color, "->", reg_df_clean.iloc[372].color)
print(reg_df.iloc[424].Color, "->", reg_df_clean.iloc[424].color)
print(reg_df.iloc[439].Color, "->", reg_df_clean.iloc[439].color)

Reddish Blonde -> ['reddish blonde']
Cream  & Rust -> ['cream ', 'rust']
Grey & White & Black -> ['grey', 'white', 'black']

Splitting reg_df_clean will happen later
Calico -> calico
Black w/white -> black|white
White w/ Yellow -> white| yellow


Justification: In order to normalize a large majority of the colors in these datasets, which makes them significantly easier to compare. It also will make our aggregation results a bit more reliable since there won't be differences in using "&" vs. "and."

### **Quality Issue 2: Handling Nulls and Duplicates**

There are lots of null values throughout the datasets, which then requires us to handle them specially. If we instead set them to a string, this will simplify life a tad. We also don't want duplicate values, as they are just redundant records. Thankfully, dupes are only in the register data.

We'll start by dropping what we don't want, being duplicates and a few columns that don't help answer our question.

In [40]:
cols_to_drop = ['shotsdate','kennelnumber', 'pitnumber', 'code', 'aco']
reg_df_clean.drop(columns=cols_to_drop, axis=1, inplace=True)

In [41]:
sub = reg_df_clean.columns.to_list()
sub.pop(sub.index('color'))
print(sub)

print("Pre-drop:\n", reg_df_clean[sub].duplicated().value_counts())
reg_df_clean.drop_duplicates(subset=sub, inplace=True)  # throws error when lists are included

# Prevent indexing disparity
# reg_df_clean.reset_index(inplace=True, drop=True)

['breed', 'sex', 'receiptnumber', 'dateimpounded', 'name', 'dispositiondate', 'approxweight', 'agecategory', 'source', 'status']
Pre-drop:
 False    26007
True       101
Name: count, dtype: int64


In [42]:
sub = lf_df_clean.columns.to_list()
sub.pop(sub.index('color'))

print("Pre-drop:\n", lf_df_clean[sub].duplicated().value_counts())
lf_df_clean.drop_duplicates(subset=sub, inplace=True)  # throws error when lists are included

# Prevent indexing disparity
lf_df_clean.reset_index(inplace=True, drop=True)

Pre-drop:
 False    17840
True        27
Name: count, dtype: int64


Now we can validate that our changes did what we intended

In [43]:
# Before -> After
print(reg_df.duplicated().any(), "->", reg_df_clean.duplicated().any())

True -> False


We'd get an error reporting that a list is not hashable if we tried to `drop_duplicates` or look for `duplicated` rows on a split `color` column. Since we have all of that handled, we can split it into a list now.

In [44]:
reg_df_clean['color'] = reg_df_clean['color'].str.split("|")

In [45]:
#FILL IN - Apply the cleaning strategy
lf_df_clean.fillna('unknown', inplace=True)
reg_df_clean.fillna('unknown', inplace=True)

In [46]:
#FILL IN - Validate the cleaning was successful
reg_df_clean.isnull().any()

breed              False
sex                False
receiptnumber      False
dateimpounded      False
name               False
dispositiondate    False
color              False
approxweight       False
agecategory        False
source             False
status             False
dtype: bool

In [47]:
lf_df_clean.isnull().any()

date     False
breed    False
color    False
name     False
sex      False
state    False
dtype: bool

Justification: Now that we have our nulls and duplicates handled, our data should be less tricky to work with, especially with aggregation, comparison, and enhancement.

### **Tidiness Issue 1: Multiple Animals in Some Rows**

This particular issue in the lost and found table, as I observed in the `breed` and `name` column, breaks the tidiness rule of "a single row for a single observation." It makes sense in terms of inputting the data into a record system that you would want this information in the same record, as that is also probably associated with a point of contact and it de-duplicates efforts, however it will not be helpful for our purposes. 

It's entirely possible that an animal was found separately from the second animal that was reported, or that they were input into the register as separate records. We'll clean up this issue by handling the `name` and `breed` column similarly to how we dealt with the `color` column.

We could handle the `color` column (since animals aren't always the same color), but that should have been done before during this step as well, however that was quality issue, not a tidiness issue, and this is a project for school. I presume that it's best to just leave things in this order and not stray even further from the template.

In [48]:
# Save indices for future access/removal.
lf_df_index_amp = lf_df_clean[lf_df_clean['name'].str.contains(" & ")].index
lf_df_index_and = lf_df_clean[lf_df_clean['name'].str.contains(" and ")].index

In [49]:
clean_and = lf_df_clean.iloc[lf_df_index_and]['name'].apply(lambda x: x.split(" and "))
clean_amp = lf_df_clean.iloc[lf_df_index_amp]['name'].apply(lambda x: x.split(" & "))

In [50]:
# Before change implemented
print(lf_df_clean.iloc[227])
print(lf_df_clean.iloc[1142])

date     1999-04-10t00:00:00+00:00
breed        lab x collie x beagle
color             [two black dogs]
name                coco and wayne
sex                              m
state                        found
Name: 227, dtype: object
date     2000-02-07t00:00:00+00:00
breed                rottys-2 dogs
color                 [black, tan]
name               starsky & hutch
sex                            m/n
state                      matched
Name: 1142, dtype: object


In [51]:
# Change + validation
lf_df_clean['name'].iloc[clean_and.index] = clean_and
print(lf_df_clean.iloc[227])

lf_df_clean['name'].iloc[clean_amp.index] = clean_amp
print(lf_df_clean.iloc[1142])

date     1999-04-10t00:00:00+00:00
breed        lab x collie x beagle
color             [two black dogs]
name                 [coco, wayne]
sex                              m
state                        found
Name: 227, dtype: object
date     2000-02-07t00:00:00+00:00
breed                rottys-2 dogs
color                 [black, tan]
name              [starsky, hutch]
sex                            m/n
state                      matched
Name: 1142, dtype: object


In [52]:
lf_df_clean.iloc[227]

date     1999-04-10t00:00:00+00:00
breed        lab x collie x beagle
color             [two black dogs]
name                 [coco, wayne]
sex                              m
state                        found
Name: 227, dtype: object

In [53]:
# Further validation 
print(lf_df_clean['name'].str.contains(" and ").any())
print(lf_df_clean['name'].str.contains(" & ").any())

False
False


Finally, we will need to split the rows that have list in `name` into separate rows, then reindex the entire dataframe.

In [54]:
cols = lf_df_clean.columns.tolist()
cols.pop(cols.index("name"))
cols

['date', 'breed', 'color', 'sex', 'state']

In [55]:
# Since lists aren't hashable, convert all color lists to strings
lf_df_clean.color = lf_df_clean.color.apply(lambda x: "|".join(x))

# Which allows us to maintain the data in a row even when we split it with explode()
lf_df_clean.set_index(cols, inplace=True)

In [56]:
lf_df_clean

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,name
date,breed,color,sex,state,Unnamed: 5_level_1
1999-01-03t00:00:00+00:00,rotty x shep,black|tan,m/n,lost,tex
1999-01-04t00:00:00+00:00,dog,light colour,m/n,found,unknown
1999-01-04t00:00:00+00:00,golden lab x,black|tan,m,lost,oscar
1999-01-04t00:00:00+00:00,shep x,black|tan,f,found,unknown
1999-01-04t00:00:00+00:00,shep x collie,black|tan,f,lost,angel
...,...,...,...,...,...
2025-08-01t00:00:00+00:00,german shepherd lab x,black,f,matched,millie
2025-08-01t00:00:00+00:00,"pointer, stafford terrier",white|red,f/s,matched,willow
2025-08-02t00:00:00+00:00,cat - dsh - black & white,black|white,f/s,lost,bean
2025-08-02t00:00:00+00:00,gsd/lab x,black,f,matched,milly


In [57]:
# Turn list of names into separate rows using explode
lf_df_clean = lf_df_clean.explode('name')
lf_df_clean.iloc[227:230]  # both rows will have the same index

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,name
date,breed,color,sex,state,Unnamed: 5_level_1
1999-04-10t00:00:00+00:00,lab x collie x beagle,two black dogs,m,found,coco
1999-04-10t00:00:00+00:00,lab x collie x beagle,two black dogs,m,found,wayne
1999-04-10t00:00:00+00:00,maltipoo,black|white,m,lost,barney


In [58]:
lf_df_clean.reset_index(inplace=True)  # Revert changes and retain columns for all rows
lf_df_clean.color = lf_df_clean.color.apply(lambda x: x.split("|")) # Re-cast color as a list

In [59]:
# Validation
lf_df_clean.iloc[227:230]

Unnamed: 0,date,breed,color,sex,state,name
227,1999-04-10t00:00:00+00:00,lab x collie x beagle,[two black dogs],m,found,coco
228,1999-04-10t00:00:00+00:00,lab x collie x beagle,[two black dogs],m,found,wayne
229,1999-04-10t00:00:00+00:00,maltipoo,"[black, white]",m,lost,barney


In [60]:
# Confirming color was changed back to a list
type(lf_df_clean.iloc[0].color)

list

In [61]:
# Previously, the name field for Coco and Wayne looked like this
clean_and.iloc[0]

['coco', 'wayne']

Justification: Now, we are no longer storing two dogs in one row, thus making this data just a bit tidier. By splitting with the somewhat highly prevalent usage of "&" and "and," we are able to get closer to a more accurate count of how many dogs were reported lost, as well as more easily allowing us to join based on the `name` column.

One limitation with this approach is that we are not doing anything about the color for the specific animal we split out into a new row. If we addressed this issue before we handled `color` in the quality steps, we may be able to have the correct color for each animal, rather than a list of colors in for two animals. I do not forsee this causing many issues as we will be using more than just a color to try and match animals when I combine the data.

### **Tidiness Issue 2: Tracking Multiple Variables in One Column**

In the both datasets, there are multiple occurences of the string "(New name)." I'd consider this a different variable that could be represented in a separate column, since it isn't part of there name. Rather, it is an aspect or quality of the name. In our case, however, we are just doing to remove every occurence of that "(New name)" string.

A similar issue comes up in the `breed` column, where we see "`X`" in place of "`mix`" and strings like "`Black Lab`". I will split out the "`mix`" portion into another column to reduce the number of unique `breed` strings.

In [62]:
# Remove undesired strings and replace if necessary in both lf_df and reg_df
lf_df_clean['breed'] = lf_df_clean['breed'].str.replace("\s[xX]", " mix", regex=True)
reg_df_clean['breed'] = reg_df_clean['breed'].str.replace("\s[xX]", " mix", regex=True)

lf_df_clean['name'] = lf_df_clean['name'].str.replace("\(?[Nn]ew [Nn]ame\)?", "", regex=True)
reg_df_clean['name'] = reg_df_clean['name'].str.replace("\(?[Nn]ew [Nn]ame\)?", "", regex=True)

lf_df_clean['breed'] = lf_df_clean['breed'].str.strip()
lf_df_clean['name'] = lf_df_clean['name'].str.strip()
reg_df_clean['breed'] = reg_df_clean['breed'].str.strip()
reg_df_clean['name'] = reg_df_clean['name'].str.strip()

In [63]:
# Validate the cleaning was successful
print(lf_df_clean['breed'].str.contains("\s[xX]", regex=True).any())
print(lf_df_clean['name'].str.contains("[Nn]ew [Nn]ame", regex=True).any())

print(reg_df_clean['breed'].str.contains("\s[xX]", regex=True).any())
print(reg_df_clean['name'].str.contains("[Nn]ew [Nn]ame", regex=True).any())

False
False
False
False


Justification: We removed the additional variable in the `name` column denoting a given animal's name as a "new name", which removes likely superfluous information and simplifys the handling of a large majority of names. 

By expanding the shorthand used for "mix" in the `breed` column, we somewhat address the tidiness issue where the breed column is storing multiple variables in the multiple breeds. There are better ways to address this that would take more cycles of cleaning, but now "mix" can act as a sort of flag for further operations. Given more time, I would create 2 more columns: one for `primary_breed` column, and `mix_breed` column that lists the other mixes (or `none` for purebred, `unknown` for "mutts").

Both of these actions aid in answering our question because they prevent potential false negatives due to the "new name" string, as well as give us a string to ignore when matching, but to review when manually investigating results with the "mix" string. Effectively, it lets us ignore just "mix" instead of both it and " x" or " X".

### **A few extra bits of cleaning**

In [64]:
# Conforming to one date format by trimming off at and after the "T" for ISO 8601
lf_df_clean['date'] = lf_df_clean['date'].apply(lambda x: x.split('t')[0])

In [65]:
lf_df_clean['date'].iloc[0]

'1999-01-03'

In [66]:
reg_df_clean['dateimpounded'].iloc[0]

'2002-02-17'

## 4. Update data store

In [67]:
# Saving raw data
lf_df.to_csv('./datastore/raw_latest_lost_found.csv')
reg_df.to_csv('./datastore/raw_latest_register.csv')

# Saving cleaned data
lf_df_clean.to_csv('./datastore/clean_latest_lost_found.csv')
reg_df_clean.to_csv('./datastore/clean_latest_register.csv')

## 5. Answer the research question

### **5.1:** Define and answer the research question 
Going back to the problem statement in step 1, use the cleaned data to answer the question you raised. Produce **at least** two visualizations using the cleaned data and explain how they help you answer the question.

*Research question:* The City of Vancouver tracks every animal that comes into its shelters and those reported by owners as lost. The city does track those that are matched back to their owner, is it possible that an animal still tracked as lost has possibly been accounted for?

#### Visual 1

In [None]:
# make copies, we'll be adding a column
lf_df_tmp = lf_df_clean.copy()
reg_df_tmp = reg_df_clean.copy()

# store only YYYY-MM in `shortdate` col
lf_df_tmp['shortdate'] = lf_df_tmp.date.str.slice(0,7)
reg_df_tmp['shortdate'] = reg_df_tmp.dateimpounded.str.slice(0,7)

Here is what that looks like:

In [135]:
lf_df_tmp.head()

Unnamed: 0,date,breed,color,sex,state,name,shortdate
0,1999-01-03,rotty mix shep,"[black, tan]",m/n,lost,tex,1999-01
1,1999-01-04,dog,[light colour],m/n,found,unknown,1999-01
2,1999-01-04,golden lab mix,"[black, tan]",m,lost,oscar,1999-01
3,1999-01-04,shep mix,"[black, tan]",f,found,unknown,1999-01
4,1999-01-04,shep mix collie,"[black, tan]",f,lost,angel,1999-01


Now we'll join the date on the a few columns: `name`, since that is usually verifiable via microchip or collar, `sex`, since that is physically identifiable (most of the time), and `shortdate`, presuming that a large majority of animals turn up within the same month that they are reported as lost. This does not accounts for animals reported as lost on the last day of a month, the possibly brought in to the city's custody in the following month or so.

In [144]:
lf_df_clean.query("name == 'unknown' & state == 'lost'")

Unnamed: 0,date,breed,color,sex,state,name
12,1999-01-10,lab,[black],m/n,lost,unknown
41,1999-01-27,shep mix husky,[tan],f/s,lost,unknown
73,1999-02-08,rottweiller puppy,"[black, tan]",m,lost,unknown
120,1999-02-22,lab,[black],m/n,lost,unknown
143,1999-03-03,daschund mix terr,"[black, tan]",m,lost,unknown
...,...,...,...,...,...,...
17105,2024-03-15,cat - dsh,"[grey, black]",x,lost,unknown
17669,2025-03-15,pigeon,[grey],x,lost,unknown
17874,2025-07-12,"bird, budgie","[yellow, green]",x,lost,unknown
17880,2025-07-15,cat - dsh - bengal,"[creamy white, brown spots]",f/s,lost,unknown


In [145]:
lf_reg_names = lf_df_tmp.merge(
    reg_df_tmp, 
    left_on=['shortdate', 'name', 'sex'], 
    right_on=['shortdate', 'name', 'sex'], 
    how='inner'
)

In [146]:
lf_reg_names.status.value_counts()

status
redeemed                       10106
sold                            1193
ride home free                   658
transferred                      472
behavior                         135
owner request - signed over      134
health                            36
passed away                       29
impound                           18
unknown                            7
escaped                            3
viewable                           2
released ( wildlife)               2
fostered                           1
stolen                             1
Name: count, dtype: int64

In [147]:
help

Type help() for interactive help, or help(object) for help about object.

In [None]:
# Select sources that are potentially lost animals
reg_query = reg_df_clean.query("source in \
                   ['holding stray', " \
                   "'brought-in', " \
                   "'other', " \
                   "'complaint', " \
                   "'unknown', " \
                   "'spca', " \
                   "'transferred', " \
                   "'patrol']")\
    [['dateimpounded','source']]


Unnamed: 0_level_0,dateimpounded,source
AnimalID,Unnamed: 1_level_1,Unnamed: 2_level_1
6214,2002-02-17,brought-in
6218,2002-03-05,holding stray
6222,2000-05-28,brought-in
6223,2001-02-14,complaint
6224,2001-12-05,brought-in
...,...,...
520,1999-03-14,unknown
526,1999-04-04,unknown
531,1999-04-12,unknown
537,1999-04-19,unknown


In [None]:
lf_yr_mo_counts = lf_df_clean.query("state == 'lost' | state == 'matched'").date.apply(lambda x: x[:7]).value_counts()
reg_yr_mo_counts = reg_query.dateimpounded.apply(lambda x: x[:7]).value_counts()

In [94]:
lf_2024 = lf_yr_mo_counts[lf_yr_mo_counts.index.str.contains('2024')]
reg_2024 = reg_yr_mo_counts[reg_yr_mo_counts.index.str.contains('2024')]

In [None]:
ax = lf_2025.hist(alpha=0.5, figsize=(8, 6), label='lost+found');
reg_2024.hist(alpha=0.5, figsize=(8, 6), label='register', ax=ax);

ax.set_title("Distribution of Lost Animal Reports and ")

dateimpounded
07    2623
08    2469
05    2447
06    2393
10    2201
04    2160
09    2145
03    2115
11    1956
01    1902
02    1818
12    1778
Name: count, dtype: int64

[<matplotlib.lines.Line2D at 0x2e8e75b7e50>]

Error in callback <function _draw_all_if_interactive at 0x000002E8D3EA1510> (for post_execute), with arguments args (),kwargs {}:


KeyboardInterrupt: 

*Answer to research question:* FILL IN

In [None]:
# Visual 2

ModuleNotFoundError: No module named 'upsetplot'