- This project explores trends in NYC dog licensing data, with a focus on how a dogβs age and breed affect licensing behaviors across the city.
- The dataset was provided by the Animal Care Centers of NYC (ACC), who manage dog licensing and pet population initiatives in collaboration with the Department of Health.
- The goal was to help ACC understand current licensing patterns and uncover areas where improvements can be made, particularly around the adoption and licensing of older dogs and underrepresented breeds.
- The data comes from the NYC Department of Healthβs Dog Licensing System, where owners are legally required to register their dogs. The dataset includes:
- Both new licenses and renewals
- One row per active license per year
- Some dogs appear more than once due to renewals
- This structure allows us to analyze annual licensing trends, but does not represent unique dogs.
- Animal Care Centers of NYC (ACC)
- ACC helps manage pet adoptions, licensing compliance, and dog population insights across the five boroughs. They rely on licensing data to support outreach, shelter planning, and policy development around responsible pet ownership.
- How does a dogβs age and breed affect licensing trends across NYC?
- Removed fully duplicated rows with df.drop_duplicates()
- Case: "Abigail" appeared multiple times with identical license info β confirmed true duplicate
- Created IssueYear and ExpireYear from license date columns
- Caught data issues where
ExtractYeardid not match the actual license year range β these rows were removed
- Dropped rows missing key fields:
AnimalBirthYear,AnimalGender, orLicenseExpiredDate- Impact: Only ~130 rows removed
- Calculated
ApproxDogAge=IssueYear-AnimalBirthYear - Created
DogAgeGroupbins:0β1 (Puppy), 2β3 (Young), 4β6 (Adult), 7β10 (Senior), 10+ (Elderly)
- Removed unused or duplicate fields (e.g.,
LicenseIssuedYear)
- ZIP codes were critical for our geographic visualizations (like maps by borough or ZIP), so we needed to make sure the data was clean and accurate. We handled this process in two key steps:
- Format & Filter by Length
- We first ensured that the
ZipCodecolumn was in string format and that only entries with 5-digit codes remained.- Why: Some entries had ZIPs like
0,100, or were blank. These arenβt valid for NYC mapping.df['ZipCode'] = df['ZipCode'].astype(str)df = df[df['ZipCode'].str.len() ==5]
- This cleaned up obvious formatting issues but didnβt guarantee that the ZIPs were real NYC ZIP codes.
- Why: Some entries had ZIPs like
- We first ensured that the
- Cross-Check with Official NYC ZIP List
- To validate ZIP codes more thoroughly, we used an external file:
zipcode.csv, which contained a verified list of valid NYC ZIP codes. - We compared every ZIP code in our dataset against this official list to filter out any invalid, outdated, or non-NYC ZIPs.
- We also cleaned up formatting quirks from Excel exports, such as
.0suffixes that appeared when numbers were read as floats. - This was important because:
- Ensured that all geographic visualizations (especially our ZIP code map) were accurate and didnβt mislead stakeholders with bad data.
- Prevented invalid ZIPs from skewing analysis, for example, a license record with
00000would distort mapping entirely. - Made it possible to link each license to a real NYC neighborhood, helping stakeholders identify underserved or overrepresented areas.
- To validate ZIP codes more thoroughly, we used an external file:
- Format & Filter by Length
-
The
BreedNamecolumn in the dataset had significant inconsistencies. For example, the same breed appeared under many slightly different names such as "Poodle Mix", "Poodle X", "Poodle Crossbreed", or even informal variants like "Poodle Dog" or "Poodle-type". These inconsistencies posed a problem for accurate aggregation and trend analysis, especially when grouping by breed. -
To address this, we implemented a two-step cleaning process:
- Regex-Based Cleanup with re
- We used regular expressions to standardize mix indicators like "Mix", "X", and "Crossbreed"
- This ensured that similar breed types were grouped under a single standard label, such as "Poodle Mix".
- Fuzzy Matching with thefuzz
- Even after regex cleanup, breed names varied too widely for manual standardization. So we used the fuzz library from thefuzz package to compare each unique breed name to a curated reference list of common dog breeds and their standardized labels. We mapped any breed that matched with a similarity score β₯ 85 to its closest reference value.
- This method allowed us to automatically consolidate variants like:
- "Yorkie Poodle" β "Yorkshire Terrier Mix"
- "Labrador Retriever" and "Lab Ret" β "Labrador Retriever"
- Handling Unknown or Placeholder Breeds
- We also identified and replaced invalid or placeholder breed names, such as:
"Unknown", "NoName", "Mixbreed", "0", etc. These were flagged with a boolean column IsBadName to help with filtering or analysis decisions later in Tableau.
- Regex-Based Cleanup with re
-
This allowed us to have:
- Reduced noise in the data
- Made breed-based grouping accurate
- Helped us visualize breed trends more reliably in Tableau
- To enhance our analysis and support more insightful visualizations, we engineered several new columns from the original dataset:
CleanedBreedName: A standardized version of breed names using fuzzy matching and manual corrections to address inconsistencies, typos, and duplicate labels.ApproxDogAge: Calculated by subtracting AnimalBirthYear from LicenseIssuedYear, giving us a rough estimate of the dogβs age at the time of licensing.DogAgeGroup: Created by binning ApproxDogAge into meaningful categories (e.g., "0β1 (Puppy)", "2β3 (Young)", etc.), allowing us to group and compare age ranges.IssueYearandExpireYear: Extracted from license date fields to simplify time-based filtering and support time series visualizations.IsBadName: A Boolean column identifying missing, placeholder, or junk names (like βNoNameβ or βWHATSERNAMEβ), used to flag low-quality entries for filtering or review.
π View Dashboard on Tableau
- Most dogs are licensed under age 3
- Very few senior/elderly dogs are licensed
- Suggests that older dogs are either less likely to be adopted or renewed
- Action: ACC can target awareness campaigns in areas with low senior dog licensing
- Puppy licenses peaked in 2021
- Elderly dog licenses are consistently lowest
- Senior dog licensing is declining year over year
- Action: Renewals and long-term ownership may need more city outreach or incentives
- Shows which breeds are mostly licensed as filtered by age group, such as in puppies, the most popular breed is American Pit Bull Terrier mix
- Identifies breeds like Shih Tzus and Airedale Terriers that appear in older age groups
- Action: Breed-specific adoption or renewal strategies could be developed
- Visualizes where younger vs older dogs are licensed
- Highlights geographic patterns by breed and age group
- Action: Helps ACC identify neighborhoods for localized campaigns or events
- Encouraging senior dog licensing:
- Offer discounts, reminders, or partner with vets for older pet outreach
- Target ZIP codes with low senior dog license counts:
- These areas could benefit from education or adoption events
- Standardize breed input for better analysis
- The licensing system could enforce dropdown-based input instead of free text
- Track renewal rates by breed and age
- This would help ACC understand which populations are staying licensed over time
- Whatβs the most important insight your dashboard reveals?
- The most impactful insight from our dashboard is that most dogs are licensed when they are young, particularly within the puppy (0β1 years) and young (2β3 years) age groups. In contrast, senior and elderly dogs are significantly underrepresented, especially when it comes to renewals over time. This pattern highlights a potential gap in outreach and awareness for older dogs, both in terms of adoption and license renewal.
- This insight opens the door for targeted education campaigns, renewed focus on senior dog advocacy, and potential policy interventions to encourage long-term licensing behavior and reduce drop-off as dogs age.
- With limited time, how did you decide what to prioritize?
-
At the start of the project, it was easy to feel overwhelmed. There were so many directions we could go. We had access to a rich dataset with dozens of columns, years of records, and endless possibilities. But from the very beginning, we reminded ourselves of one thing:
- π βStick to the business question.β
-
Our stakeholder asked:
- βHow does a dogβs age and breed affect licensing trends across NYC?β
- So we treated this question like a compass. Every time we felt tempted to explore something unrelated or scope out, we brought ourselves back to that core. We broke the big question into smaller ones that helped guide our analysis:
- Are older or younger dogs more likely to be licensed?
- How has the licensing of older vs. younger dogs changed over the years?
- Which breeds are most commonly licensed at older ages?
- Are there regional differences in breed or age group licensing (e.g., by ZIP code)?
- Are certain breeds underrepresented or possibly declining?
-
These became our building blocks. From there, we asked ourselves:
- Which of these supporting questions actually give the stakeholder insight into behavior and trends?
- What visuals will speak clearly in five minutes, without overwhelming the audience?
-
That mindset helped us filter out noise, like redundant columns or overcomplicated features, and stay focused on what truly matters. We didnβt just prioritize visuals; we prioritized value. Each visualization was selected not just to show data, but to tell part of a story, one that could help Animal Care Centers of NYC make informed decisions about outreach, renewal campaigns, and breed-specific policies.
In the end, it wasnβt about analyzing everything β it was about answering the right question, with purpose.
.
βββ data/
β βββ raw/
βββ NYC_Dog_Licensing_Dataset_.csv
β βββ cleaned/
βββ final_dataset.csv
βββ zipcode.csv
βββ excel/
β βββ final_dataset.csv
βββ notebooks/
β βββ cleaning_and_eda.ipynb
|
βββ .gitignore
|
βββ READE.md
Weβre passionate about data storytelling and using analysis to drive actionable insights. Feel free to connect with us on LinkedIn!