# Real-world Data Wrangling

In this project, you will apply the skills you acquired in the course to gather and wrangle real-world data with two datasets of your choice.

You will retrieve and extract the data, assess the data programmatically and visually, accross elements of data quality and structure, and implement a cleaning strategy for the data. You will then store the updated data into your selected database/data store, combine the data, and answer a research question with the datasets.

Throughout the process, you are expected to:

1. Explain your decisions towards methods used for gathering, assessing, cleaning, storing, and answering the research question
2. Write code comments so your code is more readable

## 1. Gather data

In this section, you will extract data using two different data gathering methods and combine the data. Use at least two different types of data-gathering methods.

### **1.1.** Problem Statement

 *The City of Vancouver tracks not only every animal that comes into its shelters, but also those that are reported as lost by their owners. While the city does track those that are matched back to their owner, is it possible that an animal still tracked as lost has possibly been accounted for? If so, is it possible to find somewhat reliable means to match animals based on data entered into a report for lost ones?*

### **1.2.** Gather at least two datasets using two different data gathering methods

List of data gathering methods:

- Download data manually
- Programmatically downloading files [ X ]
- Gather data by accessing APIs [ X ]
- Gather and extract data from HTML files using BeautifulSoup
- Extract data from a SQL database

Each dataset must have at least two variables, and have greater than 500 data samples within each dataset.

For each dataset, briefly describe why you picked the dataset and the gathering method (2-3 full sentences), including the names and significance of the variables in the dataset. Show your work (e.g., if using an API to download the data, please include a snippet of your code). 

Load the dataset programmtically into this notebook.

#### City of Vancouver Animal Control Inventory - Lost and Found

This dataset is information from the City of Vancouver where an owner of an animal has reported them as lost. It also tracks those that were either reported as found or were matched by the shelter back to the owner. 

I chose this dataset because it will address what animals were reported as lost within the city. This does not cover every animal that was lost, however it does provide a large sample size for this metro area.

Further information on the dataset can be found [here](https://opendata.vancouver.ca/explore/dataset/animal-control-inventory-lost-and-found/information).

Type: JSON

Method: This data was gathered by querying the City of Vancouver's database with the standard Opendatasoft API. I am doing it this way because the data is updated daily, and this guarantees that the most up-to-date information will be used.

Dataset variables:

- *breed* - type of animal or breed that fits best.
- *color* - color of the animal's coat/fur.
- *date* - date that the animal was lost
- *name* - the given name of the animal being tracked (if known).
- *sex* - used to label the biological sex of the animal, as well as if they are spayed or neutered (marked with `F/S` or `M/N` accordingly). `X` = unknown
- *state* - the last state of being for the animal, i.e. `matched` or `lost`.

After some poking around, I found out that without a `group by` statement, the server only returns 100 results. By including a `group by` statement for all of the fields, this should theoretically drop duplicate values. I'm also filtering out anything before `1998-10-03`, as this is the earliest `DateImpounded` timestamp in the other dataset that will be used.

In [565]:
import requests
import pandas as pd
import datetime

# "lf" is for *l*ost and *f*ound
lf_api_query = "https://opendata.vancouver.ca/api/explore/v2.1/catalog/datasets/animal-control-inventory-lost-and-found/records?where=date%20%3E%20%221998-10-02%22&group_by=date%2C%20breed%2C%20color%2C%20name%2C%20sex%2C%20state&order_by=date&limit=-1"
lf_data = requests.get(lf_api_query)
lf_data.raise_for_status()

In [566]:
lf_json = lf_data.json()

print(lf_json.keys())
print(lf_json['total_count'])

lf_json['results'][0:3]
print(type(lf_json['results']))     # all elements
print(type(lf_json['results'][0]))  # individual element

dict_keys(['total_count', 'results'])
17840
<class 'list'>
<class 'dict'>


Based on our mild digging above, we would want to load specifically the data in the `results` key as a Pandas DataFrame, since `results` is simply a `list` of `dict`'s, which Pandas.DataFrame's constructor can handle.

In [567]:
lf_df = pd.DataFrame(lf_json['results'])

#### City of Vancouver Animal Control Inventory - Register

This dataset is a "general record of each animal that has come into the custody" the City of Vancouver's animal control service.

I chose this dataset to have a record to compare all of the lost and found animals to in the event an animal is reported as lost and the City of Vancouver happens to have them, or someone very much like them, already processed into their database.

Like with the lost and found dataset, this data is updated daily. Because I must choose a different method to pull this data, I will download it programatically, as well as in CSV format just to make sure I cover all bases for this project.

Type: Semicolon (`;`) delimited "CSV" file.

Method: Programatic download via HTTP GET request

Dataset variables:

- *AnimalID* - Unique sequential number given to each entry.
- *Breed* - Type of animal.
- *ShotsDate* - Date when vaccinated.
- *Sex* - M = Male, F = Female, M/N = Male Neutered, F/S = Female Spayed.
- *ReceiptNumber* - Point of sales system of record receipt number.
- *DateImpounded* - Date first in custody of the City of Vancouver.
- *PitNumber* - Number identifying animal kennel, does not change while in custody of the city.
- *Name* - Name if known.
- *KennelNumber* - Kennel number displayed at the top of each kennel.
- *DispositionDate* - Date when animal was no longer under the control of the city.
- *Color* - Color of coat.
- *Code* - Walk-ability index (*Green = easy, Yellow = moderate, Blue = hard*).
- *ApproxWeight* - Approximate weight of animal.
- *Age category* - Rough estimate of age - puppy, young adult, adult, senior.
- *Source* - Where the animal came from (Brought-in, Holding stray, Transferred).
- *Status* - Current state/disposition of animal.
- *ACO* - Animal control officer number or initials of employee.

In [568]:
# "reg" is for registry
reg_url = "https://opendata.vancouver.ca/api/explore/v2.1/catalog/datasets/animal-control-inventory-register/exports/csv?lang=en&timezone=America%2FChicago&use_labels=true&delimiter=%3B"
reg_data = requests.get(reg_url)
reg_data.raise_for_status()

In [569]:
# Save contents to a file labeled with today's date.
file_dl_date = datetime.date.today()
filename = f"vancouver-ac-registry_{file_dl_date.strftime('%Y%m%d')}.csv"
relativepath = "./datasets/" + filename

# Write dataset JSON stored as binary to target file
with open(relativepath, mode="wb") as f:
    f.write(reg_data.content)

We'll make the `AnimalID` column our index column for our DataFrame since it's essentially a built-in order for the animals. Following this, we'll start assessing our data.

In [570]:
reg_df = pd.read_csv(relativepath, sep=";", index_col='AnimalID')

## 2. Assess data

When we are assessing data, we are on the lookout for **quality** and **tidiness** (structural) issues.

**Quality Issues:**
- Completeness - The collected data is sufficient for addressing specific problems.
- Validity - Data conforms to the defined schema.
- Accuracy - Data accurately represents the reality it is describing.
- Consistency - A standard format is followed. Data matches that which can be found in other sources.
- Uniqueness - non-duplicate or overlapping values in the data.

**Tidiness Issues:**
- Each variable forms an individual column, i.e. color, name, birth year, etc. 
  - This also means that each column only contains one variable, or "factor that varies."
- Each observation forms an individual row, i.e. red, Scott, 1997, etc.
  - As with columns, only a single observation per row.
- Each type of observational unit forms a table, i.e. a table of immediate family members.

### Quality Issue 1:

We'll use the `.head()` and `.tail()` methods to quickly browse the dataset for anything that sticks out. This helps us ensure completeness to some degree, as well as validity and consistency.

In [571]:
# Inspecting the dataframe visually
lf_df.head()

Unnamed: 0,date,breed,color,name,sex,state
0,1999-01-03T00:00:00+00:00,Rotty X Shep,Black & Tan,Tex,M/N,Lost
1,1999-01-04T00:00:00+00:00,Dog,Light Colour,,M/N,Found
2,1999-01-04T00:00:00+00:00,Golden Lab X,Black & Tan,Oscar,M,Lost
3,1999-01-04T00:00:00+00:00,Shep X,Black & Tan,,F,Found
4,1999-01-04T00:00:00+00:00,Shep X Collie,Black & Tan,Angel,F,Lost


In [572]:
lf_df.tail()

Unnamed: 0,date,breed,color,name,sex,state
17835,2025-07-21T00:00:00+00:00,Cat - DSH,Creamy white/brown,Hylia,F/S,Lost
17836,2025-07-22T00:00:00+00:00,.Unknown Breed Mix,Cream,Bebe,F/S,Matched
17837,2025-07-23T00:00:00+00:00,Orange Tabby,orange & white,Murzik/Gingie,M/N,Lost
17838,2025-07-24T00:00:00+00:00,Border Collie X,"Black, brown w white",Tilly,F,Lost
17839,2025-07-24T00:00:00+00:00,Tabby,Brown & White,Puss,F/S,Lost


In [573]:
lf_df.sample(5)  # random sampling to see if anything else jumps out

Unnamed: 0,date,breed,color,name,sex,state
7160,2005-11-26T00:00:00+00:00,Boxer,Reddish-Brown,Otis,M/N,Lost
3473,2002-03-14T00:00:00+00:00,Rotty,Black & Tan,Goliath,M,Matched
8192,2011-01-31T00:00:00+00:00,DLH,Black,Bailey,F/S,Lost
12384,2016-10-14T00:00:00+00:00,Border Collie,Black,Callie,F/S,Lost
5153,2003-09-28T00:00:00+00:00,Bernese Mountain Dog,Black & Brown & White,,M,Matched


Thankfully, there are no duplicate entries in this data (as of 20250723), indicating each row has "uniqueness."

In [574]:
lf_df.duplicated().value_counts()

False    17840
Name: count, dtype: int64

There are a lot of colors in the `color` column, which indicates potential a lack of consistency and validity.

In [575]:
print("There are", lf_df['color'].unique().shape[0], "unique strings in the color column.")

There are 3400 unique strings in the color column.


One prime example is this hamster, which has `Golden/Blonde` fur. This indicates there is not a standardized process for determining and labeling fur color.

In [576]:
lf_df.iloc[16633]

date     2023-06-26T00:00:00+00:00
breed                      Hamster
color                Golden/Blonde
name                        Chivis
sex                              M
state                         Lost
Name: 16633, dtype: object

I am also curious about how many of those colors have consistency issues, for example using an ampersand instead of an "and". The use of slashes may also indicate `color` labeling similar to `Chivis`, as seen immediately above.

In [577]:
num_colors_amp = pd.Series(lf_df.color.unique()).str.contains(r'&', na=False).value_counts().iloc[1]
num_colors_and = pd.Series(lf_df.color.unique()).str.contains(r' and ', na=False).value_counts().iloc[1]
num_colors_fwdsl = pd.Series(lf_df.color.unique()).str.contains(r'/', na=False).value_counts().iloc[1]

print(num_colors_amp, "unique color strings contain '&' in the color description while", num_colors_and, "contain 'and'.")
print(num_colors_fwdsl, " unique colors contain a forward slash (/).")

1151 unique color strings contain '&' in the color description while 191 contain 'and'.
1341  unique colors contain a forward slash (/).


Issue and justification: Aside from lacking completeness, a lot of the columns do not have a consistent format:
Currently, "&" is primarily used in place of "and" in the `color` column. This matters when trying to answer our question as this means that matching based on `color` will different approaches (fuzzy matching, tokenization, etc.). 

As a side note, the formatting in the `breed` column is also quite inconsistent, but this is addressed further in the tidiness section.

### Quality Issue 2:

We'll do a head, tail, and sample again.

In [578]:
reg_df.head()

Unnamed: 0_level_0,Breed,ShotsDate,Sex,ReceiptNumber,DateImpounded,PitNumber,Name,KennelNumber,DispositionDate,Color,Code,ApproxWeight,Age category,Source,Status,ACO
AnimalID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
3793,Doberman (pup),,M,10973 - #4,2001-01-17,20088.0,,200,,Black & Tan,,,,POLICE-DAY,Sold,8
3796,Greyhound X Lab,,M/N,"11048,MB",2001-01-17,20057.0,,200,,Brindle,,,,BROUGHT-IN,Redeemed,15
3797,Collie X (pup),,M,"10965,MC",2001-01-19,20059.0,Rex,200,,Black & White,,,,HOLDING STRAY,Redeemed,16
3798,Shep X,,M/N,"10968,JS",2001-01-19,20045.0,,200,,White,,,,BROUGHT-IN,Redeemed,16
3805,Rottweiler,2000-09-13,F/S,10966 ar,2000-09-06,20032.0,Sara,200,,Black & Tan,,,,COMPLAINT,Sold,1


In [579]:
reg_df.tail()

Unnamed: 0_level_0,Breed,ShotsDate,Sex,ReceiptNumber,DateImpounded,PitNumber,Name,KennelNumber,DispositionDate,Color,Code,ApproxWeight,Age category,Source,Status,ACO
AnimalID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
20811,Pit Bull,,M,DI 10-193205,2010-06-12,20072.0,Cane,200,2010-06-15,Tan & White Chest,,80 lbs,,HOLDING STRAY,Redeemed,15
20815,Shih Tzu X,2010-06-22,F/S,DA 10-195737,2010-06-13,20044.0,ThimbleBerry (new name),200,2010-07-03,White,green,,,HOLDING STRAY,Sold,MA#11
20819,Cocker Spaniel X,,M,DG 10-193082,2010-06-14,20074.0,Bit Bit,200,2010-06-14,Tan,,35 lbs,,HOLDING STRAY,Redeemed,tan
20823,Chihuahua,,M,10 - 193326 - skj;,2010-06-15,20005.0,Brainy,200,2010-06-16,Tan & White,,2 lbs,,POLICE-NIGHT,Redeemed,12
20825,Chihuahua,,F,10 - 193330 - skj;,2010-06-15,20035.0,Princess,200,2010-06-16,White & Tan,,2 lbs,,POLICE-NIGHT,Redeemed,12


In [580]:
reg_df.sample(5)

Unnamed: 0_level_0,Breed,ShotsDate,Sex,ReceiptNumber,DateImpounded,PitNumber,Name,KennelNumber,DispositionDate,Color,Code,ApproxWeight,Age category,Source,Status,ACO
AnimalID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
21975,Beagle,,M/N,11 - 187207,2011-05-31,20060.0,Lucky,200,2011-05-31,Brown /white/ tan,,50 lbs,,HOLDING STRAY,Redeemed,ks19
12671,Bichon Frise,,M,CT9,2005-08-18,40035.0,Buddy,400,,White,,10 lbs.,,HOLDING STRAY,Ride Home Free,14
28512,Schipperke,,M,n/c,2017-09-15,,Puck,400,2017-09-15,Black,,,Adult,HOLDING STRAY,Ride Home Free,32
887,Border Collie X Lab,,M/N,"8145, MC 7",1999-07-16,20040.0,Amigo,200,,Black & Tan,,,,,Redeemed,18
34957,Coonhound malamute x,,F,n/a - RHF,2024-08-13,,Haha,400,2024-08-14,Black w/white,,,Adult,HOLDING STRAY,Ride Home Free,43


Since there are duplicates, we will need to drop those.

In [581]:
reg_df.duplicated().value_counts()

False    26039
True        41
Name: count, dtype: int64

Using the results from the `.info()` method, we can quickly visually parse about how many null values there are per column.

In [582]:
reg_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 26080 entries, 3793 to 20825
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Breed            26072 non-null  object 
 1   ShotsDate        3752 non-null   object 
 2   Sex              25794 non-null  object 
 3   ReceiptNumber    21683 non-null  object 
 4   DateImpounded    26080 non-null  object 
 5   PitNumber        17763 non-null  float64
 6   Name             23332 non-null  object 
 7   KennelNumber     26059 non-null  object 
 8   DispositionDate  12315 non-null  object 
 9   Color            26035 non-null  object 
 10  Code             1173 non-null   object 
 11  ApproxWeight     11643 non-null  object 
 12  Age category     8029 non-null   object 
 13  Source           24106 non-null  object 
 14  Status           26071 non-null  object 
 15  ACO              21311 non-null  object 
dtypes: float64(1), object(15)
memory usage: 3.4+ MB


`Age category` is the only column with a space character, as well as not being in `CamelCase`. Additionally, these columns are `CamelCase` instead of all lowercase, like the column labels in the `lost and found` dataset.

In [583]:
reg_df['Age category'].value_counts()   # Puppy being a category assumes the animal is a dog

Age category
Adult          4909
Young Adult    1544
Senior         1132
Puppy           444
Name: count, dtype: int64

Issue and justification: Immediately, we can see that there are a lot of null values, especially for `Code`, `ShotsDate`, and `Age category` columns. Each of those have no more than about 30% of the values filled. It would be difficult, however, to infer anything to answer our question using any of these columns. 

The best columns would probably be `name`, `color`, `sex`, and `breed` for making inferences, so this means we can probably just drop the columns that are mostly null. For `name` and the other column.s previousy mentioned, we can probably just replace the null values with something like "Unknown"

All of these null values make correlating animals together based on similar factors less reliable, especially if something critical like the name isn't disclosed or the color is not reported as the staff intaking animals would describe. Names are also likely the highest differentiator between two animals with similar breeds and coat colors, as well.

There are also duplicate values, but those will be easy to handle. The biggest issue I feel is the insane abundance of missing values.

### Tidiness Issue 1:

I'm of the opinion that validating the 3 qualities of tidiness is most easily began visually with `.head()/.tail()`, then drilling down programatically with `.value_counts()`, queries, `.columns`, and more. 

In [584]:
lf_df.head()  # lets ground overselves in the data again

Unnamed: 0,date,breed,color,name,sex,state
0,1999-01-03T00:00:00+00:00,Rotty X Shep,Black & Tan,Tex,M/N,Lost
1,1999-01-04T00:00:00+00:00,Dog,Light Colour,,M/N,Found
2,1999-01-04T00:00:00+00:00,Golden Lab X,Black & Tan,Oscar,M,Lost
3,1999-01-04T00:00:00+00:00,Shep X,Black & Tan,,F,Found
4,1999-01-04T00:00:00+00:00,Shep X Collie,Black & Tan,Angel,F,Lost


In [585]:
lf_df.tail()

Unnamed: 0,date,breed,color,name,sex,state
17835,2025-07-21T00:00:00+00:00,Cat - DSH,Creamy white/brown,Hylia,F/S,Lost
17836,2025-07-22T00:00:00+00:00,.Unknown Breed Mix,Cream,Bebe,F/S,Matched
17837,2025-07-23T00:00:00+00:00,Orange Tabby,orange & white,Murzik/Gingie,M/N,Lost
17838,2025-07-24T00:00:00+00:00,Border Collie X,"Black, brown w white",Tilly,F,Lost
17839,2025-07-24T00:00:00+00:00,Tabby,Brown & White,Puss,F/S,Lost


Based on the output above, the columns are each a distinct type of variable. It does seem, however, some columns store multiple variables in a single datapoint, especially `sex`, which indicates the reproductive sterility of an animal. This column, and another to be investigated further below, can be split into additional columns.

In [586]:
lf_df.sex.value_counts()

sex
M      5099
M/N    4552
F      3900
F/S    3682
X       216
Name: count, dtype: int64

In [587]:
print("There are", lf_df.breed.unique().shape[0], "unique breeds in the dataset.")

breed_s = pd.Series(lf_df.breed.unique()).sort_values(na_position='first', ignore_index=True)
breed_s.head()

There are 4041 unique breeds in the dataset.


0                            None
1          (Miniature) Pomeranian
2              .Unknown Breed Mix
3      1 Pit Bull & 1 Terrier mix
4    1 Pitbull & 1 Bernese/Poodle
dtype: object

In [588]:
breed_s.tail(10)

4031    stratfordshire terrier
4032                     tabby
4033    tabby Grey black brown
4034       tabby short haireds
4035               terrier (?)
4036                 terrier X
4037             terrier X pug
4038      very small ( 5 lbs )
4039             westy terrier
4040              yorkie cross
dtype: object

In [589]:
breed_s.str.extractall("(.*\d.*)")

Unnamed: 0_level_0,Unnamed: 1_level_0,0
Unnamed: 0_level_1,match,Unnamed: 2_level_1
3,0,1 Pit Bull & 1 Terrier mix
4,0,1 Pitbull & 1 Bernese/Poodle
5,0,1/2 Pit & 1/2 Presa
6,0,2 Dachunds
7,0,2 Great Pyrenese
...,...,...
3640,0,Staff Terrier #2
3662,0,Stafforshire Terrier 14 months
3973,0,lab and 2nd dog German Sheper
4029,0,small dog Benji type (10lbs)


Issue and justification: It seems the `breed` column could be useful for finding single rows with multiple animals reported (e.g., `1 Pit Bull & 1 Terrier mix`). A more tidy way to do this would be splitting those out into separate rows. If the report needs to be associated with those from the same report, there can be a report ID that all applicable reports share, however I will not need that information to answer my question.

By having multiple animals in one row, this not only inaccurately represents the number of animals reported missing, but also makes grouping or filtering by the reported breed unreliable.

The `sex` column could also be split into two columns, where one solely stores the biological sex of the animal and the other stores if they are neutered or spayed. This can just be a boolean value for simplicity's sake.

### Tidiness Issue 2: 

In [590]:
reg_df.head()

Unnamed: 0_level_0,Breed,ShotsDate,Sex,ReceiptNumber,DateImpounded,PitNumber,Name,KennelNumber,DispositionDate,Color,Code,ApproxWeight,Age category,Source,Status,ACO
AnimalID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
3793,Doberman (pup),,M,10973 - #4,2001-01-17,20088.0,,200,,Black & Tan,,,,POLICE-DAY,Sold,8
3796,Greyhound X Lab,,M/N,"11048,MB",2001-01-17,20057.0,,200,,Brindle,,,,BROUGHT-IN,Redeemed,15
3797,Collie X (pup),,M,"10965,MC",2001-01-19,20059.0,Rex,200,,Black & White,,,,HOLDING STRAY,Redeemed,16
3798,Shep X,,M/N,"10968,JS",2001-01-19,20045.0,,200,,White,,,,BROUGHT-IN,Redeemed,16
3805,Rottweiler,2000-09-13,F/S,10966 ar,2000-09-06,20032.0,Sara,200,,Black & Tan,,,,COMPLAINT,Sold,1


In [591]:
reg_df.sort_index().head()

Unnamed: 0_level_0,Breed,ShotsDate,Sex,ReceiptNumber,DateImpounded,PitNumber,Name,KennelNumber,DispositionDate,Color,Code,ApproxWeight,Age category,Source,Status,ACO
AnimalID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1,Pit Bull,2005-06-18,M/N,17057 ks,2005-06-12,20038.0,Taz,200,,Tan,,45,,HOLDING STRAY,Sold,3.0
2,English Setter,,M/N,20372 BC3,2006-03-19,20041.0,Dudley,200,,White & Brown,,100lbs,,HOLDING STRAY,Redeemed,
3,Lab X,,M/N,N/C BC,2006-07-24,40041.0,Evander,400,,Black,,,,HOLDING STRAY,Ride Home Free,5.0
4,Pomeranian,,M,"22171,MC",2006-09-08,20017.0,Chewbaca,200,,Brown,,10 lbs.,,HOLDING STRAY,Redeemed,14.0
5,American Bulldog X,2006-10-07,M/N,22565 - skj,2006-09-06,20053.0,Cadillac,200,,White with Brown Patch on Eye,,80 lbs,,HOLDING STRAY,Sold,22.0


Skimming through the breed column revealed multiple occurrences of a string like "`<breed-name> X`". Immediately, I am unsure what this may indicate, but I have a presumption that I will elaborate upon when I describe the issue I'll be addressing with this dataset.

In [592]:
reg_df.query('AnimalID == 3').iloc[0]  # iloc will return a series, which is a bit easier to read for a single row

Breed                       Lab X
ShotsDate                     NaN
Sex                           M/N
ReceiptNumber              N/C BC
DateImpounded          2006-07-24
PitNumber                 40041.0
Name                      Evander
KennelNumber                  400
DispositionDate               NaN
Color                       Black
Code                          NaN
ApproxWeight                  NaN
Age category                  NaN
Source              HOLDING STRAY
Status             Ride Home Free
ACO                             5
Name: 3, dtype: object

Looking a little further, it looks like `Lab X` is incredibly common.

In [593]:
reg_df['Breed'].str.extractall("(.*[lL]ab.*)").value_counts()

0                     
Lab X                     872
Lab                       571
Labrador                  495
Labrador X                 78
Shep X Lab                 78
                         ... 
Black Lab X Pit             1
Black Lab/GSD               1
Black X Lab                 1
Heeler X Lab X Bassett      1
rotty X lab                 1
Name: count, Length: 490, dtype: int64

Issue and justification: There are a few places where some improvements can be made. 
- `Name` - Some rows contain "`(New name)`," which is tracking two factors of data in a single column.
  - This could be resolved with another column of boolean values, but I do not think it will be entirely necessary for our cause. Therefore, we will remove occurrences of `(New name)`
- `Breed` - Some breeds have "`mix`" while others have "`X`". Presumably, `X` is also for "mix," but this is merely an assumption. Additionally, `Black Lab` is not a breed, the breed is Labrador Retriever. 
  - One way to handle this is to see if the `Breed` field contains the `Color` of the animal, then trimming that out if it does.

Another general issue is that this field has inconsistent capitalization. Eventually, I will normalize all of the string fields so that they are all in lowercase.

## 3. Clean data

There are a few operations I plan to do before we begin the real cleaning:
1. Lowercase all strings
2. Lowercase all column names and remove spaces

Following that, we will address the four issues brought up during phase 2.

In [594]:
# Make copies of the datasets to ensure the raw dataframes are not impacted
lf_df_clean = lf_df.copy()
reg_df_clean = reg_df.copy()

reg_df_clean.rename(columns=lambda x: x.strip().lower().replace(" ", ""), inplace=True)

for col in lf_df_clean.columns:
    lf_df_clean[col] = lf_df_clean[col].str.lower()

reg_df_clean.index.rename(reg_df_clean.index.name.lower(), inplace=True)

for col in reg_df_clean.columns:
    if col == 'pitnumber': continue  # this is type float64

    reg_df_clean[col] = reg_df_clean[col].str.lower()

Now we'll validate...

In [595]:
# Random sample of the lost and found table is all lowercase?
lf_df_clean.sample(1).to_string().islower()

False

In [596]:
# Random sample of the register table is all lowercase?
reg_df_clean.dropna().sample(1).to_string().islower()  # dropna will prevent NaN -> "NaN"

True

In [597]:
# All columns lowercase?
str(reg_df_clean.columns.tolist()).islower()

True

### **Quality Issue 1: to '&' or to 'and?'**

For the issue of inconsistent formatting in the lost and found data's `color` column, we will just remove the "&" and "and" strings, parsing the results out into a list. This gets it split out into something ready for us to access programatically right out of the gate.

This cleaning will be applied to both datasets, as I noticed that (understandably), the same issue exists within the register dataset.

We will need to start off by handling the null values in the `color` column, though, so that we don't have to repeat the same steps of converting it to an array with the second quality issue.

In [598]:
# Getting ahead of ourselves a bit, but handling null color values
lf_df_clean.fillna({'color':'unknown'}, inplace=True)
reg_df_clean.fillna({'color':'unknown'}, inplace=True)

In [599]:
# Replace " w/", "/", " and ", "&", " with " with "|", then split into array on "|"
pattern = r"( w\/| [^\w\s] | and |/| with )"
lf_df_clean['color'] = lf_df_clean['color'].str.replace(pat=pattern, repl="|", regex=True)
lf_df_clean['color'] = lf_df_clean['color'].str.split("|")

# The same thing, but for reg_df
reg_df_clean['color'] = reg_df_clean['color'].str.replace(pat=pattern, repl="|", regex=True)
# we aren't going to split, yet, as this will cause some errors with the next quality issue.

In [600]:
# Validation cleaning succeeded for lost and found data
lf_df_clean['color'].sample(5)

14767    [gold, brown, black, white]
17538         [brown, orange, white]
12789            [black, tan, white]
6667                        [blonde]
16017           [grey, black, merle]
Name: color, dtype: object

In [601]:
# Validation on register data
reg_df_clean['color'].sample(5)

animalid
24504                 green
28746           black|white
21764                   tan
35004    blue, white|yellow
26794                 black
Name: color, dtype: object

In [602]:
# Further validation
# Before -> After
print(lf_df.iloc[372].color, "->", lf_df_clean.iloc[372].color)
print(lf_df.iloc[424].color, "->", lf_df_clean.iloc[424].color)
print(lf_df.iloc[439].color, "->", lf_df_clean.iloc[439].color)
print("\nSplitting reg_df_clean will happen later")
print(reg_df.iloc[372].Color, "->", reg_df_clean.iloc[372].color)
print(reg_df.iloc[424].Color, "->", reg_df_clean.iloc[424].color)
print(reg_df.iloc[439].Color, "->", reg_df_clean.iloc[439].color)

Reddish Blonde -> ['reddish blonde']
Cream  & Rust -> ['cream ', 'rust']
Grey & White & Black -> ['grey', 'white', 'black']

Splitting reg_df_clean will happen later
Brown -> brown
Blonde -> blonde
Black Brindle -> black brindle


Justification: In order to normalize a large majority of the colors in these datasets, which makes them significantly easier to compare. It also will make our aggregation results a bit more reliable since there won't be differences in using "&" vs. "and."

### **Quality Issue 2: Handling Nulls and Duplicates**

There are lots of null values throughout the datasets, which then requires us to handle them specially. If we instead set them to a string, this will simplify life a tad. We also don't want duplicate values, as they are just redundant records. Thankfully, dupes are only in the register data.

We'll start by dropping what we don't want, being duplicates and a few columns that don't help answer our question.

In [603]:
cols_to_drop = ['shotsdate','kennelnumber', 'pitnumber', 'code', 'aco']
reg_df_clean.drop(columns=cols_to_drop, axis=1, inplace=True)

In [604]:
reg_df_clean.drop_duplicates(inplace=True)  # throws error when lists are included

Now we can validate that our changes did what we intended

In [605]:
# Before -> After
print(reg_df.duplicated().any(), "->", reg_df_clean.duplicated().any())

True -> False


We'd get an error reporting that a list is not hashable if we tried to `drop_duplicates` or look for `duplicated` rows on a split `color` column. Since we have all of that handled, we can split it into a list now.

In [606]:
reg_df_clean['color'] = reg_df_clean['color'].str.split("|")

In [607]:
#FILL IN - Apply the cleaning strategy
lf_df_clean.fillna('unknown', inplace=True)
reg_df_clean.fillna('unknown', inplace=True)

In [608]:
#FILL IN - Validate the cleaning was successful
reg_df_clean.isnull().any()

breed              False
sex                False
receiptnumber      False
dateimpounded      False
name               False
dispositiondate    False
color              False
approxweight       False
agecategory        False
source             False
status             False
dtype: bool

In [609]:
lf_df_clean.isnull().any()

date     False
breed    False
color    False
name     False
sex      False
state    False
dtype: bool

Justification: Now that we have our nulls and duplicates handled, our data should be less tricky to work with, especially with aggregation, comparison, and enhancement.

### **Tidiness Issue 1: FILL IN**

In [610]:
#FILL IN - Apply the cleaning strategy

In [611]:
#FILL IN - Validate the cleaning was successful

Justification: *FILL IN*

### **Tidiness Issue 2: FILL IN**

In [612]:
#FILL IN - Apply the cleaning strategy

In [613]:
#FILL IN - Validate the cleaning was successful

Justification: *FILL IN*

### **Remove unnecessary variables and combine datasets**

Depending on the datasets, you can also peform the combination before the cleaning steps.

In [614]:
#FILL IN - Remove unnecessary variables and combine datasets

## 4. Update your data store
Update your local database/data store with the cleaned data, following best practices for storing your cleaned data:

- Must maintain different instances / versions of data (raw and cleaned data)
- Must name the dataset files informatively
- Ensure both the raw and cleaned data is saved to your database/data store

In [615]:
#FILL IN - saving data

## 5. Answer the research question

### **5.1:** Define and answer the research question 
Going back to the problem statement in step 1, use the cleaned data to answer the question you raised. Produce **at least** two visualizations using the cleaned data and explain how they help you answer the question.

*Research question:* FILL IN from answer to Step 1

In [616]:
#Visual 1 - FILL IN

*Answer to research question:* FILL IN