# Introduction to Data Analysis Using Python Libraries

![Data Analysis Process from Career Foundry](images/the-data-analysis-process.jpeg)

[Image Source](https://careerfoundry.com/en/blog/data-analytics/the-data-analysis-process-step-by-step/)

There are many ways to break down the process of exploring data - I like this representation because it keeps things high level. What this visual doesn't capture is how cyclic the process often is! Just because you've "cleaned" your data once does not mean you won't return to that step repeatedly as you hone your analysis and visualize your findings more effectively.

We'll gloss over steps one and two here, and focus mostly on steps 3-5. 

Let's get started!

![Front of the Austin Animal Center building](images/austin-animal-center-front.jpeg)

## Scenario

The Austin Animal Center keeps great records of their animal intakes, but it's a lot of data. Your task - to process and find some initial high-level insights to start figuring out what trends are in the data. But insights left in a notebook like this are wasted - you also want to visualize what you've found to showcase to others!

## The Data

[Austin Animal Center Intakes Data](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Intakes/wter-evkm/) - updated pretty much every day!

The data for today's session was downloaded on 2/15/2022.

## Getting Started

We'll use two libraries today: [Pandas](https://pandas.pydata.org/) for data parsing and manipulation, and [Plotly](https://plotly.com/python/) for data visualization.

First things first, we want to answer some basic questions about our data:
- What does the data look like?
- What is the shape of the data?
- What data types are in each column?
- What are the most common entries in each column?

In [None]:
# Imports
# Pandas for data manipulation

# Plotly for data visualization


In [None]:
# Read in the data


In [None]:
# What does the data look like?
# Let's look at the first five rows


In [None]:
# Look at the shape


In [None]:
# What data types are in each column?


In [None]:
# What are the most common entries in each column?


## The Questions

1. What kind of animals are brought into the Center?
2. What are the top 10 most common dog breeds brought into the Center?
3. How has the number of animals brought in changed over time?

## Question 1: What kind of animals are brought into the Center?

Just need one column for this - the Animal Type.

In [None]:
# Explore the breakdown of the Animal Type column


In [None]:
# Wow - birds and livestock make up such a small percentage
# Let's lump them in with 'Other' for a more effective visualization with replace


In [None]:
# Now let's see how that changed


In [None]:
# Capture that output in a variable - then reset the index to make it a dataframe


In [None]:
# Explore the variable we just created - we should rename these columns!


In [None]:
# So let's do that - rename the columns to actually describe the data


In [None]:
# Check our work


In [None]:
# Visualize it! With the world's most controversial chart... a pie chart
# https://plotly.com/python/pie-charts/


## Question 2: What are the top 10 most common dog breeds brought into the Center?

We'll want to look only at dogs to find the top 10 most common breeds.

In [None]:
# Segment to only dogs, in a variable we'll call df_dogs
# Time for a locate statement!


In [None]:
# Explore our new dataframe


In [None]:
# Now, we can explore the breeds - let's look at that column's value counts


In [None]:
# Already we can see some dirty data - let's look at pit bulls
# Can use a string method in a locate statement to explore!


In [None]:
# Let's see how many unique ways 'pit bull' is in the breed column


Behold! Welcome to data cleaning! Or, as it would be known in this case - data wrangling!

This is definitely part of the job, and can be a frustrating part. For example, let's look at how a Chief Decision Scientist at Google ran into a similar issue:

![screenshot from LinkedIn, as a Chief Decision Scientist at Google laments the dozens of different ways people input 'Philadelphia' in a table](images/data-wrangling-philadelphia.png)

This is something to keep in mind - whenever someone can input data, it will often be inconsistent! How you deal with that, and how you go about cleaning your data to make it usable, can have a very real impact on your analysis!

Let's first try without cleaning at all, then make one simple change and see how that impacts our visualization.

In [None]:
# Save the top 10 breeds without cleaning as a new variable


In [None]:
# What does this object look like


In [None]:
# Time to visualize - let's use a bar chart!
# https://plotly.com/python/bar-charts/


Time for one simple change: let's drop " Mix" from the values in the column, and see what that affects.

In [None]:
# Let's use string methods to replace " Mix" with nothing ("")


In [None]:
# Save the top 10 no mix breeds as a new variable


In [None]:
# Visualization round two!


## Question 3: How has the number of animals brought in changed over time?

Here we'll need to look at our DateTime column - but also get an idea of the number of animals arriving per day. Time for a group by!

In [None]:
# Let's explore our DateTime column using describe


In [None]:
# Pandas isn't recognizing this as a datetime object - let's fix that


# Note - this code might take a second to run

In [None]:
# Check our work using describe again


In [None]:
# We won't need the hour/minute/second data - just the date
# Can use normalize on the datetime attribute of this column to fix it


In [None]:
# Let's save that output as a new column, Date


In [None]:
# Check our work - let's use info


In [None]:
# Now - time for that group by!
# Let's explore what's happening in the groupby, then save it to a variable


In [None]:
# Time for a line chart!
# https://plotly.com/python/line-charts/


In [None]:
# Woah - that's a bit messy. Let's just look at a montly breakdown
# We can resample, then grab the sum per month


In [None]:
# Time for another line chart!


In [None]:
# Looks like we have an annual trend - let's take a better look...
# Let's go back to our original dataframe and create a new groupby for this...
# First - grab out the Year and Month as new columns


In [None]:
# Check our work...


In [None]:
# A new groupby - now with two columns to group by!
# Let's explore, reset the index for clarity, then save to a variable


In [None]:
# Check this new variable we just created


In [None]:
# One last line chart!


### Thank you for joining us!