# Using Python Libraries

## First Up: Numeric Python

![numpy](https://raw.githubusercontent.com/donnemartin/data-science-ipython-notebooks/master/images/numpy.png)

[NumPy](https://www.numpy.org/) is the fundamental package for scientific computing with Python. Many other data science packages, especially those that work with matrices, rely on it for its speed and utility.

For numpy, the standard alias is `np`.

In [19]:
# Import numpy
import numpy as np

### NumPy Arrays

Python lists and NumPy arrays can both hold numbers. However, Python lists have limited functionality for mathematical operations. NumPy arrays make it easy and fast to do math with a collection of numbers.

In [20]:
# Explore a numpy array
x = np.array([1, 2, 3])
print(x)
print(type(x))

[1 2 3]
<class 'numpy.ndarray'>


Let's make a list using base Python, and an array using Numpy, and see how they function differently:

In [21]:
# Create a list in base Python of 3 integers
numbers_list = [2,4,6]
# Create a numpy array containing the same 3 integers
numbers_array = np.array([2,4,6])

### Arithmetic Operations

Arithmetic operators (e.g. +, -, * and /) work according to mathematical principles for arrays, unlike with lists. These operations are done "element-wise".

In [22]:
# Multiply the array by 3
numbers_array * 3

array([ 6, 12, 18])

In [23]:
# Multiply the list by 3
# Base Python list just repeats the list
numbers_list * 3

[2, 4, 6, 2, 4, 6, 2, 4, 6]

In [24]:
# Add 20 to the array
# You can do math with NumPy! 
numbers_array + 20

array([22, 24, 26])

In [25]:
# Add 20 to the list
# Python list has no idea what to do
numbers_list + 20

TypeError: can only concatenate list (not "int") to list

### Speed

Below, you will find a piece of code we will use to compare the speed of operations on lists vs arrays.

In [None]:
# Setting the size of our iterables
size_of_vec = 1000

# Creating two lists of that size
X = list(range(size_of_vec))
Y = list(range(size_of_vec))

In [None]:
# Timing how long it takes to add each element in the two lists
# Complicated bit of code using a list comprehension
# This is basically an in-line for loop, the output of which is a list.
# %timeit is a fun little decorator, tells us how many milliseconds an operation takes.
%timeit [X[i] + Y[i] for i in range(size_of_vec)]

In [None]:
# Now let's try with numpy arrays
X = np.array(range(size_of_vec))
Y = np.array(range(size_of_vec))

In [26]:
# Much simpler code, since it's easier to do element-wise math
%timeit X + Y

1.35 µs ± 77.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


## Next Up: Importing, Reading and Manipulating Data with ACTUAL LITERAL PANDAS

![I have no idea what I'm doing panda](https://cdn-images-1.medium.com/max/1600/1*oBx032ncOwLmCFX3Epo3Zg.jpeg)

Just kidding - but Pandas is a great library to work with relational data. 

[Check out the documentation!](https://pandas.pydata.org/pandas-docs/stable/) (always a great idea)

Note that we didn't go into a lot of Numpy's functionality, but here's something cool - Pandas is built on top of Numpy! That means they work really well together, and that Pandas has some math functionality already built in.

If you'd like to read more about Numpy and Pandas, [here is an interesting blog post](https://cloudxlab.com/blog/numpy-pandas-introduction/) discussing them.

Let's dive into some data from the Austin Animal Shelter. 

Data source: [intakes data](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Intakes/wter-evkm) and [outcomes data](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Outcomes/9t4d-g238).

Today we'll be working with the intakes data, which I've already downloaded and included in the repository.

In [27]:
# Import
import pandas as pd

Before reading in the data, we need to know what format the data is in and where exactly the data can be found, so we can tell Pandas what to do.

In [28]:
# Where is our data?
# Figuring this out using a command line command
!ls data/

Austin_Animal_Center_Intakes_030921.csv
Austin_Animal_Center_Outcomes_030921.csv


In [29]:
# Read in the comma-separated-value (csv) document as df
# parse_dates is telling Pandas that the DateTime column is not a string/column it is a Date/Time
df = pd.read_csv('data/Austin_Animal_Center_Intakes_030921.csv', parse_dates=['DateTime'])

In [None]:
#The parse_dates function is not destructive; doesn't change the original file
#Never do anything to the original file! 

What options do we have when we read in a csv? Let's look at the documentation!

[Convenient link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

I happen to know that there is a column in the data named 'DateTime' (run the below code to check it out before adjusting our read-in code!) - let's use an argument to read it in as a datetime object, then discuss.

In [None]:
df['DateTime']

### Initial Exploration of a Dataframe

Questions to ask yourself:

- How big is the data?
- Are there any empty cells? 
- What are the datatypes of the columns of data?

In [None]:
# What does this dataframe look like?
# Check out the first 5 rows
df.head(3)

In [None]:
# Check out the shape of the df
# These are attributes associated with this dataframe, so you can't pass in parameters, and they have no parentheses. 
df.shape

In [None]:
# And then the size (number of cells)
df.size

In [None]:
# And then look at some info on the df
df.info()

In [None]:
# Describe the columns
# This is a warning; not an error. This is telling me that the next version of pandas is going to treat something a little differently. 
# Lindsey usually ignores warnings, but she reads them. 
df.describe()

**A note on `.describe()`:** this function behaves differently whether we feed in objects or numeric types. We'll explore this more later.

**And a question:** You see that some of the ways we dealt with our dataframe required `()` and some did not - why is that?

- Methods vs. attributes 


### Accessing Columns

Use brackets and the exact column name to access a particular column.

In [None]:
# Access columns using bracket notation 
# SERIES ARE COLUMNS!!! 

type(df['Name'])

In [None]:
# You can force a single column to be a dataframe by using:
type(df[['Name']])

In [None]:
df[['Name']].head()

Can also use `.` notation, if the column name doesn't have spaces.

In [None]:
df.Name

### Dealing with Datetime Objects

You can access parts of a datetime object using `.dt` - an attribute of the column, not a method!

In [None]:
#This does not work because the "MonthYear" column is still an object. 
df['MonthYear'].dt.year

In [None]:
# Let's check out the intake year
df['DateTime'].dt.year

In [None]:
# How do we create a new column?
# Let's create a new column for intake year
df['IntakeYear'] = df['DateTime'].dt.year

In [None]:
# Check our work
df['IntakeYear']

In [None]:
# What datatype is the data in our new column?
type(df['IntakeYear'])

### Checking for Null Values

Can use `.isna` or `.isnull` - same thing!

In [None]:
# Check it - is the result what you expect?
df.isna()

In [None]:
df.shape

In [None]:
# How can you make that result more usable?
df.isna().sum()

### Checking for Duplicate Rows

In [None]:
# Function is called duplicated - check the documentation!
# Checking to see if any of these rows is the same, or not. 
df.duplicated(subset=['Name'])

In [None]:
# Can use same trick as above on duplicated to make the result more usable
# Tells me if I have 19 identical rows. 
# Keep: False marks all duplicates as true (keeps both copies)
df.duplicated(subset=['Name'], keep=False).sum()

### Dropping Columns or Rows

Several different methods depending on what we're doing - but the to discuss right now is `.drop`

In [None]:
# Let's drop the MonthYear column, which is the same as our DateTime
# Not doing anything to the underlying file; only changing the df object
df = df.drop(['MonthYear'], axis=1)

In [None]:
# Check our work here...
df.head()

Fun thing about pandas - time to discuss resetting variables, or using `inplace`

### Renaming Columns

[Documentation for `.rename`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html)

In [None]:
# Let's remove spaces from the columns, and make all column names lowercase to be easier
# Can use a dictionary to rename
col_names = df.columns

In [None]:
col_names[0].replace(" ", "")

In [None]:
new_col_names = []
for c in col_names: 
    new_col_names.append(c.replace(" ", "").lower())

In [None]:
# Can also use a lambda function
new_col_names

In [None]:
col_dict = dict[zip(col_names, new_col_names)]

### Slicing and Dicing

Perhaps your biggest tool for exploring around your dataframes will be `.loc` (and it's accompanying `.iloc`). This allows you to use conditionals to explore your data!

In [None]:
# Example: look only at animals with intake type 'Stray'
df['animaltype'].value_counts()

In [None]:
# Second example: animals where the animal type is not dog


In [None]:
# And a third - animals found before 2018


## Let's Start to Answer Questions!

#### Question 1: What is the most common Animal Type?

In [None]:
# Let's explore the Animal Type column to find out
df.loc[df['intaketype'] == 'Stray']

In [None]:
# Another way - look above at describe, or run another describe
# 'Top' for an object column means 'most common'


#### Question 2: What is the most common dog breed to come into the shelter?

In [None]:
# Let's create a new df, dogs, for all dogs in the original data


In [None]:
# Now it's easier to look at common dog breeds


#### Question 3: What percentage of animals have come into the shelter in a condition other than "Normal"?

In [None]:
# Need to explore the proper column


In [None]:
# Want to use pandas to calculate, not inputting number manually


In [None]:
# Calculate percentage


In [None]:
# Other way to calculate


## Now - Outtake Data!

Let's explore together if we have time! If not - extra credit!