![pandas](https://upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Pandas_logo.svg/2880px-Pandas_logo.svg.png)

# Objectives

- Load .csv files into `pandas` DataFrames
- Describe and manipulate data in Series and DataFrames

# What is Pandas?

![I have no idea what I'm doing panda](https://cdn-images-1.medium.com/max/1600/1*oBx032ncOwLmCFX3Epo3Zg.jpeg)

Just kidding - not actual literal pandas.

Pandas, as [the Anaconda docs](https://docs.anaconda.com/anaconda/packages/py3.7_osx-64/) tell us, offers us "High-performance, easy-to-use data structures and data analysis tools." It's something like "Excel for Python", but it's quite a bit more powerful. The name comes from "panel data", a common way to describe the kind of multidimensional data we'll be working with in certain academic circles (namely, statistics and econometrics) [[Source]](https://www.dlr.de/sc/Portaldata/15/Resources/dokumente/pyhpc2011/submissions/pyhpc2011_submission_9.pdf)

In order to use pandas, we'll need to import it into our notebook first.

In [1]:
# Import - using the common alias
import pandas as pd

## Accessing Data

![pandas documentation image showcasing the kinds of data it can both read and write to](https://pandas.pydata.org/docs/_images/02_io_readwrite.svg)

[[Image Source]](https://pandas.pydata.org/docs/getting_started/intro_tutorials/02_read_write.html)

Pandas can access a ton of different data types, including some that should be familiar: CSVs and JSONs! That's right, no more `with` / `open` statements now that we're using pandas!

Most of the time, we'll see CSVs - so let's access a 'toy' data set quickly just to familiarize ourselves with using pandas. There's a heart dataset available in the data folder on this repository - let's read that in.

In [3]:
# Use read_csv to read in the heart csv file
# Need to assign it to a variable too - let's call this heart_df
heart_df = pd.read_csv('~/Desktop/heart.csv')

The output of the `.read_csv()` function is a pandas *DataFrame*, which has a familiar tabaular structure of rows and columns.

In [4]:
# Let's check this variable out
heart_df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [5]:
# What type is this variable?
type(heart_df)

pandas.core.frame.DataFrame

## DataFrames and Series

Two main types of pandas objects are the DataFrame and the Series, the latter being in effect a single column of the former:

In [6]:
# Let's grab just one column
heart_df['age']

0      63
1      37
2      41
3      56
4      57
       ..
298    57
299    45
300    68
301    57
302    57
Name: age, Length: 303, dtype: int64

Notice how we can isolate a column of our DataFrame simply by using square brackets together with the name of the column. We can also access columns as an attribute of the DataFrame - but that only works if the name of the column doesn't have any spaces or weird characters!

In [7]:
# Grab the same column as an attribute
heart_df.age

0      63
1      37
2      41
3      56
4      57
       ..
298    57
299    45
300    68
301    57
302    57
Name: age, Length: 303, dtype: int64

In [8]:
# What type is the column?
type(heart_df['age'])

pandas.core.series.Series

Both Series and DataFrames have an *index* as well:

In [9]:
# Check out the dataframe index
heart_df.index

RangeIndex(start=0, stop=303, step=1)

In [10]:
# Check out the series index
heart_df['age'].index

RangeIndex(start=0, stop=303, step=1)

DataFrames have columns - but a Series is just a single column, so it doesn't have the columns attribute.

In [12]:
# Check out the dataframe columns
heart_df.columns

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
      dtype='object')

In [15]:
# Now try with the series
heart_df['age'].columns

AttributeError: 'Series' object has no attribute 'columns'

Pandas is built on top of NumPy, and we can always access the NumPy array underlying a DataFrame using `.values`.

In [18]:
# Check out the dataframe values
heart_df.values

array([[63.,  1.,  3., ...,  0.,  1.,  1.],
       [37.,  1.,  2., ...,  0.,  2.,  1.],
       [41.,  0.,  1., ...,  0.,  2.,  1.],
       ...,
       [68.,  1.,  0., ...,  2.,  3.,  0.],
       [57.,  1.,  0., ...,  1.,  3.,  0.],
       [57.,  0.,  1., ...,  1.,  2.,  0.]])

## More Basic DataFrame Attributes and Methods

### `.head()` : first 5 rows

In [23]:
heart_df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [24]:
heart_df[:5]

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


### `.tail()` : last 5 rows

In [27]:
heart_df.tail()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0
302,57,0,1,130,236,0,0,174,0,0.0,1,1,2,0


### `.info()` : information about the columns, including about nulls in those columns

In [28]:
heart_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


### `.describe()` : statistics about the data

In [40]:
heart_df.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


### `.dtypes` : data types of each column

In [32]:
type(heart_df.dtypes)

pandas.core.series.Series

### `.shape` : number of rows and columns

In [31]:
heart_df.shape

(303, 14)

### Statistics

We saw them above, in the `.describe`, but we can also calculate statistics by calling them individually.

In [44]:
# Calculate the mean - for the whole dataframe!
heart_df.mean()

age          54.366337
sex           0.683168
cp            0.966997
trestbps    131.623762
chol        246.264026
fbs           0.148515
restecg       0.528053
thalach     149.646865
exang         0.326733
oldpeak       1.039604
slope         1.399340
ca            0.729373
thal          2.313531
target        0.544554
dtype: float64

Let's pause and interpret - any observations?

- 


In [45]:
# Now min
heart_df.min()

age          29.0
sex           0.0
cp            0.0
trestbps     94.0
chol        126.0
fbs           0.0
restecg       0.0
thalach      71.0
exang         0.0
oldpeak       0.0
slope         0.0
ca            0.0
thal          0.0
target        0.0
dtype: float64

In [46]:
# And max
heart_df.max()

age          77.0
sex           1.0
cp            3.0
trestbps    200.0
chol        564.0
fbs           1.0
restecg       2.0
thalach     202.0
exang         1.0
oldpeak       6.2
slope         2.0
ca            4.0
thal          3.0
target        1.0
dtype: float64

## Enough With The Small Stuff - Bring On Real Data!

Let's access an open data portal and get some real live data!

Austin Animal Center Intake Data: https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Intakes/wter-evkm/

In [64]:
# Accessing a CSV from a url
animals_from_url = pd.read_csv('https://data.austintexas.gov/resource/wter-evkm.csv')

In [65]:
animals_from_url.head()

Unnamed: 0,animal_id,name,datetime,datetime2,found_location,intake_type,intake_condition,animal_type,sex_upon_intake,age_upon_intake,breed,color
0,A852217,A852217,2022-02-28T10:56:00.000,2022-02-28T10:56:00.000,1901 Onion Creek Parkway in Austin (TX),Stray,Normal,Other,Unknown,1 year,Ferret,Brown/Tan
1,A852216,,2022-02-28T10:50:00.000,2022-02-28T10:50:00.000,Outside Jurisdiction,Stray,Normal,Cat,Unknown,4 months,Domestic Shorthair,Black/White
2,A852215,,2022-02-28T10:36:00.000,2022-02-28T10:36:00.000,William Cannon in Austin (TX),Stray,Normal,Dog,Unknown,10 years,Chihuahua Shorthair Mix,White/Tan
3,A850918,Kaia,2022-02-28T10:34:00.000,2022-02-28T10:34:00.000,Outside Jurisdiction,Owner Surrender,Normal,Cat,Spayed Female,6 years,Domestic Shorthair Mix,White
4,A850917,Daphne,2022-02-28T10:34:00.000,2022-02-28T10:34:00.000,Outside Jurisdiction,Owner Surrender,Normal,Cat,Spayed Female,6 years,Manx Mix,Torbie


In [53]:
# Same as the JSON output from this API endpoint, but different levels of detail for dates!
pd.read_json('https://data.austintexas.gov/resource/wter-evkm.json').head()

Unnamed: 0,animal_id,name,datetime,datetime2,found_location,intake_type,intake_condition,animal_type,sex_upon_intake,age_upon_intake,breed,color
0,A852217,A852217,2022-02-28 10:56:00,2022-02-28T10:56:00.000,1901 Onion Creek Parkway in Austin (TX),Stray,Normal,Other,Unknown,1 year,Ferret,Brown/Tan
1,A852216,,2022-02-28 10:50:00,2022-02-28T10:50:00.000,Outside Jurisdiction,Stray,Normal,Cat,Unknown,4 months,Domestic Shorthair,Black/White
2,A852215,,2022-02-28 10:36:00,2022-02-28T10:36:00.000,William Cannon in Austin (TX),Stray,Normal,Dog,Unknown,10 years,Chihuahua Shorthair Mix,White/Tan
3,A850918,Kaia,2022-02-28 10:34:00,2022-02-28T10:34:00.000,Outside Jurisdiction,Owner Surrender,Normal,Cat,Spayed Female,6 years,Domestic Shorthair Mix,White
4,A850917,Daphne,2022-02-28 10:34:00,2022-02-28T10:34:00.000,Outside Jurisdiction,Owner Surrender,Normal,Cat,Spayed Female,6 years,Manx Mix,Torbie


In [66]:
# But this is only 1000 rows... website says there's 137K rows!
animals_from_url.shape

(1000, 12)

In [57]:
# It's a limitation of the API - let's just download the data instead
# It's in the data folder
animal_df = pd.read_csv('~/Desktop/Austin_Animal_Center_Intakes_021822.csv')

In [59]:
# Now let's explore those earlier attributes and methods on this dataset!
# Check the first 5 rows
animal_df.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color
0,A786884,*Brock,01/03/2019 04:19:00 PM,January 2019,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor
1,A706918,Belle,07/05/2015 12:59:00 PM,July 2015,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver
2,A724273,Runster,04/14/2016 06:43:00 PM,April 2016,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,Sable/White
3,A665644,,10/21/2013 07:59:00 AM,October 2013,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico
4,A682524,Rio,06/29/2014 10:38:00 AM,June 2014,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray


In [60]:
# Check the last 5 rows
animal_df.tail()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color
136518,A851512,*Lea,02/14/2022 10:30:00 AM,February 2022,4907 Strass in Austin (TX),Stray,Normal,Cat,Intact Female,2 months,Domestic Shorthair,Black
136519,A851544,*Vera,02/14/2022 02:34:00 PM,February 2022,700 East Live Oak in Austin (TX),Stray,Normal,Dog,Intact Female,1 year,Australian Kelpie Mix,Black/White
136520,A851553,*Tommy Bahama,02/14/2022 04:19:00 PM,February 2022,9Th And Red River in Austin (TX),Stray,Normal,Dog,Intact Male,2 years,Chinese Sharpei Mix,Red
136521,A851754,,02/18/2022 09:35:00 AM,February 2022,1411 Gracy Farms Lane in Austin (TX),Stray,Pregnant,Cat,Intact Female,2 years,Domestic Shorthair,Gray Tabby
136522,A851742,Storm,02/17/2022 05:32:00 PM,February 2022,Thorton Road in Austin (TX),Stray,Normal,Dog,Spayed Female,2 years,Pit Bull,White/Black


In [61]:
# Check the shape
animal_df.shape

(136523, 12)

In [62]:
# Check the datatypes
animal_df.dtypes

Animal ID           object
Name                object
DateTime            object
MonthYear           object
Found Location      object
Intake Type         object
Intake Condition    object
Animal Type         object
Sex upon Intake     object
Age upon Intake     object
Breed               object
Color               object
dtype: object

In [63]:
# Check more general information on the dataframe
animal_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 136523 entries, 0 to 136522
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   Animal ID         136523 non-null  object
 1   Name              95561 non-null   object
 2   DateTime          136523 non-null  object
 3   MonthYear         136523 non-null  object
 4   Found Location    136523 non-null  object
 5   Intake Type       136523 non-null  object
 6   Intake Condition  136523 non-null  object
 7   Animal Type       136523 non-null  object
 8   Sex upon Intake   136522 non-null  object
 9   Age upon Intake   136523 non-null  object
 10  Breed             136523 non-null  object
 11  Color             136523 non-null  object
dtypes: object(12)
memory usage: 12.5+ MB


In [68]:
# Check summary/descriptive statistics on the dataframe
animal_df.describe()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color
count,136523,95561,136523,136523,136523,136523,136523,136523,136522,136523,136523,136523
unique,122029,23018,95988,101,57672,6,15,5,5,53,2729,613
top,A721033,Max,09/23/2016 12:00:00 PM,June 2015,Austin (TX),Stray,Normal,Dog,Intact Male,1 year,Domestic Shorthair Mix,Black/White
freq,33,619,64,2189,25645,94009,117672,76925,44888,23391,32018,14258


#### Any Observations?

- 


## Adding to a DataFrame

### Adding Rows

We have a new animal coming in, captured here in a Python dictionary:

In [78]:
# Dictionary, where keys match the column names and values are the row values
next_row = {'Animal ID': 'A851755',
            'Name': "T'Challa",
            'DateTime': '2/28/2022 11:25:00 AM',
            'MonthYear': 'February 2022',
            'Found Location': 'Houston (TX)',
            'Intake Type': 'Public Assist',
            'Intake Condition': 'Normal',
            'Animal Type': 'Cat',
            'Sex upon Intake': 'Neutered Male',
            'Age upon Intake': '4 years',
            'Breed': 'Domestic Shorthair',
            'Color': 'Black'}
next_row

{'Animal ID': 'A851755',
 'Name': "T'Challa",
 'DateTime': '2/28/2022 11:25:00 AM',
 'MonthYear': 'February 2022',
 'Found Location': 'Houston (TX)',
 'Intake Type': 'Public Assist',
 'Intake Condition': 'Normal',
 'Animal Type': 'Cat',
 'Sex upon Intake': 'Neutered Male',
 'Age upon Intake': '4 years',
 'Breed': 'Domestic Shorthair',
 'Color': 'Black'}

How can we add this to the bottom of our dataset?

In [99]:
animal_df.index.max() + 1

136523

In [105]:
# Let's first turn this into a DataFrame
new_row = pd.DataFrame(next_row, index=[0])

In [106]:
new_row

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color
0,A851755,T'Challa,2/28/2022 11:25:00 AM,February 2022,Houston (TX),Public Assist,Normal,Cat,Neutered Male,4 years,Domestic Shorthair,Black


In [110]:
# Now we just need to concatenate the two DataFrames together.
# Note the `ignore_index` parameter! We'll set that to True.
pd.concat([animal_df, new_row], ignore_index=True)

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color
0,A786884,*Brock,01/03/2019 04:19:00 PM,January 2019,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor
1,A706918,Belle,07/05/2015 12:59:00 PM,July 2015,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver
2,A724273,Runster,04/14/2016 06:43:00 PM,April 2016,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,Sable/White
3,A665644,,10/21/2013 07:59:00 AM,October 2013,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico
4,A682524,Rio,06/29/2014 10:38:00 AM,June 2014,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray
...,...,...,...,...,...,...,...,...,...,...,...,...
136519,A851544,*Vera,02/14/2022 02:34:00 PM,February 2022,700 East Live Oak in Austin (TX),Stray,Normal,Dog,Intact Female,1 year,Australian Kelpie Mix,Black/White
136520,A851553,*Tommy Bahama,02/14/2022 04:19:00 PM,February 2022,9Th And Red River in Austin (TX),Stray,Normal,Dog,Intact Male,2 years,Chinese Sharpei Mix,Red
136521,A851754,,02/18/2022 09:35:00 AM,February 2022,1411 Gracy Farms Lane in Austin (TX),Stray,Pregnant,Cat,Intact Female,2 years,Domestic Shorthair,Gray Tabby
136522,A851742,Storm,02/17/2022 05:32:00 PM,February 2022,Thorton Road in Austin (TX),Stray,Normal,Dog,Spayed Female,2 years,Pit Bull,White/Black


In [None]:
# Let's check the end to make sure we were successful!


## Adding (and Deleting) Columns

In [111]:
animal_df.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color
0,A786884,*Brock,01/03/2019 04:19:00 PM,January 2019,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor
1,A706918,Belle,07/05/2015 12:59:00 PM,July 2015,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver
2,A724273,Runster,04/14/2016 06:43:00 PM,April 2016,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,Sable/White
3,A665644,,10/21/2013 07:59:00 AM,October 2013,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico
4,A682524,Rio,06/29/2014 10:38:00 AM,June 2014,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray


Adding a column is very easy in `pandas`. Let's add a new column to our dataset called "test", and set all of its values to 0.

In [112]:
# Create a new column, 'test', where every value in the col is 0
animal_df['test'] = 0

In [114]:
# Sanity check
animal_df.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color,test
0,A786884,*Brock,01/03/2019 04:19:00 PM,January 2019,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor,0
1,A706918,Belle,07/05/2015 12:59:00 PM,July 2015,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver,0
2,A724273,Runster,04/14/2016 06:43:00 PM,April 2016,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,Sable/White,0
3,A665644,,10/21/2013 07:59:00 AM,October 2013,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico,0
4,A682524,Rio,06/29/2014 10:38:00 AM,June 2014,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray,0


But we don't need that - let's drop that column.

In [125]:
# Drop that test column
animal_df = animal_df.drop('test', axis=1)
#Same as
# animal_df.drop(columns='test', inplace=True)

In [127]:
# Sanity check
animal_df.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color
0,A786884,*Brock,01/03/2019 04:19:00 PM,January 2019,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor
1,A706918,Belle,07/05/2015 12:59:00 PM,July 2015,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver
2,A724273,Runster,04/14/2016 06:43:00 PM,April 2016,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,Sable/White
3,A665644,,10/21/2013 07:59:00 AM,October 2013,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico
4,A682524,Rio,06/29/2014 10:38:00 AM,June 2014,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray


We can also do math with columns, or use mathematical notation to combine columns even when they aren't numerical!

We don't have any numeric data in this current dataset. But we can still create a combined "Type" column that combines the values of our Intake Type and Animal Type columns.

In [136]:
animal_df['Intake Type'] + " " + animal_df['Animal Type']

0         Stray Dog
1         Stray Dog
2         Stray Dog
3         Stray Cat
4         Stray Dog
            ...    
136518    Stray Cat
136519    Stray Dog
136520    Stray Dog
136521    Stray Cat
136522    Stray Dog
Length: 136523, dtype: object

In [130]:
# Create a new column, 'Type', from the two 'Type' columns
animal_df['Type'] = animal_df['Intake Type'] + " " + animal_df['Animal Type']

In [131]:
# Sanity check
animal_df.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color,Type
0,A786884,*Brock,01/03/2019 04:19:00 PM,January 2019,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor,Stray Dog
1,A706918,Belle,07/05/2015 12:59:00 PM,July 2015,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver,Stray Dog
2,A724273,Runster,04/14/2016 06:43:00 PM,April 2016,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,Sable/White,Stray Dog
3,A665644,,10/21/2013 07:59:00 AM,October 2013,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico,Stray Cat
4,A682524,Rio,06/29/2014 10:38:00 AM,June 2014,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray,Stray Dog


### DateTime Data!

Useful reference: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html

I mentioned we don't have any numeric data in here... let's make a new numeric column! Specifically, let's grab the year from our `DateTime` column and use that to create a new `Year` column. AKA, let's learn how to deal with datetime data!

In [132]:
# What type is the DateTime column now?
animal_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 136523 entries, 0 to 136522
Data columns (total 13 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   Animal ID         136523 non-null  object
 1   Name              95561 non-null   object
 2   DateTime          136523 non-null  object
 3   MonthYear         136523 non-null  object
 4   Found Location    136523 non-null  object
 5   Intake Type       136523 non-null  object
 6   Intake Condition  136523 non-null  object
 7   Animal Type       136523 non-null  object
 8   Sex upon Intake   136522 non-null  object
 9   Age upon Intake   136523 non-null  object
 10  Breed             136523 non-null  object
 11  Color             136523 non-null  object
 12  Type              136523 non-null  object
dtypes: object(13)
memory usage: 13.5+ MB


In [134]:
# Let's make that a datetime datatype
animal_df['DateTime'] = pd.to_datetime(animal_df['DateTime'])

In [135]:
# Sanity check
animal_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 136523 entries, 0 to 136522
Data columns (total 13 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   Animal ID         136523 non-null  object        
 1   Name              95561 non-null   object        
 2   DateTime          136523 non-null  datetime64[ns]
 3   MonthYear         136523 non-null  object        
 4   Found Location    136523 non-null  object        
 5   Intake Type       136523 non-null  object        
 6   Intake Condition  136523 non-null  object        
 7   Animal Type       136523 non-null  object        
 8   Sex upon Intake   136522 non-null  object        
 9   Age upon Intake   136523 non-null  object        
 10  Breed             136523 non-null  object        
 11  Color             136523 non-null  object        
 12  Type              136523 non-null  object        
dtypes: datetime64[ns](1), object(12)
memory usage: 13.5+ MB


In [141]:
animal_df.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color,Type
0,A786884,*Brock,2019-01-03 16:19:00,January 2019,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor,Stray Dog
1,A706918,Belle,2015-07-05 12:59:00,July 2015,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver,Stray Dog
2,A724273,Runster,2016-04-14 18:43:00,April 2016,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,Sable/White,Stray Dog
3,A665644,,2013-10-21 07:59:00,October 2013,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico,Stray Cat
4,A682524,Rio,2014-06-29 10:38:00,June 2014,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray,Stray Dog


In [152]:
# Now we can access parts of a datetime object using .dt - an attribute!
animal_df['DateTime'].dt.year

# Kind of the same as:
# animal_df['MonthYear'].str[-4:].astype(int)

0         2019
1         2015
2         2016
3         2013
4         2014
          ... 
136518    2022
136519    2022
136520    2022
136521    2022
136522    2022
Name: DateTime, Length: 136523, dtype: int64

In [153]:
# Make that into a new column
animal_df['Year'] = animal_df['DateTime'].dt.year

In [161]:
# Sanity check
animal_df.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color,Type,Year
0,A786884,*Brock,2019-01-03 16:19:00,January 2019,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor,Stray Dog,2019
1,A706918,Belle,2015-07-05 12:59:00,July 2015,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver,Stray Dog,2015
2,A724273,Runster,2016-04-14 18:43:00,April 2016,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,Sable/White,Stray Dog,2016
3,A665644,,2013-10-21 07:59:00,October 2013,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico,Stray Cat,2013
4,A682524,Rio,2014-06-29 10:38:00,June 2014,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray,Stray Dog,2014


# Filtering

We can use filtering techniques to see only certain rows of our data. Let's look at only animals taken into the center during or after 2020:

In [162]:
# Check which rows have an intake year greater than or equal to 2020
animal_df['Year'] >= 2020

0         False
1         False
2         False
3         False
4         False
          ...  
136518     True
136519     True
136520     True
136521     True
136522     True
Name: Year, Length: 136523, dtype: bool

In [163]:
# Let's explore an interesting property of boolean columns...
# Find out the total sum of animals taken in during or after 2020
sum(animal_df['Year'] >= 2020)

23029

But this only gives us True/False outputs... what if we want to really filter?

## `.loc` 

We can locate and segment down to only rows where some condition is true using `.loc`. This takes in a condition, and only outputs the rows where that condition is True! 

> **Note:** locate (`.loc`) uses square brackets, not parentheses! Often, square brackets denote location-focused actions, like this one.

Let's try this first with the condition we just built, and locate all animals taken in during or after 2020.

In [168]:
# Create a subset dataframe of animals taken in during or after 2020
subset = animal_df.loc[animal_df['Year'] >= 2020]

In [169]:
# Sanity check
subset

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color,Type,Year
7,A844350,*Ella,2021-10-15 11:40:00,October 2021,2112 East William Cannon Drive in Austin (TX),Stray,Normal,Cat,Intact Female,6 months,Domestic Shorthair,Brown Tabby,Stray Cat,2021
9,A818975,,2020-06-18 14:53:00,June 2020,Braker Lane And Metric in Travis (TX),Stray,Normal,Cat,Intact Male,4 weeks,Domestic Shorthair,Cream Tabby,Stray Cat,2020
22,A831808,,2021-04-02 11:16:00,April 2021,Austin (TX),Owner Surrender,Normal,Cat,Intact Female,1 month,Domestic Shorthair Mix,Tortie,Owner Surrender Cat,2021
24,A836850,,2021-06-15 12:37:00,June 2021,6111 Softwood Drive in Austin (TX),Public Assist,Pregnant,Dog,Intact Female,4 years,Pit Bull,Blue/White,Public Assist Dog,2021
26,A815227,Baby,2020-03-12 13:52:00,March 2020,12305 Zeller Lane in Austin (TX),Stray,Normal,Dog,Intact Female,1 month,Norfolk Terrier,Brown/Cream,Stray Dog,2020
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
136518,A851512,*Lea,2022-02-14 10:30:00,February 2022,4907 Strass in Austin (TX),Stray,Normal,Cat,Intact Female,2 months,Domestic Shorthair,Black,Stray Cat,2022
136519,A851544,*Vera,2022-02-14 14:34:00,February 2022,700 East Live Oak in Austin (TX),Stray,Normal,Dog,Intact Female,1 year,Australian Kelpie Mix,Black/White,Stray Dog,2022
136520,A851553,*Tommy Bahama,2022-02-14 16:19:00,February 2022,9Th And Red River in Austin (TX),Stray,Normal,Dog,Intact Male,2 years,Chinese Sharpei Mix,Red,Stray Dog,2022
136521,A851754,,2022-02-18 09:35:00,February 2022,1411 Gracy Farms Lane in Austin (TX),Stray,Pregnant,Cat,Intact Female,2 years,Domestic Shorthair,Gray Tabby,Stray Cat,2022


We can return only certain columns when we do this, by adding an argument after the condition:

In [173]:
subset[['Animal ID', 'DateTime', 'Type']]

Unnamed: 0,Animal ID,DateTime,Type
7,A844350,2021-10-15 11:40:00,Stray Cat
9,A818975,2020-06-18 14:53:00,Stray Cat
22,A831808,2021-04-02 11:16:00,Owner Surrender Cat
24,A836850,2021-06-15 12:37:00,Public Assist Dog
26,A815227,2020-03-12 13:52:00,Stray Dog
...,...,...,...
136518,A851512,2022-02-14 10:30:00,Stray Cat
136519,A851544,2022-02-14 14:34:00,Stray Dog
136520,A851553,2022-02-14 16:19:00,Stray Dog
136521,A851754,2022-02-18 09:35:00,Stray Cat


In [174]:
# Let's return just the 'Animal ID', 'DateTime' and 'Type' columns
animal_df.loc[animal_df['Year'] >= 2020, ['Animal ID', 'DateTime', 'Type']]

Unnamed: 0,Animal ID,DateTime,Type
7,A844350,2021-10-15 11:40:00,Stray Cat
9,A818975,2020-06-18 14:53:00,Stray Cat
22,A831808,2021-04-02 11:16:00,Owner Surrender Cat
24,A836850,2021-06-15 12:37:00,Public Assist Dog
26,A815227,2020-03-12 13:52:00,Stray Dog
...,...,...,...
136518,A851512,2022-02-14 10:30:00,Stray Cat
136519,A851544,2022-02-14 14:34:00,Stray Dog
136520,A851553,2022-02-14 16:19:00,Stray Dog
136521,A851754,2022-02-18 09:35:00,Stray Cat


What if I want to segment using multiple conditions? Use `&` for "and" and `|` for "or" - and use parentheses around individual conditions!

In [177]:
# Find all the Stray Cats taken in during or after 2020
animal_df.loc[(animal_df['Year'] >= 2020) & (animal_df['Type'] == 'Stray Cat')]

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color,Type,Year
7,A844350,*Ella,2021-10-15 11:40:00,October 2021,2112 East William Cannon Drive in Austin (TX),Stray,Normal,Cat,Intact Female,6 months,Domestic Shorthair,Brown Tabby,Stray Cat,2021
9,A818975,,2020-06-18 14:53:00,June 2020,Braker Lane And Metric in Travis (TX),Stray,Normal,Cat,Intact Male,4 weeks,Domestic Shorthair,Cream Tabby,Stray Cat,2020
44,A821389,*Dim Sum,2020-08-10 14:10:00,August 2020,7800 San Felipe Boulevard in Austin (TX),Stray,Normal,Cat,Intact Male,4 weeks,Domestic Shorthair,Brown Tabby,Stray Cat,2020
123,A816185,,2020-04-09 07:49:00,April 2020,1830 W Rundberg in Austin (TX),Stray,Normal,Cat,Intact Male,1 week,Domestic Shorthair,Gray Tabby,Stray Cat,2020
128,A834412,Montie,2021-05-13 13:58:00,May 2021,1601 East Slaughter Lane in Austin (TX),Stray,Sick,Cat,Neutered Male,5 years,Domestic Shorthair,Brown Tabby,Stray Cat,2021
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
136514,A851514,*Michaela,2022-02-14 10:30:00,February 2022,4907 Strass in Austin (TX),Stray,Normal,Cat,Intact Female,2 months,Domestic Shorthair,Black,Stray Cat,2022
136515,A851530,*Pickles,2022-02-14 12:40:00,February 2022,8900 North Ih 35 in Austin (TX),Stray,Normal,Cat,Intact Male,5 months,Domestic Longhair Mix,Black/White,Stray Cat,2022
136516,A851529,*Pj,2022-02-14 12:40:00,February 2022,8900 North Ih 35 in Austin (TX),Stray,Normal,Cat,Intact Male,5 months,Domestic Shorthair Mix,Blue Tabby/White,Stray Cat,2022
136518,A851512,*Lea,2022-02-14 10:30:00,February 2022,4907 Strass in Austin (TX),Stray,Normal,Cat,Intact Female,2 months,Domestic Shorthair,Black,Stray Cat,2022


### Your turn!

### Exercise 1

You need to find dogs that need extra attention - How would you find all dogs where the intake condition is NOT normal?

In [None]:
# Your code here

### Exercise 2

You need to find animals that might need to be fixed - How would you find all animals that are either Intact Male or Intact Female?

In [None]:
# Your code here

## `.iloc`

`.iloc` is used for integer-location based indexing, aka locate by number. It can take in lists of numbers, python slices, or specific numbers - but sometimes it can be a bit tricky!

In [181]:
# Find the first 3 rows
animal_df.iloc[:3]

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color,Type,Year
0,A786884,*Brock,2019-01-03 16:19:00,January 2019,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor,Stray Dog,2019
1,A706918,Belle,2015-07-05 12:59:00,July 2015,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver,Stray Dog,2015
2,A724273,Runster,2016-04-14 18:43:00,April 2016,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,Sable/White,Stray Dog,2016


In [182]:
# Same as using head(3)
animal_df.head(3)

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color,Type,Year
0,A786884,*Brock,2019-01-03 16:19:00,January 2019,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor,Stray Dog,2019
1,A706918,Belle,2015-07-05 12:59:00,July 2015,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver,Stray Dog,2015
2,A724273,Runster,2016-04-14 18:43:00,April 2016,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,Sable/White,Stray Dog,2016


In [183]:
# Can look exactly where the 0 index is
animal_df.iloc[0]

Animal ID                                       A786884
Name                                             *Brock
DateTime                            2019-01-03 16:19:00
MonthYear                                  January 2019
Found Location      2501 Magin Meadow Dr in Austin (TX)
Intake Type                                       Stray
Intake Condition                                 Normal
Animal Type                                         Dog
Sex upon Intake                           Neutered Male
Age upon Intake                                 2 years
Breed                                        Beagle Mix
Color                                          Tricolor
Type                                          Stray Dog
Year                                               2019
Name: 0, dtype: object

In [186]:
subset.head(3)

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color,Type,Year
7,A844350,*Ella,2021-10-15 11:40:00,October 2021,2112 East William Cannon Drive in Austin (TX),Stray,Normal,Cat,Intact Female,6 months,Domestic Shorthair,Brown Tabby,Stray Cat,2021
9,A818975,,2020-06-18 14:53:00,June 2020,Braker Lane And Metric in Travis (TX),Stray,Normal,Cat,Intact Male,4 weeks,Domestic Shorthair,Cream Tabby,Stray Cat,2020
22,A831808,,2021-04-02 11:16:00,April 2021,Austin (TX),Owner Surrender,Normal,Cat,Intact Female,1 month,Domestic Shorthair Mix,Tortie,Owner Surrender Cat,2021


In [188]:
test = animal_df.set_index('Animal ID')

In [189]:
test

Unnamed: 0_level_0,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color,Type,Year
Animal ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
A786884,*Brock,2019-01-03 16:19:00,January 2019,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor,Stray Dog,2019
A706918,Belle,2015-07-05 12:59:00,July 2015,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver,Stray Dog,2015
A724273,Runster,2016-04-14 18:43:00,April 2016,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,Sable/White,Stray Dog,2016
A665644,,2013-10-21 07:59:00,October 2013,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico,Stray Cat,2013
A682524,Rio,2014-06-29 10:38:00,June 2014,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray,Stray Dog,2014
...,...,...,...,...,...,...,...,...,...,...,...,...,...
A851512,*Lea,2022-02-14 10:30:00,February 2022,4907 Strass in Austin (TX),Stray,Normal,Cat,Intact Female,2 months,Domestic Shorthair,Black,Stray Cat,2022
A851544,*Vera,2022-02-14 14:34:00,February 2022,700 East Live Oak in Austin (TX),Stray,Normal,Dog,Intact Female,1 year,Australian Kelpie Mix,Black/White,Stray Dog,2022
A851553,*Tommy Bahama,2022-02-14 16:19:00,February 2022,9Th And Red River in Austin (TX),Stray,Normal,Dog,Intact Male,2 years,Chinese Sharpei Mix,Red,Stray Dog,2022
A851754,,2022-02-18 09:35:00,February 2022,1411 Gracy Farms Lane in Austin (TX),Stray,Pregnant,Cat,Intact Female,2 years,Domestic Shorthair,Gray Tabby,Stray Cat,2022


In [192]:
# But what about our subset dataframe above? It doesn't have an index 0
test.iloc[0]

Name                                             *Brock
DateTime                            2019-01-03 16:19:00
MonthYear                                  January 2019
Found Location      2501 Magin Meadow Dr in Austin (TX)
Intake Type                                       Stray
Intake Condition                                 Normal
Animal Type                                         Dog
Sex upon Intake                           Neutered Male
Age upon Intake                                 2 years
Breed                                        Beagle Mix
Color                                          Tricolor
Type                                          Stray Dog
Year                                               2019
Name: A786884, dtype: object

In [None]:
# Try it...


# Series Methods

## `.value_counts()`

How many different values does the Animal Type column have? What about Breed?

In [193]:
# Check the value counts for Animal Type
animal_df['Animal Type'].value_counts()

Dog          76925
Cat          51698
Other         7240
Bird           636
Livestock       24
Name: Animal Type, dtype: int64

In [195]:
# Now check Breed
animal_df['Breed'].value_counts()

Domestic Shorthair Mix         32018
Domestic Shorthair             10409
Pit Bull Mix                    8902
Labrador Retriever Mix          7359
Chihuahua Shorthair Mix         6481
                               ...  
Pomeranian/Border Terrier          1
Black Mouth Cur/Plott Hound        1
Collie Rough/Irish Setter          1
English Bulldog/Dachshund          1
Rex-Mini/Lop-English               1
Name: Breed, Length: 2729, dtype: int64

Sometimes, this is more useful than others... but, can check the percentage of the total, which might be more useful!

In [197]:
# Use the normalize argument to view as percentages
animal_df['Breed'].value_counts(normalize=True).head(10)

Domestic Shorthair Mix      0.234525
Domestic Shorthair          0.076244
Pit Bull Mix                0.065205
Labrador Retriever Mix      0.053903
Chihuahua Shorthair Mix     0.047472
German Shepherd Mix         0.024230
Domestic Medium Hair Mix    0.023527
Pit Bull                    0.013851
Bat Mix                     0.012862
Bat                         0.012291
Name: Breed, dtype: float64

In [203]:
animal_df['Breed'].value_counts()[animal_df['Breed'].value_counts()>9000]

Domestic Shorthair Mix    32018
Domestic Shorthair        10409
Name: Breed, dtype: int64

## `.sort_values()`

As you can imagine, this works differently whether you're using it on a numeric or non-numeric column

In [206]:
# Let's sort the year column
animal_df['Year'].sort_values()

28124     2013
44406     2013
100709    2013
55127     2013
81493     2013
          ... 
135469    2022
135468    2022
135467    2022
135483    2022
136522    2022
Name: Year, Length: 136523, dtype: int64

In [207]:
# Now, sort the Animal Type col
animal_df['Animal Type'].sort_values()

38041      Bird
95491      Bird
89756      Bird
5175       Bird
56448      Bird
          ...  
34118     Other
70691     Other
90568     Other
101816    Other
107676    Other
Name: Animal Type, Length: 136523, dtype: object

In [209]:
# We can do this on the whole dataframe, it just needs to know what to sort by
animal_df.sort_values(by='DateTime')

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color,Type,Year
72969,A521520,Nina,2013-10-01 07:51:00,October 2013,Norht Ec in Austin (TX),Stray,Normal,Dog,Spayed Female,7 years,Border Terrier/Border Collie,White/Tan,Stray Dog,2013
38982,A664235,,2013-10-01 08:33:00,October 2013,Abia in Austin (TX),Stray,Normal,Cat,Unknown,1 week,Domestic Shorthair Mix,Orange/White,Stray Cat,2013
115915,A664236,,2013-10-01 08:33:00,October 2013,Abia in Austin (TX),Stray,Normal,Cat,Unknown,1 week,Domestic Shorthair Mix,Orange/White,Stray Cat,2013
2378,A664237,,2013-10-01 08:33:00,October 2013,Abia in Austin (TX),Stray,Normal,Cat,Unknown,1 week,Domestic Shorthair Mix,Orange/White,Stray Cat,2013
117657,A664233,Stevie,2013-10-01 08:53:00,October 2013,7405 Springtime in Austin (TX),Stray,Injured,Dog,Intact Female,3 years,Pit Bull Mix,Blue/White,Stray Dog,2013
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
136497,A851713,Zeus,2022-02-17 17:01:00,February 2022,Lunar And Cloudview in Austin (TX),Stray,Normal,Dog,Intact Male,1 year,German Shepherd,Black/Tan,Stray Dog,2022
136498,A851741,Kief,2022-02-17 17:01:00,February 2022,Lunar And Cloudview in Austin (TX),Stray,Normal,Dog,Intact Male,6 months,Boxer,Red/White,Stray Dog,2022
136522,A851742,Storm,2022-02-17 17:32:00,February 2022,Thorton Road in Austin (TX),Stray,Normal,Dog,Spayed Female,2 years,Pit Bull,White/Black,Stray Dog,2022
136500,A851744,,2022-02-17 20:14:00,February 2022,Austin (TX),Wildlife,Normal,Other,Unknown,2 years,Bat,Brown,Wildlife Other,2022


# Extra Credit: Find a .csv file online and experiment with it.

Head to [dataportals.org](https://dataportals.org) to find a .csv file.