# Module 1 - Introduction to Pandas
## Pandas Part 1

### Introduction

![austin](http://www.austintexas.gov/sites/default/files/aac_logo.jpg)
You have decided that you want to start your own animal shelter, but you want to get an idea of what that will entail and get more information about planning. You have found out that Austin has one of the largest no-kill animal shelters in the country, and they keep meticulous track of animals that have been taken in and released. However, it is a large file, the online visualization tools provided are terrible, the data is sorted as strings, and the file holds an overwhelming amount  of information. Is there an easy way to look at this data? Can we do this with base Python? Is there a better way?


#### _Our goals today are to be able to_: <br/>

- Import/read data using Pandas
- Identify Pandas objects and manipulate Pandas objects by index and columns
- Filter data using Pandas

#### _Big questions for this lesson_: <br/>
- Why use Pandas? 
 
 (a) Provides methods able analyze data stored in the format Data Scientist most often encounter (.csv, .tsv, or .xlsx). 
 
 (b) Makes it very convenient to load, process, and analyze in the aforementioned formats. 
 
 (c) Along with python visualization packages allows for the visual analysis of tabular data.
 

- When do we want to use NumPy versus Pandas?
- What are the [advantages of using Pandas?](https://stackabuse.com/beginners-tutorial-on-the-pandas-python-library/)
- What are the [disadvantages of using Pandas?](https://wesmckinney.com/blog/apache-arrow-pandas-internals/)
- DataFrames are great for representing real data: rows correspond to instances (examples, observations, etc.), and columns correspond to features of these instances.

### Activation:

<img src="https://cdn-images-1.medium.com/max/1600/1*9IU5fBzJisilYjRAi-f55Q.png" width=600>  




- The data manipulation capabilities of Pandas are built on top of the numpy library.
- Pandas dataframe object represents a spreadsheet with cell values, column names, and row index labels.

### 1. Importing and reading data with Pandas!

#### Let's use pandas to read some csv files so we can interact with them.



In [2]:
# First, let's check which directory we are in so the files we expect to see are there.
!pwd
!ls -la
!ls -la data

/Users/brad/Documents/Scripts/flatiron/dc-ds-100719/module-1/week-1/day-5-pandas-1
total 56
drwxr-xr-x  5 brad  staff    160 Oct 11 11:09 [34m.[m[m
drwxr-xr-x  7 brad  staff    224 Oct 11 11:05 [34m..[m[m
drwxr-xr-x  3 brad  staff     96 Oct 11 11:09 [34m.ipynb_checkpoints[m[m
drwxr-xr-x  4 brad  staff    128 Oct 11 11:05 [34mdata[m[m
-rw-r--r--  1 brad  staff  25472 Oct 11 11:09 pandas-1.ipynb
total 16
drwxr-xr-x  4 brad  staff  128 Oct 11 11:05 [34m.[m[m
drwxr-xr-x  5 brad  staff  160 Oct 11 11:09 [34m..[m[m
-rw-r--r--  1 brad  staff   62 Oct 11 11:05 example1.csv
-rw-r--r--  1 brad  staff  238 Oct 11 11:05 made_up_jobs.csv


In [1]:
import pandas as pd

example_df = pd.read_csv('data/example1.csv')

There is also `read_excel`, `read_html`, and many other pandas `read_` functions.  
http://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

In [5]:
example_df.head()

Unnamed: 0,Title1,Title2,Title3
0,one,two,three
1,example1,example2,example3


Try loading in the example file in the `data` directory called `made_up_jobs.csv` using pandas.

In [6]:
#read in your csv here!
jobs = pd.read_csv('data/made_up_jobs.csv')

#remember that it's nice to be able to look at your data, so let's do that here, too.
jobs.head()

Unnamed: 0,ID,Name,Job,Years Employed
0,0,Bob Bobberty,Underwater Basket Weaver,13
1,1,Susan Smells,Salad Spinner,5
2,2,Alex Lastname,Productivity Manager,2
3,3,Rudy P.,Being cool,55
4,4,Rudy G.,Being compared to Rudy P,50


You can also load in data by using the url of an associated dataset.

In [2]:
shelter_data = pd.read_csv('https://data.austintexas.gov/api/views/9t4d-g238/rows.csv?accessType=DOWNLOAD')
# this link is copied directly from the download option for CSV

shelter_data.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color
0,A806379,Guyute,10/13/2019 03:54:00 PM,10/13/2019 03:54:00 PM,10/09/2017,Adoption,,Other,Intact Female,2 years,Guinea Pig,Tricolor
1,A797732,Benji,10/13/2019 03:40:00 PM,10/13/2019 03:40:00 PM,02/17/2019,Adoption,,Dog,Neutered Male,7 months,Maltese Mix,Buff
2,A806428,Akasha,10/13/2019 03:22:00 PM,10/13/2019 03:22:00 PM,04/10/2018,Adoption,,Cat,Spayed Female,1 year,Domestic Shorthair,Tortie
3,A801936,Snoop,10/13/2019 02:54:00 PM,10/13/2019 02:54:00 PM,02/12/2019,Adoption,,Dog,Neutered Male,7 months,Pit Bull,Brown Brindle
4,A797476,*Ollie,10/13/2019 02:52:00 PM,10/13/2019 02:52:00 PM,05/05/2019,Euthanasia,Suffering,Cat,Neutered Male,5 months,Domestic Shorthair,Black


Now that we can read in data, let's get more comfortable with our Pandas data structures.

In [None]:
type(shelter_data)

In [None]:
# Now that data is read let's look at it's shape
print(shelter_data.shape)

In [None]:
# What are the names of the columns
print(shelter_data.columns)

In [None]:
# What are the different data types present in our data
print(shelter_data.info())

In [22]:
# We can find the type of a particular columns in a data frame in this way.
ID_series = shelter_data['Animal ID']
shelter_data['Animal ID'].dtypes
shelter_data.rename(columns={'size': 'Name'}, inplace=True)
shelter_data.columns

Index(['Animal ID', 'Name', 'DateTime', 'MonthYear', 'Date of Birth',
       'Outcome Type', 'Outcome Subtype', 'Animal Type', 'Sex upon Outcome',
       'Age upon Outcome', 'Breed', 'Color'],
      dtype='object')

### 2. Utilizing and identifying Pandas objects

- What is a DataFrame object and what is a Series object? 
- How are they different from Python lists?

These are questions we will cover in this section. To start, let's start with this list of fruits.

In [8]:
fruits = ['Apple', 'Orange', 'Watermelon', 'Lemon', 'Mango']

print(fruits)

['Apple', 'Orange', 'Watermelon', 'Lemon', 'Mango']


Using our list of fruits, we can create a pandas object called a 'series' which is much like an array or a vector.

In [9]:
fruits_series = pd.Series(fruits)

print(fruits_series)
type(fruits_series)

0         Apple
1        Orange
2    Watermelon
3         Lemon
4         Mango
dtype: object


pandas.core.series.Series

One difference between python **list objects** and pandas **series objects** is the fact that you can define the index manually for a **series objects**.

In [10]:
ind = ['a', 'b', 'c', 'd', 'e']

fruits_series = pd.Series(fruits, index=ind)

print(fruits_series)

a         Apple
b        Orange
c    Watermelon
d         Lemon
e         Mango
dtype: object


With a partner, create your own custom series from a list of lists.

In [31]:
list_of_lists = [['cat','ape'], ['dog'], ['horse'], ['cow'], ['macaw']]

lst = [x for sublist in list_of_lists for x in sublist]
print(lst)
# create custom indices for your series
# ind = ['imagine that','what a hog','she died of course','dunno how','how absurd']
ind = range(2,7)
# create the series using your list objects
# You can use either a for loop or also pd.Series
list_of_lists_series = pd.Series(list_of_lists, index=ind)

# print your series
print(list_of_lists_series)
type(list_of_lists_series)

['cat', 'ape', 'dog', 'horse', 'cow', 'macaw']
2    [cat, ape]
3         [dog]
4       [horse]
5         [cow]
6       [macaw]
dtype: object


pandas.core.series.Series

We can do a simliar thing with Python dictionaries. This time, however, we will create a DataFrame object from a python dictionary.

In [12]:
# Dictionary with list object in values
student_dict = {
    'name': ['Samantha', 'Alex', 'Dante'],
    'age': ['35', '17', '26'],
    'city': ['Houston', 'Seattle', 'New york']
}

students_df = pd.DataFrame(student_dict)

students_df.head()

Unnamed: 0,name,age,city
0,Samantha,35,Houston
1,Alex,17,Seattle
2,Dante,26,New york


In [13]:
#to find data types of columns
students_df.dtypes

name    object
age     object
city    object
dtype: object

Let's change the data type of ages to int.

In [15]:
# We can also change a columns type but the change has to make sense.
students_df.age = students_df.age.astype(int)

#Uncomment line below and observe what happens when trying to convert student's name to int or float
students_df.name = students_df.name.astype(int)


#How about what happens converting numeric to string
#students_df.age = students_df.age.astype(str)

students_df.dtypes

ValueError: invalid literal for int() with base 10: 'Samantha'

We can also use a custom index for these items. For example, we might want them to be the individual student ID numbers.

In [None]:
school_ids = ['1111', '1145', '0096']

# Notice here we use pd.DataFrame not pd.Series as we did for a pandas series.
students_df = pd.DataFrame(student_dict, index=school_ids)

students_df.head()

Using Pandas, we can also rename column names.

In [None]:
students_df.columns = ['NAME', 'AGE', 'HOME']
students_df.head()

Or, we can also change the column names using the rename function.

In [None]:
students_df.rename(columns={'AGE': 'YEARS'})

In [None]:
# Notice what happens when we print students_df

students_df

In [None]:
# If you want the file to save over itself, use the option `inplace = True`.
students_df.rename(columns={'AGE': 'YEARS'}, inplace=True)
students_df.head()

Similarly, there is a tool to remove rows and columns from your DataFrame

In [None]:
students_df.drop(columns=['YEARS', 'HOME'])

In [None]:
#Notice again what happens if we print students_df 
students_df

In [None]:
students_df.drop(columns=['YEARS', 'HOME'], inplace=True)
students_df

If you want the file to save over itself, use the option `inplace = True`.

Every function has options. Let's read more about `drop` [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)

### 3. Filtering Data Using Pandas
There are several ways to grab particular data from a DataFrame. 
- Python lists allow for selection of data only through integer location. 
- You can use a single integer or slice notation to make the selection but NOT a list of integers.
- Dictionaries only allow selection with a single label. Slices and lists of labels are not allowed.

In [None]:
l = [1, 2, 3, 4, 5]
l[[0, 5]]

### DataFrames can be indexed by column name (label) or row name (index) or by position.   
#### The `.loc` method is used for indexing by name.  
#### While `.iloc` is used for indexing by number.

In [None]:
student_dict = {
    'name': ['Samantha', 'Alex', 'Dante'],
    'age': ['35', '17', '26'],
    'city': ['Houston', 'Seattle', 'New york']
}

students_df = pd.DataFrame(student_dict)

In [None]:
students_df

### Let's take a look at `.iloc`
#### `.iloc` takes slices based on index position.
#### `.iloc` stands for integer location so that should help with remember what it does
#### `.iloc`[row , column]

In [None]:
# returns the first row
students_df.iloc[0]

In [None]:
# returns the first column
students_df.iloc[:, 0]

In [None]:
# returns first two rows notice that ILOC performs regular python slicing.
students_df.iloc[0:2]

In [None]:
# returns the first two columns
students_df.iloc[:, 0:2]

In [None]:
# returns first row and columns 1 and 2
students_df.iloc[0:1, 0:2]

### How would we use `.iloc` to return the last item in the last row?


In [32]:
# return the last item in the last row using iloc
students_df.iloc[-1,-1]

'New york'

### How would we use `.iloc` to return the first item in the last column?


In [36]:
# return the last item in the last column using iloc
students_df.iloc[0,-1]

'Houston'

### What if we only want certain columns or rows?

In [37]:
# Don't do students_df.iloc[0, 2]
students_df.iloc[[0, 2]]

Unnamed: 0,name,age,city
0,Samantha,35,Houston
2,Dante,26,New york


In [38]:
students_df.iloc[[0, 2], [0, 2]]

Unnamed: 0,name,city
0,Samantha,Houston
2,Dante,New york


### Let's take a look at `.loc`
#### Label based method. 
#### Names or labels of the index is used when taking slices.
#### Also supports boolean subsetting.

In [None]:
# We will use loc to return rows and columns based on labels. Let's look at the students_df DataFrame again.
students_df

In [None]:
# returns the student information associated with index 0
students_df.loc[0]

In [None]:
# returns the student information for row index 0 to 2 inclusive.
# note iloc would return normal python slicing not including 2 as demonstrated above.
students_df.loc[0:2]

In [None]:
# returns the column labeled 'age'
students_df.loc[:, 'age']

In [None]:
# returns the column labeled 'age' and index values 1 to 2.
# gives us the values of the rows with index from 1 to 2 (inclusive)
# and columns labeled age"
students_df.loc[1:2, 'age']

In [None]:
# returns the column labeled 'age' and index values 1 to 2.
# gives us the values of the rows with index from 1 to 2 (inclusive)
# and columns labeled age to city (inclusive)"
students_df.loc[1:2, 'age':'city']

In [None]:
# What should we get?
students_df.loc[1:2, ['name', 'city']]

In [None]:
# How about?
students_df.loc[[0, 2], ['name', 'city']]

In [None]:
# if index rearranged
school_ids = ['5', '11', '3']
students_df = pd.DataFrame(student_dict, index=school_ids)

In [None]:
students_df

In [None]:
# What should we get now?
students_df.loc[[0, 2], ['name', 'city']]

In [None]:
# What should we get now?
students_df.loc[['5', '11'], ['name', 'city']]

In [None]:
students_df.set_index("name", inplace=True)
students_df

In [None]:
students_df.loc[['Samantha']]

In [None]:
# Subsetting nonconsecutive rows
students_df.loc[['Samantha', 'Dante']]

In [None]:
# Samantha to the end
students_df.loc['Samantha':]

In [None]:
# return the first and last rows using one loc command

### Boolean Subsetting

In [None]:
student_dict = {
    'name': ['Samantha', 'Alex', 'Dante', 'Samantha'],
    'age': ['35', '17', '26', '21'],
    'city': ['Houston', 'Seattle', 'New york', 'Atlanta'],
    'state': ['Texas', 'Washington', 'New York', 'Georgia']
}

students_df = pd.DataFrame(student_dict)

In [None]:
# The statement data[‘name’] == ‘Samantha’] produces a Pandas Series with a True/False value for every row
# in the ‘data’ DataFrame, where there are “True” values for the rows where the name is “Samantha”.
# These type of boolean arrays can be passed directly to the .loc indexer.
students_df.loc[students_df['name'] == 'Samantha']

In [None]:
# What about if we only want the city and state of the selected students with the name Samantha?
students_df.loc[students_df['name'] == 'Samantha', ['city', 'state']]

In [None]:
# What amount if we want to select a student of a specific age?
students_df.loc[students_df['age'] == '21']

In [None]:
# What amount if we want to select a student of a specific age?
students_df.loc[(students_df['age'] == '21') &
                (students_df['city'] == 'Atlanta')]

In [None]:
# What should be returned?
students_df.loc[(students_df['age'] == '35') &
                (students_df['city'] == 'Atlanta')]

### Lesson Recap
Pandas combines the power of python lists (selection via integer location) and dictionaries (selection by label)

`.iloc` is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array.

`.iloc` will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing (this conforms with python/numpy slice semantics).

`.loc` is primarily label based, but may also be used with a boolean array.

#### Warning Note that contrary to usual python slices, both the start and the stop are included.

`.loc` will raise a keyError when any items are not found.

### Pandas
- The data structures in Pandas are implemented using series and dataframe classes.  
- A series is a one-dimensional indexed array of some fixed data type.  
- While a dataframe is a two-dimensional data structure like a table where each column contains data of the same type.  
- DataFrames are great for representing real data: rows correspond to instances (examples, observations, etc.), and columns correspond to features of these instances.


### CLASS ASSIGNMENT
Now that we have all of these new tools in our tool belt, use these tools on the shelter data set! 
- Use `shelter_data.columns` to get the list of column names.
- Subset the data by '`Outcome Subtype`.
- Subset the data by '`Outcome Subtype` `Adoption` and only return the `Animal Type` column. 
- Subset the data by '`Outcome Subtype` `Adoption` and only return the `Animal Type` column with only `Cat`. 
- Play around with your new tools on the data set.
- For extra credit: What are the data types returned from the different subsetting? Is what returned a series or dataframe?

In [39]:
import pandas as pd
shelter_data = pd.read_csv(
    'https://data.austintexas.gov/api/views/9t4d-g238/rows.csv?accessType=DOWNLOAD')
shelter_data.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color
0,A800276,Porkchop,10/10/2019 07:30:00 PM,10/10/2019 07:30:00 PM,09/19/2004,Return to Owner,,Dog,Intact Male,15 years,Australian Cattle Dog,Tan
1,A806032,,10/10/2019 06:25:00 PM,10/10/2019 06:25:00 PM,07/21/2019,Adoption,,Cat,Intact Female,2 months,Domestic Shorthair,Black/White
2,A806431,,10/10/2019 06:13:00 PM,10/10/2019 06:13:00 PM,09/18/2019,Transfer,Partner,Cat,Unknown,3 weeks,Domestic Shorthair,Black
3,A806432,,10/10/2019 06:13:00 PM,10/10/2019 06:13:00 PM,09/18/2019,Transfer,Partner,Cat,Unknown,3 weeks,Domestic Shorthair,Brown Tabby
4,A806149,*Bingley,10/10/2019 05:59:00 PM,10/10/2019 05:59:00 PM,10/06/2017,Adoption,,Dog,Neutered Male,2 years,Basset Hound/American Pit Bull Terrier,Brown Tiger/White


In [11]:
shelter_data.columns

Index(['Animal ID', 'Name', 'DateTime', 'MonthYear', 'Date of Birth',
       'Outcome Type', 'Outcome Subtype', 'Animal Type', 'Sex upon Outcome',
       'Age upon Outcome', 'Breed', 'Color'],
      dtype='object')

In [23]:
shelter_data[shelter_data['Outcome Type'] == "Adoption"]

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color
0,A806379,Guyute,10/13/2019 03:54:00 PM,10/13/2019 03:54:00 PM,10/09/2017,Adoption,,Other,Intact Female,2 years,Guinea Pig,Tricolor
1,A797732,Benji,10/13/2019 03:40:00 PM,10/13/2019 03:40:00 PM,02/17/2019,Adoption,,Dog,Neutered Male,7 months,Maltese Mix,Buff
2,A806428,Akasha,10/13/2019 03:22:00 PM,10/13/2019 03:22:00 PM,04/10/2018,Adoption,,Cat,Spayed Female,1 year,Domestic Shorthair,Tortie
3,A801936,Snoop,10/13/2019 02:54:00 PM,10/13/2019 02:54:00 PM,02/12/2019,Adoption,,Dog,Neutered Male,7 months,Pit Bull,Brown Brindle
6,A802927,Honeybun,10/13/2019 02:20:00 PM,10/13/2019 02:20:00 PM,03/24/2019,Adoption,,Dog,Spayed Female,6 months,Labrador Retriever Mix,Brown
10,A806181,,10/13/2019 01:45:00 PM,10/13/2019 01:45:00 PM,08/21/2019,Adoption,,Dog,Spayed Female,1 month,Australian Cattle Dog Mix,Black/Tricolor
11,A805536,Marley,10/13/2019 01:35:00 PM,10/13/2019 01:35:00 PM,09/28/2017,Adoption,,Dog,Spayed Female,2 years,Pointer Mix,White/Black
13,A796880,Desmond,10/13/2019 12:57:00 PM,10/13/2019 12:57:00 PM,06/06/2017,Adoption,,Dog,Neutered Male,2 years,Basenji/Pit Bull,Tan/White
14,A756806,Shadow,10/13/2019 12:51:00 PM,10/13/2019 12:51:00 PM,07/10/2017,Adoption,,Cat,Neutered Male,2 years,Domestic Shorthair Mix,Blue Tabby/White
18,A806289,,10/13/2019 12:08:00 PM,10/13/2019 12:08:00 PM,08/29/2019,Adoption,,Cat,Intact Female,1 month,Domestic Shorthair,Torbie


In [24]:
shelter_data[shelter_data['Outcome Type'] == "Adoption"]['Animal Type']

0         Other
1           Dog
2           Cat
3           Dog
6           Dog
10          Dog
11          Dog
13          Dog
14          Cat
18          Cat
19          Dog
20          Dog
26          Cat
27          Cat
28          Cat
31          Cat
32          Cat
33          Cat
35          Dog
37          Dog
38          Dog
39          Dog
40          Cat
41          Cat
42          Cat
43          Dog
44          Dog
55          Dog
61          Dog
62          Cat
          ...  
109287      Cat
109289      Cat
109294      Dog
109311      Cat
109314      Dog
109316      Dog
109319      Dog
109320      Dog
109322      Dog
109323      Dog
109327      Dog
109328      Cat
109342      Dog
109345      Cat
109355      Dog
109356      Dog
109357      Dog
109359      Cat
109368      Dog
109370      Cat
109373      Dog
109377      Dog
109378      Dog
109379      Cat
109382      Cat
109387      Dog
109410      Cat
109411      Cat
109413      Dog
109420      Dog
Name: Animal Type, Lengt

In [27]:
shelter_data[(shelter_data['Outcome Type'] == "Adoption") & (shelter_data['Animal Type'] == "Cat")]['Animal Type']

2         Cat
14        Cat
18        Cat
26        Cat
27        Cat
28        Cat
31        Cat
32        Cat
33        Cat
40        Cat
41        Cat
42        Cat
62        Cat
63        Cat
65        Cat
70        Cat
71        Cat
78        Cat
79        Cat
85        Cat
89        Cat
93        Cat
105       Cat
106       Cat
109       Cat
110       Cat
117       Cat
123       Cat
124       Cat
125       Cat
         ... 
109174    Cat
109176    Cat
109182    Cat
109189    Cat
109194    Cat
109195    Cat
109202    Cat
109207    Cat
109210    Cat
109214    Cat
109215    Cat
109216    Cat
109218    Cat
109255    Cat
109259    Cat
109261    Cat
109263    Cat
109265    Cat
109275    Cat
109287    Cat
109289    Cat
109311    Cat
109328    Cat
109345    Cat
109359    Cat
109370    Cat
109379    Cat
109382    Cat
109410    Cat
109411    Cat
Name: Animal Type, Length: 18250, dtype: object