<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Introduction-to-Pandas" data-toc-modified-id="Introduction-to-Pandas-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction to Pandas</a></span></li><li><span><a href="#Getting-Started-with-Pandas" data-toc-modified-id="Getting-Started-with-Pandas-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Getting Started with Pandas</a></span><ul class="toc-item"><li><span><a href="#Importing-Pandas" data-toc-modified-id="Importing-Pandas-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Importing Pandas</a></span></li><li><span><a href="#Importing-a-File" data-toc-modified-id="Importing-a-File-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Importing a File</a></span></li></ul></li><li><span><a href="#Pandas-Data-Structures" data-toc-modified-id="Pandas-Data-Structures-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Pandas Data Structures</a></span><ul class="toc-item"><li><span><a href="#Series" data-toc-modified-id="Series-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Series</a></span></li><li><span><a href="#Dataframes" data-toc-modified-id="Dataframes-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Dataframes</a></span></li></ul></li><li><span><a href="#Modifying-Data" data-toc-modified-id="Modifying-Data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Modifying Data</a></span><ul class="toc-item"><li><span><a href="#Indexing" data-toc-modified-id="Indexing-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Indexing</a></span></li><li><span><a href="#Combining" data-toc-modified-id="Combining-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Combining</a></span></li></ul></li><li><span><a href="#Basic-Analysis-&amp;-Visualization" data-toc-modified-id="Basic-Analysis-&amp;-Visualization-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Basic Analysis &amp; Visualization</a></span><ul class="toc-item"><li><span><a href="#Summary-Statistics" data-toc-modified-id="Summary-Statistics-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Summary Statistics</a></span></li><li><span><a href="#Visualization" data-toc-modified-id="Visualization-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Visualization</a></span></li></ul></li></ul></div>

# Introduction to Pandas

<b>Pandas builds upon numpy and is intended to make data analysis faster and easier with its data structures and data manipulation tools.</b>

# Getting Started with Pandas

## Importing Pandas

<b>Import numpy and matplotlib.</b>

In [462]:
import numpy as np

In [463]:
import matplotlib.pyplot as plt

<b>Import pandas. Series and DataFrame are used frequently, so it's useful to import them individually.</b>

In [464]:
from pandas import Series, DataFrame

In [465]:
import pandas as pd

## Importing a File

<b>Import the data file and specify the file name within the directory.</b>

Since we're importing an Excel file, we can use pd.read_table to specify that it's a table and not just a text file.

In [466]:
data = pd.read_table('seaboard_flights.csv')

We can then output the data to have a look at what we're working with.

In [467]:
data

Unnamed: 0,year,month,passengers
0,1958,January,340
1,1958,February,318
2,1958,March,362
3,1958,April,348
4,1958,May,363
5,1958,June,435
6,1958,July,491
7,1958,August,505
8,1958,September,404
9,1958,October,359


# Pandas Data Structures

## Series

<b>Series are 2D arrays that contain a data array and an index array. They're similar to a fixed-length, ordered dictionary.</b>

One of the most common ways to create a Series is to specify its values and the corresponding indices.

In [468]:
trips = Series(['London', 'Beijing', 'Paris', 'New York'], index=['2010', '2011', '2012', '2013'])

In [469]:
trips

2010      London
2011     Beijing
2012       Paris
2013    New York
dtype: object

## Dataframes

<b>Dataframes are blocks of data similar to a spreadsheet or table.</b>

Let's import the flights dataset we have as a Dataframe. We can specify the columns that are already present in the data.

In [470]:
flights = DataFrame(data, columns=['year', 'month', 'passengers'])

In [471]:
flights

Unnamed: 0,year,month,passengers
0,1958,January,340
1,1958,February,318
2,1958,March,362
3,1958,April,348
4,1958,May,363
5,1958,June,435
6,1958,July,491
7,1958,August,505
8,1958,September,404
9,1958,October,359


Once the data is in a dataframe, there are lots of methods we can perform on the data to modify, index, and filter the data.

# Modifying Data

## Indexing

<b>Indexing allows us to view subsets of the data based on columns, rows, and/or values.</b>

We can select specific rows, such as the first 20 rows.

In [472]:
flights[:20]

Unnamed: 0,year,month,passengers
0,1958,January,340
1,1958,February,318
2,1958,March,362
3,1958,April,348
4,1958,May,363
5,1958,June,435
6,1958,July,491
7,1958,August,505
8,1958,September,404
9,1958,October,359


We can specify a specific column. Here we select the 'passengers' column.

In [473]:
flights['passengers']

0     340
1     318
2     362
3     348
4     363
5     435
6     491
7     505
8     404
9     359
10    310
11    337
12    360
13    342
14    406
15    396
16    420
17    472
18    548
19    559
20    463
21    407
22    362
23    405
24    417
25    391
26    419
27    461
28    472
29    535
30    622
31    606
32    508
33    461
34    390
35    432
Name: passengers, dtype: int64

We can also use statements to find more specific subsets, such as all the flight months with more than 500 passengers.

In [474]:
flights[flights['passengers'] > 500]

Unnamed: 0,year,month,passengers
7,1958,August,505
18,1959,July,548
19,1959,August,559
29,1960,June,535
30,1960,July,622
31,1960,August,606
32,1960,September,508


We can combine the indices to find a subset based on multiple values, such as row and column.

INSERT Find the passenger data for flights from 1960.

## Combining

<b>We can combine the values in Series and DataFrames to produce a new dataset.</b>

Pandas works with data alignment, so it matches up indices across the datasets and only inserts a resulting value if there is no missing data.

In [477]:
trips_adam = Series([5, 2, 1, 4, 3], index=[2010, 2011, 2012, 2013, 2014])

In [478]:
trips_bob = Series([1, 0, 3, 4], index=[2011, 2012, 2013, 2014])

In [479]:
trips_adam + trips_bob

2010    NaN
2011    3.0
2012    1.0
2013    7.0
2014    7.0
dtype: float64

Note that these two Series are misaligned. trips_adam has a value for 2010 but trips_bob does not, so a null value of NaN is inserted into the combined Series.

# Basic Analysis & Visualization

## Summary Statistics

<b>Pandas has a variety of descriptive statistics methods that can help you quickly summarize and examine your data.</b>

In [480]:
flights['passengers'].describe()

count     36.000000
mean     428.500000
std       79.329152
min      310.000000
25%      362.000000
50%      412.000000
75%      472.000000
max      622.000000
Name: passengers, dtype: float64

INSERT picture of other descriptive statistics

## Visualization

<b>Pandas also uses matplotlib to create graphs and data visualization that is easy to use and modify.</b>

Let's plot a histogram of the data from trips_bob, specifying that there will be 4 bins.

In [481]:
trips_bob.hist(bins=4)

<matplotlib.axes._subplots.AxesSubplot at 0x113186c88>

INSERT picture of other visualizations