<h1>Chapter I</h1>
<h2>A Tour of Pandas</h2>

<ul><li><h2>Pandas Series</h2></li></ul>

The base data structure of pandas is the Series object, which is designed to operate
similar to a NumPy array but also adds index capabilities. A simple way to create a Series
object is by initializing a Series object with a Python array or Python list.

In [1]:
import pandas as pd
from pandas import Series, DataFrame

s = Series([1, 2, 3, 4, 5])
s

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [2]:
s[[1]]

1    2
dtype: int64

In [3]:
s[1]

2

In [4]:
s[[2,3]]

2    3
3    4
dtype: int64

In [5]:
s = Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
s

a    1
b    2
c    3
d    4
dtype: int64

In [6]:
s.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [7]:
s[['c']]

c    3
dtype: int64

In [8]:
s['c']

3

In [9]:
# Get mean value of Series
s.mean()

2.5

In [10]:
# Create a date range
dates = pd.date_range('2017-06-10', '2017-06-14')
dates

DatetimeIndex(['2017-06-10', '2017-06-11', '2017-06-12', '2017-06-13',
               '2017-06-14'],
              dtype='datetime64[ns]', freq='D')

In [11]:
# Align data with Index
Series([80, 82, 85, 90, 83], index=dates)

2017-06-10    80
2017-06-11    82
2017-06-12    85
2017-06-13    90
2017-06-14    83
Freq: D, dtype: int64

In [40]:
temps1

2017-06-10    80
2017-06-11    82
2017-06-12    85
2017-06-13    90
2017-06-14    83
Freq: D, dtype: int64

In [41]:
temps2

2017-06-10    60
2017-06-11    62
2017-06-12    65
2017-06-13    10
2017-06-14    13
Freq: D, dtype: int64

In [12]:
# Difference between 2 Series
temps1 = Series([80, 82, 85, 90, 83], index=dates)
temps2 = Series([60, 62, 65, 10, 13], index=dates)

temp_diffs = temps1 - temps2
temp_diffs

2017-06-10    20
2017-06-11    20
2017-06-12    20
2017-06-13    80
2017-06-14    70
Freq: D, dtype: int64

<ul><li><h2>Pandas DataFrames</h2></li></ul>

A pandas Series represents a single array of values, with an index label for each value. If
you want to have more than one Series of data that is aligned by a common index, then a
pandas DataFrame is used.
Note
In a way a DataFrame is analogous to a database table in that it contains one or more
columns of data of heterogeneous type (but a single type for all items in each respective
column).

In [13]:
temps1, temps2

(2017-06-10    80
 2017-06-11    82
 2017-06-12    85
 2017-06-13    90
 2017-06-14    83
 Freq: D, dtype: int64,
 2017-06-10    60
 2017-06-11    62
 2017-06-12    65
 2017-06-13    10
 2017-06-14    13
 Freq: D, dtype: int64)

In [14]:
temps_df = DataFrame({'Missoula': temps1, 'Philadelphia': temps2})
temps_df

Unnamed: 0,Missoula,Philadelphia
2017-06-10,80,60
2017-06-11,82,62
2017-06-12,85,65
2017-06-13,90,10
2017-06-14,83,13


In [15]:
test_df = DataFrame({'My1': [1,2,3], 'My2': [4,5,6]})
test_df

Unnamed: 0,My1,My2
0,1,4
1,2,5
2,3,6


In [16]:
# Pick only one column - 2 ways
temps_df['Missoula']
temps_df.Missoula

2017-06-10    80
2017-06-11    82
2017-06-12    85
2017-06-13    90
2017-06-14    83
Freq: D, Name: Missoula, dtype: int64

In [17]:
# Get Mean Value of Series
temps_df.Missoula.mean()

84.0

In [18]:
# Subtract Series within DataFrame
temps_df.Missoula - temps_df.Philadelphia

2017-06-10    20
2017-06-11    20
2017-06-12    20
2017-06-13    80
2017-06-14    70
Freq: D, dtype: int64

In [19]:
# Add New Column in DataFrame
temps_df['Difference'] = temps_df.Missoula - temps_df.Philadelphia
temps_df

Unnamed: 0,Missoula,Philadelphia,Difference
2017-06-10,80,60,20
2017-06-11,82,62,20
2017-06-12,85,65,20
2017-06-13,90,10,80
2017-06-14,83,13,70


In [20]:
# Get The Columns of DataFrame
temps_df.columns

Index(['Missoula', 'Philadelphia', 'Difference'], dtype='object')

In [21]:
# Pick the order of columns
temps_df[['Missoula', 'Philadelphia', 'Difference']]

Unnamed: 0,Missoula,Philadelphia,Difference
2017-06-10,80,60,20
2017-06-11,82,62,20
2017-06-12,85,65,20
2017-06-13,90,10,80
2017-06-14,83,13,70


In [22]:
# Get rows 1 - 3 
temps_df.Missoula[1:4]

2017-06-11    82
2017-06-12    85
2017-06-13    90
Freq: D, Name: Missoula, dtype: int64

In [23]:
# Entire rows from a DataFrame can be retrieved using its .loc and .iloc properties. The
# following code returns a Series object representing the second row of temps_df of the
# DataFrame object by zero-based position of the row using the .iloc property:</p>

temps_df.iloc[1]

# This has converted the row into a Series, with the column names of the DataFrame
# pivoted into the index labels of the resulting Series.

# the names of the columns have become the index
# they have been 'pivoted'

Missoula        82
Philadelphia    62
Difference      20
Name: 2017-06-11 00:00:00, dtype: int64

In [24]:
# Rows can be explicitly accessed via index label using the .loc property. The following
# code retrieves a row by the index label:

temps_df.loc['2017-06-11']

# loc for location
# iloc for index location

Missoula        82
Philadelphia    62
Difference      20
Name: 2017-06-11 00:00:00, dtype: int64

In [25]:
temps_df

Unnamed: 0,Missoula,Philadelphia,Difference
2017-06-10,80,60,20
2017-06-11,82,62,20
2017-06-12,85,65,20
2017-06-13,90,10,80
2017-06-14,83,13,70


In [26]:
# get the values in the Differences column in rows 1, 3, and 5
# using 0-based location

temps_df.iloc[[1, 3]].Difference


2017-06-11    20
2017-06-13    80
Freq: 2D, Name: Difference, dtype: int64

In [27]:
# VS This
temps_df.iloc[1:3].Difference

2017-06-11    20
2017-06-12    20
Freq: D, Name: Difference, dtype: int64

In [28]:
# which values in the Missoula column are > 82?
temps_df.Missoula > 82

2017-06-10    False
2017-06-11    False
2017-06-12     True
2017-06-13     True
2017-06-14     True
Freq: D, Name: Missoula, dtype: bool

In [29]:
# return the rows where the temps for Missoula > 82

temps_df[temps_df.Missoula > 82]

# This technique of selection in pandas terminology is referred to as a Boolean selection,
# and will form the basis of selecting data based upon its values.

Unnamed: 0,Missoula,Philadelphia,Difference
2017-06-12,85,65,20
2017-06-13,90,10,80
2017-06-14,83,13,70


<ul><li><h2>Loading CSV data from files</h2></li></ul>

In [30]:
# display the contents of test1.csv
# which command to use depends on your OS
!cat data/test1.csv # on non-windows systems

'cat' is not recognized as an internal or external command,
operable program or batch file.


In [31]:
# read the contents of the file into a DataFrame
df = pd.read_csv('data/test1.csv')
df

Unnamed: 0,ID,Date,Salary
0,1,2000-01-01 00:00:00,1500
1,2,2000-01-02 00:00:00,1800
2,3,2000-01-03 00:00:00,1900
3,4,2000-01-04 00:00:00,2000
4,5,2000-01-05 00:00:00,1800
5,6,2000-01-06 00:00:00,1900
6,7,2000-01-07 00:00:00,1500
7,8,2000-01-08 00:00:00,1800
8,9,2000-01-09 00:00:00,1780
9,10,2000-01-10 00:00:00,1680


In [32]:
df.Date

0     2000-01-01 00:00:00
1     2000-01-02 00:00:00
2     2000-01-03 00:00:00
3     2000-01-04 00:00:00
4     2000-01-05 00:00:00
5     2000-01-06 00:00:00
6     2000-01-07 00:00:00
7     2000-01-08 00:00:00
8     2000-01-09 00:00:00
9     2000-01-10 00:00:00
10    2000-01-11 00:00:00
11    2000-01-12 00:00:00
Name: Date, dtype: object

In [33]:
df.Date[8]

'2000-01-09 00:00:00'

In [34]:
type(df.Date[8])

str

In [35]:
type(df.ID[8])

numpy.int64

In [36]:
# read the data and tell pandas the date column should be
# a date in the resulting DataFrame

df = pd.read_csv('data/test1.csv', parse_dates=['Date'])
df

Unnamed: 0,ID,Date,Salary
0,1,2000-01-01,1500
1,2,2000-01-02,1800
2,3,2000-01-03,1900
3,4,2000-01-04,2000
4,5,2000-01-05,1800
5,6,2000-01-06,1900
6,7,2000-01-07,1500
7,8,2000-01-08,1800
8,9,2000-01-09,1780
9,10,2000-01-10,1680


In [37]:
type(df.Date[0])

pandas._libs.tslibs.timestamps.Timestamp

In [38]:
# read in again, now specify the data column as being the
# index of the resulting DataFrame
df = pd.read_csv('data/test1.csv',
parse_dates=['Date'],
index_col='Date')

df

Unnamed: 0_level_0,ID,Salary
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2000-01-01,1,1500
2000-01-02,2,1800
2000-01-03,3,1900
2000-01-04,4,2000
2000-01-05,5,1800
2000-01-06,6,1900
2000-01-07,7,1500
2000-01-08,8,1800
2000-01-09,9,1780
2000-01-10,10,1680


<ul><li><h2>Loading data from the Web</h2></li></ul>

In [42]:
# The following reads the data of the previous three months for GOOG (based on the current
# date), and prints the five most recent days of stock data:

import pandas_datareader.data as web 
from datetime import date
from dateutil.relativedelta import relativedelta

# read the last three months of data for GOOG
goog = web.DataReader("GOOG", "yahoo", date.today() + relativedelta(months=-3))
goog.tail()

ConnectionError: HTTPSConnectionPool(host='finance.yahoo.com', port=443): Max retries exceeded with url: /quote/GOOG/history?period1=1585702800&period2=1593651599&interval=1d&frequency=1d&filter=history (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000002234A43B668>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))

<ul><li><h2>Simplicity of visualization of pandas data</h2></li></ul>

In [None]:
# plot the Adj Close values we just read in
goog.plot(y=['Adj Close', 'High', 'Low']);