# Essential functionality



In [69]:
import pandas as pd
import numpy as np

## Reindexing

reindex means to create a new object with the data confirmed to a new index

In [2]:
obj = pd.Series([4.5,7.2,-5.3,3.6], index=['d','b','a','c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [3]:
obj2 = obj.reindex(['a','b','c','d','e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [4]:
# for ordered data like time series it may be desirable to do some interpolation or filling of values when reindexing
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0,2,4])
obj3

0      blue
2    purple
4    yellow
dtype: object

In [5]:
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

In [6]:
# reindex can altar either the row/index, or columns, or both.
frame = pd.DataFrame(np.arange(9).reshape((3,3)),index=['a','c','d'], columns=['Ohio', 'Texas', 'California'])
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [7]:
frame2 = frame.reindex(['a','b','c','d'])
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


### Dropping entities from an Axis

In [8]:
obj = pd.Series(np.arange(5.), index=['a','b','c','d','e'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [9]:
new_obj = obj.drop('c')
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [10]:
obj.drop(['c','d'])

a    0.0
b    1.0
e    4.0
dtype: float64

In [11]:
# Index values can be deleted from either axis

data = pd.DataFrame(np.arange(16).reshape((4,4)), 
    index=['Ohio','Colorado','Utah','New York'], 
    columns=['one','two','three','four'])

data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [12]:
# calling drop with a sequence of labels will drop values from the row lables (axis 0)
data.drop(['Colorado', 'Ohio'])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [13]:
# you can drop values from the columns by passing axis=1 or axis='columns'
data.drop('two', axis=1)

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


In [14]:
data.drop(['two', 'four'], axis='columns')

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


## Indexing, Selection and Filtering

In [15]:
obj = pd.Series(np.arange(4.), index=['a','b','c','d'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [16]:
obj['b']

1.0

In [17]:
obj[1]

1.0

In [18]:
obj[2:4]

c    2.0
d    3.0
dtype: float64

In [19]:
obj[['b','a','d']]

b    1.0
a    0.0
d    3.0
dtype: float64

In [20]:
obj[[1,3]]

b    1.0
d    3.0
dtype: float64

In [21]:
obj[obj < 2]

a    0.0
b    1.0
dtype: float64

In [22]:
obj['b':'c']

b    1.0
c    2.0
dtype: float64

In [23]:
# setting using these methods modifies th corresponding section of the Series
obj['b':'c'] = 5
obj

a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

In [24]:
data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

In [25]:
data[['three', 'one']]

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


In [26]:
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [27]:
data[data['three'] > 5] # selects only the rows where col theree is lager than 5

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [28]:
data < 5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [29]:
data[data < 5] = 0
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


#### Selection with loc and iloc

loc and iloc enable you to select a subset of the rows and columns from a DataFrame

In [30]:
data.loc['Colorado', ['two','three']]

two      5
three    6
Name: Colorado, dtype: int64

In [31]:
data.iloc[2,[3,0,1]]

four    11
one      8
two      9
Name: Utah, dtype: int64

In [32]:
data.iloc[2]

one       8
two       9
three    10
four     11
Name: Utah, dtype: int64

In [33]:
data.iloc[[1,2],[3,0,1]]

Unnamed: 0,four,one,two
Colorado,7,0,5
Utah,11,8,9


In [34]:
# both indexing labels work with slices in addition to single lables or lists of labels
data.loc[:'Utah', 'two']


Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int64

In [35]:
data.iloc[:, :3][data.three > 5]

Unnamed: 0,one,two,three
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


### Integer Indexes

Working with pandas objects indexed by integers is something that often trips up new users due to some differences with indexing semantics on built-in python data structures like lists and tuples. For example, you might not expect the following cod e to generate an error. 


In [36]:
ser = pd.Series(np.arange(3.))
ser

0    0.0
1    1.0
2    2.0
dtype: float64

In [37]:
ser[-1]

KeyError: -1

In this case, pandas could 'fall back' on integer indexing, but its difficult to do this in general without introducint subtle bugs. Here we have an index containing 0,1,2 but inferring what the user wants (label-basaed indexing or position based) is difficult.

In [38]:
ser

0    0.0
1    1.0
2    2.0
dtype: float64

On the other hand, with a non-integer index, there is no potential for ambiguity. 

In [39]:
ser2 = pd.Series(np.arange(3.), index=['a','b','c'])

In [40]:
ser2[-1]

2.0

IF you have an axis index containing integers, data selection will always be label-oriented. For more precise handling, use loc (for labels) and iloc(for integers).

In [41]:
# PS slicing with integers is always integer-oriented

ser[:2]

0    0.0
1    1.0
dtype: float64

### Arithmetic and Data Alignment

An important pandas feature is the behavior of arithmetic between objects with different indexes. WHen added together objects, if any index pairs are not the same, the respective index in the result will be the inion of the index pairs. For users with database experience, this is similar to an 'automatic outer join' on the index labels. 

In [42]:
s1 = pd.Series([7.3,-2.5,3.4,1.5], index=['a','c','d','e'])
s2 = pd.Series([-2.1,3.6,-1.5,4,3.1], index=['a','c','e','f','g'])

In [43]:
s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [44]:
s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [45]:
s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

The internal data alignment indtroduces missing values in the label locations that dont overlap. Missing values will then propogate in further arithmetic computations.

In the case of dataFrame, alignment is performed on both the rows and the columns.

If you add DataFrame objects with no column or row labels in common, the result will contain all nulls.

### Arithemtic methods with fill values

In arithemtic operations between differently indexed objects you might want to fill with a special value, like 0, when an azis label is found in one object but not the other. 

In [47]:
from numpy import reshape


df1 = pd.DataFrame(np.arange(12.).reshape((3,4)), columns=list('abcd'))

df2 = pd.DataFrame(np.arange(20.).reshape((4,5)), columns=list('abcde'))

In [48]:
df2.loc[1, 'b'] = np.nan

In [49]:
df1

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


In [50]:
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [51]:
df1 + df2 # adding together results in NA in locations that dont overlap

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


In [52]:
# using the add method on df1, I pass df2 and an argument to fill_value

df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,5.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


### Operations between DataFrame and Series

As with nunpy arrays of different dimentions, arithmetic between DataFrame and Series is also defined. 

First as a motivating example, consider the difference between a two-dimentional array and one of its rows. 

In [53]:
arr = np.arange(12.).reshape((3,4))
arr

array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])

In [54]:
arr[0]

array([0., 1., 2., 3.])

In [55]:
arr - arr[0] 

array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])

- when we subtract, the subtraction is performed once for each row
- This is referred to as broadcasting

In [56]:
frame = pd.DataFrame(np.arange(12.).reshape((4,3)),columns=list('bde'),index=['Utah', ' Ohio',' Texas', 'Oregon'])

In [57]:
series = frame.iloc[0]

In [58]:
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [59]:
series

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

By default, arithmetic between DataFrtame and Series metches the index of the Series on the DataFrame columns, boradcasting down to rows. 

In [60]:
frame - series

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


If and index value is not found in either the dataframe columns or the series index the objects fill be reindexed to form the union. 

In [63]:
series2 = pd.Series(range(3), index=['b','e','f'])

In [64]:
frame + series2

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


If you want to instead broadcast over the columns, matching on the rows, you have to use one of the arithmetic methods, for example:

In [65]:
series3 = frame['d']

In [66]:
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [67]:
series3

Utah       1.0
 Ohio      4.0
 Texas     7.0
Oregon    10.0
Name: d, dtype: float64

In [68]:
frame.sub(series3, axis='index')

Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohio,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


The axis number that you pass is the axis to match on. In this case we mean to match on the DataFrame row index (axis='index or axis='0) and broadcast across.

## Function Application and Mapping

Numpy ufuncs(element-wise array methods) also work with pandas objects

In [72]:
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', ' Ohio',' Texas', 'Oregon'])
frame

Unnamed: 0,b,d,e
Utah,0.378925,0.299718,0.55587
Ohio,0.043119,-0.420183,-2.006215
Texas,-0.169012,-0.381466,0.496843
Oregon,0.165155,-1.433737,2.939899


In [73]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,0.378925,0.299718,0.55587
Ohio,0.043119,0.420183,2.006215
Texas,0.169012,0.381466,0.496843
Oregon,0.165155,1.433737,2.939899


In [74]:
# one frequent operation is applying a function on one-dimentional arrays to each column or row. 

f = lambda x: x.max() - x.min()

In [75]:
frame.apply(f)

b    0.547937
d    1.733456
e    4.946113
dtype: float64

Here the function f, computes the fiffernece between the max and min of a Series. IT is invoked on each column in frame

- If you pass axis='columns' to apply, the function will be invoked once per row insted. 

In [76]:
frame.apply(f, axis='columns')

Utah      0.256152
 Ohio     2.049334
 Texas    0.878308
Oregon    4.373636
dtype: float64

## Sorting and Ranking

- To sort lexicographically, use ```sort_index```

In [77]:
obj = pd.Series(range(4), index=['d','a','b','c'])
obj

d    0
a    1
b    2
c    3
dtype: int64

In [78]:
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

In [80]:
# With a dataframe you can sort in a index on either axis.
frame = pd.DataFrame(np.arange(8).reshape((2,4)), index=['three', 'one'], columns=['d','a','b','c'])

In [81]:
frame.sort_index()

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [82]:
frame.sort_index(axis=1)

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


In [83]:
# the data is sorted in ascending order by default, but canbe sorted in descending
frame.sort_index(axis=1, ascending=False)

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


In [84]:
# To sort a series by its values use sort_values method
obj = pd.Series([4,7,-3,2])
obj.sort_values()

2   -3
3    2
0    4
1    7
dtype: int64

In [85]:
# Tany missing values are sorted to the end by default
obj = pd.Series([4, np.nan,7,-3,np.nan,2])
obj.sort_values()

3   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
4    NaN
dtype: float64

When sorting a DataFrame, you can use the data in one or more columns as the sort keys. To do so, pass one or more columns names to the ```by``` operation of ```sort_values```

In [86]:
frame = pd.DataFrame({'b':[4,7,-3,2], 'a':[0,1,0,1]})
frame

Unnamed: 0,b,a
0,4,0
1,7,1
2,-3,0
3,2,1


In [87]:
frame.sort_values(by='b') 

Unnamed: 0,b,a
2,-3,0
3,2,1
0,4,0
1,7,1


In [89]:
# to sort by multiple columns

frame.sort_values(by=['a','b'])

Unnamed: 0,b,a
2,-3,0
0,4,0
3,2,1
1,7,1


#### Ranking 

Ranking assigns ranks from one through the number of valid data points in an array. The rank methods for Series and DataFrame are the place to look; by default rank breaks ties by assigning each group the mean rank:

In [94]:
obj = pd.Series([4,-5,7,-3,0,2,4])
obj.rank()

0    5.5
1    1.0
2    7.0
3    2.0
4    3.0
5    4.0
6    5.5
dtype: float64

In [95]:
# ranks can also be assigned according to the order in which they're observed in the data

obj.rank(method='first')

0    5.0
1    1.0
2    7.0
3    2.0
4    3.0
5    4.0
6    6.0
dtype: float64

In [96]:
# or in decending order 

obj.rank(ascending=False, method='max')

0    3.0
1    7.0
2    1.0
3    6.0
4    5.0
5    4.0
6    3.0
dtype: float64

## Axis Indexes with Duplicate Labels

In [97]:
obj = pd.Series(range(5), index=['a','a','b','b','c'])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

In [98]:
# the index is_unique property can tell you whether its labes are unique or not
obj.index.is_unique

False

- data selection is one of the main things that behvaes differnetly with duplicates
- Indexing a label with multiple entries returns a Series, ehilw single entries return a scalar value. 


In [99]:
obj['a']

a    0
a    1
dtype: int64

This can make your code more complicated, as the output tupe from indexing can vaey based on whether a label is repeated or not. 

The same logic apply to a DataFrame

In [100]:
df = pd.DataFrame(np.random.randn(4,3), index=['a','b','a','b'])
df

Unnamed: 0,0,1,2
a,-0.133515,-0.448774,-1.869797
b,0.150461,0.965128,0.250872
a,1.279856,-0.75447,-0.547819
b,0.297246,0.817181,-0.831021


# Summarizing and computing descriptive statistics

In [101]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]], index=['a', 'b','c','d'], columns=['one','two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [102]:
df.sum()

one    9.25
two   -5.80
dtype: float64

In [103]:
df.sum(axis='columns')

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

In [104]:
# NA values are excluded unless the entire slice is NA, this can be disabled with the skipna option
df.mean(axis=1, skipna=False)

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

- ```idxmin``` returns the index value where the minimum value is attained
- ```idxmax``` returns the index value where the maximum value is attained

In [105]:
df.idxmin()

one    d
two    b
dtype: object

In [106]:
df.idxmax()

one    b
two    d
dtype: object

In [107]:
# other methods are accummulative

df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


In [109]:
df.describe() # produces a standard description of the numeric data

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


In [110]:
# on non-numveric data, describe produces alternative summart statistics

obj = pd.Series(['a','b','c','d']*4)
obj.describe()

count     16
unique     4
top        a
freq       4
dtype: object

## Correlation and covariance

In [111]:
import pandas_datareader .data as web

In [112]:
all_data ={ticker: web.get_data_yahoo(ticker)
for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}

In [113]:
price = pd.DataFrame({ticker: data['Adj Close'] for ticker, data in all_data.items()})

volume = pd.DataFrame({ticker: data['Volume'] for ticker, data in all_data.items()})

In [114]:
returns = price.pct_change()

In [115]:
returns.tail()

Unnamed: 0_level_0,AAPL,IBM,MSFT,GOOG
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2022-07-21,0.015094,-0.015714,0.009799,0.002964
2022-07-22,-0.008111,0.008651,-0.016916,-0.058067
2022-07-25,-0.007398,0.002261,-0.005876,-0.001384
2022-07-26,-0.008826,-0.003579,-0.026774,-0.025598
2022-07-27,0.034235,0.00812,0.066852,0.07739


- The ```corr``` method of Series computes the collelation of the overlapping, non-NA, aligned-by-index values in two Series.

- The ```cov``` computes the covariance

In [116]:
returns['MSFT'].corr(returns['IBM'])

0.4766658141596792

In [117]:
returns['MSFT'].cov(returns['IBM'])

0.00015267288242659132

In [118]:
# since MSFT is a valiud python attribute, we can select these columns using more concise syntax

returns.MSFT.corr(returns.IBM)

0.4766658141596792

DataFrame ```corr``` and ```cov``` methods, on the other hand , returns a full correlation or covariance matrix as a DataFrame

In [119]:
returns.corr()

Unnamed: 0,AAPL,IBM,MSFT,GOOG
AAPL,1.0,0.432731,0.756931,0.682604
IBM,0.432731,1.0,0.476666,0.443424
MSFT,0.756931,0.476666,1.0,0.787012
GOOG,0.682604,0.443424,0.787012,1.0


In [120]:
returns.cov()

Unnamed: 0,AAPL,IBM,MSFT,GOOG
AAPL,0.000411,0.00015,0.000287,0.000259
IBM,0.00015,0.000294,0.000153,0.000142
MSFT,0.000287,0.000153,0.000349,0.000275
GOOG,0.000259,0.000142,0.000275,0.000349


In [121]:
# corrwith , compute pairwise correlations between a dataframe columns or rows with another series or dataframe. 

returns.corrwith(returns.IBM)

AAPL    0.432731
IBM     1.000000
MSFT    0.476666
GOOG    0.443424
dtype: float64

In [122]:
# passing a dataframe computes the correlations of matching column names

returns.corrwith(volume)

AAPL   -0.076587
IBM    -0.113976
MSFT   -0.069770
GOOG   -0.081373
dtype: float64

## Unique values , value counts and memberships

- Another class of related methods extracts information about the values contained in a one-dimentional Series

In [123]:
obj = pd.Series(['c','a','d','a','a','b','b','c','c'])

In [124]:
uniques = obj.unique() # find unique
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

In [126]:
obj.value_counts() # count the index values

c    3
a    3
b    2
d    1
dtype: int64

In [127]:
# isin performs a vectorized set membership check and can be useful in 
# filtering a dataset down to a subset of values in a Series or column in a DataFrame

mask = obj.isin(['b','c'])
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [128]:
obj[mask]

0    c
5    b
6    b
7    c
8    c
dtype: object