# Introduction to pandas Data Structures 

In [2]:
import pandas as pd 
import numpy as np 

# 1) Series

In [48]:
obj = pd.Series([4,7,-3])

In [4]:
print(obj)

0    4
1    7
2   -7
dtype: int64


The string representation of a Series displayed interactively shows the index on the left and the values on the right. Since we did not specify an index for the data, a default one consisting of the integers 0 through N - 1 (where N is the length of the data) is created. You can get the array representation and index object of the Series via its values and index attributes, respectively: 

In [5]:
obj.values

array([ 4,  7, -7], dtype=int64)

In [6]:
obj.index

RangeIndex(start=0, stop=3, step=1)

Often it will be desirable to create a Series with an index identifying each data point with a label: 

In [7]:
obj2 = pd.Series([4,7,-2,-1], index = ['d', 'b', 'a', 'c'])

In [8]:
obj2

d    4
b    7
a   -2
c   -1
dtype: int64

In [9]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

In [11]:
obj2.values


array([ 4,  7, -2, -1], dtype=int64)

Compared with Numpy arrays, you can use labels in the index when selecting single values or a set of values: 

In [12]:
obj2['a']

-2

In [13]:
obj2['b']

7

In [15]:
obj2['c']

-1

In [16]:
obj2*2

d     8
b    14
a    -4
c    -2
dtype: int64

In [17]:
obj+3

0     7
1    10
2    -4
dtype: int64

In [19]:
obj2[['a','b']]

a   -2
b    7
dtype: int64

In [20]:
obj2[['c','a','d']]

c   -1
a   -2
d    4
dtype: int64

Here ['c', 'a', 'd'] or ['a','b'] is interpreted as a list of indices, even though it contains strings instead of integers. 

Using NumPy functions or Numpy-like operations, such as filtering with a boolean array, scalar multiplication, or applying math functions, will preserve the index-value link: 

In [21]:
obj2[obj2 > 0]

d    4
b    7
dtype: int64

In [22]:
obj2*2

d     8
b    14
a    -4
c    -2
dtype: int64

In [23]:
obj2+3

d     7
b    10
a     1
c     2
dtype: int64

In [24]:
np.exp(obj2)

d      54.598150
b    1096.633158
a       0.135335
c       0.367879
dtype: float64

Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values. It can be used in many contexts where you might use a dict: 

In [25]:
'b' in obj2

True

In [26]:
'b' in obj2.index

True

In [28]:
'e' in obj2

False

Should you have data contained in a Python dict, you can create a Series from it by passing the dict: 

In [30]:
sdata = {'Ohio' : 35000, 'Texas': 71000, 'Oregon' : 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)

In [31]:
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

When you are only passing a dict, the index in the resulting Series will have the dict's keys in sorted order. You can override this by passing the dict keys in the order you want them to appear in the resulting Series: 

In [32]:
states = ['California', 'Ohio', 'Oregon', 'Texas']

In [33]:
obj4 = pd.Series(sdata,index = states)

In [34]:
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

Here, three values found in sdata were places in the approraite locations, but since no value for 'California' was found. 

I will use the terms 'missing' or 'NA' interchangeably to refer to missing data. The isnull and notnull functions in pandas should be used to detect missing data:

In [35]:
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [36]:
pd.notnull(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

Series also has these an instance methods: 

In [37]:
obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [38]:
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [39]:
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [40]:
obj3 + obj4 

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Data alignment feature will be addressed in more detail later. Ì you have experience with databases, you can think about this as being similar to a join operation. 

Both the Series object itself and its index have a name attribute, which integrated with other key areas of pandas functionally:

In [41]:
obj4.name = 'population' 

In [42]:
obj4.index.name = 'state'

In [43]:
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

In [None]:
A Series index can be altered in-place by assignment: 

In [44]:
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

A Series index can be altered in-place by assignment:

In [46]:
obj 

0    4
1    7
2   -7
dtype: int64

In [51]:
obj.index = ['Bob', 'Steve', 'Jeff']

In [52]:
obj

Bob      4
Steve    7
Jeff    -4
dtype: int64

There are many ways to construct a DataFrame, though one of the most common is from a dict of equal - length lists or Numpy arrays: 

In [23]:
data = {'state' : ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year' : [2000, 2001, 2002, 2001, 2002, 2003],
        'pop' : [1.5, 2.3, 2.4 ,20.7, 3.0, 4.5,]
}
frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,2.3
2,Ohio,2002,2.4
3,Nevada,2001,20.7
4,Nevada,2002,3.0
5,Nevada,2003,4.5


# 2) DataFrame: 

In [54]:
frame 

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,2.3
2,Ohio,2002,2.4
3,Nevada,2001,20.7
4,Nevada,2002,3.0
5,Nevada,2003,4.5


If you are uisng the Jupiter notebook, pandas DataFrame objects will be displayed as a more browser-friendly HTML table. 

For large DataFrame, the head method selects only the first five rows: 

In [57]:
"==============="



In [55]:
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,2.3
2,Ohio,2002,2.4
3,Nevada,2001,20.7
4,Nevada,2002,3.0


If you specify a sequence of columns, the DataFrame's columns will be arranged in that order: 

In [58]:
pd.DataFrame(data, columns = ['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,2.3
2,2002,Ohio,2.4
3,2001,Nevada,20.7
4,2002,Nevada,3.0
5,2003,Nevada,4.5


If you pass a column that isn't contained in the dict. It will appear with missing values in the result: 

In [59]:
frame2 = pd.DataFrame(data, columns = ['year', 'state', 'pop', 'debt'],
                             index = ['one', 'two', 'three', 'four', 'five', 'six'])

In [60]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,2.3,
three,2002,Ohio,2.4,
four,2001,Nevada,20.7,
five,2002,Nevada,3.0,
six,2003,Nevada,4.5,


In [61]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

In [62]:
frame2.values

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 2.3, nan],
       [2002, 'Ohio', 2.4, nan],
       [2001, 'Nevada', 20.7, nan],
       [2002, 'Nevada', 3.0, nan],
       [2003, 'Nevada', 4.5, nan]], dtype=object)

In [63]:
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

In [64]:
frame2['year']

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

In [65]:
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

Rows can also be retrieved by position or name with the special loc attribute (much on this later): 

In [66]:
frame2.loc['three']

year     2002
state    Ohio
pop       2.4
debt      NaN
Name: three, dtype: object

In [67]:
frame2.loc['one']

year     2000
state    Ohio
pop       1.5
debt      NaN
Name: one, dtype: object

In [68]:
frame2.loc['six']

year       2003
state    Nevada
pop         4.5
debt        NaN
Name: six, dtype: object

Columns can be modified by assignment. For example, the empty 'debt' column could be assigned a scalar value or an array of values: 

In [70]:
frame2['debt'] = 16.5 

In [71]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,2.3,16.5
three,2002,Ohio,2.4,16.5
four,2001,Nevada,20.7,16.5
five,2002,Nevada,3.0,16.5
six,2003,Nevada,4.5,16.5


In [72]:
frame2['debt'] = np.arange(6.)

In [73]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0.0
two,2001,Ohio,2.3,1.0
three,2002,Ohio,2.4,2.0
four,2001,Nevada,20.7,3.0
five,2002,Nevada,3.0,4.0
six,2003,Nevada,4.5,5.0


When you are assigning lists or arrays to a column, the value's length must match the length of the DataFrame. If you assign a Series, its label will be realigned exactly to the DataFrame's index, inserting missing values in any holes: 

In [74]:
val = pd.Series([-1.2, -3.4, 1.8], index = ['two', 'four', 'five'])

In [75]:
val 

two    -1.2
four   -3.4
five    1.8
dtype: float64

In [76]:
frame2['debt'] = val 

In [77]:
frame2 

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,2.3,-1.2
three,2002,Ohio,2.4,
four,2001,Nevada,20.7,-3.4
five,2002,Nevada,3.0,1.8
six,2003,Nevada,4.5,


Assigning a column that doesn't exist will create a new column. The del keyword will delete columns as with a dict. 

As an example, of del. I first add a new column of boolean values whare the state column equals 'Ohio': 

In [78]:
frame2['eastern'] = frame2.state == 'Ohio'

In [79]:
frame2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,2.3,-1.2,True
three,2002,Ohio,2.4,,True
four,2001,Nevada,20.7,-3.4,False
five,2002,Nevada,3.0,1.8,False
six,2003,Nevada,4.5,,False


In [80]:
del frame2['eastern']

In [81]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

Another common form of data is a nested dict of dicts: 

In [83]:
pop = {'Nevada' : {2001: 2.4, 2002: 3.1},
      'Ohio' : {2000 : 1.5, 2001 : 3.2, 2002 : 3.7}}
frame3 = pd.DataFrame(pop)

In [84]:
frame3

Unnamed: 0,Nevada,Ohio
2001,2.4,3.2
2002,3.1,3.7
2000,,1.5


In [85]:
frame3.values

array([[2.4, 3.2],
       [3.1, 3.7],
       [nan, 1.5]])

# Table 5-1. Possible data inputs to the DataFrame constructor

# 2D ndarray:A matrix of data, passing optional row and column labels
# Dictionary of arrays, lists, ortuples : Each sequence becomes a column in the DataFrame; all sequences must be the same length
# NumPy structured/record array : Treated as the “dictionary of arrays” case
# Dictionary of Series Each value becomes a column; indexes from each Series are unioned together to form the result’s row index if no explicit index is passed
# Dictionary of dictionaries Each inner dictionary becomes a column; keys are unioned to form the row index as in the “dictionary of Series” case
# List of dictionaries or Series :  Each item becomes a row in the DataFrame; unions of dictionary keys or Series indexes become the DataFrame’s column labels
# List of lists or tuples : Treated as the “2D ndarray” case. Another DataFrame The DataFrame’s indexes are used unless different ones are passed
# NumPy MaskedArray :  Like the “2D ndarray” case except masked values are missing in the DataFrame result

# Index Object 

Panda's Index objects are responsible for holding the axis labels and other metadata (like the axis name or names). Any array of other sequence of labels you use when constructing a Series or DataFrame is internally converted to an Index : 

In [86]:
obj = pd.Series(range(3), index = ['a', 'b', 'c'])

In [87]:
index = obj.index 

In [88]:
index 

Index(['a', 'b', 'c'], dtype='object')

In [89]:
index [1:]


Index(['b', 'c'], dtype='object')

# 2) Essential functionality : 

# Reindexing

An important method on pandas objects is reindex, which means to create a new object with the data conformed to a new index. Consider an example: 

In [90]:
obj - pd.Series([4.5, 7.2, -4.3, 3.6], index = ['d', 'b', 'a', 'c'])
obj

a    0
b    1
c    2
dtype: int64

Calling reindex on this Series rearranges the data according to the new index, introducing missing values if any index values were not already preent: 

In [91]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2 

a    0.0
b    1.0
c    2.0
d    NaN
e    NaN
dtype: float64

For ordered data like time series, It may be desirable to do some interpolation or filling or values when reindexing. The method option allows us to do this, using a method such as ffill, which forward-fills the values: 

In [92]:
obj3 = pd.Series(['blue', 'purple', 'yellow'], index = [0,2,4])

In [94]:
obj3

0      blue
2    purple
4    yellow
dtype: object

In [95]:
obj3.reindex(range(6), method = 'ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

With DataFrame, reindex can alter either the (row) index, columns, or both. When passed only a sequence, it reindexes the rows in the result: 

In [5]:
frame = pd.DataFrame(np.arange(9).reshape((3,3)), 
                    index = ['a', 'c', 'd'], 
                    columns = ['Ohio', 'Texas', 'California'])
frame 

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [9]:
frame2 = frame.reindex(['a', 'b', 'c', 'd'])

In [10]:
frame2 

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [18]:
states = ['Texas', 'Utah', 'California']
other_frame = frame.reindex(columns = states)
other_frame

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


As we'll explore in more detail, you can reindex more succinctly by label-indexing with loc, many users prefer to use it exclusively: 

In [None]:
other_frame.loc[['a', 'b', 'c', 'd'], states]

In [25]:
script = pd.DataFrame({
        'Argument' : ['index','method','fill_value','limit','tolerance','level','copy'],
        'Description' : [
            'New sequence to use as index. Can be index instance or any other sequence-like Python data structure. An index will be used exactly as is without any copying', 
            'Interpolation (fill) method, "ffill" forward, while "bfill" fills backward', 
            'Substitute value to use when introducing missing data by reindexing', 
            'When forward -or backfilling, maximum size gap to fill', 
            'When forward -or backfilling, maximum size gap to fill for inesact matches', 
            'Match simple index on level of MultiIndex, otherwise select subset of.',
            'If true, always copy underlying data even if new index is equivalent old index. If False, do not copy the data when the indexes are equivalent'
        ]
    }
)

script

Unnamed: 0,Argument,Description
0,index,New sequence to use as index. Can be index ins...
1,method,"Interpolation (fill) method, ""ffill"" forward, ..."
2,fill_value,Substitute value to use when introducing missi...
3,limit,"When forward -or backfilling, maximum size gap..."
4,tolerance,"When forward -or backfilling, maximum size gap..."
5,level,"Match simple index on level of MultiIndex, oth..."
6,copy,"If true, always copy underlying data even if n..."


# Dropping Entries from an Axis

As that that can require a bit of munging and set logic, the frop method will return a new object with the indicated value or values deleted from an axis : 

In [26]:
obj = pd.Series(np.arange(5.), index = ['a', 'b', 'c', 'd', 'e'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [27]:
new_obj = obj.drop('c')
new_obj 

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [28]:
obj.drop(['d', 'c'])

a    0.0
b    1.0
e    4.0
dtype: float64

With DataFrame, index values can be deleted from either axis. To illustrate this, we first create an example DataFrame: 


In [30]:
data = pd.DataFrame(np.arange(16).reshape((4,4)), 
                   index = ['Ohio', 'Colorado', 'Utah', 'New York'], 
                   columns = ['one', 'two', 'three', 'four'])
data 

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [31]:
data.drop(['Colorado', 'Ohio'])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


You can drop values from the columns by passing axis = 1 or axis = 'columns': 

In [32]:
data.drop('two', axis = 1)

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


In [33]:
data.drop('three', axis = 1)

Unnamed: 0,one,two,four
Ohio,0,1,3
Colorado,4,5,7
Utah,8,9,11
New York,12,13,15


In [34]:
data.drop('four', axis = 1)

Unnamed: 0,one,two,three
Ohio,0,1,2
Colorado,4,5,6
Utah,8,9,10
New York,12,13,14


In [36]:
data.drop('one', axis = 1)

Unnamed: 0,two,three,four
Ohio,1,2,3
Colorado,5,6,7
Utah,9,10,11
New York,13,14,15


In [37]:
data.drop(['two', 'four'], axis = 'columns')

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


In [38]:
data.drop(['two', 'one'], axis = 'columns')

Unnamed: 0,three,four
Ohio,2,3
Colorado,6,7
Utah,10,11
New York,14,15


In [39]:
data.drop(['two', 'three'], axis = 'columns')

Unnamed: 0,one,four
Ohio,0,3
Colorado,4,7
Utah,8,11
New York,12,15


Many functions, like drop, which modify the size or shape of a Series or DataFrame, can manipulate an object in-place without returning a new object: 

In [40]:
obj.drop('c', inplace = True)

In [41]:
obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

Be careful with the inplace, as it destroys any data that is dropped

# Indexing, Selection, and Filtering 

Series indexing (obj[...]) works analogously to Numpy array indexing, except you can use the Serie's index values instead of only integers. Here are some some examples of this: 

In [42]:
obj = pd.Series(np.arange(4.), index = ['a', 'b', 'c', 'd'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [43]:
obj[1]

1.0

In [44]:
obj[2:4]

c    2.0
d    3.0
dtype: float64

In [46]:
obj[['b', 'a', 'd']]

b    1.0
a    0.0
d    3.0
dtype: float64

In [47]:
obj[[1,3]]

b    1.0
d    3.0
dtype: float64

In [48]:
obj[obj < 2]

a    0.0
b    1.0
dtype: float64

Slicing with labels behaves differently than normal Python slicing in that the end-point is inclusive: 

In [49]:
obj['b' : 'c'] = 5 

Indexing into a DataFrame is for retrieving one or more columns either with a single value or sequence: 

In [17]:
data = pd.DataFrame(np.arange(16).reshape((4,4)), 
                   index = ['Ohio', 'Colorado', 'Utah', 'New York'],
                   columns = ['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [51]:
data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32

In [52]:
data['three']

Ohio         2
Colorado     6
Utah        10
New York    14
Name: three, dtype: int32

In [53]:
data['one']

Ohio         0
Colorado     4
Utah         8
New York    12
Name: one, dtype: int32

Indexing like this has a few special cases. First, slicing or selecting data with a boolean array: 

In [54]:
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [55]:
data[data['three'] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


Another use case is in indexing with a boolean DataFrame, such as one produced by a scalar comparison: 

In [56]:
data < 5 

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [57]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


# Selection with loc and iloc

As a preliminary excample, let's select a single row and multiple columns by label: 

In [58]:
data.loc['Colorado', ['two', 'three']]

two      5
three    6
Name: Colorado, dtype: int32

We'll then perform some similar selections with integers using iloc: 

In [59]:
data.iloc[2,[3,0,1]]

four    11
one      8
two      9
Name: Utah, dtype: int32

In [60]:
data.iloc[2]

one       8
two       9
three    10
four     11
Name: Utah, dtype: int32

In [61]:
data.iloc[[1,2], [3,0,1]]

Unnamed: 0,four,one,two
Colorado,7,4,5
Utah,11,8,9


Both indexing functions work with slices in addition to single labels or list of labels: 

In [62]:
data.loc[:'Utah', 'two']

Ohio        1
Colorado    5
Utah        9
Name: two, dtype: int32

In [64]:
data.iloc[:,:3][data.three > 5]

Unnamed: 0,one,two,three
Colorado,4,5,6
Utah,8,9,10
New York,12,13,14


So threr

Boolean arrrays can be used with loc but not iloc: 

In [6]:
data.loc[data.three == 2] 

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3


There are many ways to select and rearrange the data contained in a pandas object.

In [None]:
options = pd.DataFrame({
    'Type' : [
        'df[column]', 
        'df.loc[rows]',
        'df.loc[: cols]',
        'df.loc[rows, cols]',
        'df.iloc[rows]',
        'df.iloc[:, cols]', 
        'df.iloc[rows, cols]',
        'df.at[row, col]',
        'df.lat[row, col]',
        'reindex method'
    ], 
    
    'Notes': [
        ''
    ]
})

# Integer Index pitfalls

For example, you might not expect the following code to generate an error: 

In [7]:
ser = pd.Series(np.arange(3.))
ser

0    0.0
1    1.0
2    2.0
dtype: float64

In [8]:
ser[-1]

KeyError: -1

In this case, pandas could "fall back" on integer indexing, but it is difficult to do this in general without introducing subtle bugs into the user code. Here we have so index containing 0, 1, and 2, but pandas does not want to guess what the user wants (label-base indexing or position-based): 

In [9]:
ser 

0    0.0
1    1.0
2    2.0
dtype: float64

In [10]:
ser2 = pd.Series(np.arange(3.), index = ["a", "b", "c"])

In [11]:
ser2[-1]

2.0

On the other hand, with a noninteger index, there is no such ambiguity: 


In [12]:
ser2[-1] 

2.0

If you have an axis index containg integers, data selection will always be label oriented. As I said above, if you use loc (for labels) or iloc (for integers) you will get exactly what you want: 

In [14]:
ser.iloc[-1]

2.0

In [None]:
On the other hand, slicing with integers is always integer oriented 

In [15]:
ser[:2]

0    0.0
1    1.0
dtype: float64

# Pitfalls with chained indexing 

In [22]:
data.loc[:,"one"] = 1 

In [23]:
data

Unnamed: 0,one,two,three,four
Ohio,1,1,2,3
Colorado,1,5,6,7
Utah,1,5,5,5
New York,1,13,14,15


In [21]:
data.iloc[2] = 5 

In [24]:
data

Unnamed: 0,one,two,three,four
Ohio,1,1,2,3
Colorado,1,5,6,7
Utah,1,5,5,5
New York,1,13,14,15


In [25]:
data.loc[data["four"] > 5] = 3

In [26]:
data

Unnamed: 0,one,two,three,four
Ohio,1,1,2,3
Colorado,3,3,3,3
Utah,1,5,5,5
New York,3,3,3,3


In [27]:
data.loc[data.three == 5, "three"] = 6 

In [28]:
data 

Unnamed: 0,one,two,three,four
Ohio,1,1,2,3
Colorado,3,3,3,3
Utah,1,5,6,5
New York,3,3,3,3


A good rule of thumb is to avoid chained indexing when doing assignments. There are other cases where pandas will generate SettingWithCopyWarning that have to do with chained indexing. I refer you to this this topic in the online pandas documenttation. 

# Arithmetic and Data Alignment

Pandas can make it much simpler to work with objects that have different indexed. For example, when you add objects, it any index pairs are not the same, the respective index in the result will be the union of the index pairs. Let's book at an example: 

In [29]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index = ["a", "c", "d", "e"])

In [33]:
s2 = pd.Series([-2.1, 3.6, -1.5,4, 3.1], index = ["a", "c", "e", "f","g"])

In [30]:
s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [34]:
s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [35]:
s1+s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

The internal data alignment introduces missing values in the label locations that don't overlap. Missing values will then propagate in further arithmetic computations. 

In the case of DataFrame, alignment is performed on both rows and columns: 

In [3]:
df1 = pd.DataFrame(np.arange(9.).reshape((3,3)), columns = list("bcd"), index = ["Ohio","Texas","Colorado"])

In [2]:
df2 = pd.DataFrame(np.arange(12.).reshape((4,3)), columns = list("bde"),
                  index = ["Utah", "Ohio", "Texas", "Oregon"])

In [4]:
df1

Unnamed: 0,b,c,d
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Colorado,6.0,7.0,8.0


In [5]:
df2

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


Adding these retunns a DataFrame with index and columns that are the unions of ones in each DataFrame: 

In [6]:
df1+df2 

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


Since the "c" and "e" columns are not found in both DataFrame objects, they appear as missing in the result. The same holds for the rows with labels that are not common to both objets. 

If you add DataFrame objects with not column or row labels in common, the result will contain all nulls 

In [7]:
df1 = pd.DataFrame({"A": [1,2]})

In [8]:
df2 = pd.DataFrame({"B": [3,4]})

In [9]:
df1

Unnamed: 0,A
0,1
1,2


In [10]:
df2

Unnamed: 0,B
0,3
1,4


In [11]:
df1+df2

Unnamed: 0,A,B
0,,
1,,


# Arithmetic methods with fill values 

In arithmetic operations between differently indexed objects, you might want to fill with a special value, like 0, when an axis label is found in one object but not the other. Here is an example where we set a particular value to NA (null) by assigning np.nan to it: 

In [12]:
df1 = pd.DataFrame(np.arange(12.).reshape((3,4)), 
                  columns = list('abcd'))

In [16]:
df2 = pd.DataFrame(np.arange(20.).reshape((4,5)),
                  columns = list('abcde'))

In [18]:
df2.loc[1,"b"] = np.nan

In [19]:
df2 

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


Adding these results in missing values in the locations that don't overlap: 

In [20]:
df1+df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


Using the add method on df1, I pass df2 and an argument to fill_value, which substitutes the passed value for any missing values in the operation: 

In [21]:
df1.add(df2, fill_value = 0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,5.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


Relatedly, when reindexing a Series or DataFrame, you can also specify a different fill value: 

In [22]:
df1.reindex(columns = df2.columns, fill_value = 0)

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,0
1,4.0,5.0,6.0,7.0,0
2,8.0,9.0,10.0,11.0,0


In [24]:
options = pd.DataFrame({
    "Method" : [
        "add, radd", 
        "sub, rsub", 
        "div, rdiv", 
        "floordiv, rfloordiv", 
        "mul, rmul",
        "pow, rpow"
    ], 
    "Description" : [
        "Methods for addition (+)", 
        "Methods for subtraction(-)",
        "Methods for division (/)",
        "Methods for floor division (//)",
        "Methods for multiplication (*)",
        "Methods for exponentiation (**)"
    ]
})

options

Unnamed: 0,Method,Description
0,"add, radd",Methods for addition (+)
1,"sub, rsub",Methods for subtraction(-)
2,"div, rdiv",Methods for division (/)
3,"floordiv, rfloordiv",Methods for floor division (//)
4,"mul, rmul",Methods for multiplication (*)
5,"pow, rpow",Methods for exponentiation (**)


# Operations between DataFrame and Series 

As with Numpy arrays of different dimensions, arithmetic between DataFrame and Series is also defined. First, as a motivating example, consider the difference between a two-dimensional array and one of its rows: 

In [25]:
arr = np.arange(12.).reshape((3,4))

In [26]:
arr 

array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])

In [27]:
arr[0]

array([0., 1., 2., 3.])

In [28]:
arr - arr[0]

array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])

When we subtract arr[0] from arr, the substraction is performed once for each row. This is referred to as broadcasting and is explained in more detail as it relates to general NumPy arrays in Appendix A. Operations between a DataFrame and a Series are similar: 

In [30]:
frame = pd.DataFrame(np.arange(12.).reshape((4,3)), 
                    columns = list("bde"), 
                    index = ["Utah", "Ohio", "Texas", "Oregon"])

In [31]:
series = frame.iloc[0] 

In [32]:
frame 

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [33]:
series 

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

By default, arithmetic between DataFrame and Series matched the index of the Series on the columns of the DataFrame, broadcasting down the rows: 

In [34]:
frame - series 

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


If an index value is not found in either the DataFrame's columns or the Series's index, the objects will be reindexed to form the union: 

In [35]:
series2 = pd.Series(np.arange(3), index = ["b", "e", "f"])

In [36]:
series2

b    0
e    1
f    2
dtype: int32

In [38]:
series3 = frame["d"]

In [39]:
frame 

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [40]:
series3

Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

In [41]:
frame.sub(series3, axis = "index")

Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohio,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


The axis that you pass is the axis to match on. In this case we mean to match on the DataFrame's row index (axis = "index") and broadcasr across the columns. 

# Function Application and Mapping 

In [42]:
frame = pd.DataFrame(np.random.standard_normal((4,3)), 
                    columns = list("bde"), 
                    index = ["Utah", "Ohio", "Texas", "Oregon"])

In [43]:
frame 

Unnamed: 0,b,d,e
Utah,-1.231779,-1.333524,0.665756
Ohio,-0.240269,1.248649,2.524658
Texas,-0.652251,1.351951,-1.454387
Oregon,-0.391071,0.815062,0.3841


In [44]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,1.231779,1.333524,0.665756
Ohio,0.240269,1.248649,2.524658
Texas,0.652251,1.351951,1.454387
Oregon,0.391071,0.815062,0.3841


Aother frequent operation is applying a function on one-dimensional arrays to each column or row. DataFrame's apply method does exactly this: 

In [45]:
def fi(x): 
    return x.max() - x.min() 

In [46]:
frame.apply(fi)

b    0.991510
d    2.685475
e    3.979046
dtype: float64

Here the function f, which computes the difference between the maximum and minimum of a Series, is invoked once on each column in frame. The result is Series having the columns of frame as its index. 

If you pass axis = "columns" to apply, the function will be invoked once per row instead. A helpful way to think about this is as "apply across the columns". 

In [48]:
frame.apply(fi, axis = "columns")

Utah      1.999280
Ohio      2.764927
Texas     2.806338
Oregon    1.206133
dtype: float64

Many of the most common array statistics (like sum and mean) are DataFrame methods, so using apply is not necessary: 

The function passed to apply need not return a scalar value, it can also return a Series with multiple values: 


In [49]:
def f(x): 
    return pd.Series([x.min(), x.max()], index = ["min", "max"])
frame.apply(f) 

Unnamed: 0,b,d,e
min,-1.231779,-1.333524,-1.454387
max,-0.240269,1.351951,2.524658


Elemen-wise Python functions can be used, too. Suppose you wanted to compute a formatted string from each floating point value in Frame. You can do this with applymap: 

In [50]:
def my_format(x): 
    return f"{x:.2f}"

frame.applymap(my_format)

Unnamed: 0,b,d,e
Utah,-1.23,-1.33,0.67
Ohio,-0.24,1.25,2.52
Texas,-0.65,1.35,-1.45
Oregon,-0.39,0.82,0.38


The reason for the name applymap is that Series has a map method for applying an element-wise function: 

In [52]:
frame["e"].map(my_format)

Utah       0.67
Ohio       2.52
Texas     -1.45
Oregon     0.38
Name: e, dtype: object

# Sorting and Ranking

Sorting a dataset by come criterion is another important built-in operation. To sort lexicographically by row or columns label, use the sort_index method, which returns a new, sorted object: 

In [53]:
obj = pd.Series(np.arange(4), index = ["d", "a", "b", "c"])
obj

d    0
a    1
b    2
c    3
dtype: int32

In [54]:
obj.sort_index() 

a    1
b    2
c    3
d    0
dtype: int32

With a DataFrame, you can sort by index on either axis: 

In [56]:
frame = pd.DataFrame(np.arange(8).reshape((2,4)), 
                    index = ["Three", "one"], 
                    columns = ["d","b","a","c"])

In [57]:
frame 

Unnamed: 0,d,b,a,c
Three,0,1,2,3
one,4,5,6,7


In [58]:
frame.sort_index()

Unnamed: 0,d,b,a,c
Three,0,1,2,3
one,4,5,6,7


In [60]:
frame.sort_index(axis = "columns")

Unnamed: 0,a,b,c,d
Three,2,1,3,0
one,6,5,7,4


the data is sorted in ascending order by default but can be sorted in descending order, too: 

In [61]:
frame.sort_index(axis = "columns", ascending =False)

Unnamed: 0,d,c,b,a
Three,0,3,1,2
one,4,7,5,6


To sort a Series by its values, use its sort_values method: 

In [62]:
obj = pd.Series([4,7,-72, 2])

In [63]:
obj.sort_values() 

2   -72
3     2
0     4
1     7
dtype: int64

Any missing values are sorted to the end of the Series by default: 

In [64]:
obj = pd.Series([4,np.nan, 7, np.nan, -3, 2])

In [65]:
obj.sort_values() 

4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

Missing values can be sorted to the start instead by using the na_position option: 

In [66]:
obj.sort_values(na_position = "first")

1    NaN
3    NaN
4   -3.0
5    2.0
0    4.0
2    7.0
dtype: float64

When sorting a DataFrame, you can use the data in one or more columns as the sort keys. To do so, pass one or more column names to sort_values: 

In [67]:
frame = pd.DataFrame ({"b" : [4,7,-3,2], 
                      "a" : [0,1,0,1]})

In [68]:
frame 

Unnamed: 0,b,a
0,4,0
1,7,1
2,-3,0
3,2,1


To sort by multiple columns, pass a list of names: 

In [69]:
frame.sort_values(["a", "b"])

Unnamed: 0,b,a
2,-3,0
0,4,0
3,2,1
1,7,1


Ranking assigns ranks from one through the number of valid data points in an array, starting from the lowest_value. The rank methods for Series and DataFrame are the place to look, by default, rank breaks ties by assigning each group the mean rank : 

In [70]:
obj = pd.Series([7,-5,7,4,2,8,4])

In [71]:
obj.rank() 

0    5.5
1    1.0
2    5.5
3    3.5
4    2.0
5    7.0
6    3.5
dtype: float64

Ranks can also be assigned according to the order in which they're observed in the data: 

In [72]:
obj.rank(method = 'first')

0    5.0
1    1.0
2    6.0
3    3.0
4    2.0
5    7.0
6    4.0
dtype: float64

Here, instead of using the average rank 6.5 for the entries 0 and 2, they instead have been set to 6 and 7 because label precedes label 2 in the data: 

You can rank in descending order, too: 

In [73]:
obj.rank(ascending = False)

0    2.5
1    7.0
2    2.5
3    4.5
4    6.0
5    1.0
6    4.5
dtype: float64

DataFream can compute ranks over the rows or the columns: 

In [75]:
frame = pd.DataFrame({
    "b" : [4.3,7,-3,2], 
    "a" : [0,1,0,1],
    "c" : [-2, 5, 8, -2.5]
})

In [76]:
frame 

Unnamed: 0,b,a,c
0,4.3,0,-2.0
1,7.0,1,5.0
2,-3.0,0,8.0
3,2.0,1,-2.5


In [78]:
frame.rank(axis = "columns")

Unnamed: 0,b,a,c
0,3.0,2.0,1.0
1,3.0,1.0,2.0
2,1.0,2.0,3.0
3,3.0,2.0,1.0


In [79]:
options = df.DataFrame({
    "Method" : [
        "average", 
        "min",
        "max", 
        "first",
        "dense" 
    ], 
    
    "Description" : [
        "Default, assign the average rank to each entry in the equal group", 
        "Use the minimum rank for the whole group", 
        "Use the maximum rank for the whole group",
        "Assign ranks in the order the values appear in the data", 
        "Like method = "min", but ranks always increase by between groups rather than the number of equal elements in a group",
    ]
})

options 

SyntaxError: invalid syntax. Perhaps you forgot a comma? (2027159680.py, line 15)

# Axis indexes with Duplicate Labels 

Up until new almost all of the examples we have looked at have unique axis labels (index values). While many pandas functions (like reindex) require that the labels be unique, it's not mandatory. Let's consider a small Series with duplicate indices: 

In [80]:
obj = pd.Series(np.arange(5), index = ["a", "a", "b", "b", "c"])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int32

The is_unique property of the index can tell you whether or not its labels are unique: 

In [81]:
obj.index.is_unique

False

Data selection is one of the main things that behaves differently with duplicates indexing a label with multiple entries returns a Series, while single entries return a scalar value: 

In [82]:
obj["a"]

a    0
a    1
dtype: int32

In [83]:
obj["c"]

4

This can make your code more complicated, as the output type from indexing can vary based on wheter or not a label is repeated. 

The same logic extends to indexing rows (or columns) in a DataFrame: 


In [84]:
df = pd.DataFrame(np.random.standard_normal((5,3)), 
                 index = ["a", "a", "b", "b", "c"])

df 

Unnamed: 0,0,1,2
a,-0.037619,-0.940243,-1.505569
a,0.877829,0.171656,1.007554
b,0.520287,-1.004429,0.476966
b,-1.30498,-1.31359,1.025022
c,-0.201832,-0.177136,-0.567304


In [86]:
df.loc["b"]

Unnamed: 0,0,1,2
b,0.520287,-1.004429,0.476966
b,-1.30498,-1.31359,1.025022


In [87]:
df.loc["c"] 

0   -0.201832
1   -0.177136
2   -0.567304
Name: c, dtype: float64

# Summarizing and Computing Descriptive Statistics 

Pandas objects are equipped with a set of common mathematical and statistical methods. Moist of these fall into the category of reductions or summary statistics, methods that extract a single value (like the Sum or mean) from a Series, or a Series of values from the rows of columns of a DataFrame. Compared with the similar methods found on NummPy arrays, they have built-in handling for missing data. Consider a small DataFrame: 

In [2]:
df = pd. DataFrame([[1.4, np.nan], [7.1, -4.5], 
                   [np.nan, np.nan],[0.75, -1.3]], 
                  index = ["a", "b", "c", "d"], 
                  columns = ["one", "two"])

df 

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [89]:
df.sum() 

one    9.25
two   -5.80
dtype: float64

Calling DataFrame's sum method returns a Series containing column sums: 

In [90]:
df.sum() 

one    9.25
two   -5.80
dtype: float64

Passing axis = "columns" or axis = 1 sums across the columns instead: 
    

In [92]:
df.sum(axis = "columns")

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

In [93]:
df.sum(axis = "index", skipna = False)

one   NaN
two   NaN
dtype: float64

In [94]:
df.sum(axis = "columns", skipna = False)

a     NaN
b    2.60
c     NaN
d   -0.55
dtype: float64

Some aggregations, like mean, require at least one non-NA value to yield a value result, so here we have: 

In [95]:
df.mean(axis = "columns")

a    1.400
b    1.300
c      NaN
d   -0.275
dtype: float64

In [99]:
options = pd.DataFrame({
    "Method" : [
        "axis", 
        "skipna", 
        "level",
    ], 
    
    "Description" : [
        "Axis to reduce over, 'index' for DataFrame's rows and 'columns' for columns", 
        "Exclude missing values, True by default", 
        "Reduce grouped by level if the axis is hierarchically indexed (MultiIndex)",
    ]
})

options 

Unnamed: 0,Method,Description
0,axis,"Axis to reduce over, 'index' for DataFrame's r..."
1,skipna,"Exclude missing values, True by default"
2,level,Reduce grouped by level if the axis is hierarc...


Some methods, like idxmin and idxmax, return indirect statistics, like the index value where the minimum or maximum values are attained: 

In [3]:
df.idxmax() 

one    b
two    d
dtype: object

Other methods are neither reductions nor acumalations, describe is one such exampl, producing multiple summary statistics in one shot: 

In [4]:
df.describe() 

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


On non-numeric data, describe produces alternative summary statistics: 

In [3]:
obj = pd.Series(["a", "a","b","c"]*3)

In [4]:
obj.describe() 

count     12
unique     3
top        a
freq       6
dtype: object

In [10]:
options = pd.DataFrame({
    "Method": [
        "count",
        "describe", 
        "min, max", 
        "argmin, argmax", 
        "idxmin, idxmax", 
        "quantitile", 
        "sum",
        "mean", 
        "median",
        "mad",
        "prod",
        "var",
        "std", 
        "skew",
        "kurt",
        "cumsum",
        "cummin, cummax", 
        "cumprod", 
        "diff",
        "pct_change", 
    ],
    
    "Description": [
        "Number of non-NA values", 
        "Compute set of summary statistics", 
        "Compute minimum and maximum values",
        "Compute index locations (integers) at which minimum or maximum value is obtained, respectively, not available on DataFrame objects. ", 
        "Compute index labels at which minimum or maximum value is obtained, respectively", 
        "Compute sample quantile ranging from 0 to 1 (default: 0.5)", 
        "Sum of values",
        "Mean of values", 
        "Arithmetic median (50% quantile) of values", 
        "Mean absolute deviation from mean value", 
        "Product of all values", 
        "Sample variacne of values", 
        "Sample standard deviation of values",
        "Sample skewness (third moment) of values", 
        "Sample kurtosis (fourth moment) of values", 
        "Cumulative sum of values", 
        "Cumulative minimum or maximum of values, respectively",
        "Cumulative product of values", 
        "Compute first arithmetic difference (useful for time series)",
        "Compute percent changes",
    ]
})

options 

Unnamed: 0,Method,Description
0,count,Number of non-NA values
1,describe,Compute set of summary statistics
2,"min, max",Compute minimum and maximum values
3,"argmin, argmax",Compute index locations (integers) at which mi...
4,"idxmin, idxmax",Compute index labels at which minimum or maxim...
5,quantitile,Compute sample quantile ranging from 0 to 1 (d...
6,sum,Sum of values
7,mean,Mean of values
8,median,Arithmetic median (50% quantile) of values
9,mad,Mean absolute deviation from mean value


# Correlation and Covariance 

Some summary statistics, like correlation and covariance, are computed from pairs of arguments. Let's consider some DataFrames of stock prices and volumes originally obtained from Yahoo! Finance and available in binary Python pickle files you can find in the accompanying datasets for the book : 

In [14]:
price = pd.read_pickle("C:/Users/ADMIN/Documents/GitHub/python/Pandas_python/yahoo_price.pkl")

In [17]:
volume = pd.read_pickle("C:/Users/ADMIN/Documents/GitHub/python/Pandas_python/yahoo_volume.pkl")

In [18]:
returns = price.pct_change() 

In [21]:
returns.tail() 

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2016-10-17,-0.00068,0.001837,0.002072,-0.003483
2016-10-18,-0.000681,0.019616,-0.026168,0.00769
2016-10-19,-0.002979,0.007846,0.003583,-0.002255
2016-10-20,-0.000512,-0.005652,0.001719,-0.004867
2016-10-21,-0.00393,0.003011,-0.012474,0.042096


In [22]:
returns["MSFT"].corr(returns["IBM"])

0.49976361144151144

In [23]:
returns["MSFT"].cov(returns["IBM"])

8.870655479703546e-05

Since MSFT is a valid Python variable name, we can also select these columns using more concise syntax: 

In [24]:
returns["MSFT"].corr(returns["IBM"])

0.49976361144151144

DataFrame's corr and cov methods, on the other hand, return a full correlation or covariance matrix as a DataFrame, respectively: 

In [25]:
returns.corr()

Unnamed: 0,AAPL,GOOG,IBM,MSFT
AAPL,1.0,0.407919,0.386817,0.389695
GOOG,0.407919,1.0,0.405099,0.465919
IBM,0.386817,0.405099,1.0,0.499764
MSFT,0.389695,0.465919,0.499764,1.0


In [27]:
returns.cov()

Unnamed: 0,AAPL,GOOG,IBM,MSFT
AAPL,0.000277,0.000107,7.8e-05,9.5e-05
GOOG,0.000107,0.000251,7.8e-05,0.000108
IBM,7.8e-05,7.8e-05,0.000146,8.9e-05
MSFT,9.5e-05,0.000108,8.9e-05,0.000215


Using DataFrame's corrwith method, you can compute pair-wise correlations between a DataFrame's columns or rows with another Series or DataFrame. Passing a Sereis returns a Series with the correlation value computed for each column: 

In [29]:
returns.corrwith(returns["IBM"])

AAPL    0.386817
GOOG    0.405099
IBM     1.000000
MSFT    0.499764
dtype: float64

Passing a DataFrame computes the correlations of matching column names. Here, I compute correlations of percent changes with volume: 

In [30]:
returns.corrwith(volume)

AAPL   -0.075565
GOOG   -0.007067
IBM    -0.204849
MSFT   -0.092950
dtype: float64

Passing axis = "columns" does things row-by-row instead. In all cases, the data points are aligned by label before the correlation is computed. 

# Unique Values, Value Counts, and Membership 

Another class of related methods extracts information about the values contained in a one-dimensional Series. To illustrate these, consider this example: 

In [31]:
obj = pd.Series(["c", "a", "d", "a", "a", "b", "b", "c", "c"])

In [32]:
uniques = obj.unique() 

In [33]:
uniques 

array(['c', 'a', 'd', 'b'], dtype=object)

The unique values are not necessarily returned in the order in which they first appear, and not in sorted order, but they could be soried after the fact if needed (unique.sort()). Relatedly, value_counts computes a Series containing value frequencies: 

In [34]:
obj.value_counts() 

c    3
a    3
b    2
d    1
Name: count, dtype: int64

The Series is sorted by value in descending order as a convenience, value_counts is available as a top-level pandas method that can be used with NumPy arrays or other Python sequences: 

In [35]:
pd.value_counts(obj.to_numpy(), sort = False)

c    3
a    3
d    1
b    2
Name: count, dtype: int64

isin performs a vectorized set membership check and can be useful in filltering a dataset down to a subset of values in a Series or column in a DataFrame: 

In [36]:
obj 

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

In [37]:
mask = obj.isin(["b", "c"])

In [38]:
mask 

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [39]:
obj[mask]

0    c
5    b
6    b
7    c
8    c
dtype: object

Related to isin is the Index.get_indexer method, which gives you an index array from an array of possibly nondistinct values into another array of distinct values: 

In [40]:
to_match = pd.Series(["c", "a", "b", "b", "c", "a"])

In [41]:
unique_vals = pd.Series(["c", "b", "a"])

In [42]:
indices = pd.Index(unique_vals).get_indexer(to_match)

In [43]:
indices 

array([0, 2, 1, 1, 0, 2], dtype=int64)

In [46]:
options = pd.DataFrame({
    "Method" : [
        "isin", 
        "get_indexer", 
        "unique", 
        "value_counts", 
    ],
    
    "Description" : [
        "Compute a Boolean array indicating wheither each Series or DataFrame value is contained in the passed sequence of values", 
        "Compute integer indices for each value in an array into another array of distinct values; helpful for data alignment and join-type operations", 
        "Compute an array of unique values in a Series, returned in the order observed", 
        "Return a Series containing unique values as its index and frequencies as its values, ordered count in descending order", 
    ]
})
options

Unnamed: 0,Method,Description
0,isin,Compute a Boolean array indicating wheither ea...
1,get_indexer,Compute integer indices for each value in an a...
2,unique,"Compute an array of unique values in a Series,..."
3,value_counts,Return a Series containing unique values as it...


In some cases, you may want to compute a histogram on multiple related columns in a DataFrame. Here's an example: 

In [47]:
data = pd.DataFrame({
    "Qu1" : [1,3,4,3,4], 
    "Qu2" : [2,3,1,2,3], 
    "Qu3" : [1,5,2,4,4]
})

data

Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,3,3,5
2,4,1,2
3,3,2,4
4,4,3,4


We can comput the value counts for a single column, like so : 

In [48]:
data["Qu1"].value_counts().sort_index() 

Qu1
1    1
3    2
4    2
Name: count, dtype: int64

To compute this for all columns, pass pandas.value_counts to the DataFrame's apply method: 

In [49]:
result = data.apply(pd.value_counts).fillna(0) 

Here, the row labels in the result are the distinct values occurring in all of the columns. The values are the respective counts of these values in each column. 

There is also a DataFrame.value_counts method, but it computes counts considering each now of the DataFrame as a tuple to determine the number of occurrences of each of distinct now: 

In [50]:
data = pd.DataFrame({
    "a" : [1,1,1,2,2], 
    "b" : [0,0,1,0,0], 
})

data

Unnamed: 0,a,b
0,1,0
1,1,0
2,1,1
3,2,0
4,2,0


In [51]:
data.value_counts()

a  b
1  0    2
2  0    2
1  1    1
Name: count, dtype: int64

In this case, the result has an index representing the distinct rows as a hierarchical index. 

# New vocabulary: 

In [5]:
Waterproof_clothes_and_windows = pd.DataFrame({
    "the public" : "cộng đồng, công chúng", 
    "upset" : "bực tức, khó chịu", 
    "to be caught out unprepared by something" : "bị bất ngờ bởi một điều gì đó (thường là điều đó không đem lại niềm vui)", 
    "waterproof" : "chống nước", 
    "exactly" : "một cách chính xác", 
    "jacket" : "áo khoác", 
    "bounce off" : "nảy ra khỏi",
    "roll off" : "lăn và rơi ra khỏi", 
    "water - repellent" : "chất chống nước (thường có tác dụng ở mức trung bình, không phải hoàn toàn chống nước)", 
    "paint" : "sơn", 
    "spray" : "xịt, phun", 
    "boots" : "đôi bốt", 
    "muddy" : "có chứa, dính bùn đất", 
    "dust" : "bụi bẩn", 
    "particle" : "hạt vật chất (trong vật lý)", 
    "keep off" : "không nằm trên, không bám trên", 
    "glass" : "thủy tinh, kính"
}, index = np.arange(16))

Waterproof_clothes_and_windows

Unnamed: 0,the public,upset,to be caught out unprepared by something,waterproof,exactly,jacket,bounce off,roll off,water - repellent,paint,spray,boots,muddy,dust,particle,keep off,glass
0,"cộng đồng, công chúng","bực tức, khó chịu",bị bất ngờ bởi một điều gì đó (thường là điều ...,chống nước,một cách chính xác,áo khoác,nảy ra khỏi,lăn và rơi ra khỏi,chất chống nước (thường có tác dụng ở mức trun...,sơn,"xịt, phun",đôi bốt,"có chứa, dính bùn đất",bụi bẩn,hạt vật chất (trong vật lý),"không nằm trên, không bám trên","thủy tinh, kính"
1,"cộng đồng, công chúng","bực tức, khó chịu",bị bất ngờ bởi một điều gì đó (thường là điều ...,chống nước,một cách chính xác,áo khoác,nảy ra khỏi,lăn và rơi ra khỏi,chất chống nước (thường có tác dụng ở mức trun...,sơn,"xịt, phun",đôi bốt,"có chứa, dính bùn đất",bụi bẩn,hạt vật chất (trong vật lý),"không nằm trên, không bám trên","thủy tinh, kính"
2,"cộng đồng, công chúng","bực tức, khó chịu",bị bất ngờ bởi một điều gì đó (thường là điều ...,chống nước,một cách chính xác,áo khoác,nảy ra khỏi,lăn và rơi ra khỏi,chất chống nước (thường có tác dụng ở mức trun...,sơn,"xịt, phun",đôi bốt,"có chứa, dính bùn đất",bụi bẩn,hạt vật chất (trong vật lý),"không nằm trên, không bám trên","thủy tinh, kính"
3,"cộng đồng, công chúng","bực tức, khó chịu",bị bất ngờ bởi một điều gì đó (thường là điều ...,chống nước,một cách chính xác,áo khoác,nảy ra khỏi,lăn và rơi ra khỏi,chất chống nước (thường có tác dụng ở mức trun...,sơn,"xịt, phun",đôi bốt,"có chứa, dính bùn đất",bụi bẩn,hạt vật chất (trong vật lý),"không nằm trên, không bám trên","thủy tinh, kính"
4,"cộng đồng, công chúng","bực tức, khó chịu",bị bất ngờ bởi một điều gì đó (thường là điều ...,chống nước,một cách chính xác,áo khoác,nảy ra khỏi,lăn và rơi ra khỏi,chất chống nước (thường có tác dụng ở mức trun...,sơn,"xịt, phun",đôi bốt,"có chứa, dính bùn đất",bụi bẩn,hạt vật chất (trong vật lý),"không nằm trên, không bám trên","thủy tinh, kính"
5,"cộng đồng, công chúng","bực tức, khó chịu",bị bất ngờ bởi một điều gì đó (thường là điều ...,chống nước,một cách chính xác,áo khoác,nảy ra khỏi,lăn và rơi ra khỏi,chất chống nước (thường có tác dụng ở mức trun...,sơn,"xịt, phun",đôi bốt,"có chứa, dính bùn đất",bụi bẩn,hạt vật chất (trong vật lý),"không nằm trên, không bám trên","thủy tinh, kính"
6,"cộng đồng, công chúng","bực tức, khó chịu",bị bất ngờ bởi một điều gì đó (thường là điều ...,chống nước,một cách chính xác,áo khoác,nảy ra khỏi,lăn và rơi ra khỏi,chất chống nước (thường có tác dụng ở mức trun...,sơn,"xịt, phun",đôi bốt,"có chứa, dính bùn đất",bụi bẩn,hạt vật chất (trong vật lý),"không nằm trên, không bám trên","thủy tinh, kính"
7,"cộng đồng, công chúng","bực tức, khó chịu",bị bất ngờ bởi một điều gì đó (thường là điều ...,chống nước,một cách chính xác,áo khoác,nảy ra khỏi,lăn và rơi ra khỏi,chất chống nước (thường có tác dụng ở mức trun...,sơn,"xịt, phun",đôi bốt,"có chứa, dính bùn đất",bụi bẩn,hạt vật chất (trong vật lý),"không nằm trên, không bám trên","thủy tinh, kính"
8,"cộng đồng, công chúng","bực tức, khó chịu",bị bất ngờ bởi một điều gì đó (thường là điều ...,chống nước,một cách chính xác,áo khoác,nảy ra khỏi,lăn và rơi ra khỏi,chất chống nước (thường có tác dụng ở mức trun...,sơn,"xịt, phun",đôi bốt,"có chứa, dính bùn đất",bụi bẩn,hạt vật chất (trong vật lý),"không nằm trên, không bám trên","thủy tinh, kính"
9,"cộng đồng, công chúng","bực tức, khó chịu",bị bất ngờ bởi một điều gì đó (thường là điều ...,chống nước,một cách chính xác,áo khoác,nảy ra khỏi,lăn và rơi ra khỏi,chất chống nước (thường có tác dụng ở mức trun...,sơn,"xịt, phun",đôi bốt,"có chứa, dính bùn đất",bụi bẩn,hạt vật chất (trong vật lý),"không nằm trên, không bám trên","thủy tinh, kính"


In [6]:
Waterproof_clothes_and_windows = pd.DataFrame({
    "Vocabulary" : [
        "the public",
        "upset",
        "to be caught out unprepared by something",
        "waterproof",
        "exactly",
        "jacket",
        "bounce off",
        "roll off",
        "water - repellent",
        "paint",
        "spray",
        "boots",
        "muddy", 
        "dust",
        "particle",
        "keep off",
        "glass"
    ], 
    "Meaning" : [
        "cộng đồng, công chúng",
        "bực tức, khó chịu",
        "bị bất ngờ bởi một điều gì đó (thường là điều đó không đem lại niềm vui)",
        "chống nước",
        "một cách chính xác",
        "áo khoác",
        "nảy ra khỏi",
        "lăn và rơi ra khỏi",
        "chất chống nước (thường có tác dụng ở mức trung bình, không phải hoàn toàn chống nước)", 
        "sơn",
        "xịt, phun",
        "đôi bốt",
        "có chứa, dính bùn đất",
        "bụi bẩn",
        "hạt vật chất (trong vật lý)", 
        "không nằm trên, không bám trên",
        "thủy tinh, kính"
    ]
})

Waterproof_clothes_and_windows

Unnamed: 0,Vocabulary,Meaning
0,the public,"cộng đồng, công chúng"
1,upset,"bực tức, khó chịu"
2,to be caught out unprepared by something,bị bất ngờ bởi một điều gì đó (thường là điều ...
3,waterproof,chống nước
4,exactly,một cách chính xác
5,jacket,áo khoác
6,bounce off,nảy ra khỏi
7,roll off,lăn và rơi ra khỏi
8,water - repellent,chất chống nước (thường có tác dụng ở mức trun...
9,paint,sơn


In [8]:
Chimp_escapes_from_a_zoo = pd.DataFrame({
    "Vocabulary" : [
        "chimp", 
        "escape", 
        "climb up", 
        "power lines", 
        "get away", 
        "veterinarian", 
        "tranquilizer", 
    ], 
    "Meaning" : [
        "tinh tinh, vượn", 
        "trốn thoát", 
        "trèo lên", 
        "mạng lưới điện", 
        "chạy trốn, trốn thoát", 
        "bác sĩ thú y",
        "thuốc an thần" 
    ]
})

Chimp_escapes_from_a_zoo

Unnamed: 0,Vocabulary,Meaning
0,chimp,"tinh tinh, vượn"
1,escape,trốn thoát
2,climb up,trèo lên
3,power lines,mạng lưới điện
4,get away,"chạy trốn, trốn thoát"
5,veterinarian,bác sĩ thú y
6,tranquilizer,thuốc an thần


In [11]:
Collocations_business = pd.DataFrame({
    "Vocabulary" : [
        "Go into business", 
        "Abide by",
        "Make a deal",
        "Do business", 
        "Break even", 
        "Launch a new product", 
        "Sign a contract", 
        "Make a profit", 
        "Hire/take on staff/ employees", 
        "Do market research", 
        "Operate at a loss",
        "Going bankrupt", 
        "Tough competition",
        "Customer service", 
        "Terms of a contract", 
        "To breach a contract", 
        "To renew a contract", 
        "A contract runs out",
        "To terminate a contract", 
        "Set up a company"
    ],
    "Meaning": [
        "bắt đầu kinh doanh",
        "tuân thủ, tuân theo",
        "thỏa thuận, cam kết điều khoản",
        "kinh doanh",
        "hòa vốn",
        "ra mắt sản phẩm mới", 
        "ký kết hợp đồng",
        "sinh lợi nhuận",
        "thuê nhân viên",
        "làm khảo sát",
        "hoạt động thua lỗ",
        "phá sản",
        "canh tranh gay gắt",
        "dịch vụ khách hàng", 
        "điều khoản hợp đồng",
        "vi phạm hợp đồng",
        "gia hạn hợp đồng",
        "hợp đồng hết hạn",
        "chấm dứt hợp đồng",
        "thành lập công ty"
    ]
})

Collocations_business

Unnamed: 0,Vocabulary,Meaning
0,Go into business,bắt đầu kinh doanh
1,Abide by,"tuân thủ, tuân theo"
2,Make a deal,"thỏa thuận, cam kết điều khoản"
3,Do business,kinh doanh
4,Break even,hòa vốn
5,Launch a new product,ra mắt sản phẩm mới
6,Sign a contract,ký kết hợp đồng
7,Make a profit,sinh lợi nhuận
8,Hire/take on staff/ employees,thuê nhân viên
9,Do market research,làm khảo sát


In [9]:
vocabulary = pd.DataFrame({
    "Vocabulary" : [
        "Behind us", 
        "Purchase",
        "Committee", 
        "Sparkling",
        "creek",
        "Grill"
        "Acknowledge",
        "Gratitude",
        "Repay",
        "loan",
        "express gratitude",
        "assistant",
        "Warmly welcome",
        "Traning session",
        "Attachment",
        "Onboarding",
        "Social gathering",
        "New hire",
        "Badge",
        "Health ínsuảnce",
        "Retirement",
        "Upon their arrival",
        "Purpose",
        "Trainer",
        "Identification document",
        "Cabinet",
        "Bet",
        "Check on",
        "Signatures",
        "Deliver", 
        "Courier service",
        "Contingency plan",
        "Onboard departure delays",
        "Rare occasions",
        "Visibility",
        "Unavoidable",
        "Complimentary snack",
        "Domestic",
        "Aircraft",
        "Pasengers",
        "Notifications",
        "Amenities",
        "Procedures",
        "Policy"
    ],
    "Meaning" : [
        "đi qua",
        "bán",
        "Ủy ban",
        "lấp lánh / nhiệt huyết, năng động / đồ uống có ga",
        "sông nhỏ",
        "Đồ nướng",
        "công nhận",
        "biết ơn",
        "trả lại",
        "khoan vay",
        "bày tỏ sự biết ơn",
        "trợ lý",
        "chào đón nồng nhiệt",
        "buổi đào tạo",
        "tệp đính kèm",
        "quy trình training nhân viên mới",
        "buổi giao lưu",
        "nhân viên mới",
        "thể tên",
        "bảo hiểm y tế",
        "quỹ hưu, lương hưu",
        "Khi họ đến",
        "mục đích",
        "mentor",
        "căn cước định danh",
        "tủ tài liệu",
        "tin rằng, cá rằng, chắc rằng",
        "kiểm tra",
        "chữ ký",
        "giao hàng",
        "dịch vụ giao hàng",
        "dịch vụ thư tín",
        "kế hoạch dự phòng",
        "chậm giờ khởi hành",
        "trường hợp hiếm gặp",
        "tầm nhìn",
        "không thể tránh khỏi",
        "đồ ăn kèm theo",
        "nội địa",
        "phi cơ",
        "hành khách",
        "thông báo",
        "tiện ích",
        "thủ tục",
        "chính sách",
        ""
    ]
})

vocabulary

ValueError: All arrays must be of the same length

In [None]:
''