In [1]:
import pandas as pd
import numpy as np

## Introduction to pandas Data Structures

- __Series and DataFrame__

### Series
- is a one-dimensional array-like object containing a sequence of values and associated array of data labels called its _index_.
- Another way to think abt Series is as a fixed-length, orderd dict, as it is a mapping of index values to data values.

In [2]:
obj = pd.Series([4,7,-5,3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

Left column is index, which we did not specify for the data but a default one consisting of integers 0 through N - 1 is created

In [3]:
obj.values

array([ 4,  7, -5,  3])

In [4]:
obj.index   # like range(4)

RangeIndex(start=0, stop=4, step=1)

__Assigning index so to identify each data point with a label__

In [5]:
obj2 = pd.Series([4,7,-5,3], index=['d','b','a','c'])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [6]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

In [7]:
#selecting single values or a set of values using labels in the index
obj2['a']

-5

In [8]:
obj2['d']=6
obj2

d    6
b    7
a   -5
c    3
dtype: int64

In [9]:
obj2[['c','a','d']]   #here the list of labels is interpreted as a list of indeices, even though it contains string instrad of integers

c    3
a   -5
d    6
dtype: int64

In [10]:
#using NumPy-like operations preserves the index-value link
obj2[obj2>0]

d    6
b    7
c    3
dtype: int64

In [11]:
obj2*2

d    12
b    14
a   -10
c     6
dtype: int64

In [12]:
np.exp(obj2)

d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

In [13]:
'b' in obj2

True

In [14]:
# Series can be created using Python dict by passing the dict
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)    # index of the resulting Series will have the dict's key in sorted order
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [15]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)   #overriding by passing dict keys in the order we want
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In the above example, there was no value for 'California' which was replaced with 'NaN'. Also, since 'Utah' data is in dict but not in states list which is why it is excluded from the resulting object.

___isnull and notnull___ fucntions can be used to detect missing data

In [16]:
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [17]:
pd.notnull(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

In [18]:
#these also has instance methods
obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [19]:
#it automatically aligns by index label in arithmetic operations
print(obj3)
obj4

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64


California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [20]:
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

In [21]:
obj4.name = 'population'   #give name to the Series object
obj4.index.name = 'state'  #give name to the object's index
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

In [22]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [23]:
obj.index = ['a','h','e','i']   #series's index can be altered in-place by assignment
obj

a    4
h    7
e   -5
i    3
dtype: int64

### DataFrame

Represents a rectangular table of data and contains an ordered collection of columns, each of which can be different value types. The DataFrame has both a row and column index; it can be thought of as a dict of Series all sharing the same index

- Data is stored as one of more two-dimensional blocks rather than list or dict or some other collection of 1D array
- Can use it to represetn higher dimensional data in a tabular format using __hierarchial indexing__

In [24]:
#construction of dataframe
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)   #index is assigned automatically and the columns are placed in sorted order
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [25]:
frame.head()  #select only the first five rows

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [26]:
frame.head(10)  #displays first 10 rows

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [27]:
pd.DataFrame(data, columns=['year','state','pop'])   #df's columns will be arranged in specified columns order

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


In [28]:
#passed column name that is not in the dict then it will appear NaN
frame2 = pd.DataFrame(data, columns=['year','state','pop', 'debt'], index=['one', 'two', 'three', 'four','five', 'six'])
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


In [29]:
#column in df can be retrieved as a Series either by dict-like notation or by attribute
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

In [30]:
frame2.year   #works only when the column name is a valid Python variable name

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

In [31]:
#rows can be retrieved by position or name
frame2.loc['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

In [32]:
#columns can be modified by assignment
#frame2['debt'] = 16.5
frame2.debt = np.arange(6.)
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0.0
two,2001,Ohio,1.7,1.0
three,2002,Ohio,3.6,2.0
four,2001,Nevada,2.4,3.0
five,2002,Nevada,2.9,4.0
six,2003,Nevada,3.2,5.0


In [33]:
#while assigning lists or arrays to a column, the value's length must match the length of the df
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val    #labels gets realigned exactly to the df's index, inserting values in any holes
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


In [34]:
#del keyword can be used to delete columns
frame2['eastern'] = frame2.state == 'Ohio'   #new columns cannot be created with frame2.eastern syntax(dict-like notation)
frame2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False
six,2003,Nevada,3.2,,False


In [35]:
del frame2['eastern']
frame2.columns    #columns returned is a view of the underlying data, not a copy

Index(['year', 'state', 'pop', 'debt'], dtype='object')

In [36]:
#nested dict of dicts
pop = {'Nevada': {2001: 2.4, 2002: 2.9},'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

In [37]:
frame3 = pd.DataFrame(pop)   #if this kind of data is passed then pd will interpret the outer dict keys as columns and the inner keys as the row indices
frame3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [38]:
#transpose of DataFrame
frame3.T

Unnamed: 0,2001,2002,2000
Nevada,2.4,2.9,
Ohio,1.7,3.6,1.5


In [39]:
pd.DataFrame(pop, index=[2001,2002,2003])

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2003,,


In [40]:
pdata = {'Ohio': frame3['Ohio'][:-1],'Nevada': frame3['Nevada'][:2]}   #dicts of series are treated in much the same way
pd.DataFrame(pdata)

Unnamed: 0,Ohio,Nevada
2001,1.7,2.4
2002,3.6,2.9


In [41]:
frame3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [42]:
frame3.index.name = 'year'; frame3.columns.name='state'
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [43]:
#values attribute returns the data contained in the DataFrame as a 2D ndarray
frame3.values

array([[2.4, 1.7],
       [2.9, 3.6],
       [nan, 1.5]])

In [44]:
frame2.values

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7],
       [2003, 'Nevada', 3.2, nan]], dtype=object)

#### Possible data inputs to DataFrame constructor

| Type | Notes |
| :--- | :---- |
|2D ndarray | A matrix of data, passing optional row and column labels |
| dict of arrays, lists, or tuples | Each sequence becomes a column in the DataFrame; all sequences must be the same length |
| NumPy structured/record array | Treated as the “dict of arrays” case |
| dict of Series | Each value becomes a column; indexes from each Series are unioned together to form the result’s row index if no explicit index is passed |
| dict of dicts | Each inner dict becomes a column; keys are unioned to form the row index as in the “dict of Series” case |
| List of dicts or Series | Each item becomes a row in the DataFrame; union of dict keys or Series indexes become the DataFrame’s column labels |
| List of lists or tuples | Treated as the “2D ndarray” case |
| Another DataFrame | The DataFrame’s indexes are used unless different ones are passed |
| NumPy MaskedArray | Like the “2D ndarray” case except masked values become NA/missing in the DataFrame result |

### Index Objects

- panda's Index objects are responsible for holding the axis labels and other metadata (like the axis name or names.)
- labels used while constructing a Series or DF is internally converted to an Index
- Index objects are immutable

In [45]:
obj = pd.Series(range(3), index=['a','b','c'])
index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')

In [46]:
index[1:]

Index(['b', 'c'], dtype='object')

In [47]:
#index[1] = 'd'   # TypeError - it's not mutable

In [48]:
labels = pd.Index(np.arange(3))
labels

Int64Index([0, 1, 2], dtype='int64')

In [49]:
obj2 = pd.Series([1.5,-2.5,0], index=labels)
obj2

0    1.5
1   -2.5
2    0.0
dtype: float64

In [50]:
obj2.index is labels

True

In [51]:
#an Index also behaves like a fixed-size set
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [52]:
frame3.columns

Index(['Nevada', 'Ohio'], dtype='object', name='state')

In [53]:
'Ohio' in frame3.columns

True

In [54]:
2003 in frame3.index

False

In [55]:
# Unlike Python sets, pandas Index can contain duplicate labels
dup_labels = pd.Index(['foo','foo','bar','bar'])
dup_labels

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

In [56]:
lab = pd.Index(['foo','too','hee','shee'])
dup_labels.union(lab)

Index(['bar', 'bar', 'foo', 'foo', 'hee', 'shee', 'too'], dtype='object')

In [57]:
dup_labels.delete(0)

Index(['foo', 'bar', 'bar'], dtype='object')

#### Some Index methods and properties
| Method | Description |
| :----- | :---------- |
| append | Concatenate with additional Index objects, producing a new Index |
| difference | Compute set difference as an Index |
| intersection | Compute set intersection |
| union | Compute set union |
| isin | Compute boolean array indicating whether each value is contained in the passed collection |
| delete | Compute new Index with element at index i deleted |
| drop | Compute new Index by deleting passed values |
| insert | Compute new Index by inserting element at index i |
| is_monotonic | Returns True if each element is greater than or equal to the previous element |
| is_unique | Returns True if the Index has no duplicate values |
| unique | Compute the array of unique values in the Index |

## Essential Functionality

### Reindexing
- method __reindex__ which means to create a new object with the data _conformed_ to a new index
- calling __reindex__ on the Series rearranges the data according to the new index
- With DataFrame, __reindex__ can alter either the (row) index, columns, or both

In [58]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [59]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

For ordered data like time series, it may be desirable to do some interpolation or filling of values when reindexing. The __method__ option allows us to do this, using a method such as __ffill__, which forward-fills the values

In [60]:
obj3 = pd.Series(['blue','purple','yellow'], index=[0,2,4])
obj3

0      blue
2    purple
4    yellow
dtype: object

In [61]:
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

In [62]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),index=['a', 'c', 'd'],columns=['Ohio', 'Texas', 'California'])
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [63]:
frame2 = frame.reindex(['a','b','c','d'])
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [64]:
# columns can be reindexed with the columns keyword
states = ['Texas','Utah','California']
frame.reindex(columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


In [65]:
#label-indexing with loc
#frame.loc[['a','b','c','d'], states]   #missing values is no longer supported

#### reindex function arguments
| Argument | Description |
| :------- | :---------- |
| index | New sequence to use as index. Can be Index instance or any other sequence-like Python data structure. An Index will be used exactly as is without any copying. |
| method | Interpolation (fill) method; __'ffill'__ fills forward, while __'bfill'__ fills backward. |
| fill_value | Substitute value to use when introducing missing data by reindexing. |
| limit | When forward- or backfilling, maximum size gap (in number of elements) to fill. |
| tolerance | When forward- or backfilling, maximum size gap (in absolute numeric distance) to fill for inexact matches. |
| level | Match simple Index on level of MultiIndex; otherwise select subset of. |
| copy | If True, always copy underlying data even if new index is equivalent to old index; if False, do not copy  |the data when the indexes are equivalent.|

### Dropping Entries from an Axis

___drop___ method will return a new object with the indicated value or values deleted from an axis


In [66]:
obj = pd.Series(np.arange(5.), index=['a','b','c','d','e'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [67]:
new_obj = obj.drop('c')
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [68]:
#with dataframe, index values can be deleted from either axis.
data = pd.DataFrame(np.arange(16).reshape(4,4), index=['Ohio','Colorado','Utah','New York'],
                   columns=['one','two','three','four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [69]:
data.drop(['Colorado','Ohio'])    #drop with sequence of labels will drop values from the row labels (axis 0)

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [70]:
#can drop values from the columns by passing axis=1 or axis='columns'
data.drop('two', axis=1)

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


Many functions, like *drop*, which modify the size or shape of a Series or DataFrame, can __manipulate an object *in-place* without returning a new object.__<br>
*__Note: Be careful with the inplace, as__* <u>*__it destroys any data that is dropped.__*</u>

In [71]:
obj.drop('c', inplace=True)
obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

### Indexing, Selection, and Filtering

In [72]:
obj = pd.Series(np.arange(4.), index = ['a','b','c','d'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [73]:
obj['b']

1.0

In [74]:
obj[1:3]

b    1.0
c    2.0
dtype: float64

In [75]:
obj[['b','c']]

b    1.0
c    2.0
dtype: float64

In [76]:
obj[[1,3]]

b    1.0
d    3.0
dtype: float64

Slicing with labels behaves differently than normal Python slicing in that the end-point is __inclusive__

In [77]:
obj['b':'c']

b    1.0
c    2.0
dtype: float64

In [78]:
#setting using these method modifies the corresponding section of the Series
obj['b':'c'] = 5
obj

a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

In [79]:
data = pd.DataFrame(np.arange(16).reshape(4,4),
                   index=['Ohio','Colorado','Utah','New York'],
                   columns=['one','two','three','four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [80]:
data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

In [81]:
data[['two','four']]

Unnamed: 0,two,four
Ohio,1,3
Colorado,5,7
Utah,9,11
New York,13,15


In [82]:
#passing a single element or a list to the [] operator selects columns
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [83]:
data[data['three']>5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [84]:
#Another use case is in indexing with boolean DataFrame, such as one produced by a scalar comparison
data < 5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [85]:
data[data<5]=0
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


#### Selection with *__loc and iloc__*
- Special indexing operators __loc and iloc__
- Enables to select a subset of the rows and columns from a DataFrame with np-like notation using either __axis labels (loc) or integers (iloc)__

In [86]:
data.loc['Colorado']

one      0
two      5
three    6
four     7
Name: Colorado, dtype: int64

In [87]:
data.loc['Colorado',['two','three']]

two      5
three    6
Name: Colorado, dtype: int64

In [88]:
data.iloc[1,[1,2]]

two      5
three    6
Name: Colorado, dtype: int64

In [89]:
data.iloc[1]

one      0
two      5
three    6
four     7
Name: Colorado, dtype: int64

In [90]:
data.iloc[[1,2],[3,0,1]]

Unnamed: 0,four,one,two
Colorado,7,0,5
Utah,11,8,9


In [91]:
#both indexing functions work with slices in addition to single labels or lists of labels
data.loc[:'Utah','two']

Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int64

In [92]:
data.iloc[:,:3][data.three>3]

Unnamed: 0,one,two,three
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


#### Table: Indexing options with DataFrame
| Type | Notes |
| :--- | :---- |
| df[val] | Select single column or sequence of columns from the DataFrame; special case conveniences: boolean array (filter rows), slice (slice rows), or boolean DataFrame (set values based on some criterion) |
| df.loc[val] | Selects single row or subset of rows from the DataFrame by label |
| df.loc[:, val] | Selects single column or subset of columns by label |
| df.loc[val1, val2] | Select both rows and columns by label |
| df.iloc[where] | Selects single row or subset of rows from the DataFrame by integer position |
| df.iloc[:, where] | Selects single column or subset of columns by integer position |
| df.iloc[where_i, where_j] | Select both rows and columns by integer position |
| df.at[label_i, label_j] | Select a single scalar value by row and column label |
| df.iat[i, j] | Select a single scalar value by row and column position (integers) |
| reindex | method Select either rows or columns by labels |
| get_value, set_value methods | Select single value by row and column label |

### Integer Indexes

In [93]:
ser = pd.Series(np.arange(3.))
ser

0    0.0
1    1.0
2    2.0
dtype: float64

In [94]:
#ser[-1]   #this does not work ob pandas objects

In [95]:
#but with non-integer, there is no potential for ambiguity
ser2 = pd.Series(np.arange(3.),index=['a','b','c'])
ser2

a    0.0
b    1.0
c    2.0
dtype: float64

In [96]:
ser2[-1]

2.0

If you have an axis index containing integers, data selection will always be label-oriented. For more precise handling, use __loc__ (for labels) or __iloc__(for integers)

In [97]:
ser[:1]

0    0.0
dtype: float64

In [98]:
ser.loc[:1]

0    0.0
1    1.0
dtype: float64

In [99]:
ser.iloc[:1]

0    0.0
dtype: float64

### Arithmetic and Data Alignment
- When adding together objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs

In [100]:
s1 = pd.Series([7.3,-2.5,3.4,1.5], index=['a','c','d','e'])
s2 = pd.Series([-2.1,3.6,-1.5,4,3.1], index=['a','c','e','f','g'])
s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [101]:
s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [102]:
s1 + s2
#The internal data alignment introduces missing values in the label locations that don't overlap.
#Missing values will the propagate in further arithmetic computations.

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

In [103]:
#In dataframe, the alignment is performed on both the rows and columns
df1 = pd.DataFrame(np.arange(9.).reshape(3,3), columns=list('bcd'),
                  index=['Ohio','Texas','Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4,3)), columns=list('bde'),
                  index=['Utah','Ohio','Texas','Oregon'])
df1

Unnamed: 0,b,c,d
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Colorado,6.0,7.0,8.0


In [104]:
df2

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [105]:
#adding above two dataframes returns a DataFrame whose index and columns are the unions of the ones in each DataFrame
df1 + df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


In [106]:
#if dataframes objects with no row or columns labels in common, the results will contain all nulls
df1 = pd.DataFrame({'A':[1,2]})
df2 = pd.DataFrame({'B':[3,4]})

In [107]:
df1 - df2

Unnamed: 0,A,B
0,,
1,,


#### Arithmetic methods with fill values

In [108]:
df1 = pd.DataFrame(np.arange(12.).reshape((3,4)), columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4,5)), columns=list('abcde'))

In [109]:
df2.loc[1,'b'] = np.nan
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [110]:
df1

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


In [111]:
df1 + df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


In [112]:
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,5.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [113]:
1 / df1

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,0.2,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


In [114]:
df1.div(1)

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


In [115]:
#while reindexing a Series or DF, we can also use fill_value argument
df1.reindex(columns=df2.columns, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,0
1,4.0,5.0,6.0,7.0,0
2,8.0,9.0,10.0,11.0,0


#### Table: Flexible arithmetic method
| Method | Description |
| :----- | :---------- |
| add, radd | Methods for addition (+) |
| sub, rsub | Methods for subtraction (-) |
| div, rdiv | Methods for division (/) |
| floordiv, rfloordiv | Methods for floor division (//) |
| mul, rmul | Methods for multiplication (*) |
| pow, rpow | Methods for exponentiation (**) |

#### Operations between DataFrame and Series

In [116]:
arr = np.arange(12.).reshape((3,4))
arr

array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])

In [117]:
arr[0]

array([0., 1., 2., 3.])

In [118]:
arr - arr[0]

array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])

Above subtraction is performed once for each row. This is referred as *broadcasting*

In [119]:
frame = pd.DataFrame(np.arange(12.).reshape((4,3)),
                    columns=list('bde'),
                    index=['Utah','Ohio','Texas','Oregon'])
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [120]:
series = frame.iloc[0]
series

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

In [121]:
#arithmetic between DataFrame and Series matches the index of the series on the DF's columns
frame - series

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


In [122]:
#If an index value is not found in either the DF's columns or the Series's index, the objects will be reindexed to form the union
series2 = pd.Series(range(3), index=['b','e','f'])
frame + series2

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


In [123]:
#broadcast over columns
series3 = frame['d']
series3

Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

In [124]:
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [125]:
frame.sub(series3, axis='index')

Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohio,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


### Function Application and Mapping
- NumPy *unfucs* (element-wise array methods) also work with pandas objects

In [126]:
frame = pd.DataFrame(np.random.randn(4,3), columns=list('bde'),
                    index=['Utah','Ohio','Texas','Oregon'])
frame

Unnamed: 0,b,d,e
Utah,0.985826,-0.410615,-0.258728
Ohio,-1.409047,0.70861,0.078281
Texas,0.5515,0.097803,-0.431392
Oregon,0.699231,-0.47595,0.728262


In [127]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,0.985826,0.410615,0.258728
Ohio,1.409047,0.70861,0.078281
Texas,0.5515,0.097803,0.431392
Oregon,0.699231,0.47595,0.728262


In [128]:
#applying a function on one-dimensional arrays to each column or row
f = lambda x : x.max() - x.min()
frame.apply(f, axis=0)    #function f is invoked once on each column in DF
#result is the Series having the columns of frame as its index

b    2.394872
d    1.184560
e    1.159655
dtype: float64

In [129]:
frame.apply(f, axis='columns')    #function will be invoked once per row

Utah      1.396441
Ohio      2.117657
Texas     0.982892
Oregon    1.204212
dtype: float64

In [130]:
#function which return a Series with multiple values
def f(x):
    return pd.Series([x.max(), x.min()], index=['max','min'])

In [131]:
frame.apply(f)

Unnamed: 0,b,d,e
max,0.985826,0.70861,0.728262
min,-1.409047,-0.47595,-0.431392


In [132]:
#compute a formatted string from each floating-point value in DF using applymap method
format = lambda c: '%.2f' % c
frame.applymap(format)

Unnamed: 0,b,d,e
Utah,0.99,-0.41,-0.26
Ohio,-1.41,0.71,0.08
Texas,0.55,0.1,-0.43
Oregon,0.7,-0.48,0.73


In [133]:
#Series has a map method for applying an element-wise function
frame['e'].map(format)

Utah      -0.26
Ohio       0.08
Texas     -0.43
Oregon     0.73
Name: e, dtype: object

### Sorting and Ranking
- tos sort lexicographically by row or column index, use the *__sort_index__* method, which return a new, sorted object

In [134]:
obj = pd.Series(range(4), index=['d','a','b','c'])
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

In [136]:
#with DF, we can sort by index on either axis
frame = pd.DataFrame(np.arange(8).reshape((2,4)), index=['three','one'], columns=['d','a','b','c'])
frame

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


In [138]:
frame.sort_index(axis=1)

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


In [139]:
frame.sort_index()

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [140]:
#the data is sorted ascending order by default, but can be sorted in descending order
frame.sort_index(axis=1, ascending=False)

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


#### To sort a Series by its values, we can use *_sort_values_* method

In [141]:
obj = pd.Series([4,7,-3,2])
obj.sort_values()

2   -3
3    2
0    4
1    7
dtype: int64

In [143]:
#any missing values are sorted to the end of the Series by default
obj = pd.Series([4, np.nan, 7, np.nan, -3,2])
obj.sort_values()

4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

In [146]:
frame = pd.DataFrame({'b':[4,7,-3,2], 'a':[0,1,0,1]})
frame

Unnamed: 0,b,a
0,4,0
1,7,1
2,-3,0
3,2,1


To sort values in DF, we can use the data in one or more columns as the sort keys. So for this we can pass one or more column names to by *__by__* option in *__sort_values__*

In [147]:
frame.sort_values(by='b')

Unnamed: 0,b,a
2,-3,0
3,2,1
0,4,0
1,7,1


In [149]:
frame.sort_values(by=['a','b'])

Unnamed: 0,b,a
2,-3,0
0,4,0
3,2,1
1,7,1


#### Ranking assigns ranks from one through the number of valid data points in an array
- __rank__ method breaks ties by assigning each group the mean rank
-  ranks can also be assigned according to the order in which they're observed in the data

In [150]:
obj = pd.Series([7,-5,7,4,2,0,4])
obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

In [151]:
obj.rank(method='first')

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

In [152]:
# assign tie vaues the maximum rank in the group
obj.rank(ascending=False, method='max')

0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

In [153]:
#DF can compute ranks over the rows or the columns
frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],'c': [-2, 5, 8, -2.5]})
frame

Unnamed: 0,b,a,c
0,4.3,0,-2.0
1,7.0,1,5.0
2,-3.0,0,8.0
3,2.0,1,-2.5


In [154]:
frame.rank(axis=1)

Unnamed: 0,b,a,c
0,3.0,2.0,1.0
1,3.0,1.0,2.0
2,1.0,2.0,3.0
3,3.0,2.0,1.0


#### Tie-breaking methods with rank
| Method | Description |
| :----- | :---------- |
| 'average' | Default: assign the average rank to each entry in the equal group |
| 'min' | Use the minimum rank for the whole group |
| 'max' | Use the maximum rank for the whole group |
| 'first' | Assign ranks in the order the values appear in the data |
| 'dense' | Like method='min', but ranks always increase by 1 in between groups rather than the number of equal elements in a group |

### Axis Indexes with Duplicate Labels
- Many pandas functions like __reindex__ require that the labels are unique but it's not mandatory

In [156]:
obj = pd.Series(range(5), index=['a','a','b','b','c'])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

In [159]:
# is_unique property can tell whether its labels are unique or not
obj.index.is_unique

False

In [160]:
obj['a']

a    0
a    1
dtype: int64

Data selection is one of the main things that behaves differently with duplicates. Indexing a label with multiple entries returns a Series, while single entries return a scalar value. This can make a code more complicated, as the output type from indexing can vary based on whether a label is repeated or not. This same logic extends to indexing a rows in DF.

## Summarizing and Computing Descriptive Statistics
- pandas objects are equipped with a set of common methematical and statistical methods and most of these falls into the category of *__reductions or summary statistics__*, methods that extracts a single value like sum or mean from a Series or a Series of value from the rows or columns of a DF.

In [161]:
df = pd.DataFrame([[1.4,np.nan],[7.1,-4.5],[np.nan, np.nan], [0.75,-1.3]],
                 index=['a','b','c','d'], columns=['one','two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [162]:
df.sum()

one    9.25
two   -5.80
dtype: float64

In [163]:
df.sum(axis=1)

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

In [164]:
#NA values are excluded unless the entire slice is NA. THis can be disabled with the *skipna* option
df.mean(axis=1, skipna=False)

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

#### Options for reduction method
| Method | Description |
| :----- | :---------- |
| axis | Axis to reduce over; 0 for DataFrame’s rows and 1 for columns |
| skipna | Exclude missing values; True by default |
| level | Reduce grouped by level if the axis is hierarchically indexed (MultiIndex) |

Some methods like __idxmin and idxmax__ return indirect statistics like the index value where the minimum or maximum values are attained

In [166]:
df.idxmin()

one    d
two    b
dtype: object

#### Other methods are *__accumulations__*

In [167]:
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


#### Another type of method which is neither reduction nor accumulation. *__describe__* is one which produces multiple summary statistic in one shot

In [168]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


In [169]:
#for non-numerical it produces alternative summary statistics
obj = pd.Series(['a','a','b','c']*4)
obj.describe()

count     16
unique     3
top        a
freq       8
dtype: object

#### Table: Descriptive and summary statistics
| Method | Description |
| :----- | :---------- |
| count | Number of non-NA values |
| describe | Compute set of summary statistics for Series or each DataFrame column |
| min, max | Compute minimum and maximum values |
| argmin, argmax | Compute index locations (integers) at which minimum or maximum value obtained, respectively |
| idxmin, idxmax | Compute index labels at which minimum or maximum value obtained, respectively |
| quantile | Compute sample quantile ranging from 0 to 1 |
| sum | Sum of values |
| mean | Mean of values |
| median | Arithmetic median (50% quantile) of values |
| mad | Mean absolute deviation from mean value |
| prod | Product of all values |
| var | Sample variance of values |
| std | Sample standard deviation of values |
| skew | Sample skewness (third moment) of values |
| kurt | Sample kurtosis (fourth moment) of values |
| cumsum | Cumulative sum of values |
| cummin, cummax | Cumulative minimum or maximum of values, respectively |
| cumprod | Cumulative product of values |
| diff Compute | first arithmetic difference (useful for time series) |
| pct_change | Compute percent changes |

### Correlation and Covariance

In [170]:
import pandas_datareader.data as web
all_data = {ticker: web.get_data_yahoo(ticker)
           for ticker in ['AAPL','IBM','MSFT','GOOG']}
price = pd.DataFrame({ticker:data['Adj Close']
                     for ticker, data in all_data.items()})
volume = pd.DataFrame({ticker:data['Volume']
                      for ticker, data in all_data.items()})

In [172]:
returns = price.pct_change()
returns.tail()

Unnamed: 0_level_0,AAPL,IBM,MSFT,GOOG
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2021-06-09,0.003077,0.010733,0.004038,0.003444
2021-06-10,-0.008023,-0.000863,0.014393,0.012122
2021-06-11,0.009833,0.004916,0.002527,-0.003042
2021-06-14,0.024578,-0.008263,0.007755,0.005215
2021-06-15,-0.004522,-0.004199,-0.005983,-0.003336


*__cov__* method of Series computes the correlation of the overlapping, non-NA, aligned-by-index values in two series. Relatedly, *__cov__* computes the covariance

In [174]:
returns['MSFT'].corr(returns['IBM'])

0.5297873314952851

In [175]:
returns['MSFT'].cov(returns['IBM'])

0.0001500232585839488

In [176]:
returns.MSFT.corr(returns.IBM)

0.5297873314952851

In [177]:
# cov and corr methods can also return a full correlation or covariance matrix as a DF
returns.corr()

Unnamed: 0,AAPL,IBM,MSFT,GOOG
AAPL,1.0,0.445788,0.726006,0.660487
IBM,0.445788,1.0,0.529787,0.492385
MSFT,0.726006,0.529787,1.0,0.772583
GOOG,0.660487,0.492385,0.772583,1.0


In [178]:
returns.cov()

Unnamed: 0,AAPL,IBM,MSFT,GOOG
AAPL,0.000364,0.000139,0.000239,0.000212
IBM,0.000139,0.000268,0.00015,0.000136
MSFT,0.000239,0.00015,0.000299,0.000225
GOOG,0.000212,0.000136,0.000225,0.000284


In [180]:
# corrwith method can be used to compute pairwise correlations betn DF's columns or rows with another Series or DF
returns.corrwith(returns.IBM)

AAPL    0.445788
IBM     1.000000
MSFT    0.529787
GOOG    0.492385
dtype: float64

In [183]:
#Passign a DF computes the correlations of matching column names
returns.corrwith(volume)

AAPL   -0.053839
IBM    -0.100900
MSFT   -0.058938
GOOG   -0.117926
dtype: float64

Passing axis='columns' does things row-by-row instead. In all cases, the data points
are aligned by label before the correlation is computed.

### Unique Values, Value Counts, and Membership

In [184]:
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

In [185]:
obj.unique()

array(['c', 'a', 'd', 'b'], dtype=object)

In [187]:
obj.value_counts()

c    3
a    3
b    2
d    1
dtype: int64

In [188]:
pd.value_counts(obj.values, sort=False)

b    2
a    3
c    3
d    1
dtype: int64

__isin__ performs a vectorized set membership check and can be useful in filtering a
dataset down to a subset of values in a Series or column in a DataFrame

In [190]:
mask = obj.isin(['b','c'])
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [192]:
obj[mask]

0    c
5    b
6    b
7    c
8    c
dtype: object

#### Table: Unique, value counts, and set membership method
| Method | Description |
| :----- | :---------- |
| isin | Compute boolean array indicating whether each Series value is contained in the passed sequence of values |
| match | Compute integer indices for each value in an array into another array of distinct values; helpful for data alignment and join-type operations |
| unique | Compute array of unique values in a Series, returned in the order observed |
| value_counts | Return a Series containing unique values as its index and frequencies as its values, ordered count in descending order |

In [193]:
data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],'Qu2': [2, 3, 1, 2, 3],'Qu3': [1, 5, 2, 4, 4]})

In [194]:
data

Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,3,3,5
2,4,1,2
3,3,2,4
4,4,3,4


In [196]:
result = data.apply(pd.value_counts).fillna(0)

In [197]:
result
#the values are the respective counts of these values in each column

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0
