# Essential Functionality
This section has to do with some mechanics of working with data contained in a Series or DataFrame.

## Reindexing
We do this using the 'reindex' method. This means we create a new object with the data conformed to a new index.

In [1]:
import numpy as np
import pandas as pd

In [2]:
# example
obj = pd.Series([3,2,5,2,1,5], index=['e','c','a','b','d','f'])
obj

e    3
c    2
a    5
b    2
d    1
f    5
dtype: int64

In [3]:
# calling reindex on this series rearranges the data according to the new index, introducing missing values if any index values were not already present.
obj2 = obj.reindex(['a','b','c','d','e','f'])
obj2

a    5
b    2
c    2
d    1
e    3
f    5
dtype: int64

For ordered data like time series, it may be desirable to do some interpolation or fill‐
ing of values when reindexing. The method option allows us to do this, using a
method such as ffill , which forward-fills the values:

In [4]:
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0,2,4])
obj3

0      blue
2    purple
4    yellow
dtype: object

In [5]:
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

In [6]:
# backward fill does it...backwards
obj3.reindex(range(6), method='bfill')

0      blue
1    purple
2    purple
3    yellow
4    yellow
5       NaN
dtype: object

With DataFrame, reindex can alter either the (row) index, columns, or both. When passed only a sequence, it reindexes the rows in the result:

In [7]:
frame = pd.DataFrame(np.arange(9).reshape((3,3)),
                    index=['a','c','d'],
                    columns=['Ohio', 'Texas', 'California']
                    )
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [8]:
frame2 = frame.reindex(['a','b','c','d'])
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [9]:
states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)
frame.reindex(columns=['Ohio','Texas', 'California'])

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


## Dropping Entries from an Axis
Dropping one or more entries from an axis is easy if you already have an index array or list without those entries

In [10]:
ser = pd.Series(np.arange(5.), index=['a','b','c','d','e'])
ser

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [11]:
# the drop bruhh
new_ser = ser.drop('c')
new_ser

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [12]:
# drop double
new_ser = ser.drop(['a', 'c'])
new_ser

b    1.0
d    3.0
e    4.0
dtype: float64

In [13]:
ser

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

With dataframe, index values can be deleted from either axis. To illustrate this, let us look at this my example.


In [14]:
data = pd.DataFrame(np.arange(16).reshape((4,4)) + 1,
                   index=['Monday','Tuesday','Thursday', 'Saturday'],
                   columns=['Vince','Elroy','Kanye','Karnage'])
data

Unnamed: 0,Vince,Elroy,Kanye,Karnage
Monday,1,2,3,4
Tuesday,5,6,7,8
Thursday,9,10,11,12
Saturday,13,14,15,16


 - You specify that you want to drop from columns by passing 'axis=1' or 'axis=columns'
 - calling drop on a sequence of labels will drop from the row labels. or just use 'axis=0'

In [15]:
data.drop('Thursday', axis=0)

Unnamed: 0,Vince,Elroy,Kanye,Karnage
Monday,1,2,3,4
Tuesday,5,6,7,8
Saturday,13,14,15,16


In [16]:
data.drop('Elroy', axis=1)

Unnamed: 0,Vince,Kanye,Karnage
Monday,1,3,4
Tuesday,5,7,8
Thursday,9,11,12
Saturday,13,15,16


In [17]:
# whatever this is
(data.drop('Thursday', axis=0)).drop('Elroy', axis=1)

Unnamed: 0,Vince,Kanye,Karnage
Monday,1,3,4
Tuesday,5,7,8
Saturday,13,15,16


# Indexing, Selection, and Filtering
Series indexing works analogously to NumPy array indexing, except you can use the Series’s index values instead of only integers. Here are some examples of this:

In [18]:
ser = pd.Series(np.arange(4.), index=['a','b','c','d'])
ser

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [19]:
# indexing the value at index='c'
ser['c']

2.0

In [20]:
# indexing the value at index=1
ser[2]

2.0

In [21]:
# indexing the values in range 1:3
ser[1:3]

b    1.0
c    2.0
dtype: float64

# indexing with indexes in a sequence
ser[['a','b']]

In [22]:
ser[[1,2]]

b    1.0
c    2.0
dtype: float64

In [23]:
ser[ser<2]

a    0.0
b    1.0
dtype: float64

Slicing labels is different here from python. The method pandas uses is to include all labels.


In [24]:
ser['b':'d']

b    1.0
c    2.0
d    3.0
dtype: float64

In [25]:
# setting using this method modifies the corresponding section of the series
ser['b':'c'] = 7
ser

a    0.0
b    7.0
c    7.0
d    3.0
dtype: float64

In [26]:
data = pd.DataFrame(np.arange(16).reshape((4,4)) + 1,
                   index=['Monday','Tuesday','Thursday', 'Saturday'],
                   columns=['Vince','Elroy','Kanye','Karnage'])
data

Unnamed: 0,Vince,Elroy,Kanye,Karnage
Monday,1,2,3,4
Tuesday,5,6,7,8
Thursday,9,10,11,12
Saturday,13,14,15,16


In [27]:
# you can only do this with columns labels bruh...
data['Vince']

Monday       1
Tuesday      5
Thursday     9
Saturday    13
Name: Vince, dtype: int64

In [28]:
data[['Elroy', 'Vince']]

Unnamed: 0,Elroy,Vince
Monday,2,1
Tuesday,6,5
Thursday,10,9
Saturday,14,13


In [29]:
# Indexing like this has a few special cases. First, slicing or selecting data with a boolean array
data[2:4]

Unnamed: 0,Vince,Elroy,Kanye,Karnage
Thursday,9,10,11,12
Saturday,13,14,15,16


In [30]:
data[data['Vince'] > 5]

Unnamed: 0,Vince,Elroy,Kanye,Karnage
Thursday,9,10,11,12
Saturday,13,14,15,16


In [31]:
data < 4

Unnamed: 0,Vince,Elroy,Kanye,Karnage
Monday,True,True,True,False
Tuesday,False,False,False,False
Thursday,False,False,False,False
Saturday,False,False,False,False


In [32]:
data [data < 4] = -4
data

Unnamed: 0,Vince,Elroy,Kanye,Karnage
Monday,-4,-4,-4,4
Tuesday,5,6,7,8
Thursday,9,10,11,12
Saturday,13,14,15,16


### Selection with loc and iloc

In [33]:
data = pd.DataFrame(np.arange(16).reshape((4,4)) + 1,
                   index=['Monday','Tuesday','Thursday', 'Saturday'],
                   columns=['Vince','Elroy','Kanye','Karnage'])
data


Unnamed: 0,Vince,Elroy,Kanye,Karnage
Monday,1,2,3,4
Tuesday,5,6,7,8
Thursday,9,10,11,12
Saturday,13,14,15,16


In [34]:
data.loc['Monday', ['Elroy', 'Kanye']]

Elroy    2
Kanye    3
Name: Monday, dtype: int64

In [35]:
data.iloc[0, [1,2]]

Elroy    2
Kanye    3
Name: Monday, dtype: int64

In [36]:
data.iloc[3]

Vince      13
Elroy      14
Kanye      15
Karnage    16
Name: Saturday, dtype: int64

In [37]:
data.iloc[[2,3],[1,2,3]]

Unnamed: 0,Elroy,Kanye,Karnage
Thursday,10,11,12
Saturday,14,15,16


## Integer Indexes
Working with pandas objects indexed by integers is something that often trips up new users due to some differences with indexing semantics on built-in Python data structures like lists and tuples. For example, you might not expect the following code to generate an error:

In [38]:
ser = pd.Series(np.arange(3.))
ser
ser[-1]

KeyError: -1

Okay so there was an error...intentional.
The -1 index does not work on series okay!

## Arithmetic and Data Alignment
An important pandas feature for some applications is the behavior of arithmetic between objects with different indexes. When you are adding together objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs. For users with database experience, this is similar to an automatic outer join on the index labels.

In [39]:
ser1 = pd.Series([7.3, -2.5, 3.4, 1.5], 
                 index=['a','c','d','e'])
ser2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], 
                 index=['a','c','e','f','g'])

In [40]:
print("ser1:\n{}\n\nser2:\n{}\n".format(ser1, ser2))

ser1:
a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

ser2:
a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64



In [41]:
ser1 + ser2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

#### In the case of DataFrame, alignment is performed on both the rows and the columns

## Function and Application Mapping
Numpy universal functions (element wise array methods) also work with pandas objects.

In [42]:
frame = pd.DataFrame(np.random.randn(4,3), 
                    columns=list('bde'),
                    index=['Bambili', 'Bamenda', 'Bamunka', 'Bambui'])
frame

Unnamed: 0,b,d,e
Bambili,0.679722,1.107812,1.385562
Bamenda,0.69983,1.856633,-0.774327
Bamunka,-1.744046,-2.18072,-0.223393
Bambui,0.639016,0.003955,0.277216


In [43]:
# using the np absolute ufunc
np.abs(frame)

Unnamed: 0,b,d,e
Bambili,0.679722,1.107812,1.385562
Bamenda,0.69983,1.856633,0.774327
Bamunka,1.744046,2.18072,0.223393
Bambui,0.639016,0.003955,0.277216


In [44]:
# function that takes a series and returns the maximum and minum
my_func = lambda x: x.max() - x.min()

frame.apply(my_func)

b    2.443876
d    4.037353
e    2.159889
dtype: float64

In [45]:
# using axis=1, the function will be invoked once per row instead
frame.apply(my_func, axis=1)

Bambili    0.705840
Bamenda    2.630960
Bamunka    1.957327
Bambui     0.635060
dtype: float64

## Sorting and Ranking
Sorting a dataset by some criterion is another important built-in operation. To sort lexicographically by row or column index, use the sort_index method, which returns a new, sorted object.

In [46]:
ser = pd.Series(range(4), index=['d','a','c','b'])
ser

d    0
a    1
c    2
b    3
dtype: int64

In [47]:
ser.sort_index()

a    1
b    3
c    2
d    0
dtype: int64

With a data frame, you can sort by index on either axis

In [48]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                     index=['three', 'one'],
                     columns=['d', 'a', 'b', 'c'])
frame

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


In [49]:
frame.sort_index()

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [50]:
frame.sort_index(axis=0)

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [51]:
frame.sort_index(axis=1)

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


In [53]:
(frame.sort_index(axis=0)).sort_index(axis=1)

Unnamed: 0,a,b,c,d
one,5,6,7,4
three,1,2,3,0


In [54]:
frame.sort_index(axis=1, ascending=False)

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


In [56]:
ser = pd.Series([4,7,-3,2])
ser

0    4
1    7
2   -3
3    2
dtype: int64

In [58]:
# sort a series by its values
ser.sort_values()

2   -3
3    2
0    4
1    7
dtype: int64

In [61]:
# missing values are added to the end of the Series by default
ser = pd.Series([4,np.nan, 7, np.nan, -3, 2])
ser.sort_values()

4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

In [64]:
frame = pd.DataFrame({
    'b': [4,7,-3,2],
    'a': [0,1,0,1]
})
frame

# when sorting a dataframe, you can use the data in one or more columns as the sort keys eh.
# for that, just pass one or more columns names to the by option of the sort_values function.

Unnamed: 0,b,a
0,4,0
1,7,1
2,-3,0
3,2,1


In [66]:
frame.sort_values(by='a')

Unnamed: 0,b,a
0,4,0
2,-3,0
1,7,1
3,2,1


In [67]:
frame.sort_values(by='b')

Unnamed: 0,b,a
2,-3,0
3,2,1
0,4,0
1,7,1


### Ranking
This assigns ranks from one through the number of valid data points in an array. 
The rank methods for Series and DataFrame are the place to look; by default rank breaks ties by assigning each group the mean rank.

In [70]:
ser = pd.Series([7,-5,7,4,2,0,4])
ser.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

In [71]:
# ranks can also be assigned according to the order in which they are observed in the data.
ser.rank(method='first')

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

In [72]:
# we can rank in descending na
ser.rank(ascending=False, method='max')

0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

## Axis Indexes with Duplicate Labels


In [74]:
ser = pd.Series(range(5), index=['a','a','b','b','b'])
ser

a    0
a    1
b    2
b    3
b    4
dtype: int64

In [77]:
# to check if index is unique
ser.index.is_unique

False

In [78]:
ser['a']

a    0
a    1
dtype: int64

In [79]:
ser['b']

b    2
b    3
b    4
dtype: int64