<a href="https://colab.research.google.com/github/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_02_4_pandas_functional.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  Reindexing 
changes the row labels and column labels of a DataFrame. To reindex means to conform the data to match a given set of labels along a particular axis.

Multiple operations can be accomplished through indexing like −

    Reorder the existing data to match a new set of labels.

    Insert missing value (NA) markers in label locations where no data for the label existed.

* https://www.tutorialspoint.com/python_pandas/python_pandas_reindexing.htm

In [1]:
import pandas as pd
import numpy as np

In [2]:
N=20

df = pd.DataFrame({
   'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
   'x': np.linspace(0,stop=N-1,num=N),
   'y': np.random.rand(N),
   'C': np.random.choice(['Low','Medium','High'],N).tolist(),
   'D': np.random.normal(100, 10, size=(N)).tolist()
})

print(df)

            A     x         y       C           D
0  2016-01-01   0.0  0.463441    High   98.872864
1  2016-01-02   1.0  0.646793  Medium  103.939844
2  2016-01-03   2.0  0.901605    High  105.932214
3  2016-01-04   3.0  0.290928    High  100.384344
4  2016-01-05   4.0  0.473389  Medium   97.051083
5  2016-01-06   5.0  0.401205  Medium   94.648927
6  2016-01-07   6.0  0.858519    High   87.095262
7  2016-01-08   7.0  0.831933     Low   96.414287
8  2016-01-09   8.0  0.141707    High   90.150203
9  2016-01-10   9.0  0.057209    High   99.287485
10 2016-01-11  10.0  0.708362  Medium   96.245662
11 2016-01-12  11.0  0.009059    High   96.673056
12 2016-01-13  12.0  0.696792    High  110.919842
13 2016-01-14  13.0  0.279249    High  101.479858
14 2016-01-15  14.0  0.798202  Medium   95.998821
15 2016-01-16  15.0  0.777210  Medium  109.307149
16 2016-01-17  16.0  0.078750    High   97.974567
17 2016-01-18  17.0  0.956570     Low   97.003481
18 2016-01-19  18.0  0.854299    High  106.197019


In [3]:
#reindex the DataFrame
df_reindexed = df.reindex(index=[0,2,5], columns=['A', 'C', 'B'])

print (df_reindexed)

           A       C   B
0 2016-01-01    High NaN
2 2016-01-03    High NaN
5 2016-01-06  Medium NaN


In [4]:
df1 = pd.DataFrame(np.random.randn(10,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(7,3),columns=['col1','col2','col3'])
print (df1,'\n' )
print (df2 ,'\n' )
df1 = df1.reindex_like(df2)# df1   is altered and reindexed like df2
print (df1 )

       col1      col2      col3
0 -0.340309  0.855078 -0.191876
1 -0.978566 -1.977544  2.810239
2  0.985124 -0.054314 -0.064572
3 -1.873912  0.824377  0.339957
4 -0.175455 -1.361435  0.539952
5  0.250965  0.493151 -0.112290
6 -0.229682  1.105200  1.113134
7 -0.459851 -0.383806 -0.463992
8 -1.311710  0.347020  1.320741
9  0.582851  0.524169 -0.522299 

       col1      col2      col3
0 -0.521479 -2.015521  0.000188
1  0.136159 -0.010194 -0.300114
2 -0.313661 -0.984261  0.986633
3 -0.161970 -0.669158 -0.235743
4 -0.498049 -0.715438 -0.193233
5 -0.718357  0.035994 -0.655087
6  0.235211  0.699096 -0.177808 

       col1      col2      col3
0 -0.340309  0.855078 -0.191876
1 -0.978566 -1.977544  2.810239
2  0.985124 -0.054314 -0.064572
3 -1.873912  0.824377  0.339957
4 -0.175455 -1.361435  0.539952
5  0.250965  0.493151 -0.112290
6 -0.229682  1.105200  1.113134


# Filling while ReIndexing

reindex() takes an optional parameter method which is a filling method with values as follows −

    pad/ffill − Fill values forward

    bfill/backfill − Fill values backward

    nearest − Fill from the nearest index values


In [5]:
df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3'])
print (df2,'\n' ) 
# Padding NAN's
print (df2.reindex_like(df1),'\n' )

# Now Fill the NAN's with preceding Values
print ("Data Frame with Forward Fill:")
x = df2.reindex_like(df1,method='ffill')
print  (x)

       col1      col2      col3
0  0.930893  0.491661 -0.168651
1  1.199753  0.085649 -1.171365 

       col1      col2      col3
0  0.930893  0.491661 -0.168651
1  1.199753  0.085649 -1.171365
2       NaN       NaN       NaN
3       NaN       NaN       NaN
4       NaN       NaN       NaN
5       NaN       NaN       NaN 

Data Frame with Forward Fill:
       col1      col2      col3
0  0.930893  0.491661 -0.168651
1  1.199753  0.085649 -1.171365
2  1.199753  0.085649 -1.171365
3  1.199753  0.085649 -1.171365
4  1.199753  0.085649 -1.171365
5  1.199753  0.085649 -1.171365


# Limits on Filling while Reindexing

The limit argument provides additional control over filling while 
reindexing.
Limit specifies the maximum count of consecutive matches. 
 

In [6]:
df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3'])
print (df1,'\n')
print (df2,'\n')
# Padding NAN's
print (df2.reindex_like(df1),'\n')

# Now Fill the NAN's with preceding Values
print ("Data Frame with Forward Fill limiting to 1:")
print (df2.reindex_like(df1,method='ffill',limit=2))#Add two lines


       col1      col2      col3
0  0.748162  1.655339  0.782519
1 -1.066825 -1.019949 -0.910280
2  0.722073  0.420461 -0.560903
3  0.775435  1.420676 -1.085266
4  0.559094  1.786018  0.140612
5 -0.053141 -0.913574  0.179552 

       col1      col2      col3
0 -2.134272  1.471083 -0.352476
1 -0.641892  0.665497  0.130276 

       col1      col2      col3
0 -2.134272  1.471083 -0.352476
1 -0.641892  0.665497  0.130276
2       NaN       NaN       NaN
3       NaN       NaN       NaN
4       NaN       NaN       NaN
5       NaN       NaN       NaN 

Data Frame with Forward Fill limiting to 1:
       col1      col2      col3
0 -2.134272  1.471083 -0.352476
1 -0.641892  0.665497  0.130276
2 -0.641892  0.665497  0.130276
3 -0.641892  0.665497  0.130276
4       NaN       NaN       NaN
5       NaN       NaN       NaN


# Renaming

The rename() method allows you to relabel an axis based on some mapping (a dict or Series) 
or an arbitrary function.
 

In [7]:
df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
print (df1)

print ("After renaming the rows and columns:")

x = df1.rename(columns={'col1' : 'c1', 'col2' : 'c2'},
               index = {0 : 'apple', 1 : 'banana', 2 : 'durian'})
print (x)

       col1      col2      col3
0  0.883731 -1.444869 -1.078702
1  0.401935  0.703915  1.325604
2  0.812734  0.646461 -1.146596
3  1.518568  0.317017 -0.077020
4 -1.943398 -0.006784 -0.042426
5 -1.141004  0.145199  0.077505
After renaming the rows and columns:
              c1        c2      col3
apple   0.883731 -1.444869 -1.078702
banana  0.401935  0.703915  1.325604
durian  0.812734  0.646461 -1.146596
3       1.518568  0.317017 -0.077020
4      -1.943398 -0.006784 -0.042426
5      -1.141004  0.145199  0.077505


 #  Iteration
 
Basic iteration (for i in object) produces −

    Series − values

    DataFrame − column labels

To iterate over the rows of the DataFrame, we can use the following functions −

    iteritems() − to iterate over the (key,value) pairs

    iterrows() − iterate over the rows as (index,series) pairs

    itertuples() − iterate over the rows as namedtuples


In [8]:
# Iterating a DataFrame gives column names. 
 
N=20
df = pd.DataFrame({
   'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
   'x': np.linspace(0,stop=N-1,num=N),
   'y': np.random.rand(N),
   'C': np.random.choice(['Low','Medium','High'],N).tolist(),
   'D': np.random.normal(100, 10, size=(N)).tolist()
   })
print (df)
for col in df:
   print (col)

            A     x         y       C           D
0  2016-01-01   0.0  0.549736     Low  123.988172
1  2016-01-02   1.0  0.334943     Low   95.301357
2  2016-01-03   2.0  0.068851     Low  108.480894
3  2016-01-04   3.0  0.861036     Low  105.012330
4  2016-01-05   4.0  0.931212    High  104.193562
5  2016-01-06   5.0  0.361122    High  106.319582
6  2016-01-07   6.0  0.359372     Low   93.294670
7  2016-01-08   7.0  0.380307    High   99.077732
8  2016-01-09   8.0  0.985083    High  103.717583
9  2016-01-10   9.0  0.494113  Medium  106.505823
10 2016-01-11  10.0  0.915462  Medium  110.031379
11 2016-01-12  11.0  0.588182  Medium  107.516912
12 2016-01-13  12.0  0.557605     Low   80.858199
13 2016-01-14  13.0  0.315939  Medium  103.758867
14 2016-01-15  14.0  0.091131  Medium   75.081830
15 2016-01-16  15.0  0.911866  Medium  112.954515
16 2016-01-17  16.0  0.911728    High  105.086248
17 2016-01-18  17.0  0.051437     Low  113.375045
18 2016-01-19  18.0  0.147823  Medium  117.335025


In [9]:
# Iterates over each column as key, value pair with label as key 
#and column value as a Series object.  
df = pd.DataFrame(np.random.randn(4,3),columns=['col1','col2','col3'])
print (df,'\n\n' )
for key,value in df.iteritems():
   print ('\n',key,'\n',value)

       col1      col2      col3
0 -0.113084 -0.412035  0.529195
1  0.127746 -1.421538  0.836490
2  0.701757  1.757010 -0.538531
3 -1.403024 -0.991567  2.881753 



 col1 
 0   -0.113084
1    0.127746
2    0.701757
3   -1.403024
Name: col1, dtype: float64

 col2 
 0   -0.412035
1   -1.421538
2    1.757010
3   -0.991567
Name: col2, dtype: float64

 col3 
 0    0.529195
1    0.836490
2   -0.538531
3    2.881753
Name: col3, dtype: float64


iterrows() returns the iterator yielding each index value along with a series containing the data in each row.   

In [10]:
 
for row_index,row in df.iterrows():
   print  ('\n row_index = ',row_index,'\n',row )


 row_index =  0 
 col1   -0.113084
col2   -0.412035
col3    0.529195
Name: 0, dtype: float64

 row_index =  1 
 col1    0.127746
col2   -1.421538
col3    0.836490
Name: 1, dtype: float64

 row_index =  2 
 col1    0.701757
col2    1.757010
col3   -0.538531
Name: 2, dtype: float64

 row_index =  3 
 col1   -1.403024
col2   -0.991567
col3    2.881753
Name: 3, dtype: float64


itertuples()

itertuples() method will return an iterator yielding a named tuple for each row in the DataFrame. The first element of the tuple will be the row’s corresponding index value, while the remaining values are the row values.

https://www.tutorialspoint.com/python_pandas/python_pandas_iteration.htm

 # Note −
 Do not try to modify any object while iterating. Iterating is meant for reading and the iterator returns a copy of the original object (a view), thus the changes will not reflect on the original object

In [15]:
 for row in df.itertuples():
    print (row)

Pandas(Index=0, col1=-0.11308421248942918, col2=-0.412034757630168, col3=0.5291951689673072)
Pandas(Index=1, col1=0.12774615905636919, col2=-1.4215375242583257, col3=0.8364902638412343)
Pandas(Index=2, col1=0.7017570921790534, col2=1.757010138126316, col3=-0.5385313071274301)
Pandas(Index=3, col1=-1.4030244694269578, col2=-0.9915667777202011, col3=2.8817526054448193)


# Python Pandas - Sorting 
https://www.tutorialspoint.com/python_pandas/python_pandas_sorting.htm

 There are two kinds of sorting available in Pandas 

    By label
    By Actual Value


In [21]:
unsorted_df=pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns =['col2','col1'])
print (unsorted_df)


       col2      col1
1 -0.488801 -1.531690
4 -0.014412 -1.124997
6 -0.386607 -0.477815
2 -3.305768 -1.440976
3 -1.287479  0.982586
5  0.656364 -0.674083
9 -1.714328  1.222876
8  2.464627  1.191909
0  0.415631  0.388929
7  0.334158 -1.510351


# By Label

Using the sort_index() method, by passing the axis arguments and the order of sorting, DataFrame can be sorted. 
By default, sorting is done on row labels in ascending order.

In [26]:
sorted_df=unsorted_df.sort_index()
print (sorted_df,'\n')
sorted_df = unsorted_df.sort_index(ascending=False)
print (sorted_df)

       col2      col1
0  0.415631  0.388929
1 -0.488801 -1.531690
2 -3.305768 -1.440976
3 -1.287479  0.982586
4 -0.014412 -1.124997
5  0.656364 -0.674083
6 -0.386607 -0.477815
7  0.334158 -1.510351
8  2.464627  1.191909
9 -1.714328  1.222876 

       col2      col1
9 -1.714328  1.222876
8  2.464627  1.191909
7  0.334158 -1.510351
6 -0.386607 -0.477815
5  0.656364 -0.674083
4 -0.014412 -1.124997
3 -1.287479  0.982586
2 -3.305768 -1.440976
1 -0.488801 -1.531690
0  0.415631  0.388929


# Sort the Columns

By passing the axis argument with a value 0 or 1, the sorting can be done on the column labels. By default, axis=0, sort by row. 
 

In [28]:
sorted_df=unsorted_df.sort_index(axis=1)
print (sorted_df )

       col1      col2
1 -1.531690 -0.488801
4 -1.124997 -0.014412
6 -0.477815 -0.386607
2 -1.440976 -3.305768
3  0.982586 -1.287479
5 -0.674083  0.656364
9  1.222876 -1.714328
8  1.191909  2.464627
0  0.388929  0.415631
7 -1.510351  0.334158


# By Value

Like index sorting, sort_values() is the method for sorting by values. It accepts a 'by' argument which will use the column name of the DataFrame with which the values are to be sorted.

In [29]:
 sorted_df = unsorted_df.sort_values(by='col1')

print (sorted_df )


       col2      col1
1 -0.488801 -1.531690
7  0.334158 -1.510351
2 -3.305768 -1.440976
4 -0.014412 -1.124997
5  0.656364 -0.674083
6 -0.386607 -0.477815
0  0.415631  0.388929
3 -1.287479  0.982586
8  2.464627  1.191909
9 -1.714328  1.222876
