#   Indexing and Selecting Data
 
The Python and NumPy indexing operators "[ ]" and attribute operator "." provide quick and easy access to Pandas data structures across a wide range of use cases. However, since the type of the data to be accessed isn’t known in advance, directly using standard operators has some optimization limits. For production code, we recommend that you take advantage of the optimized pandas data access methods explained in this chapter.

Pandas now supports three types of Multi-axes indexing; the three types are mentioned in the following table −

  .loc() Label based
 	
  .iloc() Integer based
 
  .ix() Both Label and Integer based

 https://www.tutorialspoint.com/python_pandas/python_pandas_indexing_and_selecting_data.htm

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])

df

Unnamed: 0,A,B,C,D
a,0.891379,-0.406788,-0.253217,-0.401724
b,-1.595768,1.226259,-2.080529,0.733669
c,-0.459034,-2.042511,1.483967,0.96899
d,2.453766,-0.646895,0.752428,-0.034786
e,0.96747,0.76232,-0.888068,2.223633
f,-0.353881,-1.895874,0.570419,-1.73616
g,1.740131,0.67802,0.717293,0.698409
h,0.527542,-1.319085,0.391154,-1.331395


In [3]:
#select all rows for a specific column
print (df.loc[:,'A'] ,'\n') 
# Select all rows for multiple columns, say list[]
print (df.loc[:,['A','C']],'\n')
# Select few rows for multiple columns, say list[]
print (df.loc[['a','b','h'],['A','C']],'\n')
# Select range of rows for all columns
print (df.loc['a':'h'])

a    0.891379
b   -1.595768
c   -0.459034
d    2.453766
e    0.967470
f   -0.353881
g    1.740131
h    0.527542
Name: A, dtype: float64 

          A         C
a  0.891379 -0.253217
b -1.595768 -2.080529
c -0.459034  1.483967
d  2.453766  0.752428
e  0.967470 -0.888068
f -0.353881  0.570419
g  1.740131  0.717293
h  0.527542  0.391154 

          A         C
a  0.891379 -0.253217
b -1.595768 -2.080529
h  0.527542  0.391154 

          A         B         C         D
a  0.891379 -0.406788 -0.253217 -0.401724
b -1.595768  1.226259 -2.080529  0.733669
c -0.459034 -2.042511  1.483967  0.968990
d  2.453766 -0.646895  0.752428 -0.034786
e  0.967470  0.762320 -0.888068  2.223633
f -0.353881 -1.895874  0.570419 -1.736160
g  1.740131  0.678020  0.717293  0.698409
h  0.527542 -1.319085  0.391154 -1.331395


In [4]:
 print (df.loc['a']>0)

A     True
B    False
C    False
D    False
Name: a, dtype: bool


# .iloc()

  get purely integer based indexing

In [5]:
 # select all rows for a specific column
print (df.iloc[:4])

          A         B         C         D
a  0.891379 -0.406788 -0.253217 -0.401724
b -1.595768  1.226259 -2.080529  0.733669
c -0.459034 -2.042511  1.483967  0.968990
d  2.453766 -0.646895  0.752428 -0.034786


In [6]:
print (df.iloc[1:5, 2:4] )

          C         D
b -2.080529  0.733669
c  1.483967  0.968990
d  0.752428 -0.034786
e -0.888068  2.223633


In [7]:
# Slicing through list of values
print (df.iloc[[1, 3, 5], [1, 3]])
print (df.iloc[1:3, :])
print (df.iloc[:,1:4]) 

          B         D
b  1.226259  0.733669
d -0.646895 -0.034786
f -1.895874 -1.736160
          A         B         C         D
b -1.595768  1.226259 -2.080529  0.733669
c -0.459034 -2.042511  1.483967  0.968990
          B         C         D
a -0.406788 -0.253217 -0.401724
b  1.226259 -2.080529  0.733669
c -2.042511  1.483967  0.968990
d -0.646895  0.752428 -0.034786
e  0.762320 -0.888068  2.223633
f -1.895874  0.570419 -1.736160
g  0.678020  0.717293  0.698409
h -1.319085  0.391154 -1.331395


 .ix()

 hybrid method for selections and subsetting the object  

In [8]:
 print (df.ix[:4],'\n')
 print (df.ix[:'a'],'\n')   

          A         B         C         D
a  0.891379 -0.406788 -0.253217 -0.401724
b -1.595768  1.226259 -2.080529  0.733669
c -0.459034 -2.042511  1.483967  0.968990
d  2.453766 -0.646895  0.752428 -0.034786 

          A         B         C         D
a  0.891379 -0.406788 -0.253217 -0.401724 



.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  


# Statistical Functions
https://www.tutorialspoint.com/python_pandas/python_pandas_statistical_functions.htm

# Percent_change

Series, DatFrames and Panel, all have the function pct_change(). This function compares every element with its prior element and computes the change percentage.

By default, the pct_change() operates on columns;
if you want to apply the same row wise, then use axis=1() argument.

In [9]:
s = pd.Series([1,2,3,4,5,4])
print (s.pct_change())

df = pd.DataFrame(np.random.randn(5, 2))
print (df.pct_change() )

0         NaN
1    1.000000
2    0.500000
3    0.333333
4    0.250000
5   -0.200000
dtype: float64
          0         1
0       NaN       NaN
1  6.245426 -6.524687
2  0.280130 -1.700129
3 -0.856789  0.127849
4 -8.972565 -3.624335


In [15]:
#Covariance   applied on series data
s1 = pd.Series(np.random.randn(10))
s2 = pd.Series(np.random.randn(10))
print (s1,'\n cov  =',s1.cov(s2))

0   -1.217630
1   -0.451641
2    0.599213
3   -1.438302
4   -0.905229
5    0.695782
6   -1.572937
7   -0.578331
8   -0.858843
9   -1.621571
dtype: float64 
 cov  = 0.37813347147983056


In [18]:
#Covariance  applied on a DataFrame, computes cov between all the columns.
frame = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])
print ('frame[ a ].cov(frame[ b ]) = ',frame['a'].cov(frame['b']))
print (frame.cov())

frame[ a ].cov(frame[ b ]) =  0.19068166873131734
          a         b         c         d         e
a  1.987950  0.190682  0.303564  0.121438  0.818489
b  0.190682  2.054943  0.239032  0.468931  0.711169
c  0.303564  0.239032  0.805684 -0.167491  0.648007
d  0.121438  0.468931 -0.167491  1.489115 -0.583660
e  0.818489  0.711169  0.648007 -0.583660  1.351781


# Correlation

Correlation shows the linear relationship between any two array of values (series). There are multiple methods to compute the correlation like pearson(default), spearman and kendall. 

In [20]:
print (frame['a'].corr(frame['b']))
print (frame.corr())

0.09434226514797121
          a         b         c         d         e
a  1.000000  0.094342  0.239864  0.070581  0.499295
b  0.094342  1.000000  0.185769  0.268068  0.426697
c  0.239864  0.185769  1.000000 -0.152913  0.620932
d  0.070581  0.268068 -0.152913  1.000000 -0.411380
e  0.499295  0.426697  0.620932 -0.411380  1.000000


# Data Ranking

Data Ranking produces ranking for each element in the array of elements. In case of ties, assigns the mean rank.


In [24]:
s = pd.Series(np.random.np.random.randn(5), index=list('abcde'))
print(s)
s['d'] = s['b'] # so there's a tie
print (s.rank())

a    0.047684
b   -0.677289
c    1.031239
d   -1.113284
e    0.080109
dtype: float64
a    3.0
b    1.5
c    5.0
d    1.5
e    4.0
dtype: float64


#  Window Functions
https://www.tutorialspoint.com/python_pandas/python_pandas_window_functions.htm

For working on numerical data, Pandas provide few variants like rolling, expanding and exponentially moving weights for window statistics. Among these are sum, mean, median, variance, covariance, correlation, etc. 

# .rolling()  

This function can be applied on a series of data. Specify the window=n argument and apply the appropriate statistical function on top of it.

In [27]:
df = pd.DataFrame(np.random.randn(10, 4),
   index = pd.date_range('1/1/2000', periods=10),
   columns = ['A', 'B', 'C', 'D'])
print (df)
print (df.rolling(window=3).mean())
print (df.rolling(window=5).mean())

                   A         B         C         D
2000-01-01 -0.139488  1.005300  0.219181  2.152987
2000-01-02 -1.702983  1.420294  0.302528 -0.505084
2000-01-03 -0.308261  0.574901  0.162752 -0.848525
2000-01-04  1.251195 -1.853299  1.506683 -0.863800
2000-01-05 -0.013401 -0.821836 -0.373666 -0.077285
2000-01-06 -0.141061 -2.367297 -0.916062  0.267193
2000-01-07  0.242596  0.431652  0.508310  1.532691
2000-01-08  1.886295 -0.221724 -0.326843 -0.589927
2000-01-09  1.213165  0.127129  0.523600  0.548656
2000-01-10  0.593953  0.360567  0.796264 -0.529942
                   A         B         C         D
2000-01-01       NaN       NaN       NaN       NaN
2000-01-02       NaN       NaN       NaN       NaN
2000-01-03 -0.716911  1.000165  0.228154  0.266459
2000-01-04 -0.253350  0.047299  0.657321 -0.739136
2000-01-05  0.309844 -0.700078  0.431923 -0.596537
2000-01-06  0.365578 -1.680811  0.072318 -0.224631
2000-01-07  0.029378 -0.919160 -0.260473  0.574199
2000-01-08  0.662610 -0.719123 

Note − when the window size is 3, for first two elements there are nulls and from third the value will be the average of the n, n-1 and n-2 elements. 

.ewm()  

ewm is applied on a series of data. Specify any of the com, span, 
halflife argument and apply the appropriate statistical function on top 
of it. 
It assigns the weights exponentially.

In [29]:
print (df.ewm(com=0.5).mean())

                   A         B         C         D
2000-01-01 -0.139488  1.005300  0.219181  2.152987
2000-01-02 -1.312109  1.316546  0.281692  0.159434
2000-01-03 -0.617137  0.803099  0.199349 -0.538384
2000-01-04  0.643987 -0.989969  1.081799 -0.758039
2000-01-05  0.203917 -0.877417  0.107479 -0.302328
2000-01-06 -0.026384 -1.872035 -0.575819  0.077874
2000-01-07  0.153018 -0.335541  0.147264  1.048196
2000-01-08  1.308712 -0.259652 -0.168856 -0.044053
2000-01-09  1.245011 -0.001785  0.292805  0.351107
2000-01-10  0.810965  0.239787  0.628450 -0.236269
