# Objective : 11. Pandas for Computation

<hr>

1. Percent change
2. Covariance
3. Correlation
4. Data Ranking
5. Window Functions
6. Time aware rolling
7. Window Function
8. Rolling vs Expanding

<hr>


## Statistical Functions

1. Percent Change - Series and DataFrame have a method pct_change() to compute the percent change over a given number of periods 

In [1]:
import pandas as pd
import numpy as np

In [17]:
sales_data = pd.DataFrame(data=np.random.randint(1,100,(10,4)), 
                          columns=['Tea','Milk','Carpet','Cream'], 
                          index=pd.Series(pd.period_range('1/1/2011', freq='M', periods=10)))
sales_data

Unnamed: 0,Tea,Milk,Carpet,Cream
2011-01,67,2,72,69
2011-02,59,48,21,79
2011-03,46,65,49,72
2011-04,43,21,83,35
2011-05,49,85,63,20
2011-06,64,3,52,11
2011-07,72,8,94,73
2011-08,85,86,2,28
2011-09,64,8,80,54
2011-10,23,34,92,57


* Changes in monthly sales data

In [18]:
sales_data.pct_change(periods=1).round(4)*100

Unnamed: 0,Tea,Milk,Carpet,Cream
2011-01,,,,
2011-02,-11.94,2300.0,-70.83,14.49
2011-03,-22.03,35.42,133.33,-8.86
2011-04,-6.52,-67.69,69.39,-51.39
2011-05,13.95,304.76,-24.1,-42.86
2011-06,30.61,-96.47,-17.46,-45.0
2011-07,12.5,166.67,80.77,563.64
2011-08,18.06,975.0,-97.87,-61.64
2011-09,-24.71,-90.7,3900.0,92.86
2011-10,-64.06,325.0,15.0,5.56


### 2. Covariance & Correlation
Calculate covariance between series. Covariance is a measure of how much two random variables vary together
<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcR5-M7fDrZCkWZI2w9wVhlWsUvBmZoF94HGBYMs6L2kXFLlO095">



A correlation coefficient is a way to put a value to the relationship. Correlation coefficients have a value of b
etween -1 and 1. A “0” means there is no relationship between the variables at all, while -1 or 1 means that there is a perfect negative or positive correlation
<img src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2012/10/pearson-2-small.png">


In [22]:
df = pd.DataFrame(np.random.randint(10,20,(10,2)), columns=['A','B'])

df.cov()

In [24]:
df.corr()

Unnamed: 0,A,B
A,1.0,0.182584
B,0.182584,1.0


* Rank method produces a data ranking with ties being assigned the mean of the ranks (by default) for the group:

In [27]:
df

Unnamed: 0,A,B
0,17,17
1,14,19
2,10,13
3,10,13
4,16,10
5,14,13
6,17,10
7,10,10
8,15,19
9,14,11


In [29]:
df['Rank'] = df.A.rank()
df

Unnamed: 0,A,B,Rank
0,17,17,9.5
1,14,19,5.0
2,10,13,2.0
3,10,13,2.0
4,16,10,8.0
5,14,13,5.0
6,17,10,9.5
7,10,10,2.0
8,15,19,7.0
9,14,11,5.0


### 3. Window Functions
1. For working with data, a number of window functions are provided for computing common window or rolling statistics.
2. Among these are count, sum, mean, median, correlation, variance, covariance, standard deviation, skewness, and kurtosis.

In [31]:
sales_data = pd.read_csv('../Data/sales-data.csv')

In [32]:
sales_data

Unnamed: 0,Month,Sales
0,1-01,266.0
1,1-02,145.9
2,1-03,183.1
3,1-04,119.3
4,1-05,180.3
5,1-06,168.5
6,1-07,231.8
7,1-08,224.5
8,1-09,192.8
9,1-10,122.9


In [34]:
r = sales_data.Sales.rolling(window=5)

In [35]:
r.count()

0     1.0
1     2.0
2     3.0
3     4.0
4     5.0
5     5.0
6     5.0
7     5.0
8     5.0
9     5.0
10    5.0
11    5.0
12    5.0
13    5.0
14    5.0
15    5.0
16    5.0
17    5.0
18    5.0
19    5.0
20    5.0
21    5.0
22    5.0
23    5.0
24    5.0
25    5.0
26    5.0
27    5.0
28    5.0
29    5.0
30    5.0
31    5.0
32    5.0
33    5.0
34    5.0
35    5.0
Name: Sales, dtype: float64

In [36]:
r.max()

0       NaN
1       NaN
2       NaN
3       NaN
4     266.0
5     183.1
6     231.8
7     231.8
8     231.8
9     231.8
10    336.5
11    336.5
12    336.5
13    336.5
14    336.5
15    273.3
16    273.3
17    287.0
18    287.0
19    303.6
20    303.6
21    421.6
22    421.6
23    421.6
24    421.6
25    440.4
26    440.4
27    440.4
28    440.4
29    440.4
30    575.5
31    575.5
32    682.0
33    682.0
34    682.0
35    682.0
Name: Sales, dtype: float64

### Time aware rolling

In [37]:
dft = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]},
                    index=pd.date_range('20130101 09:00:00',
                                        periods=5,
                                        freq='s'))

In [40]:
dft.rolling('2s').sum()

Unnamed: 0,B
2013-01-01 09:00:00,0.0
2013-01-01 09:00:01,1.0
2013-01-01 09:00:02,3.0
2013-01-01 09:00:03,2.0
2013-01-01 09:00:04,4.0


In [41]:
r.agg(np.sum)

0        NaN
1        NaN
2        NaN
3        NaN
4      894.6
5      797.1
6      883.0
7      924.4
8      997.9
9      940.5
10    1108.5
11    1062.6
12    1032.4
13     989.1
14    1076.3
15    1013.1
16    1018.6
17    1111.3
18    1187.8
19    1281.3
20    1297.9
21    1528.1
22    1505.6
23    1621.9
24    1658.0
25    1808.5
26    1702.8
27    1877.6
28    1936.6
29    2034.3
30    2169.4
31    2261.1
32    2503.8
33    2577.8
34    2721.7
35    2793.1
Name: Sales, dtype: float64

In [42]:
r.agg([np.sum, np.mean])

Unnamed: 0,sum,mean
0,,
1,,
2,,
3,,
4,894.6,178.92
5,797.1,159.42
6,883.0,176.6
7,924.4,184.88
8,997.9,199.58
9,940.5,188.1


### Rolling vs Expanding

In [43]:
data = pd.DataFrame([
    ['a', 1],
    ['a', 2],
    ['a', 3],
    ['b', 5],
    ['b', 6],
    ['b', 7],
    ['b', 8],
    ['c', 10],
    ['c', 11],
    ['c', 12],
    ['c', 13]
], columns = ['category', 'value'])

In [44]:
data.value.expanding(1).sum()

0      1.0
1      3.0
2      6.0
3     11.0
4     17.0
5     24.0
6     32.0
7     42.0
8     53.0
9     65.0
10    78.0
Name: value, dtype: float64

In [47]:
data.value.rolling(2).sum()

0      NaN
1      3.0
2      5.0
3      8.0
4     11.0
5     13.0
6     15.0
7     18.0
8     21.0
9     23.0
10    25.0
Name: value, dtype: float64

1. Expanding - If we use the expanding window with initial size 1, it will create a window that in the first step contains only the first row. In the second step, it contains both the first and the second row. In every step, one additional row is added to the window, and the aggregating function is being recalculated.

2. Rolling - Rolling windows are totally different. In this case, we specify the size of the window which is moving. What happens when we set the rolling window size to 2?

   - In the first step, it is going to contain the first row and one undefined row, so I am going to get NaN as a result.

   - In the second step, the window moves and now contains the first and the second row. Now it is possible to calculate the aggregate function. In the case of this example, the sum of both rows.

   - In the third step, the window moves again and no longer contains the first row. Instead of that now it calculates the sum of the second and the third row.