### Libraries / Data

import numpy and pandas

In [None]:
import numpy as np
import pandas as pd

specify some pandas settings that regulate output format

In [None]:
pd.options.display.max_rows = 10

- read data 
- use Symbol column as index 
- read only those columns ['Symbol', 'Sector', 'Price', 'Book Value']

| Column Name        | Description
| ------------- |:-------------:|
|Symbol|Сокращенное название организации|
|Name|Полное название организации|
|Sector|Сектор экономики|
|Price|Стоимость акции|
|Dividend Yield|Дивидендная доходность|
|Price/Earnings|Цена / прибыль|
|Earnings/Share|Прибыль на акцию|
|Book Value|Балансовая стоимость компании|
|52 week low|52-недельный минимум|
|52 week high|52-недельный максимум|
|Market Cap|Рыночная капитализация|
|EBITDA|**E**arnings **b**efore **i**nterest, **t**axes, **d**epreciation and **a**mortization|
|Price/Sales|Цена / объём продаж|
|Price/Book|Цена / балансовая стоимость|
|SEC Filings|Ссылка *sec.gov*|

In [None]:
sp500 = pd.read_csv("../data/sp500.csv",
                    index_col='Symbol',
                    usecols=['Symbol', 'Sector', 'Price', 'Book Value'])

In [None]:
sp500.head()

read historical data on shares quotes

In [None]:
omh = pd.read_csv('../data/omh.csv', 
                  parse_dates=['Date'])

omh.set_index('Date', 
              inplace=True)

In [None]:
omh.head()

### Summary of statistics

get a summary of the statistics for the datafreem with which we work as usual dataframe

In [None]:
sp500.describe()

calculate the stats summary for a single Price column

In [None]:
sp500.Price.describe()

get a summary of statistics for non-numeric data

In [None]:
sp500.Sector.describe()

method info:

In [None]:
sp500.info()

Get summary statistics for non-numeric data

In [None]:
sp500.Sector.value_counts(normalize=True)

### Arithmetic operations

- start value of random number generator for reproducible results
- create DataFrame object

In [None]:
np.random.seed(123)
df = pd.DataFrame(np.random.randn(5, 4), 
                  columns=['A', 'B', 'C', 'D'])
df

multiply everything by 2, take only absolute values

In [None]:
abs(df * 2)

subtract the first line from each row of the DataFrame object

In [None]:
df

In [None]:
df.iloc[0]

In [None]:
df - df.iloc[0]

subtract DataFrame object from Series object

In [None]:
df.iloc[0] - df

- take the second and third fields of line 1:
- Add column E
- see how alignment is applied in this mathematical operation

In [None]:
df

In [None]:
s = df.iloc[0][1:3]
s['E'] = 0
s

In [None]:
df + s

extract rows in positions 1 through 3 and only columns B and C <br>
Essentially - extract a small square from the mid df

In [None]:
subframe = df[1:4][['B', 'C']].copy()
subframe

demonstrate how alignment occurs during the subtraction operation

In [None]:
df - subframe

extract column A and subtract it from our datafreem

In [None]:
df.sub(df['A'], axis=0)

### One-dimensional statistics

#### minimum / maximum

determine the maximum price for both shares

In [None]:
omh[['MSFT', 'AAPL']].min()

determine the index that corresponds to the maximum price for both shares

In [None]:
omh[['MSFT', 'AAPL']].idxmin()

#### Average value / median/ fashion

<img src='..\images\moda-mediana.jpg'/>

calculate the average value for all columns in the omh date

In [None]:
omh.mean()

calculate the average of all columns for each row (print the first 5)

In [None]:
omh.mean(axis=1).head() 

calculate median values for each column

In [None]:
omh.median()

calculate the mode for the Sector column

In [None]:
sp500.Sector.mode()

there could ve several modes, so the result of the operation - Series

In [None]:
s = pd.Series([1, 2, 3, 3, 5, 1])
s.mode()

#### [dispersion](https://ru.wikipedia.org/wiki/Дисперсия_случайной_величины) / standard deviation

calculate the variance of values in each column

In [None]:
omh.var()

In [None]:
(omh.MSFT**2 - omh.MSFT.mean()**2).sum() / (omh.shape[0]-1)

calculate the standard deviation

In [None]:
omh.std()

In [None]:
omh.MSFT.var()**0.5

#### [covariance](https://ru.wikipedia.org/wiki/Ковариация) / [correlation](https://ru.wikipedia.org/wiki/Корреляция)

calculate covariance between MSFT and AAPLвычисляем ковариацию между MSFT и AAPL

In [None]:
omh.MSFT.cov(omh.AAPL)

calculate the correlation between MSFT and AAPL

In [None]:
omh.MSFT.corr(omh.AAPL)

In [None]:
omh.MSFT.cov(omh.AAPL) / (omh.MSFT.std() * omh.AAPL.std())

or we can get a matrix of covariances

In [None]:
omh.corr()

### Data conversion

#### discretization and quantification

generate 10000 random numbers from standard normal distribution

In [None]:
np.random.seed(123456)
dist = np.random.normal(size = 10000)
dist

Infer average and standard deviation

In [None]:
(dist.mean(), dist.std())

Divide into five groups of the same size (by the size of the intervals - not the number of observations in the group!)

In [None]:
bins = pd.cut(dist, 5)
bins

find the lengths of the appropriate intervals

In [None]:
bins.categories

In [None]:
[q.right - q.left for q in bins.categories]

generate 50 age values between 6 and 70

In [None]:
np.random.seed(242)
ages = np.random.randint(6, 70, 50)
ages

add names for groups

In [None]:
ranges = [6, 12, 18, 35, 50, 70]
labels = ['Youth', 'Young Adult', 'Adult', 'Middle Aged', 'Retired persons']
agebins = pd.cut(ages, ranges, labels=labels)
agebins.describe()

Split (using quantiles) into 5 groups with the same number of elements

In [None]:
qbin = pd.qcut(dist, 5)

find statistics on the groups received

In [None]:
qbin.describe()

example of using qcut:

In [None]:
sp500_copy = sp500.copy()
sp500_copy['Price_Group'], bins = pd.qcut(sp500_copy.Price, 
                                          5,
                                          labels=['group_'+str(i) for i in range(1, 6)],
                                          retbins=True)
sp500_copy.Price_Group

In [None]:
bins

In [None]:
sp500_copy.Price_Group.value_counts()

#### cumulative sums

calculate the cumulative sum

In [None]:
pd.Series([1, 2, 3, 4]).cumsum()

calculate the cumulative product

In [None]:
pd.Series([1, 2, 3, 4]).cumprod()

#### ranking

for example:

In [None]:
s = pd.Series([160, 165, 165, 170, 175], index=list('abcde'))
s

ranking values:

In [None]:
s.rank()

#### relative change

In [None]:
omh[['MSFT']].head()

calculate the relative change for MSFT (the current value with the previous)

In [None]:
omh[['MSFT']].pct_change().head()

In [None]:
(48.46 - 48.62)/48.62

### Window functions

object Rolling:

In [None]:
r = omh.MSFT.rolling(3)

possible operations:

In [None]:
r.

moving average:

In [None]:
r.mean()

the first value:

In [None]:
omh.MSFT.loc['2014-12-01':'2014-12-03'].mean()

second:

In [None]:
omh.MSFT.loc['2014-12-02':'2014-12-04'].mean()