###Libraries / Data

import numpy and pandas libraries

In [None]:
import numpy as np
import pandas as pd

import datetime library to work with dates

In [None]:
from datetime import datetime

Specify some pandas library options that configure output

In [None]:
pd.options.display.max_rows = 10

- read data
- use Symbol column as index 
- read only columns ['Symbol', 'Sector', 'Price', 'Book Value']

| Column Name        | Description
| ------------- |:-------------:|
|Symbol|Abbreviated name of organization|
|Name|Full name of organization|
|Sector|Economic Sector|
|Price|Share price|
|Dividend Yield|Dividend Yield|Dividend Yield|
|Price/Earnings|Price / Profit|
|Earnings/Share|Earnings on Share|
|Book Value|Company Book Value|
|52 week low|52 week minimum|
|52 week high|52-week maximum|
|Market Cap|Market Capitalization|
|EBITDA|***E**arnings **b**efore *i**nterest, ***t*axes, **d**epation and ***mortization|
|Price/Sales|Price / Sales|
|Price/Book|Price / Book Price|
|SEC Filings|Link *sec.gov*|

In [None]:
sp500 = pd.read_csv("../data/sp500.csv",
                    index_col='Symbol',
                    usecols=['Symbol', 'Sector', 'Price', 'Book Value'])

create a dataframe with 5 rows and 3 columns

In [None]:
df = pd.DataFrame(np.arange(0, 15).reshape(5, 3), 
                  index=['a', 'b', 'c', 'd', 'e'], 
                  columns=['c1', 'c2', 'c3'])
df

- add several columns and rows to the c4 datafray with NaN values
- string 'f' with values from 15 to 18 
- string 'g' consisting of NaN values
- column 'c5' consisting of NaN values
- change the value in the column 'c4' of the row 'a'

In [None]:
df['c4'] = np.nan
df.loc['f'] = np.arange(15, 19) 
df.loc['g'] = np.nan
df['c5'] = np.nan
df['c4']['a'] = 20
df

### Missed values

#### Search

Which elements are NaN values?

In [None]:
df.isnull()

Which elements are not set values? (we can use df.isnull() )

In [None]:
df.notnull()

count the number of NaN values in each column

In [None]:
df.isnull().sum(axis=0)

calculate the number of values other than NaN for each column (we can use len(df) - df.isnull().sum())


In [None]:
df.count(axis=0)

#### Deleting

In [None]:
df

select unchecked values in column c4

In [None]:
df.c4[df.c4.notnull()]

this program code extracts all values except NaN values in c4 column

In [None]:
df.c4.dropna()

.dropna() returns a copy with deleted values of the original date/column unchanged

In [None]:
df.c4

method. dropna() removes whole rows with at least one NaN value in this case all rows will be deleted

In [None]:
df.dropna()

using the how='all parameter, delete only those rows where all values are NaN values

In [None]:
df.dropna(how = 'all')

In [None]:
df

change axis to remove columns with NaN values instead of rows

In [None]:
df.dropna(how='all', axis=1) # удаляем c5

- create a copy of dataframe df
- replace two cells with 0 skips

In [None]:
df2 = df.copy()
df2.loc['g'].c1 = 0
df2.loc['g'].c3 = 0
df2

and now delete columns that have at least one NaN value

In [None]:
df2.dropna(how='any', axis=1) 

remove only columns with at least 3 NaN values

In [None]:
df2

In [None]:
df2.dropna(thresh=2, axis=1)

#### filling

##### constant

In [None]:
df

return a new dataframe where NaN values are filled with constant - zeros

In [None]:
filled = df.fillna(0)
filled

NaN values are not taken into account when calculating mean values

In [None]:
df.mean()

after replacing the NaN values with 0, we get different averages

In [None]:
filled.mean()

##### direct and inverse

direct filling

In [None]:
df.c4

In [None]:
df.c4.fillna(method="ffill")

or fill back

In [None]:
df.c4.fillna(method="bfill")

##### using indexes

New set of examples:

In [None]:
fill_values = pd.Series([100, 101, 102], index=['a', 'e', 'g'])
fill_values

Task example:

In [None]:
df.c4

In [None]:
df.c4.fillna(fill_values)

fill the NaN values in each column with the average value of that column

In [None]:
df.mean()

In [None]:
df

In [None]:
df.fillna(df.mean())

#### interpolation of skipped values

Perform linear interpolation ( method = 'linear' default) values NaN с 1 по 2

In [None]:
s = pd.Series([1, np.nan, np.nan, np.nan, 2])
s

In [None]:
s.interpolate()

create a time series, but only one date will be skipped

In [None]:
ts = pd.Series([1, np.nan, 2], 
            index=[datetime(2014, 1, 1), 
                   datetime(2014, 2, 1),                   
                   datetime(2014, 4, 1)])
ts

Perform linear interpolation based on the number of elements in the series

In [None]:
ts.interpolate()

this program code takes into account the fact that we have no record for 2014-03-01

In [None]:
ts.interpolate(method="time")

create a Series object to demonstrate interpolation based on index marks

In [None]:
s = pd.Series([0, np.nan, 100], index=[0, 2, 10])
s

interpolate linearly

In [None]:
s.interpolate()

interpolate based on index values

In [None]:
s.interpolate(method="index")

### duplicated values

create dataframe with duplicate rows

In [None]:
data = pd.DataFrame({'a': ['x'] * 3 + ['y'] * 4, 
                     'b': [1, 1, 2, 3, 3, 4, 4]})
data

determine which rows are duplicate, that is, which rows have been previously encountered in the dataframe

In [None]:
data.duplicated()

delete duplicate rows, each time leaving the first of the duplicate observations

In [None]:
data.drop_duplicates()

delete duplicate rows, each time leaving the last of the duplicate observations

In [None]:
data.drop_duplicates(keep='last')

Adding coloumn:

In [None]:
data

In [None]:
data['c'] = range(7)
data.duplicated()

In [None]:
data

but if we specify that you want to remove duplicate rows based on the values in columns a and b, the results will look like this

In [None]:
data.drop_duplicates(['a', 'b'])

### Value substitution

#### method .map() 

create two Series objects to illustrate the value matching process

In [None]:
x = pd.Series({"one": 1, "two": 2, "three": 3})
y = pd.Series({1: "a", 2: "b", 3: "c"})
x

In [None]:
y

Compare x series values to y series values

In [None]:
x.map(y)

if no match is found between y series and x series index mark, NaN value will be issued

In [None]:
x = pd.Series({"one": 1, "two": 2, "three": 3})
y = pd.Series({1: "a", 2: "b"})

In [None]:
x

In [None]:
y

In [None]:
x.map(y)

#### method .replace()

as an example

In [None]:
s = pd.Series([0., 1., 2., 3., 2., 4.])
s

substitution  2 to 5

In [None]:
s.replace(2, 5)

replace all elements with new values

In [None]:
s.replace([0, 1, 2, 3, 4], [4, 3, 2, 1, 0])

In [None]:
s

replace elements using dictionary as argument

In [None]:
s.replace({0: 10, 2: 100})

create a dataframe with two columns

In [None]:
df = pd.DataFrame({'a': [0, 1, 2, 3, 4], 'b': [5, 6, 7, 8, 9]})
df

specify different interchangeable values for each column

In [None]:
df.replace({'a': 1, 'b': 8}, {'a': 777, 'b': 888})

### Function use

#### to strings/ coloumns

illustrate the application of a function to each element of a Series object

In [None]:
s = pd.Series(np.arange(0, 5))
s

In [None]:
s.apply(lambda v: v * 3)

create a dataframe to illustrate the application of the summation operation to each column

In [None]:
df = pd.DataFrame(np.arange(12).reshape(4, 3), 
                  columns=['a', 'b', 'c'])
df

calculate the sum of elements in each column

In [None]:
df.apply(lambda col: col.sum())

calculate the sum of elements in each line

In [None]:
df.apply(lambda row: row.sum(), axis=1)

create column 'interim' by multiplying columns a and b

In [None]:
df

In [None]:
df['interim'] = df.apply(lambda r: r.a * r.b, axis=1)
df

and now we get the 'result' column by adding the 'interim' and 'c' columns

In [None]:
df['result'] = df.apply(lambda r: r.interim + r.c, axis=1)
df

#### to values

use the . applymap() method for all datafreem values to change the format of all DataFrame object elements

In [None]:
df

In [None]:
df.applymap(lambda x: np.exp(x)/10)