#Introduction to Financial Python


##NumPy and Basic Pandas


###Introduction
Now that we have introduced the fundamentals of Python, it's time to learn about NumPy and Pandas.



###NumPy
NumPy is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays. It also has strong integration with Pandas, which is another powerful tool for manipulating financial data.

Python packages like NumPy and Pandas contain classes and methods which we can use by importing the package:

In [1]:
import numpy as np

####Basic NumPy Arrays
A NumPy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers. Here we make an array by passing a list of Apple stock prices:

In [2]:
price_list = [143.73, 145.83, 143.68, 144.02, 143.5, 142.62]
price_array = np.array(price_list)
print(price_array, type(price_array))

[143.73 145.83 143.68 144.02 143.5  142.62] <class 'numpy.ndarray'>


In [3]:
#My example
price_list_2 = [23.3, 673, 27, 276.62, 71.2, 54.8, 455.6, 3.01]
price_array_2 = np.array(price_list_2)
print(price_array_2, type(price_array_2))

[ 23.3  673.    27.   276.62  71.2   54.8  455.6    3.01] <class 'numpy.ndarray'>


Notice that the type of array is "ndarray" which is a multi-dimensional array. If we pass np.array() a list of lists, it will create a 2-dimensional array.

In [4]:
Ar = np.array([[1,3],[2,4]])
print(Ar, type(Ar))

[[1 3]
 [2 4]] <class 'numpy.ndarray'>


In [9]:
# My example
Ar_2 = np.array([[1,2,3,0],[4,5,6,0], [7,8,9,0]])
print(Ar_2, type(Ar_2))

[[1 2 3 0]
 [4 5 6 0]
 [7 8 9 0]] <class 'numpy.ndarray'>


We get the dimensions of an ndarray using the .shape attribute:

In [10]:
print(Ar.shape)

(2, 2)


In [11]:
# My example
print(Ar_2.shape)

(3, 4)


If we create an 2-dimensional array (i.e. matrix), each row can be accessed by index:

In [12]:
print(Ar[0])
print(Ar[1])

[1 3]
[2 4]


In [14]:
# My example
print(Ar_2[0])
print(Ar_2[1])
print(Ar_2[2])

[1 2 3 0]
[4 5 6 0]
[7 8 9 0]


If we want to access the matrix by column instead:

In [15]:
print('the first column: ', Ar[:,0])
print('the second column: ', Ar[:,1])

the first column:  [1 2]
the second column:  [3 4]


In [17]:
# My example
print('the first column: ', Ar_2[:,0])
print('the second column: ', Ar_2[:,1])
print('the third column: ', Ar_2[:,2])
print('the fourth column: ', Ar_2[:,3])

the first column:  [1 4 7]
the second column:  [2 5 8]
the third column:  [3 6 9]
the fourth column:  [0 0 0]


####Array Functions
Some functions built in NumPy that allow us to perform calculations on arrays. For example, we can apply the natural logarithm to each element of an array:

In [18]:
print(np.log(price_array))

[4.96793654 4.98244156 4.9675886  4.96995218 4.96633504 4.96018375]


In [19]:
# My example
print(np.log(price_array_2))

[3.14845336 6.51174533 3.29583687 5.62264472 4.26549282 4.00369019
 6.12161523 1.10194008]


Or for example the square root:

In [20]:
print(np.sqrt(price_array))

[11.98874472 12.07600927 11.98665925 12.0008333  11.97914855 11.94236158]


In [21]:
# My example
print(np.sqrt(price_array_2))

[ 4.82700735 25.94224354  5.19615242 16.63189707  8.43800924  7.40270221
 21.34478859  1.73493516]


Other functions return a single value:

In [22]:
print(np.mean(price_array))
print(np.std(price_array))
print(np.sum(price_array))
print(np.max(price_array))

143.89666666666668
0.9673790478515796
863.38
145.83


In [23]:
# My example
print(np.mean(price_array_2))
print(np.std(price_array_2))
print(np.sum(price_array_2))
print(np.max(price_array_2))

198.06625
232.52163861335035
1584.53
673.0


The functions above return the mean, standard deviation, total and maximum value of an array.

###Pandas
Pandas is one of the most powerful tools for dealing with financial data. First we need to import Pandas:

In [24]:
import pandas as pd

####Series
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, float, Python object, etc.)

We create a Series by calling pd.Series(data), where data can be a dictionary, an array or just a scalar value.

In [25]:
price = [143.73, 145.83, 143.68, 144.02, 143.5, 142.62]
s = pd.Series(price)
s

0    143.73
1    145.83
2    143.68
3    144.02
4    143.50
5    142.62
dtype: float64

In [26]:
#My example
price_2 = [23.3, 673, 27, 276.62, 71.2, 54.8, 455.6, 3.01]
s_2 = pd.Series(price_2)
s_2

0     23.30
1    673.00
2     27.00
3    276.62
4     71.20
5     54.80
6    455.60
7      3.01
dtype: float64

We can customize the indices of a new Series:

In [40]:
s = pd.Series(price,index = ['a','b','c','d','e','f'])
s

a    143.73
b    145.83
c    143.68
d    144.02
e    143.50
f    142.62
dtype: float64

In [30]:
# My example
s_2 = pd.Series(price_2,index = ['j','k','l','m','n','o','p','q'])
s_2

j     23.30
k    673.00
l     27.00
m    276.62
n     71.20
o     54.80
p    455.60
q      3.01
dtype: float64

Or we can change the indices of an existing Series:

In [41]:
s.index = [6,5,4,3,2,1]
s

6    143.73
5    145.83
4    143.68
3    144.02
2    143.50
1    142.62
dtype: float64

In [32]:
# My example
s_2.index = [100,200,300,400,500,600,700,800]
s_2

100     23.30
200    673.00
300     27.00
400    276.62
500     71.20
600     54.80
700    455.60
800      3.01
dtype: float64

Series is like a list since it can be sliced by index:

In [34]:
print(s[1:])
print(s[:-2])

5    145.83
4    143.68
3    144.02
2    143.50
1    142.62
dtype: float64
6    143.73
5    145.83
4    143.68
3    144.02
dtype: float64


In [35]:
# My example
#The list since the fourth element
print(s_2[4:])
#The series reversed
print(s_2[:-1])

500     71.20
600     54.80
700    455.60
800      3.01
dtype: float64
100     23.30
200    673.00
300     27.00
400    276.62
500     71.20
600     54.80
700    455.60
dtype: float64


Series is also like a dictionary whose values can be set or fetched by index label:

In [42]:
print(s[4])
s[4] = 0
print(s)

143.68
6    143.73
5    145.83
4      0.00
3    144.02
2    143.50
1    142.62
dtype: float64


In [39]:
#My example
print(s_2[200])
s_2[200] = 56
print(s_2)

56.0
100     23.30
200     56.00
300     27.00
400    276.62
500     71.20
600     54.80
700    455.60
800      3.01
dtype: float64


Series can also have a name attribute, which will be used when we make up a Pandas DataFrame using several series.

In [43]:
s = pd.Series(price, name = 'Apple Price List')
print(s)
print(s.name)

0    143.73
1    145.83
2    143.68
3    144.02
4    143.50
5    142.62
Name: Apple Price List, dtype: float64
Apple Price List


In [44]:
# My example
s_2 = pd.Series(price_2, name = 'Cars price list')
print(s)
print(s.name)

0    143.73
1    145.83
2    143.68
3    144.02
4    143.50
5    142.62
Name: Apple Price List, dtype: float64
Apple Price List


We can get the statistical summaries of a Series:

In [45]:
print (s.describe())

count      6.000000
mean     143.896667
std        1.059711
min      142.620000
25%      143.545000
50%      143.705000
75%      143.947500
max      145.830000
Name: Apple Price List, dtype: float64


In [46]:
print (s_2.describe())

count      8.000000
mean     198.066250
std      248.576088
min        3.010000
25%       26.075000
50%       63.000000
75%      321.365000
max      673.000000
Name: Cars price list, dtype: float64


####Time Index
Pandas has a built-in function specifically for creating date indices: pd.date_range(). We use it to create a new index for our Series:

In [47]:
time_index = pd.date_range('2017-01-01',periods = len(s),freq = 'D')
print(time_index)
s.index = time_index
print(s)

DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
               '2017-01-05', '2017-01-06'],
              dtype='datetime64[ns]', freq='D')
2017-01-01    143.73
2017-01-02    145.83
2017-01-03    143.68
2017-01-04    144.02
2017-01-05    143.50
2017-01-06    142.62
Freq: D, Name: Apple Price List, dtype: float64


In [52]:
# My example 
# For this one, the frequency is going to be monthly
time_index_2 = pd.date_range('1810-01-01',periods = len(s_2),freq = 'M')
print(time_index_2)
s_2.index = time_index_2
print(s_2)

DatetimeIndex(['1810-01-31', '1810-02-28', '1810-03-31', '1810-04-30',
               '1810-05-31', '1810-06-30', '1810-07-31', '1810-08-31'],
              dtype='datetime64[ns]', freq='M')
1810-01-31     23.30
1810-02-28    673.00
1810-03-31     27.00
1810-04-30    276.62
1810-05-31     71.20
1810-06-30     54.80
1810-07-31    455.60
1810-08-31      3.01
Freq: M, Name: Cars price list, dtype: float64


Series are usually accessed using the iloc[] and loc[] methods. iloc[] is used to access elements by integer index, and loc[] is used to access the index of the series.

iloc[] is necessary when the index of a series are integers, take our previous defined series as example:

In [54]:
s.index = [6,5,4,3,2,1]
print(s)
print(s[1])

6    143.73
5    145.83
4    143.68
3    144.02
2    143.50
1    142.62
Name: Apple Price List, dtype: float64
142.62


In [57]:
s_2.index = [100,200,300,400,500,600,700,800]
print(s_2)
#Prints the element whose index is 600
print(s_2[600])

100     23.30
200    673.00
300     27.00
400    276.62
500     71.20
600     54.80
700    455.60
800      3.01
Name: Cars price list, dtype: float64
54.8


If we intended to take the second element of the series, we would make a mistake here, because the index are integers. In order to access to the element we want, we use iloc[] here:

In [58]:
print(s.iloc[1])

145.83


In [59]:
# My example
#Prints the fourth element of the series regardles of the index
print(s_2.iloc[3])

276.62


While working with time series data, we often use time as the index. Pandas provides us with various methods to access the data by time index.

In [61]:
s.index = time_index
print(s['2017-01-03'])

143.68


In [64]:
# My example
s_2.index = time_index_2
print(s_2['1810-02-28'])

673.0


We can even access to a range of dates:



In [65]:
print(s['2017-01-02':'2017-01-05'])

2017-01-02    145.83
2017-01-03    143.68
2017-01-04    144.02
2017-01-05    143.50
Freq: D, Name: Apple Price List, dtype: float64


In [67]:
# My example
print(s_2['1810-04-03':'1810-07-11'])

1810-04-30    276.62
1810-05-31     71.20
1810-06-30     54.80
Freq: M, Name: Cars price list, dtype: float64


Series[] provides us a very flexible way to index data. We can add any condition in the square brackets:

In [68]:
print(s[s < np.mean(s)] )
print([(s > np.mean(s)) & (s < np.mean(s) + 1.64*np.std(s))])

2017-01-01    143.73
2017-01-03    143.68
2017-01-05    143.50
2017-01-06    142.62
Name: Apple Price List, dtype: float64
[2017-01-01    False
2017-01-02    False
2017-01-03    False
2017-01-04     True
2017-01-05    False
2017-01-06    False
Freq: D, Name: Apple Price List, dtype: bool]


In [69]:
#My example
#Prints the values of s_2 that are less that half the maximum value of s_2
print(s_2[s_2 < np.max(s_2)/2] )


1810-01-31     23.30
1810-03-31     27.00
1810-04-30    276.62
1810-05-31     71.20
1810-06-30     54.80
1810-08-31      3.01
Name: Cars price list, dtype: float64


As demonstrated, we can use logical operators like & (and), | (or) and ~ (not) to group multiple conditions.



###Summary
Here we have introduced NumPy and Pandas for scientific computing in Python. In the next chapter, we will dive into Pandas to learn resampling and manipulating Pandas DataFrame, which are commonly used in financial data analysis.