<a href="https://colab.research.google.com/github/dlopezav/MetNumUN2021II/blob/main/Lab1/04_NumPy_and_Basic_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div align="center">
<img style="display: block; margin: auto;" alt="photo" src="https://cdn.quantconnect.com/web/i/icon.png">

Quantconnect

Introduction to Financial Python
</div>

# 04 NumPy and Basic Pandas
#Editado por Diego Felipe Lopez

# Introduction

Now that we have introduced the fundamentals of Python, it's time to learn about NumPy and Pandas.

# NumPy
NumPy is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays. It also has strong integration with Pandas, which is another powerful tool for manipulating financial data.

Python packages like NumPy and Pandas contain classes and methods which we can use by importing the package:

In [2]:
import numpy as np

## Basic NumPy Arrays
A NumPy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers. Here we make an array by passing a list of Apple stock prices:

In [19]:
price_list = [143.73, 145.83, 143.68, 144.02, 143.5, 142.62]
price_array = np.array(price_list)
print(price_array, type(price_array))

[143.73 145.83 143.68 144.02 143.5  142.62] <class 'numpy.ndarray'>


In [20]:
price_list = [100, 200, 300, 400, 500, 600]
price_array = np.array(price_list)
print (price_array, type(price_array))

[100 200 300 400 500 600] <class 'numpy.ndarray'>


Notice that the type of array is "ndarray" which is a multi-dimensional array. If we pass np.array() a list of lists, it will create a 2-dimensional array.

In [21]:
Ar = np.array([[1,3],[2,4]])
print(Ar, type(Ar))

[[1 3]
 [2 4]] <class 'numpy.ndarray'>


We get the dimensions of an ndarray using the .shape attribute:

In [22]:
print(Ar.shape)

(2, 2)


If we create an 2-dimensional array (i.e. matrix), each row can be accessed by index:

In [23]:
print(Ar[0])
print(Ar[1])

[1 3]
[2 4]


If we want to access the matrix by column instead:

In [24]:
print('the first column: ', Ar[:,0])
print('the second column: ', Ar[:,1])

the first column:  [1 2]
the second column:  [3 4]


## Array Functions
Some functions built in NumPy that allow us to perform calculations on arrays. For example, we can apply the natural logarithm to each element of an array:

In [25]:
print(np.log(price_array))

[4.60517019 5.29831737 5.70378247 5.99146455 6.2146081  6.39692966]


Other functions return a single value:

In [26]:
print(np.mean(price_array))
print(np.std(price_array))
print(np.sum(price_array))
print(np.max(price_array))

350.0
170.78251276599332
2100
600


The functions above return the mean, standard deviation, total and maximum value of an array.

# Pandas
Pandas is one of the most powerful tools for dealing with financial data. 

First we need to import Pandas:

In [27]:
import pandas as pd

## Series
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, float, Python object, etc.)

We create a Series by calling pd.Series(data), where data can be a dictionary, an array or just a scalar value.

In [36]:
price = [100, 200, 300, 400, 500, 600]
s = pd.Series(price)


We can customize the indices of a new Series:

In [37]:
s = pd.Series(price,index = ['a','b','c','d','e','f'])
s

a    100
b    200
c    300
d    400
e    500
f    600
dtype: int64

Or we can change the indices of an existing Series:

In [38]:
s.index = [6,5,4,3,2,1]
s

6    100
5    200
4    300
3    400
2    500
1    600
dtype: int64

In [39]:
s.index = [44,55,33,22,11,00]
print(s)

44    100
55    200
33    300
22    400
11    500
0     600
dtype: int64


Series is like a list since it can be sliced by index:

In [40]:
print(s[1:])
print(s[:-2])

55    200
33    300
22    400
11    500
0     600
dtype: int64
44    100
55    200
33    300
22    400
dtype: int64


In [41]:
print (s[1:])
print (s[:-2])

55    200
33    300
22    400
11    500
0     600
dtype: int64
44    100
55    200
33    300
22    400
dtype: int64


Series is also like a dictionary whose values can be set or fetched by index label:

In [43]:
print(s[44])
s[44] = 0
print(s)

100
44      0
55    200
33    300
22    400
11    500
0     600
4       0
dtype: int64


Series can also have a name attribute, which will be used when we make up a Pandas DataFrame using several series.

In [None]:
s = pd.Series(price, name = 'Apple Price List')
print(s)
print(s.name)

0    143.73
1    145.83
2    143.68
3    144.02
4    143.50
5    142.62
Name: Apple Price List, dtype: float64
Apple Price List


In [44]:
s = pd.Series(price, name = 'Instagram favs')
print(s)
print(s.name)

0    100
1    200
2    300
3    400
4    500
5    600
Name: Instagram favs, dtype: int64
Instagram favs


We can get the statistical summaries of a Series:

In [45]:
print(s.describe())

count      6.000000
mean     350.000000
std      187.082869
min      100.000000
25%      225.000000
50%      350.000000
75%      475.000000
max      600.000000
Name: Instagram favs, dtype: float64


## Time Index
Pandas has a built-in function specifically for creating date indices: pd.date_range(). We use it to create a new index for our Series:

In [47]:
time_index = pd.date_range('2021-01-01',periods = len(s),freq = 'D')
print(time_index)
s.index = time_index
print(s)

DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
               '2021-01-05', '2021-01-06'],
              dtype='datetime64[ns]', freq='D')
2021-01-01    100
2021-01-02    200
2021-01-03    300
2021-01-04    400
2021-01-05    500
2021-01-06    600
Freq: D, Name: Instagram favs, dtype: int64


Series are usually accessed using the iloc[] and loc[] methods. iloc[] is used to access elements by integer index, and loc[] is used to access the index of the series.

iloc[] is necessary when the index of a series are integers, take our previous defined series as example:

In [62]:
s.index = [6,5,4,3,2,1]
print(s)
print(s[1])

6    100
5    200
4    300
3    400
2    500
1    600
Name: Instagram favs, dtype: int64
600


If we intended to take the second element of the series, we would make a mistake here, because the index are integers. In order to access to the element we want, we use iloc[] here:

In [63]:
print(s.iloc[4])

500


In [64]:
print(s.iloc[0])

100


While working with time series data, we often use time as the index. Pandas provides us with various methods to access the data by time index

In [68]:
s.index = time_index
print(s['2021-01-03'])

300


We can even access to a range of dates:

In [None]:
print(s['2017-01-02':'2017-01-05'])

2017-01-02    145.83
2017-01-03    143.68
2017-01-04    144.02
2017-01-05    143.50
Freq: D, Name: Apple Price List, dtype: float64


In [69]:
print(s['2021-01-02':'2021-01-05'])

2021-01-02    200
2021-01-03    300
2021-01-04    400
2021-01-05    500
Freq: D, Name: Instagram favs, dtype: int64


Series[] provides us a very flexible way to index data. We can add any condition in the square brackets:

In [53]:
print(s[s < np.mean(s)] )
print([(s > np.mean(s)) & (s < np.mean(s) + 1.64*np.std(s))])

[6    False
5    False
4    False
3     True
2     True
1     True
Name: Instagram favs, dtype: bool]


In [59]:
print([(s > np.mean(s)) | (s > np.std(s) + 1.64*np.std(s))])

[6    False
5    False
4    False
3     True
2     True
1     True
Name: Instagram favs, dtype: bool]


As demonstrated, we can use logical operators like & (and), | (or) and ~ (not) to group multiple conditions.

# Summary
Here we have introduced NumPy and Pandas for scientific computing in Python. In the next chapter, we will dive into Pandas to learn resampling and manipulating Pandas DataFrame, which are commonly used in financial data analysis.