| [PyFinLab Index Page](ALWAYS-START-HERE.ipynb) >


<h2 align = "center"> Ulster University (Belfast Campus)</h2>

<br>

<a><img src="figures/new_campus.jpg" width="500" height="150" border="1" /></a>



<h2 align = "center">FIN303 Financial Modelling (Lab Book)</h2>
<br>

<center><a><img src="figures/cover.png" width="500" height="150" border="10" /></a></center>
<hr>
<br><br>
<h3 align = "center"> Dr Will Smyth </h3>
<h3 align = "center"> Lecturer in Financial Services</h3>
<h3 align = "center"> Ulster University Business School</h3>
<br>
<br>



<a id = "ref00"></a>

<div class="alert alert-block alert-info" style="margin-top: 10px">

<li><a href="#ref100">Welcome</a></li>
<li><a href="#ref600">Lab Session Topics</a></li>
<li><a href="#ref200">About this Python Jupyter Lab(book)</a></li>
<li><a href="#ref300">Why Python?</a></li>
<li><a href="#ref400">Python versus Excel</a></li>
<li><a href="#ref500">Packages for Data Analysis</a></li>
<li><a href="#ref700">Acknowledgements</a></li>

</div>

<a id="ref100"></a>

## Welcome 

<div align="right"><a href="#ref00">back to top</a></div>


For decades, stock brokers, representing their customers, traded on the floors of stock exchanges worldwide. They provided the latest concrete and necessary market information to their customers to make decisions. More people trade stocks through online brokers nowadays. Online brokers make trading decisions based on financial data and market information available from the internet, using a data science language. 

In these labs we combine Python and Statistics concepts and apply them to analysing financial data such as stock data. After understanding foundational concepts of statistical analysis, we implement these concepts by means of Python packages to perform financial data analysis (utilising the Jupyter Notebook environment). We are going to learn how to import, pre-process and save financial data, and how to manipulate existing data by generating new variables using multiple columns in Python data structures. This will culminate in building a simple stock trading model in the very first lab. 

Subsequently we are going to consider more advanced models for predicting stock returns using regression models. And we are going to evaluate the performance of these models with statistical standards and financial standards such as Sharpe ratio and maximum drawdown. Bespoke Jupyter Notebooks will provide an opportunity to practice the financial analysis techniques introduced (Python and Jupyter will need to be installed on your computer). Completion of the labs will utilise theoretical concepts from Financial Mathematics and Statistics (ACF101 (entry 2019)/FIN105 (entry 2020)) and the lectures for this module.

By the end of the lab sessions you should be able to perform the following tasks using Python: 

<br>

<div>
    
<li>import, pre-process and visualise financial data into a pandas Dataframe</li>
<li>manipulate existing financial data by generating new variables using multiple columns</li>
<li>apply concepts of probability and statistical inference within financial scenarios</li>
<li>build a trading model using multiple linear regression</li>
<li>evaluate the performance of trading models using a range of investment indicators</li>

</div>

<a id="ref600"></a>

## Lab Session Topics

<div align="right"><a href="#ref00">back to top</a></div>

The lab sessions are broken into five parts:

<br>

<li>Visualising and Manipulating Stock Data (Part I)</li>
<li>Random Variables and Probability Distributions (Part II)</li>
<li>Sampling and Statistical Inference (Part III)</li>
<li>Linear Regression Models for Financial Analysis (Part IV)</li>
<li><b>Module Report:</b> Signal-based Trading on SPT[ETF] (Part V)</li>



<a id="ref200"></a>

## About this Python Jupyter Lab(book)

<div align="right"><a href="#ref00">back to top</a></div>

Each notebook in this lab book focuses on a particular concept that contributes in a fundamental way to the Python financial modelling toolbox. This material is about **learning to do** financial data analysis with Python and is therefore unapologetically an applied course where skills mastery is the objective. To that end practise, practise, practise will be key to success.  

We shall install the [Anaconda](https://www.anaconda.com/distribution/) distribution of Python and set up a single uniform working environment, [Jupyter Notebook](https://jupyter.org/), that will be used to deliver the labs. Appendix presents the nuts and bolts of the Python language in detail for anyone wishing to acquire general knowledge of the language. It is not necessary to refer to the notebooks in Appendix to complete the labs. 

<a id="ref300"></a>

## Why the Python programming language?

<div align="right"><a href="#ref00">back to top</a></div>

Python is now the number one programming language for data science. Due to its simplicity and readability, it is gaining increasing importance in the financial industry. Before we begin, we should consider how Python is used in finance and investment. Quantitative analysts and financial engineers in investment banks use Python to build all kinds of models to predict returns and evaluate risks. Engineers use Python to crawl financial news, to dig out users’ opinions and sentiments. It is widely considered that this modern source of data, from social media, can help quantitative analysts improve the performance of the models they deploy. 

Python is not only used in investment banks; it is being widely used in retail banking. Many data scientists in retail banks use Python in credit-risk modelling; utilising customer behaviour analysis to lower the risk of lending. To predict customer behaviour, they use Python to build recommendation models to make more accurate recommendations to allocate new customers to differing categories in a process called customer migration. 

Two aspects of Python in particular make it easy for beginners; simplicity and readability. Simplicity means that the grammar is easy to learn whilst readability means that the code is easy to understand. Python is an example of a high-level language because it uses more human-readable syntax and structure than some alternative data-science languages (Matlab, R, etc.). Another real advantage of Python is its general applicability. Whilst we focus here on using it for financial modelling, Python is applicable in essentially any vocational discipline (more so than Matlab or R for instance). This renders it highly portable and means that time spent in learning it is time well spent as the knowledge acquired and skills developed are highly transferrable. It is also open-source (like R but unlike Matlab) and very well supported (probably the best supported of the open-source data-science languages). 


<a id="ref400"></a>
<a id="ref007"></a>

<h3>Python versus Excel</h3>

Ten reasons to choose Python:

<div class="alert alert-block alert-info" style="margin-top: 10px">

<div align="right"><a href="#ref00">back to top</a></div>

<li><a href="#ref0">Data importing and manipulation</a></li>
<li><a href="#ref1">Automation</a></li>
<li><a href="#ref2">Working with big(ish) data</a></li>
<li><a href="#ref3">Reproducibility</a></li>
<li><a href="#ref4">Debugging</a></li>
<li><a href="#ref5">Accessibility</a></li>
<li><a href="#ref6">Advanced statistics and machine learning</a></li>
<li><a href="#ref7">Advanced data visualisation</a></li>
<li><a href="#ref8">Cross-platform stability</a></li>
<li><a href="#ref9">Skills transferablity</a></li>

</div>


<a id="ref0"></a>
<div align="right"><a href="#ref007">back to top</a></div>
<h4>Data importing and manipulation</h4>

Python can read essentially any type of data. Formats that it can’t read natively can still be used; there are Python libraries and modules specifically designed to read XML, JSON, SPSS, Excel, SAS, and STATA data files, and you can also scrape data from websites and execute SQL queries.

In terms of data manipulation, tasks like subsetting, merging, and recoding data are much easier in Python. Anyone who’s spent a lot of time trying to merge and clean several large datasets in Excel can attest to the fact that it is often a difficult and time-consuming process. 

<a id="ref1"></a>
<div align="right"><a href="#ref007">back to top</a></div>
<h4>Automation</h4>

The fact that Excel has a GUI (a user interface where you can click buttons rather than writing code) definitely makes it more approachable, but that can be a real hindrance when you’re trying to automate a process or run the same analysis multiple times. Using a programming language makes this much faster.

For instance, if you needed to run the same analysis on a new set of sales data each week, doing this in Excel would require opening a different file manually each week and re-entering formulas and other elements needed for the analysis. Or spending valuable time creating macros or template sheets only to find they need frequent adhoc tweaking. You could perform that same analysis automatically in Python, writing a simple script that imports the new data and runs the same analysis each week, outputting the results in whatever format you’d like. The script may also require tweaking but it will almost certainly be much easier and much quicker to do in Python than Excel.


<a id="ref2"></a>
<div align="right"><a href="#ref007">back to top</a></div>

<h4>Working with big(ish) data</h4>

In Excel, projects are organised in sheets or tabs, and if you’ve ever dealt with Excel files that have many sheets or lots of data entries in each sheet, you know that it can get very slow very quickly. Working with enough data in Excel can sometimes even cause crashes. Python, however, can handle large amounts of data much more quickly, and it can’t really crash in the same way that Excel can, so you don’t have to worry about losing your work.


<a id="ref3"></a>
<div align="right"><a href="#ref007">back to top</a></div>
<h4>Reproducibility</h4>

Data analysis is most useful when you can explain what you’ve done to others, and others can easily reproduce your work to confirm it (or you can reproduce it yourself to double-check). But this is difficult in Excel; there’s no way to clearly document or illustrate the steps you took in the analysis, and re-doing it would entail re-opening the original Excel file and manually re-executing all of the steps you took (even assuming you can remember them).

Reproducing results is much easier in Python. Re-running an analysis is as simple as pressing "Shift-Enter", and it’s easy to add comments to your code that explain what’s happening at every step of the process, so that anyone can check or verify your work.

<a id="ref4"></a>
<div align="right"><a href="#ref007">back to top</a></div>
<h4>Debugging</h4>

When you’ve made an error in Excel, working out what’s gone wrong can be difficult: you might have to scroll through thousands of cells of data to find the answer; or attempt to manually re-trace your steps. But when you make an error in a coding language like Python, you’ll typically get an error message (aka Traceback in Python) indicating what the Python interpreter thinks has gone wrong.

And of course, you should also have comments explaining each line of your code, which makes it easier to go back and re-check each step looking for mistakes. Typically, programmers also use a system for [version control](https://github.com/), so if you experience an error you haven’t before, you’ll be able to compare your current code with its previous iteration to get a sense of what’s gone wrong. This doesn’t mean that you’ll always be able to fix mistakes immediately, but mistakes in data analysis are inevitable and it’s easier to find and fix those mistakes in Python than in Excel.

<a id="ref5"></a>
<div align="right"><a href="#ref007">back to top</a></div>
<h4>Accessibility</h4>

Excel is great, but it’s owned by Microsoft, which means you’re ultimately at the whims of the Washington-based company in terms of bugs, updates, and feature support. Unlike Excel, Python is not a black box. You can examine Python code for any function or computation you perform. You can even modify and improve key functions by changing the code though you will go far in data analysis before you need to modify any underlying already-optimised code in either environment. But it is the transparency in Python that appeals. 

Python is also open source which means that any developer (including you) can create packages to augment the language and add functionality or improve ease-of-use. Python has an extensive collection of popular and widey-used libraries that were created by third-party developers to make data analysis and visualisation easier.

Excel does have some third-party influence, add-ins, admittedly, but because it’s proprietary software, they’re not as powerful and it’s not as easy for you to add the functionality you might want or need.

<a id="ref6"></a>
<div align="right"><a href="#ref007">back to top</a></div>
<h4>Advanced statistics and machine learning</h4>

Python has more advanced statistical capabilities than Excel and facilitates the creation of machine learning models with the integration of bespoke and powerful packages and frameworks such as [statsmodels](https://www.statsmodels.org/stable/tsa.html), [scikit-learn](05.00-Machine-Learning.ipynb), the subject of Secton E of this coursebook, and [Keras](https://keras.io/), an excellent high-level package for the analysis of time series data via recurrent neural networks (deep learning).  

From a mathematics point of view Python has the library [SymPy](https://www.sympy.org/en/index.html) (symbolic mathematics along the lines of the market leaders Maple and Mathematica – not as comprehensive or sophisticated as the latter packages but ideal for general data science). Not only does SymPy provide the facility to check and build analytic expressions, from an instructional point of view it serves as an additional enhancing pedagogy. It constitutes a multiple-representation approach to learning to work with the core mathematics ideas in finance complementing the traditional approach nicely it can be a safety net during the reinforcement phase of learning until confidence with the algebra involved is obtained. 


There is a machine learning add-in for Excel called [Pyxll](https://www.pyxll.com/) but this is not an effective substitute for using Python directly and importing Excel data directly to work on. 


<a id="ref7"></a>
<div align="right"><a href="#ref007">back to top</a></div>
<h4>Advanced data visualisation</h4>

Obviously Excel can create a variety of charts but programming languages can do more and Python in particular has better, more advanced and state-of-the-art graphics capabilities. The ability to create attractive and informative visualisations is particularly important in the business context, since the people who make key decisions in a company may not be familiar with statistical analysis or adept at reading complex charts. The easier you can make it to understand your results, the more compelling the story your visualisations tell, the more likely it is that your work will have a real impact.

<a id="ref8"></a>
<div align="right"><a href="#ref007">back to top</a></div>
<h4>Cross-platform stability</h4>

Programming languages like Python can be run on any platform without hiccup. You can be confident that your Python script is going to work across Windows, Mac, and Linux machines, but the same isn’t always true of Excel files.


<a id="ref9"></a>
<div align="right"><a href="#ref007">back to top</a></div>
<h4>Skills transferability</h4>

The field of data science is fast becoming all-encompassing in the modern commercial environment as well as in academia. The range of projects, tasks, emerging industries and research fields that are finding use for some form of data science aplication is staggering. If you learned data science in Python and find that you are presented with a new and very different kind of project, or even consider a change in career direction, it is almost certain that your Python knowledge and skills will transfer with you. The same will not apply to Excel.


All of the advantages listed above also apply to the main programming alternative to Python, namely the [R](https://www.r-project.org/about.html) statistical computing language. However, unless you are a professional data scientist I do not believe there is any advantage in being proficient in both languages. If you weigh up the investment in time and effort required to learn both languages against the fact that when you come to interpret the results of your analysis, or present the results to others, the language used to perform the analysis has no bearing on the discussion, it clearly suggests that knowing one language well is the preference. 

I have selected Python for these labs because of advantages I see in using Python over R. This pertains primairly to the fact that Python is a full-blooded programming language which means it is highly transferrable and much more flexible that R. The learning curve for R is also significantly steeper so as a first step into coding Python provides a faster route to making substantial progress. 



<a id="ref500"></a>

## Packages for Data Analysis

<div align="right"><a href="#ref00">back to top</a></div>


We shall make use of four bespoke Python libraries in the course of completing the labs:

[pandas:](https://pandas.pydata.org/)
provides fast, flexible and expressive data structures. It aims to provide the fundamental high-level building blocks for performing practical real-world data analysis. For instance, DataFrame and Series, from pandas, are excellent data structures to store tabular and time-series data. With DataFrame, we can easily pre-process data; handling missing value, computing pairwise correlation, etc. We will rely heavily on pandas and DataFrame in these labs. 

[Numpy:](https://numpy.org/)
is the fundamental package for numerical computing. Its core data structures, arrays and matrices, are the workhorses of data science. It is also a very convenient tool for generating random (or more correctly pseudorandom) numbers which can be helpful if we want to shuffle data, or generate a dataset with a normal distribution. 

[Matplotlib:](https://matplotlib.org/)
is a plotting package which produces highly customisable high-quality graphics visualisations.

[Statsmodels:](https://www.statsmodels.org/stable/index.html)
is a powerful library for the statistician. It contains modules for regression and time-series analysis. In this course, we will use Statsmodels to obtain multiple linear regression models.




<a id="ref700"></a>

## Acknowledgements

<div align="right"><a href="#ref00">back to top</a></div>

The deep-dive notebooks (Appendix A1-A8) are adapted from the excellent work of [Charles Severance](https://www.py4e.com/) and [Bradley Miller](https://runestone.academy/runestone/static/fopp/index.html).

The books by [McKinney](https://covers.oreillystatic.com/images/0636920050896/lrg.jpg) and [VanderPlas](https://covers.oreillystatic.com/images/0636920034919/lrg.jpg) come highly recommmended for anyone wishing to seriously explore the scope that Python has to offer. 

Finally, we hope you enjoy this series of labs and that they provide a basis to further explore the use of Python in financial modelling.

<a><img src="figures/Signature.png" width="130" height="95" border="1" style="float:left"/></a>


January 2021

| [PyFinLab Index Page](ALWAYS-START-HERE.ipynb) >

<div align="right"><a href="#ref00">back to top</a></div>