**Fin 585**  
**Diether**  
**Python/Pandas Introduction**<br><br>

**Instructions**

+ Please read through my notes, and run each of the code cells.

+ You can run a cell of code by pressing SHIFT and ENTER at the same time.


**1 Python in Empirical Finance**

**1.1 Role of Python in this Course**

+ Goal: Develop the Python skills and tools necessary to engage in empirical research in Finance.

+ More specifically, use Python/Pandas to $\rightarrow$

  - Test economic models.
  
  - Construct portfolios (container for financial assets).
  
  - Create and backtest trading strategies.
  
  - Estimate regressions: time series and panel regressions.
  
+ I will focus on the most important features and programming constructs in Python to accomplish these goals.

+ Our main library will be `pandas`.<br><br>


**1.2 Example: Portfolio Construction and Trading Strategies**

+ A core quant finance and academic skill is portfolio construction and backtesting.

+ All trading strategies are implemented as portfolios (container for financial assets).

+ Portfolio construction and backtesting can be broken into five general steps:

  1. Data preparation.

  2. Creation of the portfolio formation variable.

  3. Binning the stock return data based the formation variable.

  4. Portfolio creation.

  5. Estimating and benchmarking historical performance of the strategy.
  
+ Need to learn enough Python/Pandas so you can tackle each step for portfolio strategies you're interested in testing.<br><br>


**1.3 Why Python?**

+ Why Python? Why not R, SAS, Stata, or something else?

+ All of those languages are used in empirical finance research.

+ Python has some important advantages:

  - Python has emerged as the most important of these languages in **Finance**.

  - Well designed and popular general purpose programming language.<br><br>  


**1.4 What about Polars Instead of Pandas?**

+ `Polars` is really useful and worth learning.

+ Particularly in Finance.

+ It's multi-threading is better than `Pandas`.

+ I still think it's beneficial to start with `Pandas`.

+ It's still the most important data analysis library in Python.

+ Learning to use `Pandas` efficiently will help you tackle programming tasks in Finance better.

+ `Pandas` "penalizes" you more for being inefficient than `Polars`. That's good for you at this stage of tackling Finance related problems.<br><br>


**2 Overview of Basic Concepts and Features**

+ Main purpose of this notebook is to introduce the `Pandas` library.

+ `Pandas` is the main library for this course.

+ Will overview core concepts and features of `pandas` for quant and academic finance.

+ Will cover the concepts and features with more detail as we move forward.<br><br>


**2.1 Accessing the Pandas Library**

+ To use the Pandas library we have to tell Python that we want access to it.

+ You make `pandas` accessible by using the `import` command.

+ When importing the pandas' library, you also associate the library with a namespace: **I use pd**

  - `pd` namespace is not required, but it's a strong convention.

  - Given the `pd` namespace $\rightarrow$ Pandas' functions look like `pd.function`.
  
  - For example, `pd.read_csv`. 

  - Namespaces make it clear what library a certain function or command comes from if each library you use has it's own namespace.

+ **code for importing pandas:**

In [None]:
import pandas as pd

<br>

**2.2 Pandas core Data Structures: Dataframes and Series**

+ Core data structure/object $\rightarrow$ the `dataframe`.

+ Dataframe: container for holding rectangular array of mixed type data.

  - Columns: represent different variables (e.g, stock price or earnings of Google).

  - Rows: represent a given observation for those variables (e.g., January 2009 for Google).

+ It's the programming equivalent of a spreadsheet.

+ Each column can be of a different type: integers, floating point numbers,  strings, or even imaginary numbers. <br><br>


**2.3 Dataframes: Store Data and Provide Useful Functions**

+ With `dataframes`, Pandas' provides:

  - Ways to create new data.

  - Transform and combine data.

  - Aggregate data

  - Display data. 

+ Examples:

  - Built in division operator (`/`): divides element by element.

  - Built in mean function that computes sample average of each column.
  
  - Higher level functions: e.g., plotting functions
  
+ Many functions are built into the `dataframe` object.

+ Built in functions are called `methods` (object oriented programming terminology).<br><br>


**2.4 Summary of DataFrames**

+ A `dataframe` is an object that provides data storage and useful functions<br><br>


**2.5 Series**

+ A `series` in pandas' is the name for a single column of data.
  
+ If you grab one column of a dataframe, you're grabbing a series.

+ `Dataframes` and `series` behave very similarly. For our purposes, it's mostly just be a technical distinction between a one dimensional and two dimensional array.<br><br>


**2.6 Importing Data and Creating a DataFrame:**

+ Getting data into a `Pandas'` dataframe is usually easy. 

+ Pandas can read many different data formats: csv files, Excel files, SAS data files, Stata data files (dta), Feather data files, etc.

+ In this class, we mostly use csv files.

+ I will highlight other methods.<br><br>


**3. Reading in Some Financial Data**

+ Let's read in some data, and create a `dataframe` object.

+ Data: annual balance sheet data for Amazon and Hormel.

+ Data are in a csv file $\rightarrow$ use the `read_csv` function.

+ The `read_csv` function will automatically create a `dataframe` object containing the data in the csv file.

+ The `read_csv` function has a lot of options and flexibility.

+ Often you don't use any of the options. Particularly for well formed csv files. 

+ [Pandas' documentation for read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html).

+ The `read_csv` function can, of course, read files stored on your local machine.

+ It can also read files stored remotely on a webserver (just provide a URL). 

+ The code below calls pandas' `read_csv` function and then reads the csv located at the URL in quotes. After reading the file it creates a dataframe and assigns the dataframe to `df`.

In [None]:
df = pd.read_csv('https://diether.org/prephd/01-intro.csv')

+ To read in data from non-csv formats you generally invoke a command very similar to `read_csv`. For example, you can read in a `Stata` datafile using the following:
```python
    df = pd.read_stata('filename.dta')
```

+ Many other ways to create dataframes. Can convert core Python `lists` or `dictionaries` into `dataframes`.<br><br> 


**3.1 Displaying or Printing out the Data in a Dataframe**

+ A **Jupyter notebook** is a special environment where if you type the name of a dataframe (or other datatypes), it will display the default view of that object.

+ If the dataframe is small it will display all the data.

+ if the dataframe is large only a truncated view of the data will be displayed. 

+ If you write a python program and run it outside of the Jupyter notebook environment, then you need to use the `print` function to see any output.

In [None]:
df

<bf>

**Print function**

+ You can also explicitly print a dataframe out using python's print function.

In [None]:
print(df)

<br>

**3.2 Dataframes and Series**

+ Our `dataframe` is called `df`.

+ If we select a column from the `dataframe` it will be of type `series`.

+ We select a column of a dataframe (a series) by wrapping the column's name in quotes.

  - You typically must wrap the column's name in quotes because most column names are stored as strings.

  - You will need to reference columns this way as long as the variable names aren't entirely numeric. 

  - Can delimit strings with single or double quotes: ' ' or " ".

In [None]:
df['revenue']

In [None]:
df[['tick','year','revenue']]

<br>

**3.3 Checking the Data Type**

+ In Python, there is a `type` function that returns the type of a variable or object.

In [None]:
type(df)

In [None]:
type(df['revenue'])

<br>

**3.4 Data creation** 

+ Typically create a new column with the assignment operator.

+ Like most programming languages, the assignment operator is just the equal sign (`=`) in Python.

+ For example, suppose I want to create a new column that measures profit margin. Profit margin is defined as the following (note, ebit is earnings before interest and taxes):
$$
\text{Profit Margin} = \frac{ebit}{revenue}
$$

+ Mathematical operations such as addition (+), subtraction (-), multiplication (*), or division (/) are all element by element operations between the dataframe columns that are addressed by the code.

+ Python/Pandas code for creating profit margin column in the dataframe.

In [None]:
df['profit_margin'] = df['ebit'] / df['revenue']
df.round(2)

<br>

**3.5 If/then/else logic in Pandas**

+ `If/then/else` logic is important in all types of programming.

+ In `Python/Pandas`, you rarely write code that looks like classic `if/then/else` statements.

+ For example, many `Pandas` logical functions or statements are actually `if/then` statements with an implicit else.

+ Data selection often involves if/then/else logic $\leftarrow$ in Pandas' jargon it's often called Boolean indexing.

+ For example, we can use if/then/else logic to create a new variable that is `True` if the year is greater than 2010 and `False` otherwise. The logical statement looks like the following:

```
if (year is greater than 2010) then
   True
else
   False
```

+ In Pandas, we create a logical condition for the whole column $\rightarrow$

In [None]:
df['year'] > 2010

+ We can also assign this new TRUE/FALSE variable to the dataframe: 

In [None]:
df['gt_2010'] = df['year'] > 2010
df

<br>

**3.6 Data selection** 

+ Based on if/then/else logic, `Pandas` allows you to select only the rows or columns of a `dataframe` that you want.

+ Suppose you only want observations where the year is greater than 2010.

+ Pandas allow us to index a dataframe's rows based on a logical condition.

+ Pandas will select on observations where the condition is `TRUE`.

+ Called logical indexing $\rightarrow$

In [None]:
df[df['gt_2010'] == True]

In [None]:
df[df['gt_2010']]

In [None]:
df[df['year'] > 2010]

<br>

**3.7 Creating a Sub Dataframe**

+ We can assign the smaller dataframe to a new dataframe with the following:

In [None]:
sub = df[df['year'] > 2010]
sub

<br>

**3.8 Deleting a Variable/Column of Data**

+ Very common to delete or remove columns.

+ Typically rely on the `drop` function.

+ For example, suppose I want to drop the `capx` column from the dataframe.

In [None]:
df.drop('capx',axis='columns')

+ The preceding command, created a new `dataframe` with the `capx` column removed. 

+ Most `pandas` functions create a new dataframe.

+ To modify the original `dataframe` (df) we have to assign the `dataframe` created by the drop command back to `df`.

In [None]:
df = df.drop('capx',axis='columns')
df

<br>

**4 More Advanced Concepts and Features for Later**

**4.1 The groupby/apply construct:**. 

+ The most important **programming idiom** or construct for this class is the `groupby/apply` construct.

+ Allows us to loop through the data grouping observations in a `dataframe` together, and then apply a function to each group.

+ For example, we often use it to group observations by date or to group all the observation of the same stock together. We then typically apply a function to the data that aggregates it or transforms it within these groups.

+ The `groupby/apply` construct allows us to accomplish the following with just one (or a few lines) of code:

  1. Logically **group** observations together based on some attribute of the data: for example, we could group stock data based on whether the company was big or small.

  2. **Apply** a function to the different groups. For example, we could compute the average number of analysts covering big versus small stocks.

+ The groupby/apply does a whole bunch of work for us behind the scene. It loops all the observations, categorizes the observations into the groups, and then applies the functions separately to each group.<br><br>


**4.2 User-written functions:**

+ You will write your own custom (i.e., user written) functions to extend the functionality of the `groupby/apply` construct.

+ For example, writing a custom function is sometimes and important part of implementing a portfolio formation criteria for a trading strategy.<br><br>


**4.3 Merging data** 

+ Merging data is a core part of the data preparation step from most empirical work or back testing of strategies.

+ You will learn how to merge dataframes together based on a single key or multiple keys.