# Day 6 - Pandas Data Frames

Data frame is a way to store data in rectangular grids that can easily be viewed. They are defined as two-dimensional labeled data structures with columns of potentially different types. Each row of these grids corresponds to measurements or values of an instance, while each column is a vector containing data for a specific variable. This means that a data frame's rows do not need to contain, but can contain, the same type of values: they can be numeric, character, logical, etc.  

[Pandas](https://en.wikipedia.org/wiki/Pandas_(software)) DataFrame consists of five main components: 

`pandas.DataFrame( data, index, columns, dtype, copy)`  

- the **data** (various forms like `ndarray`, `series`, `lists`, `dict`, constants and also another `DataFrame`), 
- the **index** (row labels),  
- the **columns** (column labels),
- `dtype` (data type of each column), and
- `copy` (used for copying of data).

<div>
<img src=data/images/PandasDataFrame01.jpg width="500"/>
</div>


Firstly, the DataFrame can contain data that is:
- a Pandas `DataFrame`
- a Pandas `Series`: a one-dimensional labeled array capable of holding any data type with axis labels or index. An example of a Series object is one column from a DataFrame.
- a NumPy `ndarray`, which can be a record or structured 
- a two-dimensional ndarray
- dictionaries of one-dimensional `ndarray`'s, lists, dictionaries or Series.

Note the difference between `np.ndarray` (an actual data type) and `np.array()` (a function to make arrays from other data structures).

![](data/images/PandasDataFrame02.png)


In [None]:
# Your Notebook should start with the imports of all required packages. 
# You can amend it as you go along.

# IMPORTANT: If you made modifications to a package (e.g., added a new function to utsbootcamp.py),
# you must restart the kernel and import it again. Menu->Kernel->Restart

import numpy as np

import pandas as pd
# You can set this up at the start of your Notebook (after you imported pandas package) as a global parameter
#pd.options.display.max_rows = 4
#pd.options.display.max_columns = 6
#pd.options.display.precision = 2
# To reset options to their default alues, use one of the following two:
#pd.reset_option('^display.', silent=True)    # reset only display option to their default values
#pd.reset_option("all")                      # reset all pandas options to defaults
# more on Pandas options: https://queirozf.com/entries/pandas-display-options-examples-and-reference

import yfinance as yf

import matplotlib.pyplot as plt
# Set default parameters for 'matplotlib' package
# More info on tweaking default parameters: https://matplotlib.org/stable/tutorials/introductory/customizing.html
plt.rcParams["figure.figsize"] = [10,8]  # Set default figure size

import datetime as dt
import utsbootcamp as bc

### Spellchecker for Jupyter Notebook

Following several attempts based on the instructions [here](https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/install.html), I was able to install the spell check extension for Jupyter Notebook markdown cells. I ran the following three commands in **Anaconda Prompt**:

1. `conda install -c conda-forge jupyter_contrib_nbextensions`
2. `jupyter contrib nbextension install --system` and then also `jupyter contrib nbextension install --user` just to make sure (one is for all users - system-wide; the second one is for a specific current user)
3. `jupyter nbextension enable codefolding/main`

Eventually, it worked.

## Recap

In [None]:
help(bc.get_yahoo_data)

In [None]:
x=bc.get_yahoo_data({'INTC': 'Intel','BHP': 'BHP','AMZN': 'Amazon','GME': 'GameStop'},
                 start='2020-01-01',
                 end='2020-12-31',
                 column='High',
                 plot=True)

In [None]:
x.head(5)

In [None]:
y=bc.price2cret(x)
y.head()

### Subset of data based on several conditions

Option 1: 

<font size="6"> **df.loc**<font color=blue> [</font> <font color=Crimson> (cond1)</font> <font color=Magenta>&</font> <font color=Crimson>(cond2)</font> , <font color=SpringGreen> [</font>col1, col2<font color=SpringGreen> ] </font> <font color=blue> ]</font> </font>

In [None]:
y.loc[ (y.BHP<0) & (y.AMZN<0), ['BHP', 'AMZN']].head(10)

Option 2: (the second blue bracket is optional)

<font size="6"> **df.query**<font color=blue> (</font> <font color=Crimson> 'cond1</font> <font color=Magenta>&</font> <font color=Crimson>cond2'</font> <font color=blue> )</font> <font color=blue> [</font> <font color=SpringGreen> [</font>col1, col2<font color=SpringGreen> ]  </font> <font color=blue> ]</font> </font>

In [None]:
y.query('BHP<0 & AMZN<0')[['BHP', 'AMZN']]

Option 3: 

<font size="6"> **df**<font color=blue> [</font> <font color=Crimson> (cond1)</font> <font color=Magenta>&</font> <font color=Crimson>(cond2)</font> <font color=blue> ]</font> <font color=blue> [</font> <font color=SpringGreen> [</font>col1, col2<font color=SpringGreen> ]  </font> <font color=blue> ]</font> </font>

In [None]:
y[ (y.BHP<0) & (y.AMZN<0) ][['BHP', 'AMZN']].head(10)

**Running ahead:** Consider various options for plotting [here](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html)

In [None]:
y.plot(title='Cumulative Returns',figsize=(14,8),
         drawstyle='steps',
         linewidth=5,
         alpha=0.5)
plt.grid(axis='y',alpha=0.2,color='k', linestyle=':', linewidth=3)

In [None]:
y.plot(title='Cumulative Returns',figsize=(14,8),
         drawstyle='steps',
         linewidth=5,
         alpha=0.5,
      subplots=True)

# Since now you are ploting sublots, make sure you declare titles for each subplot. Best way is to get column names from your pandas DataFrame.
# How?
# Tip: y.columns
# Is the object type ok or does it need to be converted?

## Creating `pandas` DataFrames


Creating an empty DataFrame

In [None]:
df = pd.DataFrame()
print(df)

Creating a DataFrame from Lists

In [None]:
data = [1,2,3,4,5]
df = pd.DataFrame(data, dtype=None) # Try dtype=float; dtype=int;      dtype=None (infer types automaticly)
df

In [None]:
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
df

In [None]:
df = pd.DataFrame(data,columns=['Name','Age']) 
df

Creating a DataFrame from `dict` of ndarrays / Lists

In [None]:
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
df

Change index labels in a dataframe:

In [None]:
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
df

Define an array

In [None]:
data = np.array([['','Col1','Col2'],
                ['Row1',1,2],
                ['Row2',3,4]])
print(data)                

Select relevant portions from your array to define your DataFrame (e.g., data, index labels, and column names). 

Recall that you index rows first and THEN columns (and don't forget that the indices start at 0!) 

In [None]:
df=pd.DataFrame(data=data[1:,1:],        # values
                  index=data[1:,0],      # 1st column as index
                  columns=data[0,1:])    # 1st row as the column names
display(df)

If you array contains only data values, converting it to DataFrame is simple:

In [None]:
my_array = np.array([[1, 2, 3], [4, 5, 6]])

In [None]:
print('Below is a \'print\' of an array:')
print(my_array,'\n')

print('Below is a \'print\' of a DataFrame:')
print(pd.DataFrame(my_array),'\n')
# vs.
print('Below is a \'display\' of a DataFrame:')
display(pd.DataFrame(my_array))

# use "print()" use "display()" to show output that is not from the last line

In [None]:
# Take a dictionary as input to your DataFrame 
my_dict = {1: ['1', '3'], 2: ['1', '2'], 3: ['2', '4']}
print(my_dict,'\n')
pd.DataFrame(my_dict)

In [None]:
# Take a part of DataFrame as input to your new DataFrame 
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
print('Original DataFrame:')
display(df)


#my_df = pd.DataFrame(data=df, index=range(0,4), columns=['Age'])
my_df = pd.DataFrame(data=df, index=range(1,5), columns=['Name'])
print('New DataFrame:')
my_df

In [None]:
# Take a Series as input to your DataFrame
my_series = pd.Series({"Belgium":"Brussels", 
                       "India":"New Delhi", 
                       "United Kingdom":"London", 
                       "United States":"Washington"})
print(my_series,'\n')     # Show output - Option 1
pd.DataFrame(my_series)   # Show output - Option 2

In [None]:
pd.DataFrame(my_series, columns=['Country','Capital']) # Error

In [None]:
pd.DataFrame(my_series, columns=['Capital City'])

> <font color=blue>Issue:</font> What gives? Why didn't it work the first time?

After you have created your DataFrame, you can use the `shape` property or the `len()` function in combination with the `.index` property to investigate and summarise the data frame. The `shape` property will provide you with the dimensions of your DataFrame. That means that you will get to know the width and the height of your DataFrame. On the other hand, the `len()` function, in combination with the `index` property, will only give you information on the height of your DataFrame.

In [None]:
df = pd.DataFrame(
    {"a" : [4 ,5, 6],
    "b" : [7, 8, 9],
    "c" : [10, 11, 12],
    "d" : [13, 14, 15]},
    index = [1, 2, 3])
display(df)

# Use the `shape` property
print('shape property: ',df.shape,'\n')

# Or use the `len()` function with the `index` property
print('len() property: ',len(df.index),'\n')

In [None]:
len(df)

In [None]:
df["a"].count() # Count elements in column "a". Will exclude the NaN values (if there are any).

In [None]:
df.columns           # This is an index object
df.columns.values    # This is an array object

print(f'\'df.columns is of type \t: {type(df.columns)}')
print(f'\'df.columns is of type \t: {type(df.columns.values)}')

In [None]:
list(df.columns.values) # vs. list(df.columns) - any difference? should we care?

## Simulate data and dates
This is useful if you want to check your code to see if it works properly. For example, instead of finding and loading time series data, you can create your own with specific properties and check your `def`s and functions, or perform Monte Carlo simulations.

In [None]:
dftest=pd.DataFrame(np.random.rand(20,5)) # 5 columns and 20 rows of random floats

In [None]:
dftest

In [None]:
dftest.index = pd.date_range('2020/1/30', periods=dftest.shape[0]) # Add a date index 

In [None]:
dftest

## Viewing and inspecting data

In [None]:
dftest.info()

In [None]:
dftest.describe()

In [None]:
# Rename columns (replaces existing list of column names with a new list)
dftest.columns = ['a','b','c','d','e']

In [None]:
dftest.head(2)

In [None]:
# Rename specific column(s)- provide a dictionary variable 
# {'old_name1':'new_name1', 'old_name2':'new_name2'}
# Optionally, save the changes to the same variable by using "inplace=True" option
dftest.rename(columns={'a':'A', 'e':'E'}, inplace=True)

In [None]:
dftest.head(2)

## Reshaping data


In [None]:
df

In [None]:
pd.melt(df) # Recall: this will not save the changes to your variable "df"

In [None]:
df.sort_values('d',ascending=False) # Try `ascending=False` or a different column name

In [None]:
df.rename(columns = {'d':'D'})

In [None]:
# the variable "df" did not change 
# (you did not save the changes you made nor did you assign it to another variable)
df

## Sorting 

In [None]:
df_sorted=df.sort_values('d',ascending=False);
display(df_sorted)
df_sorted.sort_index()  # get original DataFrame back

> In the example above, I used `display(df)` command to show output. Recall that the output is only displayed from the last line in the multiline cell. To force Python to show output for intermediate code lines, use `print()` or `display()`. For DataFrames, `display()` creates a visually appealing table.

In [None]:
df.drop(columns=['a','d'])

In [None]:
df # try using `inplace=` agrument in the above

## Selecting an Index or Column from a Pandas DataFrame

You can either access the values by calling them by their label (`.loc[]`) or by their position in the index or column (`.iloc[]`).   

To grasp the concept of `loc` and how it differs from other indexing attributes such as `.iloc[]` and `.ix[]`:

- `.loc[]` works on labels of your index. This means that if you give in `loc[2]`, you look for the values of your DataFrame that have an index labeled 2.
- `.iloc[]` works on the positions in your index. This means that if you give in `iloc[2]`, you look for the values of your DataFrame that are at index `2`.
- (depreciated) `.ix[]` was a more complex case: when the index is integer-based, you pass a label to `.ix[]`. `ix[2]` then means that you're looking in your DataFrame for values that have an index labeled `2`. This is just like `.loc[]`! However, if your index is not solely integer-based, `ix` will work with positions, just like `.iloc[]`. <font color=red>Warning</font>: The `.ix` indexer is now deprecated, in favor of the more strict .iloc and .loc indexers.


In [None]:
df

In [None]:
# Using `iloc[]`
print(df.iloc[0,0])

In [None]:
# Using `loc[]`
print(df.loc[1]['a'])

In [None]:
# Using `loc[]`
print(df.loc[1,'a'])

In [None]:
# Using `at[]`
print(df.at[1,'a'])

In [None]:
# Using `iat[]`
print(df.iat[0,0])

let's access another cell

In [None]:
# Using `iloc[]`
print(df.iloc[1,3])

# Using `loc[]`
print(df.loc[2]['d'])

# Using `loc[]`
print(df.loc[2,'d'])

# Using `at[]`
print(df.at[2,'d'])

# Using `iat[]`
print(df.iat[1,3])

What about selecting rows and columns?

In [None]:
# Use `iloc[]` to select a row
df.iloc[0]

In [None]:
# Use `loc[]` to select a column
df.loc[:,'d']

### The difference between `.loc` and `.iloc`

In [None]:
df2 = pd.DataFrame(data=np.array([[1, 2, 3], 
                                  [4, 5, 6], 
                                  [7, 8, 9]]), 
                   index= [2, 'A', 4], 
                   columns=[48, 49, 50])
df2

In [None]:
# Pass `2` to `loc`
df2.loc[2]

In [None]:
# Pass `2` to `iloc`
df2.iloc[2]

> <font color=blue>Issue:</font> Why does the output differ?

## Adding a Column to your DataFrame

In [None]:
df = pd.DataFrame(data=np.array([[1, 2, 3], 
                                 [4, 5, 6], 
                                 [7, 8, 9]]), 
                  columns=['A', 'B', 'C'])
df

In [None]:
# Use `.index`
df['D'] = df.index
df

Alternatively,

In [None]:
df.loc[:, 'E'] = pd.Series(['50', '60', '70'], index=df.index)

In [None]:
df

## Deleting a Column from your DataFrame
To get rid of (a selection of) columns from your DataFrame, you can use the drop() method. 

- You can set `inplace` to `True` to delete the column without having to reassign the DataFrame to a new variable.
- The `axis` argument is either 0 indicating that you are deleting rows, and 1 when it is used to drop columns.

In [None]:
df

In [None]:
# Drop the column with label 'D'                  
df.drop('D', axis=1, inplace=True)

In [None]:
df

In [None]:
# Drop the column at position 3
df.columns[[3]]

In [None]:
df.drop(df.columns[[3]], axis=1)

In [None]:
df

In [None]:
df.drop(df.columns[[3]], axis=1, inplace=True)

In [None]:
df

## Rename Index labels or Column names
To give the columns or your index values of your dataframe a different value, it is best to use the `.rename()` method.

In [None]:
df = pd.DataFrame(data=np.array([[1, 2, 3], 
                                 [4, 5, 6], 
                                 [7, 8, 9]]), 
                  columns=['A', 'B', 'C'])
df

In [None]:
newcols = {
    'A': 'new_column_1', 
    'B': 'new_column_2', 
    'C': 'new_column_3'
}

In [None]:
# Use `rename()` to rename your columns
df.rename(columns=newcols, inplace=True)

In [None]:
df

In [None]:
# Rename your index
df.rename(index={1: 'new_row_1'}, inplace=True)

In [None]:
df

In [None]:
df.add_prefix('Column_')

# Reading data in from a file

## Example 1: NBA database

In [None]:
df=pd.read_csv('data/nbaNew.csv')

In [None]:
df.head(10)

In [None]:
df.replace(['A.C. Green'],['A.C. Red'])

In [None]:
# Delete/Drop only the rows which has all values as NaN in pandas

df.dropna(axis = 0, how = 'all', inplace = True)
df

In [None]:
# Drop the last two rows:

# Option 1 (flexible, can use 'inplace=False' to avoid permanent change)
df.drop(df.tail(2).index, inplace = False)

# Option 2 (permanently change, unless saved to another variable)
df2 = df[:-2] 
df2

Delete rows from a pandas DataFrame based on a conditional expression:

In [None]:
df[df.PlayerName=='A.C. Green']

In [None]:
# Drop entries relevant to player 'A.C. Green':

df.drop(df[df.PlayerName=='A.C. Green'].index, inplace = False)

## Example 2: Macroeconomic data

In [None]:
df=pd.read_csv('data/Country-data.csv')

In [None]:
df

In [None]:
df.head()[['country','gdpp']]

#### Statistics

In [None]:
df.describe()

In [None]:
df.corr()

In [None]:
df.std()

In [None]:
df.count()

In [None]:
df.min()

In [None]:
df.max()

> <font color=blue>Issue:</font> <br>
> - What do `min()` and `max()` statistic above represent? Do these make sense?
> - What are the countries with the highest and lowest child mortality?

In [None]:
display(df[df.country=='Afghanistan'])
display(df[df.country=='Zambia'])

In [None]:
df[df.child_mort==208]

In [None]:
df.max().inflation

Find out which country has the highest inflation rate

In [None]:
df[df.inflation==df.max().inflation].country

In [None]:
df.gdpp.mean()

In [None]:
df.gdpp.median()

Sort data by GDP per capita in descending order and display top 10 countries.

In [None]:
df.sort_values(['gdpp'],ascending=False).head(10)

In [None]:
df.sort_values(['gdpp','country'],ascending=[False,True])

In [None]:
extr=df['country'].str.extract(r'^(Al)')
print(extr)

In [None]:
df[df['country'].str.contains("lia")]

In [None]:
df[df['country'].str.contains("lia|Ca")==True]

In [None]:
# For columns with mixed data types (e.g., string AND numeric data in a single column). 
# Anything that is not a string cannot have string methods applied on it, so the result is NaN (naturally). 
# In this case, specify na=False to ignore non-string data.

df[df['country'].str.contains("lia|Ca",na=False)]

In [None]:
df['exports'].plot.hist()

In [None]:
df.plot.scatter('exports','imports')

In [None]:
df.country

In [None]:
df.plot.scatter('exports','imports',85,'red',alpha=0.2)
plt.show() # disable log output

In [None]:
df.plot.scatter('exports','imports','child_mort','red',alpha=0.2)
plt.show() # disable log output

# What is the difference?

In [None]:
# Add labels to scatter points (annotations):
ax=df.plot.scatter('exports','imports',85,'red',alpha=0.2)  # Note: we are saving a plot/axis object to a variable so we can refer to it and modify it later

for i, txt in enumerate(df.country):
    ax.annotate(txt, (df.exports.iat[i], df.imports.iat[i]))  # here, we refer to already plotted plain sctatterplot, 'ax', and add stuff to it.
    
plt.show() # disable log output

Replace numeric row labels in the DataFrame with the data in one of the columns:

In [None]:
df.set_index('country',inplace=True)

In [None]:
df

In [None]:
df.loc['Yemen',:]

## Merge, Join, and Concatinate DataFrames

In [None]:
x=bc.get_yahoo_data({'INTC': 'Intel'}, start='2020-01-01', end='2020-01-31')
y=bc.get_yahoo_data({'AMZN': 'Amazon'}, start='2020-01-15', end='2020-02-15')

In [None]:
x.shape

In [None]:
y.shape

In [None]:
xy = pd.merge(x, y,on=["Date"])

In [None]:
xy # why doesn't start date in y match the requested one?

In [None]:
pd.merge(x, y,on=["Date"],how="left") # how{‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘inner’

# References
* [Pandas reference](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html)
* [Combining Data in Pandas With `.merge()`, `.join()`, and `concat()`](https://realpython.com/pandas-merge-join-and-concat/)
* [More on `merge()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html)