# Code Snippets

Little code samples

In [5]:
import pandas as pd
from numpy import NaN

### Add a Column to DataFrame

Add a column of to a DataFrame object. We are going to use pd.concat() function to achieve the result. To make it work, we need to:

1. Create a dataframe of a single column, with the column name specified by the user;
2. Align the index of the new dataframe with the dataframe we are appending to;
3. Call concat().

#### Add a List of Fixed Values As a Column

In [16]:
from itertools import repeat

"""
    Add a column of fixed value to a dataframe
    
    [String] column (column name), [Object] value (any value type that can be put into a dataframe), [DataFrame] df
        => [DataFrame] df (new DataFrame with a column appended)
"""
addColumn = lambda column, value, df: \
    pd.concat( [ df
               , pd.DataFrame(repeat([value], len(df)), columns=[column], index=df.index)
               ]
             , axis=1
             )

Try it out

In [17]:
df = pd.DataFrame( [[88, 3], [90, 1], [75, 10]]
                 , columns=['Score', 'Rank']
                 , index=['Isaac', 'Yan', 'Chuyi'])
df

Unnamed: 0,Score,Rank
Isaac,88,3
Yan,90,1
Chuyi,75,10


In [18]:
addColumn('Date', '2020-05-16', df)

Unnamed: 0,Score,Rank,Date
Isaac,88,3,2020-05-16
Yan,90,1,2020-05-16
Chuyi,75,10,2020-05-16


#### Add a Series Object As a Column

In [19]:
"""
    Add a Series Object as a column to a dataframe
    
    The index of the series object is not important but its length must be the same as the length 
    of the dataframe to be appended, otherwise an exception will be thrown.
    
    [String] column (column name), [Series] s, [DataFrame] df
        => [DataFrame] df (new DataFrame with series s as a column appended)
"""
addSeriesColumn = lambda column, s, df: \
    pd.concat( [ df
               , pd.DataFrame({column: s.values}, index=df.index)
               ]
             , axis=1
             )

In [20]:
dates = pd.Series(['2020-05-16', '2020-05-17', '2020-05-18'])
dates

0    2020-05-16
1    2020-05-17
2    2020-05-18
dtype: object

In [21]:
addSeriesColumn('Date', dates, df)

Unnamed: 0,Score,Rank,Date
Isaac,88,3,2020-05-16
Yan,90,1,2020-05-17
Chuyi,75,10,2020-05-18


In [22]:
# Will go wrong! Because the length of the dataframe is not equal to the length of the series
# addSeriesColumn( 'Date'
#                , pd.Series(['2020-05-16', '2020-05-17'])
#                , df)

#### Create a New Series Then Add

Sometimes we create a series from a dataframe and add it as a new column. This way, we don't have to worry about the index of the series. So two steps:

1. Create a new series;
2. Combine the new series to the dataframe.

In [23]:
df

Unnamed: 0,Score,Rank
Isaac,88,3
Yan,90,1
Chuyi,75,10


In [26]:
# create a new column
comments = df['Rank'].apply(lambda x: 'Excellent' if x < 6 else 'Good')
comments

Isaac    Excellent
Yan      Excellent
Chuyi         Good
Name: Rank, dtype: object

In [27]:
# combine the new column
pd.concat([df, pd.DataFrame({'Comments': comments})], axis=1)

Unnamed: 0,Score,Rank,Comments
Isaac,88,3,Excellent
Yan,90,1,Excellent
Chuyi,75,10,Good


## Count NaN Values

In [34]:
countNan = lambda v: v.isnull().apply(lambda x: 1 if x else 0).sum()

Let's count number of NaN values in a vector

In [35]:
s = pd.Series([1, 2, NaN, 0])
s

0    1.0
1    2.0
2    NaN
3    0.0
dtype: float64

In [36]:
countNan(s)

1

In [38]:
df = pd.DataFrame([[1, NaN], [2, 3], [NaN, NaN]], columns=['value1', 'value2'])
df

Unnamed: 0,value1,value2
0,1.0,
1,2.0,3.0
2,,


In [39]:
df.apply(countNan)

value1    1
value2    2
dtype: int64

### Count NaN Percentage

In [43]:
countNanPercent = lambda v: countNan(v)/v.size

In [41]:
countNanPercent(s)

0.25

In [42]:
df.apply(countNanPercent)

value1    0.333333
value2    0.666667
dtype: float64

## Filter DataFrame Rows By Function

To filter dataframe rows, we need a boolean series. One way to generate a boolean series is through an expression, e.g., df\[x\] == y, where x is the column name and y is a value.

What if we need a logic that's more complicated, e.g, taking values from several columns then doing some calculation. We need a function instead. So, how to filter rows by a function? Two steps:

1. Generate a boolean series through apply();
2. Filter rows using the series.

In [5]:
df = pd.DataFrame( [['700 HK', 3500, 1.25], ['BABA US', 10200, 0.36], ['TSLA US', 15000, 3.25]]
                 , columns=['Investment', 'Profit', 'Profit Ratio']
                 )
df

Unnamed: 0,Investment,Profit,Profit Ratio
0,700 HK,3500,1.25
1,BABA US,10200,0.36
2,TSLA US,15000,3.25


Now, we want to selection the investments that satisfy the following:

1. Profit > 10,000;
2. Profit ration > 1.0

In [9]:
def successful(row):
    return row['Profit'] > 10000 and row['Profit Ratio'] > 1.0


selector = df.apply(successful, axis=1)
selector

0    False
1    False
2     True
dtype: bool

In [10]:
df[df.apply(successful, axis=1)]

Unnamed: 0,Investment,Profit,Profit Ratio
2,TSLA US,15000,3.25


## Read DataFrame from Lines

We have read_csv() and read_excel() from pandas. But sometimes we need to filter lines and columns from an Excel or Csv file before feeding them to the dataframe. Therefore we need a read_lines() function to create dataframe. Here is how.

In [3]:
from utils.iter import pop
from toolz.functoolz import compose

"""
    [Iterator] lines => [DataFrame] df
    
    lines is an iterator over lines, where each line is a list or an iterator over values. The first line is the headers (column
    names).
    
    Note: for the pop() function to work properly, lines cannot be a List.
"""
read_lines = compose(
    lambda t: pd.DataFrame(t[1], columns=t[0])
  , lambda lines: (pop(lines), lines)
)

Let's use an Excel file as an example.

In [9]:
from utils.excel import fileToLines
from itertools import takewhile
from functools import partial

compose(
    read_lines
  , partial(takewhile, lambda line: len(line) > 0 and line[0] != '')
  , fileToLines
)('data/19437_Investment_Positions_20200529.xlsx').tail()

Unnamed: 0,ReportMode,LongShortDescription,SortKey,LocalCurrency,BasketInvestDescription,Description,InvestID,Quantity,LocalPrice,CostLocal,CostBook,BookUnrealizedGainOrLoss,AccruedInterest,MarketValueBook,Invest
193,Payable Investments,SFC Annual Fee Payable,Cash and Equivalents,Hong Kong Dollar,,Hong Kong Dollar,HKD,515.28,1,515.28,515.28,0.0,0.0,515.28,0.0
194,Payable Investments,SoldAI,Cash and Equivalents,United States Dollar,,United States Dollar,USD,-579734.96,1,-579734.96,-4493908.07,266.43,0.0,-4493641.64,0.0175
195,Payable Investments,SubscriptionsInAdvance,Cash and Equivalents,Hong Kong Dollar,,Hong Kong Dollar,HKD,-2282680.79,1,-2282680.79,-2282680.79,0.0,0.0,-2282680.79,0.0089
196,Payable Investments,System Fee Payable,Cash and Equivalents,United States Dollar,,United States Dollar,USD,-8550.0,1,-8550.0,-66591.28,0.0,0.0,-66591.28,0.0003
197,Payable Investments,Trustee Directors and Officers Fees Payable,Cash and Equivalents,Hong Kong Dollar,,Hong Kong Dollar,HKD,-1911151.15,1,-1911151.15,-1911151.15,0.0,0.0,-1911151.15,0.0074
