In [None]:
import pandas as pd
import numpy as np
from numpy.random import default_rng
rng = default_rng()

## Series

Given the Series below, without entering the statements:

In [None]:
s = pd.Series(np.arange(5),index=list("abcde"))

- predict the values and the type of object returned for each statement:

In [None]:
s['d']        # value at index 'd'
s['b':'d']    # values from indices 'b', 'c' and 'd'
s[2::2][::-1] # values from the second position to the end in steps of 2, then reversed
s[['b', 'a']] # values of sets of indices ['b', 'a']

- predict the contents of `s`, `s1` and `lst`:

In [None]:
# 'lst' is a ndarray and Series 's' has a reference to 'lst' as its content, also termed a view. Any
# change made to 's' will be a change to 'lst'. However, 's1' is a copy of 's' therefore any change
# to 's1' is not propagated to 's' nor 'lst'.
#
lst, idx = np.arange(5), list("abcde")
s = pd.Series(lst,idx)
s[-1:] = 10               # -1: is equivalent to -1 which marks the last position and is updated to 10 in 's' and 'lst'
lst[0] = 5                # changes is 'lst' will be reflected in 's', therefore s[0] is equal to 5
s1 = pd.Series(s.copy())  # s1 is a copy of s, a separate memory location
s1[0] = -1                # s1[0] is set to -1 and s[0]=lst[0]=5

- predict the result of operation

In [None]:
# Arithmetic operations on Series is done according to the overlapping indices and
# the symmetric difference is set to NaN.
s1 = pd.Series({'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4})
s2 = pd.Series({'d': 0, 'e': 1, 'f': 2, 'g': 3})

s1 + s2           # only 'd' and 'e' are common and are summed
s1[3:] * s2[:-2]  # slices pick  overlapping indices: [3,4] * [0,1] = [0,4]

## DataFrame

Give the de dataframe below:

retrieve:
- row 2 as a Series as wel as a dataFrame
- rows on even positions
- rows with even indices
- 3d column
- odd (index) rows and columns 'b' to 'd'

In [None]:
rng = default_rng(1234)
df = pd.DataFrame(np.array(rng.standard_normal(25)).reshape(5,5),
             index=[1, 0, 4, 3, 2], columns=list("abcde"))

In [None]:
type(df.loc[2])                   # row 2 as Series
type(df.loc[[2]])                 # row 2 as DataFrame
df.loc[df.index[0::2]]            # rows on even positions
df.loc[df.index % 2 ==0]          # rows with even indices
df.iloc[:,2]                      # 3rd column <=> df['c']
df.loc[df.index % 2 !=0,'b':'d']  # odd index rows of columns 'b' and 'd'

### Merge DataFrames

Given `df1`, `df2` and `df3` apply the following:

- merge df1 and df2 side by side
- merge df1 and df3 stacked
- merge all and reset index

In [None]:
df1 = pd.DataFrame({'name': ['ants', 'bees','wasps'] , 'order':['Hymenoptera']*3})
df2 = pd.DataFrame({'name': ['beetles', 'weevils'] , 'order':['Coleoptera']*2})
df3 = pd.DataFrame({'name': ['butterflies', 'moths'], 'order':['Lepidoptera']*2 })

In [None]:
pd.concat([df1, df2], axis=1)
pd.concat([df1, df3], axis=0)
pd.concat([df1, df2, df3], axis=0,ignore_index=True) #

### Missing values

Given the following DataFrame

In [None]:
df = pd.DataFrame(np.arange(25).reshape(5,5))

set the values to NaN to reproduce the following

In [None]:
df.loc[0,0::2] = np.nan
df.loc[::2,2] = np.nan
df.loc[:,4] = np.nan
df.loc[2,:] = np.nan

Apply the following on the dataframe with missing values created in the previous step.

Drop missing:
- rows with missing values
- columns with missing values
- rows where all values are missing
- columns where all values are missing

Fill missing:
- with 0
- with mean based on column values
- with median based on row values

In [None]:
# drop
df.dropna(axis=0)             # all rows have missing values
df.dropna(axis=1)             # all columns have missing values
df.dropna(axis=0, how='all')  # row 2 has all missing
df.dropna(axis=1, how='all')  # column 4 has all missing
# fill
df.fillna(0)                  #
df.fillna(df.mean(axis=0))    # the vertical axis, axis=0, is taken to get the column means
df.fillna(df.mean(axis=1))    # the horizontal axis, axis=1, is taken to get the row means

### Natural gas consumption in the Netherlands

The dataset can be downloaded from [CBS Open data StatLine](https://opendata.cbs.nl/statline/portal.html?_la=en&_catalog=CBS). A version is already included in the data directory of this session's git repository. We will be using this dataset in the exercises to prepare for visualisation later on.

We first read the data with `pd.read_csv`. Here we only select the columns `Periods` and `TotalSupply_1`:

In [None]:
cbs = pd.read_csv("data/00372eng_UntypedDataSet_17032023_161051.csv",sep=";")
df0 = cbs[['Periods','TotalSupply_1']].copy()

The column `Periods`has the year (yyyy) followed by a tag {JJ,KW,MM} representing the yearly, quarterly and monthly terms respectively, and finally ending with two digits `00..12`. The two digit followed by the tags have different meaning per tag. For JJ it is always `00`, MM with `00..12` for 12 months and `KW`  with  `01..04` for four quarters. The column `TotaalAanbod_1` holds the natural gas consumption (MCM).

In order to get more control over the date ranges will need to split the string based on a pattern `YYYY{MM,KW,JJ}{00,...,12}`. The Series class has a comprehensive set of submodules, one of which being `pandas.Series.str` with the method `split`. The `split` method takes a [regular expression](https://docs.python.org/3/library/re.html) describing the pattern, splits the string based on the pattern. Regular expressions fall beyond the scope of this course, therefore the solution is only given here.

In [None]:
df = df0.Periods.str.split(r'(JJ|MM|KW)', regex=True, expand=True)  # expand=True forces the result into
                                                                    # a DataFrame
df = pd.DataFrame({'year': df[0].astype(int),                       # Create DataFrame {year,term,idx}
                        'term': df[1],
                        'idx': df[2].astype(int)})

df = pd.concat([df,cbs[['TotalSupply_1']]],axis=1)
df

1) Write a function given a Series with {year,term,idx} returns a timestamp according to the following specification:

```
JJ : yyyyJJ00 => 31-12-yyyy
KW : yyyyKWmm => where mm in {1,2,3,4}
                 01: 1-1-yyyy to 31-3-yyyy
                 02: 1-4-yyyy to 30-6-yyyy
                 03: 1-7-yyyy to 30-9-yyyy
                 04: 1-10-yyyy to 31-12-yyyy
MM : yyyyMMmm => dd-mm-yyyy where dd is the last day of the month and
                 mm in {1,..,12}
```

2) Create a new DataFrame called `ngc` (natural gas consumption) with three columns {term, date, consumption} :
- term : {JJ,KW,MM}
- date : timestamps as specified in the previous exercise
- consumption: which is `TotalSupply_1` only renamed

In [None]:
def last_day(ts):
    """
    given a timestamp we can calculate the number of days in the month by subtracting the timestamps' next
    month from this month on the same day and year. This results in a Timedelta of days. Only for the last
    month of the year we need to make an exception to set the next month's year to the next year.

    :param ts:
    :return: Number of days in the month (int).
    """
    return (ts.replace(year=ts.year + (ts.month == 12), month=(ts.month % 12) + 1) - ts).days


def to_ts(s):
    """
    Apply the following timestamp format for each term {JJ,KW,MM}:

    JJ : yyyyJJ00 => 31-12-yyyy
    KW : yyyyKWdd => 01: 1-1-yyyy to 31-3-yyyy
                     02: 1-4-yyyy to 30-6-yyyy
                     03: 1-7-yyyy to 30-9-yyyy
                     04: 1-10-yyyy to 31-12-yyyy
    MM : yyyyMMdd => 1-xx-yyyy

    :param format_:
    :param s: {year,term,idx}
    :return: Timestamp
    """
    year_, term, idx = s

    #
    # import sys
    # sys.version

    if term == 'JJ':
        day_, month_ = 31, 12
        return pd.Timestamp(year=year_, month=month_, day=day_)
    elif term == 'KW':
        day_, month_ = [(31,3),(30,6),(30,9),(31,12)][idx-1]
        return pd.Timestamp(year=year_, month=month_, day=day_)
    elif term == 'MM':
        day_, month_ = last_day(pd.Timestamp(year=year_, month=idx, day=1)), idx
        return pd.Timestamp(year=year_, month=month_, day=day_)
    else:
        raise Exception('invalid tag, valid tags:  {JJ, KW, MM} !')

    """ Alternative for if/elif/.../else.

    The construct match/case is available in Python versions >=3.10 The following
    commented out code is equivalent to the if/elif/.../else construct that creates
    timestamps here above.

    """

    # match term:
    #     case 'JJ':
    #         day_, month_ = 31, 12
    #         return pd.Timestamp(year=year_, month=month_, day=day_)
    #     case 'KW':
    #         day_, month_ = [(31,3),(30,6),(30,9),(31,12)][idx-1]
    #         return pd.Timestamp(year=year_, month=month_, day=day_)
    #     case 'MM':
    #         day_, month_ = last_day(pd.Timestamp(year=year_, month=idx, day=1)), idx
    #         return pd.Timestamp(year=year_, month=month_, day=day_)


df['date'] = [to_ts(s[:-1]) for i,s in df.iterrows()] # create the date variable
ngc = df[['term', 'date', 'TotalSupply_1']].copy()      # make a copy
ngc.columns = ['term', 'date','consumption']           # rename TotalSupply_1 to consumption

Validate entries in the ngc DataFrame from the previous step:
- whether sum of 3 months consumptions are equal to the corresponding quarterly entries(KW)
- whether sum of 4 quarters addup to the yearly (JJ) entries

In [None]:
jj,kw,mm = [ngc.loc[ngc.term==t] for t in ['JJ','KW','MM']]

In [None]:
j = jj.groupby(pd.Grouper(key='date', freq='Y'))['consumption'].sum()       # yearly
q = kw.groupby(pd.Grouper(key='date', freq='Y'))['consumption'].sum()       # quarterly
m = mm.groupby(pd.Grouper(key='date', freq='Y'))['consumption'].sum()[:-1]  # monthly : remove last year (only two months)
j = j[q.index.min():] # remove years for which there are no corresponding months and quarters

In [None]:
(q==m).all()  # compare quarterly and monthly

In [None]:
(q==j).all()  # compare quarterly and yearly