# Jupyter Notebooks

# Basics

## Operators
Operators represent a mapping of efficient functions to basic operations in python. Along with variables, these form the basis for all operations in python. More can be found at https://docs.python.org/3/library/operator.html

## Data Types
Data is represented as different types in python (and pretty much every other programming language). These types indicate the different data that a variable can represent. Three basic types are integers, floats and strings. We can set the value of these using the "=" operator. More on python data types can be found at https://docs.python.org/3/library/datatypes.html

## Printing
Printing the value of a variable 

In [4]:
x_int = 3
x_float = 1.1
x_str = "2"

print(f'x_int is {type(x_int)} and has a value of {x_int}')
print(f'x_float is {type(x_float)} and has a value of {x_float}')
print(f'x_str is {type(x_str)} and has a value of {x_str}')

x_int is <class 'int'> and has a value of 3
x_float is <class 'float'> and has a value of 1.1
x_str is <class 'str'> and has a value of 2


We can convert between classes using the built in python functions (more on functions later). Notice how converting the float to an int results in a rounding (truncation) after the decimal. 

In [5]:
x_int_str = str(x_int)
x_float_int = int(x_float)
x_str_int = int(x_str)

print(f'x_int_str is {type(x_int_str)} and has a value of {x_int_str}')
print(f'x_float_int is {type(x_float_int)} and has a value of {x_float_int}')
print(f'x_str_int is {type(x_str_int)} and has a value of {x_str_int}')

x_int_str is <class 'str'> and has a value of 3
x_float_int is <class 'int'> and has a value of 1
x_str_int is <class 'int'> and has a value of 2


We can also perform operations on different data types. It is important to note that python converts mixed operations up to the most complex variable. As an example, subtracting an int from a float will result in an implicit conversion of the integer data type to a float and will produce a float. This is not true for all languages, so it is best practice generally to convert all data to the same type. 

In [6]:
print(x_int - x_float)

1.9


However, implicit conversion is not always possible, as in the following case:

In [7]:
print(x_int - x_str)

TypeError: unsupported operand type(s) for -: 'int' and 'str'

We can fix this by converting the string to an integer 

In [None]:
print(x_int - int(x_str))

## More Complex Data Types

Besides integers, floats and strings, other more complex data types exist. Three of these are the tuple, the list and the dictionary. These can serve to store multiple data types together in a single variables.

In [None]:
x_tuple = (0,1,2)
print(f'x_tuple is {type(x_tuple)} and has a value of {x_tuple}')

It is important to note that not all types can be directly converted to other types. For example, converting a tuple to a float does not make sense. However, converting each element of a tuple to a float does make sense. 

In [None]:
float(x_tuple)

However, we can apply the float type conversion to each element of a tuple using the map function and then converting back to a tuple. 

In [None]:
tuple(map(float, x_tuple))

In [None]:
# We can access elements of a tuple using the following approach

x, y, z = x_tuple

print(f'x={x};y={y};z={z}')

In [8]:
# Or we can access specific elements of a tuple using the following approach

_, y, _ = x_tuple

print(f'y={y}')

NameError: name 'x_tuple' is not defined

## Lists

Lists serve as a collection of indexed elements grouped together. Where tuples are accessible through the "=" operator, list elements can be accessed via an index (similar to an array). 

In [None]:
x_list = [1, 2, 3]
print(f'x_list is {type(x_list)} and has a value of {x_list}')

In [None]:
print(f'The second element of the list has value {x_list[1]}')
print(f'The sum of the first and second elements of the list is {x_list[0]+x_list[1]}')
print(f'The length of the list is {len(x_list)}')

In [None]:
# We can also add new elements to a list using the append operator

x_list.append(4)
print(f'x_list is {type(x_list)} and has a value of {x_list}')

# Elements can be deleted as well

x_list.remove(4)
print(f'x_list is {type(x_list)} and has a value of {x_list}')

## Dictionaries

Dictionaries serve as a mapping between key:value pairs. Unlike lists, in which the index of each element is an integer, begins at 0 and increased uniformly for all elements, the key (similar to an index in a list) for each value (element) is not limited by type. This provides a great deal of flexibility in data storage when using a dictionary. 

In [None]:
x_dict = {0:'1', 1.0:'2', 'dog': 3}
print(f'x_dict is {type(x_dict)} and has a value of {x_dict}')

In [9]:
# We can print all of the keys and values

print(f'x_dict has keys {x_dict.keys()}')
print(f'x_dict has values {x_dict.values()}')

NameError: name 'x_dict' is not defined

In [None]:
# We can add new elements simply by specifying a new key:value pair

x_dict['opensuse'] = 4
print(f'x_dict is {type(x_dict)} and has a value of {x_dict}')

x_dict.pop('opensuse')
print(f'x_dict is {type(x_dict)} and has a value of {x_dict}')

# Functions

Python is a high-level, general purpose programming language first released in 1991 by Guido van Rossum. Since then, it has become one of the most used languages in the world, and represents the *lingua franca* of data science. This notebook serves as a basic introduction to python. This tutorial neither serves as a comprehensive overview of python nor as a deep technical dive of any particular topic. The purpose is to provide basic principles of python and Jupyter to facilitate the other tutorials.

Python provides functionality through the use of packages. These packages can be accessed through the import command. Packages help to extend functionality by providing new abilities and improve performance by providing access to functions written in faster coding languages (C/C++/Rust). To access packages, we just use the import command.

In [None]:
def calc_mean(x):
    x_sum = 0
    for x0 in x:
        x_sum += x0
    x_mean = x_sum / len(x)
    return x_mean

In [None]:
x_list_mean = calc_mean(x_list)
print(f'x_list has mean {x_list_mean}')

# Packages

## Numpy

NumPy is a fundamental scientific computing package available in python. Along with SciPy and Matplotlib, it forms a core group of scientific computing resources. For more information see https://numpy.org

In [20]:
import numpy as np
from time import process_time

In [None]:
n_samples = int(1e8)

x = np.random.random(n_samples)
t0 = process_time()
x_mean = calc_mean(x)
t1 = process_time()
print(f'> Python function returned value {x_mean} in {t1-t0} seconds')

t0 = process_time()
x_mean = np.mean(x)
t1 = process_time()
print(f'> Numpy function returned value {x_mean} in {t1-t0} seconds')

## Pandas

Pandas is a flexible data manipulation tool built on top of python (and available in other languages). It is a staple of data manipulation in data science. For more information see https://pandas.pydata.org

In [48]:
import pandas as pd
import scipy

In [46]:
def pad_array(x0, cutoff=3):
    x_new = np.zeros(cutoff)
    n = np.min([cutoff, len(x0)])
    i = cutoff - n
    ii = len(x0) - n
    x_new[i:] = x0[ii:]
    return x_new

def q1(x):
    return np.quantile(x,0.25)

def q3(x):
    return np.quantile(x,0.75)

In [11]:
df = pd.DataFrame(x, columns=['values'])
df

NameError: name 'x' is not defined

In [None]:
t0 = process_time()
x_mean = df.mean()
t1 = process_time()
print(f'> Pandas function returned value {x_mean} in {t1-t0} seconds')

### Groupby and Agg

The pandas groupby function allows for data to be collected and grouped together by a certain value in one or more columns. As an example of the usefulness of this, consider data collection where patient vital signs are entered by patient identifier, value and timestamp. This will produce multiple entries for a single patient. To collect all of the data for a patient into a single row, the groupby function could be used. 

Using groupby returns a grouped object, which needs further processing to be used. Rows are collected together based on the column(s) specified, but we need to specify what operations should be used to create entries for each grouped object. This can be specified using the agg function, which can be used to aggregate the all associated entries together. Users can specify built in functions (like mean, median, mode, etc.) or pass user defined functions for custom transformations. 

For more information see https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html and https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html

In [32]:
test_df = pd.DataFrame({'Patient': ['001', '002','001', '001','002','001','003'],
                        'Timestamp': [
                           '01/01/1970 07:00:00','01/01/1970 07:10:00','01/01/1970 07:30:00',
                           '01/01/1970 08:00:00','01/01/1970 07:40:00','01/01/1970 08:30:00','01/01/1970 11:00:00'],
                       'Value Type':['HR','HR','HR','HR','HR','HR','HR'],
                       'Value':['80','57','82','77','61','72','91']
                  })
test_df = test_df.astype({'Value':'float32','Timestamp':'datetime64[ns]'})
test_df

Unnamed: 0,Patient,Timestamp,Value Type,Value
0,1,1970-01-01 07:00:00,HR,80.0
1,2,1970-01-01 07:10:00,HR,57.0
2,1,1970-01-01 07:30:00,HR,82.0
3,1,1970-01-01 08:00:00,HR,77.0
4,2,1970-01-01 07:40:00,HR,61.0
5,1,1970-01-01 08:30:00,HR,72.0
6,3,1970-01-01 11:00:00,HR,91.0


In [45]:
test_df.groupby('Patient',as_index=False).agg({'Value':'mean'})

Unnamed: 0,Patient,Value
0,1,77.75
1,2,59.0
2,3,91.0


In [49]:
# We can also pass a list of functions like so

f_list = [
    'mean', 
    np.median, 
    scipy.stats.mode, 
    np.std, 
    np.var,
    scipy.stats.kurtosis,
    scipy.stats.skew,
    np.sum,
    np.min,
    np.max,
    scipy.stats.entropy,
    scipy.stats.variation,
    q1,
    q3
    ]

test_group = test_df.groupby('Patient',as_index=False).agg({'Value':f_list})

test_group.columns = test_group.columns.map('_'.join)
test_group.reset_index(drop=True,inplace=True)
test_group

Unnamed: 0,Patient_,Value_mean,Value_median,Value_mode,Value_std,Value_var,Value_kurtosis,Value_skew,Value_sum,Value_amin,Value_amax,Value_entropy,Value_variation,Value_q1,Value_q3
0,1,77.75,78.5,"(72.0, 1)",4.349329,18.916658,-1.204875,-0.478933,311.0,72.0,82.0,1.385111,0.048445,75.75,80.5
1,2,59.0,59.0,"(57.0, 1)",2.828427,8.0,-2.0,0.0,118.0,57.0,61.0,0.692572,0.033898,58.0,60.0
2,3,91.0,91.0,"(91.0, 1)",,,,,91.0,91.0,91.0,0.0,0.0,91.0,91.0


In [42]:
group_df = test_df.groupby('Patient',as_index=False).agg({'Value':list})
group_df

Unnamed: 0,Patient,Value
0,1,"[80.0, 82.0, 77.0, 72.0]"
1,2,"[57.0, 61.0]"
2,3,[91.0]


### Apply

To make changes to a column, we can also use the apply function. Coupled with a lambda iterator, we can build a user defined function to transform a column in a custom way.

For more information see https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html

In [43]:
group_df['Value_pad'] = grouped_df.Value.apply(lambda x: pad_array(x, cutoff=3))
group_df

Unnamed: 0,Patient,Value,Value_pad
0,1,"[80.0, 82.0, 77.0, 72.0]","[82.0, 77.0, 72.0]"
1,2,"[57.0, 61.0]","[0.0, 57.0, 61.0]"
2,3,[91.0],"[0.0, 0.0, 91.0]"


## Scikit Learn

Scikit-Learn offers simple, flexible tools for machine learning and predictive analytics. It is the *de facto* package for machine learning and provides tools to support all phases of the machine learning life cycle. For more information see https://scikit-learn.org/stable/

In [24]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

In [25]:
# Create a fake dataset
X, y = make_classification(n_samples=1000, n_features=3, n_redundant=0, n_repeated=0, n_classes=2, class_sep=1.5)
print(X)
print(y)

[[-0.01941594  1.2275616  -1.51232478]
 [ 0.69940899 -0.15222557  0.06109745]
 [ 1.2817935  -0.0572301  -1.4481825 ]
 ...
 [ 1.35207759  0.79044877  1.11381198]
 [ 2.79167392  0.40681725  4.19002012]
 [-1.4400929   1.63315846  2.18916863]]
[0 1 1 0 0 1 1 1 1 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 1
 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 1 1 0 0 1 1 0 1 1 1 0 1 1 0 0 0 0 1 0 1 1
 0 0 1 1 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0 1 0 1 1 0 1 1 1 0 1 0
 1 1 1 1 1 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 1 1 0 1 0 0 0 1 1 1 1
 1 0 0 1 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 0 1 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1
 1 0 1 1 0 0 0 1 1 1 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 1 1 0 0 1 1 1 0 1 1 1 1
 0 0 0 0 0 1 1 0 0 0 0 0 1 0 1 1 0 0 1 1 1 1 1 0 1 1 1 0 0 1 0 0 0 0 1 0 1
 0 1 1 0 1 1 1 0 1 1 0 0 0 0 1 1 0 1 1 0 0 0 1 0 0 0 0 1 0 1 1 1 1 1 0 1 0
 1 0 1 1 1 1 1 1 1 0 0 0 0 1 1 1 0 0 1 0 0 1 0 0 0 1 0 1 1 0 1 0 0 0 1 0 1
 1 0 1 1 0 0 1 1 1 1 0 0 1 1 0 0 1 0 1 1 1 1 1 0 1 0 0 1 0 0 0 0 1 0 0 0 1
 0 1 1 1 1

In [26]:
# Split the data into a train and a test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y) 

In [27]:
# Instantiate the classifier
est = RandomForestClassifier()

In [28]:
# Fit the classifier to the training data
est.fit(X_train, y_train)

In [29]:
# Make a prediction on the test data
y_pred = est.predict(X_test)

In [30]:
# Calculate AUC on the test data
roc_auc_score(y_test, y_pred)

0.9900217787457221