# Jupyter Notebooks

# Basics

## Operators
Operators represent a mapping of efficient functions to basic operations in python. Along with variables, these form the basis for all operations in python. More can be found at https://docs.python.org/3/library/operator.html

## Data Types
Data is represented as different types in python (and pretty much every other programming language). These types indicate the different data that a variable can represent. Three basic types are integers, floats and strings. We can set the value of these using the "=" operator. More on python data types can be found at https://docs.python.org/3/library/datatypes.html

## Printing
Printing the value of a variable 

In [1]:
x_int = 3
x_float = 1.1
x_str = "2"

print(f'x_int is {type(x_int)} and has a value of {x_int}')
print(f'x_float is {type(x_float)} and has a value of {x_float}')
print(f'x_str is {type(x_str)} and has a value of {x_str}')

x_int is <class 'int'> and has a value of 3
x_float is <class 'float'> and has a value of 1.1
x_str is <class 'str'> and has a value of 2


We can convert between classes using the built in python functions (more on functions later). Notice how converting the float to an int results in a rounding (truncation) after the decimal. 

In [2]:
x_int_str = str(x_int)
x_float_int = int(x_float)
x_str_int = int(x_str)

print(f'x_int_str is {type(x_int_str)} and has a value of {x_int_str}')
print(f'x_float_int is {type(x_float_int)} and has a value of {x_float_int}')
print(f'x_str_int is {type(x_str_int)} and has a value of {x_str_int}')

x_int_str is <class 'str'> and has a value of 3
x_float_int is <class 'int'> and has a value of 1
x_str_int is <class 'int'> and has a value of 2


We can also perform operations on different data types. It is important to note that python converts mixed operations up to the most complex variable. As an example, subtracting an int from a float will result in an implicit conversion of the integer data type to a float and will produce a float. This is not true for all languages, so it is best practice generally to convert all data to the same type. 

In [3]:
print(x_int - x_float)

1.9


However, implicit conversion is not always possible, as in the following case:

In [4]:
#print(x_int - x_str)

We can fix this by converting the string to an integer 

In [5]:
print(x_int - int(x_str))

1


## More Complex Data Types

Besides integers, floats and strings, other more complex data types exist. Three of these are the tuple, the list and the dictionary. These can serve to store multiple data types together in a single variables.

In [6]:
x_tuple = (0,1,2)
print(f'x_tuple is {type(x_tuple)} and has a value of {x_tuple}')

x_tuple is <class 'tuple'> and has a value of (0, 1, 2)


It is important to note that not all types can be directly converted to other types. For example, converting a tuple to a float does not make sense. However, converting each element of a tuple to a float does make sense. 

In [7]:
#float(x_tuple)

However, we can apply the float type conversion to each element of a tuple using the map function and then converting back to a tuple. 

In [8]:
tuple(map(float, x_tuple))

(0.0, 1.0, 2.0)

In [9]:
# We can access elements of a tuple using the following approach

x, y, z = x_tuple

print(f'x={x};y={y};z={z}')

x=0;y=1;z=2


In [10]:
# Or we can access specific elements of a tuple using the following approach

_, y, _ = x_tuple

print(f'y={y}')

y=1


## Lists

Lists serve as a collection of indexed elements grouped together. Where tuples are accessible through the "=" operator, list elements can be accessed via an index (similar to an array). 

In [11]:
x_list = [1, 2, 3]
print(f'x_list is {type(x_list)} and has a value of {x_list}')

x_list is <class 'list'> and has a value of [1, 2, 3]


In [12]:
print(f'The second element of the list has value {x_list[1]}')
print(f'The sum of the first and second elements of the list is {x_list[0]+x_list[1]}')
print(f'The length of the list is {len(x_list)}')

The second element of the list has value 2
The sum of the first and second elements of the list is 3
The length of the list is 3


In [13]:
# We can also add new elements to a list using the append operator

x_list.append(4)
print(f'x_list is {type(x_list)} and has a value of {x_list}')

# Elements can be deleted as well

x_list.remove(4)
print(f'x_list is {type(x_list)} and has a value of {x_list}')

x_list is <class 'list'> and has a value of [1, 2, 3, 4]
x_list is <class 'list'> and has a value of [1, 2, 3]


## Dictionaries

Dictionaries serve as a mapping between key:value pairs. Unlike lists, in which the index of each element is an integer, begins at 0 and increased uniformly for all elements, the key (similar to an index in a list) for each value (element) is not limited by type. This provides a great deal of flexibility in data storage when using a dictionary. 

In [14]:
x_dict = {0:'1', 1.0:'2', 'dog': 3}
print(f'x_dict is {type(x_dict)} and has a value of {x_dict}')

x_dict is <class 'dict'> and has a value of {0: '1', 1.0: '2', 'dog': 3}


In [15]:
# We can print all of the keys and values

print(f'x_dict has keys {x_dict.keys()}')
print(f'x_dict has values {x_dict.values()}')

x_dict has keys dict_keys([0, 1.0, 'dog'])
x_dict has values dict_values(['1', '2', 3])


In [16]:
# We can add new elements simply by specifying a new key:value pair

x_dict['opensuse'] = 4
print(f'x_dict is {type(x_dict)} and has a value of {x_dict}')

x_dict.pop('opensuse')
print(f'x_dict is {type(x_dict)} and has a value of {x_dict}')

x_dict is <class 'dict'> and has a value of {0: '1', 1.0: '2', 'dog': 3, 'opensuse': 4}
x_dict is <class 'dict'> and has a value of {0: '1', 1.0: '2', 'dog': 3}


# Functions

Python is a high-level, general purpose programming language first released in 1991 by Guido van Rossum. Since then, it has become one of the most used languages in the world, and represents the *lingua franca* of data science. This notebook serves as a basic introduction to python. This tutorial neither serves as a comprehensive overview of python nor as a deep technical dive of any particular topic. The purpose is to provide basic principles of python and Jupyter to facilitate the other tutorials.

Python provides functionality through the use of packages. These packages can be accessed through the import command. Packages help to extend functionality by providing new abilities and improve performance by providing access to functions written in faster coding languages (C/C++/Rust). To access packages, we just use the import command.

In [17]:
def calc_mean(x):
    x_sum = 0
    for x0 in x:
        x_sum += x0
    x_mean = x_sum / len(x)
    return x_mean

In [18]:
x_list_mean = calc_mean(x_list)
print(f'x_list has mean {x_list_mean}')

x_list has mean 2.0


# Packages

## Numpy

NumPy is a fundamental scientific computing package available in python. Along with SciPy and Matplotlib, it forms a core group of scientific computing resources. For more information see https://numpy.org

In [19]:
import numpy as np
from time import process_time

In [20]:
n_samples = int(1e8)

x = np.random.random(n_samples)
t0 = process_time()
x_mean = calc_mean(x)
t1 = process_time()
print(f'> Python function returned value {x_mean} in {t1-t0} seconds')

t0 = process_time()
x_mean = np.mean(x)
t1 = process_time()
print(f'> Numpy function returned value {x_mean} in {t1-t0} seconds')

> Python function returned value 0.49996989021833904 in 13.5 seconds
> Numpy function returned value 0.4999698902181744 in 0.15625 seconds


## Pandas

Pandas is a flexible data manipulation tool built on top of python (and available in other languages). It is a staple of data manipulation in data science. For more information see https://pandas.pydata.org

In [21]:
import pandas as pd

In [22]:
df = pd.DataFrame(x, columns=['values'])
df

Unnamed: 0,values
0,0.066640
1,0.694262
2,0.542817
3,0.352289
4,0.957097
...,...
99999995,0.712517
99999996,0.981184
99999997,0.698013
99999998,0.741049


In [23]:
t0 = process_time()
x_mean = df.mean()
t1 = process_time()
print(f'> Pandas function returned value {x_mean} in {t1-t0} seconds')

> Pandas function returned value values    0.49997
dtype: float64 in 0.296875 seconds


## Scikit Learn

Scikit-Learn offers simple, flexible tools for machine learning and predictive analytics. It is the *de facto* package for machine learning and provides tools to support all phases of the machine learning life cycle. For more information see https://scikit-learn.org/stable/

In [24]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

In [25]:
# Create a fake dataset
X, y = make_classification(n_samples=1000, n_features=3, n_redundant=0, n_repeated=0, n_classes=2, class_sep=1.5)
print(X)
print(y)

[[-2.69544515 -1.69341893  1.78504868]
 [ 0.69356288  1.46843332  1.25232292]
 [ 1.21260162 -1.31346584 -1.78104389]
 ...
 [-0.22831571  1.62102379  1.47278412]
 [-0.1637844   2.18549166 -1.60545869]
 [-0.37250378  1.55628482  1.19283399]]
[1 1 0 1 0 0 1 0 1 1 1 0 1 0 0 0 1 1 1 1 1 1 1 0 0 1 1 0 0 1 1 1 1 0 0 0 0
 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 1 1 0 0 1 1 0 1 0 1 1 0 1 0 1 1 0 1 1
 0 1 1 1 0 0 1 0 1 0 1 1 1 0 0 0 1 1 1 0 1 1 0 0 0 1 1 0 1 0 1 0 1 0 0 0 1
 0 0 0 1 1 0 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 1 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0
 1 0 1 0 1 1 1 0 0 1 0 1 1 0 1 0 1 1 1 1 1 1 0 1 0 0 1 0 0 0 1 0 1 0 0 0 0
 1 1 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 1 1 0 1 0 0 0
 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0 1 1 0 1 0 1 0 0 1 0
 0 0 0 1 1 0 1 1 1 1 0 1 0 1 1 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 1 0 1 1 1 1
 1 1 0 1 0 0 1 0 1 0 0 0 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 0 0
 0 0 0 1 0 1 1 0 0 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 0 0 1 0 1 0 0
 1 0 0 0 1

In [26]:
# Split the data into a train and a test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y) 

In [27]:
# Instantiate the classifier
est = RandomForestClassifier()

In [28]:
# Fit the classifier to the training data
est.fit(X_train, y_train)

In [29]:
# Make a prediction on the test data
y_pred = est.predict(X_test)

In [30]:
# Calculate AUC on the test data
roc_auc_score(y_test, y_pred)

0.9767322992132984