# Jupyter Notebook

Jupyter is just a wrapper for Python code. It comes with a lot of convenience tools though. There are two main modes which you can be in:
    - Edit mode: Typing code or markdown,
    - Command mode: where you can select cells and run them

Every jupyter session starts fresh. So you can run cells when you open the notebook. You can either run each individually by pressing SHIFT+Enter on the highlighted cell, or run all cells. The cell tab at the top gives you these options. Expore the options in the tab, there are not many of them, but they will be used often enough

# Bringing data into python: The Difficult way

Jorge has been working on a problem where he takes point data generated from mimics and determines distance information. The first thing that needs to be done is to take this in and parse out the numerical data. We will do that using python's file library.

In [4]:
f = open('./data/RCA_Test_Spline.txt', 'r') #I have saved a file in the data folder.
#'r' is for reading files. We can write files too. We won't need to do that here.
#and we will rarely ever have to do that.

The file object here is sort of like a promise to read a document. You can read the whole thing or you can read line by line. But if you decide to read the first line, it will tell you the second line if you do a read line a gain. In this way, a file object is one that both can return values to you while changing its state.

In [5]:
print(f.readline())

Legend



In [6]:
print(f.readline())




In [7]:
print(f.read()) #Notice how This does not tell you about 'Legend' and 
#'=====' because that has already been read


Spline:
Name:          Name of the spline
Xp,Yp,Zp:      Coordinates of point P on spline

Data
====

Spline:
Name: RCA
X0 Y0 Z0: 27.31  12.30  18.66 
X1 Y1 Z1: 26.40  13.08  19.79 
X2 Y2 Z2: 24.85  14.08  21.45 
X3 Y3 Z3: 23.65  14.55  22.67 
X4 Y4 Z4: 22.20  15.08  24.27 
X5 Y5 Z5: 20.75  15.38  25.75 
X6 Y6 Z6: 19.00  15.29  26.93 
X7 Y7 Z7: 17.32  15.12  28.21 
X8 Y8 Z8: 15.74  14.86  29.15 
X9 Y9 Z9: 13.90  12.97  29.75 
X10 Y10 Z10: 12.23  12.48  30.84 
X11 Y11 Z11: 10.02  11.84  32.11 
X12 Y12 Z12: 7.04   10.40  34.03 
X13 Y13 Z13: 4.97   8.80   35.26 
X14 Y14 Z14: 2.77   6.61   35.89 
X15 Y15 Z15: 0.58   4.38   36.66 
X16 Y16 Z16: -1.45  3.06   37.11 
X17 Y17 Z17: -4.09  0.52   37.05 
X18 Y18 Z18: -6.47  -0.76  37.95 
X19 Y19 Z19: -7.70  -4.25  37.12 
X20 Y20 Z20: -10.36 -6.29  38.10 
X21 Y21 Z21: -11.24 -9.54  36.88 
X22 Y22 Z22: -12.26 -11.09 36.93 
X23 Y23 Z23: -14.42 -11.16 36.55 
X24 Y24 Z24: -15.76 -12.52 36.12 
X25 Y25 Z25: -15.99 -14.75 34.92 
X26 Y26 Z26: -17.34 -16.1

In [8]:
f.read() #There is nothing more left to read

''

In [9]:
f.close() #Let's close the file before we read it again so
#we can free up any memory that python may be reserving still for the file

In [10]:
f = open('./data/RCA_Test_Spline.txt', 'r')

In [11]:
#There are a lot of ways we could go about picking the lines we need.
#Using regular expressions is one possibility.
#We are going to take a naive approach and just take the lines that have
#X's in them
ls = []
for line in f:
    #In a file object, python makes this equivalent to doing each readline
    if 'X' in line:
        ls.append(line)

For loops are great and everything, but I don't like them very much. I will replace them with list comprehensions whenever I can because it looks so much better and makes me make less variables. So let's do that instead.

In [37]:
f = open('./data/RCA_Test_Spline.txt', 'r')
points = [line for line in f if 'X' in line]
f.close()

In [38]:
points

['Xp,Yp,Zp:      Coordinates of point P on spline\n',
 'X0 Y0 Z0: 27.31  12.30  18.66 \n',
 'X1 Y1 Z1: 26.40  13.08  19.79 \n',
 'X2 Y2 Z2: 24.85  14.08  21.45 \n',
 'X3 Y3 Z3: 23.65  14.55  22.67 \n',
 'X4 Y4 Z4: 22.20  15.08  24.27 \n',
 'X5 Y5 Z5: 20.75  15.38  25.75 \n',
 'X6 Y6 Z6: 19.00  15.29  26.93 \n',
 'X7 Y7 Z7: 17.32  15.12  28.21 \n',
 'X8 Y8 Z8: 15.74  14.86  29.15 \n',
 'X9 Y9 Z9: 13.90  12.97  29.75 \n',
 'X10 Y10 Z10: 12.23  12.48  30.84 \n',
 'X11 Y11 Z11: 10.02  11.84  32.11 \n',
 'X12 Y12 Z12: 7.04   10.40  34.03 \n',
 'X13 Y13 Z13: 4.97   8.80   35.26 \n',
 'X14 Y14 Z14: 2.77   6.61   35.89 \n',
 'X15 Y15 Z15: 0.58   4.38   36.66 \n',
 'X16 Y16 Z16: -1.45  3.06   37.11 \n',
 'X17 Y17 Z17: -4.09  0.52   37.05 \n',
 'X18 Y18 Z18: -6.47  -0.76  37.95 \n',
 'X19 Y19 Z19: -7.70  -4.25  37.12 \n',
 'X20 Y20 Z20: -10.36 -6.29  38.10 \n',
 'X21 Y21 Z21: -11.24 -9.54  36.88 \n',
 'X22 Y22 Z22: -12.26 -11.09 36.93 \n',
 'X23 Y23 Z23: -14.42 -11.16 36.55 \n',
 'X24 Y24 Z24: -

In [39]:
#That worked fine, let's get rid of 
stripped_points = [point.strip() for point in points[1:]]
stripped_points

['X0 Y0 Z0: 27.31  12.30  18.66',
 'X1 Y1 Z1: 26.40  13.08  19.79',
 'X2 Y2 Z2: 24.85  14.08  21.45',
 'X3 Y3 Z3: 23.65  14.55  22.67',
 'X4 Y4 Z4: 22.20  15.08  24.27',
 'X5 Y5 Z5: 20.75  15.38  25.75',
 'X6 Y6 Z6: 19.00  15.29  26.93',
 'X7 Y7 Z7: 17.32  15.12  28.21',
 'X8 Y8 Z8: 15.74  14.86  29.15',
 'X9 Y9 Z9: 13.90  12.97  29.75',
 'X10 Y10 Z10: 12.23  12.48  30.84',
 'X11 Y11 Z11: 10.02  11.84  32.11',
 'X12 Y12 Z12: 7.04   10.40  34.03',
 'X13 Y13 Z13: 4.97   8.80   35.26',
 'X14 Y14 Z14: 2.77   6.61   35.89',
 'X15 Y15 Z15: 0.58   4.38   36.66',
 'X16 Y16 Z16: -1.45  3.06   37.11',
 'X17 Y17 Z17: -4.09  0.52   37.05',
 'X18 Y18 Z18: -6.47  -0.76  37.95',
 'X19 Y19 Z19: -7.70  -4.25  37.12',
 'X20 Y20 Z20: -10.36 -6.29  38.10',
 'X21 Y21 Z21: -11.24 -9.54  36.88',
 'X22 Y22 Z22: -12.26 -11.09 36.93',
 'X23 Y23 Z23: -14.42 -11.16 36.55',
 'X24 Y24 Z24: -15.76 -12.52 36.12',
 'X25 Y25 Z25: -15.99 -14.75 34.92',
 'X26 Y26 Z26: -17.34 -16.15 34.48',
 'X27 Y27 Z27: -17.72 -17.70 33

In [40]:
[point.split() for point in stripped_points]
#[point.split()[3:] for point in stripped_points]
#or
points = [point.strip().split()[3:] for point in points[1:]] #Chain chain chain chain...

In [45]:
#Nested for list comprehension
ls = []
for point in points:
    ls.append([])
    for coord in point:
        ls[-1].append(float(coord))

ls
#[[float(coord) for coord in point] for point in points]

[[27.31, 12.3, 18.66],
 [26.4, 13.08, 19.79],
 [24.85, 14.08, 21.45],
 [23.65, 14.55, 22.67],
 [22.2, 15.08, 24.27],
 [20.75, 15.38, 25.75],
 [19.0, 15.29, 26.93],
 [17.32, 15.12, 28.21],
 [15.74, 14.86, 29.15],
 [13.9, 12.97, 29.75],
 [12.23, 12.48, 30.84],
 [10.02, 11.84, 32.11],
 [7.04, 10.4, 34.03],
 [4.97, 8.8, 35.26],
 [2.77, 6.61, 35.89],
 [0.58, 4.38, 36.66],
 [-1.45, 3.06, 37.11],
 [-4.09, 0.52, 37.05],
 [-6.47, -0.76, 37.95],
 [-7.7, -4.25, 37.12],
 [-10.36, -6.29, 38.1],
 [-11.24, -9.54, 36.88],
 [-12.26, -11.09, 36.93],
 [-14.42, -11.16, 36.55],
 [-15.76, -12.52, 36.12],
 [-15.99, -14.75, 34.92],
 [-17.34, -16.15, 34.48],
 [-17.72, -17.7, 33.09],
 [-17.59, -19.83, 31.34],
 [-17.03, -22.5, 30.07],
 [-17.42, -24.78, 29.26],
 [-17.2, -27.02, 28.09],
 [-16.89, -29.09, 26.77],
 [-16.24, -30.99, 25.36],
 [-15.33, -32.48, 23.73],
 [-14.68, -34.07, 22.07],
 [-13.82, -35.26, 19.99],
 [-13.13, -36.33, 18.02],
 [-12.48, -36.79, 15.88],
 [-11.09, -37.52, 13.65],
 [-9.19, -38.44, 11.2],

In [48]:
#This is the sum total of everything we've done so far.
def extract_data(point_file_path):
    f = open(point_file_path, 'r')
    #This is the bad line. Should probably be regex
    #We are just being lazy.
    points = [line for line in f if 'X' in line]
    f.close()
    points = [point.strip().split()[3:] for point in points[1:]]
    points = [[float(coord) for coord in point] for point in points]
    return points

extracted_points = extract_data('./data/RCA_Test_Spline.txt')
extracted_points

[[27.31, 12.3, 18.66],
 [26.4, 13.08, 19.79],
 [24.85, 14.08, 21.45],
 [23.65, 14.55, 22.67],
 [22.2, 15.08, 24.27],
 [20.75, 15.38, 25.75],
 [19.0, 15.29, 26.93],
 [17.32, 15.12, 28.21],
 [15.74, 14.86, 29.15],
 [13.9, 12.97, 29.75],
 [12.23, 12.48, 30.84],
 [10.02, 11.84, 32.11],
 [7.04, 10.4, 34.03],
 [4.97, 8.8, 35.26],
 [2.77, 6.61, 35.89],
 [0.58, 4.38, 36.66],
 [-1.45, 3.06, 37.11],
 [-4.09, 0.52, 37.05],
 [-6.47, -0.76, 37.95],
 [-7.7, -4.25, 37.12],
 [-10.36, -6.29, 38.1],
 [-11.24, -9.54, 36.88],
 [-12.26, -11.09, 36.93],
 [-14.42, -11.16, 36.55],
 [-15.76, -12.52, 36.12],
 [-15.99, -14.75, 34.92],
 [-17.34, -16.15, 34.48],
 [-17.72, -17.7, 33.09],
 [-17.59, -19.83, 31.34],
 [-17.03, -22.5, 30.07],
 [-17.42, -24.78, 29.26],
 [-17.2, -27.02, 28.09],
 [-16.89, -29.09, 26.77],
 [-16.24, -30.99, 25.36],
 [-15.33, -32.48, 23.73],
 [-14.68, -34.07, 22.07],
 [-13.82, -35.26, 19.99],
 [-13.13, -36.33, 18.02],
 [-12.48, -36.79, 15.88],
 [-11.09, -37.52, 13.65],
 [-9.19, -38.44, 11.2],

# Numpy

Numpy is the core of fast matrix computing in Python. Lists in python are too slow to do lots of math with. Numpy save data in contiguous memory and uses a fast C++ backend to make everything fast. As an example we will use Jorge's problem of finding the closest point between to point sets.

In [53]:
import numpy as np
#It is easy to convert a list of values like this,
#Into a numpy array. Simply pass it into the np.array method,
#and now it will be a numpy array object with all its
#own methods.
points_np = np.array(extracted_points)
points_np

array([[ 27.31,  12.3 ,  18.66],
       [ 26.4 ,  13.08,  19.79],
       [ 24.85,  14.08,  21.45],
       [ 23.65,  14.55,  22.67],
       [ 22.2 ,  15.08,  24.27],
       [ 20.75,  15.38,  25.75],
       [ 19.  ,  15.29,  26.93],
       [ 17.32,  15.12,  28.21],
       [ 15.74,  14.86,  29.15],
       [ 13.9 ,  12.97,  29.75],
       [ 12.23,  12.48,  30.84],
       [ 10.02,  11.84,  32.11],
       [  7.04,  10.4 ,  34.03],
       [  4.97,   8.8 ,  35.26],
       [  2.77,   6.61,  35.89],
       [  0.58,   4.38,  36.66],
       [ -1.45,   3.06,  37.11],
       [ -4.09,   0.52,  37.05],
       [ -6.47,  -0.76,  37.95],
       [ -7.7 ,  -4.25,  37.12],
       [-10.36,  -6.29,  38.1 ],
       [-11.24,  -9.54,  36.88],
       [-12.26, -11.09,  36.93],
       [-14.42, -11.16,  36.55],
       [-15.76, -12.52,  36.12],
       [-15.99, -14.75,  34.92],
       [-17.34, -16.15,  34.48],
       [-17.72, -17.7 ,  33.09],
       [-17.59, -19.83,  31.34],
       [-17.03, -22.5 ,  30.07],
       [-1

In [55]:
#We can change our method to return a numpy array instead.
def extract_data_np(point_file_path):
    f = open(point_file_path, 'r')
    #This is the bad line. Should probably be regex
    #We are just being lazy.
    points = [line for line in f if 'X' in line]
    f.close()
    points = [point.strip().split()[3:] for point in points[1:]]
    #points = np.array(points).astype(float) #We like this much better
    #points = [[float(coord) for coord in point] for point in points]
    return points

extract_data_np('./data/RCA_Test_Spline.txt')

array([[ 27.31,  12.3 ,  18.66],
       [ 26.4 ,  13.08,  19.79],
       [ 24.85,  14.08,  21.45],
       [ 23.65,  14.55,  22.67],
       [ 22.2 ,  15.08,  24.27],
       [ 20.75,  15.38,  25.75],
       [ 19.  ,  15.29,  26.93],
       [ 17.32,  15.12,  28.21],
       [ 15.74,  14.86,  29.15],
       [ 13.9 ,  12.97,  29.75],
       [ 12.23,  12.48,  30.84],
       [ 10.02,  11.84,  32.11],
       [  7.04,  10.4 ,  34.03],
       [  4.97,   8.8 ,  35.26],
       [  2.77,   6.61,  35.89],
       [  0.58,   4.38,  36.66],
       [ -1.45,   3.06,  37.11],
       [ -4.09,   0.52,  37.05],
       [ -6.47,  -0.76,  37.95],
       [ -7.7 ,  -4.25,  37.12],
       [-10.36,  -6.29,  38.1 ],
       [-11.24,  -9.54,  36.88],
       [-12.26, -11.09,  36.93],
       [-14.42, -11.16,  36.55],
       [-15.76, -12.52,  36.12],
       [-15.99, -14.75,  34.92],
       [-17.34, -16.15,  34.48],
       [-17.72, -17.7 ,  33.09],
       [-17.59, -19.83,  31.34],
       [-17.03, -22.5 ,  30.07],
       [-1

Remember that this is the hard way of doing things. Mimics returns files that are not easily parsable, but list comprehensions make dealing with this a fairly easy process. When we get more structured data, like a csv, the pandas library will be able to create a tabular data object for us in one line. For data like this though, it makes more sense to keep it as a numpy array. You don't have fields that you need to query or match up to anything else, you just care about the math behind it.

In [1]:

#Placeholders for using our above method on some cool data files.
ps1 = np.random.random((40,3))
ps2 = np.random.random((30,3))

ps1

array([[0.21353496, 0.3834453 , 0.56432614],
       [0.69772295, 0.39672954, 0.2747583 ],
       [0.3034489 , 0.41751568, 0.69772927],
       [0.30413305, 0.67913954, 0.37630028],
       [0.15937635, 0.40533564, 0.97341271],
       [0.69822766, 0.95237304, 0.49655389],
       [0.52424429, 0.52050969, 0.15214705],
       [0.64268822, 0.77106205, 0.23334826],
       [0.76984718, 0.90613019, 0.66798594],
       [0.02097098, 0.10248472, 0.35366696],
       [0.30289593, 0.68625763, 0.98727945],
       [0.69270558, 0.50318029, 0.40569417],
       [0.57050612, 0.01199752, 0.73033142],
       [0.40833829, 0.96199326, 0.9340937 ],
       [0.77476349, 0.92064619, 0.30979368],
       [0.71643566, 0.13264324, 0.93598299],
       [0.89097785, 0.9849063 , 0.55294764],
       [0.28109038, 0.37466919, 0.00901289],
       [0.62930505, 0.18521589, 0.29459521],
       [0.75078031, 0.88621428, 0.01533064],
       [0.25550922, 0.30569632, 0.68970241],
       [0.58248162, 0.11384877, 0.67044518],
       [0.

In [56]:
ps1 + ps2

ValueError: operands could not be broadcast together with shapes (40,3) (30,3) 

In [2]:
ps1[0:10, 0]

array([0.21353496, 0.69772295, 0.3034489 , 0.30413305, 0.15937635,
       0.69822766, 0.52424429, 0.64268822, 0.76984718, 0.02097098])

In [3]:
ps1[:,:].shape #This doesn't change anything, just asks for everything

(40, 3)

In [4]:
ps1[:, None, :].shape, ps2[None,:,:].shape #Insert a dimension of size one into the arrays

((40, 1, 3), (1, 30, 3))

In [5]:
disp_vec_mat = (ps1[:, None, :] - ps2[None,:,:]) #Magical operation
disp_vec_mat.shape

(40, 30, 3)

In [6]:
dist_mat = np.sqrt((disp_vec_mat ** 2).sum(axis=-1))
dist_mat.shape

(40, 30)

In [7]:
#minimum distance is:
print(dist_mat.argmin(0).argmin())
print(dist_mat.argmin(1).argmin())

5
13


# Broadcasting

In [8]:
arr = np.arange(16).reshape(8,2)
arr.shape

(8, 2)

In [9]:
const = np.array([2])
const.shape

(1,)

In [10]:
(arr + const).shape

(8, 2)

In [11]:
vec = np.array([3,4])
vec.shape

(2,)

In [12]:
arr + vec #makes sense

array([[ 3,  5],
       [ 5,  7],
       [ 7,  9],
       [ 9, 11],
       [11, 13],
       [13, 15],
       [15, 17],
       [17, 19]])

In [13]:
arr + np.array([1,2,3])

ValueError: operands could not be broadcast together with shapes (8,2) (3,) 

In [14]:
def broacast_op(arr1, arr2):
    return np.sqrt( ((arr1[:,None,:] - arr2[None,:,:])**2).sum(axis=-1) )

%timeit -n10 broacast_op(ps1, ps2)

10 loops, best of 3: 62.8 µs per loop


In [15]:
import numba
broadcast_op_numb = numba.jit(broacast_op)

%timeit -n10 broadcast_op_numb(ps1, ps2)

The slowest run took 2172.88 times longer than the fastest. This could mean that an intermediate result is being cached.
10 loops, best of 3: 72.7 µs per loop
