# Introduction

## Jupyter notebooks

This class is focused on data analysis so we'll restrict ourselves to the parts of Python that we need to get the job done. This means we won't be using much object oriented programming or any complicated software engineering tools. For the most part, we'll be running code cell-by-cell and presenting our results in Jupyter notebooks, just like the one you are reading right now.

The Jupyter server is a program hosting notebook documents by running on your computer or a remote server and it lets you run code and take notes directly through a web browser. This makes it much easier to integrate pictures, text and even equations in the same space as your code. However, if you need to write a program that will continuously generate predictions every day, repeatedly apply some functions to some streaming data or another recurrent task, these notebooks are not the best choice and you would be better off developing standalone Python scripts that can be run from the command line.

The Jupyter notebook has a tour of the user interface built in. If you are new to using Jupyter, I recommend you select "User Interface Tour" that from the "Help" button on the toolbar and see the different features it has. Similarly, take a look at the keyboard shortcuts display.

## Cells

The basic unit of structure in the notebook is a **cell**. These can be either code cells, markdown-formatted cells, or raw text cells. Markdown is a formatting language that lets you add effects to text like *italics*, **bolding** or bulleted lists. You can see how to use all the different formatting options markdown allows at https://www.markdownguide.org/basic-syntax/. Making section headers with large and bold text is done by prefixing your text with up to 4 # symbols.

To run a cell, press shift-enter. This will either render a markdown cell or tell a code cell to start computation. To edit a formatted markdown cell, just double click it. Once the computation has started, a little asterisk will be shown to the left of the cell. In the cell below, we'll import the standard library `time` which comes packaged with Python and tell the computer to wait for seven seconds.

In [1]:
import time
time.sleep(7)

Once the cell is done running, you can see a number appear where the asterisk used to be. This indicates the order of execution of all cells and can be useful for identifying in what order your code was run. In general, it's best to try to keep from running code cells out of order if possible. To run the entire notebook at once, you can select "Run All" from the Cell toolbar button.

Sometimes your code can hang for a long time with out completing. If you're running remotely, this could be due to a hiccup in the communication between the server and your computer. Either way, you can restart the underlying Python instance called the *kernel* from the toolbar. Restarting the kernel wipes all variables stored in memory but does not delete any code or text.

## Directives / magic commands

You can also pass certain commands directly to the Jupyter server and bypass the Python interpreter by using the `%` symbol in front of your commands. For example, you will want to tell the Python plotting library, `matplotlib`, to display its plots in the Jupyter notebook using the magic command `%matplotlib inline`.



In [2]:
%matplotlib inline

You can also run Linux commands by prepending `!` to your command. To identify the current directory that the notebook is being run out of, you can use `!pwd!`.

In [3]:
!pwd

/home/ckrapu/Dropbox/teaching/engineering-data-science/notebooks


Here, we can see the full path of the directory that this notebook is in. This can be really handy to figure out whether or not your data is in the right directory. I'll check to see if our data file `hearts.csv` is in the right place.

In [4]:
! ls ../datasets

heart.csv


# Functions

Python functions are declared with a simple syntax:

In [5]:
def some_function(argument1,argument2, optional_argument=0):
    sum_value = argument1 + argument2 + optional_argument
    return sum_value

Note that Python is a dynamically typed language - the type of a input variable is not checked when a function runs. A function can take two types of arguments, positional (required) and keyword (optional) arguments. Optional arguments must have a default value to be set when the function is declard.

Both types of arguments are used in the example above. Also, Python indicates nesting without brackets like C. Instead, it uses indentation. You can either use tabs or four spaces sequentially to create an indent. If you are writing code in a `.py` script, make sure you do not mix the two styles or else you will get errors.

Python allows for higher-level functions; functions can also be used as arguments to other functions as well.

In [6]:
def cube(x):
    return x*x*x

def square(x):
    return x*x

def polynomial(x,function1, function2,a=1,b=1):
    return a*function1(x) + b*function2(x)

print(polynomial(2,cube,square))

12


While the `print` function works fine, we can also display the return value of a statement without using print. The output from the return value depends on what the variable's method for displaying itself is.

In [7]:
x = 3+5
x

8

## Exercise
* Write a function `radius_to_area` that converts the radius of a circle into its area.
* Write a function `apply_and_double` which takes in a number as well as a function as an argument such that argument function is applied to the number and then doubled.

If you want to see all the methods and attributes that belong to a Python object, you can use the `dir` function to do so. Everything in Python is an object. Methods which have double underscores before and after their name are typically intended for internal use by the developers and aren't recommended to be accessed by end users.

In [8]:
dir(x)

['__abs__',
 '__add__',
 '__and__',
 '__bool__',
 '__ceil__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__float__',
 '__floor__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__index__',
 '__init__',
 '__init_subclass__',
 '__int__',
 '__invert__',
 '__le__',
 '__lshift__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__or__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rand__',
 '__rdivmod__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rfloordiv__',
 '__rlshift__',
 '__rmod__',
 '__rmul__',
 '__ror__',
 '__round__',
 '__rpow__',
 '__rrshift__',
 '__rshift__',
 '__rsub__',
 '__rtruediv__',
 '__rxor__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__truediv__',
 '__trunc__',
 '__xor__',
 'bit_length',
 'conjugate',
 'denominator',
 'from_bytes',
 'imag',
 'numerator',
 'real',
 'to_bytes']

# Sequences

Python has a range of data structures that come preinstalled. `tuples` and `lists` are very commonly used. Both of these types of sequences respect the order of their elements. Other data structures like `set` and `dict` do not; once an object is added to them then the ordering of the data is lost. Tuples and lists behave very similarly.

In [9]:
my_tuple = (0,1,2,3)
my_list  = [4,5,6,7]


Indexing one or several elements from the sequence is accomplished with square brackets. Note that Python's index starts at zero.

In [10]:
my_list[0]

4

In [11]:
my_list[0:3]

[4, 5, 6]

The colon also indicates that the indexing should run until the end.

In [12]:
my_list[1::]

[5, 6, 7]

You can also take every n-th element from the list using the square bracket notation with `start:stop:step`.

In [13]:
my_list[0:4:2]

[4, 6]

Indexing with negative numbers is also valid. This counts backwards from the end.

In [14]:
my_tuple[-1]

3

In [15]:
my_tuple[-3]

1

The append and insert operations are used to make a list longer. 

In [16]:
my_list.append(4)
my_list.insert(0,0)
my_list

[0, 4, 5, 6, 7, 4]

We can find the length of a sequence with the function `len`.

In [17]:
len(my_list)

6

Lists may also be nested!

In [18]:
nested = [[0,1],[2,3,4]]
nested

[[0, 1], [2, 3, 4]]

## Exercise
* How do would you access every third element of a list `x` starting at the fifth element through the 13th?
* What do you get from the expression `my_list[my_tuple]`. Can you explain what is going on here?

# Sets

Sets aren't used as frequently as some other Python data structures but they can be useful in certain situations. A set will only contain distinct, unique elements. Converting a list into a set is easy  - you just pass the list as an argument to `set` when it is first initialized. This gives a straightforward way to find all the unique values in a list.

In [19]:
list_with_duplicates = [1,1,2,2,3,3,3]
set(list_with_duplicates)

{1, 2, 3}

# Dict

A Python dict (short for dictionary) is an implementation of a hash table which provides very fast indexing and updating. Dicts map one value onto another with the requirement that the indexing variable must be immutable. This means that the variable must be fixed or constant. Examples of mutable datatypes include strings, integers and floating point numbers.

In [20]:
mapping = {1:'a',2:'b',3:'c',0:'d'}
mapping[1]


'a'

In [21]:
mapping[0]

'd'

## Exercise

* You can insert new values into the dict with the syntax `dictionary[index] = new_value`. Try this with `mapping`. What happens to the previously indexed value?
* Suppose that you work at a restaurant want to keep a table of a collection of customers' names as indices and the name of their favorite dish as values. You also would like to keep track of which order the values were entered. Is the dict an appropriate data structure?
* Review the [documentation](https://docs.python.org/2/library/collections.html) for the `collections` standard library and see if there is a more appropriate data structure for this purpose.

# Map function and iterators

Dicts are really handy for applying a mapping to a sequence such as a list or tuple. Here, we'll see how to combine this with the `map` function and anonyomous or `lambda` functions.

The `map` function takes a function and a sequence and applies the function to each element of the sequence. It does not return a list or a tuple, however. Instead, it returns an `iterator` which provides new values one-at-a-time. This is especially useful when you have a very large input list and you do not want to compute them all at once, but rather only when you need them. The iterator tracks its internal state over time.

In [22]:
iterator = map(square,my_list)


In [23]:
iterator.__next__()

0

If we just want all the values at once, we can turn it into a list too.

In [24]:
iterator = map(square,my_list)
list(iterator)

[0, 16, 25, 36, 49, 16]

If we want to use our dict, we'll have to create a throwaway or anonymous lambda function to do so. The syntax for a lambda function is `lambda <input>: <output>`.

In [25]:
letter_iterator = map(lambda x: mapping[x],my_tuple)
print('Before mapping:',my_tuple)
print('After mapping:',list(letter_iterator))

Before mapping: (0, 1, 2, 3)
After mapping: ['d', 'a', 'b', 'c']


One of the iterators you will see quite often is the `range` iterator. With its default argument, it starts at 0 and counts up to the required argument.

In [26]:
list(range(5))

[0, 1, 2, 3, 4]

In [27]:
list(range(2,4))

[2, 3]

In [28]:
list(range(5,1,-1))

[5, 4, 3, 2]

# Loops and conditionals

`for` loops in python require a sequence or iterator to loop over and `while` loops continue until their termination condition is false.

In [29]:
for x in range(5):
    print(x,cube(x))

0 0
1 1
2 8
3 27
4 64


In [30]:
x = 0
while x < 5:
    print(x,cube(x))
    x += 1

0 0
1 1
2 8
3 27
4 64


Both types of loop also have an `else` clause that executes once the loop is finished.

In [31]:
x = 0
while x < 5:
    print(x,cube(x))
    x += 1
else:
    print('The next number would have been',x)

0 0
1 1
2 8
3 27
4 64
The next number would have been 5


Conditional statements are also straightforward. If you wish for nothing to be done, you can use the `pass` keyword.

In [32]:
condition = True

def do_something():
    print('Something happened.')
    
def do_nothing():
    pass

if condition:
    do_something()
else:
    do_nothing()

Something happened.


If you want to iterate over both an integer index `i` and some `value` drawn from a sequence or iterator, you can do both at the same time with `enumerate`.

In [50]:
for i,value in enumerate(['a','b','c','d']):
    print(i,value)

0 a
1 b
2 c
3 d


## Exercise
* Use a `for` loop and the `range` iterator to calculate the sum of all numbers between 11 and 39.
* Use `map` and `range` to create a list populated with the cube of all numbers between 1 and 6.

# Strings

Much like in the C programming language, a string is really an array of characters so you can index it just like a list or tuple.

In [33]:
my_name = 'Christopher'
my_name[5::]

'topher'

In [34]:
my_name.lower()

'christopher'

In [35]:
my_name.split('t')

['Chris', 'opher']

Using the `in` keyword allows us to check whether a substring is part of a larger string. This operation can also be used to check in general if an element is part of a list

In [36]:
'Chris' in my_name

True

In [37]:
0 in [0,1,2,3]

True

We can turn a longer string into a list of substrings by splitting it based on some punctuation.

In [38]:
text = 'Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty...'

In [39]:
text

'Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty...'

In [51]:
text.split(' ')

['Four',
 'score',
 'and',
 'seven',
 'years',
 'ago',
 'our',
 'fathers',
 'brought',
 'forth',
 'on',
 'this',
 'continent,',
 'a',
 'new',
 'nation,',
 'conceived',
 'in',
 'Liberty...']

Also, we can use the `replace` method to replace one string with another. `replace` is not an inplace operation!

In [41]:
text.replace('Four','Twelve').replace('seven','three')


'Twelve score and three years ago our fathers brought forth on this continent, a new nation, conceived in Liberty...'

## Exercise
* Calculate the average word length (in characters) of `text`.

# In-class project: loading data from text files

In this section, we'll be using a dataset on heart disease from UCI which can be downloaded here: https://www.kaggle.com/ronitf/heart-disease-uci. Unzip the file and place it in the same directory that you are running the Jupyter notebook out of.

Python has functions for opening and reading files. The syntax looks like this:


In [42]:
filepath = '../datasets/heart.csv'
with open(filepath,'r') as file:
    data = file.read()

The 'r' keyword to `open` indicates that the file should be prepared for reading rather than writing (`w`). This file is a comma separated value (CSV) file that contains numerous data values in a tabular format.

In [43]:
data[0:1000]

'\ufeffage,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target\n63,1,3,145,233,1,0,150,0,2.3,0,0,1,1\n37,1,2,130,250,0,1,187,0,3.5,0,0,2,1\n41,0,1,130,204,0,0,172,0,1.4,2,0,2,1\n56,1,1,120,236,0,1,178,0,0.8,2,0,2,1\n57,0,0,120,354,0,1,163,1,0.6,2,0,2,1\n57,1,0,140,192,0,1,148,0,0.4,1,0,1,1\n56,0,1,140,294,0,0,153,0,1.3,1,0,2,1\n44,1,1,120,263,0,1,173,0,0,2,0,3,1\n52,1,2,172,199,1,1,162,0,0.5,2,0,3,1\n57,1,2,150,168,0,1,174,0,1.6,2,0,2,1\n54,1,0,140,239,0,1,160,0,1.2,2,0,2,1\n48,0,2,130,275,0,1,139,0,0.2,2,0,2,1\n49,1,1,130,266,0,1,171,0,0.6,2,0,2,1\n64,1,3,110,211,0,0,144,1,1.8,1,0,2,1\n58,0,3,150,283,1,0,162,0,1,2,0,2,1\n50,0,2,120,219,0,1,158,0,1.6,1,0,2,1\n58,0,2,120,340,0,1,172,0,0,2,0,2,1\n66,0,3,150,226,0,1,114,0,2.6,0,0,2,1\n43,1,0,150,247,0,1,171,0,1.5,2,0,2,1\n69,0,3,140,239,0,1,151,0,1.8,2,2,2,1\n59,1,0,135,234,0,1,161,0,0.5,1,0,3,1\n44,1,2,130,233,0,1,179,1,0.4,2,0,2,1\n42,1,0,140,226,0,1,178,0,0,2,0,2,1\n61,1,2,150,243,1,1,137,1,1,1,0,2,1\n40,1,3,140

The rows are delimited by the newline operator `\n` while the columns are separated by commas. All of the values in the first line correspond to column header names.

## Exercise

In this exercise, we will load the data in the CSV file into a dict where each column name indexes a list of data values corresponding to that column. We also want the order of values across lists to be consistent with the order in the CSV file. To do this, complete the following steps.

* First, split the large data string using the newline character to separate the file into distinct lines.
* Split the first line into column names and keep track of how many columns there are.
* For the rest of the file, iterate over each of the lines and add the data value to the appropriate dict's list. You can do this by splitting each line by commas and keeping track of the column index belonging to each column name. You can initialize an empty dict with `{}` and an empty list with `[]`. If possible, try to convert the string values into numerical values.

In [47]:
lines = data.split('\n')
columns = lines[0].split(',')
table = {}
for name in columns:
    table[name] = []
    
for line in lines[1::]:
    values = line.split(',')
    for i, value in enumerate(values):
        table[columns[i]].append(value)

# Challenge exercises

If you found the previous problems to be too easy, here are a few extra exercises to encourage you to explore more of Python's functionality.

* Define a function that implements the matrix-vector product between a matrix $A$ and a vector $x$ $A$ should have dimensions $d\times d$ and $x$ should be $d$ elements long. $A$ should be a nested list-of-lists and $x$ should be a list. 
* Define a function to generate a large sample matrix and large vector for testing purposes. The entries of the matrix and vector are irrelevant, but the dimension should be passed in as an argument.
* Generate a list-of-lists and vectors for varying dimensions. Use the `time` library (documentation [here](https://docs.python.org/3/library/time.html) to calculate how much time is required to calculate the product for $d = 10$, $d = 100$ and $d = 1000$.
* Install and use the Numba just-in-time compiler to speed up your code. A good tutorial for this can be found at https://numba.pydata.org/numba-doc/dev/user/5minguide.html.
* Compare the run time for each case. Is Numba providing a performance boost?