

__Functions, File I/O and External Libraries__

<hr>

### Functions

#### Calling functions

Previously, we have already made use of many built-in functions to facilitate programming. Function is a block of codes with input arguments (and, optionally, return values) for specific purposes. In Python ( and many other languages), a function call is as the following:

```python
>> output = function(input_argument)
```

For example:

In [1]:
range(5)

range(0, 5)

Now that Python 3 use [`iterator`](https://stackoverflow.com/questions/25653996/what-is-the-difference-between-list-and-iterator-in-python) for 'range' function, we can manually convert the output into `list` so that we can see the output explicitly

In [2]:
list(range(5))

[0, 1, 2, 3, 4]

As another example:

In [3]:
abs(-3.5)

3.5

In many cases we need more sophisticated usage of functions, where we need to use more than one input arguments. For example:

In [4]:
list(range(5, 0, -1))

[5, 4, 3, 2, 1]

A second example, sort a dictionary by values:

In [5]:
d = {'a': 100, 'c': 50, 'b': 70}
res = sorted(d)
print(res)

['a', 'b', 'c']


In [6]:
d

{'a': 100, 'c': 50, 'b': 70}

In [7]:
d['a']

100

In [8]:
sorted(d, key=lambda k: d[k])

['c', 'b', 'a']

#### Lambda functions

Aha, we just saw something different: `lambda`!

Lambda functions are just functions, except that they are anonymous (literally). See [here](https://stackoverflow.com/questions/890128/why-are-python-lambdas-useful) for many good discussions. In short, you can use regular functions to achieve anything with `lambda`. Yet, it is handy because it is lightweight and anonymous.

The example above is actually a good example of when to use `lambda`:

In [6]:
sorted(d, key=lambda k: d[k])

['c', 'b', 'a']

There is one and only one expression within the `lambda` function. In this case, the input is `k`, a key inside the dictionary `d` and the output is `d[k]`, the value in `d` w.r.t. the key `k`. Therefore we are sorting our dictionary keys by their values instead of the keys themselves.

#### Define our own functions

Note that we are not limited to built-in functions only. Let's now try make our own functions. Before that, we need to be clear on the structure of a function
```python
def func_name(arg1, arg2, arg3, ...):
    #####################
    # Do something here #
    #####################
    return output
```

\* *`return output` is NOT required*

In the following example, we make use of `sum`, a built-in function to sum up numeric iterables.

In [None]:
def mySum(list_to_sum):
    return sum(list_to_sum)

In [None]:
mySum(range(5))

10

A more complicated one that does not use `sum` function.
- Do not remember for loop? Check out [here](https://github.com/zhiyzuo/python-tutorial/blob/master/1-Variables-Data_Structures-Control_Logic.ipynb)

In [None]:
def mySumUsingLoop(list_to_sum):
    sum_ = list_to_sum[0]
    for item in list_to_sum[1:]:
        sum_ += item
    return sum_

In [None]:
mySumUsingLoop(range(5))

10

*The two example functions are not doing anything interesting but just served as illustrations to build customized functions.*

Finally, let's see how we can sort a dictionary by values using functions instead of `lambda`

In [None]:
d

{'a': 100, 'c': 50, 'b': 70}

In [None]:
def my_key(key):
    return d[key]

In [None]:
sorted(d, key=my_key)

['c', 'b', 'a']

See, `lambda` is way simpler than defining a function explicitly

---

In [7]:
import glob

### FIle I/O

This section is about some basics on reading and writing data, in Python native style

#### Write data to a file

In [9]:
!pwd
!ls

/content
sample_data


In [10]:
f = open("tmp1.csv", "w") # f is a file handler, while "w" is the mode (w for write)
for item in range(6):
    f.write(str(item))
    f.write('hello')
    # add newline character 
    f.write("\n") 
    # alternatively, we can do:
    # f.write(str(item)+"\n") because we can concat two strings by using `+`
f.close() # close the filer handler for security reasons.

In [11]:
!ls

sample_data  tmp1.csv


check out the file we just created `tmp.csv`

In [12]:
!cat tmp1.csv

0hello
1hello
2hello
3hello
4hello
5hello


Note that without the typecasting from `int` to `str`, an error will be raised.

A more commonly used way:

In [13]:
with open("tmp2.csv", "w") as f: # f is a file handler, while "w" is the mode (w for write)
    for item in range(4):
        f.write(str(item))
        f.write("\n") # add newline character

In [14]:
cat tmp2.csv

0
1
2
3


No need to close because of `with`.

See more here:
1. https://stackoverflow.com/questions/3012488/what-is-the-python-with-statement-designed-for
2. https://docs.python.org/3/whatsnew/2.6.html#pep-343-the-with-statement

Occasionally, we need to _append new elements_ instead of _overwriting_ existing files. In this case, we should use `a` mode in our `open` function:

In [15]:
with open("tmp2.csv", "a") as f:
    for item in range(15, 19):
        f.write(str(item)+"\n")

In [16]:
cat tmp2.csv

0
1
2
3
15
16
17
18


#### Read data to a file

To read a text file into Python, we use `r` mode (for _read_)

In [17]:
f = open("tmp1.csv", "r") # this time, use read mode
contents = [item for item in f] # list comprehension. This is the same as for-loop but more concise
print(contents)

['0hello\n', '1hello\n', '2hello\n', '3hello\n', '4hello\n', '5hello\n']


Usually, we do not like trailing newlines. We can use `strip` to remove them.

In [18]:
contents = [item.strip("\n") for item in contents] # strip the newline
print(contents)

['0hello', '1hello', '2hello', '3hello', '4hello', '5hello']


`map` is a function to do similar things like _list comprehension_. See [here](https://stackoverflow.com/questions/10973766/understanding-the-map-function) for more discussions.

In [19]:
contents = [item.strip('hello') for item in contents]
print(contents)

['0', '1', '2', '3', '4', '5']


In [20]:
int_values = list(map(int, contents)) # map the values into integer type
print(int_values)
f.close() # always remember to close the file handler

[0, 1, 2, 3, 4, 5]


In [None]:
type(int_values[0])

int

Also using with:

In [21]:
with open("tmp1.csv", "r") as f:
    contents = [item for item in f] # list comprehension. This is the same as for-loop but more concise
    contents = [item.strip("hello\n") for item in contents] # strip the newline
    print('Before converting to `int`')
    print(contents)
    int_values = list(map(int, contents)) # map the values into integer type
    print('After...')
    print(int_values)

Before converting to `int`
['0', '1', '2', '3', '4', '5']
After...
[0, 1, 2, 3, 4, 5]


---

### Libraries

Often times, we need either internal or external help for complicated computation tasks. In these occasions, we need to _import libraries_. 

#### Built-in libraries

Python provides many built-in packages to prevent extra work on some common and useful functions

We will use __math__ as an example.

In [None]:
import math # use import to load a library

To use functions from the library, do: `library_name.function_name`. For example, when we want to calculate the logarithm using a function from `math` library, we can do `math.log`

In [None]:
x = 3
print("e^x = e^3 = %f"%math.exp(x))
print("log(x) = log(3) = %f"%math.log(x))

e^x = e^3 = 20.085537
log(x) = log(3) = 1.098612


You can also import one specific function:

In [None]:
from math import exp # You can import a specific function
print(exp(x)) # This way, you don't need to use math.exp but just exp

20.085536923187668


Or all:

In [None]:
from math import * # Import all functions

In [None]:
print(exp(x))
print(log(x)) # Before importing math, calling `exp` or `log` will raise errors

20.085536923187668
1.0986122886681098


Depending on what you want to achieve, you may want to choose between importing a few or all (by `*`) functions within a package.

#### External libraries

There are times you'll want some advanced utility functions not provided by Python. There are many useful packages by developers.

We'll use __numpy__ as an example. (__numpy__, __scipy__, __matplotlib__,and probably __pandas__ will be of the most importance to you for data analyses.

Installation of packages for Python is the easiest using <a href="https://packaging.python.org/installing/" target="_blank">pip</a>:

```bash
~$ pip install numpy scipy pandas
```

If you use Anaconda, I beleive all these are ready for your use.

Loading external libraries is just the same as built-in ones. To use _alias_ for easier access to the libraries, we can import a library by: `import library_long_name as short_name`. For example:

In [24]:
# After you install numpy, load it
import numpy as np # you can use np instead of numpy to call the functions in numpy package

In [25]:
x = np.array([[1,2,3], [4,5,7]], dtype=np.float) # create a numpy array object, specify the data type as float
print(x)
print(type(x))

[[1. 2. 3.]
 [4. 5. 7.]]
<class 'numpy.ndarray'>


We can call `shape` function designed for `numpy.ndarray` class to check the dimension

In [None]:
x.shape

(2, 3)

Unlike `list`, we have to use one single data type for all elements in an array

In [26]:
y = np.array([1, 'yes'])
y

array(['1', 'yes'], dtype='<U21')

In [27]:
y[0], type(y[0])

('1', numpy.str_)

In [28]:
y_list = [1, 'yes']
y_list[0], type(y_list[0])

(1, int)

__Scipy/Numpy__ provides extensive utilities to manipulate data and simple analysis

In [22]:
from scipy.stats import pearsonr, spearmanr # correlation functions

In [29]:
print(pearsonr(x[1, :], x[0, :]))
print(spearmanr(x[1, :], x[0, :]))

(0.9819805060619655, 0.1210377183236774)
SpearmanrResult(correlation=1.0, pvalue=0.0)


__Pandas__ (Python Data Analysis Library) is a great package for data structures: `DataFrame`

If you're familar with `R`, then you must love `pandas.DataFrame` data structure.

In [None]:
import pandas as pd

In [None]:
x

array([[1., 2., 3.],
       [4., 5., 7.]])

In [None]:
x_df = pd.DataFrame(x)
x_df

Unnamed: 0,0,1,2
0,1.0,2.0,3.0
1,4.0,5.0,7.0


Easy import/export

In [None]:
x_df.to_csv('tmp_pd.csv', index=False) # `index=False`: do not write row indices to file

In [None]:
df = pd.read_csv('tmp_pd.csv')

In [None]:
df

Unnamed: 0,0,1,2
0,1.0,2.0,3.0
1,4.0,5.0,7.0


---

### Quick Intro to Numpy

Instead of using the native data structures, we use `numpy.ndarray` for data analytics most of the time. While they are not as "flexible" as lists, they are easy to use and have better performance. As Numpy's official documentation states:
> NumPy’s main object is the homogeneous multidimensional array. It is a table of elements (usually numbers), all of the same type, indexed by a tuple of positive integers.

As we were using it just now, the most common alias for `numpy` is `np`:

In [None]:
import numpy as np

#### Create arrays

Depending on what types of analyses we are going to work on later, the most appropriate array initialization methods can be choosed.

##### By hand

This is very similar to creating a list of elements manually, except that we wrap the list around by `np.array()`.

In [None]:
arr = np.array([1,2,3,8])
arr

array([1, 2, 3, 8])

In [None]:
arr.shape

(4,)

Multidimensional arrays: seperated by comma

1 by 4: 1 row and 4 columns

In [None]:
arr = np.array([[1,2,3,8]])
arr.shape

(1, 4)

In [None]:
arr

array([[1, 2, 3, 8]])

3 by 4: 3 row and 4 columns

In [None]:
arr = np.array([[1,2,3,8], [3,2,3,2], [4,5,0,8]])
arr.shape

(3, 4)

In [None]:
arr

array([[1, 2, 3, 8],
       [3, 2, 3, 2],
       [4, 5, 0, 8]])

##### By functions

There are many special array initialization methods to call:

In [None]:
np.zeros([3,5], dtype=int)

array([[0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0]])

In [None]:
np.ones([3,5])

array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

In [None]:
np.eye(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

#### Arithmatic operations

The rules are very similar to R: they are generally element wise

In [None]:
arr

array([[1, 2, 3, 8],
       [3, 2, 3, 2],
       [4, 5, 0, 8]])

In [None]:
arr * 6

array([[ 6, 12, 18, 48],
       [18, 12, 18, 12],
       [24, 30,  0, 48]])

In [None]:
arr - 5

array([[-4, -3, -2,  3],
       [-2, -3, -2, -3],
       [-1,  0, -5,  3]])

In [None]:
np.exp(arr)

array([[2.71828183e+00, 7.38905610e+00, 2.00855369e+01, 2.98095799e+03],
       [2.00855369e+01, 7.38905610e+00, 2.00855369e+01, 7.38905610e+00],
       [5.45981500e+01, 1.48413159e+02, 1.00000000e+00, 2.98095799e+03]])

Note that if we want conduct matrix multiplication, we need to use `@` or `.dot` function, since `*` still means element wise computation

In [None]:
arr_2 = np.array([[1], [3], [2], [0]])
arr_2

array([[1],
       [3],
       [2],
       [0]])

In [None]:
arr @ arr_2

array([[13],
       [15],
       [19]])

In [None]:
arr.dot(arr_2)

array([[13],
       [15],
       [19]])

##### Operation based on itself

There are many class methods to calculate some statistics of the array itself along some axis:
- `axis=1` means row-wise
- `axis=0` means column-wise

In [None]:
arr

array([[1, 2, 3, 8],
       [3, 2, 3, 2],
       [4, 5, 0, 8]])

In [None]:
arr.max()

8

In [None]:
arr.max(axis=1)

array([8, 3, 8])

In [None]:
arr.max(axis=0)

array([4, 5, 3, 8])

In [None]:
arr.cumsum()

array([ 1,  3,  6, 14, 17, 19, 22, 24, 28, 33, 33, 41])

In [None]:
arr.cumsum(axis=1)

array([[ 1,  3,  6, 14],
       [ 3,  5,  8, 10],
       [ 4,  9,  9, 17]])

#### Indexing and slicing

The most important part is how to index and slice a `np.array`. It is actually very similar to `list`, except that we now may have more index elements because there are more than one dimension for most of the datasets in real life

##### 1 dimensional case

In [None]:
a1 = np.array([1,2,8,100])
a1

array([  1,   2,   8, 100])

In [None]:
a1[0]

1

In [None]:
a1[-2]

8

In [None]:
a1[[0,1,3]]

array([  1,   2, 100])

We can also use boolean values to index
- `True` means we want this element

In [None]:
a1 > 3

array([False, False,  True,  True])

In [None]:
a1[a1 > 3]

array([  8, 100])

##### 2 dimensional case

In [None]:
arr

array([[1, 2, 3, 8],
       [3, 2, 3, 2],
       [4, 5, 0, 8]])

Using only one number to index will lead to a subset of the original multidimenional array: also an array

In [None]:
arr[0]

array([1, 2, 3, 8])

In [None]:
type(arr[0])

numpy.ndarray

Since we have 2 dimensions now, there are 2 indices we can use for indexing the 2 dimensions respectively

In [None]:
arr[0,0]

1

We can use `:` to indicate everything along that axis

In [None]:
arr[1]

array([3, 2, 3, 2])

In [None]:
arr[1, :]

array([3, 2, 3, 2])

In [None]:
arr[1,:] == arr[1]

array([ True,  True,  True,  True])

In [None]:
arr[:, 1]

array([2, 2, 5])

##### 3 dimensional case

As a final example, we look at a 3d array:

In [None]:
arr_3 = np.random.randint(low=0, high=100, size=24)
arr_3

array([69,  5, 51, 67, 35, 20,  8, 40, 66, 94,  0, 25, 64, 78, 47, 48, 98,
       59, 86, 82, 67, 63, 81, 34])

We can use [`reshape`](https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.reshape.html) to manipulate the shape of an array

In [None]:
arr_3 = arr_3.reshape(3,4,2)
arr_3

array([[[69,  5],
        [51, 67],
        [35, 20],
        [ 8, 40]],

       [[66, 94],
        [ 0, 25],
        [64, 78],
        [47, 48]],

       [[98, 59],
        [86, 82],
        [67, 63],
        [81, 34]]])

In [None]:
arr_3[0]

array([[69,  5],
       [51, 67],
       [35, 20],
       [ 8, 40]])

In [None]:
arr_3[:, 3, 1]

array([40, 48, 34])

In [None]:
arr_3[2, 3, 1]

34