## Files and Loops

In this session, we’ll learn how to work with files and use loops to iterate through lists. We’ll be working with crime rate data for 73 cities in the United States.
Datasets are often represented in files that you can download and manipulate.
Before we get started, we’ll first need to learn how to work with files in Python.


### Open files

To open a file in Python, we use the `open()` function. This function accepts two different arguments (inputs) in the parentheses, always in the following order:

- the name of the file (as a string)
- the mode of working with the file (as a string)

We’ll learn about the various modes later. For now, we’ll just use `"r"` , the mode for
reading in files.

For example, to open a file named `story.txt` in read mode, we write the following:

    open("story.txt", "r")

The `open()` function returns a File object. This object stores the information we passed in, and allows us to call methods specific to the File class. We can assign the File object to a variable so we can refer to it later:

    a = open("story.txt", "r")

Note that the File object, `a` , won’t contain the actual contents of the file. It’s instead an object that acts as an interface to the file and contains methods for reading in and modifying the file’s contents (which we’ll cover on the next slide).

#### Exercise

- Use the `open()` function to create a File object.
- The name of the file to open is `"movie_metadata.csv"` . Access the file in read mode ( `"r"` ).
- Assign this File object to the variable `f`.

In [None]:
# Exercise
#open('movie_metadata.csv','r')

movie_file = open('movie_metadata.csv', 'r')
movie_data = movie_file.read()
print(movie_data)

### Reading from files
File objects have a `read()` method that returns a string representation of the text in a file.

Unlike the `append()` method from the previous session, the `read()` method returns a value instead of modifying the object that calls the method.

In the following code, we use the `read()` function to read the contents of `"test.txt"` into a File object, and assign that object to `g` :


    f = open("test.txt", "r")
    g = f.read()

Since `g` is a string, we can use the `print()` function to display the contents of the file:

    f = open("test.txt", "r")
    g = f.read()
    print(g)

#### Exercise

- Run the `read()` method on the File object `f` to return the string representation of `movie_metadata.csv`.
- Assign the resulting string to a new variable named `data`.

In [None]:
# Exercise
movie_file = open('movie_metadata.csv', 'r')
movie_data = movie_file.read()
#print(movie_data)
split_data = movie_data.split('\n')

In [None]:
rows = []
for i in split_data[:]:
    #print('current i: ',i)
    split_row = i.split(',')
    #print('result of split: ',split_row)
    rows.append(split_row)
    
print(rows[0])

In [None]:
print(rows[1])

In [None]:
print(rows[2])

In [11]:
movie_file = open('movie_metadata.csv', 'r')
movie_data = movie_file.read()
split_data = movie_data.split('\n')

rows = []
for i in split_data[:]:
    split_row = i.split(',')
    rows.append(split_row)

# extract the duration of every movie
# compute the average duration
# call it mean_duration
# print it
movie_durations = []
duration = 0
for item in rows[1:]:
    if len(item[3]) > 0:
        duration = int(item[3])
    else:
        duration = 0
    movie_durations.append(duration)
    
print(movie_durations[:5])

total_duration = 0
for i in movie_durations:
    total_duration = total_duration + i
    
mean_duration = total_duration/len(movie_durations)
print('total duration: ', total_duration)
print('mean duration: ', mean_duration)

[178, 169, 148, 164, 136]
total duration:  527922
mean duration:  107.04014598540147


### Splitting
To make our string object data more useful, let’s convert it into a list. Here’s a preview of how the dataset looks:

    movie_title,director_name,color,duration,actor_1_name,language,country,title_year\nAvatar,James Cameron,Color,178,CCH Pounder,English,USA,2009

Each line is separated by the string `\n` , which is referred to as the new-line character. When we open a text file in a text editor, the editor will automatically split the text
and create a new line wherever it sees the string `\n` .

In Python, we can use the `split()` method to turn a string object into a list of
strings, like so:

    ["Avatar,James Cameron,Color,178,CCH Pounder,English,USA,2009", 
    "Pirates of the Caribbean: At World's End,Gore Verbinski,Color,169,Johnny Depp,English,USA,2007", 
    "Spectre,Sam Mendes,Color,148,Christoph Waltz,English,UK,2015"]

The `split()` method takes a string input corresponding to the delimiter, or separator. This delimiter determines how the string is split into elements in a list.

For example, the delimiter for the crime rate data we just looked at is `\n` . Many other files use commas to separate elements:

     sample = "john,plastic,joe"
     split_list = sample.split(",")
     # split_list is a list of _strings_: ["john", "plastic", "joe"]
     
#### Exercise
- Split the string object `data` on the new-line character `"\n"` , and store the result in a variable named `rows`.
- Then, use the `print()` function to display the first five elements in `rows`.

In [None]:
# Exercise



## Introduction to Functions

### Motivating functions
A function is a packaged body of code that we can reuse by __calling__ with the relevant parameters.
The parameters that a function takes are called the inputs of the function, and the result that it returns is called the __output__.

Other than reusability, there are 3 main advantages of using functions:
- They allow us to use other people’s code without the necessity to have a deep understanding of how it was written. We call this __information hiding__.
- They break down complex logic into smaller components or modules. We refer to this as __modularity__. Modularity makes it easier for someone else to read, understand, use, and build upon our code.
- They streamline our code and make it easier to maintain. Programmers reuse the same functions in multiple situations across a project. This means that they generalize the function as much as possible to maximize its usefulness. We call this process __abstraction__.

### Writing our own functions
The syntax for defining a function consists of 5 parts:
- `def` keyword - For Python to interpret the following code as a function
- Name - To refer to when we need to call the function later
- Arguments - Input value(s) that the function takes in
- Body - The code that the function executes
- Return value - The value that the function returns to the user when the function terminates

Let us examine the syntax further, using an example function that returns the first element of a list:

    def first_elt(input_lst):
        first = input_lst[0]
        return first

Things to note:
- Indentation of the function: after the colon, we indent the remainder of the function by one `tab` , which is the equivalent of 4 `space bar` strokes.
- `first` and `input_lst` are temporary variables, which means that they are only accessible inside the function.

#### Exercise

- Write a function, with a definition, name, argument(s), body and return value, that returns a list containing the names of the movies in `movie_data`.
- This function is expected to behave similar to `first_elt()` on previous slide, but for multiple lists.
    - Give the function a name that describes what it does; `first_elts()` is a good example, but feel free to be creative.
    - Declare an empty list.
    - Use a `for` loop to extract the first element of each list, and append these elements to the empty list.
    - Return the list.
- Assign the returned list to `movie_names` .
- Display the first 5 elements of `movie_names` using the `print()` function.

In [None]:
# Exercise





### Function with multiple return paths

Even though we suggested return signifies the end of a function, a function can have multiple return statements.
We can take advantage of this to add an if statement that returns a value if a certain criteria is met, and another value otherwise. For example,

     def is_blah(input_lst):
         if input_lst[0] == "blah":
             return True
         else:
             return False
    
Notice that there is a further layer of indentation after the `if` and `else` statements.

#### Exercise

- Write a function named `is_usa()` that checks whether or not a movie was made in the United States.
    - Check the `movie_metadata.csv` file to see which column corresponds to the nationality of the movie. Don’t forget to subtract one to find the true index of the column in the list.
    - Use an `if` statement to check the right column of the list with the word ”USA”. The equality operation is case sensitive, so make sure to get the capitilization right.
    - Return `True` if the condition is met, and `False` otherwise.

- Try it with a few movies in `movie_data`.
- Call it on `wonder_woman` and store the result in `wonder_woman_usa`.

        wonder_woman = [’Wonder Woman’,’Patty Jenkins’,’Color’,141, ’Gal Gadot’,’English’,’USA’,2017]

In [None]:
# Exercise




### Functions with multiple arguments

If we wanted to check if the first value of the 7th column is ”UK” instead, we would have to write a completely separate function:

     def is_uk(input_lst):
         if input_lst[6] == "UK":
             return True
         else:
             return False
             
However, you can see that this function is almost the same with `is_usa()` , except for the string they check for.

This can give us the intuition that there is another layer of abstraction we can perform.

We could write a function that takes in two inputs, namely, the list and the string to check for:

    def equals_str(input_lst,input_str):
         if input_lst[0] == input_str:
            return True
        else:
            return False

Now, `is_usa(input_lst)` behaves the same way as `equals_str(input_lst, "USA")` and `is_uk(input_lst)` behaves the same way as `equals_str(input_lst, "UK")`.

### Functions with multiple arguments

Because there is more than one argument in this function, the order with which we call the arguments becomes important.

For example, `equals_str(movie_data[4], "UK")` would be correct; however, `equals_str(movie_data[4])` would not, because the function expects to get the list first and the string second.

If we want to override this, we have to used __named arguments__ instead of the default,
__positional arguments__.

If we explicitly write the names of the arguments as we provide them, their positions
become unimportant.

This means that `equals_str(input str="UK",input lst=movie data[4])` does not result in an error.

Naming arguments does not add any functionality, but it may embellish the readability of the code, which is important if you are working on a team.

Finally, we can abstract out another layer by adding a third argument that will determine which column of the list the checked attribute is.

#### Exercise

- Write a function `index_equals_str()` that takes in three arguments: a list, an index and a string, and checks whether that index of the list is equal to that string.
- Call the function with a different order of the inputs, using named arguments.
- Call the function on `wonder_woman` to check whether or not it is a movie in color, store it in `wonder_woman_in_color`, and print the value.

        wonder_woman = [’Wonder Woman’,’Patty Jenkins’,’Color’,141,’Gal Gadot’,’English’,’USA’,2017]

In [None]:
# Exercise





### Optional arguments

Function can also have optional __arguments__.

Optional arguments have default values that they take on unless a different value is
provided by the user.

Let’s say we want to count the number of movies in the list `movie_data` . Our intuition might be to do the following:

     def naive_counter(input_lst):
         num_elt = 0
         for each in input_lst:
             num_elt = num_elt + 1
         return num_elt

However, if we attempt to call this function with `movie_data` as its argument, we get a wrong answer.

This is because the first item in the list is a header row and is also counted by the counter.

Of course, we can get around this by subtracting one from the result, but manipulating the function that way would cause it to be unusable in cases where there is no header row.

This is not generalizable, so it is not a neat solution.

Instead, we can use an argument that has a default value that can be manipulated:

     def counter(input_lst,header_row = False):
         num_elt = 0
         if header_row == True:
             input_lst = input_lst[1:len(input_lst)]
         for each in input_lst:
             num_elt = num_elt + 1
         return num_elt
         
Now, the function will behave as we expected:

     print(counter(movie_data))
     # returns 4933
     print(counter(movie_data, True))
     # returns 4932
     
If we are concerned about the readability of our code by co-workers, we can name the optional argument as well:

     >>> print(counter(movie_data, header_row = True))
     4932
     
#### Exercise

- Write a function named  `feature_counter()` that combines the logic of the `index _equals_str()` and  `counter()` functions.
- Use this to find out how many of the movies were made in USA, and store the value in `num_of_us_movies` .

In [None]:
# Exercise 





### Calling a function inside another function

Now, we have all the tools we need to create the statistics summary function we explained in the beginning of the session.

However, we would like `summary_statistics()` to be a function itself, and re-writing all of the code inside  `feature_counter()` in `summary_statistics()` defies the purpose of using a function.

You may remember that one of the big advantages of using a function is abstraction: the fact that it saves us from having to write the same code twice.

In this vein, the last feature of functions that we will use is the ability to call a function inside another function.

The body of one function can include a call to another function.

Let’s say we want to build a function `list_counter()` that will count the elements in multiple lists, and make a separate list holding these values. This is how we want the function to operate:

     >>> lists = [["dog","cat","rabbit"],[1,2,3,4],[True]]
         >>> list_count = (list_counter(lists))
         >>> print(list_count)
         [3,4,1]
         
Even though this seems like a complicated problem, because we have a counter function, it will not take more than 6 lines:

     def list_counter(input_lst):
         final_list = []
         for each in input_lst:
             num_elt = counter(each)
             final_list.append(num_elt)
         return final_list

As you can see, we called the user-defined function `counter()` and assigned its return value to `num_elt`.

Each time the for loop starts, the counter will be called with a different argument (the current value assigned to each), and return a different value. Whenever we define a new function, we can call it inside another function using this syntax.

#### Exercise

Write a `summary_statistics()` function that will take `movie_data` as input, and output a dictionary that will give useful numbers from the data.

- Define `summary_statistics()` with one argument, an input list.
- Use the `feature_counter()` with the relevant arguments to count the following properties and make them equal to the corresponding variables.
    - Assign the number of movies made in Japan to `num_japan_films`.
    - Assign the number of movies in color to `num_color_films`.
    - Assign the number of movies in English to `num_films_in_english`.
    
- Create a dictionary that associates the keys (`japan_films`, `color_films`, `films_in_english`) with the correpsonding variables.

- Return the dictionary.

Call the function with `movie_data` as its input, and store its value in summary.

In [None]:
# Exercise




### `lambda` expressions

Small anonymous functions can be created with the `lambda` keyword.

This function returns the sum of its two arguments: `lambda a, b: a+b `.

Lambda functions can be used wherever function objects are required. They are syntactically restricted to a single expression. Semantically, they are just syntactic sugar for a normal function definition. Like nested function definitions, lambda functions can reference variables from the containing scope:

    def make_incrementor(n):
        return lambda x: x + n
    f = make_incrementor(42)
    print(f(0))
    print(f(1))

The above example uses a lambda expression to return a function.

#### Exercise
Write a lambda expression that gives $f(x) = x_1 + x_2 - x_4^2$.

In [None]:
# Exercise






## Numpy basics

### Introducing NumPy

From the first Session, we learned that Python lists offer a few advantages when representing data:

- Lists can contain mixed types
- Lists can shrink and grow dynamically

Using Python lists to represent and work with data also has a few key disadvantages:
- To support their flexibility, lists tend to consume lots of memory
- They struggle to work with medium and larger sized datasets

While there are many different ways to classify programming languages, an important way that keeps performance in mind is the difference between __low-level__ and __high-level languages__.

Python is a high-level programming language that allows us to quickly write, prototype, and test our logic.
The C programming language, on the other hand, is a low-level programming language that is highly performant but has a much slower human workflow.

NumPy is a library that combines the flexibility and ease-of-use of Python with the speed of C.

### Creating arrays

The core data structure in NumPy is the `ndarray` object, which stands for __N-dimensional array__. An array is a collection of values, similar to a list. __N-dimensional__ refers to the number of indices needed to select individual values from the object.

A 1-dimensional array is often called a __vector__ while a 2-dimensional array is called a __matrix__. Both of these terms are borrowed from linear algebra.


![](n-dimarray.png)

To use NumPy, we first need to import it into our environment. NumPy is commonly imported using the alias `np` :

    import numpy as np

We can directly construct arrays from lists using the `numpy.array()` function. To construct a vector, we need to pass in a single list (without nesting):

     vector = np.array([5, 10, 15, 20])
     
The `numpy.array()` function also accepts a list of lists, which we use to create a matrix (where each sublist is a row of the matrix):

    matrix = np.array([[5, 10, 15], [20, 25, 30], [35, 40, 45]])

#### Exercise

- Create a vector from the list `[10, 20, 30]`, assign the result to the variable `vector` .
- Create a matrix from the list of lists `[[5, 10, 15], [20, 25, 30], [35, 40, 45]]`, assign the result to the variable `matrix` .

In [None]:
# Exercise





### Array shape

Arrays have a certain number of elements. The array below has 5 elements:

|1986| Western Pacific| Viet Nam| Wine| 0|
|----|---|---|----|---|

Matrices instead use rows and columns, which matches how we thought about datasets in the first session.
The matrix below has 3 rows and 5 columns, often referred to as a 3 by 5 matrix:



|row\col |1|2|3|4|5|
|----|---|---|----|---|---|
|1|1986| Western Pacific| Viet Nam| Wine| 0|
|2|1986|Americas|Uruguay|Other|0.5|
|3|1985|Africa|Cote d’Ivoire|Wine|1.62|

It’s often useful to know how many elements an array contains.

We can use the `ndarray.shape` property to figure out how many elements are in the array.

For vectors, the shape property contains a tuple with 1 element. A tuple is a kind of list where the elements can’t be changed.

     vector = np.array([1, 2, 3, 4])
     print(vector.shape)
     
The code above would result in the tuple `(4,) `. This tuple indicates that the array vector has one dimension, with length 4, which matches our intuition that vector has 4 elements.

For matrices, the shape property contains a tuple with 2 elements.

     matrix = np.array([[5, 10, 15], [20, 25, 30]])
     print(matrix.shape)
     
The above code will result in the tuple `(2,3)` indicating that matrix has 2 rows and 3 columns.

#### Exercise

- Assign the shape of `vector` to `vector_shape`.
- Assign the shape of `matrix` to `matrix_shape`.
- Display both `vector_shape` and `matrix_shape` using the `print()` function.

In [None]:
# Exercise





### Elementwise operations

We can perform elementwise operations on numpy arrays. For example, `a-b` gives elementwise difference of vector $a$ and $b$.

As an example of elementwise division, let’s find the 3-vector of asset returns $r$ from the (numpy arrays of) initial and final prices of assets.

    import numpy as np
    p_initial = np.array([22.15, 89.32, 56.77])
    p_final = np.array([23.05, 87.32, 53.13])
    r =(p_final - p_initial)/ p_initial
    r
    >>>array([ 0.04063205, -0.0223914 , -0.06411837])

### Numpy inbuilt functions

Numpy has a wide variety of inbuild functions that comes handy when you are trying to code a specific operation of your own. Every function has a different name and they require different arguments and give different output. Google the function that you need and read the corresponding Numpy documentation is often what we recommend our students to do. 

Other useful numpy functions:
- stacking and concatenating: `np.concatenate(), np.vstack(), np.hstack()`
- algebric operations: ` np.linalg.norm(), np.mean(), np.std(), np.matmul(), np.transpose`
- mathematics functions: `np.sin(), np.cos(), np.exp(), np.log(), np.arccos()`
- index operations: `np.where(), np.unique()`


We can do a lot more by combining your own function and numpy inbuilt functions.
For example, say we want to define a function that calculates the nearest neighbour of a vector in a list of vectors.

    def near_neigh(x,z):
        distance = []
        for vector in z:
            distance.append(np.linalg.norm(x - vector))
        return z[np.argmin(distance)]

#### Exercise

Define a function that computes the angle between two vectors.
$$ \theta = \mathrm{cos}^{-1} \left( \frac{x^T y}{\|x\| \|y\|} \right) $$

In [None]:
# Exercise





#### Exercise
Write a function that computes the 2-norm of a vector. Call the function on vector $z = (1,2,3,4,5)$. The 2-norm of a vector is given by
$$\|x\|_2 = \left( \sum_i x_i^2 \right)^{\frac{1}{2}}$$

In [None]:
# Exercise





#### Exercise
Recall from BUSS1020, to standardize a variable, we subtracting its mean and divide by its standard deviation.
Write a function `standardize ` which standardize a vector.

In [None]:
# Exercise





### NumPy strengths and weaknesses

You should now have a good foundation in NumPy, and in handling issues with your data.

NumPy is much easier to work with than lists of lists, because:
- It’s easy to perform algebric operations.
- Data indexing and slicing is faster and easier.
- We can convert data types quickly.
- There are many inbuilt functions that we can use!

Overall, NumPy makes working with data in Python much more efficient. It’s widely used for this reason, especially for machine learning.

You may have noticed some limitations with NumPy. For example:
- All of the items in an array must have the same data type. For many datasets, this can make arrays cumbersome to work with.
- Columns and rows must be referred to by number, which gets confusing when you go back and forth from column name to column number.
- NumPy performance can degrade when dealing with very large multi-dimensional arrays

In the next few sessions, we’ll learn about the Pandas library, one of the most popular data analysis libraries.

Pandas builds on NumPy, but does a better job addressing the limitations of NumPy.

## Pandas basics

Pandas is a package that unifies the most common workflows for which data analysts and data scientists previous relied on many different packages. So it had quickly became an important tool in a data professional's toolbelt and is the most popular library for working with tabular data in Python.

To represent tabular data, pandas uses a custom data structure called a __dataframe__. A dataframe is a highly efficient, 2-dimensional data structure that provides a suite of methods and attributes to quickly explore, analyse and visualise data. It is similar to a Numpy 2D array but supports many more features that help you working with tabular/panel data.

Pandas dataframes also handle missing values gracefully (NaN).

In this section, you will learn the basics of pandas while exploring the portfolio dataset. The data set contains the daily returns of 19 stocks and 1 risk-free assts over a period of 2517 days.


In [None]:
# Import the pandas package
import pandas as pd
# Read the data file and use the first row (index 0) as header
data = pd.read_csv("portfolio_data.csv", sep=",", header=0)

#### Exploring the dataframe

In [None]:
# Look at first 5 rows
data.head() #or 10?

In [None]:
# Look at the last 5 rows
data.tail() #or 10?

In [None]:
# Look at all column names
data.columns

In [None]:
# Look at dataset dimension
data.shape

In [None]:
# Summary statistics
summary_stats = data.describe()
summary_stats

#### Compute the stock return
Let's compute the stock (log) return for 3 stocks: 'Bank of America Corporation', 'Intel Corporation', 'Tiffany & Co.'

In [None]:
#Select 3 stocks and make a copy


#subsample.head()

Say, we are interested in the days when log return of Bank of America is above 0.025.
We can filter the dataframe using Boolean conditions.

In [None]:
# Filtering
mask = 
BOA_high_return =
#BOA_high_return.shape

#### Dataframe Indexing

In [None]:
# DataFrame indexing
data.iloc[0,0]

In [None]:
data.loc[0, 'American Express Company']

#### Plotting with dataframe

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.lineplot(subsample.index, subsample['log return Bank of America Corporation'])

In [None]:
sns.heatmap(data.corr(), 
            xticklabels=data.corr().columns.values,
            yticklabels=data.corr().columns.values)
plt.title('Correlation Matrix')

Note that operations in Pandas datframes and NumPy arrays is generally applied in different ways. Pandas is genearlly method based and numpy is generally function based.

For example,

Operations in Pandas : `dataframe.sum(), dataframe.mean(), dataframe.unique()`

Operations in Numpy : `np.sum(np.array()), np.mean(np.array()) np.isin(np.array(), np.array())`