#  Python for Economic and Social Data Science: Lecture Three

### 17th July, 2024
---

Lets now run the homework randomiser again, where each randomly selected student will answer some of the homework questions!

## Section 8: Pseudocode (An unrelated aside)

Pseudocode is a common method which enables the programmer to plan without worrying about syntax and tracebacks, with a focus on the operations and methods. We can write down every step, including if conditions and loops, and fill it in a rough and ready fashion, using shortcuts where  we think appropriate. This allows us to ascertain what we do know how to do easily (which we can use shortcuts or shorthand for in the psuedocode), and what we still need to figure out. This skill of abstraction is essential to solving problems as a programmer! Note that psuedocode isnt meant to be ran. The more and more comfortable you become with programming in Python, the less you will need to write or use pseudocode.

See below for an example:


```python

state my favourite number
for 1 to 10
    raise it to the power of 1 to 10 sequentially in a loop
    print out the numbers
```

But by using all the tools we've learnt so far, we can easily expand this into _real_ code:

In [1]:
myfavnum=37
for number in range(1,20):
    answer=myfavnum**number
    print(answer)

37
1369
50653
1874161
69343957
2565726409
94931877133
3512479453921
129961739795077
4808584372417849
177917621779460413
6582952005840035281
243569224216081305397
9012061295995008299689
333446267951815307088493
12337511914217166362274241
456487940826035155404146917
16890053810563300749953435929
624931990990842127748277129373


## Section 9: Functions. 

### Section 9.1: Introduction to Functions

Perhaps the most important concept that we will cover in this course is the notion of abstraction. To make something more abstract is to articulate or encode some phenomenon in a form that is less contingent or specific. If we can discern the commonalities between different cases, we can apply the same logic in a number of cases.

The functional purpose of programming is typically to transform particular cases into more general ones. Instead of renaming every photo in a set of cat photos, we can append "Cat_" to all of them. In a broader sense, we are trying to automate or abstract tasks with computers in order to play to their strengths for speed and accuracy.

When coding, we prefer abstraction because it gives code a number of advantages:

* easier to re-use
* more robust
* more efficient for the coder.

The primary way in which we employ abstraction in programming is the use of the function, f(x) = y. In this case x is some input, f() is a function that is applied to the input and y is the output. Here is a function in python:

```python
def greetings(): 
    print('hello world')
```

This function simply prints the greeting 'hello world'. Now every time we want to say hello world, we could just say greet() instead. It is hardly an improvement, but when you place many reusable lines of code together, then the benefits become obvious and the readability of code is increased substantially.

In [2]:
def greet():
    print('Are you having a nice day today?')
    print('I love Charlie\'s Python class!')

greet()

Are you having a nice day today?
I love Charlie's Python class!


Critically, functions can also take an **input**, operate on it, and then return the result:

In [3]:
def doubleme(input):
    doublednumber=input*2
    # greet()
    return doublednumber

print(doubleme(2))

4


What's going to happen if we comment out the greet() **function call** above?

Lets try some more functions as examples, including a try except.

First, take a function which accepts strings as inputs:

In [4]:
def favouritecolour(colour):
    print('My favourite colour is ' + colour)
    
favouritecolour('green')

My favourite colour is green


What about if somebody tries to input a number into a function which only accepts strings? Try and except to the rescue!

In [5]:
def favouritecolour(colour):
    try:
        print('My favourite colour is ' + colour)
    except TypeError:
        print('That cant possibly be a colour...')

favouritecolour('green')

My favourite colour is green


In [6]:
favouritecolour(5)

That cant possibly be a colour...


### Section 9.2: Passing variables around

Parameters to functions are passed by **reference**, not as copies. This means that objects that get modified (specifically here -- reassigned) in the function are permanently changed. See this below example for further exposition:

In [7]:
def list_sorter(listIn):
    listIn.sort() 
    return listIn

mylist = [7, 10, 1, 15, 2, 4]

print("The list before it was sorted:", mylist)

result = list_sorter(mylist)

print("\nThe list after it was sorted:", result)

The list before it was sorted: [7, 10, 1, 15, 2, 4]

The list after it was sorted: [1, 2, 4, 7, 10, 15]


### Section 9.3. Global vs local variables

In python, variables are local if not otherwise declared. This means that when you define variables inside a function indentation, they are `local` to this function by default. While this topic is slightly more advanced and beyond the scope of this module, consider the following: 

In [8]:
def my_favourite_food(a_random_statement):
    print(a_random_statement)
    a_random_statement = "But 538 isnt the same after Harry Enton left"
    print(a_random_statement)

a_random_statement = "Tofu is my favourite food!"        
my_favourite_food(a_random_statement)
print(a_random_statement)

Tofu is my favourite food!
But 538 isnt the same after Harry Enton left
Tofu is my favourite food!


### Section 9.4: Abstraction

Functions are most useful when we want to solve a repetitive task efficiently. However, there is a challenge: how do we make the functions general enough, but not too general? So far, we havent seen functions take or return more than one input or output:                 

In [9]:
def two_outputs(a):
    b=a+a
    c=a*a
    return b, c

two_outputs(5)

(10, 25)

Importantly, we can also assign function outputs:

In [10]:
def two_outputs(a):
    b=a+a
    c=a*a
    return b, c

b, c = two_outputs(5)
print(b, c)

10 25


And, similarly, we can assign something with two inputs, one output:

In [11]:
def two_inputs(a, b):
    return a*b # note how we can return directly, without assignment to c inside

b = two_outputs(5)
print(b)

(10, 25)


You can also declare default function inputs, for example like:

```python
def two_inputs(input1="burger", input2 = cheeseburger):
    return input1 + input2

print(two_inputs("cheese","sandwich"))
```

### Section 9.5 Your turn!

Python doesn't like it when the first argument into a function has a default value, and the second one doesn't. Try this yourself, and try to understand the error.

## Section 10: Reading, Writing and Appending Files

### Section 10.1 Reading

File manipulation is central to the management of data and thus to data analysis. Opening files in python is relatively straightforward. To open a file in python we create a 'file handler'. There are a couple ways of doing this. Most involve the basic format of:

```python
filein/fileout = open("PATH_TO_FILE",'r'/'w'/'a')
```

```filein.close()``` closes the file after the operations are complete: it's a good habbit to always close files!

In [12]:
# Open a file and read it line by line:
filein = open("../README.md",'r') # What does the '..' mean?
counter = 0
for i in filein: 
    if counter < 20:
        print(i.strip())
        counter += 1
filein.close()

# Python_for_DataScience

<p align="center">
<img src="./assets/python_logo.png" width="200"/>&nbsp; &nbsp; &nbsp;<img src="./assets/python_logo.png" width="200"/>&nbsp; &nbsp; &nbsp;<img src="./assets/python_logo.png" width="200"/>
</p>

## :page_facing_up: # Python for Economic and Social Data Science  :page_facing_up:

![coverage](https://img.shields.io/badge/Teaching-yellow)
[![Generic badge](https://img.shields.io/badge/Python-blue.svg)](https://shields.io/)
[![Generic badge](https://img.shields.io/badge/GNU3.0-purple.svg)](https://shields.io/)
[![Generic badge](https://img.shields.io/badge/Maintained-brightgreen.svg)](https://shields.io/)
[![Generic badge](https://img.shields.io/badge/BuildPassing-orange.svg)](https://shields.io/)
---

### Introduction
Welcome to the 'Python for Economic and Social Data Science' class! This GitHub repository contains everything that we'll need for about ~20 hours of lectures. This course begins by assuming absolutely no knowledge of what Python i

One way to guarantee that the file closes is to open it in the following way:

```python
with open("readme.md") as filein:
    print(filein.read())
```

Note that the line by line method might be more suitable if we are dealing with very, very large files.

### Section 10.2: Writing Files

Writing files is very similar to how you read them. However, you need to notice the 'w' (for 'write' parameter in the open() function). This generally replaces the 'r' in Section 10.1 above. Lets save this text as data to a csv file a text file, and then read it in and print it line by line:

In [13]:
abc = open("../Data/data_write.csv",'w')
abc.write('hi, this, is, a, csv, with, some, data')
abc.close()

In [14]:
filein = open("../Data/data_write.csv",'r')
for i in filein: 
    print(i.strip())
filein.close()

hi, this, is, a, csv, with, some, data


Or you can write multiple times, adding to the end of the file.

See this example where we might be iterating over all of the tweets in our twitter database:

```python
with open("donaldstweets.txt",'w') as fileout:
    for tweet in twitter_database:
        if "@therealdonald" in tweet:
            newtweet=functioncall(tweet)
            fileout.write(newtweet)
```

### Section 10.3: Appending to Files

Appending to files is very similar to how you write them. However, you need to notice the 'a' (for 'append' parameter in the open() function). This generally replaces the 'r'/'w' in Sections 10.1-10.2 above. Lets append to our earlier file, and then read it line by line:

In [15]:
abc = open("../Data/data_write.csv",'a')
abc.write('\neven, more, data, now, in, this, csv, file')
abc.close()

In [16]:
filein = open("../Data/data_write.csv",'r')
for i in filein: 
    print(i.strip())
filein.close()

hi, this, is, a, csv, with, some, data
even, more, data, now, in, this, csv, file


### Section 10.4 Your turn!

Write _any_ new file to any part of your file system. Can you read it back in successfully? If this is too easy, can you next append something to the file?

## Section 11: Programming outside Jupyter 

There are two main options when it comes to writing code outside of Jupyter notebooks. 

### Section 11.1: Integrated Development Environments.

IDEs are coding tools which allow you to write/test/debug your code. They offer features such as completion/insight by highlighting, resource management, debugging tools, etc. They are therefore extremely useful for development. Some of the most commonly used ones are:

1. Spyder: bundled with the Anaconda distribution, built specifically for data science. Interface similar to RStudio/MATLAB
2. PyCharm: also has support for JavaScript, HTML/CSS, Angular JS, Node.js, etc. Full git integration.
3. Rodeo: *very* similar to RStudio: divided into text editor, console, environment plot/libraries/files.

### Section 11.2: Text editors

Another option is to use a text editor and then execute the .py files in the command prompt/terminal. This comes in two steps:

1. Edit code in a text editor (*strongly recommend Atom*)
2. Open command window/terminal (Start > Run > "cmd" on Windows, Terminal on Mac/Liunux)

### Section 11.3: Consoles

You can also use the command window/terminal to prototype code on the fly:

    > python

### Section 11.4 Your turn!

What are your favourite IDEs or text editors? Why? How does this way of working relate to Stata, Eviews, and MATLAB?

## Section 12: Numpy

We've spent the majority of the first three lessons talking about the 'standard library'.

**The standard library:** Python’s standard library is very extensive, offering a wide range of facilities as indicated by the long table of contents listed below. The library contains built-in modules (written in C) that provide access to system functionality such as file I/O that would otherwise be inaccessible to Python programmers, as well as modules written in Python that provide standardized solutions for many problems that occur in everyday programming. Some of these modules are explicitly designed to encourage and enhance the portability of Python programs by abstracting away platform-specifics into platform-neutral APIs.

However, the majority of tools which we need for practical data science come from libraries _outside_ of the standard library. We're now going to talk about the four main, very important ones, although note that we could have made alternative choices.

Python is a general-purpose language and as such it can and it is widely used by system administrators for operating system administration, by web developpers as a tool to create dynamic websites and by linguists for natural language processing tasks. Being a truely general-purpose language, Python can of course - without using any special numerical modules - be used to solve numerical problems as well. So far so good, but the crux of the matter is the execution speed. Pure Python without any numerical modules couldn't be used for numerical tasks Matlab, R and other languages are designed for. If it comes to computational problem solving, it is of greatest importance to consider the performance of algorithms, both concerning speed and data usage. If we use Python in combination with its modules NumPy, Matplotlib, Pandas and more, it belongs to the top numerical programming languages. It is as efficient - if not even more efficient - than Matlab or R.

Note, also, that you don't need to install any of these tools if you have installed the Anaconda distribution of Python: one of the main benefits of our installation of that is that it takes care of all the bundles of tools that we need as data scientists.

NumPy is a module for Python. The name is an acronym for "Numeric Python" or "Numerical Python". It is pronounced /ˈnʌmpaɪ/ (NUM-py). It is an extension module for Python, mostly written in C. This makes sure that the precompiled mathematical and numerical functions and functionalities of Numpy guarantee great execution speed. Furthermore, NumPy enriches the programming language Python with powerful data structures, implementing multi-dimensional arrays and matrices. These data structures guarantee efficient calculations with matrices and arrays. The implementation is even aiming at huge matrices and arrays, better know under the heading of "big data". Besides that the module supplies a large library of high-level mathematical functions to operate on these matrices and arrays.

Before we can use NumPy we will have to import it.

In [17]:
import numpy as np

Our first simple Numpy example deals with temperatures. Given is a list with values, e.g. temperatures in
Celsius:

In [18]:
cvalues = [20.1, 20.8, 21.9, 22.5, 22.7, 22.3, 21.8, 21.2, 20.9, 20.1]
print(cvalues)

[20.1, 20.8, 21.9, 22.5, 22.7, 22.3, 21.8, 21.2, 20.9, 20.1]


We will turn our list "cvalues" into a one-dimensional numpy array:

In [19]:
C = np.array(cvalues)
print(C)

[20.1 20.8 21.9 22.5 22.7 22.3 21.8 21.2 20.9 20.1]


These might look the same, but note that they are different object types:

In [20]:
print(type(C), type(cvalues))

<class 'numpy.ndarray'> <class 'list'>


Let's assume, we want to turn the values into degrees Fahrenheit. This is very easy to accomplish with a
numpy array. The solution to our problem can be achieved by simple scalar multiplication:

In [21]:
print(C * 9 / 5 + 32)

[68.18 69.44 71.42 72.5  72.86 72.14 71.24 70.16 69.62 68.18]


Compared to this, the solution for our Python list looks awkward:

In [22]:
fvalues = [x*9/5 + 32 for x in cvalues]
print(fvalues)

[68.18, 69.44, 71.42, 72.5, 72.86, 72.14, 71.24000000000001, 70.16, 69.62, 68.18]


We can determine the size of the integers when we define an array. Needless to say, this changes the memory
requirement:

In [23]:
import sys

a = np.array([24, 12, 57], np.int8)
print(sys.getsizeof(a))
a = np.array([24, 12, 57], np.int16)
print(sys.getsizeof(a))
a = np.array([24, 12, 57], np.int32)
print(sys.getsizeof(a))
a = np.array([24, 12, 57], np.int64)
print(sys.getsizeof(a))

115
118
124
136


### 12.1 Speed of numpy vs standard library

Lets use two very simple but otherwise equivilent functions to compare the efficiency of pure python with numpy for a basic arthmetical task:

In [24]:
import time


def pure_python_version():
    t1 = time.time()
    X = range(size_of_vec)
    Y = range(size_of_vec)
    Z = [X[i] + Y[i] for i in range(len(X))]
    return time.time() - t1

def numpy_version():
    t1 = time.time()
    X = np.arange(size_of_vec)
    Y = np.arange(size_of_vec)
    Z = X + Y
    return time.time() - t1

size_of_vec = 10000000
t1 = pure_python_version()
t2 = numpy_version()
print(t1, t2)
print("Numpy is in this example " + str(t1/t2) + " faster!")

2.703886032104492 0.08404898643493652
Numpy is in this example 32.17035858246319 faster!


### 12.2 Evenly spaced values

The syntax of arange: ```arange([start,] stop[, step], [, dtype=None])``` arange returns evenly spaced values within a given interval. The values are generated within the half-open
interval '[start, stop)' If the function is used with integers, it is nearly equivalent to the Python built-in function
range, but arange returns an ndarray rather than a list iterator as range does. If the 'start' parameter is not given,
it will be set to 0. The end of the interval is determined by the parameter 'stop The default value for 'step' is 1.

In [25]:
x = np.arange(10.4)
print(x)
x = np.arange(0.5, 10.4, 0.8)
print(x)

[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10.]
[ 0.5  1.3  2.1  2.9  3.7  4.5  5.3  6.1  6.9  7.7  8.5  9.3 10.1]


## 12.3 Dimensions and Shapes

### 12.3.1 Zero dimensional arrays

It's possible to create multidimensional arrays in numpy. Scalars are zero dimensional. In the following
example, we will create the scalar 42. Applying the ndim method to our scalar, we get the dimension of the
array. We can also see that the type is a "numpy.ndarray" type.

In [26]:
x = np.array(42)
print("x: ", x)
print("The type of x: ", type(x))
print("The dimension of x:", np.ndim(x))

x:  42
The type of x:  <class 'numpy.ndarray'>
The dimension of x: 0


### 12.3.2 One dimensional arrays

that numpy arrays are
containers of items of the same type, e.g. only integers. The homogenous type of the array can be determined
with the attribute "dtype", as we can learn from the following example:

In [27]:
F = np.array([1, 1, 2, 3, 5, 8, 13, 21])
V = np.array([3.4, 6.9, 99.8, 12.8])
print("F: ", F)
print("V: ", V)
print("Type of F: ", F.dtype)
print("Type of V: ", V.dtype)
print("Dimension of F: ", np.ndim(F))
print("Dimension of V: ", np.ndim(V))

F:  [ 1  1  2  3  5  8 13 21]
V:  [ 3.4  6.9 99.8 12.8]
Type of F:  int64
Type of V:  float64
Dimension of F:  1
Dimension of V:  1


### 12.3.3 Multidimensional Arrays

Of course, arrays of NumPy are not limited to one dimension. They are of arbitrary dimension. We create them
by passing nested lists (or tuples) to the array method of numpy.

In [28]:
B = np.array([[[111, 112], [121, 122]],
              [[211, 212], [221, 222]],
              [[311, 312], [321, 322]]]
            )
print(B.ndim)
print(B.shape)

3
(3, 2, 2)


### 12.4 Slicing

We can also slice and index multidimensional arrays, indeed in two ways; one more efficient than the other.

In [29]:
A = np.array([[3.4, 8.7, 9.9],
             [1.1, -7.8, -0.7],
             [4.1, 12.3, 4.8]]
            )
print(A[1][0])
print(A[1, 0])

1.1
1.1


We accessed an element in the second row, i.e. the row with the index 1, and the first column (index 0). We
accessed it the same way, we would have done with an element of a nested Python list.
You have to be aware of the fact, that way of accessing multi-dimensional arrays can be highly inefficient.
There is another way to access elements of multi-dimensional arrays in Numpy: We use only one pair of
square brackets and all the indices are separated by commas:

### 12.5 Creating Arrays

We might want to initialize empty arrays for filling in later. There are two common ways to do this, with either zeros or ones:

In [30]:
F = np.ones((3,4), dtype=int)
print(F)

Z = np.zeros((2,4))
print(Z)

[[1 1 1 1]
 [1 1 1 1]
 [1 1 1 1]]
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]]


We can also make an idenity matrix (useful for linear algebra)

In [31]:
np.identity(4)

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

Note, here, that we aren't talking about dtypes or structured arrays, because pandas generally supercedes numpy for tasks which involve things like that.

### 12.6 Basic matrix arithmatic

Lets start with adding a scalar to an array:

In [32]:
v = np.array([5, 6, 7])
v + 2

array([7, 8, 9])

How about multiplying by a scalar? Exactly as we expect:

In [33]:
np.array([5, 6, 7]) * 3

array([15, 18, 21])

If we use another array instead of a scalar, the elements of both arrays will be component-wise combined:

In [34]:
A = np.array([ [11, 12, 13], [21, 22, 23], [31, 32, 33] ])
B = np.ones((3,3))
print("Adding to arrays: ")
print(A + B)
print("\nMultiplying two arrays: ")
print(A * (B + 1))

Adding to arrays: 
[[12. 13. 14.]
 [22. 23. 24.]
 [32. 33. 34.]]

Multiplying two arrays: 
[[22. 24. 26.]
 [42. 44. 46.]
 [62. 64. 66.]]


This should not be mistaken for matrix multiplication. The elements are solely component-wise multiplied. For this purpose, we can use the dot product. Using the previous arrays, we can calculate the matrix multiplication:

In [35]:
np.dot(A, B)

array([[36., 36., 36.],
       [66., 66., 66.],
       [96., 96., 96.]])

### 12.7 Comparisons

Comparisons are performed elementwise:

In [36]:
A = np.array([ [11, 12, 13], [21, 22, 23], [31, 32, 33] ])
B = np.array([ [11, 102, 13], [201, 22, 203], [31, 32, 303] ])
A == B

array([[ True, False,  True],
       [False,  True, False],
       [ True,  True, False]])

We can use `array_equal' if we want to check whether two entire arrays are elementwise equal:

In [37]:
print(np.array_equal(A, B))
print(np.array_equal(A, A))

False
True


### 12. 8 Broadcasting

Numpy provides a powerful mechanism, called Broadcasting, which allows to perform arithmetic operations
on arrays of different shapes. This is extremely computationally efficient in general, as it saves on memory storage and in the use of loops

In [38]:
A = np.array([ [11, 12, 13], [21, 22, 23], [31, 32, 33] ])
B = np.array([1, 2, 3])
print("Multiplication with broadcasting: ")
print(A * B)
print("... and now addition with broadcasting: ")
print(A + B)

Multiplication with broadcasting: 
[[11 24 39]
 [21 44 69]
 [31 64 99]]
... and now addition with broadcasting: 
[[12 14 16]
 [22 24 26]
 [32 34 36]]


### 12.9 Flattening, Reshaping, and Concatenating

Lets see first an example of flattening a two dimensional array:

In [39]:
A=np.array([[0, 1],
            [2, 3],
            [4, 5]
           ] # Note the indenting here
        )
A.flatten()

array([0, 1, 2, 3, 4, 5])

The method reshape() gives a new shape to an array without changing its data:

In [40]:
X = np.array(range(12))
X.reshape((2,3,2))

array([[[ 0,  1],
        [ 2,  3],
        [ 4,  5]],

       [[ 6,  7],
        [ 8,  9],
        [10, 11]]])

And, finally, concatenating:

In [41]:
x = np.array(range(5))
y = np.array(range(10,12))
np.concatenate((x,y))

array([ 0,  1,  2,  3,  4, 10, 11])

### Section 12.10 Your turn!

Generate a 5x5 Identity matrix, and a 5x1 vector. Multiply them by one another. Check that the shape of the output is exactly what we think it is.

## Optional Homework

Can you come up with another example of pseudocode, and the convert it to real code with everything we've learnt so far?

## Non-Optional Homework!

See Homework_Three.ipynb in the 'Homeworks' section of the course materials!