Five Elements in Learning a Programming Language
===============
This notebook uses Python as an example to demonstrate the learning of five basic elements in any languages. Once familar with these elements, users should be able to start using the language in a meaningful way. The five elements are 
    1. Data types and their operations
    2. Conditional Statement (decision rules)
    3. Loop (repetition)
    4. Input/Output (I/O)
    5. Users defined functions (breaking large code into smaller pieces)
The idea here is not to learn *everything* about the language, but learn enough so that it can be productive for you. 


Data Types and their Operations
------------
Most languages usually defined different data types, for example, numerics vs. characters. While the reasons behind this classification are at the core of the language design, from a user perspective, we need to understand the different data types so we can operate on them in a meaningful way.

For examples:

In [1]:
5+6

11

The statement above demonstrates the operator '$+$' between two *numbers* and together it produces the sum operation between the first and the second numbers. The case below demonstrates a different use of the operator '$+$'. In this case, the operator is being applied to two *characters*

In [2]:
'a'+'b'

'ab'

This produces a *concatenation* between two characters, namely 'a' and 'b'. Another example concerns with *lists*

In [3]:
A = [1,2,3,4]
B = [2,3,4,5]
A+B 

[1, 2, 3, 4, 2, 3, 4, 5]

Note that the '$+$' does not add the individual elements together, but rather, it joined the two lists together. 

It is often possible to introduce more data types. For example, the module *numpy* introduce data types like *vector* and *matrix*. It is often possible to **convert** data types. See an example below. 

In [4]:
import numpy as np
NA = np.array(A)
NB = np.array(B)
NA + NB

array([3, 5, 7, 9])

The *np.array* function converts the list into a numpy array. This is a different data type, similar to a *vector*. The point is that when we apply the sum operator between two numpy arrays, the result adds the individual elements together an form a new numpy array. 

This highlights the importance of understanding the different data types of a langauge as well as their associated operations. 

The example below demonstrates that a *string* is just a list (collection) of characters. 

In [5]:
S = 'Hello World' #define a string called 'S' and has a value 'Hello World'
Slist = list(S) #'convert' S into a list of its characters. 

In [9]:
S[0] == Slist[0] #asking Python if the first element in the string S is the same as the first element in Slist
print(S[5])

 


To save time, we can check the above more efficiently by 

In [11]:
i = 5
S[i] == Slist[i] #asking Python if the i element in the string S is the same as the i element in Slist with i defined above

True

Obviously we can use the function 'map' or list comprehension to make this easier but we will leave this until later. 

Conditional Statements
--------
The last example demonstrates a *boolean* type operation, specifically it asks Python to determine if something is true or false. We can extend this to tell Python what to do next in advanced depending on the outcome of a boolean statement. This is generally referred to as *conditional statement*. 

The example below demonstrates a typical 'if-elseif-else' structure. 

In [14]:
a = 'go away'
if a is 'Welcome':
    print('Hello')
elif a is 'Bye':
    print('See ya!')
else:
    print('Hanging out?')

Hanging out?


Note that there can be multiple number of scenarios. 

In [9]:
if a is 'Welcome':
    print('Hello')
elif a is 'Bye':
    print('See ya!')
elif a is 'hungry':
    print('Would you like something to eat?')
else:
    print('Hanging out?')

Hello


Loop
------
Loop is one of the most powerful concepts in programming and automation. It takes advantage on what computers do best, repetition! There are two general loop structures. The for-loop and while-loop. Let's look at for-loop first. 

Let's say we want to count the number of characters in a string. We can do this by using the **len** function in Python. 

In [15]:
sentence = 'Hello World'
len(sentence)

11

Note that the white space is counted as one character. Computers does not understand space. It uses special character for space. 

An alternative to achieve this is to go through the sentence and count the number of character one by one

In [18]:
i = 0 #initiate a variable to count. 
for c in sentence: #go through each character in the string 'sentence'
    i = i + 1 #add 1 to the counter 
    print('Character {0} and count {1}'.format(c,i)) #print the current value of 'c'
print('The total number of characters is {0}'.format(i))

Character H and count 1
Character e and count 2
Character l and count 3
Character l and count 4
Character o and count 5
Character   and count 6
Character W and count 7
Character o and count 8
Character r and count 9
Character l and count 10
Character d and count 11
The total number of characters is 11


The code above can be explained as follows. For each element (character) in the variable *sentence*, we will assign the value of the element into the variable 'c', then we will execute the indented part of the code. Upon finishing the indented code, we then go back to the beginning of the for-loop and change the value of 'c' to the next character in 'sentnece'. We keep doing this until we arrive at the last charcater in 'sentnece'. 

The power loop can be enhanced when we combine it with a conditional statement. For example, we can now count the number of character, excluding white space. 

In [12]:
i = 0 #initiate a variable to count. 
for c in sentence: #go through each character in the string 'sentence'
    if c is not ' ': # check if c is a white space. 
        print(c)
        i = i + 1 #add 1 to the counter 
print('The total number of characters is {0}'.format(i))

H
e
l
l
o
W
o
r
l
d
The total number of characters is 10


Another form of loop is the *while* loop. It repears the same operations until a certain condiiton is satisfied. To demonstrate, we repeat our example above using the *while* loop. 

In [19]:
length = len(sentence) #get the total length of the string. 
i = 0 #initiate a counter. 
while i < length:
    c = sentence[i] #examine the ith element in the sentence. 
    if c is not ' ':
        print(c)
    i = i + 1 
print('The total number of characters is {0}'.format(i))

H
e
l
l
o
W
o
r
l
d
The total number of characters is 11


A More Practical Example
----------
We will demonstrate loop and conditional statement with a more practical example. 
Let's assume you want to examine how each additional variable may contribute to the $R^2$ of a linear regression model. We can try the following:
    1. Estiamte the benchmark model. 
    2. Record $R^2$. 
    3. Add a variable into the model. 
    4. Restimate the model. 
    5. Record the new $R^2$. 
    6. Remove the new variable and go back to step 3, until we exhausted all variables in the list. 

We first import the module that will help us with the data and do the estimation for us. 

In [20]:
import pandas as pd #Pandas is a powerful module for data 
import statsmodels as sm #statsmodel is a feature rich module for statistical modelling
import statsmodels.formula.api as smf #This allows R like model specification 

We first import data into a Pandas dataframe

In [21]:
m = pd.read_excel('../data/mur.xlsx', header=0, index_col=0)
m

Unnamed: 0_level_0,Index,AGE,LF,M,NW,PC,POP,PX,SOUTH,T,U,URB,W,X,XPOS
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Alabama,1,0.165999,0.512,19.2523,0.320782,0.203578,8.67,0.035,1,47,0.042,40.1,1102,0.324932,1
Arkansas,2,0.154263,0.485,7.5286,0.224143,0.326934,3.72,0.080851,1,58,0.047,32.3,920,0.3174,1
Arizona,3,0.152304,0.508,5.6567,0.126838,0.400922,2.12,0.011765,0,82,0.076,36.5,1716,0.302128,1
California,4,0.132523,0.544,3.2094,0.063389,0.317876,66.1,0.07037,0,100,0.079,67.1,2184,0.297032,1
Colorado,5,0.150324,0.524,2.8048,0.02146,0.34978,6.42,0.061538,0,222,0.042,57.4,1748,0.302152,1
Connecticut,6,0.132526,0.567,1.4085,0.027376,0.282964,9.94,0.1,0,164,0.054,64.1,2255,0.247715,1
Delaware,7,0.138485,0.546,6.1778,0.138979,0.203556,1.29,0.05,1,161,0.031,46.5,2066,0.270326,1
Florida,8,0.142542,0.527,12.1511,0.2184,0.23163,12.2,0.053846,1,70,0.045,56.5,1431,0.280856,1
Iowa,9,0.142673,0.523,1.3423,0.008213,0.198968,10.4,0.085714,0,219,0.018,46.9,1916,0.295264,1
Idaho,10,0.146452,0.53,3.7062,0.012303,0.137514,1.89,0.0,0,81,0.055,39.8,1815,0.319505,0


The code below show us how to specify and estimate a benchmark model. 

In [26]:
benchmark = 'M~1+PC+PX+T' #define the benchmark model 
reg01 = smf.ols(benchmark, data=m).fit()
reg01.summary()

0,1,2,3
Dep. Variable:,M,R-squared:,0.356
Model:,OLS,Adj. R-squared:,0.308
Method:,Least Squares,F-statistic:,7.37
Date:,"Mon, 17 Oct 2016",Prob (F-statistic):,0.000481
Time:,13:53:17,Log-Likelihood:,-118.07
No. Observations:,44,AIC:,244.1
Df Residuals:,40,BIC:,251.3
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,11.6980,1.832,6.385,0.000,7.995 15.401
PC,-6.6638,4.095,-1.627,0.112,-14.941 1.613
PX,11.0791,8.481,1.306,0.199,-6.061 28.219
T,-0.0383,0.009,-4.156,0.000,-0.057 -0.020

0,1,2,3
Omnibus:,4.692,Durbin-Watson:,1.97
Prob(Omnibus):,0.096,Jarque-Bera (JB):,3.746
Skew:,0.702,Prob(JB):,0.154
Kurtosis:,3.266,Cond. No.,2290.0


Construct the list of variables that we want to add to the regression model. 

In [24]:
varlist = m.columns.difference(['PC', 'PX', 'T', 'Index', 'M'])

Now we are ready

In [27]:
Rlist = [] #initiate a list to store R-squared. 
benchmark = 'M~1+PC+PX+T' 
for var in varlist: 
    tempm = benchmark+'+'+var #add variable 'var' into the specification. 
    tempsmf = smf.ols(tempm, data=m).fit() #estimate the new model. 
    Rlist.append( [var, tempsmf.rsquared, tempsmf.rsquared-reg01.rsquared]) #record the R^2


In [28]:
Rlist

[['AGE', 0.53309830573178807, 0.17712324295586002],
 ['LF', 0.39616849163393431, 0.040193428858006253],
 ['NW', 0.65896220506394698, 0.30298714228801893],
 ['POP', 0.36980201228814835, 0.013826949512220299],
 ['SOUTH', 0.66806772504062861, 0.31209266226470056],
 ['U', 0.40508669966301614, 0.049111636887088084],
 ['URB', 0.45163768462927834, 0.095662621853350283],
 ['W', 0.61677688925584362, 0.26080182647991557],
 ['X', 0.36747365290635281, 0.011498590130424757],
 ['XPOS', 0.46279850938667133, 0.10682344661074328]]

Input/Output (IO)
---------------
In the last example, we exposed a function that imports data stored in EXCEL into Python/pandas. We are not a position to to talk a little more about input and output. 

Most books in Python will start with the 'print' function for ouput. 

In [29]:
print('What is your name?')

What is your name?


This doesn't allow us to enter an answer. Try 'input'

In [30]:
name = input('What is your name? ')

What is your name? Felix


The last name print the line 'What is your name' and store the response (as a string) into the variable called **name**. Now we can use **name** to construct a response. 

In [31]:
print('Nice meeting you {0}!'.format(name))

Nice meeting you Felix!


This shows basic input and output functionality via *standard* I/O. To input data via file. 

In [32]:
f = open('../data/estimate.txt') #open a file to be read. 
for i,line in enumerate(f): 
    print(line)
f.close()
print('The total number of line in this file is {0}'.format(i+1))

,coef,std err

c,11.6980,1.832

PC,-6.6638,4.095

PX,11.0791,8.481

T,-0.0383,0.009

The total number of line in this file is 5


As you can see, this text file contains the regression output of our benchmark model. It represents a typical output format from software, namely, a column of coefficients and a column of standard errors. A typical problem is to put each standard error under its corrresponding estimate, and possibly with a bracket around it. So the step is this
1. Import each line. 
2. Separate the coefficient and the standard error. 
3. For each row, put the standard error underneath the coefficient. 

In [33]:
coeff = [] # list to store coefficients. 
std = [] #list to store standard errors
f = open('../data/estimate.txt') #open a file to be read. 
for i,line in enumerate(f):
    if i > 0: #ignore the first line. 
        temp = line.rstrip().split(',') #get rid of the next line character and split the line into a list with ',' being the spliter. 
        print(temp) #optional, demonstrate what 'split' does.         
        coeff.append([temp[0], temp[1]]) #store coefficient into coeff. 
        std.append(temp[2]) #store standard errors into std. 
f.close()
    

['c', '11.6980', '1.832']
['PC', '-6.6638', '4.095']
['PX', '11.0791', '8.481']
['T', '-0.0383', '0.009']


In [34]:
coeff

[['c', '11.6980'], ['PC', '-6.6638'], ['PX', '11.0791'], ['T', '-0.0383']]

In [35]:
std

['1.832', '4.095', '8.481', '0.009']

In [40]:
final = []
for i,c in enumerate(coeff):
    final.append(c)
    tempstd = '['+std[i]+']'
    final.append([' ',tempstd])
    

In [41]:
final

[['c', '11.6980'],
 [' ', '[1.832]'],
 ['PC', '-6.6638'],
 [' ', '[4.095]'],
 ['PX', '11.0791'],
 [' ', '[8.481]'],
 ['T', '-0.0383'],
 [' ', '[0.009]']]

Now we can write this back to a CSV! 

In [39]:
s = '\n'.join([','.join([i for i in l]) for l in final]) #turn this into a string
f = open('../data/new_estimate.csv', 'w') #open a file to write
f.write(s)
f.close()

Users Defined Functions
-----------
For a bigger task, we sometimes want to reuse some of the code without having to write the whole thing again and again. Users defined functions allow us to reuse a block of code. This also allows us to break a big tasks into multiple smaller tasks. 

One example is to turn our problem above into a function. Let's say we need to put the standard errors under the coefficients from multiple outputs. We can first define a function that will do the rearrangement of table, then we can use a loop to go through all the output files. 

We can do this by defining three different functions. The first one opens the file and store the coefficients and standard errors.

In [29]:
def get_coef(filename):
    coeff = [] # list to store coefficients. 
    std = [] #list to store standard errors
    f = open(filename) #open a file to be read. 
    for i,line in enumerate(f):
        if i > 0: #ignore the first line. 
            temp = line.rstrip().split(',') #get rid of the next line character and split the line into a list with ',' being the spliter. 
            coeff.append([temp[0], temp[1]]) #store coefficient into coeff. 
            std.append(temp[2]) #store standard errors into std. 
    f.close()
    return coeff, std #returning the two lists. 
    

In [30]:
def put_under(coef, std):
    final = [] #initiate the final list
    for i,c in enumerate(coef): #loop through the coefficient list
        final.append(c) #append the coefficints
        tempstd = [' '] #initiate a list to stored the modified standard errors
        temp = '('+std[i]+')' #put brackets around the stnadard errors
        tempstd.append(temp) #add the modified standard error row into the list. 
        final.append(tempstd) #add a new row. 
    return final #return the double list.

In [31]:
def writecsv(final, savefilename):
    s = '\n'.join([','.join([i for i in l]) for l in final]) #turn this into a string
    f = open(savefilename, 'w') #open a file to write
    f.write(s)
    f.close()

In [32]:
def rearranging(filename, savefilename):
    coef, std = get_coef(filename)
    final = put_under(coef, std)
    writecsv(final, savefilename)
    return coef, std, final 
    

Let's test this on one file. 

In [33]:
filename = '../data/estimate.txt'
sfilename = '../data/save_est.csv'
rearranging(filename, sfilename)

([['c', '11.6980'], ['PC', '-6.6638'], ['PX', '11.0791'], ['T', '-0.0383']],
 ['1.832', '4.095', '8.481', '0.009'],
 [['c', '11.6980'],
  [' ', '(1.832)'],
  ['PC', '-6.6638'],
  [' ', '(4.095)'],
  ['PX', '11.0791'],
  [' ', '(8.481)'],
  ['T', '-0.0383'],
  [' ', '(0.009)']])

Now, let's do it for all of them. 
First we generate all the filenames we need to apply the rearrangement for. 

In [34]:
filelist = ['estimate'+str(i)+'.csv' for i in range(0,10)] #this is using list comprehension, which is a neat feature in Python
#You can use a standard for-loop

In [35]:
filelist

['estimate0.csv',
 'estimate1.csv',
 'estimate2.csv',
 'estimate3.csv',
 'estimate4.csv',
 'estimate5.csv',
 'estimate6.csv',
 'estimate7.csv',
 'estimate8.csv',
 'estimate9.csv']

then we generate a list of save file names

In [36]:
savefilelist = ['save_'+i for i in filelist]
savefilelist

['save_estimate0.csv',
 'save_estimate1.csv',
 'save_estimate2.csv',
 'save_estimate3.csv',
 'save_estimate4.csv',
 'save_estimate5.csv',
 'save_estimate6.csv',
 'save_estimate7.csv',
 'save_estimate8.csv',
 'save_estimate9.csv']

In [37]:
path = '../data/' 
for i,fn in enumerate(filelist): #loop through each file in filelist
    coef, std, final = rearranging(path+fn, path+savefilelist[i]) #apply modification to file and save it to a new file 

Appendix
======
The code below shows you how I generated the test files with the coefficient estimates and the standard errors.

In [39]:
for i,var in enumerate(varlist): 
    tempm = benchmark+'+'+var #add variable 'var' into the specification. 
    tempsmf = smf.ols(tempm, data=m).fit() #estimate the new model. 
    header = [' ', 'coeffs','tstats']
    final = [header]
    for k, row in enumerate(tempsmf.model.exog_names):
        temp = [row, tempsmf.params[k], tempsmf.tvalues[k]]
        final.append(temp)
    s = '\n'.join([','.join([str(i) for i in l]) for l in final])
    savefilename = '../data/estimate'+str(i)+'.csv'
    f = open(savefilename, 'w')
    f.write(s)
    f.close()

Regular Expression
===============
Regular expression is an extremely powerful representation of patterned text. It has a solid foundation in theoretical computer science and formal language theory. For our purposes, it allows us to search through textfile when what we are looking for may not be exact, but rather, contains different variations. An example may be hyper links or email address for a particular organisations. 

Most text editors provide regular expression support. It also makes find/replace fairly easy. 

The demonstration here barely touches the surface of what RE can do. The aim is to convince you that RE is a useful skill that worth learning. In one of the examples, it also demonstrates how Python can be used to download the HTML code behind webpages. Combine with RE, it allows us to collect data via webpage very efficiently. 

In [None]:
import re as re #import the regular expression module

We start off by examining a data file first. 

In [None]:
m1 = pd.read_csv('../data/datastream_equities_201405.csv', header=0, index_col=0) #load data

In [None]:
m1.head()

Let's assume we only want observations on a particular date of every year. 

In [None]:
reexp = '^23/06/[0-9]{4}'
re105 = re.compile(reexp)
dlist = [i for i,e in enumerate(m1.index) if re105.search(e) is not None]
sm = m1.iloc[dlist,:]

In [None]:
sm

Let's take a look how RE can work with text file. Let's say we want to extract all the hyperlinks in a particular HTML file called *temp.html*.

We first define our regular expression, then we ask python to search through everyline in the file and return the line that contains a match. We save each of these lines in a list called *httplist*

In [None]:
f = open('../data/temp.html')
reexp = 'http.*/"'
rehttp = re.compile(reexp)
httplist = [e for i,e in enumerate(f) if rehttp.search(e) is not None]
f.close()    

In [None]:
httplist

That's good but the lines also contain all the other code. We would like to extract just the hyper link addresses. To do that, we utilise the attributes witin RE.search object. Attribute **regs** contains the starting and ending positions of the string that matches the regular expression. 

In [None]:
f = open('../data/temp.html')
reexp1 = 'http.*/"'
rehttp1 = re.compile(reexp1)
addresslist = []
for i,e in enumerate(f):
    temp = rehttp1.search(e)
    if temp is not None:
        start = temp.regs[0][0]
        end = temp.regs[0][1]
        addresslist.append(temp.string[start:end])
f.close()  

In [None]:
addresslist

The address above combined both secured and not-secured link. Let's say we only want secure link. We can modify our regular expression to enusre there is no 's' after 'http'

In [None]:
f = open('../data/temp.html')
reexp2 = 'http[^s].*/"'
rehttp2 = re.compile(reexp2)
addresslist = []
for i,e in enumerate(f):
    temp = rehttp3.search(e)
    if temp is not None:
        start = temp.regs[0][0]
        end = temp.regs[0][1]
        addresslist.append(temp.string[start:end])
f.close()  

In [None]:
addresslist

In [None]:
f = open('../data/temp.html')
reexp3 = 'http[^s].+economics-and-finance.*/"'
rehttp3 = re.compile(reexp3)
addresslist = []
for i,e in enumerate(f):
    temp = rehttp3.search(e)
    if temp is not None:
        start = temp.regs[0][0]
        end = temp.regs[0][1]
        addresslist.append(temp.string[start:end])
f.close()  

In [None]:
addresslist

Let's say we want to find the names of the people who work for the School of Economics and Finance.  
In this case, we specify a web address, we then ask Python to download the HTML page. Then we extract the names of the people that were listed on the page. 

In [None]:
import urllib.request as ul

furl = ul.urlopen("http://business.curtin.edu.au/schools-and-departments/economics-and-finance/our-people/") #open a link to the HTML page. 
tempfile = "temp02.html" #The name of the file that we want to save the downloaded HTML. 
f = open(tempfile, 'w') #open a file to write.
f.write(furl.read().decode('utf-8')) #Save the HTML file to tempfile. 
f.close()
f = open(tempfile) #open the saved file for name extraction. 
re_name = re.compile("profile/view/[A-Za-z]+\.[A-Za-z]+") #define the regular expression. 
staffname = [] #initial the list to store the names. 
for i,line in enumerate(f):
    temp = re_name.findall(line) 
    if len(temp) > 0:
        temp1 = temp[0].split('/')
        staffname.append(temp1[-1])
f.close()
firstname = [i.split(".")[0] for i in staffname]  #separating last name and first name. 
surename = [i.split(".")[1] for i in staffname]
back = zip(firstname,surename)


In [None]:
staffname

We have only just touched the surface of RE here. In fact, many implementations allow RE to have *conditional statement* and even *recursion* (a different form of loop). For more information, check out *[this site][regular]*. 
[regular]: http://www.regular-expressions.info/tutorial.html