## What is a programming language?

<img src="natural_languages_bookends.png">

In what sense is a *programming language* a language like English or French?

While programming languages clearly have a different **purpose** from natural languages, they can be meaningfully be compared in other respects. 

### Grammar 

What is the syntactic structure of the sentence?

### Semantics

What is the meaning of a grammatical sentence?


### Pragmatics

What assumptions do I need to invoke to interpret the sentence?

We'll go over each of these aspects in this notebook. 

### A language with a purpose

Most of us use our native languages to talk about everything under the sun and to accomplish all sorts of things. Nevertheless, there is a widely spread notion that some languages are more suitable for certain purposes than others. 

According to Google autocomplete, 

- Russian is the language of literature
- French is the language of desire
- Chinese is the language of the future
- Italian is the language of music
- Greek is the language of mathematics

### What language is Python?

Although these conventional images may not be wholly accurate in the case of natural languages, programming languages are actually constructed with a very specific purpose in mind. You use them to control what your computer do. In other words, 


- Python is the language of bossing your computer around

Or, more delicately phrased:

- Python is the language of computation and algorithms

### Grammaticality

#### Natural languages

In natural languages the difference between grammatical and ungrammatical statements is sometimes blurry. English, for instance, mostly puts modifying adjectives on the left side of the noun, e.g. 

> I'll take you **unknown places**

Reversing the word order is likely to raise some eyebrows, but people would nevertheless have no trouble understanding the sentence, which might be considered *borderline grammatical*:

> I'll take you **places unknown**

In some instances the inverted word order is uncontroversial:

> Paradise Lost


#### Formal languages

Formal languages have a highly non-flexible notion of grammaticality. Either a sentence is a part of the language, or it isn't. 

A formal language if often conceptualized as a set of sentences, namely all sentences that are a part of that language. We can thus say that a sentence is grammatical if it *belongs* to the (set of sentences in the) language. 

Formal languages are based on collections of rules, which can be though of as building blocks for constructing sentences. 

### A formal language of treasure maps

![Treasure map of Jesse James](jesse_james_treasure_map.jpg)

#### Some sentences generated by the langauge

- The treasure chest is hidden under the old oak
- The gold is hidden at the cementary
- The treasure chest is hidden under the floor

#### Discussion

These sentences seem follow a simple pattern where you plug in words or phrases from a list into predefined slots. 

Indicate by inserting brackets into the sentences where these slots are.

Here are the sentences with brackets inserted.

- The \[treasure chest\] is hidden \[under the old oak\]
- The \[gold\] is hidden \[at the cementary\]
- The \[treasure chest\] is hidden \[under the floor\]

All of the sentences can be summarized by a master sentence with *placeholder symbols* inserted. 

> *Master* $\rightarrow$ The *Treasure* is hidden *AtPlace* 

> *Master* $\rightarrow$ The *Treasure* is hidden *AtPlace* 

The placeholder symbols then need to defined. First the treasure.

> *Treasure* $\rightarrow$ gold <br/>
> *Treasure* $\rightarrow$ treasure chest

The location apparently consists of two things, a relative location (e.g. "under" or "at") and  landmark (e.g. "old oak"). 

Therefore, we split that placeholder further:


> *AtPlace* $\rightarrow$ *RelativeLocation* the *Landmark* <br/>
> *RelativeLocation* $\rightarrow$ at <br/>
> *RelativeLocation* $\rightarrow$ under <br/>
> *Landmark* $\rightarrow$ old oak <br/>
> *Landmark* $\rightarrow$ floor <br/>
> *Landmark* $\rightarrow$ cementary <br/>


What we have written above in terms of a master sentence with pluggable placeholder symbols is a grammar that defines a small formal language. 

This sort of grammar is also called a *rewrite system*, because the individual elements of the grammar are small rules that tell you that you may rewrite (replace) the symbol on the left side of the arrow with whatever is on the right side. 

#### Exercise

Let's try it out. In the cell below, begin by writing master sentence on the first line (i.e. the right-hand side of the *Master* rule). Then replace one symbol of your choice in the sentence, writing the new sentence on the next line. 

Repeat this process until there are no longer any symbols left. Once you are done, repeat the process to generate a different sentence.

#### Another view of the generating process

<img src="treasure_non_recursive_parseviz.png">

### Limitations of the treasure map grammar

The number of sentences that can be generated by the simple rule-based grammar above is small. 

One way to greatly enlarge this number is through adding more rules of the type:

> *Landmark* $\rightarrow$ mountain top <br/>
> *Landmark* $\rightarrow$ enchanted forest <br/>

#### Discussion

Say we want to be able to express more complex descriptions of the location of the treasure, such as the following sentence:

- The gold is buried under the old oak at the mountain top in the enchanted forest.

What kind of relation holds between the sentence parts *under the old oak*, *at the mountain top*, and *in the enchanted forest*?

Can we get away with simply adding more rules to the grammar in order to generate a sentence like the one above?

Regardless of any such additions to the vocabulary, the treasure map language is incapable of expressing more complex descriptions of the treasure and its location. In particular, we run into trouble when meaning is compositional (obtaining by combining smaller units) and we do not wish to impose some arbitrary limit on the number of such combinations.


(Some formal observations:)

- The number of sentences in the language remains finite. 
- It's easy to calculate the maximum number of words a sentence can be. 
- The number of derivation steps you need to generate a sentence is always the same.

### Recursion 

Recursion allows us to express sentences with a long chain of modifications like the one above. Recursion is a special form of repeating an action and is pervasive in all programming languages.

Allowing chains of modifications can be done by adding a single rule to our grammar. 

> *AtPlace* $\rightarrow$ *RelativeLocation* the *Landmark* *AtPlace* <br/>

Here is an example derivation (with some non-essential omissions):

- The gold is buried \[*AtPlace*\]
- The gold is buried \[*RelativeLocation* the *Landmark* *AtPlace*\].
- The gold is buried \[under the old oak *AtPlace*\].
- The gold is buried \[under the old oak \[ *RelativeLocation* the *Landmark* *AtPlace* \] \].
- The gold is buried \[under the old oak \[ at the mountain top *AtPlace* \] \].
- The gold is buried \[under the old oak \[ at the mountain top \[ *RelativeLocation* the *Landmark* \] \] \].
- The gold is buried \[under the old oak \[ at the mountain top \[ in the enchanted forest \] \] \].

#### Sentence with right-recursive rule

<img src="treasure_right_branching_parseviz.png">

The formal properties of the grammar has changed:

- The number of sentences in the language is infinite. 
- There is no bound on the number of words in a sentence. 
- The number of derivation steps you need to generate a sentence has a minimum but no maximum.


#### Discussion (time permitting)

Here is an alternative rule that also adds compositionality of the place name to the grammar. What does it do differently? Can you think of any reasons for preferring the first version?

> *AtPlace* $\rightarrow$ *AtPlace* *RelativeLocation* the *Landmark* <br/>


#### Sentence with left-recursive rule

<img src="treasure_left_branching_parseviz.png">

### Semantics 

While writing code that obeys the grammar of a programming language is an accomplishment, it doesn't ensure that the program is *bug free*. 

Code may be well-formed but semantically undefined. For such statements the computer understands your instructions, but carrying them out leads to errors. 

In [1]:
# Division by zero
x = 25 
y = 0
x / y

ZeroDivisionError: division by zero

In [2]:
# Calling an undefined function
"a most unusual story".make_title()

AttributeError: 'str' object has no attribute 'make_title'

In [3]:
# Combining incompatible types of things
a_number = 3
a_letter = "X"
a_number + a_letter

TypeError: unsupported operand type(s) for +: 'int' and 'str'

In [4]:
# However, numbers work
a_number + a_number

6

In [5]:
# String may be combined 
a_letter + a_letter

'XX'

Some programming languages know how to check for semantic errors like the ones above where incompatible types are combined. These are referred as *strongly-typed* languages, because the type of any variable like `a_letter` is known even before running the program. (Recall the name of the variable, despite being meaningful to humans, is nothing but an uninterpretable symbol to the computer). Java, C and C++ are examples of such languages. 

Python, on the other hand, is *weakly typed*. In practical terms that means you have to deal with semantic errors due to type incompatibility while the program is running (they could happen when your program has been running for an hour). Strongly-typed languages rejects such programs without even attempting to run them, potentially saving you time. 

Why would anyone then use a weakly-typed language? Well, it turns out that it takes a lot of effort to keep track of the exact types in a complicated program.

For instance, if you wanted a datastructure to store how many times you've seen different words in a collection of documents, you might write something like the following in Java: 

```java
HashMap<String, HashMap<String, Integer>> wordCountsByDoc = new HashMap<String, HashMap<String, Integer>>(); 
```

Python would let you get away with this:

```python
word_counts_by_doc = {}
```


### Lack of pragmatics 

Pragmatics is what enables speakers of a language to communicate efficiently. They do not have to spell out every single detail, because they can rely on a set of shared assumptions. Even if the message comes out less than perfect, speakers are often able to infer the meaning. 

Programming languages offer no such pragmatics. The computer is incredibly literal-minded and will do exactly as instructed, offering no charitable interpretation of your code, even in cases when humans would have no trouble inferring what you meant. 

Yet this is not as bad as it sounds. Computer languages have less need for pragmatics, because they are restricted in a way that natural languages are not. A syntactically and semantically valid line in a computer programming will have one and only one interpretation. English sentences, in contrast, are often ambigious and can mean a number of things, depending on the context. One of the more famous examples of natural language ambiguity is prepositional attachment:


> He saw the girl with the telescope

Did the girl or the guy have the telescope? I.e. which of the two interpretations below is the correct one?

> He saw \[the girl \[with the telescope\]\] <br/>
> He saw \[the girl \] \[with the telescope\]

Whenever programming languages encounter potentially ambigious statements, they employ preference rules to select a single interpretation. 

Below, the value of the mathematical expression depends on what order you evaluate the components of the equations -- subtraction or multiplication first? 

```
3 - 2 * 5
```

As you might have guessed, multiplications are handled first in Python. Alternative orderings of the operations can be forced by using parenthesis.


In [6]:
3 - 2 * 5

-7

In [7]:
(3 - 2) * 5

5

In [8]:
3 - (2 * 5)

-7