# The Big Picture

<div "style="width:500px;">


<p>Before we begin mining data for the betterment of the world, we should take a minute to reflect on what programming is really about.</p>

<p>Programming is about telling the computer to do tedious, repetitive things that we personally would rather not do.</p>

<p>Anything a computer can do, we could <i>choose</i> to do with paper and pencil, and vice versa.</p>

<p>In fact, chances are that you already have done data processing at some point, such as calculating how much you spent during a holiday. When it comes to calculating how much you spent during a <i>year</i>, it is worth offloading all those calculations to a computer.</p>

</div>
<img src="programming.png" alt="Variables" style="width:50%;height:55%;float:right;">

### Hand-calculation versus programming

<div "style="width:500px;">

<p> Consider how you would do the following by hand: you have a table with all surgeons in Glasgow, and you want to figure out what proportion of all female surgeons work at Gartnavel Hospital.</p>


<p> You would probably do something like: </p> 

1. Count all the rows that has gender as value F and hospital as Gartnavel

2. Write down this number so you don't forget

3. Count all the rows that has gender as value F

4. Write it down

5. Calculate the ratio

Just like how you would repeatedly write things down externally, a computer program works by doing simple operations and storing the result in a **variable**. In other words, we save a value by giving it a memorable name.

This is useful, because it means that we can express our steps in terms of those names, and then simply look up their actual values.

Without further ado, let's have a look at the Python code below, and you might be shocked to discover just how intuitive programming can be.
</div>

In [2]:
gartnavel = 14
total = 67
ratio = gartnavel / total

print(ratio)

0.208955223880597


<div "style="width:500px;">

<p>This idea of temporarily storing things before processing them further is central in programming. </p>
    
<p> When programming with Python, whenever we bring a new value into play and assign it a label, we are free to revise the value later:
    
</div>

In [3]:
gartnavel = 14
gartnavel = 16

print(gartnavel)

16


<div "style="width:500px;">
It may also be that we forgot one surgeon and need to add 1 to the ` gartnavel` variable.

Here, the `gartnavel` variable is given the *old* value (left side means it is old) plus one. When ` gartnavel` is referenced later it will have the new value.

</div>

In [None]:
gartnavel = 14
gartnavel = gartnavel + 1

print(gartnavel)

### Variable syntax

<div "style="width:500px;">
There are some rules for variable assignments. Variable names must start with a letter (by convention lower case) and can only contain letters, numbers and underscores. Therefore we may write `gartnavel = 14`, but not `Gartnavel Hospital = 14`, not `g@rtnavel = 14` and certainly not `2019gartnavel = 14`. 

Python has strict rules for everything in order for the computer to be able to unambiguously interpret it. When violated, it results in **syntax errors**.

See for yourself what happens:
</div>

In [5]:
2019gartnavel = 14



SyntaxError: invalid syntax (<ipython-input-5-3fad73510cdf>, line 1)

### Other operators

<div "style="width:500px;">

Python, by the way, has built into it all the operators you would find in a calculator (`+`,`-`,`*`,`/`,`%`, `**`). Don't worry about any whitespace (Python ignores them) and add parantheses as you see fit as long as you close them.

Also note that you can add comments using the hash-symbol (`#`) for messages that are useful for the programmer (yourself) but not part of the program logic.

</div>

In [10]:
a = 3
b = 4

sum = a + b
product = a * b
ratio = a / b
remainder =  a % b 
exponent =  a ** b 

combo = (( a * b ) * a ) + b
print(combo) #Try out the others

40


### Variable types

<div "style="width:500px;">

Here's another thing. Python's functionality goes way beyond that of a normal calculator, because it can also deal with **strings**. Strings consist of characters, enclosed by quotation marks, like `name = "Gartnavel"`. The quotation marks are important, because they allow Python to distinguish between numbers-as-string (`a = '3'`) and numbers-as-numbers (`a = 3`).

Why are strings useful? Many programming tasks involve processing and producing text. For example, we may want to glue two strings together (*concatenate* them). Then we could actually use the `+` operator for this.

</div>

In [12]:
name = 'Peter'
surname = 'Jones'

fullname = name + ' ' + surname  #Note the blank space in between
print(fullname)

Peter Jones


<div "style="width:500px;">

To complicate things, Python distinguishes between whole numbers (*integers*) and decimals (*doubles*) by putting a decimal point afterwards. 

Moreover, there are many situations when we wish to store the value `True` and `False`. These are called *Boolean values* and are not to be confused with strings. We'll discuss them more later.

</div>

In [3]:
name = "Gartnavel"      #String
name = 'Gartnavel'      #Both single-quote and double-quote work!
age = 45                #Integer
age = 45.5              #Double
is_female = True        #Boolean

# Input, Output and the In-Between Part


<img src="generating_mean.png" alt="Variables" style="width:33%;height:55%;float:right;">
<div "style="width:500px;">



What we have been doing so far is using Python the way we use a calculator. We manually <i> type in </i> data, do something to it, and the <i> print </i> it for immediately visible feedback.
    
When we `print`, what happens is that the program sends data to the surrounding programming environment. This channel of output is called **standard output**. The programming environment can then display it.

We mention this, because many programming workflows look very different to this.

</div>


<div "style="width:500px;">
 
In order for programming to save us work, it is rare for programmers to type in data within the program. Generally, the data has been written to a separate file, and the program *reads in* this data to process it.

Moreover, it is not always that we our desired end-product is a single number (such as a sum, or mean), but rather another file, such as a graphics file that we could put in our report.
 
   

To do these more complex workflows - the ones that truly save us effort - we will need to go beyond the built-in Python functionality. 


<img src="generating_graph.png" alt="Variables" > 
</div>




### Using functions

When we talk about Python, the language, we are really talking about a set of rules for automatically translating operators like `+` into machine code. Like everything else, these rules are stored in files, but we never see them directly: we let our programming environment take care of it whenever we press Run.

Python, the language, doesn't only contain operators. It also contains a range of **functions** - labelled code snippets stored elsewhere that we can run if we reference its name.

A tell-tale sign of a function is that it contains brackets.




#### Functions

There are two ways of signalling to the function what information you send:
* The function expects the information in a particular order
* Using keyword arguments (often called "kwargs"). 

Mention defaults.

### Using libraries 

These built-in functions are limited to the most basic operations, however. To do more complex things, like calculating a mean, we have to combine them. How useful it would be if we, i

### Objects



# A world made up of kinds: Statistical attributes

Seeing the world the way a data scientist sees it requires of you to put on a new pair of spectacles. Through these spectacles, any characteristic you are interested in - whether that be height, petal width, loudness, sadness, biological sex or the number of piano tuners in Scotland - is an **attribute**.

Think of a attribute as a set of potential values. Height, width and loudness are physical and quantiative, **continuous** variable that may take up infinitely many values while number of piano tuners may be counted in whole numbers (it is **discrete**). Biological sex is not quantitative but rather **categorical** - either "male" or "female". 

<img src="question.png" alt="Variables" style="width:10%;height:10%;float:left;"> </p> 

### Look around you and convert attributes around you into attributes. Take, for example, age, hometown, ID number, humidity and income. Are they numerical or categorical? Discrete or continuous? What is the difference between categorical and discrete?

# Getting hold of the raw materials: CSV files

Data therefore ultimately consists of variable-values. For example, you may have a list of the heights of all people in your class, a list of loudness estimates from football matches, a list of member of parliament members... All of these are **raw data**, because they list individual observations where each has a value out of the possible variable values.

It is common practice to organise these observations such that each row represents one individual observation and every column represents a variable. For example, the first column may be "id". 

Below is an example of how we may organise data when the variables are id, age and physics grade. Note how the first row, called **headings**, *should* contain the variable names.

| ID        | Age           | Physics grade  |
| ------------- |:-------------:| -----:|
| 1456      | 18 | "A" |
| 1457      | 17      |   "C" |
| 1458 | 17      |    "B" |

 </p> 

These lists must come from somewhere. Perhaps it was you who dutifully asked classmates for their ages or noted down Member of Parliament sexes and put them into an Excel sheet. Perhaps administration staff had already put it in a database and all you had to do was to extract it. **Either way, you must convert it into a filetype that Python knows how to read.**

The easiest way to do this as follows:

1. _Obtain the data as a **CSV** file_. CSV stands for *Comma-Separated Values* and if you open you will find that, instead of a neat table of columns, the columns are suggested by simple comma signs.
2. _Put it into the directory of your Python script_ 
3. _Have the program read in that file_

For example, say you wish to open the file traffic_data_glasgow.csv. If you open this in a program like NotePad, you will find a jumble of letters:

*Year,Count points,Pedal Cycles,MotorCycles,Cars,Buses and Coaches...
2000," 118","   1829"," 4197","                     19197*

Have a closer look and you will find that it is actually a table, and a program has all the information it needs in order to locate the rows and columns.

Without getting bogged down in the specifics, let's have a look at an example below.

In [2]:
### import

We can do the same thing for max, except in a single row.