# The things in memory: Variables

### Real-life variables  
<img src="variable.png" alt="Variables" style="width:50%;height:50%;float:right;"> 
<p> <b> Chances are that you already have done data processing at some point in your life. </b>
    
 You have maybe calculated how much money you spent on a holiday, or did notches in the door frame to track your own growth spurt. </p> 

    
<p> Suppose that we wish to count the number of female surgeons in Glasgow using pen and paper. </p>

<p> Most likely you would try to make this problem more manageable by going through all the surgeons in a given hospital at a time.</p> 

<p> Then you would make a record of the number of female ones in that hospital, for example **Gartnavel Hospital = 14**, before moving on. <b>It almost serves like a resting-point: by storing it externally, you are relieved from having to keep every number in your own memory.</b> </p>

<p> Eventually, you would sum the hospital sums together into one grand total.</p>

### Programming variables

<p>This idea of temporarily storing things before processing them further is central in programming. </p>
    
<p> When programming with Python, whenever we bring a new value into play and store that into memory, we decide on a label for it.<img src="bookwork.png" alt="Variables" style="width:50%;height:50%;float:right;"> </p> 

<p> This label is called a <em>variable</em>. **Gartnavel Hospital** is an example of this. If a new female surgeon is hired tomorrow, we will have to revise our value to 15. Program variables can similarly be given new values.</p>

<p>The rules for making a record is < variable name > = < value >.  Variable names can only contain letters, numbers and underscores, and by convention start with lowercase. Therefore we may write **gartnavel = 14**.  </p>
    
#### Execute the statement below to assign value 14 to the variable **gartnavel**:  

In [None]:
gartnavel = 14

<img src="question.png" alt="Variables" style="width:10%;height:10%;float:left;"> </p> 
### Which of the following are valid variable names:  Gartnavel Hospital, gartnavel2018, g@rtnavel, gartnavel_hospital, 2018gartnavel?


### Variable types
For Python to handle the values appropriately, you need to make sure it is of the right type. Python distinguishes between strings, like names, and numbers, like age. <p>**Strings** consist of characters, enclosed by quotation marks, like **name = "Gartnavel"**.</p> <p> When it is a number, it needs no quotation marks, but has to only include a number, like **age = 45.5**. Python distinguishes between whole numbers (integers) and decimals ("doubles") by putting a decimal afterwards. </p> <p> When it is a logical value ("boolean" in Python-ese), it is either True or False. Do remember the upper-case. </p>

In [3]:
name = "Gartnavel"      #String
age = 45                #Integer
age = 45.5              #Double
is_female = True        #Boolean

# Doing things to variables: Computation

Giving a value a label is known as an **assignment**. We don't have to explicitly specify a value: we could also give the label the result of some other processing.

For example, Python can serve you well as a calculator. You have all the usual operators at your service: the right side of **result = 2*3 + 5*4** is 26. Asterisk * is used for multiplication, percentage sign for division.

When you want to write a comment rather than executable code, we prefix that with a #. To print it out into the output, enclose by **print()**.

<img src="calc.png" alt="Variables" style="width:50%;height:50%;"> </p> 

#### By now we have enough knowledge to utilise Python as a regular calculator, except with variables instead of raw numbers. Run the following code to calculate the number of tenants per landlords in Fife: 

In [4]:
nr_of_landlords = 57011

nr_of_tenants = 90982

avg_t_per_l = nr_of_tenants / nr_of_landlords #Note! The variable avg_t_per_l is a number

print("Average number of tenants per landlord is " + str(avg_t_per_l) ) 
#To print it, we need to convert it to a string

Average number of tenants per landlord is 1.5958674641735806



# A world made up of kinds: Statistical attributes

Seeing the world the way a data scientist sees it requires of you to put on a new pair of spectacles. Through these spectacles, any characteristic you are interested in - whether that be height, petal width, loudness, sadness, biological sex or the number of piano tuners in Scotland - is an **attribute**.

Think of a attribute as a set of potential values. Height, width and loudness are physical and quantiative, **continuous** variable that may take up infinitely many values while number of piano tuners may be counted in whole numbers (it is **discrete**). Biological sex is not quantitative but rather **categorical** - either "male" or "female". 

<img src="question.png" alt="Variables" style="width:10%;height:10%;float:left;"> </p> 
### Look around you and convert attributes around you into attributes. Take, for example, age, hometown, ID number, humidity and income. Are they numerical or categorical? Discrete or continuous? What is the difference between categorical and discrete?

# Getting hold of the raw materials: CSV files

Data therefore ultimately consists of variable-values. For example, you may have a list of the heights of all people in your class, a list of loudness estimates from football matches, a list of member of parliament members... All of these are **raw data**, because they list individual observations where each has a value out of the possible variable values.

It is common practice to organise these observations such that each row represents one individual observation and every column represents a variable. For example, the first column may be "id". 

Below is an example of how we may organise data when the variables are id, age and physics grade. Note how the first row, called **headings**, *should* contain the variable names.

| ID        | Age           | Physics grade  |
| ------------- |:-------------:| -----:|
| 1456      | 18 | "A" |
| 1457      | 17      |   "C" |
| 1458 | 17      |    "B" |

<img src="question.png" alt="Variables" style="width:10%;height:10%;float:left;"> </p> 
### Why do you think we did quotation marks around the grades? Was it necessary to make IDs numerical?

<img src="import.png" alt="csv" style="width:50%;height:50%;"> </p> 

These lists must come from somewhere. Perhaps it was you who dutifully asked classmates for their ages or noted down Member of Parliament sexes and put them into an Excel sheet. Perhaps administration staff had already put it in a database and all you had to do was to extract it. **Either way, you must convert it into a filetype that Python knows how to read.**

The easiest way to do this as follows:

1. _Obtain the data as a **CSV** file_. CSV stands for *Comma-Separated Values* and if you open you will find that, instead of a neat table of columns, the columns are suggested by simple comma signs.
2. _Put it into the directory of your Python script_ 
3. _Have the program read in that file_

For example, say you wish to open the file traffic_data_glasgow.csv. If you open this in a program like NotePad, you will find a jumble of letters:

*Year,Count points,Pedal Cycles,MotorCycles,Cars,Buses and Coaches...
2000," 118","   1829"," 4197","                     19197*

Have a closer look and you will find that it is actually a table, and a program has all the information it needs in order to locate the rows and columns.

Without getting bogged down in the specifics, let's have a look at an example below.

In [2]:
import pandas as pd  #Import a library full of relevant code that will help you later on

df = pd.read_csv("traffic_data_glasgow.csv", sep=',')


# With a little help from others: Libraries and functions

Let's parse the above program. The file containing the data we wish to read in is *traffic_data_glasgow.csv* and it is indeed a csv, as shown by the (.csv) extension. 

The code that we use to read the file is actually very complex and stored in other files called a **library**, but we can make use of it by simply loading in a library (called pandas) and *call* its **function** (run some of its code) called *read_csv( )*, into which we pass various specifications, such as the file name and the separator symbol.

Try yourself! Read in a CSV file into Python - once through the portal, all kinds of magic can be done to it!

# Playing with tables in Python: Dataframes

We create a variable into which we store all the data we read in. How is it stored? Well, it's not quite a string or number - the type of Python variables we have dealt with so far, but a Python **object** called a **dataframe**.

A dataframe is, quite simply, a table. We can access its values, change its values, summarise its values. 

For starters, let us see which headers it has.

In the case you just read in we have a huge dataframe with 10 000+ rows and many columns, amd we wouldn't want to print it as we might end up feeling lost in all those numbers. 

What we could do instead is see what the columns are, then perhaps see the first few rows. 

We are going to use the following: 
* df.columns: a property (attribute) that gives us the name of the columns

* df.head(): a function that gives us the first five rows

These are functions where *df* is the name we gave our dataframe (this might be something else), and *.columns* is a property of the dataframe (thus no parantheses after it), while *.head()* is a **function**.

These little magical expressions return a list. Unless we wrap it with a print-statement, that list will not be visible in our console. Have a go at it below.


In [None]:
#getting the columns of the dataframe
print (df.columns)

This function is very useful when we haven't seen the data file beforehand. Perhaps we just downloaded it on Internet and have no insight into its structure yet, or perhaps we couldn't remember the attribute (column) names. To remind yourself, print them out, and do remember that names are case-sensitive. "Cars" is *not* the same as "cars".

In [None]:
print (df.head())

### Exploring the attributes 
We may also want to show all the values in a particular column, for example all the ages in the Age variable. To do this, we simply need to plug the name into a pair of square brackets and join that to the dataframe. So for example, we could write *print(   df["Cars"])* . Even now we use quotation marks around the variable name, since it is a string.

In [3]:
print(df["Cars"])

0      935727
1      950044
2      975271
3      963936
4     1006914
5     1004468
6     1010355
7     1015887
8     1023497
9     1044660
10    1028489
11    1030061
12    1090425
13    1121681
14    1121856
15    1108334
16    1135713
17    1132339
Name: Cars, dtype: int64


The list of values in an attribute that we just obtained has statistical properties just like any list. It may have a minimum number or a maximum number. To do that, we simply chain another function on top, simply named *.min()* and *.max()*.



In [None]:
count_points = df['Count points']

count_points_min = count_points.min()

print (count_points_min)

We can do the same thing for max, except in a single row.

In [None]:
print (df['Count points'].max())

We printed out two values but those numbers don't really mean much. That's why it's useful to add meaningful print messages. In order to do this we need to cast those numbers to strings in order to concatenate them with some meaningful text. 

Also, let's chain the operations into one.

In [None]:
print(str(df['Count points'].min()) + " is the min number of count points" )

# Your turn:
Based on the concepts covered so far, complete the following exercises. Sceleton code is provided below

1. Retrieve the column where the cars stats are stored
2. What's the minimum number of cars recorded over the years? 
3. What's the maximum number of cars recorded? 



In [4]:
#create a variable where to store the column with header "cars"
carsColumn = df[INSERT LABEL NAME HERE]

#variable to store the min of cars
carsMin = df['LABEL'].FUNCTION
print()

#variable to store the max of cars
carsMax = df[LABEL].FUNCTION
print()

SyntaxError: invalid syntax (<ipython-input-4-5ad32915c871>, line 2)