# Week 7: 4.0 Working with Data

![image.png](attachment:image.png)


## 4.001 Useful references and an introduction to the topic
In this topic we will do some reading and practical activities first then switch to lots of video content in the second week. I want to give you a chance to practice working both with code and documentation. `The two labs` for this week will focus on building on your existing knowledge and applying your knowledge in what might be for some of you, a new domain:

* Lab one will be applying lots of the techniques you `already know `(iteration, loops, conditions etc.) We will look at this in the context of doing some `basic data handling` - grabbing some data from somewhere and manipulating it for purpose.
* Lab two will focus on a `powerful library` that we will use to do tasks we have already had some practice with. We have dealt with `lists and other types of data structures before`. As we move towards the midway point of the course we want to encourage you to `use libraries more and more` - even where they might be novel or unfamiliar to you.


These are the reading resources that you should make use of while studying this topic. We will look at lots of practical examples in the lectures, but as with everything in Python there are lots of options and libraries available to us. We have provided these to you as an accessible resource for the topic - but they are not compulsory reading. You should get used to handling documentation to solve tasks - it is a really important skill as a programmer. You will often be working on unfamiliar libraries and being able to jump in and figure things out is an invaluable skill!

* Python ‘14.1. csv – CSV file reading and writing’, Python language documentation – 14. File formats 
* Python ‘pandas.read_csv’, pandas 0.22.0 documentation – API reference 
* Python ‘pandas.DataFrame.to_csv’, pandas 0.22.0 documentation – API reference – pandas.DataFrame 
* Python ‘pandas.read_json’, pandas 0.22.0 documentation – API reference 

If any of the links are broken, let us know via the Student Portal. 

## 4.002 Exploring the CSV data format

What is CSV? 

![image.png](attachment:image.png)

# Individual Activity 1 (15 minutes)

## Finding the average, max, min of Nitrogen Dioxide(NO2) in UK School

* Try the above using School.csv
* Fill out the blank (_________)

##### Nitrogen dioxide causes a range of harmful effects on the lungs,

In the air quality directive (2008/EC/50) the EU has set two limit values for nitrogen dioxide (NO2) for the protection of human health: the NO2 hourly mean value may not exceed 200 micrograms per cubic metre (µg/m3) more than 18 times in a year and the NO2 annual mean value may not exceed 40 micrograms per cubic metre (µg/m3). 

![image-2.png](attachment:image-2.png)

### the 8th columns**

In [5]:
import csv
with open("Schools.csv") as csvfile:
    reader=csv.reader(csvfile, delimiter=',')
    lst=[]
    for column in reader:
        lst.append(column[7])
                     

In [6]:
lst.remove(lst[0])

In [7]:
lst=[float(i) for i in lst]

In [8]:
sum(lst)/len(lst)

36.22726157620361

In [9]:
max(lst)

73.1

In [13]:
min(lst)

22.4

## 4.004 Alternative libraries
Now that we have seen an example of how to do things in standard Python, let us look at a library that improves 
<br>`efficiency` both in terms of `performance` and in terms of `our coding style`.

https://numpy.org/doc/stable/reference/generated/numpy.mean.html

Not only is NumPy a `faster alternative `- it provides a far more `comprehensive set` of tools than the standard operators in Python.

Read the following examples:  `std, var, nanmean, nanstd, nanvar`.

We will be using NumPy in the labs to try and do some things, so you might want to spend a little longer in this reading to prepare yourself.

Tip: NumPy can also handle other variations of data e.g. string operations.

In [35]:
import csv
with open('Schools.csv') as csvfile:
    reader=csv.reader(csvfile, delimiter=',')
    lst=[]
    for column in reader:
        lst.append(column[7])


In [36]:
lst=lst[1:]

In [16]:
lst=[float(i) for i in lst]

In [17]:
import numpy as np
myarray=np.array(lst)

In [18]:
myarray

array([73.1, 68.2, 67.8, ..., 22.8, 22.5, 22.4])

In [31]:
np.std(myarray)

7.011575972786973

In [None]:
np.nanstd(myarray) 

In [32]:
np.var(myarray)

49.162197622163596

In [41]:
np.nanvar(myarray)

49.162197622163596

In [42]:
np.mean(myarray)

36.22726157620362

In [43]:
np.nanmean(myarray)

36.22726157620362

### By the way, what is std, var, mean? and nanstd, nanvar, nanmean?

## 4.005 The NaN type

Python treats data that is "not a number" as a NaN type.


Why might we want to ignore NaN values?<br>
1. Python has corrupted the data on import<br>
2. `They do not add meaning to our analysis` <br>


## Be careful about the nan... must remove or use appropriate function

In [13]:
import numpy as np

In [14]:
L=[3,4,5,np.nan]

In [15]:
L

[3, 4, 5, nan]

In [16]:
np.array(L)

array([ 3.,  4.,  5., nan])

In [17]:
np.mean(L)

nan

In [18]:
np.nanmean(L)

4.0

## 4.006 Distinctions between data and information

Information and data are the same thing.<br>
1. True
2. `False`

Data is raw and unprocessed. Information usually has some processing and meaning attached to it through said processing.  


Question 2
This is a chance to practice an exam type open-ended question.
What do you think the differences between data, information and knowledge are?



![image.png](attachment:image.png)

https://www.cambridgeinternational.org/Images/285017-data-information-and-knowledge.pdf

## 4.007 Open data resources


Lots of institutions publish "open data" ie data that is designed to be readily accessible to all. In many cases, the funding attached to research projects now requires this as a prerequisite. I regularly use the UK Governments open data repository search engine to find data to work with.

https://data.gov.uk/

We do not need to be data scientists to look at data and make sense of what is contained within. In fact, all of us have probably explored datasets before, in software like Excel and in databases. Here are your tasks for this activity:

Find some open data resources relating to health. 
Identify the file types and how they are structured. 
See if you can find out some headlines about the data by looking at it in a spreadsheet program. 
Post your findings in the discussion forum and comment on at least two posts of your peers.

Participation is optional

## 4.008 Working with data

Question 1
Why might we use a different name for a data file that we are processing? Hint: Think about maintaining the integrity of the original file.<br>
1. It is not possible to overwrite an existing file so it needs a different name
2. `To stop the program from overwriting the original data file`

Question 2
What is the `correct name for the programming technique` used here: 
data = [d[0] for d in data]
1. `A list comprehension`
2. A for loop

Question 3
What is the `mean` of the values in data?
<br>x = [2,4,6,8,10]
<br>data = [v*2 for v in x]


```pthon
import numpy
numpy.array(data).mean()
```

In [1]:
x = [2,4,6,8,10]

data = [v*2 for v in x]

In [3]:
import numpy
numpy.array(data).mean()

12.0

## 4.009 DataFrame

## Click <a href='Week4C_B4.009_DataFrame.ipynb'> Here </a>

## 4.010 Share your Work
Go back to the previous course activity: `4.009 Dataframes`

Post a one-sentence summary of your data in the discussion forum.

You can also share your lab work on the discussion forum:

Use the "Generate/Update shared link" button to generate a read-only version of your lab.

Copy and paste the generated link to another browser tab and make sure it works.

Post your generated link in the discussion forum so that other students can review your work and comment on at least one other post.

If any of the links are broken, let us know via the Student Portal.

##  4.011 Lists and Arrays

## Click <a href='Week4D_B_4.012ListsArray.ipynb'> Here </a>

## 4.012 Share your Work
Go back to the previous course activity: 4.11 Lists and Arrays

Can you do `some basic statistics` on it with NumPy? 

Post a one-sentence summary of your data (from the basic statistics) in the discussion forum.

You can also share your lab work on the discussion forum:

Use the "Generate/Update shared link" button to generate a read-only version of your lab.

Copy and paste the generated link to another browser tab and make sure it works.

Post your generated link in the discussion forum so that other students can review your work and comment on at least one other post.

If any of the links are broken, let us know via the Student Portal.

Participation is optional

# Possible Solution

In [2]:
import csv
with open('Schools.csv') as csvfile:
    reader=csv.reader(csvfile, delimiter=',')
    lst=[]
    for column in reader:
        lst.append(column[7])
                     

In [5]:
lst.remove(lst[0])

In [8]:
lst=[float(i) for i in lst]

In [10]:
sum(lst)/len(lst)

36.22726157620361

In [11]:
max(lst)

73.1

In [12]:
min(lst)

22.4

In [13]:
import csv
with open('Schools.csv') as csvfile:
    reader=csv.reader(csvfile, delimiter=',')
    lst=[]
    for column in reader:
        lst.append(column[7])

In [18]:
lst=lst[1:]

In [20]:
lst=[float(i) for i in lst]

In [21]:
import numpy as np
np.array(lst)

array([73.1, 68.2, 67.8, ..., 22.8, 22.5, 22.4])