# Download and process data in Python (aka how to Google when trying to do something in Python)

In this notebook we will continue to work on our datasets. Our first challenge is how to download some data in Python

# How to download
I could tell you about various methds for downloading in our notebooks but lets go through the process together instead. This is often what you'll be doing when using Python for your work. Particularly if your main work isn't programming. 

# Finding an external library

When we looked at why Python one of things mentioned was the availability of libraries to help us do lots of things in Python.  Someone has probably done what you want to do before e.g: 
- https://pypi.org/project/pih2o/ 
- https://pypi.org/project/pynamical/

and slightly more relevant 
- https://pypi.org/project/marc2excel/
- https://pypi.org/project/filehash/

# Google is your friend

Let's good "download files Python". One of the responses we get is this [article](https://stackabuse.com/download-files-with-python/). This article mentions three possible options. This is something we'll often face. Again Google might help you decide what is appropriate. 

# Choosing between options

Again there are a few ways of deciding between options. One is to use Google again. 

> [urllib vs requests](http://google.com/#q=urllib+vs+requests) 

One of the answers we get is a "stackoverflow" question. Stackoverlow is a website where you can ask programming questions and people will give answers. This can be a great resource for finding help. Often someone will have already answered the question you have. 
> https://stackoverflow.com/questions/2018026/what-are-the-differences-between-the-urllib-urllib2-urllib3-and-requests-modul

![Example answer on Stackoverflow](figures/stack.png)

We can see that Stackoverflow allows you to vote for good responses. This helps filter out 'good' vs 'bad' answers but should be treated carefully. Thing swill also change over time so it's worth checking the date of answers. It seems that 'Requests' might be a good option for us to use. One other place we can look is the projects GitHub (or GitLab etc.) page. 

## Using GitHub to assess code 
We have seen that 'Requests' was a suggested option. Let's look at the GitHub quickly. 

> https://github.com/psf/requests 

### We can check: 

#### Stars ✨
this is sort of a 'like' system on GitHub. In this case there are lots of stars. We should of course not overuse this. A project dealing with something obscure might be very good but not have many stars because it isn't something lots of people will work with. 

#### Activity 🤹🏽‍♀️
Is there some recent activity? We can check if there has been updates to the code recently. We shouldn't always worry if there hasn't been much activity. Some code can remain stable for a long time without needing changes. 

#### Issues 👾
The issues of a GitHub can also be a good indicator of activity in a project. It's not the case that more issues = more problems but looking at the issues can give you a sense of whether the project is still being invested in by people using and developing the project. 

#### Documentation 📚
It can also be worth checking if a project/library has good/any documentation. This is particularly important when you are getting started. Good documentation might make the difference between being able to get something done or not and should be an important consideration. It doesn't matter much if there is a 'better' or 'quicker' way to do something with another library, if you aren't able to make use of that library because of a lack of documentation. 


# Using Requests 
We've decided to use Requests. How do we use it download our file?

## Read the docs
- This is usually a good starting point 
- Often docs include tutorials 

## Read a tutorial 
- Tutorials can be very helpful but sometimes can only focus on a 'toy' example. 

## Stackoverflow/Google 
- If you get stuck this can be very helpful. 
- Again we need to be careful about choosing a suitable answer
- Sometimes it's tempting to use answers without fully understanding them. This can be okay but it's worth a) trying to understand the answer, b) checking documentation based on the answer to see if you can replicate a solution yourself


# Importing packages

So far we have used functionality included in python. Now we want to import functionality. The way we do this in Python is using 'imports'. Imports assumes that we have a package installed already. We won't cover how to install packages in Python in these notebooks in great depth. Fortunately Google Colab already has many of the packages we'll need. To import we use ```import package_name```

In [89]:
import requests 

We now have requests imported. Let's look at the Requests documentation and see if it can help us get hold of Frakenstein in a form we can begin to work with. 
https://requests.readthedocs.io/en/master/

In this case looking at the 'quick start' gives us a clue. 

In [114]:
url = 'https://www.gutenberg.org/files/84/84-0.txt'
r = requests.get(url)

Lets see what happens if we print out ```r```

In [115]:
r

<Response [200]>

This is a good start! As you may know 200 refers to a [HTML response which means OK](https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html). Let's see if we can get the text. 

As another tip we can use ? in Jupyter to get documentation. Let's see what r.text will do 

In [99]:
?r.text

[0;31mType:[0m        property
[0;31mString form:[0m <property object at 0x10edbbf48>
[0;31mDocstring:[0m  
Content of the response, in unicode.

If Response.encoding is None, encoding will be guessed using
``chardet``.

The encoding of the response content is determined based solely on HTTP
headers, following RFC 2616 to the letter. If you can take advantage of
non-HTTP knowledge to make a better guess at the encoding, you should
set ``r.encoding`` appropriately before accessing this property.


In [102]:
r.text;

# Exercise 🎓 
- How can we store this response in a variable?

In [None]:
# Answer here


In [116]:
text = r.text

# Exercise 🎓
- How can we get only the first 250 characters of the text

In [123]:
# Answer here 


# Saving 

In [127]:
file = open("book.txt", "w")
file.write(r.text)
file.close()

# For loops 
For loops are an important concept in many programming languages. Using for loops will start to move us away from doing one thing at a time to doing a series of things. 

For loops execute a command for each item in a series. The basic idea is something like

```python
for thing in things:
    do something to thing
```

In a for loop the first line must end in a ```:``` and the next line will be indented

Lets look at a proper example:

In [143]:
numbers = [1,2,3,4,5]
for number in numbers:
    print(number)

1
2
3
4
5


What happens if we do the same thing with a string variable?

In [144]:
name = 'Harry Potter'
for character in name:
    print(character)

H
a
r
r
y
 
P
o
t
t
e
r


## For loop parts 
A for loop is made up of a collection, a loop variable, and a body

- The collection is the thing being looped over. This could be many different things. A string, a list, a collection of files. 
- The body says what we do for each item in the collection 
- The loop variable is the current item in the collection 

### loop variables 
You will have seen that in the first loop we have for **numbers** and the second we have for **characters***. These particular words don't change the behaviour of the loop but they help indicate what items are in the collection we are looping over. You want to try and use something meaningful that helps you and others remember what type of thing you are looping over. To prove we can use another loop variable name

In [147]:
numbers = [1,2,3,4,5]
for puppy in numbers:
    print(puppy)

1
2
3
4
5


This works even though there are no puppies in numbers. In this case 'number' is more useful for other people to understand what it is you are doing. 

# Exercise 🎓
Print out each of the names in the list below

In [148]:
names = ['Berthold', 'Denys', 'Erno']
# your code here 


# Exercise 🎓
- Can we write some code that prints out the characters in each of these names?

In [197]:
# Your code here


# Back to Frakenstein 
Now we now how to loop and we have a text file downloaded lets see if we can use what we have learned so far to do some basic counting.

# Opening a file in Python 
We already saw above that we can use ```open()``` to open a file. Let's open the file we saved earlier. To make sure we all have the file we'll repeat the steps here again. 

In [171]:
url = 'https://www.gutenberg.org/files/84/84-0.txt'
r = requests.get(url)
file = open('book.txt','w')
file.write(r.text)
file.close()

In [183]:
file = open('book.txt')

Again we can use the power of completion to see what options we have

In [None]:
file.

In [184]:
text = file.readlines()

```readlines()``` will read a line at a time. 

At the moment 'text' contains the whole book. To make things a bit easier we can slide the first 50 lines of this. 

In [185]:
text_50 = text[:50]

In [None]:
We can check what happpens when we loop through this. 

In [188]:
for line in text_50:
    print(line)

ï»¿

Project Gutenberg's Frankenstein, by Mary Wollstonecraft (Godwin) Shelley



This eBook is for the use of anyone anywhere at no cost and with

almost no restrictions whatsoever.  You may copy it, give it away or

re-use it under the terms of the Project Gutenberg License included

with this eBook or online at www.gutenberg.net





Title: Frankenstein

       or The Modern Prometheus



Author: Mary Wollstonecraft (Godwin) Shelley



Release Date: June 17, 2008 [EBook #84]

Last updated: January 13, 2018



Language: English



Character set encoding: UTF-8



*** START OF THIS PROJECT GUTENBERG EBOOK FRANKENSTEIN ***









Produced by Judith Boss, Christy Phillips, Lynn Hanninen,

and David Meltzer. HTML version by Al Haines.

Further corrections by Menno de Leeuw.







Frankenstein;





or, the Modern Prometheus









by





Mary Wollstonecraft (Godwin) Shelley















# Conditionals 

Another very important concept in programming is conditionals. This conditional says whether to do something or not. 
The general syntax is:

```python
if something is true:
    do something
```

Lets look at a proper example:

In [190]:
number = 20
if number > 10:
    print(number)

20


If we do the same thing with a number lower than 10

In [191]:
number = 1
if number > 10:
    print(number)

This time we don't get any output because the condition isn't met. Let's see another example:

In [194]:
name = 'Harry' 
if 'a' in name:
    print('a is in name')

a is in name


We haven't seen ```in``` before but we can probably guess what it does. 

# Using a conditional in a loop 
Conditionals are often used inside a loop. This allows you to decide whether to do something with/to an item in your collection depending on whether a condition is met. Let's look at an example. This mainly uses things we have already seen:

In [195]:
total = 0 
for line in text_50:
    if 'the' in line:
        total +=1 
print(total)

5


# Exercise 🎓
What does this code do? 

In [209]:
# Add comments for each line 
total = 0  #
for line in text_50: #
    if 'the' in line: #
        total +=1 #
print(total) #

5


# 🏁 Fin 
