# Introduction to Web Scraping with Python

#### CEMFI Undergraduate Summer Internship 2021

#### Instructor: Cay

# Course Outline

1. **Introduction: Python and Jupyter Notebooks**
1. Web Scraping Example 1
1. Web Scraping Example 2

# Introduction: Python and Jupyter Notebooks

- This section of the course is based on a lecture series from [Quantitative Economics with Python](https://quantecon.org/python-lectures/), an amazing resource for learning Python for economic modelling. 

## What's Python?

[Python](https://www.python.org) is a free and open source general-purpose programming language conceived in 1989.


### Common Uses (General)

Python is used in a wide variety of application domains:
- Web development.  
- Computer generated imagery.
- Game development.  
- Data processing, etc., etc., etc.  

### Uses in Economics

In economics, it is probably true that Stata, MATLAB, and R are still dominant.

But Python is **beginning** to rise...

It can be used to: 

- Economic modelling like in MATLAB or Julia (not this course).
- __Data and empirics:__
  - __Get data: web scraping__ (most of this course), digitalization of maps, etc. 
  - Clean/organize data: Stata also good for that but Python has an edge on unstructured data (e.g. text-heavy).
  - Visualize data: very nice plots!
  - Analyze data: econometrics (as of today, probably it is still easier to use Stata but...)

### Resources to learn Python

General purpose:
- [Beginner's Guide to Python](https://wiki.python.org/moin/BeginnersGuide/)
- [Codecademy](https://www.codecademy.com/catalog/language/python)


For economics:
- [Quantitative Economics with Python](https://quantecon.org/python-lectures/) by Thomas J. Sargent and John Stachurski (I recommend this one!)


## Download options

1. Official website [https://www.python.org/downloads/](https://www.python.org/downloads/)
 
 - Easy to install (click on download, double click the file and you are done).
 - To do **just web scraping** it would probably be good enough.
 - But you may need to further install separate packages to do more complex economic modelling.


2. [Anaconda](https://www.anaconda.com/what-is-anaconda/) (recommended)

 - Free and open-source distribution of Python (and R) for scientific computing.
 - Includes lots of data-science packages. 
 - *Makes our life easier:* automates the process of installing, upgrading, configuring, and removing packages in a consistent manner.

### Installing Anaconda

To install Anaconda, [download](https://www.anaconda.com/download/) the binary and follow the instructions.

**Hopefully**, some of you already have Anaconda installed by now.

### Executing Python code

#### System Terminal

You can run Python directly from the system terminal (show how to do it).

#### Integrated development environment (IDE)

- With an IDE you can:
    - Write and execute code from one program.
    - May offer code completion, syntax highlighting, debugging tools, many more (depends on the IDE)...
<br><br>  
- Examples of IDE:
    - __IDLE:__ comes with the default implementation of the language.
    - __Spyder:__ comes with Anaconda, specifically built for data science, interface familiar to those used to Matlab or R.
    - __PyCharm:__ very popular and provides support for other languages like JavaScript, HTML/CSS, etc (good option for web development).
    - Other options: Atom, Visual Studio Code, nteract, Sublime Text.
<br><br>
- But in our classes we use: [__Jupyter Notebooks__](https://jupyter.org/)

## Jupyter Notebook

__[Jupyter](http://jupyter.org/)__ notebook uses a *browser-based* interface to Python and provides:
- The ability to write and execute Python commands.  
- Formatted output in the browser, including tables, figures, animation, etc.  
- The option to mix in formatted text and mathematical expressions.    

__Jupyter__ is great for:
- Start coding in Python.  
- Test new ideas or interact with small pieces of code.  

__Note:__ for the purposes of this class Jupyter is ideal but once you have to produce more complex and longer codes...
- Probably a good idea to move to one of the IDEs discussed above.
- They are better for organizing complex codes and reproducing the results.

### Starting the Jupyter Notebook

Once you have installed Anaconda, there are different ways to start a jupyter notebook:

1. Search for the `Anaconda-Navigator` in your applications menu, open it and click on `JupyterLab` (or `Notebook`)

2. Open up a `terminal` (on Mac) or `Anaconda command prompt` (on Windows) and type `jupyter lab` (or `jupyter notebook`)  

### Any questions?

### Notebook Basics

Let’s start with how to edit code and run simple programs.

#### Running Cells

To execute the code in a cell, hit `Shift+Enter` instead of the usual `Enter`.

#### Modal Editing

A particular feature of Jupyter notebooks is that it uses a *modal* editing system.

This means that the effect of typing at the keyboard **depends on which mode you are in**.

1. Edit mode
    - Whatever you type appears in that cell.
1. Command mode  
    - Keystrokes are interpreted as commands — for example, typing `b` adds a new cell below  the current one.  

To switch:
- Command mode to edit mode: hit `Enter` or click in a cell.
- Edit mode to command mode: hit the `Esc` key.

#### Cell Type

Cells can be of different types, the two we will use are `Code` and `Markdown`.

1. **Code:** to write Python code.  
1. **Markdown:** to write formatted text like in this cell.  

To switch between types you can either do it using the dropdown list above or, while in `command` mode:

- From Code to Markdown: press `m`
- From Markdown to Code: press `y`

**Note:** the behavior of the Jupyter notebook may look a bit confusing at first but it is very efficient when you get used to it.

#### Working with Python Files

So far we’ve focused on executing Python code entered into a Jupyter notebook cell or opening a notebook file `ipynb`.

Traditionally most Python code is simply a text file with the `.py` extension.

If you come across code saved in a `*.py` file, and you want to load it to your Jupyter Notebook you simply do `%load file.py`.

The entire text file will be loaded to that cell.

In [None]:
# %load "sq_function.py"
def squared(x):
    return x**2

In [None]:
# %load "/Users/cayrua/Desktop/USI/sq_function.py"
def squared(x):
    return x**2


**Note:**

If you work on Windows and copy the path to a file you may get folders separated by `\`, which python does not read correctly.

Modify the string with the path so that folders are separated with `/` or `\\`. Another alternative is to use `r` before the *path*.


- `"C:\Users\cayrua\Desktop\USI\sq_function.py"` **will not work**

But all of the options below will:
- `"C:/Users/cayrua/Desktop/USI/sq_function.py"`
- `"C:\\Users\\cayrua\\Desktop\\USI\\sq_function.py"`
- `r"C:\Users\cayrua\Desktop\USI\sq_function.py"`

We can also write (create) a `.py` file by

In [None]:
%%writefile cube_function.py
def cube(x):
    return x**3

#### More tips on Jupyter Notebook

There are many other tips on working with Jupyter Notebooks, go to [Quantitative Economics with Python](https://python-programming.quantecon.org/) to learn more.

### Any questions?

# Python Basics

## Data Types

Here we discuss only a few of the standard (built-in) Python data types:

### Numeric

Any representation of data which has a numeric value.

We can check the type of any object in memory using the `type()` function.

In [None]:
# 
type(1), type(1.5)

### Boolean

Data with one of two values `True` or `False`. 

In [None]:
x = True

In [None]:
x

In [None]:
type(x)

In [None]:
w = True

### Strings

A collection of one or more characters put in single or double quotes.

**Note:** strings are usually important in web scraping applications.

**Example:** we can make operations on strings, see below:

In [None]:
x = 'web'          # single quotes
y = "scraping"     # double quotes
z = 4

In [None]:
# sum strings
print(x + y)

In [None]:
# multiply string times number
# we can see the output with using the print command
x*z

We can also select only parts of a string:

In [None]:
# remember Python starts indices with a zero 
x[0]

In [None]:
# a slice of a string: string_var[a:b] outputs the slice from a to (b-1)
# string_var[a:b] delivers b-a characters
y[ -2 :  ]  

### Lists

An *ordered* collection of one or more data items, not necessarily of the same type.

When we use **square brackets** `[]` after the `=` sign, Python understands we want to create a **list**.

In [None]:
# a list of elements of different types
x = [10, 'web', False]
print(type(x))
print(x)

The first element of `x` is an `integer`, the second is a `string`, and the third is a `Boolean`.

When adding an element to a list, we can use: `list_name.append(element)`

In [None]:
x.append("extraValue")

In [None]:
x

Here `append()` is what’s called a *method*, which is a function “attached to” an object—in this case, the list `x`.

Python objects such as lists, strings, etc. all have methods that are used to manipulate the data contained in the object.

String objects have [string methods](https://docs.python.org/3/library/stdtypes.html#string-methods), list objects have [list methods](https://docs.python.org/3/tutorial/datastructures.html#more-on-lists), etc.  


Another useful list method for lists is `list_name.pop()`

In [None]:
x

In [None]:
x.pop()

In [None]:
x

The first element is referenced by `exampleList[0]`

In [None]:
x

In [None]:
x[0]

In [None]:
x[2]

Python lists are __mutable__:

In [None]:
x = [1, 2]

In [None]:
x

In [None]:
x[0]

In [None]:
x[0] = 10

In [None]:
x

#### Slice Notation

To access multiple elements of a list, you can use Python’s slice notation.

For example,

In [None]:
a = [2, 4, 6, 8]

In [None]:
# standard is [a,b] => from a to b-1
a[1:3]

In [None]:
a[:2] # from second element until the end

In [None]:
a[-2:]  # Last two elements of the list

### Dictionaries

A dictionary is an unordered collection of data in a `key:value` pair form. 

Dictionaries are very much like lists, except that the elements are named with a `key` instead of numbered by their order.

In [None]:
# lists are ordered
a = [2, 4, 6, 8]
# get first element
a[0]

In [None]:
# dictionary values are indexed by a key (which could also be a number)
d = {'author' : 'Plato' , 'book': 'The Republic', 'pages': 404}
type(d)

In [None]:
# Plato is not the first element in order
d[0]

In [None]:
# it is the value of the key called author
d['author'] 

The names `'author'`, `'book'` and `'pages'` are called the *keys*.

The objects that the keys are mapped to (`'Plato'`, `'The Republic'`  and `404`) are called the `values`.

### Any questions?

## Loops

One common task in web scraping is stepping through a sequence of data and performing a given action.

One of Python’s strengths is its simple, flexible interface to this kind of loop/iteration.

It is particularly useful to know that `lists` and `dictionaries` are __iterable__.

#### Loop over dictionaries

In [None]:
# dictionary
d

In [None]:
# to get keys only
for a in d:
    print(a)

In [None]:
# to get values only
for v in d.values():
    print(v)

In [None]:
# to get a tuple with both key and value
for i in d.items():
    print(i)

In [None]:
# to get both key and values but as separate objects
for a in d:
    print(a , "=>" , d[a] )

#### Loop over lists

In [None]:
# a list
social_sciences = ['anthropology', 'sociology', 'psychology', 'political science', 'economics']

In [None]:
# another more "pythonean" way
for i in social_sciences:
    print(i)

In [None]:
# we can also get index and value in the same iteration: remember indices in python start from zero
for idx, value in enumerate(social_sciences):
    position = str(idx)
    print("List's item in position " + position + " is " + value)

## Functions and Imports

Three broad types of functions:

1. __Built-in__ functions.
1. __User defined__ functions.
1. Functions from __external modules__ that have to be __imported__.

### Built-in functions

We have already used quite a few.

The structure is `function(argument)`.

In [None]:
# Returns a sequence of numbers, starting from 0 and increments by 1 (by default)
list(range(10))

In [None]:
# the sorted function
letters = ['c', 'z', 'a', 'p', 'd']
letters

In [None]:
sorted(letters)

Many more examples....

### User defined functions

In [None]:
def name_of_function(x):
    """
    This function return whether its argument is non-negative or not
    """
    if x < 0:
        return 'negative'

    return 'nonnegative'

In [None]:
# if no argument is passed, python throws an error
name_of_function()

In [None]:
# we need to pass the argument
name_of_function(-5)

In [None]:
# we need to pass the argument
name_of_function(5)

### Functions from external modules, packages and libraries.

Function < Module < Package < Library.

There are many ways to import modules/packages/libraries.

Here are some examples:

#### Numpy

In [None]:
# if we try to use Numpy without importing it we get an error
numpy.random.randn()

In [None]:
# one way to import
import numpy

In [None]:
# get a random draw from a standard normal using numpy 
numpy.random.randn()

In [None]:
# another option is to import directly the function you need
from numpy.random import randn

In [None]:
randn()

In [None]:
# More common to import with short name!
import numpy as np

In [None]:
np.random.randn()

#### Matplotlib

In [None]:
# standard way to import it 
import matplotlib.pyplot as plt

In [None]:
# pass on any list to it and it will plot

# list by list comprehension
x = [ a**2 for a in range(101) ]

# plt.plot(list) plots the list
plt.plot(x);

### Any questions?