# Web Scraping

![Data Science Workflow](img/ds-workflow.png)

## Acquire Data
### Common Data Sources
- **The Internet - Web Scraping**
- Databasis
- CSV
- Excel
- Parquet

### Web Scraping
- Extracting data from websites
- Leagal issues: [wikipedia.org](https://en.wikipedia.org/wiki/Web_scraping#Legal_issues)
- The legality of web scraping varies across the world.
- In general, web scraping may be against the terms of use of some websites, but the enforceability of these terms is unclear.

### Be ethical
- Not for commercial use
- Only private use

## Example
- Let's consider [https://en.wikipedia.org/wiki/Wikipedia:Fundraising_statistics](https://en.wikipedia.org/wiki/Wikipedia:Fundraising_statistics)
- **pandas** ```.read_html(.)``` Read HTML tables into a list of DataFrame objects ([docs](https://pandas.pydata.org/docs/reference/api/pandas.read_html.html)).

In [1]:
import pandas as pd

In [7]:
url="https://en.wikipedia.org/wiki/Wikipedia:Fundraising_statistics"

In [8]:
data=pd.read_html(url)

## Data Wrangling
- Data wrangling (data munging): transforming and mapping data from one "raw" data form into another format
- With the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics

### Check the data types
- Remember ```.dtypes```

In [9]:
type(data)

list

In [12]:
type(data[0])

pandas.core.frame.DataFrame

In [13]:
len(data)

1

In [14]:
data[0].head()

Unnamed: 0,Year,Source,Revenue,Expenses,Asset rise,Total assets
0,2020/21,PDF,"$ 162,886,686","$ 111,839,819","$ 50,861,811","$ 231,177,536"
1,2019/20,PDF,"$ 129,234,327","$ 112,489,397","$ 14,674,300","$ 180,315,725"
2,2018/19,PDF,"$ 120,067,266","$ 91,414,010","$ 30,691,855","$ 165,641,425"
3,2017/18,PDF,"$ 104,505,783","$ 81,442,265","$ 21,619,373","$ 134,949,570"
4,2016/17,PDF,"$ 91,242,418","$ 69,136,758","$ 21,547,402","$ 113,330,197"


In [15]:
Fundaraising=data[0]

In [16]:
Fundaraising.dtypes

Year            object
Source          object
Revenue         object
Expenses        object
Asset rise      object
Total assets    object
dtype: object

In [18]:
Fundaraising['Expenses'].str[2:]

0     111,839,819
1     112,489,397
2      91,414,010
3      81,442,265
4      69,136,758
5      65,947,465
6      52,596,782
7      45,900,745
8      35,704,796
9      29,260,652
10     17,889,794
11     10,266,793
12      5,617,236
13      3,540,724
14      2,077,843
15        791,907
16        177,670
17         23,463
Name: Expenses, dtype: object

In [19]:
Fundaraising['Exp']=Fundaraising['Expenses'].str[2:]

In [20]:
Fundaraising['Exp']=Fundaraising['Expenses'].str[2:].str.replace(',','')

In [21]:
Fundaraising.head()

Unnamed: 0,Year,Source,Revenue,Expenses,Asset rise,Total assets,Exp
0,2020/21,PDF,"$ 162,886,686","$ 111,839,819","$ 50,861,811","$ 231,177,536",111839819
1,2019/20,PDF,"$ 129,234,327","$ 112,489,397","$ 14,674,300","$ 180,315,725",112489397
2,2018/19,PDF,"$ 120,067,266","$ 91,414,010","$ 30,691,855","$ 165,641,425",91414010
3,2017/18,PDF,"$ 104,505,783","$ 81,442,265","$ 21,619,373","$ 134,949,570",81442265
4,2016/17,PDF,"$ 91,242,418","$ 69,136,758","$ 21,547,402","$ 113,330,197",69136758


In [22]:
Fundaraising['Exp']=pd.to_numeric(Fundaraising['Exp'])

In [24]:
Fundaraising.dtypes

Year            object
Source          object
Revenue         object
Expenses        object
Asset rise      object
Total assets    object
Exp              int64
dtype: object

In [25]:
Fundaraising['Revenue'].str[2:]

0     162,886,686
1     129,234,327
2     120,067,266
3     104,505,783
4      91,242,418
5      81,862,724
6      75,797,223
7      52,465,287
8      48,635,408
9      38,479,665
10     24,785,092
11     17,979,312
12      8,658,006
13      5,032,981
14      2,734,909
15      1,508,039
16        379,088
17         80,129
Name: Revenue, dtype: object

In [26]:
Fundaraising['Rev']=Fundaraising['Revenue'].str[2:]

In [27]:
Fundaraising['Rev']=Fundaraising['Rev'].str.replace(',','')

In [28]:
Fundaraising.head()

Unnamed: 0,Year,Source,Revenue,Expenses,Asset rise,Total assets,Exp,Rev
0,2020/21,PDF,"$ 162,886,686","$ 111,839,819","$ 50,861,811","$ 231,177,536",111839819,162886686
1,2019/20,PDF,"$ 129,234,327","$ 112,489,397","$ 14,674,300","$ 180,315,725",112489397,129234327
2,2018/19,PDF,"$ 120,067,266","$ 91,414,010","$ 30,691,855","$ 165,641,425",91414010,120067266
3,2017/18,PDF,"$ 104,505,783","$ 81,442,265","$ 21,619,373","$ 134,949,570",81442265,104505783
4,2016/17,PDF,"$ 91,242,418","$ 69,136,758","$ 21,547,402","$ 113,330,197",69136758,91242418


In [29]:
Fundaraising.dtypes

Year            object
Source          object
Revenue         object
Expenses        object
Asset rise      object
Total assets    object
Exp              int64
Rev             object
dtype: object

In [30]:
Fundaraising['Rev']=pd.to_numeric(Fundaraising['Rev'])

In [31]:
Fundaraising.dtypes

Year            object
Source          object
Revenue         object
Expenses        object
Asset rise      object
Total assets    object
Exp              int64
Rev              int64
dtype: object