![alt text](https://github.com/gerwolf/webscraping-workshop/blob/main/header.PNG?raw=true")

## Course content
1. Python basics and collaborative software development (GitHub)
2. HTML basics and parsing, IO-Operations
3. Programmatic access (APIs)
4. Different types of data
5. Visualisation techniques and simple analyses
6. Web application development basics (your project!)

## Learning objectives
1. Get up-and-running with Python and Jupyter Notebooks
2. Obey best practices such as commenting, documenting and sharing results in reproducable and effective ways
3. Start to think in an "algorithmic" way, i.e. think about how you can resort to computational resources in order to perform repetitive tasks
4. Understand how you can deploy a diverse arsenal of (mostly free and complementary) tools and services and subsequently develop your own toolkit and work-flow based on preferences and experiences
5. Get used to machine-readable formats and objects
6. Learn how to seek help in the extensive open-source community and contribute to it, if you believe you can
7. Have fun!

## Software/tools and preliminaries

### Software
- [x] [Current Anaconda distribution](https://www.anaconda.com/products/individual)
- [x] Git installed and authenticated, GitHub and GitHub Desktop (see [installation guide](https://docs.github.com/en/desktop/installing-and-configuring-github-desktop/setting-up-github-desktop), must have a 64-bit OS for GitHub Desktop)
- [x] A text editor, e.g. [Sublime Text](https://www.sublimetext.com/)

### GitHub repository
- [ ] Instantiate a repository on GitHub website incl. `README.md` and `.gitignore` and clone it to your local directory (see video tutorial)
- [ ]  Amend the `README.md` (in a text editor of your choice, I use Sublime Text) such that any user can quickly understand what this repository is about. You can find excellent tips and hints for how to properly format your `README.md` [here](https://docs.github.com/en/github/writing-on-github/basic-writing-and-formatting-syntax).

### Virtual environment
- [ ] Create and activate a virtual environment (see video tutorial)
    - [ ] Install the Jupyter module via pip, i.e. `pip install jupyter`
    - [ ] Install the Jupyter Notebook kernel module via pip, i.e. `pip install ipykernel`
    - [ ] Perform a `pip freeze` and write the `requirements.txt` to your local GitHub folder
    
### Jupyter Notebook
- [ ] Create a new Jupyter Notebook using the installed Kernel (see video tutorial)

### GitHub Desktop
- [ ] Commit your changes and push them to your repository (see video tutorial). Congratulations, you have just made your first contribution and are now part of the world’s largest Open-Source community!

### Check-list
- [ ] [GitHub](https://github.com/pricing) (personal user, free)
- [ ] [Twitter developer account](https://developer.twitter.com/en/apply-for-access) (you need a Twitter account first and have to submit an application in order to get access to the Twitter API; choose the standard product track and fill out the mandatory fields; approval may take some time and require verification via phone/SMS; store your consumer_key, consumer_secret, access_token_key, access_token_secret row by row as string objects in this order in a .py file e.g. Twitter_API.py))
- [ ] [Destatis/GENESIS-Online](https://www-genesis.destatis.de/genesis/online?Menu=Registrierung#abreadcrumb) (choose Registrierter Nutzer and obtain your username and password; store it row by row in this order in a .py file e.g. Genesis_API.py)
- [ ] [Spotify developer account](https://developer.spotify.com/documentation/web-api/quick-start/) (you need to register at http://www.spotify.com/ for Premium or Free user); create an app and store your Client ID and Client Secret row by row in this order in a .py file e.g. Spotify_API.py
- [ ] [Plotly account](https://chart-studio.plotly.com/Auth/login/?action=signup#/) (obtain your username and API key (Account → Settings → API Keys → Generate Key) and store them row by row in this order in a .py file e.g. Plotly_API.py
- [ ] [Heroku account](https://signup.heroku.com/) (for web application hosting and deployment)
- [ ] [Slack Desktop](https://slack.com/intl/de-de/downloads/windows) (make sure to join [our Slack group](https://join.slack.com/t/codingbootcam-nmp8973/shared_invite/zt-ofv11epd-PTKTczAm7H2s1OD6XNbK5g) as this will be the main communication tool before and during the workshop)
- [ ] Text editor, e.g. [Sublime Text](https://www.sublimetext.com/)

## Why all of this?
### Why [Python](https://en.wikipedia.org/wiki/Python_(programming_language))?
1. Readability, intuitive logic and accessibility
2. It's free (Open Source)
3. Great documentation (StackOverflow, Books, MOOCs, etc.)
4. FOMO - it's all over...and the user base is growing

The PYPL PopularitY of Programming Language Index is created by analyzing how often language tutorials are searched on Google.

<a href="https://de.statista.com/statistik/daten/studie/678732/umfrage/beliebteste-programmiersprachen-weltweit-laut-pypl-index/" rel="nofollow"><img src="https://de.statista.com/graphic/1/678732/beliebteste-programmiersprachen-weltweit-laut-pypl-index.jpg" alt="Statistik: Die beliebtesten Programmiersprachen weltweit laut PYPL-Index im April 2021 | Statista" style="width: 60%; height: auto !important; max-width:1000px;-ms-interpolation-mode: bicubic;"/></a><br />Mehr Statistiken finden Sie bei  <a href="https://de.statista.com" rel="nofollow">Statista</a>

### Why [Open Source](https://en.wikipedia.org/wiki/Open-source_model) (GitHub etc.)?
1. **Participation** and transparency/dissolvement of information asymmetries
2. Efficient and effective **communication** through **collaboration**
3. Skill development and structure, credible repertoire
4. Awareness: Who owns `data`?
5. **Network effects** and reduction of **transaction costs**
6. Digital **globalisation** and democratization

As of January 2020, GitHub reports having over 40 million users and more than 190 million repositories (including at least 28 million public repositories), making it the largest host of source code in the world.

### Why Webscraping?
1. Modern communication/exchange of information/services are transmitted through the internet - but most things happen unnoticedly for the average user (Cookies, IP, `user-agent`, clicks, etc.)
2. We disentangle the traffic and learn how to **read** it (e.g. JSON) and then how to **participate** in it in order to accomplish our (research) mission
3. Then we usually need to **structure** it, i.e. bring it into a format we can work with, e.g. tables
4. Amount of data is growing at incredible speed (storage costs are close to zero) but complexity increases exponentially (if you ask me - others might don't agree) and need to derive **insights** from those data lakes

<img src="https://camo.githubusercontent.com/6012a6bf037b404c727827e42f2df887810a08e7/687474703a2f2f7777772e616c6c6163636573732e636f6d2f6173736574732f696d672f636f6e74656e742f6d657267652f323032302f6d2d30332d31302d706963312e6a70672e7061676573706565642e63652e664f6b447a6e666e2d4c2e6a7067" alt="Drawing" data-canonical-src="http://www.allaccess.com/assets/img/content/merge/2020/m-03-10-pic1.jpg.pagespeed.ce.fOkDznfn-L.jpg">

### Economics & Data Science?

#### The fundamental problem of economics (as of 2021) summarized in one cat meme:
<img alt="" class="oj uv fb fn fk ku v c" width="500" height="375" role="presentation" src="https://miro.medium.com/max/500/1*ZhYNqU2y96_f3QkWq9oiWQ.jpeg" srcset="https://miro.medium.com/max/276/1*ZhYNqU2y96_f3QkWq9oiWQ.jpeg 276w, https://miro.medium.com/max/500/1*ZhYNqU2y96_f3QkWq9oiWQ.jpeg 500w" sizes="500px">

1. Fundamentally different objectives, e.g. w.r.t. causal inference (Economists: **Structural!** Machine Learning people: **Fit!**)
2. Tradeoff between interpretability and predictability
2. ...but economists need to address changing economies, markets, information problems etc. in a timely manner
3. Data Sience tools can complement economic reasoning and the research process in general

### About me - Gerome Wolf
- BSc Volkswirtschaftslehre (HU Berlin, Bachelor thesis: ["Cultural modernisation in Germany in the 19th century: First name choices"](https://drive.google.com/file/d/1T-9apxtAVRYWy7ksiqbx3JzOoq0GNtVB/view?usp=sharing), Economic History)
- MSc Economics (UCL, Dissertation: [The equity risk premium puzzle - A cross-country study of asset returns, growth and disaster risk](https://github.com/gerwolf/master-thesis-UCL), Macro Finance)
- Currently finishing second masters degree (MSc Volkswirtschaftslehre, HU Berlin), dissertation on imperfect VAT pass-through in a HANK model
- I like coding, hacking, prototyping, innovation, `data`, metal music, snowboarding, hiking and animals (cats in particular)
- Participated in three Hackathons so far (never won but had an amazing time)
- Worked at HelloFresh in BI, commercial and central banks
- Worked with many types of data and data sources (time series, spatial, music/voice/audio, images)
- Started computer programming 3,5 years in a seminar on ... webscraping :D

### Now it's your turn!
1. Background and Affiliation
2. Why did you decide to join this workshop?
3. Do you have a specific use case / project in mind?
4. Previous coding experience?

In [None]:
from random_name import draw_name

In [None]:
draw_name(textfile = 'participants.txt', total_seconds = 120)

## Schedule*
| Date | Start | End | Content | Type |
| --- | --- | --- | --- | --- |
| **Wednesday, 14/04/2021** | 10:00<br><br>11:30<br><br>11:45<br><br>13:00<br><br>14:00<br><br>15:00 | 11:30<br><br>11:45<br><br>13:00<br><br>14:00<br><br>15:00<br><br>16:30 | Welcome, Intro & Who is who?<br><br>GitHub, Anaconda & Jupyter NB/Markdown<br><br>Python Basics<br><br>Break<br><br>Python Basics Challenge<br><br>Webscraping I | Plenum<br><br>Plenum<br><br>Plenum<br><br><br><br>Indiviual<br><br>Plenum|
|  |  |  |  |  |
| **Thurday, 15/04/2021** | 10:00<br><br>11:00<br><br>13:00<br><br>14:00<br><br>15:00 | 11:00<br><br>13:00<br><br>14:00<br><br>15:00<br><br>16:30 | Webscraping II + Visualisation<br><br>API Basics<br><br>Break<br><br>Spatial libraries + Visualisation<br><br>Spatial routines + Application |Plenum<br><br>Plenum<br><br><br><br>Plenum<br><br>Individual/Plenum|
|  |  |  |  |  |
| **Friday, 16/04/2021** | 10:00<br><br>10:30<br><br>11:00<br><br>11:15<br><br>11:30<br><br>13:00<br><br>14:00<br><br>16:00<br><br>16:30| 10:30<br><br>11:00<br><br>11:15<br><br>11:30<br><br>13:00<br><br>14:00<br><br>16:00<br><br>16:30<br><br>17:00 | Browser Automation (Selenium) <br><br>Flask App Basics + Deployment<br><br>Challenges announcement<br><br>Group discussion + decision<br><br>Group work I<br><br>Break<br><br>Group work II<br><br>Demoing + Wrapping it up<br><br>Virtual Friday Beers (Q&A) | Plenum<br><br>Plenum<br><br>Plenum<br><br>Breakout groups<br><br>Breakout groups<br><br><br><br>Breakout groups<br><br>Plenum<br><br>Optional |
*subject to changes, depending on pace and flow

### How we will learn
- *Learning by doing* - minimum amount of theory and instructions, maximum amount of applications, implementing and hacking
- Individually - *exercises and challenges*
- Groups - *main project and collaboration*

### Literature

<img src="https://covers.oreillystatic.com/images/0636920034391/lrg.jpg" alt="Drawing" style="width: 
400px;"/>

[Web scraping with Python](http://shop.oreilly.com/product/0636920034391.do) with [Code examples](https://github.com/REMitchell/python-scraping)
<br>
<br>


___

                
**Contact: Gerome Wolf** (Email: wolfgerome@gmail.com)