# Webscraping Wikipedia

## Scraping rules
- You should check a site's terms and conditions before you scrape them. It's their data and they likely have some rules to govern it.
- Be nice - A computer will send web requests much quicker than a user can. Make sure you space out your requests a bit so that you don't hammer the site's server.
- Scrapers break - Sites change their layout all the time. If that happens, be prepared to rewrite your code.
- Web pages are inconsistent - There's sometimes some manual clean up that has to happen even after you've gotten your data.

<h3>Import necessary modules</h3>

In [None]:
import requests
from bs4 import BeautifulSoup
import json
import os

## requests
- requests executes HTTP requests, like GET
- The requests object holds the results of the request. This is page content and other items like HTTP status codes and headers.
- requests only gets the page content without any parsing.
- Beautiful Soup does the parsing of the HTML and finding content within the HTML.

In [None]:
url = 'https://en.wikipedia.org/w/index.php?title=Special:UserLogin&returnto=Main+Page'
login = 'https://en.wikipedia.org/w/index.php?title=Special:UserLogin&action=submitlogin&type=login'
watchlist = 'https://en.wikipedia.org/wiki/Special:Watchlist'
url, login, watchlist

## Login via session

Store your credentials in a encrypted/protected file (line1 = name, line2 = pwd)

In [None]:
# Create credentials file:
# CRED_KEY = '....'
# !touch ../_credentials/wiki_credentials.txt
# !echo "$CRED_KEY" > ../_credentials/wiki_credentials.txt
# !chmod 400 ../_credentials/wiki_credentials.txt
# !cat ../_credentials/wiki_credentials.txt

In [None]:
with open('../_credentials/wiki_credentials.txt') as f:
    contents = f.read().split('\n')
    username = contents[0]
    password = contents[1]

### Construct object that contains requested login data
Inspect the login-form in your browser

<h3>get the value of the login token</h3>

In [None]:
def get_login_token(response):
    soup = BeautifulSoup(response.text, 'lxml')
    token = soup.find('input', {'name': "wpLoginToken"}).get('value')
    return token

In [None]:
payload = {
    'wpName': username,
    'wpPassword': password,
    'wploginattempt': 'Log in',
    'wpEditToken': '+\\',
    'title': 'Special:UserLogin',
    'authAction': 'login',
    'force': '',
    'wpForceHttps': '1',
    'wpFromhttp': '1',
    'wpLoginToken': 'get_login_token(session.response)'
    }

<h3>Setup a session, login, and get data</h3>

In [None]:
with requests.session() as s:
    
    response = s.get(url)
    
    # Set login token
    payload['wpLoginToken'] = get_login_token(response)
    
    # Send the login request
    response_post = s.post(login, data=payload)
    
    # Get another page and check if we’re still logged in
    response = s.get(watchlist)
    data = BeautifulSoup(response.content, 'lxml')

In [None]:
data;

In [None]:
print(data.find('div', class_='watchlistDetails').get_text())

In [None]:
s.close()