# Web Scraping in Python

## Instructions:
- install the parser: pip install lxml
- pip install BeautifulSoup4 requests
- Check robots.txt
- www.website_to_scrape.com/robots.txt
- These rules identify which parts of the websites are not allowed to be automatically extracted or how frequently a bot is allowed to request a page.
- Inspect the website

In [1]:
from bs4 import BeautifulSoup
import requests
import re
import numpy as np
import pandas as pd

In [2]:
URL = 'https://en.wikipedia.org/wiki/List_of_game_engines'

## Crawler

In [3]:
content = requests.get(URL)

## Parser

In [4]:
soup = BeautifulSoup(content.text, 'html.parser') #lxml

In [5]:
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of game engines - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRequestId":"Xc8eEQpAADwAAA2baasAAADE","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_game_engines","wgTitle":"List of game engines","wgCurRevisionId":926365077,"wgRevisionId":926365077,"wgArticleId":2323909,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Use mdy dates from June 2018","All articles w

## find

In [6]:
row = soup.find('tr')
print(row)

<tr>
<th style="width: 12em">Name
</th>
<th>Primary <a href="/wiki/Programming_language" title="Programming language">programming language</a>
</th>
<th><a href="/wiki/Scripting_language" title="Scripting language">Scripting</a>
</th>
<th><a class="mw-redirect" href="/wiki/Cross-platform" title="Cross-platform">Cross-platform</a>
</th>
<th>2D/3D oriented
</th>
<th>Target <a href="/wiki/Computing_platform" title="Computing platform">platform</a>
</th>
<th>Notable games
</th>
<th>License
</th>
<th class="unsortable">Notes and references
</th></tr>


In [7]:
print(row.get_text())


Name

Primary programming language

Scripting

Cross-platform

2D/3D oriented

Target platform

Notable games

License

Notes and references



## find_all

In [8]:
tot_rows = soup.find_all('tr')
for row in tot_rows:
    print(row.get_text())


Name

Primary programming language

Scripting

Cross-platform

2D/3D oriented

Target platform

Notable games

License

Notes and references


4A Engine

C++



Yes

3D

Windows, OS X, Linux, PlayStation 3, PlayStation 4, Xbox 360, Xbox One

Metro 2033, Metro: Last Light, Metro Exodus

Proprietary




A-Frame (VR)

HTML, JavaScript

JavaScript

Yes

3D

Cross-platform

A-Painter[1]

MIT

Open source Entity component system WebVR framework


Adventure Game Interpreter



C style

Yes

2D

DOS, Apple SOS, ProDOS, Classic Mac OS, Atari TOS

List

Proprietary




Adventure Game Studio

C++

AGSScript

Yes

2D

Windows, Linux

Chzo Mythos, Blackwell

Artistic 2.0

Mostly used to develop third-person pre-rendered graphic adventure games, one of the most popular for developing amateur adventure games


Alamo





Yes

3D

Windows, OS X, Xbox 360

Star Wars: Empire at War, Star Wars: Empire at War: Forces of Corruption, Universe at War: Earth Assault

Proprietary




Aleph One

C++

Lua, Mara

## table

In [9]:
table = soup.find_all('table')
print(table)

[<table class="wikitable sortable" style="text-align: center; font-size: 85%; width: auto; table-layout: fixed;">
<tbody><tr>
<th style="width: 12em">Name
</th>
<th>Primary <a href="/wiki/Programming_language" title="Programming language">programming language</a>
</th>
<th><a href="/wiki/Scripting_language" title="Scripting language">Scripting</a>
</th>
<th><a class="mw-redirect" href="/wiki/Cross-platform" title="Cross-platform">Cross-platform</a>
</th>
<th>2D/3D oriented
</th>
<th>Target <a href="/wiki/Computing_platform" title="Computing platform">platform</a>
</th>
<th>Notable games
</th>
<th>License
</th>
<th class="unsortable">Notes and references
</th></tr>
<tr>
<th><a href="/wiki/4A_Engine" title="4A Engine">4A Engine</a>
</th>
<td><a href="/wiki/C%2B%2B" title="C++">C++</a>
</td>
<td>
</td>
<td class="table-yes" style="background:#9F9;vertical-align:middle;text-align:center;">Yes
</td>
<td>3D
</td>
<td><a href="/wiki/Microsoft_Windows" title="Microsoft Windows">Windows</a>, <a

In [10]:
content_table = soup.find('table', {"class":"wikitable sortable"})

In [11]:
rows_table = content_table.find_all('tr')
for r in rows_table:
    print(r.get_text())


Name

Primary programming language

Scripting

Cross-platform

2D/3D oriented

Target platform

Notable games

License

Notes and references


4A Engine

C++



Yes

3D

Windows, OS X, Linux, PlayStation 3, PlayStation 4, Xbox 360, Xbox One

Metro 2033, Metro: Last Light, Metro Exodus

Proprietary




A-Frame (VR)

HTML, JavaScript

JavaScript

Yes

3D

Cross-platform

A-Painter[1]

MIT

Open source Entity component system WebVR framework


Adventure Game Interpreter



C style

Yes

2D

DOS, Apple SOS, ProDOS, Classic Mac OS, Atari TOS

List

Proprietary




Adventure Game Studio

C++

AGSScript

Yes

2D

Windows, Linux

Chzo Mythos, Blackwell

Artistic 2.0

Mostly used to develop third-person pre-rendered graphic adventure games, one of the most popular for developing amateur adventure games


Alamo





Yes

3D

Windows, OS X, Xbox 360

Star Wars: Empire at War, Star Wars: Empire at War: Forces of Corruption, Universe at War: Earth Assault

Proprietary




Aleph One

C++

Lua, Mara


XnGine





No

3D

DOS

The Terminator: Future Shock, The Terminator: SkyNET, TES 2: Daggerfall, TES Legends: Battlespire, TES Adventures: Redguard

Proprietary




Zest3D

ActionScript 3, C++

Lua

Yes

3D

Web, Windows, Linux, OS X, Android, iOS, BlackBerry



Boost




Zillions of Games



Zillions Rules

No

2D

Windows



Proprietary




Name

Primary programming language

Scripting

Cross-platform

2D/3D oriented

Target platform

Notable games

License

Notes and references



## Nested tags

In [12]:
print(soup.select("html head title")[0].get_text())

List of game engines - Wikipedia


## Regular Expressions

In [13]:
rows_table_  = content_table.find_all('a', title = re.compile('^Id Tech .*'))
print(rows_table_)

[<a href="/wiki/Id_Tech_3" title="Id Tech 3">id Tech 3</a>, <a href="/wiki/Id_Tech_4" title="Id Tech 4">id Tech 4</a>, <a href="/wiki/Id_Tech_5" title="Id Tech 5">id Tech 5</a>, <a href="/wiki/Id_Tech_6" title="Id Tech 6">id Tech 6</a>, <a href="/wiki/Id_Tech_7" title="Id Tech 7">id Tech 7</a>]


In [14]:
for row in rows_table_:
    print(row.get_text())

id Tech 3
id Tech 4
id Tech 5
id Tech 6
id Tech 7


In [15]:
regex = re.compile('^wiki*')
content = soup.find_all('table', attrs={'class': regex})
print(content)

[<table class="wikitable sortable" style="text-align: center; font-size: 85%; width: auto; table-layout: fixed;">
<tbody><tr>
<th style="width: 12em">Name
</th>
<th>Primary <a href="/wiki/Programming_language" title="Programming language">programming language</a>
</th>
<th><a href="/wiki/Scripting_language" title="Scripting language">Scripting</a>
</th>
<th><a class="mw-redirect" href="/wiki/Cross-platform" title="Cross-platform">Cross-platform</a>
</th>
<th>2D/3D oriented
</th>
<th>Target <a href="/wiki/Computing_platform" title="Computing platform">platform</a>
</th>
<th>Notable games
</th>
<th>License
</th>
<th class="unsortable">Notes and references
</th></tr>
<tr>
<th><a href="/wiki/4A_Engine" title="4A Engine">4A Engine</a>
</th>
<td><a href="/wiki/C%2B%2B" title="C++">C++</a>
</td>
<td>
</td>
<td class="table-yes" style="background:#9F9;vertical-align:middle;text-align:center;">Yes
</td>
<td>3D
</td>
<td><a href="/wiki/Microsoft_Windows" title="Microsoft Windows">Windows</a>, <a

In [16]:
rows  = content_table.find_all('a', string = 'C', limit = 3 )
print(rows)

[<a href="/wiki/C_(programming_language)" title="C (programming language)">C</a>, <a href="/wiki/C_(programming_language)" title="C (programming language)">C</a>, <a href="/wiki/C_(programming_language)" title="C (programming language)">C</a>]


In [17]:
links  = content_table.find_all('a')
print(links)

[<a href="/wiki/Programming_language" title="Programming language">programming language</a>, <a href="/wiki/Scripting_language" title="Scripting language">Scripting</a>, <a class="mw-redirect" href="/wiki/Cross-platform" title="Cross-platform">Cross-platform</a>, <a href="/wiki/Computing_platform" title="Computing platform">platform</a>, <a href="/wiki/4A_Engine" title="4A Engine">4A Engine</a>, <a href="/wiki/C%2B%2B" title="C++">C++</a>, <a href="/wiki/Microsoft_Windows" title="Microsoft Windows">Windows</a>, <a class="mw-redirect" href="/wiki/OS_X" title="OS X">OS X</a>, <a href="/wiki/Linux" title="Linux">Linux</a>, <a href="/wiki/PlayStation_3" title="PlayStation 3">PlayStation 3</a>, <a href="/wiki/PlayStation_4" title="PlayStation 4">PlayStation 4</a>, <a href="/wiki/Xbox_360" title="Xbox 360">Xbox 360</a>, <a href="/wiki/Xbox_One" title="Xbox One">Xbox One</a>, <a href="/wiki/Metro_2033_(video_game)" title="Metro 2033 (video game)">Metro 2033</a>, <a href="/wiki/Metro:_Last_Lig

In [18]:
web_links = []
for l in links:
    web_links.append(l.get('title'))
print(web_links)    

['Programming language', 'Scripting language', 'Cross-platform', 'Computing platform', '4A Engine', 'C++', 'Microsoft Windows', 'OS X', 'Linux', 'PlayStation 3', 'PlayStation 4', 'Xbox 360', 'Xbox One', 'Metro 2033 (video game)', 'Metro: Last Light', 'Metro Exodus', 'Proprietary software', 'A-Frame (VR)', 'Cross-platform', None, 'MIT License', 'Entity component system', 'WebVR', 'Adventure Game Interpreter', 'DOS', 'Apple SOS', 'ProDOS', 'Classic Mac OS', 'Atari TOS', 'Adventure Game Interpreter', 'Proprietary software', 'Adventure Game Studio', 'C++', 'Microsoft Windows', 'Linux', 'Chzo Mythos', 'Blackwell (series)', 'Artistic License', 'Pre-rendering', 'Adventure game', 'Adventure game', 'Petroglyph Games', 'Microsoft Windows', 'OS X', 'Xbox 360', 'Star Wars: Empire at War', 'Star Wars: Empire at War: Forces of Corruption', 'Universe at War: Earth Assault', 'Proprietary software', 'Aleph One (game engine)', 'C++', 'Lua (programming language)', 'Marathon (video game)', 'Microsoft Wind

## Extract Data

In [19]:
A=[]
B=[]
C=[]
D=[]
E=[]
F=[]
G=[]
H=[]

for t in content_table.find_all('tr'):
    cells=t.find_all('td')
    if len(cells)==8:
        A.append(cells[0].find(text=True))
        B.append(cells[1].find(text=True))
        C.append(cells[2].find(text=True))
        D.append(cells[3].find(text=True))
        E.append(cells[4].find(text=True))
        F.append(cells[5].find(text=True))
        G.append(cells[6].find(text=True))
        H.append(cells[7].find(text=True))

In [21]:
df = pd.DataFrame(A, columns=['1'])
df['2']=B
df['3']=C
df['4']=D
df['5']=E
df['6']=F
df['7']=G
df['8']=H
df.head(10)

Unnamed: 0,1,2,3,4,5,6,7,8
0,C++,\n,Yes\n,3D\n,Windows,Metro 2033,Proprietary,\n
1,"HTML, JavaScript\n",JavaScript\n,Yes\n,3D\n,Cross-platform,A-Painter,MIT,Open source
2,\n,C style\n,Yes\n,2D\n,DOS,List,Proprietary,\n
3,C++,AGSScript\n,Yes\n,2D\n,Windows,Chzo Mythos,Artistic,Mostly used to develop third-person
4,\n,\n,Yes\n,3D\n,Windows,Star Wars: Empire at War,Proprietary,\n
5,C++,Lua,Yes\n,2.5D\n,Windows,Aleph One (,GPL,FPS engine\n
6,C,Ada,Yes\n,2D\n,Windows,Factorio,zlib,"Graphics, audio, input\n"
7,"C, Assembler\n","C, C++, Gel\n",Yes\n,"2D, 3D\n",Windows,\n,Proprietary,\n
8,"C++, FFL",FFL,Yes\n,2D\n,Windows,Frogatto & Friends,zlib,[
9,"C++, C#\n",\n,Yes\n,3D\n,Windows,List,Proprietary,\n


## Reference:

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

https://en.wikipedia.org/wiki/List_of_game_engines