# HTML and CSS

- HTML: Most web pages are formatted using the Hypertext Markup Language (HTML), we need to understand how to extract
information from such pages.

- CSS: Cascading Style Sheets (CSS) is used for format and stylize modern web pages.

## HTML

Remember what we extract from the webpage:

```
<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>List of Game of Thrones episodes - Wikipedia</title>
[...]
</html>


HTML provides the building blocks to provide structure and formatting to
documents. This is provided by means of a series of “tags.” 
- HTML tags often come in pairs and are enclosed in angled brackets, with “`<tagname>`” being the opening tag and
“`</tagname>`” indicating the closing tag. 
- Some tags come in an unpaired form, and do not require a closing tag.

Let's look at some commonly used tags:

- `<p>`...`</p>` to enclose a paragraph;

- `<br>` to set a line break;

- `<table>`...`</table>` to start a table block, inside; `<tr>`...`</tr>` is
used for the rows; and `<td>`...`</td>` cells;

- `<img>` for images;

- `<h1>`...`</h1>` to `<h6>`...`</h6>` for headers;

- `<div>`...`</div>` to indicate a “division” in an HTML document, basically used to group a set of elements;

- `<a>`...`</a>` for hyperlinks;

- `<ul>`...`</ul>`, `<ol>`...`</ol>` for unordered and ordered lists respectively; inside of these, `<li>`...`</li>` is used for each list item.

-----------------------------
A few things:

1. Tags can be nested inside each other, so “`<div><p>Hello</p></div>`” is perfectly valid, though overlapping nestings such as “`<div><p>Oops</div></p>`” is not.

2. Even though this isn’t proper HTML, every web browser will exert a lot of effort to still parse and render an HTML page as well as possible.

3. HTML is messy. If web browsers have strict requirements on the web pages HTML code (like perfectlly formatted based on HTML standard), most websites would die. 

When you read HTML content:

1. Tags that come in pairs have content. For instance, “`<a>click here</a>`” will render out “click here” as a hyperlink in your browser.

2. Tags can also have attributes, which are put inside of the opening tag. “href” attribute indicates the web address of the link. 
“`<a href=“http://www.google.com”> click here </a>`” directs the user to Google’s home page.

3. For image tag, the `src` attribute is used to indicate the URL of the image the browser should retrieve. Ex., “
`<img src=“http://www.example.com/image.jpg”>`”. 

### Use browser as a development tool

Assuming you use Google Chrome. 

- Navigate to https://en.wikipedia.org/w/index.php?title=List_of_Game_of_Thrones_episodes&oldid=802553687

- Let's look at the underlying HTML of the page. 
    - Right-click on the page and press "View source" or `Control+U` in Chrome.
    - A new page will open containing the raw HTML contents for the current page (the same content as what we got back using `r.text`);

- Open up Chrome’s “Developer Tools.” To do so, either select the Chrome Menu at the top right of your browser window, then select “Tools,” “Developer Tools,” or press Control+Shift+I. Alternatively, you can also right-click on any page element and select “Inspect Element.” You will see a new window with a bunch of helpful tools for web scrapers.

- Explore the Developer Tools pane. Yours might appear at the bottom of your browser window. If you prefer to have it on the right, find the menu with the three-dotted-colon icon (the tri-colon), and pick a different “Dock side.”

Summary: 

- “Elements” and “Network” will come in most helpful.

- Red “recording” icon in the toolbar indicates that Chrome is tracking network requests (if the icon is not lit, press it to start tracking).

- Refresh the Wikipedia page and look at what happens in the Developer Tools pane: 
    - Chrome starts logging all requests it is making, starting with an HTTP request for the page itself at the top.
    - Your web browser is also making lots of other requests to actually render the page, most of them to fetch image data (“Type: png”).
    - By clicking a request, you can get more information about it. Click the “index.php” request at the top.
        - Selecting a request opens another pane that provides a wealth of information that should already, look pretty familiar to you now that you’ve already worked with HTTP.
        - For instance, select header, you will see info like, request URL, method (verb), and status code that was sent back by the server, as well as a full list of request and response headers.

----------------------------

**Check boxes** in the Network tab that are noteworthy to mention:
   - Enabling “Preserve log” will prevent Chrome from “cleaning up” the overview every time a new page request is performed. This can come in handy in case you want to track a series of actions when navigating a website
   - “Disable cache” will prevent Chrome from using its “short-term memory.”
   - Chrome will try to be smart and prevent performing a request if it still has the contents of a recent page around, though you can override this in case you want to force Chrome to actually perform every request.
   
------------------------------

**Elements** tab in Chrome:

- You can hover over the HTML tags in the Elements tab, and Chrome will show a transparent box over the corresponding visual representation on the web page itself. This can help you to quickly find the pieces of content you’re looking for.

- Alternatively, you can right-click any element on a web page and press “Inspect element” to immediately highlight its corresponding HTML code in the Elements tab.

#### Inspecting Elements versus View Source
Why the “View source” option is useful to look at a page’s raw HTML source when we have a much user-friendlier alternative offered by the Elements tab.

1. “View source” option shows the HTML code as it was returned by the web server, and it will contain the same contents as r.text when using requests.
2. The view in the Elements tab, on the other hand, provides a “cleaned up” version after the HTML was parsed by your web browser. Overlapping tags are fixed and extra white space is removed, for instance.
3. There might hence be small differences between these two views.

4. The Elements tab provides a live and dynamic view and will reflect the current state of the page. 
    - websites can include scripts that are executed by the web browser and which can alter the contents of the page at will. 
    - These scripts are written in a programming language called JavaScript and can be found inside <script>...</script> tags in HTML.
    
---------------------------

#### A few other things:

- Next, note that any HTML element in the Elements tab can be right-clicked.
    - “Copy, Copy selector” and “Copy XPath” are particularly useful, which we’re going to use quite often later on.
    
- You can edit the HTML code in real time (the web page will update itself to reflect your edits). 
    - These changes are only local. They don’t do anything on the web server itself and will be gone once you refresh the page, though it can be a fun way to experiment with HTML.

## Cascading Style Sheets: CSS

The structure and formatting of documents basically relate to two different concerns. Then,
- HTML is used to define the general structure and semantics of a document
- CSS will govern how a document should be styled, or, what it should look like.

The CSS language looks somewhat different from HTML. In CSS, style information is written down 
   - as a list of colon-separated key-value based statements
   - with each statement itself being separated by a semicolon, as follows:

``` 
color: 'red';
background-color: #ccc;
font-size: 14pt;
border: 2px solid yellow;



These style declarations can be included in a document in 3 different ways:

- Inside a regular HTML “style” attribute, for instance as in: `<p style="color:'red';">...</p>`

- Inside of HTML `<style>...</style>` tags, placed in inside the `<head>` tag of a page.

- Inside a separate file, which is then referred to by means of a `<link>` tag inside the `<head>` tag of a page. This is the cleanest way of working. When loading a web page, your browser will perform an additional HTTP request to download this CSS file and apply its defined styles to the document.

When style declaratoins are placed inside a "style" attribute, it is clear to which element the declaration should be applied: the HTML tag itself. There is another way of defining style declaration inside curly brackets to group them, and putting a "CSS selector" at the beginning of each group. 

```

h1{
  color: red;
}
div.box{
  border: 1px solid black;
}
#intro-pragraph{
  font-weight: bold;
}

```

CSS selector define the patterns used to "select" the HTML elements you want to style. ([Details](https://www.w3schools.com/cssref/trysel.asp?selector=ul%20~%20table))


Here is a full reference:
- `tagname` selects all elements with a particular tag name. Ex., "h1" simply matches with all "\<h1\>" tags on a page.
- `.classname` selects all elements having a particular class defined in the HTML document. This is where the "class" attribute comes in. 
    - For instance, `.intro` will match with both "\<p class="intro"\>" and "\<h1 class="intro"\>". Note that HTML elements can have multiple classes, for example, "\<p class="intro highlight"\>".
- `#idname` matches elements based on their “id” attribute. Contrary to classes, proper HTML documents should ensure that each “id” is unique and only given to one element only (though don’t be surprised if some particularly messy HTML page breaks this convention and used the same id value multiple times).
- These selectors can be combined in all sorts of ways. div.box,
for instance, selects all “\<div class=”box”\> tags, but not “\<div class=”circle”\>” tags.
- Multiple selector rules can be specified by using a comma, “,”, for example, h1, h2, h3.
- `selector1` `selector2` defines a chaining rule (note the space) and selects all elements matching selector2 inside of elements matching selector1. Note that it is possible to chain more than two selectors together.
- `selector1` > `selector2` selects all elements matching selector2 where the parent element matches selector1. Note the subtle difference here with the previous line. A “parent” element refers to the “direct parent.” For instance, `div` \> `span` will not match with the span element inside “`<div><p><span></span></p></div>`” (as the parent element here is a "\<p>" tag), whereas `div` span will.
- `selector1` + `selector2` selects all elements matching `selector2` that are placed directly after (i.e., on the same level in the HTML hierarchy) elements matching `selector1`.
- `selector1` ~ `selector2` selects all elements matching `selector2` that are placed after (on the same level in the HTML hierarchy) `selector1`. Again, there’s a subtle difference here with the previous rule: the precedence here does not need to be “direct”: there can be other tags in between.
- It is also possible to add more fine-tuned selection rules based on attributes of elements.tagname\[attributename\] selects all tagname elements where an attribute named attributename is present. Note that the tag selector is optional, and simply writing \[title\] selects all elements with a “title” attribute.
- The attribute selector can be further refined. \[attributename=value\]
checks the actual value of an attribute as well. If you want to include
spaces, wrap the value in double quotes.
- \[attributename~=value\] does something similar, but instead of performing an exact value comparison, here all elements are selected whose attributename attribute’s value is a space-separated list of words, one of them being equal to value.
- \[attributename|=value\] selects all elements whose attributename attribute’s value is a space-separated list of words, with any of them being equal to “value” or starting with “value” and followed by a hypen (“-”).
- \[attributename^=value\] selects all elements whose attribute value starts with the provided value. If you want to include spaces, wrap the value in double quotes.
- \[attributename$=value\] selects all elements whose attribute value ends with the provided value. If you want to include spaces, wrap the value in double quotes.
- \[attributename*=value\] selects all elements whose attribute value contains the provided value. If you want to include spaces, wrap the value in double quotes.
- Finally, there are a number of “colon” and “double-colon” “pseudo-classes” that can be used in a selector rule as well. p:first-child selects every “\<p\>” tag that is the first child of its parent element, and `p:last-child` and `p:nth-child(10)` provide similar functionality.

-----------------------------
Next thing is try to play around with a webpage using the Chrome's Developer Tools and find instances of the "class" attribute. The CSS resource of the page is referenced through a "<link>" tag (note that pages can load multiple CSS files as well):

`<link rel="stylesheet" href="/w/load.php?[...];skin=vector">`

- The same CSS selector syntax can be used to quickly find and retrieve elements from an HTML page using Python. 

- Try right-clicking some HTML elements in the Elements tab of Chrome’s Developer Tools pane and press “Copy, Copy selector.” You will obtain a CSS selector. Ex., the following selector can fetch a table on the page: 
    - `#mw-content-text > div > table:nth-child(9).`
    - Or: “inside the element with id “mw-content-text,” get the child “div” element, and get the 9th “table” child element.”
    
Practice: 
`#mw-content-text > div.mw-parser-output > table:nth-child(24)`

## BeautifulSoup Package

A Python library for pulling data out of HTML and XML files. Beautiful Soup tries to organize complexity: it helps to parse, structure and organize the messy web by fixing bad HTML and presenting us with an easy-to-work-with Python structure.

How to install `beautifulsoup` in Python

- pip install -U beautifulsoup4

- conda install beautifulsoup4


In [2]:
# run multiple outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
import requests

url = 'https://en.wikipedia.org/w/index.php?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
#url = 'http://faculty.baruch.cuny.edu/nkumar/pytest/page.htm'
r = requests.get(url)

# html content from the page
html_contents = r.text

In [231]:
help(BeautifulSoup)

Help on class BeautifulSoup in module bs4:

class BeautifulSoup(bs4.element.Tag)
 |  BeautifulSoup(markup='', features=None, builder=None, parse_only=None, from_encoding=None, exclude_encodings=None, element_classes=None, **kwargs)
 |  
 |  A data structure representing a parsed HTML or XML document.
 |  
 |  Most of the methods you'll call on a BeautifulSoup object are inherited from
 |  PageElement or Tag.
 |  
 |  Internally, this class defines the basic interface called by the
 |  tree builders when converting an HTML/XML document into a data
 |  structure. The interface abstracts away the differences between
 |  parsers. To write a new tree builder, you'll need to understand
 |  these methods as a whole.
 |  
 |  These methods will be called by the BeautifulSoup constructor:
 |    * reset()
 |    * feed(markup)
 |  
 |  The tree builder may call these methods from its feed() implementation:
 |    * handle_starttag(name, attrs) # See note about return value
 |    * handle_endtag(na

In [3]:
from bs4 import BeautifulSoup

#create a beautiful soup object
html_soup = BeautifulSoup(html_contents)

# type(html_soup)

# If error code comes up, try the code below
html_soup = BeautifulSoup(html_contents, 'html.parser')

# print(html_contents)

# get website text
# print(html_soup.get_text())
# print(html_soup.text)

# get website title
# print(html_soup.title)

### HTML parser

The reason why we use `BeautifulSoup(html_contents, "html.parser")` is that the Beautiful Soup Library itself depends on an HTML parser to perform most of the bulk parsing work. In Python, there are multiple existing parsers:

- “html.parser”: a built-in Python parser that is decent (especially when using recent versions of Python 3) and requires no extra installation.

- "lxml”: which is very fast but requires an extra installation.

- “html5lib”: which aims to parse web page in exactly the same way as a web browser does, but is a bit slower.

Since there are small differences between these parsers, Beautiful Soup warns you if
you don’t explicitly provide one, this might cause your code to behave slightly different
when executing the same script on different machines. To solve this, we simply specify a
parser ourselves: `html_soup = BeautifulSoup(html_contents, 'html.parser')`

Beautiful Soup’s main task is to take HTML content and transform it into a tree-based representation. Once you’ve created a BeautifulSoup object, there are two methods you’ll be using to fetch data from the page:
- `find(name, attrs, recursive, string, **keywords)`
- `find_all(name, attrs, recursive, string, limit, **keywords)`

Both methods look very similar indeed, with the exception that find_all takes an
extra limit argument.

- The `name` argument defines the tag names you wish to “find” on the page. You can pass a string, or a list of tags. Leaving this argument as an empty string simply selects all elements.

- The `attrs` argument takes a Python dictionary of attributes and matches HTML elements that match those attributes.

- The `recursive` argument is a Boolean and governs the depth of the search. If set to True — the default value, the find and find_all methods will look into children, children’s children, and so on... for elements that match your query. If it is False, it will only look at direct child elements.

- The `string` argument is used to perform matching based on the text content of elements.

- The `limit` argument is only used in the find_all method and can be used to limit the number of elements that are retrieved. 

    - `find` is functionally equivalent to calling `find_all` with the limit set to 1
        - the former returns the retrieved element directly
        - the latter will always return a list of items, even if it just contains a single element. 
    - when `find_all` cannot find anything, it returns an empty list, whereas if `find` cannot find anything, it returns None.

- `**keywords` is kind of a special case. Basically, this part of the method signature indicates that you can add in as many extra named arguments as you like, which will then simply be used as attribute filters. 
    - Writing `find(id='myid')` is hence the same as `find(attrs={'id': 'myid'})`. 
    - If you define both the attrs argument and extra keywords, all of these will be used together as filters. 
    - This functionality is mainly offered as a convenience in order to write easier-to-read code.
    - Caveats:
        - You cannot use class as a keyword, as this is a reserved Python keyword
            - Workaround in Beautiful Soup. Use `“find(class_='myclass')”`. Or the original way: `find({'class':'myclass'})`.
        - However, `name` is used as the first argument name for these 2 functions. You need to use `attrs` if you want to filter based on the "name" HTML attribute.

#### What you can do with `find` and `find_all`
They return `Tag` objects and there are many things you can do with the `Tag` object.

- Access the `name` attribute to retrieve the tag name.

- Access the `contents` attribute to get a Python list containing the tag’s bchildren (its direct descendant tags) as a list.

- The `children` attribute does the same but provides an iterator instead; the `descendants` attribute also returns an iterator, now including all the tag’s descendants in a recursive manner. These attributes are used when you call find and find_all.

- Similarly, you can also go “up” the HTML tree by using the parent and parents attributes. To go sideways (i.e., find next and previous elements at the same level in the hierarchy), `next_sibling`, `previous_sibling` and `next_siblings`, and `previous_siblings` can be used.

- Converting the `Tag` object to a string shows both the tag and its HTML content as a string. This is what happens if you call print out the `Tag` object, for instance, or wrap such an object in the `str` function.

- Access the attributes of the element through the `attrs` attribute of the `Tag` object. You can also directly use the `Tag` object itself as a dictionary. 

- Use the `text` attribute to get the contents of the `Tag` object as clear text (remove HTML tags).
    - `get_text(strip = True)` does the same thing.
    - You can also specify a string to be used to join the bits of text enclosed in the element together, ex., `get_text('***')`.
- If a tag has only one child and that child is text, you can use the `string` attribute to get the textual content. 
    - If a tag contains other HTML tags nested within, `string` will return `None` whereas `text` can recursively fetch all the text.
    
- Last, not all `find` and `find_all` searches need to start from the BeautifulSoup objects. Every Tag object itself can be used as a new root from which new searches can be started. 

In [240]:
html_soup

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of Game of Thrones episodes - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"6a012443-228a-4eee-93b8-264e40f72009","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_Game_of_Thrones_episodes","wgTitle":"List of Game of Thrones episodes","wgCurRevisionId":1087086788,"wgRevisionId":802553687,"wgArticleId":31120069,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 errors: unsupported parameter","Articles containing potentially dated statements from Augus

In [236]:
# find firsting heading tag
first_h1 = html_soup.find('h1')
first_h1
str(first_h1)

<h1 class="firstHeading mw-first-heading" id="firstHeading">List of <i>Game of Thrones</i> episodes</h1>

'<h1 class="firstHeading mw-first-heading" id="firstHeading">List of <i>Game of Thrones</i> episodes</h1>'

In [245]:
# find() can be used on tag object directly 
html_soup.find('body').find('h1')

<h1 class="firstHeading mw-first-heading" id="firstHeading">List of <i>Game of Thrones</i> episodes</h1>

In [247]:
# find_all()  
first_h2 = html_soup.find_all('h2')
print(first_h2)

[<h2 id="mw-toc-heading">Contents</h2>, <h2><span class="mw-headline" id="Series_overview">Series overview</span></h2>, <h2><span class="mw-headline" id="Episodes">Episodes</span></h2>, <h2><span class="mw-headline" id="Home_media_releases">Home media releases</span></h2>, <h2><span class="mw-headline" id="Ratings">Ratings</span></h2>, <h2><span class="mw-headline" id="References">References</span></h2>, <h2><span class="mw-headline" id="External_links">External links</span></h2>, <h2>Navigation menu</h2>]


In [274]:
first_h2 = html_soup.find_all(name = 'h2')
first_h2                            

[<h2 id="mw-toc-heading">Contents</h2>,
 <h2><span class="mw-headline" id="Series_overview">Series overview</span></h2>,
 <h2><span class="mw-headline" id="Episodes">Episodes</span></h2>,
 <h2><span class="mw-headline" id="Home_media_releases">Home media releases</span></h2>,
 <h2><span class="mw-headline" id="Ratings">Ratings</span></h2>,
 <h2><span class="mw-headline" id="References">References</span></h2>,
 <h2><span class="mw-headline" id="External_links">External links</span></h2>,
 <h2>Navigation menu</h2>]

In [272]:
# using attrs argument
first_h2 = html_soup.find_all(name = 'h2', attrs = {'id':'mw-toc-heading'})
print(first_h2)

[<h2 id="mw-toc-heading">Contents</h2>]


In [275]:
first_h1

<h1 class="firstHeading mw-first-heading" id="firstHeading">List of <i>Game of Thrones</i> episodes</h1>

In [279]:
# methods for the h1 object

# tag name 
print(first_h1.name)

# contents between HTML tags
print(first_h1.contents)

# attributes
first_h1.attrs

h1
['List of ', <i>Game of Thrones</i>, ' episodes']


{'id': 'firstHeading', 'class': ['firstHeading', 'mw-first-heading']}

In [280]:
# tag object
# Official documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
type(first_h1)

# extract attrs
# treat tag object as a dict
print(first_h1.attrs['id'])
# use get method for tag object
print(first_h1.get('id'))

bs4.element.Tag

firstHeading
firstHeading


In [135]:
# use Tag object itself as dictionary
print(first_h1['class'])
print(first_h1['id'])

['firstHeading', 'mw-first-heading']
firstHeading


In [31]:
#find and find_all methods

#syntax:
#find(name, attrs, recursive, string, **keywords)

#find first <a> tag (hyperlink)
html_soup.find('a')

print(html_soup.a)

print('\n')

#find first <b> tag (bold text) beneath <body> tag
html_soup.body.b
print(html_soup.find('body').find('b'))

print('\n')

print(html_soup.find('body').find('b').text)
# alternative to get text
print(html_soup.find('body').find('b').get_text())

<a id="top"></a>


<b>This is an <a href="/wiki/Help:Page_history" title="Help:Page history">old revision</a> of this page, as edited by <span id="mw-revision-name"><a class="mw-userlink" href="/wiki/User:Alex_21" title="User:Alex 21"><bdi>Alex 21</bdi></a> <span class="mw-usertoollinks">(<a class="mw-usertoollinks-talk" href="/wiki/User_talk:Alex_21" title="User talk:Alex 21">talk</a> | <a class="mw-usertoollinks-contribs" href="/wiki/Special:Contributions/Alex_21" title="Special:Contributions/Alex 21">contribs</a>)</span></span> at <span id="mw-revision-date">22:27, 26 September 2017</span><span id="mw-revision-summary"> <span class="comment">(Work on article at <a class="mw-redirect" href="/wiki/Draft:Game_of_Thrones_(season_8)" title="Draft:Game of Thrones (season 8)">Draft:Game of Thrones (season 8)</a>)</span></span>. The present address (URL) is a <a href="/wiki/Help:Permanent_link" title="Help:Permanent link">permanent link</a> to this revision, which may differ significantly f

In [284]:
#find first <p> tag (paragraph)
print(html_soup.p)
html_soup.find('p')
html_soup.find('p').text

<p><b>This is an <a href="/wiki/Help:Page_history" title="Help:Page history">old revision</a> of this page, as edited by <span id="mw-revision-name"><a class="mw-userlink" href="/wiki/User:Alex_21" title="User:Alex 21"><bdi>Alex 21</bdi></a> <span class="mw-usertoollinks">(<a class="mw-usertoollinks-talk" href="/wiki/User_talk:Alex_21" title="User talk:Alex 21">talk</a> | <a class="mw-usertoollinks-contribs" href="/wiki/Special:Contributions/Alex_21" title="Special:Contributions/Alex 21">contribs</a>)</span></span> at <span id="mw-revision-date">22:27, 26 September 2017</span><span id="mw-revision-summary"> <span class="comment">(Work on article at <a class="mw-redirect" href="/wiki/Draft:Game_of_Thrones_(season_8)" title="Draft:Game of Thrones (season 8)">Draft:Game of Thrones (season 8)</a>)</span></span>. The present address (URL) is a <a href="/wiki/Help:Permanent_link" title="Help:Permanent link">permanent link</a> to this revision, which may differ significantly from the <span cl

<p><b>This is an <a href="/wiki/Help:Page_history" title="Help:Page history">old revision</a> of this page, as edited by <span id="mw-revision-name"><a class="mw-userlink" href="/wiki/User:Alex_21" title="User:Alex 21"><bdi>Alex 21</bdi></a> <span class="mw-usertoollinks">(<a class="mw-usertoollinks-talk" href="/wiki/User_talk:Alex_21" title="User talk:Alex 21">talk</a> | <a class="mw-usertoollinks-contribs" href="/wiki/Special:Contributions/Alex_21" title="Special:Contributions/Alex 21">contribs</a>)</span></span> at <span id="mw-revision-date">22:27, 26 September 2017</span><span id="mw-revision-summary"> <span class="comment">(Work on article at <a class="mw-redirect" href="/wiki/Draft:Game_of_Thrones_(season_8)" title="Draft:Game of Thrones (season 8)">Draft:Game of Thrones (season 8)</a>)</span></span>. The present address (URL) is a <a href="/wiki/Help:Permanent_link" title="Help:Permanent link">permanent link</a> to this revision, which may differ significantly from the <span cl

'This is an old revision of this page, as edited by Alex 21 (talk | contribs) at 22:27, 26 September 2017 (Work on article at Draft:Game of Thrones (season 8)). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.'

In [26]:
# rendered text
print(first_h1.text)
# rendered text
print(first_h1.get_text())

List of Game of Thrones episodes
List of Game of Thrones episodes


In [286]:
# you can try to match certain string in the content 
# here no text 'jupyter notebook' exists
print(html_soup.find(string = 'jupyter notebook'))

None


In [287]:
# limit in find_all()
html_soup.find_all('a',limit = 5)

[<a id="top"></a>,
 <a href="/wiki/Wikipedia:Featured_lists" title="This is a featured list. Click here for more information."><img alt="This is a featured list. Click here for more information." data-file-height="438" data-file-width="462" decoding="async" height="19" src="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/30px-Cscr-featured.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/40px-Cscr-featured.svg.png 2x" width="20"/></a>,
 <a href="/wiki/Help:Page_history" title="Help:Page history">old revision</a>,
 <a class="mw-userlink" href="/wiki/User:Alex_21" title="User:Alex 21"><bdi>Alex 21</bdi></a>,
 <a class="mw-usertoollinks-talk" href="/wiki/User_talk:Alex_21" title="User talk:Alex 21">talk</a>]

In [294]:
html_soup.find_all('a', class_='external text', limit = 5)

[<a class="external text" href="https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes">current revision</a>,
 <a class="external text" href="https://en.wikipedia.org/w/index.php?title=List_of_Game_of_Thrones_episodes&amp;action=edit">[update]</a>,
 <a class="external text" href="https://web.archive.org/web/20120817073932/http://tv.ign.com/articles/116/1160215p1.html" rel="nofollow">"Game of Thrones: "Winter is Coming" Review"</a>,
 <a class="external text" href="http://tv.ign.com/articles/116/1160215p1.html" rel="nofollow">the original</a>,
 <a class="external text" href="https://web.archive.org/web/20120516224747/http://www.variety.com/article/VR1117957532?refCatId=14" rel="nofollow">"HBO turns <i>Fire</i> into fantasy series"</a>]

In [296]:
html_soup.find_all(attrs = {'class':'external text'},limit = 5)

[<a class="external text" href="https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes">current revision</a>,
 <a class="external text" href="https://en.wikipedia.org/w/index.php?title=List_of_Game_of_Thrones_episodes&amp;action=edit">[update]</a>,
 <a class="external text" href="https://web.archive.org/web/20120817073932/http://tv.ign.com/articles/116/1160215p1.html" rel="nofollow">"Game of Thrones: "Winter is Coming" Review"</a>,
 <a class="external text" href="http://tv.ign.com/articles/116/1160215p1.html" rel="nofollow">the original</a>,
 <a class="external text" href="https://web.archive.org/web/20120516224747/http://www.variety.com/article/VR1117957532?refCatId=14" rel="nofollow">"HBO turns <i>Fire</i> into fantasy series"</a>]

In [288]:
html_soup.find_all('a', string = 'Game of Thrones')

[<a href="/wiki/Game_of_Thrones" title="Game of Thrones">Game of Thrones</a>,
 <a class="external text" href="https://www.imdb.com/title/tt0944947/episodes" rel="nofollow"><i>Game of Thrones</i></a>,
 <a class="external text" href="https://www.rottentomatoes.com/tv/game-of-thrones//" rel="nofollow"><i>Game of Thrones</i></a>,
 <a href="/wiki/Game_of_Thrones_(2012_video_game)" title="Game of Thrones (2012 video game)"><i>Game of Thrones</i></a>,
 <a href="/wiki/Game_of_Thrones_(2014_video_game)" title="Game of Thrones (2014 video game)"><i>Game of Thrones</i></a>,
 <a href="/wiki/Game_of_Thrones" title="Game of Thrones">Game of Thrones</a>]

In [304]:
# filter 2 tags at the same time
# combine in a list
# html_soup.find_all(['h1','h2'])
html_soup.findAll(['h1','h2'])

html_soup.findAll('h1')

[<h1 class="firstHeading mw-first-heading" id="firstHeading">List of <i>Game of Thrones</i> episodes</h1>,
 <h2 id="mw-toc-heading">Contents</h2>,
 <h2><span class="mw-headline" id="Series_overview">Series overview</span></h2>,
 <h2><span class="mw-headline" id="Episodes">Episodes</span></h2>,
 <h2><span class="mw-headline" id="Home_media_releases">Home media releases</span></h2>,
 <h2><span class="mw-headline" id="Ratings">Ratings</span></h2>,
 <h2><span class="mw-headline" id="References">References</span></h2>,
 <h2><span class="mw-headline" id="External_links">External links</span></h2>,
 <h2>Navigation menu</h2>]

[<h1 class="firstHeading mw-first-heading" id="firstHeading">List of <i>Game of Thrones</i> episodes</h1>]

In [151]:
#find_all method
#find_all(name, attrs, recursive, string, limit, **keywords)

#find all <a> tags (external links)
html_soup.find_all('a', class_='external text',limit=1)

#find all <a> tags (external links)
#attrs argument in find_all
html_soup.find_all(attrs = {'class':'external text'},limit = 1)

#find all <a> tags with the content "Game of Thrones"
html_soup.find_all('a',string = 'Game of Thrones', limit = 1)

#filter tag within a list
html_soup.findAll(['h1','h2'])

# html_soup.find_all(id = 'bodyContent')
html_soup.find_all('cite',limit = 1)


[<a class="external text" href="https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes">current revision</a>]

[<a class="external text" href="https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes">current revision</a>]

[<a href="/wiki/Game_of_Thrones" title="Game of Thrones">Game of Thrones</a>]

[<h1 class="firstHeading mw-first-heading" id="firstHeading">List of <i>Game of Thrones</i> episodes</h1>,
 <h2 id="mw-toc-heading">Contents</h2>,
 <h2><span class="mw-headline" id="Series_overview">Series overview</span></h2>,
 <h2><span class="mw-headline" id="Episodes">Episodes</span></h2>,
 <h2><span class="mw-headline" id="Home_media_releases">Home media releases</span></h2>,
 <h2><span class="mw-headline" id="Ratings">Ratings</span></h2>,
 <h2><span class="mw-headline" id="References">References</span></h2>,
 <h2><span class="mw-headline" id="External_links">External links</span></h2>,
 <h2>Navigation menu</h2>]

[<cite class="citation web cs1" id="CITEREFFowler2011">Fowler, Matt (April 8, 2011). <a class="external text" href="https://web.archive.org/web/20120817073932/http://tv.ign.com/articles/116/1160215p1.html" rel="nofollow">"Game of Thrones: "Winter is Coming" Review"</a>. <a href="/wiki/IGN" title="IGN">IGN</a>. Archived from <a class="external text" href="http://tv.ign.com/articles/116/1160215p1.html" rel="nofollow">the original</a> on August 17, 2012<span class="reference-accessdate">. Retrieved <span class="nowrap">September 22,</span> 2016</span>.</cite>]

In [306]:
# find the first 5 cite HTML tag
cites = html_soup.find_all('cite', attrs = {'class':'citation'}, limit = 3)
cites

[<cite class="citation web cs1" id="CITEREFFowler2011">Fowler, Matt (April 8, 2011). <a class="external text" href="https://web.archive.org/web/20120817073932/http://tv.ign.com/articles/116/1160215p1.html" rel="nofollow">"Game of Thrones: "Winter is Coming" Review"</a>. <a href="/wiki/IGN" title="IGN">IGN</a>. Archived from <a class="external text" href="http://tv.ign.com/articles/116/1160215p1.html" rel="nofollow">the original</a> on August 17, 2012<span class="reference-accessdate">. Retrieved <span class="nowrap">September 22,</span> 2016</span>.</cite>,
 <cite class="citation news cs1" id="CITEREFFleming2007">Fleming, Michael (January 16, 2007). <a class="external text" href="https://web.archive.org/web/20120516224747/http://www.variety.com/article/VR1117957532?refCatId=14" rel="nofollow">"HBO turns <i>Fire</i> into fantasy series"</a>. <i><a href="/wiki/Variety_(magazine)" title="Variety (magazine)">Variety</a></i>. Archived from <a class="external text" href="http://www.variety.c

In [311]:
for citation in cites:
    print(citation.get_text())
    # inside of this cite element, find the first tag
    link = citation.find('a')
    # ... and show its URL
    print(link.get('href'))

Fowler, Matt (April 8, 2011). "Game of Thrones: "Winter is Coming" Review". IGN. Archived from the original on August 17, 2012. Retrieved September 22, 2016.
https://web.archive.org/web/20120817073932/http://tv.ign.com/articles/116/1160215p1.html
Fleming, Michael (January 16, 2007). "HBO turns Fire into fantasy series". Variety. Archived from the original on May 16, 2012. Retrieved September 3, 2016.
https://web.archive.org/web/20120516224747/http://www.variety.com/article/VR1117957532?refCatId=14
"Game of Thrones". Emmys.com. Retrieved September 17, 2016.
http://www.emmys.com/shows/game-thrones


In [314]:
# if you traverse a chain of tag names 
html_soup.find('h1')
# returning tag object
type(html_soup.h1)

html_soup.find('body').find('table').find('tr').find('th')

# retrieve attribute within a tag
html_soup.body.table.tr.th.get('style')

<h1 class="firstHeading mw-first-heading" id="firstHeading">List of <i>Game of Thrones</i> episodes</h1>

bs4.element.Tag

<th rowspan="2" scope="col" style="min-width:50px;padding:0 8px">Season</th>

'min-width:50px;padding:0 8px'

In [194]:
# for find_all, this is equivalent
html_soup.find_all('h1')
html_soup('h1')

[<h1 class="firstHeading mw-first-heading" id="firstHeading">List of <i>Game of Thrones</i> episodes</h1>]

[<h1 class="firstHeading mw-first-heading" id="firstHeading">List of <i>Game of Thrones</i> episodes</h1>]

#### Code Example

The example below shows how to scrap the tables from the page and store them in a list neatly.

In [494]:
#<tr> tag defines a row in an HTML table
#<th> tag defines a header cell in an HTML table

import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/w/index.php' + \
'?title=List_of_Game_of_Thrones_episodes&oldid=802553687'

r = requests.get(url)

html_contents = r.text

html_soup = BeautifulSoup(html_contents, 'html.parser')

# use a list to store episode list
episodes = []

ep_tables = html_soup.find_all('table', attrs = {'class':'wikiepisodetable'}, limit = 1)

for table in ep_tables:
    
    headers = []
    rows = table.find_all('tr')
    
    # start by fetching the header cells from the first row to determine 
    # the field names 
    
    for header in table.find('tr').find_all('th'):
        headers.append(header.text)
    
    # then go through all the rows except the first row
    for row in table.find_all('tr')[1:]:
        values = []
        # get the column cells, the first one being inside a th-tag (header cell)
        for col in row.find_all(['th', 'td']):
            values.append(col.text)
        if values:
            episode_dict = {headers[i]:values[i] for i in range(len(values))}
            episodes.append(episode_dict)

# show the result
for episode in episodes:
    print(episode)

{'No.overall': '1', 'No. inseason': '1', 'Title': '"Winter Is Coming"', 'Directed by': 'Tim Van Patten', 'Written by': 'David Benioff & D. B. Weiss', 'Original air date\u200a[20]': 'April\xa017,\xa02011\xa0(2011-04-17)', 'U.S. viewers(millions)': '2.22[21]'}
{'No.overall': '2', 'No. inseason': '2', 'Title': '"The Kingsroad"', 'Directed by': 'Tim Van Patten', 'Written by': 'David Benioff & D. B. Weiss', 'Original air date\u200a[20]': 'April\xa024,\xa02011\xa0(2011-04-24)', 'U.S. viewers(millions)': '2.20[22]'}
{'No.overall': '3', 'No. inseason': '3', 'Title': '"Lord Snow"', 'Directed by': 'Brian Kirk', 'Written by': 'David Benioff & D. B. Weiss', 'Original air date\u200a[20]': 'May\xa01,\xa02011\xa0(2011-05-01)', 'U.S. viewers(millions)': '2.44[23]'}
{'No.overall': '4', 'No. inseason': '4', 'Title': '"Cripples, Bastards, and Broken Things"', 'Directed by': 'Brian Kirk', 'Written by': 'Bryan Cogman', 'Original air date\u200a[20]': 'May\xa08,\xa02011\xa0(2011-05-08)', 'U.S. viewers(millio

Summary:

- We don’t come up with the `find_all('table', class_= 'wikiepisodetable')` line from thin air, although it might seem that
way just by looking at the code. Recall what we said earlier about your
browser’s developer tools becoming your best friend. Inspect the
episode tables on the page. Note how they’re all defined by means
of a `<table>` tag. However, the page also contains tables we do not
want to include. Some further investigation leads us to a solution: all
the episode tables have “wikiepisodetable” as a class, whereas the
other tables do not. You’ll often have to puzzle your way through a
page first before coming up with a solid approach. In many cases,
you’ll have to perform multiple find and find_all iterations before
ending up where you want to be.
    
- For every table, we first want to retrieve the headers to use as keys in
a Python dictionary. To do so, we first select the first `<tr>` tag, and
select all `<th>` tags within it.
    
- Next, we loop through all the rows (the `<tr>` tags), except for the first one (the header row). For each row, we loop through the `<th>`
and `<td>` tags to extract the column values (the first column is
wrapped inside of a `<th>` tag, the others in `<td>` tags, which is why
we need to handle both). At the end of each row, we’re ready to add
a new entry to the “episodes” variable. To store each entry, we use a
normal Python dictionary (episode_dict). The way how this object is
constructed might look a bit strange in case you’re not very familiar
with Python. That is, Python allows us to construct a complete list or
dictionary “in one go” by putting a “for” construct inside the “[...]”
or “{...}” brackets. Here, we use this to immediately loop through the
headers and values lists to build the dictionary object. Note that this
assumes that both of these lists have the same length, and that the
order for both of these matches so that the header at “headers[2]”,
for instance, is the header corresponding with the value over at
“values[2]”. Since we’re dealing with rather simple tables here, this
is a safe assumption.

### More on Beautiful Soup

We can search the string that starts with any letter by constructing a regular expression using Python's `re` module.

#### Regex
A regular expression (regex) defines a sequence of patterns (an expression) defining a
search pattern. It is frequently used for string searching and matching code to find
(and replace) fragments of strings. Although they are very powerful constructs,
they can also be misused. 
- Do not go overboard with long or complex regular expressions, as they’re not readable and it might be hard to figure out what a particular piece of regex is doing later on. 
- You should avoid using regex to parse HTML pages. 
    - Always use an HTML parser like Beautiful Soup to perform the grunt work. You can then use small snippets of regex to find or extract pieces of content.
    
- I write a short introduction to Regex [here](https://github.com/YuxiaoLuo/RUG-RUserGroup/blob/main/RUG_RgularExpr.md), and here is a [cheatsheet](https://github.com/YuxiaoLuo/RUG-RUserGroup/blob/main/images/regex_cheatsheet.pdf) about using Regex.
- [regexone.com](https://regexone.com/) can help you practice using Regex.

In [6]:
#finds all the tags whose names start with the letter "b", then print tag name:
import re

taglist = []
for tag in html_soup.find_all(re.compile("^b")):
    #print(tag.name)
    taglist.append(tag.name)

set(taglist)

bs4.element.ResultSet

{'b', 'bdi', 'body', 'br'}

In [325]:
#finds all the tags whose names ends with the letter "a", then print tag name:
import re

taglist = []
for tag in html_soup.find_all(re.compile("a$")):
    #print(tag.name)
    taglist.append(tag.name)

set(taglist)

{'a', 'meta'}

In [218]:
#finds all the tags whose names contain the letter "t", then print tag name:

taglist1 = []
for tag in html_soup.find_all(re.compile("t")):
    #print(tag.name)
    taglist1.append(tag.name)

print(set(taglist1))

{'title', 'html', 'style', 'footer', 'script', 'cite', 'input', 'td', 'tr', 'noscript', 'table', 'th', 'meta', 'tbody'}


In [352]:
#get one external link from a webs0t
html_soup.find_all('a')[1]
html_soup.find_all('a')[1].get('href')

html_soup.find_all('a')[1].attrs['href']
#The href attribute specifies the URL of the page the link goes to
#If the href attribute is not present, the <a> tag will not be a hyperlink

# get other html attributes in the a tag
html_soup.find_all('a')[1]['title']

# get the child tag (img) under a tag
html_soup.find_all('a')[1].find('img')
# get the html attribute of child tag img
html_soup.find_all('a')[1].find('img')['src']

<a href="/wiki/Wikipedia:Featured_lists" title="This is a featured list. Click here for more information."><img alt="This is a featured list. Click here for more information." data-file-height="438" data-file-width="462" decoding="async" height="19" src="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/30px-Cscr-featured.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/40px-Cscr-featured.svg.png 2x" width="20"/></a>

'/wiki/Wikipedia:Featured_lists'

'/wiki/Wikipedia:Featured_lists'

'This is a featured list. Click here for more information.'

<img alt="This is a featured list. Click here for more information." data-file-height="438" data-file-width="462" decoding="async" height="19" src="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/30px-Cscr-featured.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/40px-Cscr-featured.svg.png 2x" width="20"/>

'//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png'

In [356]:
html_soup.find_all('a',limit = 5)
html_soup.find_all('a', limit =5)[1].get('href')

[<a id="top"></a>,
 <a href="/wiki/Wikipedia:Featured_lists" title="This is a featured list. Click here for more information."><img alt="This is a featured list. Click here for more information." data-file-height="438" data-file-width="462" decoding="async" height="19" src="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/30px-Cscr-featured.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/40px-Cscr-featured.svg.png 2x" width="20"/></a>,
 <a href="/wiki/Help:Page_history" title="Help:Page history">old revision</a>,
 <a class="mw-userlink" href="/wiki/User:Alex_21" title="User:Alex 21"><bdi>Alex 21</bdi></a>,
 <a class="mw-usertoollinks-talk" href="/wiki/User_talk:Alex_21" title="User talk:Alex 21">talk</a>]

'/wiki/Wikipedia:Featured_lists'

In [357]:
#get all links from a website:
for links in html_soup.find_all('a'):
    print(links.get('href'))

None
/wiki/Wikipedia:Featured_lists
/wiki/Help:Page_history
/wiki/User:Alex_21
/wiki/User_talk:Alex_21
/wiki/Special:Contributions/Alex_21
/wiki/Draft:Game_of_Thrones_(season_8)
/wiki/Help:Permanent_link
https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes
/wiki/User:Alex_21
/wiki/User_talk:Alex_21
/wiki/Special:Contributions/Alex_21
/wiki/Draft:Game_of_Thrones_(season_8)
/w/index.php?title=List_of_Game_of_Thrones_episodes&diff=prev&oldid=802553687
/w/index.php?title=List_of_Game_of_Thrones_episodes&direction=prev&oldid=802553687
/wiki/List_of_Game_of_Thrones_episodes
/w/index.php?title=List_of_Game_of_Thrones_episodes&diff=cur&oldid=802553687
/w/index.php?title=List_of_Game_of_Thrones_episodes&direction=next&oldid=802553687
/w/index.php?title=List_of_Game_of_Thrones_episodes&diff=next&oldid=802553687
#mw-head
#searchInput
/wiki/File:Game_of_Thrones_2011_logo.svg
/wiki/Game_of_Thrones
/wiki/Fantasy
/wiki/Television_drama
/wiki/David_Benioff
/wiki/D._B._Weiss
/wiki/A_Song_of_I

Below is a code example showing how to use Regex to match the links ending with .org using `re.compile` and `re.search`.

In [17]:
# regex_pattern = re.compile('([A-Za-z9-9-]+)\.org/$')
regex_pattern = re.compile('.org/\Z')

results = []
for i, links in enumerate(html_soup.find_all('a')):
    if links.get('href'):
        # pattern matching fing .org link
        results.append(re.search(regex_pattern, links.get('href')))

# show the result
for result in results:
    if result:
        print(result.string)

https://wikimediafoundation.org/
https://www.mediawiki.org/


In [480]:
#<li> defines a list item

url = 'https://en.wikipedia.org/wiki/Baruch_College'
r = requests.get(url)
soup = BeautifulSoup(r.text)

newlist1 = []
for item in soup.findAll('li'):
    #print(item)
    #print(item.find('a').get('href'))
    a = item.find('a')
    if a != None:
        newlist1.append(a.get('href'))
        #newlist1.append(a.attr['href'])
        
    #newlist1.append(href)
    #print(href)
    
    
for i in newlist1:
    print(i)


#History
#Presidents_of_Baruch_College
#Academics
#Campus
#Lawrence_and_Eris_Field_Building
#Information_and_Technology_Building
#Newman_Vertical_Campus
#Campus_location
#Academic_centers_and_institutes
#Partnerships
#Student_life
#Athletics
#Admissions
#Rankings
#Notable_people
#Alumni
#Business
#Politics,_government,_and_law
#The_arts
#Literature,_journalism,_and_tech
#Sports
#Faculty
#References
#External_links
/wiki/New_York_City_Subway
/wiki/MTA_Regional_Bus_Operations
#cite_note-30
#cite_note-31
#cite_note-32
#cite_note-33
#cite_note-34
#cite_note-35
#cite_note-36
#cite_note-37
#cite_note-38
#cite_note-39
#cite_note-40
/wiki/Bernard_L._Schwartz_Communication_Institute
#cite_note-42
#cite_note-43
#cite_note-44
#cite_note-45
#cite_note-46
#cite_note-47
/wiki/JPMorgan_Chase
/wiki/JPMorgan_Chase
/wiki/Brooklyn_Law_School
/wiki/Baruch_College_Campus_High_School
/wiki/American_Graduate_School_in_Paris
#cite_note-64
/wiki/Washington_Monthly
/wiki/CNBC
/wiki/Entrepreneur_(magazine)
/wiki

In [481]:
#build a connection with the url and create a beatifulsoup object
url = 'https://en.wikipedia.org/wiki/Baruch_College'
r = requests.get(url)
soup = BeautifulSoup(r.text)

cites = soup.find_all('cite',class_='citation',limit = 10)

newlist = []
for cite in cites:
    #print(cite.find('a'))
    a = cite.find('a')
    link = a.get('href')
    #print(link)
    newlist.append(link)
    
newlist 

['https://www.alumni.baruch.cuny.edu/uploaded/Annual_Reports/BCF_Annual_Report_2020-21-Final.pdf',
 'https://ir.baruch.cuny.edu/wp-content/uploads/sites/23/2021/01/Factsheet.Fall_2020_Finalx.pdf',
 'http://colleges.usnews.rankingsandreviews.com/best-colleges/baruch-cuny-4766',
 'http://www.baruch.edu/news/waldron_announces_gifts.htm',
 'https://web.archive.org/web/20090918142223/http://www.baruch.cuny.edu/campaign/index.htm',
 'https://www.nytimes.com/2011/11/22/education/cuny-students-clash-with-police-in-manhattan.html',
 'https://www1.cuny.edu/mu/forum/2020/02/03/cuny-appoints-next-president-of-baruch-college/',
 'https://www.baruch.cuny.edu//academic-degree-programs/',
 'http://zicklin.baruch.cuny.edu/',
 'http://www.baruch.cuny.edu/wsas/']

Apart from `find` and `find_all`, there are also a number of other methods for searching the HTML tree, which are very similar to find and find_all. The difference is that they will search different parts of the HTML tree:
- `find_parent` and `find_parents` work their way up the tree, looking at a tag’s parents using its parents attribute. Remember that find and find_all work their way down the tree, looking at a tag’s descendants.
- `find_next_sibling` and `find_next_siblings` will iterate and match a tag’s siblings using next_siblings attribute.
- `find_previous_sibling` and `find_previous_siblings` do the same but use the previous_siblings attribute.
- `find_next` and `find_all_next` use the next_elements attribute to iterate and match over whatever comes after a tag in the document.
- `find_previous` and `find_all_previous` will perform the search backward using the previous_elements attribute instead.
- Remember that `find` and `find_all` work on the children attribute in case the recursive argument is set to False, and on the descendants attribute in case recursive is set to True.

### CSS selectors: `.select()` method 

Pass a CSS selector rule as a string. `.select()` method runs a CSS selector against a parsed document and return all the matching elements.

For example,

```
# Find all <a> tags
html_soup.select('a')

# Find the element with the info id
html_soup.select('#info')

# Find <div> tags with both classa and classb CSS classes
html_soup.select(div.classa.classb)

# Find <a> tags with an href attribute starting with http://example.com/
html_soup.select('a[href^="http://example.com/"]')

# Find <li> tags which are children of <ul> tags with class lst
html_soup.select(ul.lst > li')
```



- There are often multiple ways to write a CSS selector to get the same result.
    - Ex., instead of writing “cite a,” you can also go overboard and write “body div.reflist ol.references li cite.citation a” and get the same result.
- Websites are changing and it might break your selectors. Including extra checks in your code and providing early warning signs can help a lot to build robust web scrapers.
- [Details](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors)

In [113]:
print('Type: ',type(html_soup.select('a',limit=1)))
print('Type:', type(html_soup.find_all('a')))

Type:  <class 'bs4.element.ResultSet'>
Type: <class 'bs4.element.ResultSet'>


In [24]:
# Find the element with id
print(html_soup.select('#firstHeading'))
print(html_soup.find('h1'))

[<h1 class="firstHeading mw-first-heading" id="firstHeading">List of <i>Game of Thrones</i> episodes</h1>]
<h1 class="firstHeading mw-first-heading" id="firstHeading">List of <i>Game of Thrones</i> episodes</h1>


In [496]:
# Find the element with id
html_soup.select('#firstHeading, #top')

[<a id="top"></a>,
 <h1 class="firstHeading mw-first-heading" id="firstHeading">List of <i>Game of Thrones</i> episodes</h1>]

In [30]:
# find tags beneath other tags
html_soup.select('body div a', limit = 5)

[<a id="top"></a>,
 <a href="/wiki/Wikipedia:Featured_lists" title="This is a featured list. Click here for more information."><img alt="This is a featured list. Click here for more information." data-file-height="438" data-file-width="462" decoding="async" height="19" src="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/30px-Cscr-featured.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/40px-Cscr-featured.svg.png 2x" width="20"/></a>,
 <a href="/wiki/Help:Page_history" title="Help:Page history">old revision</a>,
 <a class="mw-userlink" href="/wiki/User:Alex_21" title="User:Alex 21"><bdi>Alex 21</bdi></a>,
 <a class="mw-usertoollinks-talk" href="/wiki/User_talk:Alex_21" title="User talk:Alex 21">talk</a>]

In [504]:
#find tags
html_soup.select('title')
html_soup.find('title')

#find tags directly beneath other tags
html_soup.select('body > div > a')

[<title>List of Game of Thrones episodes - Wikipedia</title>]

<title>List of Game of Thrones episodes - Wikipedia</title>

[<a id="top"></a>]

In [517]:
for link in html_soup.select('ol.references cite a [href]'):
    print(link.get('href'))

In [507]:
# find specific tags in the reference section, with select method

for i in soup.select("ol.references li#cite_note-1 a[href]"):
    print(i.get('href'))
    #print(i)

#cite_ref-1
https://www.alumni.baruch.cuny.edu/uploaded/Annual_Reports/BCF_Annual_Report_2020-21-Final.pdf


In [516]:
#find all citation links from a webpage with select method

for i in soup.select('ol.references cite a[href]'):
    print(i.get('href'))

https://www.alumni.baruch.cuny.edu/uploaded/Annual_Reports/BCF_Annual_Report_2020-21-Final.pdf
https://ir.baruch.cuny.edu/wp-content/uploads/sites/23/2021/01/Factsheet.Fall_2020_Finalx.pdf
http://colleges.usnews.rankingsandreviews.com/best-colleges/baruch-cuny-4766
http://www.baruch.edu/news/waldron_announces_gifts.htm
https://web.archive.org/web/20090918142223/http://www.baruch.cuny.edu/campaign/index.htm
http://www.baruch.cuny.edu/campaign/index.htm
https://www.nytimes.com/2011/11/22/education/cuny-students-clash-with-police-in-manhattan.html
https://www1.cuny.edu/mu/forum/2020/02/03/cuny-appoints-next-president-of-baruch-college/
https://www.baruch.cuny.edu//academic-degree-programs/
http://zicklin.baruch.cuny.edu/
http://www.baruch.cuny.edu/wsas/
http://www.baruch.cuny.edu/spa/home.php
http://zicklin.baruch.cuny.edu/programs/doctoral/areas-of-study
https://web.archive.org/web/20171223211306/http://www.baruch.cuny.edu/wsas/academics/psychology/Psychology_PhD.htm
http://www.baruch.cu

### Other Beautiful Soup objects
There are two more object types in Beautiful Soup that, although less commonly used, are useful to know about:

- `NavigableString` objects: these are used to represent text within tags, rather than the tags themselves. Some Beautiful Soup functions and attributes will return such objects, such as the string attribute of tags, for instance. Attributes such as descendants will also include these in their listings. In addition, if you use find or find_all methods and supply a string argument value without a name argument, then these will return NavigableString objects as well, instead of Tag objects.
- `Comment objects`: these are used to represent HTML comments (found in comment tags, “`<!-- ... -->`”). These are very rarely useful when web scraping.