# Topic 10: HTML, CSS, & Web Scraping


- 03/11/21
- onl01-dtsc-ft-022221


## Learning Objectives / Outline




- **Part 1: HTML & CSS: Beyond Web Scraping**
    - Brief Overview of HTML & CSS
    - Learn when you will use HTML & CSS in your data science journey
    - Demonstrate the power of CSS with a Plotly/Dash dashboard. 
    - Demonstrate the value of learning HTML/CSS with VS Code.
    <br><br>

- **Part 2: Walk through the basics of web scraping:**
    - Learn to use Chrome's Inspect tool to hunt down target website data
    - Learn how to use Beautiful Soup to scrape the contents of a web page. 

    



### Questions

# 📓 Part 1: HTML & CSS

- HMTL is responsible for the _content_ of a website.
- CSS is responsible for the appearance / layout of a website.



## HTML Overview & Tags


- All HTML pages have the following components
    1. document declaration followed by html tag
    
    `<!DOCTYPE html>`<br>
    `<html>`
    2. Head
     html tag<br>
    `<head> <title></title></head>`
    3. Body<br>
    `<body>` ... content... `</body>`<br>
    `</html>`



- Html content is divdied into **tags** that specify the type of content.
    - [Basic Tags Reference Table](https://www.w3schools.com/tags/ref_byfunc.asp)
    - [Full Alphabetical Tag Reference Table](https://www.w3schools.com/tags/)
    
    - **tags** have attributes
        - [Tag Attributes](https://www.w3schools.com/html/html_attributes.asp)
        - Attributes are always defined in the start/opening tag. 

    - **tags** may have several content-creator-defined attributes such as `class` or `id`
    
    
- We will **use the tag and its identifying attributes to isolate content** we want on a web page with BeautifulSoup.

___

## CSS Overview


#### List the Components of CSS
*Excerpt From Section 13: Intro to CSS*

>For each **presentation rule**, there are 3 things to keep in mind:
1. What is the specific HTML we want to style?
2. What are the qualities we want to modify (e.g. the properties of text
   in a paragraph)?
3. _How_ do we want to modify the qualities of the element (e.g. font
   family, font color, font size, line height, letter spacing, etc.)?


> CSS **selectors** are a way of declaring which HTML elements you wish to style.
Selectors can appear a few different ways:
- The type of HTML element(`h1`, `p`, `div`, etc.)
- The value of an element's `id` or `class` (`<p id='idvalue'></p>`, `<p
  class='classname'></p>`)
- The value of an element's attributes (`value="hello"`)
- The element's relationship with surrounding elements (a `p` within an element
  with class of `.infobox`)

[Type selectors documentation](https://developer.mozilla.org/en-US/docs/Web/CSS/Type_selectors)

The element type `class` is a commonly used selector. Class selectors are used
to **select all elements that share a given class name**. The class selector
syntax is: `.classname`. Prefix the class name with a '.'(period).

```css
/*
select all elements that have the 'important-topic' classname (e.g. <h1 class='important-topic'>
and <h1 class='important-topic'>)
*/
.important-topic
```




You can also use the `id` selector to style elements. However, **there should
be only one element with a given id** in an HTML document. This can make
styling with the ID selector ideal for one-off styles. The `id` selector syntax
is: `#idvalue`. Prefix the id attribute of an element with a `#` (which is
called "octothorpe," "pound sign", or "hashtag").

```css
/*
selects the HTML element with the id 'main-header' (e.g. <h1 id='main-header'>)
*/
#main-header

```

[id selectors documentation](https://developer.mozilla.org/en-US/docs/Web/CSS/ID_selectors)

## When will/can I use HTML & CSS?


### 1. Web scraping.

- See Part 2 of notebook.

### 2. Adding Images and links to your Markdown Documents/Blog Posts

- Using `img` tags
- ```html
<img src="" width=70%>```

<img src="https://raw.githubusercontent.com/flatiron-school/Online-DS-FT-022221-Cohort-Notes/master/assets/images/flatironlogo_slack.png" width=30%>




### 3. Controlling the appearance of Pandas with CSS

- https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html
    

In [4]:
import pandas as pd
df = pd.util.testing.makeDataFrame()
df = df.head(10)
df

  import pandas.util.testing


Unnamed: 0,A,B,C,D
EI9vIjHG0h,-0.161342,-0.313988,-0.65333,0.063165
qWxEt6GVel,1.037422,1.614248,-0.399002,0.128166
XjKlqMmymL,-0.75738,1.055972,-1.635408,-0.682517
qusofIkkbl,-0.174048,1.313819,-1.97849,-0.510585
nsXzydtRQ3,-0.382832,1.467373,-0.595242,0.754063
s2RIBOdwBZ,-1.269973,1.240586,1.419255,-0.104134
OUphaFDUKJ,0.383776,1.453987,0.344523,-1.085975
HkkyhO08Gh,0.948445,0.149785,-0.445306,0.912587
0Il1oSrPLX,-0.206911,1.489894,-1.221931,1.201347
T9e4i5ph7w,-2.029459,0.845881,1.782064,1.198443


In [5]:
def highlight_max(s):
    '''
    highlight the maximum in a Series yellow.
    '''
    is_max = s == s.max()
    css = 'background-color: yellow;font-weight:bold;color:green;font-size:1.3em;'
    return [css if v else '' for v in is_max]


df.style.apply(highlight_max,axis=0)

Unnamed: 0,A,B,C,D
EI9vIjHG0h,-0.161342,-0.313988,-0.65333,0.063165
qWxEt6GVel,1.037422,1.614248,-0.399002,0.128166
XjKlqMmymL,-0.75738,1.055972,-1.635408,-0.682517
qusofIkkbl,-0.174048,1.313819,-1.97849,-0.510585
nsXzydtRQ3,-0.382832,1.467373,-0.595242,0.754063
s2RIBOdwBZ,-1.269973,1.240586,1.419255,-0.104134
OUphaFDUKJ,0.383776,1.453987,0.344523,-1.085975
HkkyhO08Gh,0.948445,0.149785,-0.445306,0.912587
0Il1oSrPLX,-0.206911,1.489894,-1.221931,1.201347
T9e4i5ph7w,-2.029459,0.845881,1.782064,1.198443


### 4. Your Markdown Cells in Notebooks

<details>
<summary style="font-weight:bold">Click here for example.</summary>
<div style="color:blue;display:block;text-align:center;border: solid purple 2px;font-family:serif;font-size:3rem;padding:2rem;background-color:lightgreen;width:50%;padding:2em"><br>YOUR NOTEBOOKS!</div>
</details>



- The HTML used above

```HTML
<details>
<summary style="font-weight:bold">Click here for example.</summary>
<div style="color:blue;display:block;text-align:center;border: solid purple 2px;font-family:serif;font-size:3rem;padding:2rem;background-color:lightgreen;width:50%;padding:2em"><br>YOUR NOTEBOOKS!</div>
</details>

```

### 5. Dashboards

- Plotly and Dash
- Open `./dash-example/app.py` & `./dash-example/assets/style.css`

<!-- - Example Dashboard from a former student
    - https://still-plateau-25734.herokuapp.com/ -->

## HTML/CSS DEMOS

### Demo 1: Loading CSS styles via Python

- We can load external CSS files by using `IPython.display.HTML` and using the code 


```python
from IPython.display import HTML
HTML("<style>{}</style>".format(css_info))
```

In [7]:
from IPython.display import HTML
css_stylesheet = "../../assets/webscrape_example.css"

with open(css_stylesheet,'r')  as f_css:
    style = f_css.read()

## Run Me First
HTML(f"<style>{style}</style>")

## UNCOMMENT TO RESET STYLE
HTML("")

### Demo 2: Using Codepen to practice HTML/CSS

- https://codepen.io/james_irving/pen/OJbrNjx

### Demo 3: Dashboard

- Plotly and Dash
- Open `./dash-example/app.py` & `./dash-example/assets/style.css`

<!-- - Example Dashboard from a former student
    - https://still-plateau-25734.herokuapp.com/ -->

# 📓 Part 2: Web Scraping 101

### Scraping Task

- Our task is to get the tables from the Wikipedia page: https://en.wikipedia.org/wiki/List_of_highest-grossing_films

## Using python's `requests` module:


-  Use `requests` library to initiate connections to a website.
- Check the status code returned to determine if connection was successful (status code=200)
~~url = 'https://en.wikipedia.org/wiki/Stock_market~~~

```python
import requests
url = 'https://en.wikipedia.org/wiki/List_of_highest-grossing_films'

# Connect to the url using requests.get
response = requests.get(url)
response.status_code
```

 ___
 
| Status Code | Code Meaning 
| --------- | -------------|
1xx |   Informational
2xx|    Success 
3xx|     Redirection
4xx|     Client Error 
5xx |    Server Error

___



- Adding a sleep time is helpful for avoiding and getting blocked from a server `time.sleep(
- **Note: You can add a `timeout` to `requests.get()` to avoid indefinite waiting**
    - Best in multiples of 3 (`timeout=3` or `6` , `9` ,etc.)

```python
# Add a timeout to prevent hanging
response = requests.get(url, timeout=3)
response.status_code

```




In [8]:
import requests
from time import sleep

url = 'https://en.wikipedia.org/wiki/List_of_highest-grossing_films'
response = requests.get(url=url, timeout=3)
print(f'Status code: {response.status_code}')

Status code: 200


In [9]:
print(response.text[:1000])

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>List of highest-grossing films - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"YEpdhEI1ogwQpTPXquKUBwAAAFM","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_highest-grossing_films","wgTitle":"List of highest-grossing films","wgCurRevisionId":1011527056,"wgRevisionId":1011527056,"wgArticleId":59892,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 uses Russian-language script (ru)","CS1 Russian-language sources (ru)","CS1 errors: missing periodical","CS1: long volume v

___

##  Using `BeautifulSoup`


### Cook a soup




- Connect to a website using`response = requests.get(url)`
- Feed `response.content` into BeautifulSoup 
- Must specify the parser that will analyze the contents
    - default available is `'html.parser'`
    - recommended is to install and use `lxml` [[lxml documentation](https://lxml.de/3.7/)]
- use soup.prettify() to get a user-friendly version of the content to print

```python
# Define Url and establish connection
url = 'https://en.wikipedia.org/wiki/Stock_market'
response = requests.get(url, timeout=3)

# Feed the response's .content into BeauitfulSoup
page_content = response.content
soup = BeautifulSoup(page_content,'lxml') #'html.parser')

# Preview soup contents using .prettify()
print(soup.prettify()[:2000])

```




In [10]:
import bs4
## Make a BeautifulSoup
url = 'https://en.wikipedia.org/wiki/List_of_highest-grossing_films'
response = requests.get(url)
soup = bs4.BeautifulSoup(response.content) 

In [11]:
## Print the prettified preview
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of highest-grossing films - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"YEpdhEI1ogwQpTPXquKUBwAAAFM","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_highest-grossing_films","wgTitle":"List of highest-grossing films","wgCurRevisionId":1011527056,"wgRevisionId":1011527056,"wgArticleId":59892,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 uses Russian-language script (ru)","CS1 Russian-language sources (ru)","CS1 errors: missing periodical","

## What's in a Soup?


- **A soup is essentially a collection of `tag objects`**
    - each tag from the html is a tag object in the soup
    - the tag's maintain the hierarchy of the html page, so tag objects will contain _other_ tag objects that were under it in the html tree.
    
    

- **Each tag has a:**
    - `.name`
    - `.contents`
    - `.string`
    
    
    
- **A tag can be access by name (like a column in a dataframe using dot notation)**
    - and then you can access the tags within the new tag-variable just like the first tag
    ```python
    # Access tags by name
    meta = soup.meta
    head = soup.head
    body = soup.body
    p = soup.p
    # and so on...
    ```
    
    
- [!] ***BUT this will only return the FIRST tag of that type, to access all occurances of a tag-type, we will need to navigate the html family tree***


In [None]:
# soup.contents

In [12]:
## check .head
print(soup.head)

<head>
<meta charset="utf-8"/>
<title>List of highest-grossing films - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"YEpdhEI1ogwQpTPXquKUBwAAAFM","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_highest-grossing_films","wgTitle":"List of highest-grossing films","wgCurRevisionId":1011527056,"wgRevisionId":1011527056,"wgArticleId":59892,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 uses Russian-language script (ru)","CS1 Russian-language sources (ru)","CS1 errors: missing periodical","CS1: long volume value","CS1: Julian–Gregorian uncertainty","Wikipedia articles n

In [14]:
# soup.body


### Navigating the HTML Family Tree: Children, siblings, and parents

- **Each tag is located within a tree-hierarchy of parents, siblings, and children**
    - The family-relation is based on the identation level of the tags.

- **Methods/attributes for the location/related tags of a tag**
    - `.parent`, `.parents`
    - `.child`, `.children`
    - `.descendents`
    - `.next_sibling`, `.previous_sibling`

- *Note: a newline character `\n` is also considered a tag/sibling/child*

#### Accessing Child Tags

- To get to later occurances of a tag type (i.e. the 2nd `<p>` tag in a tree), we need to navigate through the parent tag's `children`
    - To access an iterable list of a tag's children use `.children`
        - But, this only returns its *direct children*  (one indentation level down)     
        
    ```python
    # print direct children of the body tag
    body = soup.body
    for child in body.children:
        # print child if its not empty
        print(child if child is not None else ' ', '\n\n')  # '\n\n' for visual separation
    ```
- To access *all children* use `.descendents`
    - Returns all chidren and children of children
    ```python
    for child in body.descendents:
        # print all children/grandchildren, etc
        print(child if child is not None else ' ','\n\n')  
    ```
  

In [15]:
## What is the .children for the soup.body?
soup.body.children

<list_iterator at 0x7fed4c591040>

In [16]:
## Make the children viewable and check how many there are
body_tags = list(soup.body.children)
len(body_tags)

18

  
#### Accessing Parent tags

- To access the parent of a tag use `.parent`
```python
title = soup.head.title
print(title.parent.name)
```

- To get a list of _all parents_ use `.parents`
```python
title = soup.head.title
for parent in title.parents:
    print(parent.name)
```

#### Accessing Sibling tags
- siblings are tags in the same tree indentation level
- `.next_sibling`, `.previous_sibling`

In [17]:
## Check the parents of the first p
p_parents = list(soup.body.p.parents)
p_parents[0]

<div class="mw-parser-output"><div class="shortdescription nomobile noexcerpt noprint searchaux" style="display:none">Wikimedia list article</div>
<p class="mw-empty-elt">
</p>
<div class="thumb tright"><div class="thumbinner" style="width:222px;"><a class="image" href="/wiki/File:Poster_-_Gone_With_the_Wind_01.jpg"><img alt="A screencap of the title card from the trailer of Gone with the Wind." class="thumbimage" data-file-height="2000" data-file-width="1299" decoding="async" height="339" src="//upload.wikimedia.org/wikipedia/commons/thumb/2/27/Poster_-_Gone_With_the_Wind_01.jpg/220px-Poster_-_Gone_With_the_Wind_01.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/2/27/Poster_-_Gone_With_the_Wind_01.jpg/330px-Poster_-_Gone_With_the_Wind_01.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/2/27/Poster_-_Gone_With_the_Wind_01.jpg/440px-Poster_-_Gone_With_the_Wind_01.jpg 2x" width="220"/></a> <div class="thumbcaption"><div class="magnify"><a class="internal" href="/wiki/

## Searching Through Soup


### Finding the target tags to isolate


Using example  from  [Wikipedia article](https://en.wikipedia.org/wiki/List_of_highest-grossing_films)
where we are trying to isolate the body of the article content.


- **Examine the website using Chrome's inspect view.**

    - Press F12 or right-click > inspect

    - Use the mouse selector tool (top left button) to explore the web page content for your desired target
        - the web page element will be highlighted on the page itself and its corresponding entry in the document tree.
        - Note: click on the web page with the selector in order to keep it selected in the document tree

    - Take note of any identifying attributes for the target tag (class, id, etc)
<img src="https://drive.google.com/uc?export-download&id=1KifQ_ukuXFdnCh1Tz1rwzA_cWkB_45mf" width=450>

### Using BeautifulSoup's search functions

Note: while the process below is a decent summary, there is more nuance to html/css tags than I personally have been able to digest. 
    - If something doesn't work as expected/explained, please verify in the documentation.
        - [BeauitfulSoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#beautiful-soup-documentation)
        - [docs for .find_all()](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all)
    
- **BeautifulSoup has methods for searching through descendent-tags**
    - `.find`
    - `.find_all`
    
- **Using `.find_all()`**
    - Searches through all descendent tags and returns a result set (list of tag objects)
```python
# How to get results from .find_all()
results = soup.find_all(name, attrs, recursive, string, limit,**kwargs) `
```        
    - `.find_all()` parameters:
        - `name` _(type of tags to consider)_
            - only consider tags with this name 
                - Ex: 'a',  'div', 'p' ,etc.
        - `atrrs`_(css attributes that you are looking for in your target tag)_
            - enter an attribute such as the class or id as a string

                `attrs='mw-content-ltr'`
            - if passing more than one attribute, must use a dictionary:

            `attrs={'class':'mw-content-ltr', 'id':'mw-content-text'}`
        - `recursive`_(Default=True)_
            - search all children (`True`)
            - search only  direct children(`False`)

        - `string`
            - search for text _inside_ of tags instead of the tags themselves
            - can be regular expression
        - `limit`
            - How many results you want it to return


    

In [25]:
import re
exp = re.compile(r"wikitable sortable plainrowheaders jquery-tablesorter")
exp

re.compile(r'wikitable sortable plainrowheaders jquery-tablesorter',
re.UNICODE)

In [29]:
## Final All Table tags
tables = soup.find_all('table')
len(tables)

93

In [30]:
## Save the first table as its own tag
table = tables[0]
table

<table class="wikitable sortable plainrowheaders" style="margin:auto; margin:auto;">
<caption>Highest-grossing films<sup class="reference" id="cite_ref-13"><a href="#cite_note-13">[13]</a></sup>
</caption>
<tbody><tr>
<th scope="col">Rank
</th>
<th scope="col">Peak
</th>
<th scope="col">Title
</th>
<th scope="col">Worldwide gross
</th>
<th scope="col">Year
</th>
<th class="unsortable" scope="col">Reference(s)
</th></tr>
<tr>
<td>1
</td>
<td>1
</td>
<th scope="row"><i><a href="/wiki/Avengers:_Endgame" title="Avengers: Endgame">Avengers: Endgame</a></i>
</th>
<td align="right">$2,797,800,564
</td>
<td data-sort-value="2019-04" style="text-align:center;">2019
</td>
<td style="text-align:center;"><sup class="reference" id="cite_ref-endgame_14-0"><a href="#cite_note-endgame-14">[# 1]</a></sup><sup class="reference" id="cite_ref-endgame_peak_15-0"><a href="#cite_note-endgame_peak-15">[# 2]</a></sup>
</td></tr>
<tr>
<td>2
</td>
<td>1
</td>
<th scope="row"><i><a href="/wiki/Avatar_(2009_film)"

In [31]:
## Save all table rows as children
children = list(table.find_all('tr'))
len(children)

51

In [32]:
row = children[0]
row

<tr>
<th scope="col">Rank
</th>
<th scope="col">Peak
</th>
<th scope="col">Title
</th>
<th scope="col">Worldwide gross
</th>
<th scope="col">Year
</th>
<th class="unsortable" scope="col">Reference(s)
</th></tr>

In [38]:
print(row.text.replace('\n',' '))

 Rank  Peak  Title  Worldwide gross  Year  Reference(s) 


In [44]:
children[1].text

'\n1\n\n1\n\nAvengers: Endgame\n\n$2,797,800,564\n\n2019\n\n[# 1][# 2]\n'

In [47]:
children[1].text.replace('\n',',')#.strip().split(' ')

',1,,1,,Avengers: Endgame,,$2,797,800,564,,2019,,[# 1][# 2],'

In [51]:
row.text.replace('\n',',').replace(',,',',').strip(',').split(',')

['50', '2', 'The Lion King', '$968', '483', '777', '1994', '[# 79][# 64]']

In [50]:
## First, Print the text of each row from children 
table_data = []
for row in children:
    print(row.text.replace('\n',',').replace(',,',',').strip(','))
## Update code to save cleaned up version of text_data

    

Rank,Peak,Title,Worldwide gross,Year,Reference(s)
1,1,Avengers: Endgame,$2,797,800,564,2019,[# 1][# 2]
2,1,Avatar,$2,790,439,000,2009,[# 3][# 4]
3,1,Titanic,$2,194,439,542,1997,[# 5][# 6]
4,3,Star Wars: The Force Awakens,$2,068,223,624,2015,[# 7][# 8]
5,4,Avengers: Infinity War,$2,048,359,754,2018,[# 9][# 10]
6,3,Jurassic World,$1,671,713,208,2015,[# 11][# 12]
7,7,The Lion King,$1,656,943,394,2019,[# 13][# 2]
8,3,The Avengers,$1,518,812,988,2012,[# 14][# 15]
9,4,Furious 7,$1,516,045,911,2015,[# 16][# 17]
10,10,Frozen II,$1,450,026,933,2019,[# 18][# 19]
11,5,Avengers: Age of Ultron,$1,402,805,868,2015,[# 20][# 17]
12,9,Black Panther,$1,347,280,838,2018,[# 21][# 22]
13,3,Harry Potter and the Deathly Hallows – Part 2,$1,342,025,430,2011,[# 23][# 24]
14,9,Star Wars: The Last Jedi,$1,332,539,889,2017,[# 25][# 26]
15,12,Jurassic World: Fallen Kingdom,$1,309,484,461,2018,[# 27][# 10]
16,5,Frozen,F$1,290,000,000,2013,[# 28][# 29]
17,10,Beauty and the Beast,$1,263,521,126,2017,[# 30][# 31]
18,1

In [None]:
## Try to make a df
pd.DataFrame(table_data)

### SuperPower: pd.read_html

In [56]:
url

'https://en.wikipedia.org/wiki/List_of_highest-grossing_films'

In [52]:
import pandas as pd
dfs = pd.read_html(url)
type(dfs)


list

In [53]:
len(dfs)

93

In [54]:
dfs[0]

Unnamed: 0,Rank,Peak,Title,Worldwide gross,Year,Reference(s)
0,1,1,Avengers: Endgame,"$2,797,800,564",2019,[# 1][# 2]
1,2,1,Avatar,"$2,790,439,000",2009,[# 3][# 4]
2,3,1,Titanic,"$2,194,439,542",1997,[# 5][# 6]
3,4,3,Star Wars: The Force Awakens,"$2,068,223,624",2015,[# 7][# 8]
4,5,4,Avengers: Infinity War,"$2,048,359,754",2018,[# 9][# 10]
5,6,3,Jurassic World,"$1,671,713,208",2015,[# 11][# 12]
6,7,7,The Lion King,"$1,656,943,394",2019,[# 13][# 2]
7,8,3,The Avengers,"$1,518,812,988",2012,[# 14][# 15]
8,9,4,Furious 7,"$1,516,045,911",2015,[# 16][# 17]
9,10,10,Frozen II,"$1,450,026,933",2019,[# 18][# 19]


In [55]:
dfs[1]

Unnamed: 0,Rank,Title,Worldwide gross(2019 $),Year
0,1,Gone with the Wind,"$3,706,000,000",1939
1,2,Avatar,"$3,257,000,000",2009
2,3,Titanic,"T$3,081,000,000",1997
3,4,Star Wars,"$3,043,000,000",1977
4,5,Avengers: Endgame,"AE$2,798,000,000",2019
5,6,The Sound of Music,"$2,549,000,000",1965
6,7,E.T. the Extra-Terrestrial,"$2,489,000,000",1982
7,8,The Ten Commandments,"$2,356,000,000",1956
8,9,Doctor Zhivago,"$2,233,000,000",1965
9,10,Star Wars: The Force Awakens,"$2,202,000,000",2015


### Level-Up Activity


- For each of the movies in our scraped table:
    - Navigate to the movie's wikipedia page. 
    - Save the top-right dark gray box with the summary info for the movie.

In [62]:
## Extract table again
table = soup.find_all('table')[0]
a_tags = table.find_all('a',href=True)
a_tags


[<a href="#cite_note-13">[13]</a>,
 <a href="/wiki/Avengers:_Endgame" title="Avengers: Endgame">Avengers: Endgame</a>,
 <a href="#cite_note-endgame-14">[# 1]</a>,
 <a href="#cite_note-endgame_peak-15">[# 2]</a>,
 <a href="/wiki/Avatar_(2009_film)" title="Avatar (2009 film)">Avatar</a>,
 <a href="#cite_note-avatar-16">[# 3]</a>,
 <a href="#cite_note-avatar_peak-17">[# 4]</a>,
 <a href="/wiki/Titanic_(1997_film)" title="Titanic (1997 film)">Titanic</a>,
 <a href="#cite_note-titanic-18">[# 5]</a>,
 <a href="#cite_note-titanic_peak-19">[# 6]</a>,
 <a href="/wiki/Star_Wars:_The_Force_Awakens" title="Star Wars: The Force Awakens">Star Wars: The Force Awakens</a>,
 <a href="#cite_note-sw7-20">[# 7]</a>,
 <a href="#cite_note-sw7_peak-21">[# 8]</a>,
 <a href="/wiki/Avengers:_Infinity_War" title="Avengers: Infinity War">Avengers: Infinity War</a>,
 <a href="#cite_note-infinity_war-22">[# 9]</a>,
 <a href="#cite_note-infinity_war_peak-23">[# 10]</a>,
 <a href="/wiki/Jurassic_World" title="Jurassi

In [64]:
tag= a_tags[1]
tag

<a href="/wiki/Avengers:_Endgame" title="Avengers: Endgame">Avengers: Endgame</a>

In [91]:
tag['href'].startswith('/wiki')

'#cite_note-Jurassic_Park_peak-78'

In [73]:
movie_urls = []
for tag in a_tags:
    if tag['href'].startswith('/wiki'):
        movie_urls.append(tag['href'])
movie_urls
    

['/wiki/Avengers:_Endgame',
 '/wiki/Avatar_(2009_film)',
 '/wiki/Titanic_(1997_film)',
 '/wiki/Star_Wars:_The_Force_Awakens',
 '/wiki/Avengers:_Infinity_War',
 '/wiki/Jurassic_World',
 '/wiki/The_Lion_King_(2019_film)',
 '/wiki/The_Avengers_(2012_film)',
 '/wiki/Furious_7',
 '/wiki/Frozen_II',
 '/wiki/Avengers:_Age_of_Ultron',
 '/wiki/Black_Panther_(film)',
 '/wiki/Harry_Potter_and_the_Deathly_Hallows_%E2%80%93_Part_2',
 '/wiki/Star_Wars:_The_Last_Jedi',
 '/wiki/Jurassic_World:_Fallen_Kingdom',
 '/wiki/Frozen_(2013_film)',
 '/wiki/Beauty_and_the_Beast_(2017_film)',
 '/wiki/Incredibles_2',
 '/wiki/The_Fate_of_the_Furious',
 '/wiki/Iron_Man_3',
 '/wiki/Minions_(film)',
 '/wiki/Captain_America:_Civil_War',
 '/wiki/Aquaman_(film)',
 '/wiki/The_Lord_of_the_Rings:_The_Return_of_the_King',
 '/wiki/Spider-Man:_Far_From_Home',
 '/wiki/Captain_Marvel_(film)',
 '/wiki/Transformers:_Dark_of_the_Moon',
 '/wiki/Skyfall',
 '/wiki/Transformers:_Age_of_Extinction',
 '/wiki/The_Dark_Knight_Rises',
 '/wi

In [None]:
## Get all wikipedia links


### Joining Together Links

In [77]:
url

'https://en.wikipedia.org/wiki/List_of_highest-grossing_films'

In [86]:
import os
os.path

<module 'posixpath' from '/opt/anaconda3/envs/learn-env-new/lib/python3.8/posixpath.py'>

In [83]:
import urllib
parsed = urllib.parse.urlparse(url)
parsed

ParseResult(scheme='https', netloc='en.wikipedia.org', path='/wiki/List_of_highest-grossing_films', params='', query='', fragment='')

In [84]:
# parsed.scheme+parsed.netloc

'httpsen.wikipedia.org'

In [79]:
urllib.parse.urljoin(url,movie_urls[0])

'https://en.wikipedia.org/wiki/Avengers:_Endgame'

In [74]:
## Make full links for full list
abs_urls = [urllib.parse.urljoin(url,url_movie) for url_movie in movie_urls]
abs_urls


['https://en.wikipedia.org/wiki/Avengers:_Endgame',
 'https://en.wikipedia.org/wiki/Avatar_(2009_film)',
 'https://en.wikipedia.org/wiki/Titanic_(1997_film)',
 'https://en.wikipedia.org/wiki/Star_Wars:_The_Force_Awakens',
 'https://en.wikipedia.org/wiki/Avengers:_Infinity_War',
 'https://en.wikipedia.org/wiki/Jurassic_World',
 'https://en.wikipedia.org/wiki/The_Lion_King_(2019_film)',
 'https://en.wikipedia.org/wiki/The_Avengers_(2012_film)',
 'https://en.wikipedia.org/wiki/Furious_7',
 'https://en.wikipedia.org/wiki/Frozen_II',
 'https://en.wikipedia.org/wiki/Avengers:_Age_of_Ultron',
 'https://en.wikipedia.org/wiki/Black_Panther_(film)',
 'https://en.wikipedia.org/wiki/Harry_Potter_and_the_Deathly_Hallows_%E2%80%93_Part_2',
 'https://en.wikipedia.org/wiki/Star_Wars:_The_Last_Jedi',
 'https://en.wikipedia.org/wiki/Jurassic_World:_Fallen_Kingdom',
 'https://en.wikipedia.org/wiki/Frozen_(2013_film)',
 'https://en.wikipedia.org/wiki/Beauty_and_the_Beast_(2017_film)',
 'https://en.wikiped

### Getting the info Box for the first movie

In [76]:
## Make a soup for the first url in full_urls
resp2 = requests.get(abs_urls[0])
soup2 = bs4.BeautifulSoup(resp2.content)
soup2

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Avengers: Endgame - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"YEpeD77xEFMk0veNjWuhmAAAAAc","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Avengers:_Endgame","wgTitle":"Avengers: Endgame","wgCurRevisionId":1011573003,"wgRevisionId":1011573003,"wgArticleId":44254295,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Pages with non-numeric formatnum arguments","CS1 uses Russian-language script (ru)","CS1 Russian-language sources (ru)","Articles with short description","Short de

In [None]:
## Identify a tag you could use to target the info


In [None]:
## Eff that, use pd.read_html


# APPENDIX

## Other HTML/CSS Use Cases

#### Adding/controlling images and alignment of text

- Add an image hosted on github by grabbing grabbing the raw link.

<img src="https://raw.githubusercontent.com/jirvingphd/fsds_pt_100719_cohort_notes/master/Images/flatiron-building-glitter.jpeg" width=30%>

- Let's add an image hosted on github to our notebook (great example for making blog posts).
1. Go to the repo's website, click on the image file.
2. Click download and copy the raw.githubsercontent.link.

### 6. ipywidget layouts (e.g. fsds.ihelp_menu)

In [None]:
## 4. ipywidgets Example
try: 
    import fsds as fs
except:
    !pip install -U fsds 
    import fsds as fs

fs.ihelp_menu(fs.ihelp_menu)

## Other Text-Related Tips

### Sidebar: Using RegularExpressions to sift through the content.

In [None]:
import re
regexp = re.compile(r"(\$\d{1,})\.(\d{2})")
regexp


In [None]:
found_text = regexp.findall(all_text)
found_text

In [None]:
tag0=tags[0]
target = tag0.contents
target = ' '.join(target)
target


- Best Hands On Tester for Regex:
    - https://regex101.com/
    - Select "Python" on the left side of the page.
    - Paste the text you want to sift through in the large center window.
    - Type your expression in the top center window.
    - It will highlight the text that matches your regular expression in the big center panel. 

- Cheatsheet for Regex Symbols:
    - https://www.debuggex.com/cheatsheet/regex/python

In [None]:
import re
price =  re.compile("(\$\d\,\d*\.\d{2})")
price.findall(target)

### Text Formatting with f-strings

In [None]:
import requests
url = 'https://en.wikipedia.org/wiki/Stock_market'

response = requests.get(url, timeout=3)
print('Status code: ',response.status_code)
if response.status_code==200:
    print('Connection successfull.\n\n')
else: 
    print('Error. Check status code table.\n\n')    

    
# Print out the contents of a request's response
print(f"{'---'*20}\n\tContents of Response.items():\n{'---'*20}")

for k,v in response.headers.items():
    print(f"{k:{25}}: {v:{40}}") # Note: add :{number} inside of a    

In [None]:
for k,v in response.headers.items():
    print(f"{k:{30}}:{v:{20}}") # Note: add :{number} inside of a  

#### Sidebar Notes - Explaining The Above Text Printing/Formatting:**



- **You can repeat strings by using multiplication**
    - `'---'*20` will repeat the dashed lines 20 times

- **You can determine how much space is alloted for a variable when using f-strings**
    - Add a `:{##}` after the variable to specify the allocated width
    - Add a `>` before the `{##}` to force alignment 
    - Add another symbol (like '.'' or '-') before `>` to add guiding-line/placeholder (like in a table of contents)

```python
print(f"Status code: {response.status_code}")
print(f"Status code: {response.status_code:>{20}}")
print(f"Status code: {response.status_code:->{20}}")
```    
```
# Returns:
Status code: 200
Status code:                  200
Status code: -----------------200
```

### Recommended packages/tools to use
1. `fake_useragent`
    - pip-installable module that conveniently supplies fake user agent information to use in your request headers.
    - recommended by udemy course
2. `lxml`
    - popular pip installable html parser (recommended by Udemy course)
    - using `'html.parser'` in requests.get() did not work for me, I had to install lxml

In [None]:
# !pip install fake_useragent
# !pip install lxml

# import fake_useragent
# import lxml