In order to successfully complete this assignment you must do the required reading, watch the provided videos and complete all instructions.  The embedded survey form must be entirely filled out and submitted on or before **11:59pm on Wednesday September 30**.  Students must come to class the next day prepared to discuss the material covered in this assignment. answer

# Pre-Class Assignment: Web Scraping

### Goals for today's pre-class assignment 


1. [Web Scraping](#Web_Scraping)
2. [Regular Expressions and the ```re``` library](#Regular_Expressions)
3. [Mechanics of regular expressions](#Mechanics_of_regular_expressions)
4. [RE Examples](#RE_Examples)
5. [Additional Resources](#Additional_Resources)
4. [Assignment wrap-up](#Assignment_wrap-up)

---
<a name="Web_Scraping"></a>
# 1. Web Scraping

In the next class we will be doing some more web scraping. This is a common technique to download and use data on the internet.  Sometimes webscraping can be really easy othertimes it can be complex. here are some basic levels. 

- (easy) Simple HTML
- (harder) HTML and CSS
- (difficult) Javascript - Often requires a "Headless" web browser.

To prepair for class make sure you understand the basics of HTML. If you are new to HTML I highly recommend reading the following article in detail. If you are familar with HTML a quick skim of the article may be sufficient:

- https://www.dataquest.io/blog/web-scraping-tutorial-python/

---
<a name="Regular_Expressions"></a>
# 2. Regular Expressions and the ```re``` library

Regular expressions (also referred to as **regex** or **regexp**) can be thought of as a powerful  language for pattern matching in text and the concept is used a lot in web scraping.  The python module **re** provides support for regular expressions. A typical regular expression search in python looks like

    match = re.search(pattern, text)

where

</p>

1. **pattern**: is a string with the instructions of what to look for and how to look for it
1. **text**: is a string on which the pattern matching will be performed 

In particular, the ```re.search(pattern, text)``` method returns a match object if the search pattern was found within the text, and None otherwise.

---

Try it for yourself:

In [None]:
import re

text = 'Go green, go white!'

match1 = re.search('MSU', text )

match2 = re.search('green', text )

print(match1)
print(match2)

The power of regular expressions comes from the fact that the pattern can contain not only **regular characters** such as 'g' and 'M', but also **metacharacters** such as \d (any digit), \s (white space), \w (alphanumeric), \W (non-alphanumeric); and **quantifiers** such as * (zero or more occurrences), + (one or more occurences) and ? (at most one occurence). Here is an example that finds the word cat followed by a dash and any alphanumeric set:

In [None]:
text = 'an example word-cat!!'

match = re.search('word-\w*', text)

# If-statement after search() tests if it succeeded
if match:                      
    print('found', match.group()) ## 'found word:cat'
else:
    print('did not find')

&#9989; **<font color=red>DO THIS:</font>**  In the code above, replace the quantifier * by another quantifier so that the result includes any single alphanumeric character after the dash (ex: word-c).

In [None]:
### Your code here

##ANSWER##
match = re.search('word-\w', text)
print(match)

##ANSWER##
# $??????

---

A good list of regex characters and other expressions can be found here:

https://www.shortcutfoo.com/app/dojos/python-regex/cheatsheet

---

Here is another example: Say you want to find the domain for all email addresses in a text string. Here is a solution:



In [None]:
text = 'purple alice-b@google.com monkey dishwasher sparty@msu.edu'

match = re.findall('@\w+', text)

match

The method ```findall``` returns a list of strings with all the matches found. If not matches are found, then it returns the empty list [] .

---

The final point I would like to make is the following: Since the search pattern in a regular expression is essentially a set of instructions (i.e., a program) then one can compile it, which is advantageous if the same search pattern is going to be used several times. The methods from the library **re** can then be applied to the compiled pattern:

In [None]:
pattern = re.compile('@([\w\d.]+\.)+(com|org|edu)')

text = 'This is a list of email addresses: first.last@example.com, first.last+category@gmail.com,  valid-address@mail.example.com,  not-valid@example.foo'

pattern.findall(text)

---
<a name="Mechanics_of_regular_expressions"></a>
# 3.   Mechanics of regular expressions

The following video is about regular expressions and is really LONG.  You will want to watch it at faster speeds.  I don't like long videos so feel free to skip if you are not finding it helpful.  

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo("UR6a_wZ8ido",width=640,height=360, cc_load_policy=True)

----
<a name="RE_Examples"></a>
# 4. RE Examples

Here is an example of how regular expressions in the wild:

In [None]:
#example postal codes used in Great Britten
example_codes = ["SW1A 0AA", # House of Commons
                 "SW1A 1AA", # Buckingham Palace
                 "SW1A 2AA", # Downing Street
                 "BX3 2BB", # Barclays Bank
                 "DH98 1BT", # British Telecom
                 "N1 9GU", # Guardian Newspaper
                 "E98 1TT", # The Times
                 "TIM E22", # a fake postcode
                 "A B1 A22", # not a valid postcode
                 "EC2N 2DB", # Deutsche Bank
                 "SE9 2UG", # University of Greenwhich
                 "N1 0UY", # Islington, London
                 "EC1V 8DS", # Clerkenwell, London
                 "WC1X 9DT", # WC1X 9DT
                 "B42 1LG", # Birmingham
                 "B28 9AD", # Birmingham
                 "W12 7RJ", # London, BBC News Centre
                 "BBC 007" # a fake postcode
                ]


pattern = re.compile("[A-z]{1,2}[0-9R][0-9A-Z]? [0-9][ABD-HJLNP-UW-Z]{2}")

for postcode in example_codes:
    r = pattern.search(postcode)
    if r:
        print(postcode + " matched!")
    else:
        print(postcode + " is not a valid postcode!")

For a another example, the following code finds simple phone numbers in websites:

In [None]:
import re
import requests

url = "https://colbrydi.github.io/pages/contact.html"

source_code = requests.get(url)
plain_text = source_code.text

regex = re.compile("\(?\d{3}\)?\s?\d{3}[-.]\d{4}")

res = regex.findall(plain_text)

print(res)


---
<a name="Additional_Resources"></a>
      
# 5. Additional Resources

There are a lot of resources on regular expressions.  Here are a few to check out:

* https://docs.python.org/3/howto/regex.html
* http://www.pyregex.com/
* http://www.bogotobogo.com/python/python_regularExpressions.php
* http://howardabrams.com/regexp/
* https://www.regextester.com/ 

Whenever I work with regular expressions I look for online tools that can help me. For example the following is a really great tool for experimenting with regular expressions: 

https://regex101.com/

&#9989; **<font color=red>DO THIS:</font>**  Use the [regex101](https://regex101.com/) tool to generate regular expressions for the following:

&#9989; **<font color=red>QUESTION:</font>** Write a regular expression to find valid **email addresses** in a body of text (ex. dirk@colbry.com, colbrydi@msu.edu, dirkcolbry+junk@gmail.com).  Include both positive and negative examples for testing. 

Put your regular expression and test code here

&#9989; **<font color=red>QUESTION:</font>** Write a regular expression to find valid **hashtag** in a body of text (ex. #Election2016, #RegexpressionsAreAwsome, #whatisahashtag). Include both positive and negative examples for testing.

Put your regular expression and test code here

&#9989; **<font color=red>QUESTION:</font>** Write a more general regular expression to find valid phone number in a body of text (ex. (517) 432-0455, 432-0455, 517-432-0455, 1-517-432-0455. include both positive and negative examples for testing.

Put your regular expression and test code here

----
<a name="Assignment_wrap-up"></a>
# 6. Assignment wrap-up

Please fill out the form that appears when you run the code below.  **You must completely fill this out in order to receive credit for the assignment!**

[Direct Link to Google Form](https://cmse.msu.edu/cmse802-pc-survey)


If you have trouble with the embedded form, please make sure you log on with your MSU google account at [googleapps.msu.edu](https://googleapps.msu.edu) and then click on the direct link above.

&#9989; **<font color=red>Assignment-Specific QUESTION:</font>** What is the regular expression and testing code you used to find hashtags?

Put your answer to the above question here

&#9989; **<font color=red>QUESTION:</font>**  Summarize what you did in this assignment.

Put your answer to the above question here

&#9989; **<font color=red>QUESTION:</font>**  What questions do you have, if any, about any of the topics discussed in this assignment after working through the jupyter notebook?

Put your answer to the above question here

&#9989; **<font color=red>QUESTION:</font>**  How well do you feel this assignment helped you to achieve a better understanding of the above mentioned topic(s)?

Put your answer to the above question here

&#9989; **<font color=red>QUESTION:</font>** What was the **most** challenging part of this assignment for you? 

Put your answer to the above question here

&#9989; **<font color=red>QUESTION:</font>** What was the **least** challenging part of this assignment for you? 

Put your answer to the above question here

&#9989; **<font color=red>QUESTION:</font>**  What kind of additional questions or support, if any, do you feel you need to have a better understanding of the content in this assignment?

Put your answer to the above question here

&#9989; **<font color=red>QUESTION:</font>**  Do you have any further questions or comments about this material, or anything else that's going on in class?

Put your answer to the above question here

&#9989; **<font color=red>QUESTION:</font>** Approximately how long did this pre-class assignment take?

Put your answer to the above question here

In [None]:
from IPython.display import HTML
HTML(
"""
<iframe 
	src="https://cmse.msu.edu/cmse802-pc-survey?embedded=true" 
	width="100%" 
	height="1200px" 
	frameborder="0" 
	marginheight="0" 
	marginwidth="0">
	Loading...
</iframe>
"""
)

---------
### Congratulations, we're done!

To get credit for this assignment you must fill out and submit the above survey from on or before the assignment due date.

### Course Resources:


- [Website](https://msu-cmse-courses.github.io/cmse802-f20-student/)
- [ZOOM](https://msu.zoom.us/j/97272546850)
- [Syllabus](https://docs.google.com/document/d/e/2PACX-1vT9Wn11y0ECI_NAUl_2NA8V5jcD8dXKJkqUSWXjlawgqr2gU5hII3IsE0S8-CPd3W4xsWIlPAg2YW7D/pub)
- [Schedule](https://docs.google.com/spreadsheets/d/e/2PACX-1vQRAm1mqJPQs1YSLPT9_41ABtywSV2f3EWPon9szguL6wvWqWsqaIzqkuHkSk7sea8ZIcIgZmkKJvwu/pubhtml?gid=2142090757&single=true)



Written by Dirk Colbry, Michigan State University
<a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">Creative Commons Attribution-NonCommercial 4.0 International License</a>.

-----