# Python examples in lecture 3
* This file is a jupyter notebook. To run it you can download it from the DLE and run it on your own machine.
* Or you can run it on google collab <https://colab.research.google.com> via your google account. This may be slower than running on your own machine


## Why do we need to manipulate text
* We need to extract specific parts of a text such as names or numbers. This is a important part of data cleaning.
* We may need correct mis-spelled words.
* When you download web pages you get additional tags (such as \<li\>) that must be removed.

Python has excellent string processing tools.
See https://greenteapress.com/thinkpython2/html/thinkpython2009.html from the book Think Python

* There is potentially informtion in free text written by customers. 
* The text needs to be "cleaned up" so that it can be potentially used in machine learning.
* The review below is from Trustpilot  https://uk.trustpilot.com/

![RegBook](https://github.com/cmcneile/COMP5000-2023-lectures/blob/main/trustPilot.png?raw=true)




## Example of possible text manipulation

Often we need to extract specific parts of a document

List of food Cost<br>
Cost 10 pounds meat<br>
Cost 5  pounds fruit<br>
Cost 2  Chocolate<br>


In the above example we might want to extract 10,5 and 2. How
do we match **Cost** at the start of the line.


## String manipulation methods

* To deal with text files we often need to change or extract parts of the text.
* The string class in python has some  useful tools.

* https://www.w3schools.com/python/python_ref_string.asp for more information

* There are more powerful ways of manipulating strings using **regular expressions**.



## Example of string manipulation
* The split method can break a string into a list of text that were seperated by spaces.
* There are additional methods such as **lower** (that converts the text to lower case)

In [1]:
sentence = "List of food Cost"
sentence_list = sentence.split()
print(sentence_list)
for xx in sentence_list :
    print(xx )
    if "Cost" in xx :
        print("Cost has been found")


['List', 'of', 'food', 'Cost']
List
of
food
Cost
Cost has been found


## Searching strings

* There are commands to search for substrings in strings. 
* These return the first position of the substring, or -1 if the substring is not part of the string.

In [2]:
text = "List of food Cost"
xxx = text.find("of") 
print ("Location of of  " ,  xxx)
yyy = text.find("blue") 
print ("Location of blue " ,  yyy)

Location of of   5
Location of blue  -1


## Example of replacing text
* It is important to be able to change strings

In [3]:
name = "<li> The first example"
namer = name.replace("<li> " , "")
print("Initial string = ", name)
print("Modfied string = ", namer)


Initial string =  <li> The first example
Modfied string =  The first example


## Further string manipulation

* How do we search for a word at the start or end of a string?
* How do you search for a pattern, such as four digits, eg 4567

Regular expressions are a powerful way to search and replace strings

##  Introduction to regular expressions


* There is a module called **re** that contains the regular expression function.
* Regular expression can search and replace text.
* Regular expressions can replace parts of strings.
* Regular expressons allow you to match text at the start of a line for example.
* There is a powerful syntax of manipulating strings, which is similar in many computer languages.
* See examples on https://www.w3schools.com/python/python_regex.asp



I am not going to discuss the performance of regular expressions,
because I personally only use them for small strings.


## Using regular expressions to replace text

* The regular expressions are part of a module called **re**
* The function **re.sub** can replace parts of strings.


In [3]:
import re
in_string = "Hello Student"
out_string = re.sub("Student", "Roger" , in_string)
print(in_string)
print(out_string)

Hello Student
Hello Roger


In [5]:
import re
Email = "Dear Name, You didn't get the job."
for name in [ "Roger" , "Mary" ]  :
    EmailOut = re.sub("Name", name, Email)
    print ("Email", EmailOut)

Email Dear Roger, You didn't get the job.
Email Dear Mary, You didn't get the job.


The string **EmailOut** would be sent to a 
function that sends out an
email.

## Regular expressions

There are some special symbols which match generic patterns.


* . this matches any character except a newline
* $ Matches the end of the string or just before the newline at the end of the string.
* \d  matches any digit (0 - 9)
* \*  repeats the previous expressions as many times as possible (wild card).
* \S matches any white space character.
* \D matches any non-digit character.

This is a topic where it is helpful to see some examples.


##  Searching
* The function **re.search** looks for a pattern in a string

In [7]:
import re
xxx = " Cost List of food Cost"

if "Cost" in xxx :
    print("Cost is in the string")
    
if re.search(r"Cost$" , xxx) :
    print("Cost is at the end ")

if re.search(r"^Cost" , xxx) :
    print("Cost is at the start")

Cost is in the string
Cost is at the end 


## Another example 

In [7]:
def look_for_Cost(xxx) :
    if re.search(r"Cost$" , xxx) :   # match at end of string
        print("Cost is at the end ")

    if re.search(r"^Cost" , xxx) :  # match at start of string
       print("Cost is at the start")

look_for_Cost("List of food Cost")
look_for_Cost("Cost 10 pounds meat")

Cost is at the end 
Cost is at the start


##  Examples of Regular expressions
*  \\D below means any letter
* The replacement converts any letter to nothing

In [8]:
import re
input =  "My name is Roger and my ID is 5674231"
inputr= re.sub(r"\D", "", input)
print("Student ID", inputr)

Student ID 5674231


## Example
* Write a regular expression to check that a string only contains 4 digits
* The application would be to check that the format of a pin number is correct

In [11]:
def match_4_digits(xxx) :
  if re.search(r"^\d\d\d\d$", xxx) :
      print(xxx, "has 4 digits")

match_4_digits("12")
match_4_digits("123")
match_4_digits("2345")
match_4_digits("1X34")
match_4_digits("1234 ")

if re.search(r"^Cost" , xxx) :
    print("Cost is at the star5")

2345 has 4 digits


##  Example of Regular expressions

* Extract a phone number from a string.
* .*\$  matches any charachter to the end of the line

In [11]:
import re
phone = "2004-959-559 # This is Phone Number"
print ("Input " , phone)
# Delete Python-style comments
num = re.sub(r'#.*$', "", phone)  
print ("Part A : ", num)
# Remove anything other than digits
num = re.sub(r'\D', "", phone)    
print ("Phone Number : ", num)


Input  2004-959-559 # This is Phone Number
Part A :  2004-959-559 
Phone Number :  2004959559


## Example of possible text manipulation

Often we need to extract specific parts of a document

List of food Cost<br>
Cost 10 pounds meat<br>
Cost 5  pounds fruit<br>
Cost 2  Chocolate<br>

Plan
Test driven development should get 17

LIVE CODING EXAMPLE



In [22]:
import re
text = """
List of food Cost
Cost 10 pounds meat
Cost 5 pounds fruit
Cost 2 Chocolate
"""

cost = 0
#print(text)
text_lines = text.split("\n")
#print(text_lines)
for line in text_lines :
    if re.search("^Cost", line):
        #  print(line)
          tmp =  line.split(" ")    
          print(tmp[1])
          cost = cost + int( tmp[1] )
print(cost)

10
5
2
17


In [21]:
# Solution
text = """
List of food Cost
Cost 10 pounds meat
Cost 5 pounds fruit
Cost 2 Chocolate
"""

cost = 0
#print(text)
text_lines = text.split("\n")
#print(text_lines)
for x_ in text_lines :
    if re.search("^Cost" , x_) :
       x_split = x_.split()
       cost += int( x_split[1])
print("Cost = " , cost)

Cost =  17


## The power of regular expressions

So far we can do simple replacements of substrings.

* I have showed some examples, but you need to read the documentation to fully use all the tricks.
* See the documentation https://docs.python.org/2/library/re.html
* Regular expressions are a language, but not as complete a language as python.
* Once you have understood the basics, regular expressions are similar in most languages.

##  My history with regular expressions

* I started using regular expressions with UNIX tools: grep, sed, awk.
* I then used regular expressions with a language called perl
* I don't use the full power of regular expressions.

![RegBook](https://m.media-amazon.com/images/I/51s3zpVhkYL._SX260_.jpg)

## Background to regular expressions

There is a long history of regular expressions.

* A regular expression (shortened as regex or regexp) is a sequence of characters that define a search pattern. 
* The concept arose in the 1950s when the American mathematician Stephen Cole Kleene formalized the description of a regular language
* The first computer implementations were in the late 1960s.


More detail at https://en.wikipedia.org/wiki/Regular_expression
and the book:
Mastering Regular Expressions: Understand Your Data and Be More
Productive 
by Jeffrey E. F. Friedl .

## Natural Language processing

Regular expressions allow us to split up a text
and split it into sub parts.

* The next step is Natural Language Processing NLP, where the library knows the structure of English.
* Natural Language Toolkit library in python https://www.nltk.org/

One of the professions that AI may radically change is lawyers.
For example, "Will A.I. Put Lawyers Out Of Business?" 
https://www.forbes.com/sites/cognitiveworld/2019/02/09/will-a-i-put-lawyers-out-of-business/?sh=4df4c25131f0 .

For example, NLP is used to extract information from legal 
contracts https://pythonawesome.com/a-spacy-pipeline-and-model-for-nlp-on-unstructured-legal-text/ .


## Basic example of a nltk 
* You may need to install the nltk module  https://anaconda.org/anaconda/nltk
* So nltk knows about punctuation, so it breaks down strings into language tokens.

In [20]:
import nltk
#nltk.download('punkt')
#nltk.download('punkt')
#nltk.download('wordnet')
#nltk.download('omw-1.4')

sentence = """At eight o'clock on Thursday morning
Arthur didn't feel very good, but he tested negative for COVID."""
tokens = nltk.word_tokenize(sentence,language='english', preserve_line=True)

print(sentence)
print(tokens)

At eight o'clock on Thursday morning
Arthur didn't feel very good, but he tested negative for COVID.
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', 'Arthur', 'did', "n't", 'feel', 'very', 'good', ',', 'but', 'he', 'tested', 'negative', 'for', 'COVID', '.']


##  Web scraping project

* Web scraping is downloading a web page and extracting information https://en.wikipedia.org/wiki/Web_scraping
* Web scraping is useful for example to find customer information (360 oor 720 view of customers)
* If possible it is better to use an API (such as the twitter API) to download the information, but sometimes there is no API such as for facebook or Trustpilot (https://uk.trustpilot.com/).
* Sometimes the web server does not like pages to be downloaded.


![Customer](https://www.insurancethoughtleadership.com/sites/default/files/wp/2015/03/Untitled.png)

##  Example

*  As an example we look at extracting information from https://webscraper.io/test-sites/tables
*  This is a test site to practice web scraping (which means the structure is simpler than other sites)
*  The goal is to extract the **usernames**

![scrape](https://github.com/cmcneile/COMP5000-2022-lectures/blob/main/webscrape.png?raw=true)

##  Plan of the project

<blockquote>
    Extract the usernames from the web page
</blockquote>

We start from a basic plan. In this case the overall plan 
is suggested by the problem.


* **Download the web page** There will be a python module to do that.

* **Convert the web page into lines in a list**
    * Loop through the lines.
    * Use regular expressions or string matching to extract lines containing the username such as @mdo

## Python module to web scrape
* We shall use a module to download the web page
* Use https://urllib3.readthedocs.io/en/stable/ 
* Start from the example on the above web page.


In [20]:
#  conda install -c conda-forge urllib3   # command to install
import urllib3
http = urllib3.PoolManager()
import re
url_="https://webscraper.io/test-sites/tables"
r = http.request('GET',url_ )
print("Status " , r.status)
print("rdata = " , r.data)

Status  200
rdata =  b'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<!-- Google Tag Manager -->\n<script nonce="768RYLf6zgq9yhbdhjULJxjJb4ZRLDhV">(function (w, d, s, l, i) {\n\t\tw[l] = w[l] || [];\n\t\tw[l].push({\n\t\t\t\'gtm.start\':\n\t\t\t\tnew Date().getTime(), event: \'gtm.js\'\n\t\t});\n\t\tvar f = d.getElementsByTagName(s)[0],\n\t\t\tj = d.createElement(s), dl = l != \'dataLayer\' ? \'&l=\' + l : \'\';\n\t\tj.async = true;\n\t\tj.src =\n\t\t\t\'https://www.googletagmanager.com/gtm.js?id=\' + i + dl;\n\t\tf.parentNode.insertBefore(j, f);\n\t})(window, document, \'script\', \'dataLayer\', \'GTM-NVFPDWB\');</script>\n<!-- End Google Tag Manager -->\n\t<title>Table Playground | Web Scraper Test Sites</title>\n\t<meta charset="utf-8">\n\t<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\n\n\t<meta name="keywords"\n\t\t  content="web scraping,Web Scraper,Chrome extension,Crawling,Cross platform scraper"/>\n\t<meta name="description"\n\t\t  content="The most popular web 

##  Two complications
*  The function downloads the data in binary format
* The binary format https://www.w3resource.com/python/python-bytes.php#byte-string which needs to be converted to a string.
* Also the document contains information about the format of the document. Tags from HTML.

In [23]:
import urllib3
http = urllib3.PoolManager()
url_="https://webscraper.io/test-sites/tables"
r = http.request('GET', url_ )
print(type(r.data))
rr = r.data.decode('utf-8')
print(rr)
#type(2)

<class 'bytes'>
<!DOCTYPE html>
<html lang="en">
<head>
	<!-- Google Tag Manager -->
<script nonce="zuUekavuHNJgO0TdZ5yv0twMYS25KOWl">(function (w, d, s, l, i) {
		w[l] = w[l] || [];
		w[l].push({
			'gtm.start':
				new Date().getTime(), event: 'gtm.js'
		});
		var f = d.getElementsByTagName(s)[0],
			j = d.createElement(s), dl = l != 'dataLayer' ? '&l=' + l : '';
		j.async = true;
		j.src =
			'https://www.googletagmanager.com/gtm.js?id=' + i + dl;
		f.parentNode.insertBefore(j, f);
	})(window, document, 'script', 'dataLayer', 'GTM-NVFPDWB');</script>
<!-- End Google Tag Manager -->
	<title>Table Playground | Web Scraper Test Sites</title>
	<meta charset="utf-8">
	<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">

	<meta name="keywords"
		  content="web scraping,Web Scraper,Chrome extension,Crawling,Cross platform scraper"/>
	<meta name="description"
		  content="The most popular web scraping extension. Start scraping in minutes. Automate your tasks with our Cloud Scrape

![acsi](https://github.com/cmcneile/COMP5000-2022-lectures/blob/main/asci.png?raw=true)

![unicodeA](https://github.com/cmcneile/COMP5000-2022-lectures/blob/main/unicode_A.png?raw=true)

![unicodeB](https://github.com/cmcneile/COMP5000-2022-lectures/blob/main/unicode_B.png?raw=true)

##  Next version of the code
* We can loop over the lines in the document
* The goal is to search for user names  @mdo so we match lines containing @

In [26]:
import urllib3
import re
http = urllib3.PoolManager()
url_="https://webscraper.io/test-sites/tables"

r = http.request('GET', url_ )
rr = r.data.decode('utf-8')


##sys.exit(1)
rrr = rr.split("\n") # recall \n is the new line
# rrr is a list with each line 
for ll in rrr :
    if re.search("@",ll) :
        print(ll)


	<link href="https://fonts.googleapis.com/css2?family=Roboto:wght@400;500;600&display=swap" rel="stylesheet">
	<link href="https://fonts.googleapis.com/css2?family=Montserrat:wght@300;400;500;600;700;900&display=swap" rel="stylesheet">
				<td>@mdo</td>
				<td>@fat</td>
				<td>@twitter</td>
				<td>@hp</td>
				<td>@dunno</td>
				<td>@timbean</td>
						<a href="mailto:info@webscraper.io">info@webscraper.io</a>
						<a href="https://youtube.com/@WebScraper/videos" target="_blank" rel="noopener" aria-label="Web Scraper on Youtube">


##  Background to HTML

* HTML (Hyper Text Mark up Language) is a mark up langauge.  https://www.w3schools.com/html/
*There many other markup languages such as YAML https://en.wikipedia.org/wiki/YAML and XML. 

The tags are used to tell the browser how to display the document
(eg. section headings or bullet points.)


* $<p> </p>$   new paragraph
* $<li>$   bullet point
* $<h1> </h1>$   section heading


When we are extracting information from the textfile we mostly
want to remove the tags


##  Solution
* The goal was to download the usernames such as @mdo
* We need to get rid of additional text such as $<td>$

In [29]:
import sys
import urllib3
import re
http = urllib3.PoolManager()
url_="https://webscraper.io/test-sites/tables"
r = http.request('GET', url_ )
rr = r.data.decode('utf-8')
rrr = rr.split("\n") # recall \n is the new line
# rrr is a list with each line 
for ll in rrr :
    if re.search("<td>@",ll) :
        ll = re.sub("<td>","", ll)
        ll = re.sub("</td>","", ll)
        ll = ll.strip()
        print(ll)


@mdo
@fat
@twitter
@hp
@dunno
@timbean


We now have the list of user IDs

## Summary of the process

How to write python code to solve these ``small problems''.



* Break the problem into small steps
* Develop python code for each small step, by mapping problem into small parts of python.
    * Loops:  for, while, or ....
    *  Data structures: variables, lists, dictionaries, or ..
    * Functions, ..
* For specialized tasks, you also need to find an appropriate module. For example, we needed a function to download the web page.
* When I have a first draft of a script, I often improve it. Writing code is an iterative proces -- particulary for web scraping.


## Another solution

* A potentially better library for scraping web pages is **Beautifulsoup**
*  https://www.crummy.com/software/BeautifulSoup/ Beautiful Soup can help work HTML documents downloaded from web pages


In [24]:
import requests
from bs4 import BeautifulSoup
import re

URL = "https://webscraper.io/test-sites/tables"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")

H2s = soup.find_all("td")
for h2 in H2s:
    ll = h2.text
    if re.search("@",ll) :
         print(ll)


@mdo
@fat
@twitter
@hp
@dunno
@timbean


##  Final comments on web scraping

* The beautifulsoup web scraping code is a bit simpler than my first version
* See this review of web scraping using python  https://www.scrapingbee.com/blog/web-scraping-101-with-python/

* web sites are now used as training data for Large Language Models.


##  Semantic web
<blockquote>
I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A "Semantic Web", which makes this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The "intelligent agents" people have touted for ages will finally materialize
</blockquote>
Berners-Lee originally expressed his vision of the Semantic Web in 1999 


* Semantic web  https://en.wikipedia.org/wiki/Semantic_Web
* Not yet fully implemented



## ChatGPT and web scraping

I did try ChatGPT to extract the usernames

![RegBook](https://github.com/cmcneile/COMP5000-2024-lectures/blob/main/webscrapeChatGPT.png?raw=true)


##  Warning

* A PhD student I work with gave a presentation at a conference about data science project based on web scraping.
* One question was "is the web scraping legal?"
* He had scraped data from Trustpilot, so that was fine.  https://www.setuserv.com/trustpilot-scraping-service/