# Python examples in lecture 4
* This file is a jupyter notebook. To run it you can download it from the DLE and run it on your own machine.
* Or you can run it on google collab <https://colab.research.google.com> via your google account. This may be slower than running on your own machine
* Information on downloading notebooks from the store to your computer https://youtu.be/1zY7hIj5tWg

###  Basic example of string manipulation  

* Python has excellent string processing tools.
* See https://greenteapress.com/thinkpython2/html/thinkpython2009.html from the book Think Python
* The example shows how to combine two strings

## Why do we need to manipulate text
* We need to extract specific parts of a text such as names or numbers. This is a important part of data cleaning.
* We may need correct mis-spelled words.
* When you download web pages you get additional tags (such as \<li\>) that must be removed.

## Example of possible text manipulation

Often we need to extract specific parts of a document

List of food Cost<br>
Cost 10 pounds meat<br>
Cost 5  pounds fruit<br>
Cost 2  Chocolate<br>


In the above example we might want to extract 10,5 and 2. How
do we match **Cost** at the start of the line.


## String manipulation methods

* To deal with text files we often need to change or extract parts of the text.
* The string class in python has some  useful tools.

* https://www.w3schools.com/python/python_ref_string.asp

* There are more powerful ways of manipulating strings using **regular expressions**.



## Example of string manipulation
* The split method can break a string into 

In [20]:
sentence = "List of food Cost"
sentence_list = sentence.split()
for xx in sentence_list :
    print(xx, xx.lower() )


List list
of of
food food
Cost cost


## Searching strings

* There are commands to search for substrings in strings. 
* These return the first position of the substring, or -1 if the substring is not part of the string.

In [21]:
text = "List of food Cost"
xxx = text.find("of") 
print ("Location of of  " ,  xxx)
yyy = text.find("blue") 
print ("Location of blue " ,  yyy)

Location of of   5
Location of blue  -1


In [22]:
name = "<li> The first example"
namer = name.replace("<li> " , "")
print(name)
print(namer)
namerA = namer.title()
print(namerA)

<li> The first example
The first example
The First Example


## Further string manipulation

* How do we search for a word at the start or end of a string?
* How do you search for a pattern 

Regular expressions are a powerful way to search and replace strings

##  Introduction to regular expressions

* Regular expressions are a powerful way to manipulate text,
* There is a module called **re**
* Regular expression can search and replace text.
* Regular expressions can replace parts of strings.
* There is a powerful syntax of manipulating strings, which is similar in many computer languages.



I am not going to discuss the performance of regular expressions,
because I personally only use them for small strings.


## Using regular expressions to replace text

* The regular expressions are part of a module called **re**
* The function **re.sub** can replace parts of strings.


In [23]:
import re
Email = "Dear Name, You didn't get the job."
for name in [ "Roger" , "Mary" ]  :
    EmailOut = re.sub("Name", name, Email)
    print ("Email:", EmailOut)

Email: Dear Roger, You didn't get the job.
Email: Dear Mary, You didn't get the job.


The string **EmailOut** would be sent to a 
function that sends out an
email.

## Regular expressions

There are some special symbols which match generic patterns.


* . this matches any character except a newline
* $ Matches the end of the string or just before the newline at the end of the string.
* \d  matches any digit (0 - 9)
* \*  repeats the previous expressions as many times as possible.
* \S matches any white space character.
* \D matches any non-digit character.

This is a topic where it is helpful to see some examples.


In [24]:
xxx = """List of food Cost
Cost 10 pounds meat
Cost 5 pounds fruit
Cost 2 Chocolate""" 

print(xxx)
for xxx_ in xxx:
    print("> ", xxx_)

List of food Cost
Cost 10 pounds meat
Cost 5 pounds fruit
Cost 2 Chocolate
>  L
>  i
>  s
>  t
>   
>  o
>  f
>   
>  f
>  o
>  o
>  d
>   
>  C
>  o
>  s
>  t
>  

>  C
>  o
>  s
>  t
>   
>  1
>  0
>   
>  p
>  o
>  u
>  n
>  d
>  s
>   
>  m
>  e
>  a
>  t
>  

>  C
>  o
>  s
>  t
>   
>  5
>   
>  p
>  o
>  u
>  n
>  d
>  s
>   
>  f
>  r
>  u
>  i
>  t
>  

>  C
>  o
>  s
>  t
>   
>  2
>   
>  C
>  h
>  o
>  c
>  o
>  l
>  a
>  t
>  e


In [25]:
import re
xxx = "List of food Cost"

if "Cost" in xxx :
    print("Cost is the string")
    
if re.search(r"Cost$" , xxx) :
    print("Cost is at the end ")

if re.search(r"^Cost" , xxx) :
    print("Cost is at the start")




Cost is the string
Cost is at the end 


In [26]:
def look_for_Cost(xxx) :
    if re.search(r"Cost$" , xxx) :   # match at end of string
        print("Cost is at the end ")

    if re.search(r"^Cost" , xxx) :  # match at start of string
       print("Cost is at the start")

look_for_Cost("List of food Cost")
look_for_Cost("Cost 10 pounds meat")

Cost is at the end 
Cost is at the start


##  Examples of Regular expressions

In [11]:
import re
input =  "My name is Roger and my ID is 5674231"
inputr= re.sub("\D", "", input)
print("Student ID", inputr)

Student ID 5674231


##  Example of Regular expressions

Extract a phone number from a string.


In [12]:
import re
phone = "2004-959-559 # This is Phone Number"
print ("Input " , phone)
# Delete Python-style comments
num = re.sub(r'#.*$', "", phone)
print ("Phone Num : ", num)
# Remove anything other than digits
num = re.sub(r'\D', "", phone)    
print ("Phone Num : ", num)


Input  2004-959-559 # This is Phone Number
Phone Num :  2004-959-559 
Phone Num :  2004959559


## Summary of regular expressions

note

## The power of regular expressions

So far we can do simple replacements of substrings.

* I will show some more sophisticated examples, but you need to read the documentation to fully use all the tricks.
* See the documentation https://docs.python.org/2/library/re.html
* Regular expressions are a langauge.
* Once you have understood the basics, the ideas are similar in most languages.

##  My history with regular expressions

* I started using regular expressions with UNIX tools: grep, sed, awk.
* I then used regular expressions with a language called perl
* I don't use the full power of regular expressions.

![RegBook](https://m.media-amazon.com/images/I/51s3zpVhkYL._SX260_.jpg)

## Background to regular expressions

There is a long history of regular expressions.

* A regular expression (shortened as regex or regexp) is a sequence of characters that define a search pattern. 
* The concept arose in the 1950s when the American mathematician Stephen Cole Kleene formalized the description of a regular language
* The first computer implementations were in the late 1960s.


More detail at https://en.wikipedia.org/wiki/Regular_expression
and the book:
Mastering Regular Expressions: Understand Your Data and Be More
Productive 
by Jeffrey E. F. Friedl .

## Natural Language processing

Regular expressions allow us to split up a text
and split it into sub parts.

* The next step is Natural Language Processing NLP, where the library knows the structure of English.
* Natural Language Toolkit library in python https://www.nltk.org/

One of the professions that AI may radically change is lawyers.
For example, "Will A.I. Put Lawyers Out Of Business?" 
https://www.forbes.com/sites/cognitiveworld/2019/02/09/will-a-i-put-lawyers-out-of-business/?sh=4df4c25131f0 .

For example, NLP is used to extract information from legal 
contracts https://pythonawesome.com/a-spacy-pipeline-and-model-for-nlp-on-unstructured-legal-text/ .


## Basic example of a nltk 
*  You may need to install the nltk module  https://anaconda.org/anaconda/nltk
* So nltk knows about punctuation, so it is more 

In [3]:
import nltk
nltk.download('punkt')

sentence = """At eight o'clock on Thursday morning
Arthur didn't feel very good, but he tested negative for COVID."""
tokens = nltk.word_tokenize(sentence)

print(sentence)
print(tokens)

At eight o'clock on Thursday morning
Arthur didn't feel very good, but he tested negative for COVID.
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', 'Arthur', 'did', "n't", 'feel', 'very', 'good', ',', 'but', 'he', 'tested', 'negative', 'for', 'COVID', '.']


[nltk_data] Downloading package punkt to /Users/cmcneile/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


##  Web scraping project

In [3]:
import urllib3
http = urllib3.PoolManager()
import re
r = http.request('GET',
'https://www.plymouth.ac.uk/schools/school-of-engineering-computing-and-mathemat\
ics' )
print("Status " , r.status)
print("rdata = " , r.data)

Status  200
rdata =  b'<!DOCTYPE html>\n<html lang="en" itemscope="itemscope" itemtype="http://schema.org/WebPage" class="">\n  <head>\n    <!-- Preload assets -->\n    <link rel="preload" href="https://d3bpgcke55gfwt.cloudfront.net/assets/application-707c6accb9405c1609a233782ddc1ba3502810493e0607902c9b169dd307f8f6.css" as="style" type="text/css">\n    <link rel="preload" href="https://d3bpgcke55gfwt.cloudfront.net/assets/application-28b262323718b1cfec5d1317e64aa8c3071ac6dd744196ab459f1ff537696af0.js" as="script" type="text/javascript">\n    <link rel="preload" href="https://d3bpgcke55gfwt.cloudfront.net/assets/icon-webfont-a9e5f93912dcd8125407e4c448f72d8b7bbac509d75f5f503159afdda5cf89e0.woff" as="font" type="font/woff" crossorigin="anonymous">\n\n    <!-- Prefetch DNS for external assets -->\n    <link rel="preconnect" href="https://platform.twitter.com/" >\n    <link rel="preconnect" href="https://www.google-analytics.com/" >\n    <link rel="preconnect" href="https://www.googletagman

In [4]:
import urllib3
http = urllib3.PoolManager()
url_='https://www.plymouth.ac.uk/schools/school-of-engineering-computing-and-mathematics'
r = http.request('GET', url_ )
rr = r.data.decode('utf-8')
print(rr)

<!DOCTYPE html>
<html lang="en" itemscope="itemscope" itemtype="http://schema.org/WebPage" class="">
  <head>
    <!-- Preload assets -->
    <link rel="preload" href="https://d3bpgcke55gfwt.cloudfront.net/assets/application-707c6accb9405c1609a233782ddc1ba3502810493e0607902c9b169dd307f8f6.css" as="style" type="text/css">
    <link rel="preload" href="https://d3bpgcke55gfwt.cloudfront.net/assets/application-28b262323718b1cfec5d1317e64aa8c3071ac6dd744196ab459f1ff537696af0.js" as="script" type="text/javascript">
    <link rel="preload" href="https://d3bpgcke55gfwt.cloudfront.net/assets/icon-webfont-a9e5f93912dcd8125407e4c448f72d8b7bbac509d75f5f503159afdda5cf89e0.woff" as="font" type="font/woff" crossorigin="anonymous">

    <!-- Prefetch DNS for external assets -->
    <link rel="preconnect" href="https://platform.twitter.com/" >
    <link rel="preconnect" href="https://www.google-analytics.com/" >
    <link rel="preconnect" href="https://www.googletagmanager.com/" >
    <link rel="precon

In [25]:
import sys
import urllib3
import re
http = urllib3.PoolManager()
#url_='https://www.plymouth.ac.uk/schools/school-of-engineering-computing-and-mathematics'
url_="https://webscraper.io/test-sites/tables"

r = http.request('GET', url_ )
#print(r.data)

#sys.exit(0)

rr = r.data.decode('utf-8')
rrr = rr.split("\n")
for ll in rrr :
    ll.rstrip()
    if re.search("<td>@",ll) :
        ll = re.sub("<td>","", ll)
        ll = re.sub("</td>","", ll)
        print(ll)


				@mdo
				@fat
				@twitter
				@hp
				@dunno
				@timbean
				@mdo
				@fat
				@twitter
				@mdo
				@fat
				@twitter


## Summary of the process

How to write python code to solve these ``small problems''.



* Break the problem into small steps
* Develop python code for each small step, by mapping problem into small parts of python.
    * Loops:  for, while, or ....
    *  Data structures: variables, lists, dictionaries, or ..
    * Functions, ..
* For specialized tasks, you also need to find an appropriate module. For example, we needed a function to download the web page.
* When I have a first draft of a script, I often improve it. Writing code is an iterative process.
