<H2>Regular Expressions</H2>

<H3>Regular Expressions Quick Guide</H3>

<P>^ Matches the beginning of a line</P>
<P>$ Matches the end of the line</P>
<P>. Matches any character</p>
<p>\s Matches whitespace</p>
<p>\S Matches any non-whitespace character</p>
<p>* Repeats a character zero or more times</p>
<p>*? Repeats a character zero or more times (non-greedy)</p>
<p>+ Repeats a character one or more times</p>
<p>+? Repeats a character one or more times (non-greedy)</p>
<p>[aeiou] Matches a single character in the listed set</p>
<p>[^XYZ] Matches a single character not in the listed set</p>
<p>[a-z0-9] The set of characters can include a range</p>
<p>( Indicates where string extraction is to start</p>
<p>) Indicates where string extraction is to end</p>

<H3>Load the Regular Expression Library</H3>

In [3]:
import re

<H3>Load the mbox-short dataset</H3>

In [5]:
import urllib2

target_url = 'http://www.py4inf.com/code/mbox-short.txt'

<H3>Using re.search() like find()</H3>

In [6]:
hand = urllib2.urlopen(target_url)
for line in hand:
    line = line.rstrip()
    if line.find('From:') >= 0:
        print line

From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu


In [8]:
hand = urllib2.urlopen(target_url)
for line in hand:
    line = line.rstrip()
    if re.search('From:', line):
        print line

From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu


<H3>Fine tune to find 'From:' at the beginning of the line</H3>

In [10]:
hand = urllib2.urlopen(target_url)
for line in hand:
    line = line.rstrip()
    if line.startswith('From:'):
        print line

From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu


In [11]:
hand = urllib2.urlopen(target_url)
for line in hand:
    line = line.rstrip()
    if re.search('^From:', line):
        print line

From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu


<H3>Print the lines Containing 'X' followed by any character any number of times followed by a colon</H3>

In [13]:
hand = urllib2.urlopen(target_url)
for line in hand:
    line = line.rstrip()
    if re.search('X.*:', line):
        print line

X-Sieve: CMU Sieve 2.3
X-Content-Type-Outer-Envelope: text/plain; charset=UTF-8
X-Content-Type-Message-Body: text/plain; charset=UTF-8
X-DSPAM-Result: Innocent
X-DSPAM-Processed: Sat Jan  5 09:14:16 2008
X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.0000
X-Sieve: CMU Sieve 2.3
X-Content-Type-Outer-Envelope: text/plain; charset=UTF-8
X-Content-Type-Message-Body: text/plain; charset=UTF-8
X-DSPAM-Result: Innocent
X-DSPAM-Processed: Fri Jan  4 18:10:48 2008
X-DSPAM-Confidence: 0.6178
X-DSPAM-Probability: 0.0000
X-Sieve: CMU Sieve 2.3
X-Content-Type-Outer-Envelope: text/plain; charset=UTF-8
X-Content-Type-Message-Body: text/plain; charset=UTF-8
X-DSPAM-Result: Innocent
X-DSPAM-Processed: Fri Jan  4 16:10:39 2008
X-DSPAM-Confidence: 0.6961
X-DSPAM-Probability: 0.0000
X-Sieve: CMU Sieve 2.3
X-Content-Type-Outer-Envelope: text/plain; charset=UTF-8
X-Content-Type-Message-Body: text/plain; charset=UTF-8
X-DSPAM-Result: Innocent
X-DSPAM-Processed: Fri Jan  4 15:46:24 2008
X-DSPAM-Confidence:

<H3>Without any space between X- and :</H3>
<H4>Starts with X- followed by one or more non-whitespace characters and ends with a colon</H4>

In [14]:
hand = urllib2.urlopen(target_url)
for line in hand:
    line = line.rstrip()
    if re.search('X-\S+:', line):
        print line

X-Sieve: CMU Sieve 2.3
X-Content-Type-Outer-Envelope: text/plain; charset=UTF-8
X-Content-Type-Message-Body: text/plain; charset=UTF-8
X-DSPAM-Result: Innocent
X-DSPAM-Processed: Sat Jan  5 09:14:16 2008
X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.0000
X-Sieve: CMU Sieve 2.3
X-Content-Type-Outer-Envelope: text/plain; charset=UTF-8
X-Content-Type-Message-Body: text/plain; charset=UTF-8
X-DSPAM-Result: Innocent
X-DSPAM-Processed: Fri Jan  4 18:10:48 2008
X-DSPAM-Confidence: 0.6178
X-DSPAM-Probability: 0.0000
X-Sieve: CMU Sieve 2.3
X-Content-Type-Outer-Envelope: text/plain; charset=UTF-8
X-Content-Type-Message-Body: text/plain; charset=UTF-8
X-DSPAM-Result: Innocent
X-DSPAM-Processed: Fri Jan  4 16:10:39 2008
X-DSPAM-Confidence: 0.6961
X-DSPAM-Probability: 0.0000
X-Sieve: CMU Sieve 2.3
X-Content-Type-Outer-Envelope: text/plain; charset=UTF-8
X-Content-Type-Message-Body: text/plain; charset=UTF-8
X-DSPAM-Result: Innocent
X-DSPAM-Processed: Fri Jan  4 15:46:24 2008
X-DSPAM-Confidence:

<H3>Find numbers in a string</H3>
<H4>One or more digits</H4>

In [15]:
x = 'My 2 favorite numbers are 19 and 42'
y = re.findall('[0-9]+', x)
print y

['2', '19', '42']


<H3>Find any uppercase vowels in x</H3>
<H4>All sequences of one or more upper case vowels</H4>

In [18]:
y = re.findall('[AEIOU]+', x)
print y

[]


<H3>Greedy Matching</H3>
<H4>Greedy matching gives you the largest possible string</H4>
<H4>Non-greedy matching gets satisfied with the shortest string</H4>

In [19]:
x = 'From: Using the : character'
y = re.findall('^F.+:', x)
print y

['From: Using the :']


In [20]:
y = re.findall('^F.+?:', x)
print y

['From:']


<H3>Extracting mail id</H3>
<H4>Atleast one or more non-whitespace character before and after @ sign</H4>

In [21]:
x = 'From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008'
y = re.findall('\S+@\S+', x)
print y

['stephen.marquard@uct.ac.za']


<H3>Fine tuning the above code to search mail ids from lines that start with 'From'</H3>

In [24]:
y = re.findall('^From (\S+@\S+)', x)
print y

['stephen.marquard@uct.ac.za']


In [23]:
x

'From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008'

<H3>Finding the domain name</H3>

In [25]:
data = 'From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008'
atpos = data.find('@')
print atpos

sppos = data.find(' ', atpos)
print sppos

host = data[atpos+1 : sppos]
print host

21
31
uct.ac.za


In [26]:
words = data.split()
email = words[1]
pieces = email.split('@')
print pieces[1]

uct.ac.za


In [27]:
y = re.findall('@([^ ]*)', data)
print y

['uct.ac.za']


<H3>Modify the above to find matches only in lines starting with 'From'</H3>

In [31]:
y = re.findall('^From .*@([^ ]*)', data)
print y

['uct.ac.za']


<h3>Finding D-SPAM-Confidence number</H3>

In [32]:
hand = urllib2.urlopen(target_url)
numlist = list()
for line in hand:
    line = line.rstrip()
    stuff = re.findall('^X-DSPAM-Confidence: ([0-9.]+)', line)
    if len(stuff) != 1:    continue
    num = float(stuff[0])
    numlist.append(num)

print 'Maximum: ', max(numlist)  

Maximum:  0.9907


<H3>Escape Character</H3>
<H4>If you want a special regular expression character to just behave normally you prefix it with '\'</H4>

In [36]:
x = 'We just received $10.00 for cookies.'
y = re.findall('\$[0-9.]+', x)
print y

['$10.00']


<H3>Assignment</H3>

In [44]:
import re
import urllib2

new_url = 'http://python-data.dr-chuck.net/regex_sum_287906.txt'
book = urllib2.urlopen(new_url)
numbers = []
for line in book:
    num = re.findall('[0-9]+', line)
    num_sum = 0
    for nu in num:
        num_sum += int(nu)
    numbers.append(num_sum)
print sum(numbers)

459288


<H2>Networks and Sockets</H2>

<H3>Making a socket connection</H3>

In [7]:
import socket
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect( ('www.py4inf.com', 80) )

<H3>An HTTP Request in Python</H3>

In [8]:
mysock.send('GET http://www.py4inf.com/code/romeo.txt HTTP/1.0\n\n')

while True:
    data = mysock.recv(512)
    if (len(data) < 1):
        break
    print data
mysock.close()

HTTP/1.1 200 OK
Date: Sat, 02 Jul 2016 10:32:10 GMT
Server: Apache
Last-Modified: Fri, 04 Dec 2015 19:05:04 GMT
ETag: "e103c2f4-a7-526172f5b5d89"
Accept-Ranges: bytes
Content-Length: 167
Cache-Control: max-age=604800, public
Access-Control-Allow-Origin: *
Access-Control-Allow-Headers: origin, x-requested-with, content-type
Access-Control-Allow-Methods: GET
Connection: close
Content-Type: text/plain

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fai
r sun and kill the envious moon
Who is already sick and pale with grief



<H3>Making HTTP easier with urllib</H3>

In [9]:
import urllib
fhand = urllib.urlopen('http://www.py4inf.com/code/romeo.txt')

for line in fhand:
    print line.strip()

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief


<H3>Make a word count for the above text file</H3>

In [11]:
import urllib
fhand = urllib.urlopen('http://www.py4inf.com/code/romeo.txt')

counts = dict()

for line in fhand:
    words = line.split()
    for word in words:
        counts[word] = counts.get(word, 0) + 1

print counts

{'and': 3, 'envious': 1, 'already': 1, 'fair': 1, 'is': 3, 'through': 1, 'pale': 1, 'yonder': 1, 'what': 1, 'sun': 2, 'Who': 1, 'But': 1, 'moon': 1, 'window': 1, 'sick': 1, 'east': 1, 'breaks': 1, 'grief': 1, 'with': 1, 'light': 1, 'It': 1, 'Arise': 1, 'kill': 1, 'the': 3, 'soft': 1, 'Juliet': 1}


<H3>Assignment</H3>

In [12]:
import socket
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect( ('www.pythonlearn.com', 80) )

mysock.send('GET http://www.pythonlearn.com/code/intro-short.txt HTTP/1.0\n\n')

while True:
    data = mysock.recv(512)
    if (len(data) < 1):
        break
    print data
mysock.close()

HTTP/1.1 200 OK
Date: Sat, 02 Jul 2016 11:21:08 GMT
Server: Apache
Last-Modified: Mon, 12 Oct 2015 14:55:29 GMT
ETag: "20f7401b-1d3-521e9853a392b"
Accept-Ranges: bytes
Content-Length: 467
Cache-Control: max-age=604800, public
Access-Control-Allow-Origin: *
Access-Control-Allow-Headers: origin, x-requested-with, content-type
Access-Control-Allow-Methods: GET
Connection: close
Content-Type: text/plain

Why should you learn to write programs?

Writing programs (or programming) is a very creative 

and rewarding activity.  You can write programs for 
many reasons, ranging from making your living to solving
a difficult data analysis problem to having fun to helping
someone else solve a problem.  This book assumes that 
everyone needs to know how to program, and that once 
you know how to program you will figure out what you want 
to do with your newfound skills.  

