# Python notes 3 - regex

this document is a kind of cook notebook with receipts to help you code faster.
enjoy it 
:)

## Table of Content


* [regex](#regex)
* [date_and_time](#date_and_time)
* [processing_time](#processing_time)
* [timedelta_pretty_print](#timedelta_pretty_print)
* [file_names_and_paths](#file_names_and_paths)
* [copy_file](#copy_file)



<a id = 'regex' ></a>

## regex - regular expressions

#### regex mini course
https://regexone.com/

#### regex python reference
https://regexone.com/references/python

#### Regular Expressions Cookbook, 2nd Edition 
Detailed Solutions in Eight Programming Languages 
By Jan Goyvaerts, Steven Levithan
Publisher: O'Reilly Media    
http://shop.oreilly.com/product/0636920023630.do    
    

#### find one match

In [1]:
# match the stringas that contain abc
import re

s1 = 'abcdefgabc'
s2 = 'abcde'
s3 = 'abc'

pattern = 'abc'

m1 = re.search( pattern, s1 )
m2 = re.search( pattern, s2 )
m3 = re.search( pattern, s3 )

print( m1.group(0) )
print( m2.group(0) )
print( m2.group(0) )

abc
abc
abc


### find all matches

In [6]:
import re
# Lets use a regular expression to match a few date strings.
regex = r"[a-zA-Z]+ \d+"
matches = re.findall(regex, "June 24, August 9, Dec 12" )
for match in matches:
    # This will print:
    #   June 24
    #   August 9
    #   Dec 12
    print( "Full match: %s" % (match) )
    

AttributeError: 'str' object has no attribute 'group'

### find a pattern and extract just a subpattern

In [3]:
# To capture the specific months of each date we can use the following pattern

regex = r"([a-zA-Z]+) (\d+)"
matches = re.findall(regex, "June 24, August 9, Dec 12")
for match in matches:
    # This will now print:
    #   June
    #   August
    #   Dec
    #print "Match month: %s" % (match)
    print( match )


('June', '24')
('August', '9')
('Dec', '12')


In [7]:
# match all pdf files, ignore temporary files.

import re

s1 = 'file_record_transcript.pdf'
s2 = 'file_07241999.pdf'

#skip this temporal file
s3 = 'testfile_fake.pdf.tmp'

pattern = '^(file.+)\.(pdf)$'

m1 = re.search( pattern, s1 )
m2 = re.search( pattern, s2 )
m3 = re.search( pattern, s3 )

print( m1.group(0) )
print( m1.group(1) )
print( m1.group(2) )



print( m2.group(0) )

if m3 == None:
    print( 'pattern not found for {0}'.format( s3 ) )
else:    
    print( m3.group(0) )


file_record_transcript.pdf
file_record_transcript
pdf
file_07241999.pdf
pattern not found for testfile_fake.pdf.tmp


#### exercise_regex

If you're familiar with web servers at all, you'll recognize that this is in [Common Log Format](https://www.w3.org/Daemon/User/Config/Logging.html#common-logfile-format). The fields are:

_remotehost rfc931 authuser [date] "request" status bytes_

| field         | meaning                                                                |
| ------------- | ---------------------------------------------------------------------- |
| _remotehost_  | Remote hostname (or IP number if DNS hostname is not available).       |
| _rfc931_      | The remote logname of the user. We don't really care about this field. |
| _authuser_    | The username of the remote user, as authenticated by the HTTP server.  |
| _[date]_      | The date and time of the request.                                      |
| _"request"_   | The request, exactly as it came from the browser or client.            |
| _status_      | The HTTP status code the server sent back to the client.               |
| _bytes_       | The number of bytes (`Content-Length`) transferred to the client.      |



In [8]:
value = r"""uplherc.upl.com - - [01/Aug/1995:00:00:08 -0400] "GET /images/ksclogo-medium.gif HTTP/1.0" 304 0   """

pattern = r""

# write your code here to extract the value of all the fields:
#    * remotehost
#    * [date]
#    * "request"
#    * status
#    * bytes

def replace_month_string_by_numbers( string_date ):
    #write code here to replace Month of the day from string to numbers.
    # input : 01/Aug/1995
    # output: 01/08/1995

    
    

In [27]:

import datetime
import re
import sys


# For extracting the date and request info from the request.
pattern = re.compile( r"""^.*\[(.*)\] \"GET .*$""" )

def StandardizeDate( date_time_str ):
    #print ( 'begin ...' )
    
    dt_str, tz = date_time_str.split(" ")
    #a = date_time_str.split(" ")
    #print ( a )
    
    dt = datetime.datetime.strptime(dt_str, "%d/%b/%Y:%H:%M:%S")

    if tz == "-0500":  # East Coast
        dt += datetime.timedelta(hours=5)
    else: # Pacific
        dt += datetime.timedelta(hours=8)

    #s_date = dt.strftime("%d/%b/%Y:%H:%M:%S")
    s_date = dt.strftime("%d/%m/%Y:%H:%M:%S")

    return s_date


"""line = "01/Aug/1995:00:00:08 -0400"
s_date = StandardizeDate( line )
print( s_date )"""


'line = "01/Aug/1995:00:00:08 -0400"\ns_date = StandardizeDate( line )\nprint( s_date )'

In [28]:
line = r"""uplherc.upl.com - - [01/Aug/1995:00:00:08 -0400] "GET /images/ksclogo-medium.gif HTTP/1.0" 304 0   """

res = pattern.match(line)
if res:
    print( 'We have matches' )
    date_str = res.group(1)
    line_2 = line.replace( date_str, StandardizeDate(date_str) )
    print ( line_2 )
else:
    print( 'No matches found' )


We have matches
uplherc.upl.com - - [01/08/1995:08:00:08] "GET /images/ksclogo-medium.gif HTTP/1.0" 304 0   


### replace s pattern

In [2]:
import re

pattern =  r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):' 
repl    = r'static PyObject*\npy_\1(void)\n{'
my_str  = 'def myfunc():'
s       = re.sub( pattern, repl, my_str )

print( s )

static PyObject*
py_myfunc(void)
{


#### replace delete a pattern

In [4]:
import re

pattern =  r'\d{2}:\d{2}:\d{2}' 
repl    = r''
my_str  = r'''
About Alluxio And The Course
00:03:38
About The Author
00:01:24
Using Alluxio Locally
Downloading Alluxio
00:03:03
Starting The System Locally
00:05:09
Interacting Via The Shell
00:02:45
Browsing The Web UI
00:03:53
Examples With Alluxio
Setting Up Alluxio With Spark And S3
00:06:15
Running Spark on Alluxio with S3
00:05:29
Using Alluxio With Unified Namespace
00:06:05
Deploying Alluxio On A Cluster
Deploying Alluxio In AWS
00:07:49
Conclusion
Contributing To The Project And Conclusion
00:03:52

'''
s       = re.sub( pattern, repl, my_str )

print( s )


About Alluxio And The Course

About The Author

Using Alluxio Locally
Downloading Alluxio

Starting The System Locally

Interacting Via The Shell

Browsing The Web UI

Examples With Alluxio
Setting Up Alluxio With Spark And S3

Running Spark on Alluxio with S3

Using Alluxio With Unified Namespace

Deploying Alluxio On A Cluster
Deploying Alluxio In AWS

Conclusion
Contributing To The Project And Conclusion





#### delete new line - convert to one row string

In [1]:
import re

pattern =  r'\n' 
repl    = r''

In [6]:
my_str  = r'''
<book id="bk112">
  <author>Galos, Mike</author>
  <title>Visual Studio 7: A Comprehensive Guide</title>
  <genre>Computer</genre>
  <price>49.95</price>
  <publish_date>2001-04-16</publish_date>
  <description>
    Microsoft Visual Studio 7 is explored in depth,
    looking at how Visual Basic, Visual C++, C#, and ASP+ are
    integrated into a comprehensive development
    environment.
  </description>
</book>
'''

In [7]:
s       = re.sub( pattern, repl, my_str )

print( s )

<book id="bk112">  <author>Galos, Mike</author>  <title>Visual Studio 7: A Comprehensive Guide</title>  <genre>Computer</genre>  <price>49.95</price>  <publish_date>2001-04-16</publish_date>  <description>    Microsoft Visual Studio 7 is explored in depth,    looking at how Visual Basic, Visual C++, C#, and ASP+ are    integrated into a comprehensive development    environment.  </description></book>


<a id = 'date_and_time' ></a>

## Date and Time

get date and time

In [20]:
import datetime

curtime  = datetime.datetime.now()

time_rg  = curtime.strftime( 'year: %Y, month: %m, day: %d, hours: %H, minutes: %M,  seconds: %S' )

print curtime
print time_rg

2017-06-07 16:06:38.296000
year: 2017, month: 06, day: 07, hours: 16, minutes: 06,  seconds: 38


create a file name with datetime stamp

In [21]:
import datetime

curtime  = datetime.datetime.now()

# format datetime in year month day hours minutes seconds
time_rg  = curtime.strftime( '%Y%m%d_%H%M%S' )
fileName = 'flowers_' + time_rg + '.png'

print( type( curtime ) )
print( fileName )

<class 'datetime.datetime'>
flowers_20171221_164405.png


In [24]:
import datetime

curtime  = datetime.datetime(year= 2017, month= 1, day = 15, hour= 3, minute= 7, second= 9 )

# format datetime in year month day hours minutes seconds
time_rg  = curtime.strftime( '%Y%m%d_%H%M%S' )
fileName = 'flowers_' + time_rg + '.png'

print( type( curtime ) )
print( fileName )

<class 'datetime.datetime'>
flowers_20170115_030709.png


### Calculate days to my next birthday

In [49]:
import time
import datetime
from datetime import date

#my_birth_day = date( 2018, 6, 20 )
my_birth_day = date( 2018, 5, 20 )

today        = date.today()

delta        = my_birth_day - today

print( 'delta in days: {0}'.format( delta ) )
years = int( delta.days / 365 )

m = int( delta.days % 365 )
#print ( 'm: {0}'.format( m )  )
months = m / 30 
d      = m % 30


print( 'years : {0}'.format( years  ) )
print( 'months: {0}'.format( months ) )
print( 'days  : {0}'.format( d      ) )


delta in days: 268 days, 0:00:00
years : 0
months: 8.933333333333334
days  : 28


<a id = 'processing_time' ></a>

## Processing time

measure processing time of my script

In [19]:
import datetime
import time

start_time   = time.time()

# execute my process ...
time.sleep(3) # delays for 3 seconds

end_time     = time.time()
process_time = end_time - start_time

print ( 'process_time: {0} seconds'.format( process_time ) )

print( 'timedelta: {0}'.format( datetime.timedelta( seconds = process_time ) ) )
  

# format the process time in Hours Minutes Seconds
#timeHMS     = "*{:0>30}*".format(datetime.timedelta( seconds = process_time ))
#timeHMS     = "*{:<30}*".format(datetime.timedelta( seconds = process_time ))

#print ( timeHMS )
#print( 'the time to process my script was {0} (hours:minutes:seconds)'.format( timeHMS ) )

process_time: 3.0000112056732178 seconds
timedelta: 0:00:03.000011


#### substract two datetime objects

In [3]:
import datetime
import time

start_time   = datetime.datetime.now()

# execute my process ...
time.sleep(3) # delays for 3 seconds

end_time     = datetime.datetime.now()
process_time = end_time - start_time

print ( 'process_time: {0} seconds'.format( process_time ) )


#print( 'timedelta: {0}'.format( datetime.timedelta( seconds = process_time ) ) )
  

process_time: 0:00:03.003409 seconds


<a id = 'timedelta_pretty_print' ></a>

#### Time Delta Pretty print

In [15]:
def timedelta_pretty_print( start_time, end_time ):
    delay = datetime.timedelta(seconds=( end_time - start_time ))
    if (delay.days > 0):
        out = str(delay).replace(" days, ", ":")
    else:
        out = "0:" + str(delay)
    outAr = out.split(':')
    outAr = ["%02d" % (int(float(x))) for x in outAr]
    out   = ":".join(outAr)
    return out

In [18]:
start_time   = time.time()
time.sleep(3) # delays for 3 seconds
end_time     = time.time()

s = timedelta_pretty_print( start_time, end_time )
print( 'process time (dd:HH:MM:SS) {0}'.format( s ) )

process time (dd:HH:MM:SS) 00:00:00:03


In [11]:
import datetime
import time

def timedelta_pretty( process_time, start_time = None, end_time = None ):
    
    if start_time != None and end_time != None:
        process_time = datetime.timedelta(seconds=( end_time - start_time))
    
    print( 'param process_time {0}'.format( process_time ) )
    
    if (process_time.days > 0):
        out = str( process_time ).replace( " days, ", ":" )
    else:
        out = "0:" + str( process_time )
    
    print( 'out {0}'.format( out ) )
    
    outAr = out.split(':')
    outAr = ["%02d" % (int(float(x))) for x in outAr]
    out   = ":".join(outAr)
    return out

In [13]:
process_time = datetime.timedelta(seconds=41000)

print( process_time )

s = timedelta_pretty( process_time )
print( 'process time (dd:HH:MM:SS) {0}'.format( s ) )


11:23:20
param process_time 11:23:20
out 0:11:23:20
process time (dd:HH:MM:SS) 00:11:23:20


<a id = 'file_names_and_paths' ></a>

## File names and paths

### where is stored this python notebook ipynb?

#### extract file name and extension from path

In [2]:
import os

filename, ext = os.path.splitext('/path/to/myDir/myFriendList.txt')
print 'file name: {0}'.format( filename )
print 'extension: {0}'.format( ext      )

pathSegmented, ext = os.path.splitext('a.png')
print 'file name: {0}'.format( pathSegmented )
print 'extension: {0}'.format( ext      )

file name: /path/to/myDir/myFriendList
extension: .txt
file name: a
extension: .png


#### extract last dir from path

In [1]:
import os
fileName = 'C:/aat/pics/animals/insects/hornet.jpg'

d1 = os.path.dirname( fileName )
d2 = os.path.basename( d1 ) 
d3 = os.path.split   ( d1 )[1]


print( 'd1: {0}'.format( d1 ) )
print( 'd2: {0}'.format( d2 ) )
print( 'd3: {0}'.format( d3 ) )

d1: C:/aat/pics/animals/insects
d2: insects
d3: insects


In [2]:
print ( os.path.split   ( d1 ) )

('C:/aat/pics/animals', 'insects')


#### concatenate directory and file name to create path

In [14]:
from os.path         import join
myPath = join( '/aat/pics/animals/', 'hornet2.png')
myPath

'/aat/pics/animals/hornet2.png'

#### normalize paths

In [5]:
myDir = os.path.normpath( '/aat/pics/animals\\insects' )
print( myDir )

\aat\pics\animals\insects


#### get a list of files in a path

In [9]:
from os import listdir
from os.path import isfile, join

_dir = 'C:/aat/pics/'

onlyfiles = [f for f in listdir( _dir ) if isfile(join( _dir, f ))]            
onlyfiles

['box.png',
 'desktop.ini',
 'rectangles.png',
 'rotated.png',
 'shapes.png',
 'shapes2.png',
 'shapes_01.png',
 'shapes_02.png',
 'sudoku-original.jpg']

#### get a list of directories in a path

In [2]:
from os import listdir
from os.path import isdir, join

_dir = 'C:/aat/Pictures/'

only_dirs = [f for f in listdir( _dir ) if isdir(join( _dir, f ))]            
only_dirs

['animals', 'sharks']

#### get a list of files in a path with glob (long path)

In [None]:
from glob import glob
files = glob('C:/tmp/Pictures/sharks/*')
for i in files:
    print( i )


#### get a list of files in a path (long path)

In [None]:
from os import listdir
from os.path import isfile, join

_dir = 'C:/tmp/Pictures'

onlyfiles = [join( _dir, f ) for f in listdir( _dir ) if isfile(join( _dir, f ))]            
#print 'onlyfiles: {0}'.format( onlyfiles )
for i in onlyfiles:
    print( i )

### list all the files - navigating recursively starting in a mother node

In [None]:
import os

_dir = 'C:/tmp/Pictures/'

for root, dirs, files in os.walk( _dir, topdown=False):
    for name in files:
        print(os.path.join(root, name))
    for name in dirs:
        print(os.path.join(root, name))

<a id = 'copy_file' ></a>

### copy a file

In [None]:
import os
import sys
from os.path import join
from shutil import copyfile


copyfile( 'mountains.png', 'mountains_copy.png' )
        
        

In [None]:
import os
import shutil
from shutil import copyfile
import tempfile

filename1 = tempfile.mktemp ("C:/aat/info_sources/python_mini_workshop/.txt")

print( 'filename1' )
print( filename1 )

"""
open (filename1, "a").close ()
filename2 = filename1 + ".testcopy"
print filename1, "=>", filename2

shutil.copy (filename1, filename2)

if os.path.isfile (filename2): print "Success" 
"""

<a id = 'txt_files' ></a>

## references

* https://docs.python.org/2/tutorial/
* https://regexone.com/
* https://regexone.com/references/python