# <center>LECTURE OVERVIEW </center>

---

## By the end of the lecture, you'll be able to:
- pattern match with string methods
- pattern match regular expressions using the `re` module
- pattern match filenames using the `glob` and `pathlib` modules
- handle the complexities of dates and times using the `datetime` and `dateutil` modules
- archive and unarchive files using the `zipfile` and `shutil` modules
- serialize and deserialize Python objects using the `pickle` module

# <center>PATTERN MATCHING</center>

---

How often do you Google for something you are looking for in the web of the internet? How often do you search for text in a document or webpage by pressing `Ctlr-F` and typing in the words you're looking for? These technologies all use some sort of pattern matching to solve a problem and there are many ways to go about pattern matching in Python.

For instance, you've learned to use the `in` operator to see if a given string is a substring of another string:

In [None]:
'apples' in "I love apples!"

You have also learned how to index and split strings that can used with comparison operators:

In [None]:
"apples"[0] == 'z'

In [None]:
"apples"[-2] != 's'

# String Methods for Pattern Matching

## <font color='LIGHTGRAY'>By the end of the lecture, you'll be able to:</font>
- **pattern match with string methods**
- <font color='LIGHTGRAY'>pattern match regular expressions using the re module</font>
- <font color='LIGHTGRAY'>pattern match filenames using the glob and pathlib modules</font>
- <font color='LIGHTGRAY'>handle the complexities of dates and times using the datetime and dateutil modules</font>
- <font color='LIGHTGRAY'>archive and unarchive files using the zipfile and shutil modules</font>
- <font color='LIGHTGRAY'>serialize and deserialize Python objects using the pickle module</font>

Python has several built-in functions associated with the string datatype. We'll go over several different functions that we can use for pattern matching.

Specifically, the string methods that will evaluate to a `Boolean` value. These methods are useful when, for example, we are creating forms for users to fill in. There are a number of string methods that will return `Boolean` values:

- `str.isalnum()`: String consists of only alphanumeric characters (no symbols)
- `str.isalpha()`: String consists of only alphabetic characters (no symbols)
- `str.islower()`: String’s alphabetic characters are all lower case
- `str.isnumeric()`: String consists of only numeric characters
- `str.isspace()`: String consists of only whitespace characters
- `str.istitle()`: String is in title case
- `str.isupper()`: String’s alphabetic characters are all upper case
- `str.startswith(val)`: String starts with the specified value
- `str.endswith(val)`: String ends with the specified value
- `str.find(substring, start=None, end=None)`: Searches the target string for a given substring

Let's see a few of these in action:

In [None]:
"5".isnumeric()

In [None]:
"abcdef".isnumeric()

Checking whether characters are lower case, upper case, or title case, can help us to sort our data appropriately. It can also provide us with the opportunity to standardize data we collect by checking and then modifying strings as needed:

In [None]:
movie = "2012: THE KINGS OF BBQ BARBECUE KUWAIT"
book = "The Barbecue! Bible"
poem = "fred could bbq before he could read"

print(f"movie.islower(): {movie.islower()}")
print(f"movie.isupper(): {movie.isupper()}")
 
print(f"book.istitle(): {book.istitle()}")
print(f"book.isupper(): {book.isupper()}")
 
print(f"poem.istitle(): {poem.istitle()}")
print(f"poem.islower(): {poem.islower()}")

We can also chain these with logical operators:

In [None]:
print(
    f"Does the movie title start with '2012' and is upper case?: {movie.startswith('2012') & movie.isupper()}"
)

print(
    f"Is 'fred' in the poem and ends with 'city'?: {('fred' in poem) & (poem.endswith('city'))}"
)

To find if a string contains a particular substring you can use the `str.find()` method that will return the lowest index in the string that the substring is found:

In [None]:
# show readable indexes
for i in range(len(book)):
    print(f'{i:2d} {book[i]}')

print(f"book.find('Bible'): {book.find('Bible')}")

To learn more about string methods, check out the [documentation](https://docs.python.org/3.8/library/stdtypes.html#string-methods).

# Regular Expressions

**Regular expressions**, also known as **regexes** (pronounced reg-EX-is), is a sequence of characters that can be used to define a search pattern for finding text. To use regex, you sepcify the rules for the set of possible strings that you want to match and then ask youself:
*<center>"Does this string match the pattern?"</center>* 
*<center>"Is there a match for the pattern anywhere in this string?"</center>*

## Syntax

Since a regex is a pattern that aims to match an input string, here are some special characters and patterns you can use to match strings:

| Metacharacter | Description |
|:---------------:|-------------|
| `[] ` | Specifies a character class. For example, `[abc]` matches "a", "b", or "c". |
| `()` | Creates a group. For example, `(ab)*` matches "", "ab", "abab", "ababab", and so on. |
| `*` | Matches zero or more repetitions. For example, `ab*c` matches "ac", "abc", "abbbc", etc. |
| `+` | Matches one or more repetitions. For example, `ab+c` matches "abc", "abbc", "abbbc", and so on, but not "ac". |
| `?` | Matches the preceding element zero or one time. For example, `ab?c` matches only "ac" or "abc". Additionally, it matches the ending position of the string or the position just before a string-ending newline, like the `endsWith()` function. In line-based tools, it matches the ending position of any line.|
| `\|` | Designates alternation. For example, `abc\|def` can match either "abc" or "def". |
| `.` | Matches any single character except newline. For example, `a.c` matches "abc", etc., but `[a.c]` matches only "a", ".", or "c". |
| `^` | Matches the starting position in the string, like the `startsWith()` function. In line-based tools, it matches the starting position of any line. |
| `{}` | Matches an explicity specified number of repetitions. |
| `<>` | Creates a named group. |
| `$` | Anchors a match at the end of a string. |
| `\`| Escapes a metacharacter of its special meaning, introduces a special character class, or introduces a grouping backreference. |

If you want to use regex metacharacters in a Python string, you must insert a `r` before the string:
```python
regex_pattern = r'[abc]'
```

There are also a list of special sequences that consist of `\` and a character from the list below:

| Metacharacter | Description |
|:-------------:|-------------|
| `\number` | Matches groups of the same `number` |
| `\A` | Matches only at the start of the string |
| `\b` | Matches a empty string, but only at the beginning or end of a word |
| `\B` | Matches the empty string, but only when it is not at the beginning or end of a word |
| `\d` | Matches digits |
| `\D` | Matches character that is not a digit |
| `\s` | Matches whitespace characters |
| `\S` | Matches non-whitespace characters |
| `\w` | Matches words |
| `\W` | Matches non-words |
| `\Z` | Matches only at the end of the string |

You can read more about regex [here](https://en.wikipedia.org/wiki/Regular_expression).

## The `re` Module

## <font color='LIGHTGRAY'>By the end of the lecture, you'll be able to:</font>
- <font color='LIGHTGRAY'>pattern match with string methods</font>
- **pattern match regular expressions using the `re` module**
- <font color='LIGHTGRAY'>pattern match filenames using the glob and pathlib modules</font>
- <font color='LIGHTGRAY'>handle the complexities of dates and times using the datetime and dateutil modules</font>
- <font color='LIGHTGRAY'>archive and unarchive files using the zipfile and shutil modules</font>
- <font color='LIGHTGRAY'>serialize and deserialize Python objects using the pickle module</font>


There are several methods that are available with the built-in `re` module and we are going to discuss the most commonly used methods. 

In [None]:
import re

Note, most of these methods return a [Match object](https://docs.python.org/3/library/re.html#match-objects) which always has a boolean value of `True`. This makes it convenient to test if there is a match with a simple `if` statement.

1. ```python
re.match(pattern, string, ...)
```
: return `Match` object if `pattern` is a match at the *beginning* of a `string`

In [None]:
string = 'I LOVE BBQ!'

result1 = re.match('BBQ', string)
print(result1)

result2 = re.match('I', string)
print(result2)

You can pass in a start position and end position parameters to `re.match()` but you might as well use the `re.search()` method that allows you to locate a match string anywhere.

To read more about `re.match()`, check out help output:

In [None]:
help(re.match)

2. ```python
re.search(pattern, string, ...)
```
: returns `Match` object if `pattern` matches *anywhere* in a given `string`

In [None]:
string = 'Mark is bringing his 3 kids to the BBQ.'

if re.search('kids', string):
    print('Kids at the BBQ!')
else:
    print('No kids at the BBQ.')

To read more about `re.find()`, check out help output:

In [None]:
help(re.search)

3. ```python
re.findall(pattern, string, ...)
```
: finds and retrieves a list of all occurances of `pattern` in a given `string`

In [None]:
string = "Mark is bringing 3 kids, Joanne is bringing 0 kids, and Georgie is bringing 1 kid."

re.findall(r'(\d+)[^\d]+[kid]', string)

To read more about `re.findall()`, check out help output:

In [None]:
help(re.findall)

4. ```python
re.sub(pattern, replace, string, ...)
```
: search `string` for a `pattern` and substitute with a new string `replace` if `pattern` occurs

In [None]:
string = 'I will have chicken, mashed potatoes, and a beer from the BBQ menu. Please and thank you!'
pattern = 'chicken'
repl = 'pulled pork'

re.sub(pattern, repl, string)

To read more about `re.sub()`, check out help output:

In [None]:
help(re.sub)

5. ```python
re.split(pattern, string, ...)
```
: identical to the `split()` function in Python, will split a `string` at the location in which a `pattern` occurs and return the text of all groups as list 

In [None]:
string = "Mark is bringing 3 kids, Joanne is bringing 0 kids, and Georgie is bringing 1 kid."

re.split(', ', string)

To read more about `re.split()`, check out help output:

In [None]:
help(re.split)

6. ```python
re.compile(pattern, ...)
```
: compiles a regex `pattern` into a regular expression object

We can directly call the functions we have learned above on this compiled object that can save time time since parsing/handling regex strings can be computationally expensive.

In [None]:
string = "Mark is bringing 3 kids, Joanne is bringing 0 kids, and Georgie is bringing 1 kid."
pattern = re.compile(r'(\d+)[^\d]+[kid]')
pattern.findall(string)

To read more about `re.compile()`, check out help output:

In [None]:
help(re.compile)

As you've seen, regular expressions can be rather complex so use them sparingly. It can be tempting to write one massive, super regular expression, but it will most likely not be the best practice to do. So, as a rule of thumb, use the `in` operator or string methods.

For a more in-depth tutorial on regular expressions, check [this](https://realpython.com/regex-python/) out.

To learn more about the `re` module, see the [documentation](https://docs.python.org/3/library/re.html#re.Pattern.match).

## **<font color='GREEN'> Exercise</font>**

Write a program that finds all the sub-strings where there is a single character between `a` and `c`.

For example:
```python
string = "abcadsfearc"
```

Expected output:
    
`['abc', 'adc', 'arc']`

In [None]:
# TODO: insert solution here

# Filename Pattern Matching

When you are working within your filesystem in your OS, you'll often times want to quickly retrieve the filenames with the system. Additionally, you might only want a subset of the filenames based on a given pattern. To accomplish this, there are a few built-in Python modules to help out. 

## The `glob` module

## <font color='LIGHTGRAY'>By the end of the lecture, you'll be able to:</font>
- <font color='LIGHTGRAY'>pattern match with string methods</font>
- <font color='LIGHTGRAY'>pattern match regular expressions using the re module</font>
- **pattern match filenames using the `glob` and `pathlib` modules**
- <font color='LIGHTGRAY'>handle the complexities of dates and times using the datetime and dateutil modules</font>
- <font color='LIGHTGRAY'>archive and unarchive files using the zipfile and shutil modules</font>
- <font color='LIGHTGRAY'>serialize and deserialize Python objects using the pickle module</font>

The `glob` module (short fot global) is useful for finding all the pathnames matching a specific pattern.

We can use `glob` to search for a specific file pattern, or even search for files where the filenames matches a certain pattern using wildcards such as:

- `*`: matches zero or more characters
- `?`: matches exactly one character

In [None]:
import glob

For exmple, we can use the
```python
glob.glob(pathname, recursive=False)
```
method to search for all the IPython Jupyter Notebook (`.ipynb` extension) source files in the current directory and returns a list of the filenames:

In [None]:
glob.glob('*.ipynb')

Here's another example where we find all the files with a `.ipynb` extension that contains `lecture` in the filename:

In [None]:
for filename in glob.glob('*lecture*.ipynb'):
    print(filename)

The `glob` module also makes it easy to search for files recursively in subdirectories too:

In [None]:
for file in glob.glob('**/*.txt', recursive=True):
    print(file)

To read more about `glob.glob()`, check out help output:

In [None]:
help(glob.glob)

A similar method,
```python
glob.iglob(pathname, recursive=False)
```
, returns an iterator with the same values as `glob.glob()`.

Searching a large number of directories could take a long time and use a lot of memory. A solution to this is to use `glob.iglob()`.

For example:

In [None]:
glob.iglob('**/*.txt', recursive=True)

In [None]:
for file in glob.iglob('**/*.txt', recursive=True):
    print(file)

To read more about `glob.iglob()`, check out help output:

In [None]:
help(glob.iglob)

To read more about the `glob` module, check out the [documentation](https://docs.python.org/3/library/glob.html).

## The `pathlib` Module

In previous lectures, we've learned how to use the `pathlib` module to navigate around our filesystem. It can also be used for pattern matching of filenames. The module contains similar `glob` functionality for making flexible file listings.

In [None]:
from pathlib import Path

For example, when we import `Path` from `pathlib`, we can use the
```python
Path.glob(pattern)
```
or
```python
Path.rglob(pattern)
```
(recursive glob) function to list file types that contain the number `2` that returns a generator object that points to all the files in the current directory that fits our match:

In [None]:
p = Path('.')
for name in p.glob('*2*'):
    print(name)

To read more about `Path.glob()` and `Path.rglob()`, check out help output:

In [None]:
help(Path.glob)

In [None]:
help(Path.rglob)

## **<font color='GREEN'> Exercise</font>**

Recall, in day 3 lab we did a scavenger hunt for coins scatter across directories starting at `day_3_assets/lab` that have the format `{coin type}_COIN.txt` for the filename.

Find all the coins and print out the pathnames.

In [None]:
# TODO: insert solution here

# <center>DATES AND TIMES</center>

---

Working with dates and times can be quite a tricky challenge since you have to deal with time zones, daylight savings time, and various written date formats. Fortunately, Python has built-in libraries that can help manage all these complexities. 

# How Computers Count Time

Almost all computers count time form an instant called the [Unix epoch](https://en.wikipedia.org/wiki/Unix_time), which occured on January 1, 1970, at 00:00:00 UTC (or [Coordinated Universal Time](https://en.wikipedia.org/wiki/Coordinated_Universal_Time)).

By definition, Unix time elapses at the same rate as UTC (i.e., a one-second step in UTC is equivalent to a one-second step in Unix time). Nearly all programming languages, including Python, incorporate the concept of Unix time.

For instancem we can use the `time` module to show the number of seconds since the Unix epoch (excluding leap seconds):

In [None]:
import time
time.time()

As you can see from the example above, Unix time is nearly impossible for humans to parse so instead, it is typically converted to UTC, which can then be converted into a local time zone.

# Date Formats

Instead of trying to parse Unix time, we can work in terms of years, months, days, and so forth. But even with these conventions, another layer of complexity stems from the fact that different languages and cultures have different ways of writing the date.

For instance, in the United States, dates are usually written starting with the month, then day, then year. For example, January 31, 2020, is written as **01-31-2020**. However, most of Europe and many other areas write the date starting with the day, then the month, then the year. This means that January 31, 2020 is written as **31-01-2020**. 

These differences can cause all sorts of confusion when communicating across cultures. To help avoid communication mistakes, the International Organization for Standardization (ISO) developed a standard (called [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601)) that specifies that all dates should be written in order of most-to-least-significant data. For instance, the format is year (`YYYY`), month (`MM`), day (`DD`), hour (`hh`), minute (`mm`), and second (`ss`) is formatted as:

```
YYYY-MM-DD hh:mm:ss
```

This format has no ambiguity and you'll see how the ISO 8601 format is used in the `datetime` and `dateutil` modules.

# The `datetime` and `dateutil` Modules

## <font color='LIGHTGRAY'>By the end of the lecture, you'll be able to:</font>
- <font color='LIGHTGRAY'>pattern match with string methods</font>
- <font color='LIGHTGRAY'>pattern match regular expressions using the re module</font>
- <font color='LIGHTGRAY'>pattern match filenames using the glob and pathlib modules</font>
- **handle the complexities of dates and times using the `datetime` and `dateutil` modules**
- <font color='LIGHTGRAY'>archive and unarchive files using the zipfile and shutil modules</font>
- <font color='LIGHTGRAY'>serialize and deserialize Python objects using the pickle module</font>

Fortunately, you don't need to implement these complicated features from scratch. We will show you how to to navigate working with dates and times using the `datetime` and `dateutil` modules. 

In [None]:
import datetime

## Creating `datetime` Instances

The `datetime` module provides three classes that make up the high-level interface that most people will use:

1. `datetime.date`: idealized date that assumes the Gregorian calendar. This object stores the `year`, `month`, and `day` as attributes
2. `datetime.time`: idealized time that assumes there are 86,400 seconds per day with no leap seconds. This object stores `hour`, `minute`, `second`, and `microsecond`.
3. `datetime.datetime`: combination of `datetime.date` and `datetime.time` classes. It has all the attributes of both classes.

In [None]:
datetime.date(year=2020, month=1, day=31)

In [None]:
datetime.time(hour=13, minute=14, second=31)

In [None]:
datetime.datetime(year=2020, month=1, day=31, hour=13, minute=14, second=31)

To read more about the `datetime()` classes, check out help output:

In [None]:
help(datetime.date)

In [None]:
help(datetime.time)

In [None]:
help(datetime.datetime)

The `datetime` module also provides several other ways to create `datetime` instances that don't require you to specify each attribute:

1. `.date.today()`: creates `datetime.time` instance with the current local date
2. `.datetime.now()`: creates a `datetime.datetime` instance with the current local date and time
3. `.datetime.combine()`: combines instances of `datetime.date` and `datetime.time` into a single `datetime.datetime` instance

For example:

In [None]:
today = datetime.date.today()
today

In [None]:
now_time = datetime.datetime.now()
now_time

In [None]:
now = datetime.datetime.now()
current_time = datetime.time(now_time.hour, now_time.minute, now_time.second)
datetime.datetime.combine(today, current_time)

## Using Strings to Create `datetime` instances

If you have a `date_string` with the date in ISO 8601 format, you can create a `date` instance from the `date_string` by using the
```python
.fromisoformat(date_string)
```
function.

For example:

In [None]:
datetime.date.fromisoformat("2020-01-21")

But what if you have a string that represents a date and time that isn't in the ISO 8601 format? Fortunately, `datetime` provides a method called
```python
.strptime(date_string, format_string)
```
to handle this.

First, you need to tell Python what each part of the string represents using formatting codes:

In [None]:
date_string = "01-31-2020 14:45:37"
format_string = "%m-%d-%Y %H:%M:%S"

Now that `date_string` and `format_string` are defined, we can use them to create a `datetime` instance using `.strptime()`:

In [None]:
datetime.datetime.strptime(date_string, format_string)

To find a complete list of format codes, see the [documentation](https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior).

## Working With Time Zones

To ensure that your code is correct, it is important to know the time zone in which a date occurs. The `datetime` module does not provide a direct way to interact with the time zone database. Thus, the `datetime` module recommends using a third-party package called `dateutil` for this.

In [None]:
import dateutil

We can use the 
```python
dateutil.tz.tzlocal()
```
method to get the local time zone.

For example:

In [None]:
now = datetime.datetime.now(tz=dateutil.tz.tzlocal())
now.tzname()

You can also create time zones that are not the same as the time zone reported by your computer. To do this, we will use the
```python
dateutil.tz.gettz(zone_string)
```
method where `zone_string` is an offical [IANA name](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones) for the time zone you're interested in.

For example:

In [None]:
london_tz = dateutil.tz.gettz("Europe/London")
now = datetime.datetime.now(tz=london_tz)
now

We can also use the `.tzname()` method to print the name of the time zone, which is `'BST'` or British Standard Time.

In [None]:
now.tzname()

To read more about `dateutil.tz`, check out help output:

In [None]:
help(dateutil.tz)

## Arithmetic With `datetime`

Using the a 
```python
datetime.timedelta(...)
```
instance, we can do all sorts of arithmetic on our `datetime` instances.

For example:

In [None]:
now = datetime.datetime.now()
now

In [None]:
delta_pos_1_day = datetime.timedelta(days=+1)
tomorrow = now + delta_pos_1_day
tomorrow

`datetime.timedelta` instances also support negative values.

For example:

In [None]:
now = datetime.datetime.now()
now

In [None]:
delta_neg_1_day = datetime.timedelta(days=-1)
yesterday = now + delta_neg_1_day
yesterday

You can also provide a mix of positive and negative arguments.

For example:

In [None]:
now = datetime.datetime.now()
now

In [None]:
delta = datetime.timedelta(days=+3, hours=-4)
now + delta

To read more about `datetime.timedelta()`, check out help output:

In [None]:
help(datetime.timedelta)

`datetime.timedelta` is very useful, but is limited because it cannot add or subtract intervals larger than a day. Fortunately, `dateutil` provides a more powerfull replacement called
```python
dateutil.relativedelta(...)
```
that is very similar to `datetime.timedelta`.

For example:

In [None]:
now = datetime.datetime.now()
now

In [None]:
from dateutil.relativedelta import relativedelta

delta = relativedelta(years=+5, months=+1, days=+3, hours=-4, minutes=-30)
now + delta

You can also use `.relativedelta()` to calculate the difference between two `datetime` instances.

For example:

In [None]:
now = datetime.datetime.now()
now

In [None]:
tomorrow = now + datetime.timedelta(days=+1)
relativedelta(now, tomorrow)

To read more about `.relativedelta()`, check out help output:

In [None]:
help(relativedelta)

To learn more about the `datetime` module, see the [documentation](https://docs.python.org/3/library/datetime.html#module-datetime).

To learn more about the `dateutil` module, see the [documentation](https://dateutil.readthedocs.io/en/stable/).

## **<font color='GREEN'> Exercise</font>**

This years Community BBQ will start on July 17, 2021 in Providence, RI at 10am. Next years Community BBQ will start on August 16, 2021 in Providence, RI at 10:30am.

Do and answer the following:
- create a `datetime` instance of the 2021 Community BBQ named `bbq_date_2021`
- create a `datetime` instance of the 2022 Community BBQ named `bbq_date_2022`
- what is date and time difference between `bbq_date_2021` and `bbq_date_2022`?

In [None]:
# TODO: insert solution here

# <center>ARCHIVING</center>

---

Archiving is a convenient way to package several files into one. The two most common archive types are ZIP and TAR but we will mostly cover working with ZIP archive types.

## <font color='LIGHTGRAY'>By the end of the lecture, you'll be able to:</font>
- <font color='LIGHTGRAY'>pattern match with string methods</font>
- <font color='LIGHTGRAY'>pattern match regular expressions using the re module</font>
- <font color='LIGHTGRAY'>pattern match filenames using the glob and pathlib modules</font>
- <font color='LIGHTGRAY'>handle the complexities of dates and times using the datetime and dateutil modules</font>
- **archive and unarchive files using the `zipfile` and `shutil` modules**
- <font color='LIGHTGRAY'>serialize and deserialize Python objects using the pickle module</font>

# Reading ZIP Files

The `zipfile` module makes it easy to open and extract ZIP files. Similar to using `open()`, you can create a `ZipFile` object with the support of a `with` statement.

In [None]:
from zipfile import ZipFile

For example, using the 
```python
ZipFile.namelist()
```
method we can retrieve a list of files in the archive:

In [None]:
from pprint import pprint
from pathlib import Path

with ZipFile(Path.cwd() / 'day_4_assets' / 'event_tables.zip', 'r') as zip_obj:
    pprint(zip_obj.namelist())

You can also retrieve information about the files in the archive using the
```python
ZipFile.getinfo(filename)
```
method that returns a `ZipInfo` object.

For example:

In [None]:
with ZipFile(Path.cwd() / 'day_4_assets' / 'event_tables.zip', 'r') as zip_obj:
    bbq_event_2_csv = [name for name in zip_obj.namelist() if 'bbq_event_2.csv' in name][0]
    print(zip_obj.getinfo(bbq_event_2_csv))

# Extracting ZIP Files

We can extract one or more files from ZIP archives through the
```python
ZipFile.extract(member, path=None, pwd=None)
```
and
```python
ZipFile.extractall(path=None, members=None, pwd=None)
```
methods. They both extract files to the current directory by default but you can pass in a `path` that allows you to specify a different directory to extract files to. If the directory does not exist, it is automatically created.

For example:

In [None]:
with ZipFile(Path.cwd() / 'day_4_assets' / 'event_tables.zip', 'r') as zip_obj:
    extract_path = [name for name in zip_obj.namelist() if 'bbq_event_2.csv' in name][0]
    unzip_path = Path.cwd() / 'day_4_assets' / 'unzipped_with_extract'
    zip_obj.extract(extract_path, path=unzip_path)

We will take our `tree()` function from when a previous lecture using the `pathlib` module to view the directory tree of our archive results:

In [None]:
def tree(directory):
    print(f'+ {directory}')
    for path in sorted(directory.rglob('*')):             # list subdirectories
        depth = len(path.relative_to(directory).parts)    # use .relative_to() to get how far we are from the root
        spacer = '    ' * depth
        print(f'{spacer}+ {path.name}')

In [None]:
tree(Path.cwd() / 'day_4_assets' / 'unzipped_with_extract')

In [None]:
with ZipFile(Path.cwd() / 'day_4_assets' / 'event_tables.zip', 'r') as zip_obj:
    unzip_path = Path.cwd() / 'day_4_assets' / 'unzipped_with_extractall'
    zip_obj.extractall(path=unzip_path)

In [None]:
tree(Path.cwd() / 'day_4_assets' / 'unzipped_with_extractall')

Some misc. cleanup:

In [None]:
import shutil
shutil.rmtree(Path.cwd() / 'day_4_assets' / 'unzipped_with_extract')
shutil.rmtree(Path.cwd() / 'day_4_assets' / 'unzipped_with_extractall')

# Creating New ZIP Archives

To create a new ZIP archive, you simply open a `ZipFile` object in write mode and add the files you want to archive using the 
```python
ZipFile.write(filename, arcname=None, ...)
```
method:

In [None]:
recipe_files = ['pulled_pork_recipe.txt', 'smoked_mac_and_cheese_recipe.txt']
with ZipFile(Path.cwd() / 'day_4_assets' / 'recipes.zip', 'w') as recipe_zip:
    for name in recipe_files:
        filename = Path.cwd() / 'day_4_assets' / 'recipes' / name
        recipe_zip.write(filename, arcname=name)

Note, we must pass in the `arcname` otherwise, the file will be saved with the full path within our archive file.

In [None]:
with ZipFile(Path.cwd() / 'day_4_assets' / 'recipes.zip', 'r') as zip_obj:
    pprint(zip_obj.namelist())

You can also add files to an existing archive using the append mode:

In [None]:
new_recipe_files = ['cornbread_recipe.txt']
with ZipFile(Path.cwd() / 'day_4_assets' / 'recipes.zip', 'a') as recipe_zip:
    for name in new_recipe_files:
        filename = Path.cwd() / 'day_4_assets' / 'recipes' / name
        recipe_zip.write(filename, arcname=name)

In [None]:
with ZipFile(Path.cwd() / 'day_4_assets' / 'recipes.zip', 'r') as zip_obj:
    pprint(zip_obj.namelist())

Some misc. cleanup:

In [None]:
import os
os.remove(Path.cwd() / 'day_4_assets' / 'recipes.zip')

To read more about `ZipFile()`, check out help output:

In [None]:
help(ZipFile)

To learn more about `zipfile`, check out the [documentation](https://docs.python.org/3/library/zipfile.html#module-zipfile).

# An Easier Way of Creating Archives

The `shutil` module also supports creating, reading, and extracting TAR and ZIP archives using high-level methods. We will also use the `pathlib` module for best practices when creating file paths.

In [None]:
import shutil
from pathlib import Path

To create an archive using `shutil`, you will use the
```python
shutil.make_archive(base_name, archive_format, root_dir=current_directory, ...)
```
method where it supports `archive_format`s of `zip`, `tar`, `bztar`, and `gztar`.

For example:

In [None]:
base_name = Path.cwd() / 'day_4_assets' / 'archived_recipes'
root_dir = Path.cwd() / 'day_4_assets' / 'recipes/'
shutil.make_archive(base_name, 'zip', root_dir)

In [None]:
with ZipFile(Path.cwd() / 'day_4_assets' / 'archived_recipes.zip', 'r') as zip_obj:
    pprint(zip_obj.namelist())

To read more about `shutil.make_archive()`, check out help output:

In [None]:
help(shutil.make_archive)

To extract the archive, call the
```python
shutil.unpack_archive(filename, extract_dir=current_directory)
```
method.

For example:

In [None]:
filename = Path.cwd() / 'day_4_assets' / 'archived_recipes.zip'
extract_dir = Path.cwd() / 'day_4_assets' / 'unpacked_recipes/'
shutil.unpack_archive(filename, extract_dir)

In [None]:
tree(Path.cwd() / 'day_4_assets' / 'unpacked_recipes')

To read more about `shutil.unpack_archive()`, check out help output:

In [None]:
help(shutil.unpack_archive)

Some misc. cleanup:

In [None]:
import os
shutil.rmtree(Path.cwd() / 'day_4_assets' / 'unpacked_recipes')
os.remove(Path.cwd() / 'day_4_assets' / 'archived_recipes.zip')

To learn more about archiving operations from the `shutil` module, check out the [documentation](https://docs.python.org/3/library/shutil.html#archiving-operations).

# <center>OBJECT PERSISTENCE</center>

---

The term **pickling** is a popular method of preserving food for later consumption. By placing a product in a specific solution, it is possible to increase its shelf life.

As a Python developer, you might one day find yourself in need of a way to store your Python objects for later use. What if I told you that you can pickle Python objects too?

# Serialization

**Serialization** is the process of transforming objects (or data structures) into **byte streams**. A byte is composed of 8 bits of zeros and ones, which can be then stored and transferred easily. This allows for develpers to save configuration data or a program's progress, and then store it on a disk or send it to another location.

Unlike pickling vegetables, where the pickled food's flavor and texture change, pickled (or serialized) Python objects can be easily unpickled back to their original form. This process is know as **deserialization**.

Python objects can be serialized and deserialized using the native `pickle` module.

In [None]:
import pickle

Pickling should not be confused with archiving (i.e., compression). Pickling translates data into a format that can be transferred from RAM to disk. Compression, on the other hand, is the process of encoding data using fewer bits to save disk space.

For example, think of serialization as the process of saving and loading your progress in a video game from your storage disk.

# Pickle vs JSON

Recall from previous lectures, JavaScript Object Notation (JSON) lets us save and transmit objects encoded as strings in a human-readable, language-independent schema. Although converting Python objects to JSON might be faster than pickling, the JSON format does have limitations. Most importantly, only a limited subset of Python built-in data structures can be respresented by JSON. With pickling in Python, we can easily serialize a very large range of Python types, and custom classes.

# What can be Pickled and Unpickled

The following types can be serialized and deserialized using the `pickle` module:

- All native datatypes in Python (e.g., `booleans`, `None`, `integers`, `floats`, `strings`, etc)
- Python containers (e.g., `dictionary`, `set`, `list`, `tuples`) - as long as the container has pickleable objects
- Functions and classes that are defined at the top level of a module

<center>
    <b>IMPORTANT NOTES:</b>
    <ul><b>Your pickled data can only be unpickled using Python</b></ul>
    <ul><b>Using a different Python version when pickling and unpickling can cause many problems</b></ul>
    <ul><b>Make sure that the environment where the function is unpickled is able to import the function, otherwise, an exception will be raised</b></ul>
    <ul><b>Unpickling data from an untrusted source can result in the execution of malicious code</b></ul>
</center>

## <font color='LIGHTGRAY'>By the end of the lecture, you'll be able to:</font>
- <font color='LIGHTGRAY'>pattern match with string methods</font>
- <font color='LIGHTGRAY'>pattern match regular expressions using the re module</font>
- <font color='LIGHTGRAY'>pattern match filenames using the glob and pathlib modules</font>
- <font color='LIGHTGRAY'>handle the complexities of dates and times using the datetime and dateutil modules</font>
- <font color='LIGHTGRAY'>archive and unarchive files using the zipfile and shutil modules</font>
- **serialize and deserialize Python objects using the `pickle` module**

# Pickling

To pickle a Python object, use the
```python
pickle.dump(obj, file_obj, ...)
```
method paired with an `open()` method in write binary mode and a `with` statement. 

For example:

In [None]:
from pathlib import Path

food_lst = ['cucumber', 'pumpkin', 'carrot']

with open(Path.cwd() / 'day_4_assets' / 'food_lst.pkl', 'wb') as f:
    pickle.dump(food_lst, f)
    
tree(Path.cwd() / 'day_4_assets')

Note, that the `.pkl` extension is not necessary but it's good practice to include the extension for readability purposes.

To read more about `pickle.dump()`, check out help output:

In [None]:
help(pickle.dump)

# Unpickling

To unpickle the contents we just pickled, use the
```python
pickle.load(pickle_filename, ...)
```
method in read binary mode.

For example:

In [None]:
with open(Path.cwd() / 'day_4_assets' / 'food_lst.pkl', 'rb') as f:
    unpickled_lst = pickle.load(f)
    
print(unpickled_lst)

To read more about `pickle.load()`, check out help output:

In [None]:
help(pickle.load)

# Pickling and Unpickling Custom Objects

As mentioned, you can serialize your custom objects.

For example:

In [None]:
class Veggy():
    def __init__(self):
        self.color = ''
    def set_color(self, color):
        self.color = color

cucumber = Veggy()
cucumber.set_color('green')

with open(Path.cwd() / 'day_4_assets' / 'cucumber.pkl', 'wb') as f:
    pickle.dump(cucumber, f)

with open(Path.cwd() / 'day_4_assets' / 'cucumber.pkl', 'rb') as f:
    unpickled_cucumber = pickle.load(f)

print(unpickled_cucumber.color)

## **<font color='ORANGE'>Caution</font>**

We can only unpickle the object in an environment where the class `Veggy` is either defined or imported. Otherwise, we'll get an `AttributeError`. 

In [None]:
del Veggy

In [None]:
with open(Path.cwd() / 'day_4_assets' / 'cucumber.pkl', 'rb') as f:
    unpickled_cucumber = pickle.load(f)

print(unpickled_cucumber.color)

Some misc. cleanup:

In [None]:
import os
os.remove(Path.cwd() / 'day_4_assets' / 'food_lst.pkl')
os.remove(Path.cwd() / 'day_4_assets' / 'cucumber.pkl')

To learn more about the `pickle` module, check out the [documentation](https://docs.python.org/3/library/pickle.html#module-pickle).

# Conclusion

## You are now able to:
- pattern match with string methods
- pattern match regular expressions using the `re` module
- pattern match filenames using the `glob` and `pathlib` modules
- handle the complexities of dates and times using the `datetime` and `dateutil` modules
- archive and unarchive files using the `zipfile` and `shutil` modules
- serialize and deserialize Python objects using the `pickle` module

# References

- https://www.digitalocean.com/community/tutorials/an-introduction-to-string-functions-in-python-3
- https://stackabuse.com/introduction-to-regular-expressions-in-python/
- https://realpython.com/working-with-files-in-python/#filename-pattern-matching-using-glob
- https://realpython.com/python-datetime/
- https://realpython.com/working-with-files-in-python/#archiving
- https://stackabuse.com/introduction-to-the-python-pickle-module/