<img src="http://www.cs.wm.edu/~rml/images/wm_horizontal_single_line_full_color.png">


<h1 style="text-align:center;">CSCI 141, Fall 2020</h1>
<h1 style="text-align:center;">Project: searching files; reading and copying webpages</h1>

# Part I: searching files

In this project you will make a simple version of the Unix/Linux <code>fgrep</code> program for searching files for a specified string (the name <code>fgrep</code> comes from "fixed regular expression").  You should work through the <code>file_io.ipynb</code> notebook as part of this project.

This time you are writing a program (named <code>fgrep.py</code>) rather than a function.  Your program should behave as follows.  When invoked via
<pre>
python fgrep.py foo file1 file2 file3 file4
</pre>
will search in order the files <code>file1, file2, &hellip;</code> for the string <code>foo</code>.  Whenever it finds this word in a line of the file, it will print the name of the file and the line.  The output should
look like this:
<pre>
file1: foo is a metasyntactic variable
file2: foobar() is a good name for a function
file3: foo is not a good name for a baby
</pre>

## Comments:
* You can either read each file line-by-line or you can read the entire file into a variable and then split it up into lines.

# How to retrieve command line arguments

From your program you can retrieve the information from command line via the
[sys.argv](https://docs.python.org/3/library/sys.html?highlight=sys.argv%23sys.argv#sys.argv)variable:
<code>
import sys
print(sys.argv)
</code>
This variable is a list of strings.  If you invoked the function as above, it would have the value
<code>
['fgrep.py', 'foo', 'file1', 'file2', 'file3', 'file4']
</code>
You should experiment with <code>sys.argv</code> until you are confident you understand it.

## What to upload

Upload <code>fgrep.py</code> via the Blackboard site.

# Part 2: retrieving data from the interwebs

In this project you will make a simple version of the <code>wget</code> program for downloading data from the web.
This will give you a Pythonic, command line alternative to downloading by opening a file in a browser and trying to save it.

# <code>urllib.request</code>

The [<code>urllib.request</code> module](https://docs.python.org/3/library/urllib.request.html) is an easy to use
module for opening and reading files on tbe Web.  Files are specified by their web address, formally known as a Uniform Resource Locator (URL).  For instance, the URL for Amazon is [https://www.amazon.com](https://www.amazon.com).  The <code>https</code> indicates that this URL interacts with the outside world via the secure HTTP (Hypertext Transfer Protocol) scheme.

## Reading a webpage into a variable

Here's how we can read a webpage into a variable:

In [None]:
import urllib.request

URL = 'http://www.cs.wm.edu/~rml/teaching/csci141/jupyter/alice.txt'

# Notice the similiarity to opening and reading a file:
with urllib.request.urlopen(URL) as url:
    s = url.read()

The result is a variable of type [<code>bytes</code>](https://docs.python.org/3/library/stdtypes.html#binary-sequence-types-bytes-bytearray-memoryview):

In [None]:
print(type(s)) 
print(s)

The reason this is bytes rather than a string is that non-ASCII characters can be encoded in several different ways, and Python does not make assumptions about the encoding scheme.

By far the most common encoding scheme is [UTF-8](https://en.wikipedia.org/wiki/UTF-8).  Since our file is an ASCII text file it it safe to use UTF-8 to decode the web page to turn it into a string:

In [None]:
text = s.decode('utf-8')

print(type(text))
print(text)

## Copying a webpage to a file

We can also copy a webpage directly to a file using the [shutil.copyobj()](https://docs.python.org/3/library/shutil.html#shutil.copyfileobj) function:

In [None]:
import shutil

local_file = 'alice.txt'

with urllib.request.urlopen(URL) as url:
    with open(local_file, 'bw') as file:
        shutil.copyfileobj(url, file)

**NB:** Note the mode used to open the destination file: <code>'bw'</code>.  The <code>'bw'</code> indicates the file should be opened as writable in **binary mode** as opposed to **text mode**.  This is because the data to be written is of type <code>bytes</code> rather than simple ASCII text.

Try removing the 'b' and see what happens.

# A <code>wget</code> program

Write a program <code>wget.py</code> so that the invokation 
<pre>
python wget.py http://www.cs.wm.edu/~rml/teaching/csci141/jupyter/alice.txt foo.txt
</pre>
will copy the file with the URL <code>http://www.cs.wm.edu/~rml/teaching/csci141/jupyter/alice.txt</code> to a file named <code>foo.txt</code>.  You will need to retrieve the command line arguments via <code>sys.argv</code>.

## What to upload

Upload your file <code>wget.py</code> via Blackboard.