# Computer Programming

## Programs 7: File Wrangling

For the final two practicals, we are moving away from Jupyter Notebooks to writing programs that run "on their own". Notebooks are great for small pieces of code, and are widely used in Data Science, but we want to end by looking at something more general.

This week we will create some programs that make use of data in files. We will read data from files, and write results back.

_Hint: There are examples of file use in the program library that came with your module repo. That would be an excellent place to start._

_As usual, once finished, make sure this Notebook ends up in your GitHub repo._

## File Locations

Obviously in order to use a file we need to know where it is on the computer!

A file's location consists of its name, and the name of the folder (directory) in which it is stored. So examples might be (Linux/Mac first, then Windows):

    /home/users/voldemort/horcruxes.txt
    C:\Users\volde\Documents\horcruxes.txt

A program can either process a file from its full location, or from just its name **if it is in the same location as the program**.

For these exercises you can put your data files in the same folder as your Notebook. But this is untidy, so we recommend creating a new folder called something like ``data`` _below_ your Notebooks folder, and then referencing files there:

    data/useful_data.txt
    data\useful_data.txt

Since this can be error-prone, always make sure you can open the file before starting on the program that will use it!

## Practice

Answer the following before trying the programs below. Where you write code, make sure it runs without errors (or with the errors you expect!).

Remember that there are two styles of file-handling. (Basically one uses ``with`` for simplicity.) The question below ask about both, but for the programs, stick with, ah, ``with``.

_Create a code block below, with the line that would open a file called ``data.txt`` that is located **in the same folder as the program**. Don't use ``with``._

_Create a code block below, with the line that would open a file called ``data.txt`` that is located **in a folder called ``data`` inside the folder containing the program**. Don't use ``with``._

_Finally, create a code block below, with the line that would open a file called ``data.txt`` that is located **in a folder called ``data`` above the folder containing the program**. Don't use ``with``._

_What Exception would be thrown if opening the file fails because the file cannot be found?_

_What about the Exception if the file does exist, but the user does not have access rights to use it?_

_What changes in the code if we want to write the file? Create that line in a code block below._

_If a file is opened for writing, and does not exist, is that an error? What happens?_

_If a file is opened for writing, and already exists, what happens to the existing contents of the file?_

_And add below a code block with the code to add to the end of a file when opening it._

_To complete the circle, create a code block and add the line to close a file after it has been written._

_Create a code block below, and use ``with`` to open a file, read all the lines, and display the number of lines in the file._

_Hint: This is the Unix ``wc`` command. Look in your program library._

_What is the difference between the ``read``, ``readline``, and ``readlines`` methods when reading from a file? What data types do you get from each?_

_The method to write to a file is (excitingly) called ``write``. How do we add a new line to a file using this?_

_If using ``with`` what needs to be done in order to close a file after it has been used?_

## Programs

Now complete these. Use ``with``, as that is the modern way.

Remember to be very sure where you have stored your data files!

_Write a program that simply tells you whether a file exists and can be opened. Just assign the name of the file to a string variable. (The easiest way to do this is just to try and open it, and report whether or not an Exception happens.)_

_Now modify that code to display how many characters there are in the file, assuming it exists. Just display that it cannot be opened otherwise._

_Copy it below, and modify the code again so that it reports how many lines there are in file. (This should be a very, very small change.)_

_Make another file, and populate it with some integers, one on each line. Remember to change the file name in your code, and add below a program to print the total of all the numbers in the file. Assume a Happy Path (that is, there is one number on each line, and it really is a number)._

_Hint: Start by copying your code from above. You don't need to change much. Remember that your data will be read as strings, so some conversion will be needed._

_Now use the ``random`` module to create a file containing 1000 random numbers between 0 and 100 inclusive._

_If the numbers are really random, the average of the numbers in your file should be round about 50. Write a program below that reads that file, and tells you what the average is. (Remember that you will need the output file of the previous code block here.)_

_Write a program that is intended to process a file containing just numbers. Have it report if there is a line in the file that does not contain an integer. If all is OK, do nothing._

_Hint: One of the programs above will almost do this. Edit your file of random numbers so that one line has letters on it. See what happens. Then just trap this._

_Now take your random number file, and use it to create another file of random numbers, this time in sorted order. Call the new file anything you like. Assume a Happy Path._

_Modify your program so that it sorts the same file. That is, the output sorted file overwrites the original file. This **should** be a trivial change._

## Challenge

_A trainspotter has a file of the all numbers of all the locomotives they have seen. It looks something like:_

    50321
    47362
    78919
    50321

_Only it is much longer._

_The first two digits denote the "class" of the locomotive. Write a program that counts how many locomotives of each class our trainspotter has recorded, and displays the four most common. Obviously the same locomotive may have been spotted more than once (so will be in the file more than once) but should only be counted once._

_Assume the file is in the format above. In fact, assume it exists too._

_You want a hint. Enter the following to Python, and see what is in the ``counts`` thingy. Specificallym look at what's in ``counts.most_common()``._

    >>> letters = ['a', 'a', 'b', 'c', 'c', 'c', 'c']
    >>> from collections import Counter
    >>> counts = Counter(letters)



## Reflection

The last program here showed how a complex problem can be solved by using some built-in features in Python. We _could_ have done that with a dictionary, but it was much easier with a ``collection``.

The mindset to have is to keep thiking "There **must** be a way to do that." Then look at the docs, and StackOverflow, and you'l probably find it.

These programs were all a little artificial in that we had to hard-code the names of the files they used. Next, and finally, we'll see how to sort that.
