# Day 2 Part 2 - File Management

Python has many different built-in modules and functions for working with files of different types (e.g. .txt, .csv, etc.)

In this short tutorial we will cover the basics in navigating directories and using files in Python!

In this tutorial, you’ll learn how to work with files of various types and general best practises for your coding workflow.

## What is a File?

Most modern files contain three parts:

1. Header - Metadata about the contents of the file (file name, size, type, etc.);
2. Data - The contents written by the author;
3. End of File (EOF) - A special character that signals the end of a file.

What the data represents is generally signalled by the extension of the file. For example, `.pdf`, `.txt`, `.py` are all file extensions you would have seen before.



## File Paths

When accessing a file, you need a file path. This is a string representing the location of the file.

It is also broken into three parts:

1. Folder Path - The file folder location where subsequent folders are separated by a forward slash (Unix) or Backslash (Windows).
2. File Name - The actual name of the file.
3. Extension


## Reading and Writing Files

This is pretty straightforward. The most important thing here is that when we use files in Python, we `open` them in a certain mode. 

The modes are as follows:

1. 'w' - Write Mode. Use this when altering/adding/changing information. This erases the existing file to create a new one. The file pointer is placed at the beginning of the file.

2. 'r' - Read Mode. Use this when reading information and not changing anything. File pointer is placed at the beginning of the file.

3. 'a' - Append Mode. Use this when adding new information to the end of the file. File Pointer is placed at the end of the file.

4. 'r+' - Read/Write Mode. Used when making changes and reading information from a file. File pointer is placed at the beginning of the file.

5. 'a+' - Append/Read Mode. A file is opened to allow data to be added to the end of the file and reading information from it. File pointer placed at the end of the file.

6. 'x' - Exclusive Creation Mode. Exclusively creates a new file. If a file of the same name exists, this will fail.

First off, let's create a simple file to work with called `hello.txt`

Note that the path to our file is in a `string` format, with us using the `open` functionality in the read (`w`) mode.

We do not need to use the `with ... as` format, as this is essentially just another way of assigning data to a variable. However, it's a much nicer way to work with files.

The alternative, more familiar way, is shown below.

In [None]:
our_file = open('hello.txt', 'r')
data = our_file.read()
print(data)
our_file.close()

In this format, it is important to remember to `.close()` our file, so that we are not wasting memory. By running the following code you will see that we can then not perform a reading operation on a closed file:

In [None]:
our_file.read()

It is recommended that for working with files, you use the first `with open(filepath) as x:` approach. This means that you will not be wasting memory in keeping large files open unnecessarily.

## Working with Directories with the `os` Module

A tutorial on the `os` module could be endless. We will focus on some core concepts and commands for interacting with files and directories. For a full runthrough, check the [docs](https://docs.python.org/3/library/os.html) or [this tutorial](https://www.geeksforgeeks.org/os-path-module-python/).

In [None]:
import os
from os import path # for direct usage of the path functionality within os

In [None]:
filenames = ['hello.txt', 'goodbye.txt'] #First file exists, the second doesn't
directory = '/content/'

for filename in filenames:
  print(path.exists(path.join(directory,filename))) # Checking if each filename exists

## CSV Files

When working with data, most of us are familiar with the a 'spreadsheet' appearance, whereby it is organised into columns, with each new 'occurence' having a new row.

This is doable in a plain text format, whereby we can separate our columns with a 'delimiter' character. The clearest example of this is a 'Comma Separated Values(`.csv`)' file. They can also be known as comma delimited files.

They have a fairly simple structure. For example :

```
Name,Email,Age,City
Mark, mark@email.com, 23, Belfast
Bob, bob@email.com, 25, Glasgow
```

These files can easily be imported and manipulated in python, most commonly through libraries such as `pandas` which you will work with tomorrow. It is particularly useful for working with numerical data.

For today, we will use the `csv` library, already included with standard python, as it's a bit more similar to how we have already worked with files today. It works for most cases where you're not being too scientific.

In [None]:
import csv

Let's create our csv. We will open the file the same as before, and writing to the file will be done by the `DictWriter` functionality within the `csv` library. It does the heavy lifting for us.

In [None]:
with open('PIADS_students.csv', mode='w') as csv_file: # Same process as before
    
    fieldnames = ['Name', 'Cohort', 'City'] # Our column names
    
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames) #Establishing our framework for writing to the csv

    writer.writeheader() # Write the first line (column names)

    writer.writerow({'Name': 'Mark', 'Cohort': '2021', 'City': 'Belfast'})
    writer.writerow({'Name': 'Colin', 'Cohort': '2020', 'City': 'Glasgow'})

We have now written to a new csv file and stored data!
Let's now read data from the file to see what we have done.

We will first read from the file using a `reader` provided by the `csv` library.

In [None]:
with open('PIADS_students.csv', mode='r') as csv_file:
  csv_reader = csv.reader(csv_file, delimiter=',')

  for row in csv_reader:
    print(row)

As you can see, we have read our data in easily enough. Let's do some post-processing and make everything look a bit nicer, by separating out our original columnn names from the rest of the data.

In [None]:
with open('PIADS_students.csv', mode='r') as csv_file:
  csv_reader = csv.reader(csv_file)

  line_count = 0
  for row in csv_reader:
    if line_count==0:
      print(f'Our column names are: {", ".join(row)}')

    else:
      print(f'{row[0]} is a part of the {row[1]} cohort, based in the {row[2]} campus.')
    line_count+=1

This looks a bit nicer!
However, indexing into our data as `row[0]`, `row[1]` etc could get a bit messy if we have a lot of columns. It begins to lose all meaning and context.

In this case, perhaps it is best to store our data as a dictionary, whereby we can index into the data nicely using the column names!

For this, we will use the `DictReader` functionality from the `csv` library.

In [None]:
with open('PIADS_students.csv', mode='r') as csv_file:
  csv_reader = csv.DictReader(csv_file)

  line_count = 0
  for row in csv_reader:
    print(f'{row["Name"]} is a part of the {row["Cohort"]} cohort, based in the {row["City"]} campus.')

Tomorrow, you will use the `pandas` module for working with csv files in more detail!

##`np.loadtxt` and `io`

The python `io` module allows functionality for working with various types of input/output. You can read about it [here](https://docs.python.org/3/library/io.html)

In [None]:
from io import StringIO #StringIO behaves like a file object
import numpy as np

In [None]:
string_file_object = StringIO("0 1\n 2 3")

our_array = np.loadtxt(string_file_object)

print(our_array)
print(our_array.shape)

As we can see above, from a simple string in a file object and importing it using `np.loadtxt`, we have created a 2x2 matrix.

### Only taking certain columns

In [None]:
second_file_object = StringIO("1,2,3\n4,5,6")
matrix = np.loadtxt(second_file_object,
                  delimiter = ',',
                  usecols = (0,2)) # Leaving out middle column

print(matrix)

# Exercise

In the code cells below, write a short program that will read `our_text_file.txt` and extract the data using `np.loadtext`. Return a generator `our_generator` that will iterate through each row of the data and return the mean value in `float` format. 

Make sure to use functions, function annotations and comments.

In [None]:
### Run this to create your file
with open("our_text_file.txt","w") as our_file:
  our_file.write("1,2,3\n4,5,6\n7,8,9\n10,11,12")

## Exercise 2

Run the following cell below to create a new text file containing lots of words.

Write your own code, using functions, that will read in a text file and compute the most repeated word and letter, along with the number of times each is repeated.

In [None]:
with open("our_exercise_text.txt","w") as our_file:
  our_file.write("This is the first line which will be used for the exercise.")
  our_file.write("What will happen when lots of words are used over and over again?")
  our_file.write("\n\n")
  our_file.write("I think we could write things in this box forever!")

In [None]:
#Your code here