# STAT1100 Data Communication and Modelling: Week 4

This week we will be looking at reading, saving, and analyzing data from files. To complete this
lab you will need to download the following files and place them in the same folder you are
running python/jupyter from:

- 04cars.txt
- 04cars.xlsx
- brainsize.csv
- example.stat1100
- heart.json
- iris.xlsx
- mystery_file
- titanic.xml

## Reading Data From Files

The first step to reading some data from a file is knowing what type of file it is. Often
we can tell by the file extension, that is the part of the file name after the dot, e.g. the docx part
of `notes.docx` shows you that it is a word document. However, sometimes we can't tell either
due the having the wrong extension or none at all. This is not a big problem in python though,
as we can use the [python-magic](https://pypi.org/project/python-magic/) library to help us:

In [None]:
%pip install -U python-magic

Using this package in python is reasonably simple, just import it and then use the `from_file` method:

In [None]:
import magic

magic.from_file("mystery_file")

Once we know the filetype, we know what strategy we need to read and translate it into a python-based
object/collection. In the following, we will provide techniques to read common file types that store data.
For many of these there are often techniques specific to the file type and a pandas/numpy based approach,
either usually works fine while the former allows for more flexibility and the latter is much easier.

### Text Files (TXT)

Text files are usually one of the most basic file types, and within a structured file context tend to only
store basic descriptions to accompany other data that holds structure. With text files, we can see the basics
in python file reading. In the following example code, we will read the text file '04cars.txt' and store in
a variable:

In [None]:
description = ""
with open('04cars.txt', 'r') as f:
    for line in f:
        description += line

print(description)

In the first line we open a `with` block over the file opening 'context', this ensures that our file is closed once
the block ends (when the new indent from the colon ends) preventing potential file corruption. This could alternatively
be done without the `with` block, instead the file would be opened with a `f = open('notes.txt', 'r')` statement and
closed with `f.close()`. You may notice that in the `open` statement we first specify the file name and then the 'mode',
this mode just states our use for the file, in this case we are reading the file with the `r` mode.

In the next two lines, we create a loop iterating over each line of the file, and add each line to our `description` string.

### Comma Separated Values (CSV)

The csv file is a simple file type that stores values in a tabulated format by separating columns with commas (sometimes
also a different character like a space or colon), and rows by new lines. For example,

```csv
id,name,age
1,John,20
2,Jane,21
```

Python has a built-in library for reading csv files, which is called `csv`. We can use this library to read 'brainsize.csv'
as follows:

In [None]:
import csv

data = []
with open('brainsize.csv', 'r') as f:
    reader = csv.reader(f)
    for row in reader:
        data.append(row)

print(data)

The above example uses the `csv.reader` function to interpret the file as a csv file instead of text, this transforms the
rows of the file into lists of the values in each row. This next allows us to easily construct the 2D list `data` which
contains the data from the file, where rows are lists of the individual values.

Alternatively, both numpy and pandas provide functions to read csv files, they place the data into a numpy array or
pandas dataframe respectively. We show both in the following:

In [None]:
# First we import both libraries, you would only need one or the other
# depending on what collection you want the data in
import numpy as np
import pandas as pd

# numpy csv reading
data = np.genfromtxt('brainsize.csv', delimiter=',')
print(data)  
# Notice that it places nan (not a number) for each of the string values

# pandas csv reading
data = pd.read_csv('brainsize.csv')
print(data)

### JavaScript Object Notation (JSON)

[JSON](json.org) files are also a simple file type that stores data in a structured format, yet they are more flexible than csv.
JSON files originate as the format by which JavaScript objects are represented, and thus have similar flexibility to collections
seen in programming languages. The following is an example of what JSON file looks like:

```json
{
    "id": 1,
    "name": "John",
    "age": 20,
    "address": {
        "street": "Main Street",
        "city": "London",
        "postcode": "E1 2AB"
    },
    "phone": [
        "0123456789",
        "9876543210"
    ]
}
```

Fortunately for python, the structure of JSON files is equivalent to the structure of a python dictionary (+ list). So reading, analyzing and
writing remains relatively simple. The json library built into python loads the files directly into dictionaries, for example:

In [None]:
import json

with open("heart.json", 'rb') as f:
    data = json.load(f)

print(data)

Alternatively, Pandas also provides a function to read json files, which is called `read_json`:

In [None]:
import pandas as pd

data = pd.read_json('heart.json')
print(data)

The pandas function again provides a dataframe representing the data in the json file.

### Excel Spreadsheets (XLS/XLSX)

Excel spreadsheets are a much more complex file type which stores data in the tabulated spreadsheet format. As you
may know, these files are created using the Microsoft Excel or libreoffice (etc.) applications. Even inside of those
applications we can perform substantial analysis of the data. But, python also provide libraries to interact with
excel files, providing us with more flexibility in both analysis and in saving or converting data collections.

Python does not have any built-in libraries for reading excel files, but we can use the combination of the openpyxl
and pandas libraries to read excel files. First make sure both are installed by running the following:

In [None]:
%pip install -U openpyxl pandas

Next, we can read the excel file '04cars.xlsx' as follows:

In [None]:
import pandas as pd

data = pd.read_excel('04cars.xlsx')
print(data)

Since, the excel tables are nearly equivalent to the pandas dataframe there is almost no loss of information here, though
it appears we are missing the ability to see different sheets in the file. But, it is actually possible to read the other sheets,
it just requires a new dataframe for each:

In [None]:
data = pd.read_excel('iris.xlsx', sheet_name='virginica')

### Extensible Markup Language (XML)

Reading xml files is a bit more complicated as they are much more flexible than other structured file types. XML files follow a
tree-based structure, and use tags to define branches in the tree, indicated by angle brackets `<tag-name>` and closed with
`</tag-name>`. The tags themselves can hold extra information known as attributes, which are indicated by `attribute="value"` (as whole
that would be `<tag-name attribute="value"></tag-name>`). Also, inside those tags are it's 'children' which can be other tags or values.
The following is an example of what an xml file looks like:

```xml
<?xml version="1.0" encoding="UTF-8"?>
<people>
	<person id="1">
		<name>John</name>
		<age>20</age>
	</person>
	<person id="2">
		<name>Jane</name>
		<age>21</age>
	</person>
</people>
```

The above XML represents the following tree structure (diagram does not show the specific values):

![A tree representation of our xml example](xml.png)

The tree structure makes xml files a bit more difficult to read in python. For now we will briefly look at doing so using
the python's built-in xml library, but we will also look into reading xml/html in more detail in the future tutorials on
web scraping.

In [None]:
from xml.etree import ElementTree as ET

tree = ET.parse('titanic.xml')
root = tree.getroot()

data = []
for row in root:  
    # There are rows for each passenger
    record = {}
    for item in row:  
        # Each passenger only has data here
        # Take the labelled data and keep both the tag and the value
        record[item.tag] = item.text  
    data.append(record)

print(data)

### Exercises

1. Create a dataframe from the file 'brainsize.csv' and print the first 5 rows.
2. Load the file '04cars.xlsx' and print the retail price of the Mini Cooper.
3. Load the file 'iris.xlsx' and print the mean petal length of the Setosa species.
4. Load the file 'heart.json' and print the median age of the patients.

## Saving Data

Once we have read the data, saving it is a simple task. It is just the reverse of reading, and the same libraries are used.
We need to first structure the data in python into a suitable collection/structure for the writing process, then run the
respective libraries write/save function. The following code shows examples of writing each of the discussed file types:

In [None]:
# Saving a txt file
with open('example.txt', 'w') as f:
    f.write('Hello World!')

# Using numpy to save a csv file
data = np.array([[1, 2, 3], [4, 5, 6]])
np.savetxt('example.csv', data, delimiter=',')

# Using pandas to save an excel file
data = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
pd.to_excel('example.xlsx', data)

# Using pandas to save a json file
pd.to_json('example.json', data)

# Saving an xml file with pandas
pd.to_xml('example.xml', data)

### Exercises

1. Read `heart.json` and save it into a csv file
2. Read `brainsize.csv` and save it into a json file
3. Take the data from `04cars.xlsx`, `brainsize.csv`, `heart.json` and save them into sheets of single excel file called `data.xlsx`

## Supplementary Material: What If We Are Looking at a Different File Type?

Fortunately, due to the existence of [PyPI](https://pypi.org/) and pip we can often find and install a library for reading and
writing almost any file type. However, in there may be cases where you are dealing with a custom file type, for these you will
need to write your own code to intelligently read the characters in the file and accordingly construct a collection of correct
typing to make it useful. For example, if we are reading the following custom stat1100 file, `example.stat1100`:

```txt
data: object
    1; name: str=John; age: int=20
    2; name: str=Jane; age: int=21
```

we could write the following code, which is a modified version of the txt file reader:

In [None]:
data = []
with open('example.stat1100', 'r') as f:
    for line in f:
        if line.startswith('data:'):
            # Skip this line, it does not tell us anything
            continue  
        else:
            # Current line value created as a dictionary
            line_object = dict()  
            
            # Split the line at the ';' into a list of values
            line_split = line.split(';')  
            
            # id is always the first value
            line_object['id'] = line_split[0]  
            
            # Iterate through the rest of the values
            for value in line_split[1:]:  
                # Split the value at the ':' into a key and value
                key, val = value.split(':')  
            
                # Split the value at the '=' into a type and value
                val_type, val_data = val.split('=')  
                
                # Convert value and add
                line_object[key] = getattr(__builtins__, val_type)(val_data)  
            data.append(line_object)
print(data)

In this example, we iterate through the lines in the file, and look at it's contents, we then accordingly convert them
to a value within a dictionary that is stored in our data list. We use a new function here, `getattr` which gets a function,
module, or variable specified by name as a string in the second argument from the module specified in the first argument. In
this case the take the file specified typing from the builtins module, and use that to convert the value to the correct type.