# File Formats
## Try me
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ffraile/computer_science_tutorials/blob/main/source/Data%20Manipulation/tutorials/Files.ipynb)[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/ffraile/computer_science_tutorials/main?labpath=source%2FData%20Manipulation%2Ftutorials%2FFiles.ipynb)

## Introduction
Before we dive into data processing, let us discuss some common file formats used to store data, set the basic terminology and describe the main steps involved when dealing with data files in computer programming.

### Basic explanation of how Python read files
At the end, a file is just a collection of bytes containing information for a specific purpose. In this Notebook, we will address different common file formats that contain information represented as **text**. Text files are composed of characters and organized in **lines**. In storage, characters need to be **encoded** into bytes. This process is called character encoding and each file may use a different character encoding, although your operating system will define a default character encoding to be used.
Line breaks will be stored using a special character, and the end of the file will also be encoded as an special character.
So basically, when reading a file in Python, we will read the contents line by line, until the end of file character is detected. But before we are able
Another important

#### Opening files
After this brief explanation, with no further ado, let´s start with practice. Copy the content of the next cell in a file using a text editor (a plain text editor like Notepad or TextEdit) and save it in a file named example.txt

```text
Hello,
This is the first file to try in Python.
Best luck!
```
Once you have saved it, you need to import it in your Python runtime. If you have opened this Notebook in Colabs, you need to open the lateral menu *Files* (the one with the folder 📁 icon), and either drag and drop the file in the area where the files and folders in your runtime are listed, or click on the button upload.

![Import file in colabs](img/colabs_import.png)

> ☝ Note that you can also connect your Google Drive folder to your runtime and use any file you have stored in there!

Once you have uploaded the file (and the example.txt file is available in the file system of your Python runtime, as in the figure), you are ready to test the following cell:


In [1]:
f = open("example.txt")
line = f.readline() # read one line
while line: # if line is an empty string, this will evaluate to false
    print(line)
    line = f.readline() # read a new line again
f.close() # Close the file

Hello,

This is the first file to try in Python.

Best luck!



Note that we used the built-in function ```open()``` to open the file. This built-in method takes one argument with the location of the file you want to open, either **relative** to your Python script working directory, or **absolute**, from the root directory of your file system. You need to have permissions in your file system to open the file, otherwise this line might raise an error. Let us stop here for a minute to make these concepts clear.

Imagine that we are working on a Unix based file system (such as in Mac OS X or Google Colabs) and our Python script working directory is a folder called content in the root folder (the root folder for all files in the system). Imagine we want to open a file called example.txt which is stored in a folder called example1 within the content file, that is, our target file is organised as:
```terminal
content
|-->example1
    |--> example.txt
...
```
Since the working directory is content, relative to the working directory, the file is located in the following url:
```python
f = open('example1/example.txt')
```
We can also use an absolute path to the file from the root folder, as:

```python
f = open('example1/example.txt')
```

#### Reading lines
The ```open()``` method returns a file object (assigned to variable ```f``` in the example), which has a ```readline()``` method that returns a string with the context of the next line (the first line after calling open and subsequent lines thereafter), until the end-of-file character is detected, in which case, an empty string is returned. In the example, we assign the result to the variable ```line``` in a while loop. Since an empty string evaluates to false, the example prints the file line by line and exists the loop when the end of the file is reached.

Finally, we use the method ```close()``` to close the file. In practice, closing the file makes sure that the runtime keeps track of which files are open by which applications and takes measures to avoid inconsistencies (more on this below).

In some examples, you may find that the file is opened using the keyword ```with```, as in the following example:


In [3]:
with open("example.txt") as f:
    line = f.readline() # read one line
    while line: # if line is an empty string, this will evaluate to false
        print(line)
        line = f.readline() # read a new line again
    f.close() # Close the file

Hello,

This is the first file to try in Python.

Best luck!



The ```with``` statement assigns the result of the ```open()``` function to a variable f that only exists in the context of the indented code below it. This gives us more control to ensure that the file is  loaded in memory only when it is required.

#### Modes
The ```open()``` function has some additional arguments worth highlighting, one is the opening mode. This argument gives additional security control to open the file, explicitly indicating what we want to do with the file in our program, so that for instance we cannot write in a file if we do not have permissions to modify it. The opening mode is specified using the characters in the table below, extracted from the official [Python documentation](https://docs.python.org/3/library/functions.html#open):

| Character | Meaning                                                         |
|-----------|-----------------------------------------------------------------|
| 'r'       | open for reading (default)                                      |
| 'w'       | open for writing, truncating the file first                     |
| 'x'       | open for exclusive creation, failing if the file already exists |
| 'a'       | open for writing, appending to the end of file if it exists     |
| 'b'       | binary mode                                                     |
| 't'       | text mode (default)                                             |
| '+'       | open for updating (reading and writing)                         |

#### Writing to a file
By default, files are opened with mode 'rt' or 'r' which is equivalent, so what we can only read lines in the file, and do not write to it. The mode 'w' allows us to write in the file, using the ```write()``` method, but first it *truncates* the file, meaning that in practice we will overwrite its contents. If we do not want to override the contents of the file, we can either use mode 'a' (to append content after the last line of the file), or mode 'r+', to read the file from the beginning and being able to modify each line with ```write()``` before reading new lines.

In the example below, we write a small program to write a shopping list into a file using the input provided by the user:

In [None]:
with open("list.txt", 'a') as f:
  while True:
    line = input("Write something to append to the list or click Enter to exit")
    if line:
      f.write(line + "\n")
    else:
      f.close()
      break

Note that we added the special character ```"\n"``` to the method write so that each entry is written in the list is written in a new line.




## Common file formats for tabular data
### JSON
JSON stands for JavaScript Object Notation, since it is the notation used to define objects in this another popular programming language. However, you are already familiar with JSON since as previously described when covering iterables: it is the same notation used to define [dictionaries in Python](https://programming.engineeringcodehub.com/en/latest/Introduction/tutorials/Iterable%20Objects%20II.html#Dictionaries). JSON files use the .json extension, so, for instance, copy the following contents into a file named example.json:

```json
[
{
  "date": "2022-08-31",
  "time": "00:15",
  "temperature": 25.5,
  "humidity": 65
},
{
  "date": "2022-08-31",
  "time": "00:30",
  "temperature": 25.6,
  "humidity": 66
},
{
  "date": "2022-08-31",
  "time": "00:45",
  "temperature": 25.7,
  "humidity": 67
},
{
  "date": "2022-08-31",
  "time": "01:00",
  "temperature": 25.6,
  "humidity": 66
},
{
  "date": "2022-08-31",
  "time": "01:15",
  "temperature": 25.5,
  "humidity": 65
}
]
```
Two important differences between dictionaries and the JSON file format that need to be accounted for are first, that you must always use double quotation marks, as single quotation masks are not allowed. Also, unlike Python dictionary keys, JSON field names must always be double-quoted strings.



#### The json Library
The [json](https://docs.python.org/3/library/json.html) library is a useful library to read json objects and load them to python objects or dump the contents of dictionaries to a json file. The function ```load``` takes a file object created by opening a JSON file as argument and returns an object with the contents of the file. For instance, copy the JSON example above in a file named example.json and give a try to the code snippet below:

In [16]:
import json
example_file = open('example.json')
my_example_dict = json.load(example_file)
print(my_example_dict[0]["temperature"])

25.5


Similarly, we can save the objects of a dictionary using the function ```dump()```:

In [19]:
my_dict = {"name": "Wilson", "surname": "Fisk", "age": 52}
my_file = open('kingpin.json', 'w')
json.dump(my_dict,my_file)
my_file.close()

### Comma separated values
The simplest file format for tabular data which is still widely used nowadays is called CSV (Comma Separated Values). Just as its name indicates, in CSV files, each line represents a different row or record, and the values corresponding to each column are separated by commas, for instance:

```csv
DATE, TIME, TEMPERATURE, HUMIDITY
2022-08-31, 00:15, 25.5, 65
2022-08-31, 00:30, 25.7, 66
2022-08-31, 00:45, 25.9, 67
2022-08-31, 01:00, 25.7, 66
2022-08-31, 01:15, 25.5, 65
```
In this example, the data collects records of temperature and humidity readings, with four columns, a data column containing the date of the reading, a time column containing the time of the reading, and read temperature and humidity. Normally, csv files use the *.csv file extension.
Note that the first row is a header row, used to facilitate the use of the file by humans.

Now, using the skills you gained from the previous section, can you make a Python script to read this file and calculate the average temperature? Save the contents of the cell below in a file named exercise1.csv and use the examples above the read its contents.

Although the name states that fields are separated by commas, but you can use other field separators. Another common field separator is the tabulation, as in the following example:

```csv
DATE TIME TEMPERATURE HUMIDITY
2022-08-31 00:15 25.5 65
2022-08-31 00:30 25.7 66
2022-08-31 00:45 25.9 67
2022-08-31 01:00 25.7 66
2022-08-31 01:15 25.5 65
```
Note that instead of commas, we have used tabulations (special character ```"\t"```). These type of CSV files usually use the extension *.tab and are convenient because you can just copy or drag and drop content from applications like Google sheets or Excel into a text editor to create a tab file.

### CSV files with Numpy
Luckily for us, Numpy provides methods to load data from a CSV file into an array and to write arrays to csv files.

The function [loadtxt](https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html) allows to load data from CSV files in a numpy array, for example:

In [5]:
import numpy as np

my_arr = np.loadtxt('exercise1.csv', delimiter=',', skiprows=1, usecols=(2, 3))
print(my_arr.mean(axis=0))

[25.66 65.8 ]


In the example, we loaded the csv described above and indicated that the field delimiter is a comma using the named argument ```delimiter```. We also ignored the header using the ```skiprows``` named argument and specifying that we want to skip exactly one row. Finally, since we are only interested in the temperature and humidity readings (the only ones containing numerical values, we use the named argument ```usecols``` to only load data in columns 2 and 3 (yeah, you guessed it, column indexing starts in 0). The result is an array with two columns, so we can for instance calculate the mean temperature and humidity using the ```mean()``` method on axis 0 (rows).

Equivalently, the function [savetxt](https://numpy.org/doc/stable/reference/generated/numpy.savetxt.html) saves the values of a Numpy array into a CSV file:

In [14]:
my_arr = np.arange(1,9)
my_arr = my_arr.reshape((2,4))
print(my_arr)
np.savetxt('my_array.csv', my_arr, delimiter=",", fmt='%i')

[[1 2 3 4]
 [5 6 7 8]]


This will create a csv file named 'my_array.csv' in the working directory, containing the contents of the ```my_arr``` Numpy array, using commas as field delimiter or separator, and formatting numbers as integers.

### IoT Challenge: Data Logging in Python
In this activity, we will extend the Python script of the IoT challenge template [available here](iiot_challenge/templates/python_part/script_template.py) to log data received from the Arduino to a CSV file. This data can be analyzed later for trends, helping with decision-making in your application.

We are going to implement this data logging functionality as an additional option called continuous mode. The image below illustrate how the continuous mode works

![Continuous mode](./img/Serial_Communication_Continuous_Mode.png)

Basically, we will send commands to the Arduino device to read sensor data, and whenever we collect new data, we will print it to the console, but also put in a dictionary, that we will later save into a CSV file. In this mode, we will repeat these steps in an endless loop, so we will send information back and forth in the serial without the need to get input from the user. All the information will be stored in the file, and we will be able to use this data for analysis.
This mode has the advantage that we can collect data from the device without the need to send a user command to the device every time we want to read the sensors. This is useful when we want to collect data from the device for a long period of time.

### Example code
Here's the complete example code. Note that we have used the same sensor data as in the Arduino part example. Make sure you adapt to your application!




In [None]:
import serial
import time
import random

# Initialize the port variable.
# If you already know the serial port where the Arduino board is connected,
# you need to assign it to this variable, replacing None with the actual name of the port.
# For instance, if your Arduino board is connected to port COM7, you can use the line below
# port = 'COM7'
# TODO: Find out the name of the port you use to connect to Arduino and update the variable
# definition
port = None

# Initialize serial communications. Set the baud rate to 96000 bps.
if port is not None:
    arduino = serial.Serial(port, 9600, timeout=1)

simulation_mode = True # Set this variable to True to simulate the data
if port is not None or simulation_mode:
    print(" _      __    __")
    print("| | /| / /__ / /______  __ _  ___")
    print("| |/ |/ / -_) / __/ _ \\/  ' \\/ -_)")
    print("|__/|__/\\__/_/\\__/\\___/_/_/_/\\__/")

    print("Welcome to the Arduino control panel")
    print("You can use the following commands:")
    print("1. Read humidity")
    print("2. Read temperature")
    print("3. Read humidity and temperature")
    print("4. Read soil moisture")
    print("5. read all in continuous mode")
    print("Press Ctrl+C to exit")
    while True:

        command = input("Enter command: ")
        if command in ['1', '2', '3', '4'] and not simulation_mode:
            signal = command.encode('utf-8') # Convert the command to a binary string
            arduino.write(signal) # Send the command to the device
            arduino.flush() # Wait until the command is sent
            raw_data = arduino.readline() # Read the data from the device
            print(raw_data) # Print the response from the device
        elif command == '5':
            print("Entering continuous mode. Press Ctrl+C to exit")
            while True:
                time.sleep(1) # Wait for 1 second
                # Let´s put the data in a dictionary.
                data = {"timestamp": time.strftime("%Y-%m-%d %H:%M:%S")}

                if simulation_mode: # If we are in testing mode, we will simulate the data
                    data["humidity"] = random.randint(0, 100)
                    data["temperature"] = random.randint(0, 100)
                    data["soil_moisture"] = random.randint(0, 1000)
                else: # If we are not in testing mode, we will read the data from the device
                    # First send a signal to read humidity and temperature
                    signal = b'3'
                    arduino.write(signal) # Send the command to the device
                    arduino.flush() # Wait until the command is sent
                    raw_data = arduino.readline() # Read the data from the device

                    #Incoming data is in the format b'Humidity: 50.00 % Temperature: 23.00 \n'
                    # We need to split the string into a list of strings
                    raw_data = raw_data.decode('utf-8').strip().split(' ')
                    # Now we need to convert the strings to floats and add them to the dictionary
                    data["humidity"] = float(raw_data[1])
                    data["temperature"] = float(raw_data[4])
                    time.sleep(1) # Wait for 1 second
                    # Now send a signal to read soil moisture
                    signal = b'4'
                    arduino.write(signal) # Send the command to the device
                    arduino.flush() # Wait until the command is sent
                    raw_data = arduino.readline() # Read the data from the device
                    # Incoming data is in the format b'Soil Moisture: 350 \n'
                    # Decode the data and split it into a list of strings
                    raw_data = raw_data.decode('utf-8').strip().split(' ')
                    # Get the soil moisture value and store it in the dictionary
                    data["soil_moisture"] = float(raw_data[2])

                print(data)
                # Now we can save incoming data into the file. We need to open the file in append mode, and check whether the file contains previous data
                with open("data.csv", "a+") as f:
                    # First we need to check whether the file contains previous data
                    f.seek(0) # Move the cursor to the beginning of the file
                    previous_data = f.read() # Read the file
                    if previous_data == "": # If the file is empty, we need to add the header
                        f.write("timestamp,humidity,temperature,soil_moisture\n")
                        # Now we can write the data to the file
                        f.write(f"{data['timestamp']},{data['humidity']},{data['temperature']},{data['soil_moisture']}\n")
                    else: # If the file is not empty, we just need to move the cursor to the end of the file and add the data
                        f.seek(0, 2) # Move the cursor to the end of the file
                        f.write(f"{data['timestamp']},{data['humidity']},{data['temperature']},{data['soil_moisture']}\n")
        else:
            print("Invalid command")

Note that the strategy is very simple: we use device commands to collect sensor data, and we store the sensor values in a dictionary named ```data```. Finally, we open the file in ```a+``` mode so that we can append data. This means that if the file already contains some records we will not overwrite them, but instead, we will append new data to the file.

And this is it! We have successfully established a serial connection with the Arduino board, and we have used it to control the device. We have also saved the data in a file!