In [1]:
!wget --no-cache -O init.py -q https://raw.githubusercontent.com/UDEA-Esp-Analitica-y-Ciencia-de-Datos/EACD-01-FUNDAMENTOS/master/init.py
import init; init.init(force_download=False); 

/bin/sh: wget: command not found


# Lectura y escritura de archivos

Una de las tareas más comunes cuando se trabaja en ciencia de datos o Machine Learning es procesar archivos de datos o metadatos, los últimos se refieren a datos acerca de los datos.

Debemos estar en la capacidad de procesar muchos tipos de archivos, archivos de texto, csv, json, xml, yaml, imágenes y videos son los tipos más comunes. Eventualmente necesitaremos escribir archivos también, como parte de un largo proceso de tratamiento de datos o para exportar alguna configuración de nuestros modelos para luego ser leídos por otros componentes de un sistema más grande.

## Texto sin formato

Realmente estos archivos pueden venir con algun formato, simplemente que no encaja dentro de algún estándar. Empecemos considerando el archivo `files/abstract.txt`, el cual contiene el abstract de un famoso artículo.

In [2]:
f = open("local/files/abstract.txt", "r")
content = f.readlines()
f.close()
content

['Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.\n',
 'The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, 

**¿Qué paso?** 

1. Usamos la función `open` para abrir un archivo (`"local/files/abstract.txt"`) en modo lectura (`"r"`).
2. Leimos todas las líneas del archivo.
3. Cerramos el archivo.

Observa que el archivo se lee línea a línea, y por eso el resultado en `content` es una lista de strings. Básicamente lo pudimos haber leido de la siguiente manera

In [3]:
f = open("local/files/abstract.txt", "r")
content = []
line = "some arbitrary string to enter the loop"
while line != "":
    line = f.readline()
    content.append(line)
f.close()
content

['Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.\n',
 'The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, 

Leerlo línea a línea nos permite cargar en memoria solo una línea a la vez, lo cual es bastante importante si el contenido de nuestro archivo no cabe en la memoria del computador. También es posible leerlo completo como un string, sin separarlo línea por línea,

In [4]:
f = open("local/files/abstract.txt", "r")
content = f.read()
f.close()
content

'Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.\nThe depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obt

Observa que para convertir esto en una lista de líneas, solo debemos partir la lista cada vez que se encunetre un caracter que significa final de línea (`"\n"`)

In [5]:
content.split("\n")

['Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.',
 'The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we

Otra forma de leer los archivos es utilizando algo llamado [context manager](https://www.geeksforgeeks.org/context-manager-in-python/):

In [6]:
with open("local/files/abstract.txt", "r") as f:
    content = f.readlines()
content

['Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.\n',
 'The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, 

Lo anterior no es simplemente una forma más bonita de hacer lo mismo, nos proporciona la seguridad de que siempre nos estamos refiriendo al mismo archivo y lo cierra para que otros procesos lo utilicen de ser necesario.

La función `open` admite más parámetros, que puedes leer en la [documentación oficial](https://docs.python.org/3/library/functions.html#open). Sin embargo, si mostraremos un par de formas más de abrir archivos.

**Modo escritura**

In [7]:
with open("temp_file", "w") as f:
    for i in range(10):
        f.write(f"writing line with number {i}\n")

**Modo lectura y esritura de archivos binarios**

Este último es especialmente importante, porque ciertos formatos solo permiten leer y escribir en formato binario.

In [8]:
with open("local/imgs/udea-datascience.png", "r") as f:
    content = f.read()
content

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte

In [9]:
with open("local/imgs/udea-datascience.png", "rb") as f:
    content = f.read()
content

b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x01/\x00\x00\x00\xea\x08\x06\x00\x00\x00\xf3\x8c\xc3k\x00\x00\x00\x04sBIT\x08\x08\x08\x08|\x08d\x88\x00\x00\x00\x19tEXtSoftware\x00gnome-screenshot\xef\x03\xbf>\x00\x00\x00*tEXtCreation Time\x00Tue 09 Jun 2020 06:38:38 -05Ng\xdc\x83\x00\x00 \x00IDATx\x9c\xed\xbd\x7fTSW\xba\xff\xff\x0e\x049Q\xd4\xa0X\x93\x8aJ\xfcI\xa8\xb6\x84\xd1\xb6\xa1\xda\x19cmk\x1c\xef\x8c\xe1\xda{\x85q:\x9a\xda[\x1b\xda\xfbi\xb13\xb7\xc5vu\xf9\xa5\x9du[\xe8\xcc\xb5`\xef\xa7\x85\xf6~\xda\x01;\xb5\xe0\x8c\x96\xd8\xea\x10\xdbj\x83\xb7Zb\x8b%\xfe$\xa8hRA\x13\x05\xcc\x01\x02\xfb\xfbG\xc2/\x05r\x0e$\x84\x03\xfb\xe5\xcaZ\x08\xfb\xec\xf3d\x9fs\xdeg\xefg?\xfb\xd9"B\x08\x01\x85B\xa1\x08\x8c\xb0P\x1b@\xa1P(\x03\x81\x8a\x17\x85B\x11$T\xbc(\x14\x8a \xa1\xe2E\xa1P\x04\t\x15/\n\x85"H\xa8xQ(\x14AB\xc5\x8bB\xa1\x08\x12*^\x14\nE\x90P\xf1\xa2P(\x82\x84\x8a\x17\x85B\x11$T\xbc(\x14\x8a \xa1\xe2E\xa1P\x04\t\x15/\n\x85"H\xa8xQ(\x14AB\xc5\x8bB\xa1\x08\x12*^\x14\nE\x90P\xf1\xa2P(\x82\x84\x8a\x1

In [10]:
with open("temp_file", "w") as f:
    f.write(content)

TypeError: write() argument must be str, not bytes

In [11]:
with open("temp_file", "wb") as f:
    f.write(content)

## CSV

Usualmente, estos se pueden leer muy fácilmente utilizando la librería [Pandas](https://pandas.pydata.org/), que es especialmente útil para hacer análisis de datos. Sin embargo, acá mostraremos un ejemplo utilizando el módulo `csv`

```
col1,col2,col3,col4,col5
57,20,65,2,20
72,34,8,44,65
60,35,98,49,7
0,49,25,27,70
```

In [12]:
import csv

In [13]:
with open("local/data/demodata.csv", "r") as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row)

OrderedDict([('col1', '57'), ('col2', '20'), ('col3', '65'), ('col4', '2'), ('col5', '20')])
OrderedDict([('col1', '72'), ('col2', '34'), ('col3', '8'), ('col4', '44'), ('col5', '65')])
OrderedDict([('col1', '60'), ('col2', '35'), ('col3', '98'), ('col4', '49'), ('col5', '7')])
OrderedDict([('col1', '0'), ('col2', '49'), ('col3', '25'), ('col4', '27'), ('col5', '70')])
OrderedDict([('col1', '32'), ('col2', '5'), ('col3', '49'), ('col4', '68'), ('col5', '84')])
OrderedDict([('col1', '18'), ('col2', '78'), ('col3', '60'), ('col4', '50'), ('col5', '95')])
OrderedDict([('col1', '26'), ('col2', '59'), ('col3', '20'), ('col4', '70'), ('col5', '96')])
OrderedDict([('col1', '30'), ('col2', '18'), ('col3', '33'), ('col4', '98'), ('col5', '97')])
OrderedDict([('col1', '89'), ('col2', '65'), ('col3', '53'), ('col4', '43'), ('col5', '22')])
OrderedDict([('col1', '46'), ('col2', '6'), ('col3', '65'), ('col4', '66'), ('col5', '91')])


Podemos ver que cada fila del archivo csv se representa con un objeto de la clase `OrderedDict`, que es como un diccionario que preserva el orden en que se agregan los elementos. Cada llave de ese diccionario corresponde a los nombres de las columnas del csv.

In [14]:
data = [
    {"name": "julanito", "grade": 3.7},
    {"name": "peranito", "grade": 3.3}
]
with open("temp.csv", mode="w") as f:
    writer = csv.DictWriter(f, fieldnames=["name", "grade"])
    writer.writeheader()
    for row in data:
        writer.writerow(row)

Puedes leer más en [Reading and Writing CSV Files in Python](https://realpython.com/python-csv/)

## JSON

Un archivo en formato JSON es muy similar a un diccionario de Python o una lista de diccionarios, por lo que es muy normal escribirlos a partir de diccionarios y leerlos como diccionarios. 

```json
[
    {
        "name": "julanito",
        "grade": 3.7
    },
    {
        "name": "peranito",
        "grade": 3.3
    }
]
```

In [15]:
import json

In [16]:
with open("local/files/coco_example.json", "r") as f:
    data = json.load(f)
data

{'images': [{'height': 600, 'width': 800, 'id': 1, 'file_name': '1.jpg'}],
 'categories': [{'supercategory': 'date', 'id': 1, 'name': 'date'},
  {'supercategory': 'fig', 'id': 2, 'name': 'fig'},
  {'supercategory': 'hazelnut', 'id': 3, 'name': 'hazelnut'}],
 'annotations': [{'segmentation': [[307.37888198757764,
     99.62111801242236,
     345.2670807453416,
     75.3975155279503,
     348.3726708074534,
     47.4472049689441,
     352.7204968944099,
     35.64596273291925,
     340.91925465838506,
     31.298136645962728,
     330.9813664596273,
     20.11801242236025,
     311.1055900621118,
     13.906832298136646,
     277.5652173913043,
     32.54037267080745,
     266.3850931677018,
     57.38509316770186,
     267.0062111801242,
     77.88198757763975,
     282.53416149068323,
     93.40993788819875]],
   'iscrowd': 0,
   'area': 5186.528297519399,
   'image_id': 1,
   'bbox': [266.0, 13.0, 86.0, 86.0],
   'category_id': 3,
   'id': 13},
  {'segmentation': [[620.4223602484471,


In [17]:
data = [
    {"name": "julanito", "grade": 3.7},
    {"name": "peranito", "grade": 3.3}
]
with open("local/files/temp.json", "w") as f:
    data = json.dump(data, f)

Esto es lo más importante a recordar cuando trabajas con archivos JSON. Si quieres saber más te sugiero leer [Working With JSON Data in Python](https://realpython.com/python-json/)

## XML

Otra estructura, un poco más compleja en mi opinión...

```xml
<annotation>
	<folder>GeneratedData_Train</folder>
	<filename>000001.png</filename>
	<path>/my/path/GeneratedData_Train/000001.png</path>
	<source>
		<database>Unknown</database>
	</source>
	<size>
		<width>224</width>
		<height>224</height>
		<depth>3</depth>
	</size>
	<segmented>0</segmented>
	<object>
		<name>21</name>
		<pose>Frontal</pose>
		<truncated>0</truncated>
		<difficult>0</difficult>
		<occluded>0</occluded>
		<bndbox>
			<xmin>82</xmin>
			<xmax>172</xmax>
			<ymin>88</ymin>
			<ymax>146</ymax>
		</bndbox>
	</object>
</annotation>
```

In [18]:
import xml.etree.ElementTree as ET

In [19]:
tree = ET.parse('local/files/pascalvoc_example.xml')
tree

<xml.etree.ElementTree.ElementTree at 0x7fbcaa5b8490>

Este formato se puede representar como un árbol y podemos acceder recursivamente a cada uno de los elementos del árbol. Cada elemento del árbol tiene un `tag` y un `attrib` (atributo)

In [20]:
root = tree.getroot()

In [21]:
[(c.tag, c.attrib) for c in root]

[('folder', {}),
 ('filename', {}),
 ('path', {}),
 ('source', {}),
 ('size', {}),
 ('segmented', {}),
 ('object', {})]

In [22]:
root.find("object").find("bndbox").find("xmin").text

'82'

otra opción es usar [xmltodict](https://pypi.org/project/xmltodict/)

In [23]:
import xmltodict

In [24]:
xmltodict.parse(open("local/files/pascalvoc_example.xml", "r").read())

OrderedDict([('annotation',
              OrderedDict([('folder', 'GeneratedData_Train'),
                           ('filename', '000001.png'),
                           ('path', '/my/path/GeneratedData_Train/000001.png'),
                           ('source', OrderedDict([('database', 'Unknown')])),
                           ('size',
                            OrderedDict([('width', '224'),
                                         ('height', '224'),
                                         ('depth', '3')])),
                           ('segmented', '0'),
                           ('object',
                            OrderedDict([('name', '21'),
                                         ('pose', 'Frontal'),
                                         ('truncated', '0'),
                                         ('difficult', '0'),
                                         ('occluded', '0'),
                                         ('bndbox',
                                          O

## YAML

Este formato es bastante parecido a JSON, pero un poco más legible

```yaml
name: Tests on PRs

on:
  pull_request:

jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        python-version: [3.7, 3.8]
    steps:
    - uses: actions/checkout@v2
    - name: Set up Python ${{ matrix.python-version }}
      uses: actions/setup-python@v2
      with:
        python-version: ${{ matrix.python-version }}
    - name: Install dependencies
      run: python -m pip install ".[dev]"
    - name: Test with pytest
      run: pytest --cov=lac --cov-report=term-missing
```

In [25]:
import yaml

In [26]:
with open("local/files/config_example.yml") as f:
    data = yaml.safe_load(f)

In [27]:
data

{'name': 'Tests on PRs',
 True: {'pull_request': None},
 'jobs': {'test': {'runs-on': 'ubuntu-latest',
   'strategy': {'matrix': {'python-version': [3.7, 3.8]}},
   'steps': [{'uses': 'actions/checkout@v2'},
    {'name': 'Set up Python ${{ matrix.python-version }}',
     'uses': 'actions/setup-python@v2',
     'with': {'python-version': '${{ matrix.python-version }}'}},
    {'name': 'Install dependencies', 'run': 'python -m pip install ".[dev]"'},
    {'name': 'Test with pytest',
     'run': 'pytest --cov=lac --cov-report=term-missing'}]}}}

In [28]:
data = [
    {"name": "julanito", "grade": 3.7},
    {"name": "peranito", "grade": 3.3}
]
with open("local/files/temp.yml", "w") as f:
    data = yaml.dump(data, f)