# Data Preparation:

#### Context
We have a directory structure as such:

In [13]:
from pathlib import Path
from useful_funcs import tree 

In [14]:
parent_dir = Path("..")

tree(parent_dir)

+ ..
    + data
        + final
        + interim
        + raw
            + hidrografia.zip
            + mexstates.zip
    + docs
    + figures
        + final
        + interim
    + notebooks
        + 00 Data Prep.ipynb
        + useful_funcs.py
    + README.md


We have two `shape files`
* `hidrografia.zip`: which contains hydrography information of Mexico. 
  - Source: [Instituto Nacional de Estadística y Geografía - INEGI](http://www.beta.inegi.org.mx/app/mapas/)
* `mexstates.zip`: which contains the outlines of the 32 states that compose Mexico. 
  - Source: [ArcGIS](https://www.arcgis.com/home/item.html?id=ac9041c51b5c49c683fbfec61dc03ba8)

The fist step is to unzip the files.

In [15]:
from zipfile import ZipFile

In [17]:
interim_data_path = Path("../data/interim/")
raw_data_path = Path("../data/raw/")

In [19]:
for zippedfile in raw_data_path.glob("*.zip"):
    file = ZipFile(zippedfile)
    file.extractall((interim_data_path / zippedfile.stem))

For those unfamiliar with the `pathlib` module. We can concatenate filepaths and names using `/`. This is another example that may illustrate this better:

*NOTE: for more on `pathlib` read [RealPython's article](https://realpython.com/python-pathlib/) on it.

In [22]:
for zippedfile in raw_data_path.glob("*.zip"):
    # Path.glob() returns an iterator
    # in this case of file Paths and each has attributes like .name, .stem, and others.
    print("the file's .name is: {zippedfile.name}")
    print("the file's .stem is: {zippedfile.stem}")
    print(f"and we can concatenate paths like this Path / Path. the resulting path in this case is: \n\t {interim_data_path / zippedfile.stem}")
    print("="*15)

the file's .name is: {zippedfile.name}
the file's .stem is: {zippedfile.stem}
and we can concatenate paths like this Path / Path. the resulting path in this case is: 
	 ..\data\interim\hidrografia
the file's .name is: {zippedfile.name}
the file's .stem is: {zippedfile.stem}
and we can concatenate paths like this Path / Path. the resulting path in this case is: 
	 ..\data\interim\mexstates


Now, because these `.zip` files come from different sources, they are bundled differently. 

In [23]:
tree(interim_data_path)

+ ..\data\interim
    + hidrografia
        + conjunto_de_datos
            + red_hidrografica_250k.dbf
            + red_hidrografica_250k.prj
            + red_hidrografica_250k.sbn
            + red_hidrografica_250k.sbx
            + red_hidrografica_250k.shp
            + red_hidrografica_250k.shp.xml
            + red_hidrografica_250k.shx
        + metadato
            + Red_Hidrografica_Digital.html
            + Red_Hidrografica_Digital.sgml
            + Red_Hidrografica_Digital.txt
            + Red_Hidrografica_Digital.xml
            + ~$d_Hidrografica_Digital.txt
    + mexstates
        + mexstates.dbf
        + mexstates.prj
        + mexstates.sbn
        + mexstates.sbx
        + mexstates.shp
        + mexstates.shp.xml
        + mexstates.shx


***
Now that we have extracted our data we can move on to our visualization work with `geopandas` and `altair`

[01 Data Visualization notebook](01_Data_Visualization.ipynb)