# Optimus Utilities

This notebook decribes with more detail what Optimus Utitilies module is about. 

Utilities module contains tool classes that support use of DataFrameTransformer and DataFrameAnalyzer modules. It provides some facilities to read local/remote csv files (by providing an url), to load spark dataframes from an url and also, to read parquet files and provide csv-parquet conversion. 

### Importing Modules

In [1]:
# Import optimus
import optimus as op

Deleting previous folder if exists...
Creation of checkpoint directory...
Done.


### Instantiation of Utility class

In [2]:
# Instance of Utilities class
tools = op.Utilities()

## Reading Dataframe from csv

Lets assume you have a file, called foo.csv, in your current directory (this file foo.csv actually can be found in the Optimus repository).

The dataframe can be obtained by reading the csv file, just with the following line:

In [3]:
# Reading dataframe in this case, local file 
# system (hard drive of the pc) is used.

df = tools.read_dataset_csv(path="foo.csv", delimiter_mark=',')

#### Now we can view part of the dataframe by using the Dataframe.show()

In [None]:
df.show(5)

+---+---------+-----------+---------+-------+-----+----------+--------+
| id|firstName|   lastName|billingId|product|price|     birth|dummyCol|
+---+---------+-----------+---------+-------+-----+----------+--------+
|  1|     Luis|Alvarez$$%!|      123|   Cake|   10|1980/07/07|   never|
|  2|    André|     Ampère|      423|   piza|    8|1950/07/08|   gonna|
|  3|    NiELS| Böhr//((%%|      551|  pizza|    8|1990/07/09|    give|
|  4|     PAUL|     dirac$|      521|  pizza|    8|1954/07/10|     you|
|  5|   Albert|   Einstein|      634|  pizza|    8|1990/07/11|      up|
+---+---------+-----------+---------+-------+-----+----------+--------+
only showing top 5 rows



#### Or with the DataframeProfler Optimus class:

In the following cell, a basic profile of the DataFrame is shown. This overview presents basic information about the DataFrame, like number of variable it has, how many are missing values and in which column, the types of each varaible, also some statistical information that describes the variable plus a frecuency plot. table that specifies the existing datatypes in each column dataFrame and other features. Also, for this particular case, the table of dataType is shown in order to visualize a sample of column content. 

In [None]:
# Instance of profiler class
profiler = op.DataFrameProfiler(df)
profiler.profiler()

In [21]:
#tools.read_dataset_url()

## Checkpoints

### Instantiation of DataFrameTransformer
DataFrameTransformer is a specialized class to make dataFrame transformations. Transformations are optimized as much as possible to internally used native spark 
transformation functions.

In [24]:
# Instance of transformer class 
transformer = op.DataFrameTransformer(df)

In [37]:
df.show(5)

+---+---------+-----------+---------+-------+-----+----------+--------+
| id|firstName|   lastName|billingId|product|price|     birth|dummyCol|
+---+---------+-----------+---------+-------+-----+----------+--------+
|  1|     Luis|Alvarez$$%!|      123|   Cake|   10|1980/07/07|   never|
|  2|    André|     Ampère|      423|   piza|    8|1950/07/08|   gonna|
|  3|    NiELS| Böhr//((%%|      551|  pizza|    8|1990/07/09|    give|
|  4|     PAUL|     dirac$|      521|  pizza|    8|1954/07/10|     you|
|  5|   Albert|   Einstein|      634|  pizza|    8|1990/07/11|      up|
+---+---------+-----------+---------+-------+-----+----------+--------+
only showing top 5 rows



## Checkpoints
At this point, it is important to remember that Optimus is an apache spark upper abstraction layer. In consequence, all
the computing logic is actually made by apache spark and this is a main reason why we need to use checkpoints. 

Checkpoints in this context are images of processed data (dataframes) saved in the disk. Checkpoints are useful because of the apache Spark lazy evaluation (instructions are not runned until they are actually needed), so when you execute for example, some dataframe transformations, what Apache Spark actually does is to place those instrucctions in a queue. 

The transformation instructions are acummulated but not executed until an action instruction is demanded (for example a df.show() instruction). So the problem cames when we have a very long list of stacked instrucctions. At this point executing all instructions everytime the dataframe is shown, can result tedius and time consumming. 

The solution is just to the DataFrameTransformer.check_point() method in order to cut the lineage of instrucctions and to save the new stated dataframe and to advoid re running processes again an again. All this issue is actually normal in Apache Spark because it has been created to not save the result, but the tasks (because it is also a distribute computing framework).

#### Setting checkpoint folder

In [31]:
tools.set_check_point_folder(path="/home/hugo/Documents/Development/Optimus/examples", file_system="local")

Deleting previous folder if exists...
Creation of checkpoint directory...
Done.


#### Deleting checkpoint folder

In [32]:
tools.delete_check_point_folder(path="/home/hugo/Documents/Development/Optimus/examples", file_system="local")

Deleting checkpoint folder...
Folder deleted. 

