# initial exploration

You might want to keep track of your notebooks in the git repository. The easiest way to do this is to dump them into the notebooks folder. There are many different conventions that you can use, but here are a couple of useful ones:


- `XX-{title}.ipynb` where `XX` is a number that indicates the order in which the notebooks were created. Useful for a smaller team where you dont want to clutter the directory with long file names.
- `YYYMMDD-{initials}-{title}.ipynb` where `YYYMMDD` is the date on which the notebook was created, `initials` are your initials, and `title` is the title of the notebook. This is useful if you want to keep track of when the notebook was created and who created it.

You might consider dumping notebooks into your user directory if they are meant to be a one-off and not meant to be reused as part of the workflow.

In [1]:
%load_ext autoreload
%autoreload 2

In the code cell above, we add the magic for autoreloading modules. This is useful when you are calling in utility functions from your module. If you change the module, this should automatically reload the module on imports.

In [2]:
from my_task_package.spark import get_spark

spark = get_spark()
spark

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/01/16 21:09:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/01/16 21:09:41 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).


Here we can call into into our module so we can organize a lot of the utilities into shared functions. Use this to your advantage to keep notebooks organized, and to limit the amount of actual code that you actually write in notebooks.

One final tip that is useful is to use the `!` operator to run shell commands. This is useful when a shell utility is easier to use, or if you want to capture the output from your program to check into the repository. You can easily interpolate values from the python environment using `{}`.

In [4]:
project_root = "../.."
! ls {project_root}

docs		 my_task_project.egg-info  README.md	     tests
LICENSE		 notebooks		   requirements.txt  user
my_task_package  pyproject.toml		   scripts
