# Use a project settings module

I always struggle to keep track of paths in my project scripts and notebooks. Are they relative to the notebook file? Or the directory I was in when I started the notebook server?

To make this explicit, I started adding a settings module to my project similar to the [one used in Django projects](https://docs.djangoproject.com/en/3.2/topics/settings/). Django is a popular Python web application framework, so this convention is likely to be easily understood by other Python developers.

You can see a bare-bones one in the [`settings.py`](../settings.py) file at the root of this project's repository.

In [1]:
# Take a look at the settings module contents
!cat settings.py

"""Project settings"""

from pathlib import Path

PROJECT_ROOT = Path(__file__).parent
# I use a layout similar to the one used by templates distributed along with AP's DataKit
DATA_DIR = PROJECT_ROOT / "data"
DATA_DIR_SRC = DATA_DIR / "source"
DATA_DIR_MANUAL = DATA_DIR / "manual"
DATA_DIR_PROCESSED = DATA_DIR / "processed"
DATA_DIR_PUBLIC = DATA_DIR / "public"

However, before you can import this module in your notebook, you have to tell Python where to find the module. I like to do this using a script that is run when Python is started using the [`PYTHONSTARTUP` environment variable](http://www.witkowskibartosz.com/blog/pythonstartup_what_it_is_and_how_to_use_it.html).

This script changes directory to the project root.

If you're using this in your project, you want to change the `project_name` variable to reflect your project.

In [2]:
# Show the Python startup script
!cat .startup.py

import os

project_name = "python-data-cheatsheet" 
cwd = os.getcwd()
basename = os.path.basename(cwd)
project_root_not_found = False

while basename != project_name:
    if os.path.split(cwd)[1] == '':
        basename = project_name
        project_root_not_found = True
    else:
        cwd = os.path.split(cwd)[0]
        basename = os.path.basename(cwd)

if project_root_not_found:
    print("Couldn't find project root. Current directory not changed to project root.")
else:
    os.chdir(cwd)


Finally, you need to set the `PYTHONSTARTUP` environment variable and point it to your script.

You'll want to replace `/home/ghing/workspace/python-data-cheatsheet/` with the full absolute path to your project directory.

You can set the `PYTHONSTARTUP` environment variable in a number of ways.

You could specify it when you start the notebook server:

```
PYTHONSTARTUP=/home/ghing/workspace/python-data-cheatsheet/.startup.py jupyter-lab
```

or export it and then it will be used any time you run Python in your current terminal

```
export PYTHONSTARTUP=/home/ghing/workspace/python-data-cheatsheet/.startup.py
# Anything you run after this point will use that startup script
```

The way I do it, because I use [Pipenv](https://github.com/pypa/pipenv) to manage virtual environments for my project, and it automatically sources `.env` files when you run a command with `pipenv run`, is to just set the environment variable in a `.env` file in my project root.

In [3]:
# Show relevent line of .env file
!grep PYTHONSTARTUP .env

PYTHONSTARTUP=/home/ghing/workspace/python-data-cheatsheet/.startup.py


This might seem like a lot to set up, but it's worth it to not have to think about mid-project. If you have one main project/virtual environment for your data analysis, you can do this once and set it and forget it.

If you create many projects, I highly recommend the [DataKit](https://datakit.ap.org/) tool which will initialize all of this whenever you bootstrap a new project.

In [4]:
# Setup

import os

import numpy as np
import pandas as pd

# Import the settings module
import settings

Then you can easily use the paths to data directories defined in your settings module to build paths to data that you open or save in your notebooks.

In [5]:
# Show building paths based on ones defined in the settings module

settings.DATA_DIR_PROCESSED / "test.csv"

PosixPath('/home/ghing/workspace/python-data-cheatsheet/data/processed/test.csv')

In [6]:
# Create some toy data

df = pd.DataFrame(
    [
        [np.nan, 2, np.nan, 0],
        [3, 4, np.nan, 1],
        [np.nan, np.nan, np.nan, 5],
        [np.nan, 3, np.nan, 4],
    ],
    columns=list("ABCD"),
)

df

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,,,,5
3,,3.0,,4


In [7]:
# Use a path in the settings module to unambiguously save the dataframe
# to a known location.
os.makedirs(settings.DATA_DIR_PROCESSED, exist_ok=True)
df.to_csv(settings.DATA_DIR_PROCESSED / "test.csv", index=False)

In [8]:
# This and remaining cells aren't likely to be things you would put in a real
# notebook. They're just to show how this works.

# Show that we successfully saved the file
!cat data/processed/test.csv

A,B,C,D
,2.0,,0
3.0,4.0,,1
,,,5
,3.0,,4


In [9]:
# Cleanup
!rm -f data/processed/test.csv