# Research data management with Python and MongoDB

This notebook details an example upload process from parsing raw data to pushing it into the MongoDB database.
A second notebbok will follow this up with querying data and retrieving it to perform analyses on it.

## Parsing files in Python

Before we can push data to the database, we need to parse a datafile. EVN generously provided us with a sample dataset found in `data/BZ011_Rohdaten.dat`. Before trying top parse it we should take a look at the file ourselves to understand its structure. Doing that we should notice the following:
- The file is structured as a csv file, but uses tabs as delimiters between values instead of commas
- The first row of the file contains column headers as strings and all following rows contain mixed data
- The data is a mix of strings, decimal numbers and dates
- The decimal numbers use commas instead of decimal separators (as opposed to points, which are used in english speaking countries and therefore also in virtually all programming languages) 
- Some column headers contain special characters, i.e. `°C` 

We will take care of the last point first. Special characters often cause problems with text encoding if they are not handled consistently. Therefore we use the `chardet` module to automatically detect the encoding of our input file:

In [50]:
import pandas as pd
import chardet

# File path
file_path = "./data/BZ011_Rohdaten.dat"


with open(file_path, "rb") as f:
    result = chardet.detect(f.read(100000))  # Analyze first 100KB
    detected_encoding = result["encoding"]

print(detected_encoding)

utf-8


We can then parse the file using the function `pd.read_csv()`, which conveniently accepts arguments to adjust the decimal separator and delimiters. We also want to convert the `Datum` column into a proper date format using `pd.to_datatime()`.

In [51]:
# Read data into pandas DataFrame
df = pd.read_csv(file_path, delimiter="\t", encoding=detected_encoding, decimal=",")  # Adjust delimiter if needed# Convert timestamp column to datetime (modify 'timestamp_column' accordingly)
df['Datum'] = pd.to_datetime(df['Datum'])  # Change 'timestamp_column' to the actual column name

  df['Datum'] = pd.to_datetime(df['Datum'])  # Change 'timestamp_column' to the actual column name


With this we have done all the necessary parsing to get our input file into a pandas dataframe. Some additional parsing methods, like combining multiple files can be found in the Repository of the recent Python course: https://github.com/ZBT-Tools/Python_workshop , in the notebook of part 3.

## Uploading data to MongoDB

To upload data to MongoDB we start by creating a connection using the `pymongo` module and particularly its `MongoClient` function. The ZBT database can be reached with the following connection string:
`mongodb://username:password@172.16.134.8:27017/?directConnection=true&authSource=admin`, where username and password need to be replaced by your own credentials. For users that exist only on a specific database, e.g. student users, the `authSource` parameter needs to be set to that database.
Naturally, we do not want our credentials to be plainly visible in a Python script - especially if we want to push it to a GitHub repository at some point. Any such secrets should be stored in environment variable files, which are conventionally called `.env` (but can be called however you prefer). 

```
MONGODB_USER = "username"
MONGODB_PASSWORD = "password"
```
Before committing your files to a git repository you should then create a `.gitignore` file, which contains the name of your environment file. 

In [None]:
from pymongo import MongoClient
from dotenv import load_dotenv
import os 

load_dotenv()

mongodb_user = os.environ.get("MONGODB_USER")
mongodb_password = os.environ.get("MONGODB_PASS")
# MongoDB connection
mongo_uri = "mongodb://"+mongodb_user+":"+mongodb_password+"@172.16.134.8:27017/?directConnection=true&authSource=admin"
# mongo_uri = "mongodb://localhost:27017"
client = MongoClient(mongo_uri)



We then select the database, for example `rdm_workshop` and create a new collection named after the data file that we have loaded before. Before creating the collection we make sure that it does not exist. For this particular example, we create a `timeseries` collection, which is optimized for tabular data, where the main variable is a time. Previously, we named the column with date and time information `timestamp`, so we pass that to the `timeField` argument to make sure it is indexed properly and can be queried. 

In [53]:

db = client["rdm_workshop"]
collection_name = file_path.split("/")[-1].split(".")[0]

if not collection_name in db.list_collection_names():

    #Create time-series collection if it doesn't exist
    db.create_collection(
        collection_name,
        timeseries={
            "timeField": "Datum",
            "metaField": "Set ",
            "granularity": "seconds"
        },
    )


Finally, before uploading the data we need to convert it to a dictionary, which is Python's internal datatype for JSON-like data. Dictionaries in Python are **unordered**, which means that the order of columns is not preserved. Therefore we need to query data by column names and cannot rely on column ids.

In [54]:


collection = db[collection_name]

if collection.count_documents({}) == 0:
    # Insert data into MongoDB
    records = df.to_dict(orient="records")
    collection.insert_many(records)
    print("Data uploaded successfully!")
else:
    print("WARNING: Collection already contains data, make sure you are writing to the correct collection!")


Data uploaded successfully!
