# Research data management with Python and MongoDB

This notebook details an example upload process from parsing raw data to pushing it into the MongoDB database.
A second notebbok will follow this up with querying data and retrieving it to perform analyses on it.

## Parsing files in Python

Before we can push data to the database, we need to parse a datafile. EVN generously provided us with a sample dataset found in `data/BZ011_Rohdaten.dat`. Before trying top parse it we should take a look at the file ourselves to understand its structure. Doing that we should notice the following:
- The file is structured as a csv file, but uses tabs as delimiters between values instead of commas
- The first row of the file contains column headers as strings and all following rows contain mixed data
- The data is a mix of strings, decimal numbers and dates
- The decimal numbers use commas instead of decimal separators (as opposed to points, which are used in english speaking countries and therefore also in virtually all programming languages) 
- Some column headers contain special characters, i.e. `°C` 

We will take care of the last point first. Special characters often cause problems with text encoding if they are not handled consistently. Therefore we use the `chardet` module to automatically detect the encoding of our input file:

In [None]:
import pandas as pd
import chardet

# File path
file_path = "./data/BZ011_Rohdaten.dat"
file_path_metadata = "./data/metadata_BZ011_Rohdaten.json"


with open(file_path, "rb") as f:
    result = chardet.detect(f.read(100000))  # Analyze first 100KB
    detected_encoding = result["encoding"]

print(detected_encoding)

We can then parse the file using the function `pd.read_csv()`, which conveniently accepts arguments to adjust the decimal separator and delimiters. We also want to convert the `Datum` column into a proper date format using `pd.to_datatime()`.

In [None]:
# Read data into pandas DataFrame
df = pd.read_csv(file_path, delimiter="\t", encoding=detected_encoding, decimal=",")  
df['Datum'] = pd.to_datetime(df['Datum'], format="%d.%m.%y %H:%M:%S")  

df.head()

With this we have done all the necessary parsing to get our input file into a pandas dataframe. Some additional parsing methods, like combining multiple files can be found in the Repository of the recent Python course: https://github.com/ZBT-Tools/Python_workshop , in the notebook of part 3.

## Uploading data to MongoDB

To upload data to MongoDB we start by creating a connection using the `pymongo` module and particularly its `MongoClient` function. The ZBT database can be reached with the following connection string:
`mongodb://username:password@172.16.134.8:27017/?directConnection=true&authSource=admin`, where username and password need to be replaced by your own credentials. For users that exist only on a specific database, e.g. student users, the `authSource` parameter needs to be set to that database.
Naturally, we do not want our credentials to be plainly visible in a Python script - especially if we want to push it to a GitHub repository at some point. Any such secrets should be stored in environment variable files, which are conventionally called `.env` (but can be called however you prefer). 

```
MONGODB_USER = "username"
MONGODB_PASSWORD = "password"
```

Before committing your files to a git repository you should then create a `.gitignore` file, which contains the name of your environment file. For your testing purposes, I have added the `.env` file to this repository, so you can use the test account from the course with username `rdm_workshop` and password `password`. 

In [None]:
from pymongo import MongoClient
from dotenv import load_dotenv
import os 

load_dotenv()

mongodb_user = os.environ.get("MONGODB_USER")
mongodb_password = os.environ.get("MONGODB_PASS")
# MongoDB connection
mongo_uri = "mongodb://"+mongodb_user+":"+mongodb_password+"@172.16.134.8:27017/?directConnection=true&authSource=admin"
# mongo_uri = "mongodb://localhost:27017"
client = MongoClient(mongo_uri)



We then select the database, for example `rdm_workshop` and create a new collection named after the data file that we have loaded before. Before creating the collection we make sure that it does not exist. For this particular example, we create a `timeseries` collection, which is optimized for tabular data, where the main variable is a time. Previously, we named the column with date and time information `timestamp`, so we pass that to the `timeField` argument to make sure it is indexed properly and can be queried. 

In [None]:

db = client["rdm_workshop"]
collection_name = file_path.split("/")[-1].split(".")[0]

if not collection_name in db.list_collection_names():

    #Create time-series collection if it doesn't exist
    db.create_collection(
        collection_name,
        timeseries={
            "timeField": "Datum",   # Name of the main column, by which the time series is indexed 
            "metaField": "metadata",    # Name of the metadata field  
            "granularity": "seconds"    
        },
    )


Finally, before uploading the data we need to convert it to a dictionary, which is Python's internal datatype for JSON-like data. Dictionaries in Python are **unordered**, which means that the order of columns is not preserved. Therefore we need to query data by column names and cannot rely on column ids.

In [None]:



collection = db[collection_name]

if collection.count_documents({}) == 0:
    # Insert data into MongoDB
    records = df.to_dict(orient="records")
    collection.insert_many(records)
    print("Data uploaded successfully!")
else:
    print("WARNING: Collection already contains data, make sure you are writing to the correct collection!")




Now let us add some metadata to help us find our data set later on. We will read the metadata from a secondary file `metadata_BZ011_Rohdaten.json`, which is structured as JSON data. The `metaField` in a timeseries has some special properties. If multiple documents share the same information in their respective `metaField`, MongoDB will only store this information once and then link it to each document. This gives us the best of both worlds between the convenience of having the data available in each document and the efficiency of storing it only once.

In [None]:
metadata_from_file = pd.read_json(file_path_metadata)
metadata_from_file = metadata_from_file.to_dict(orient="records")

print(metadata_from_file)

collection.update_many(
  {},
  { "$set": { "metadata": metadata_from_file} }
)

The `update_many` command above may require some additional explanation. As the name implies, it is used to update many documents in a collection at once. To do this, we first query the documents to update. By querying for an empty `{}` dictionary, the query will match all documents. The second argument is the update that we wish to apply to all selected documents. Specifically, we `set` the `metaField`, which we named `"metadata"` earlier, to the value `metadata_from_file`, which contains a dictionary version of the entries of the original JSON file.

With this metadata in place, we can run some example queries to retrieve our data from the database.

In [None]:
# Find all documents that describe Accelerated stress test experiments
cursor = collection.find({"metadata.experiment_type": "Accelerated stress test"})
# Convert cursor to list of dictionaries
data = list(cursor)
# Convert to Pandas DataFrame
df = pd.DataFrame(data)
# Print only the first few rows
df.head()


In [None]:
# Shorter version
cursor = collection.find({"metadata.testbench": "BZ011"})
pd.DataFrame(list(cursor)).head()

In [None]:


from datetime import datetime
query_date = datetime(2024, 8, 5, 13, 30, 0)
# Query data taken after a certain date / time
cursor = collection.find({
    "Datum": {"$gt": query_date}
})

pd.DataFrame(list(cursor)).head()

In [None]:
# We can combine multiple queries to further specify which documents we want
cursor = collection.find({
    "metadata.testbench": "BZ011",
    "Datum": {"$gt": query_date}
})
pd.DataFrame(list(cursor)).head()

In [None]:

# We can also just get the number of documents that match a condition
count = collection.count_documents({
    "Datum": {"$gt": query_date}
})
print(f"Number of experiments on testbench BZ011: {count}")