Staff of the Data Management and Analysis (DMA) practice support internal and external clients by compiling, organizing, and standardizing data so that it is available and suitable for technical analyses; by conducting geospatial analyses and producing map figures; by carrying out a variety of data summarization and data analysis tasks including statistical analyses, graphic visualization, and data modeling; and by developing web sites for presentation of data and analytical results.

Data are the foundation of most of the decision support services that Integral provides to its clients. Our interpretations, assessments, judgments, and recommendations can only be as good as the underlying data. Because data are often voluminous and complex, Integral's DMA practice has established goals, standards, practices, and tools for handling data efficiently and consistently.

## Goals

Data management standards and practices are designed to meet the following goals:

:::{.grid}
:::{.g-col-10}

- Establish and maintain the highest level of data quality that is consistent with the needs of each project.  [Six dimensions of data quality](/docs/data-management/data-quality.qmd#data-quality-dimensions), and the approaches to handling them, are described on the [Data Quality](./data-quality.qmd) page.
- Establish data security: ensuring that clients' data are available only to authorized project staff, and that data revisions are controlled and documented.
- Ensure that data are accessible and available to project staff either directly or with the assistance of experienced data managers.
- Carry out data summarization and analysis using consistent, efficient, and technically appropriate methods, including both standard and cutting-edge analysis methods.
- Document the provenance, handling, and history of each data set, providing traceability analogous to chain-of-custody procedures for digital data.

:::
:::{.g-col-2}

![](/static/data-management/data-to-action.svg){height=360 fig-align="left"}

:::
:::

## Staff

The DMA practice includes the following {{< fa people >}}staff with specialties in the management, analysis, and presentation of data:


In [None]:
import polars as pl
import itables
from ipyleaflet import (
    AwesomeIcon,
    Map,
    Marker,
    Popup,
    Heatmap,
    basemaps,
    basemap_to_tiles,
)
from ipywidgets import HTML

m = Map(
    basemap=basemap_to_tiles(basemaps.OpenStreetMap.Mapnik),
    center=(40, -98),
    zoom=3,
)

dm_staff = "../../static/staff.xlsx"
offices = pl.read_excel(dm_staff, sheet_name="Offices")
locations = offices.to_dicts()
staff = pl.read_excel(dm_staff, sheet_name="Data Management")
df = offices.join(staff, on=("Location Code"), how="inner")

for location in locations:
    location_staff = staff.filter(pl.col("Location Code") == location["Location Code"])
    if not location_staff.is_empty():
        marker = Marker(
            location=(location["Latitude"], location["Longitude"]),
            draggable=False,
            title=f'{location["Location Code"]}',
        )
        m.add_layer(marker)
        marker.popup = HTML(
            f"""
            <b>{location['Location Code']}</b>
            <ul>
                {''.join('<li>' + person['Staff'] + '</li>' for person in location_staff.to_dicts())}
            </ul>
        """
        )

m.add_layer(
    Heatmap(
        locations=[
            (person["Latitude"], person["Longitude"]) for person in df.to_dicts()
        ],
        radius=20,
    )
)

m

In [None]:
itables.show(
    df.select(
        pl.col(
            [
                "State",
                "City",
                "Staff",
                "Title / Resume",
                "Database Management",
                "GIS",
                "Programming",
                "Statistics",
                "Web Development",
            ]
        )
    ).sort(["State", "City", "Staff"]),
    classes="display nowrap compact table table-striped",
)

## Data Management Standards

The following data management standards help assure that data management activities are carried out efficiently, consistently, flexibly, and reliably within and across projects.

- Use of a centralized, standardized, authoritative database (ordinarily IDB for environmental data).
- Adhering to [data management best practices](./best-practices.qmd).
- [Standard data summary routines](./data-summaries.qmd#standard-data-summaries) to make data easily available.
- A process for [producing custom data summaries](./data-summaries.qmd#producing-custom-data-summaries) in response to user requests.
- Use of [scripts](./processes/scripting-data-management-operations.qmd) to perform all changes to data.
- Documentation
  - [Data Management Plan](./processes/data-management-planning.qmd#data-management-plan)
  - Completed [checklists](./processes/data-management-checklists.qmd)
  - [Metadata for data requests](./data-summaries.qmd#producing-custom-data-summaries)
  - [Script documentation](../development/code/sql.qmd#script-header)
  - [Issue logs](./data-issues.qmd)
- [Templates for entry and uploading of field sampling information](./processes/electronic-data-deliverables.qmd).
- [Checklists](./processes/data-management-checklists.qmd) for project initiation, data set evaluation, and other activities
- [EDD templates for laboratory analytical results](./processes/electronic-data-deliverables.qmd)
- [Arc/GIS and CAD templates](../gis/resources/gis-and-cad-templates.qmd) for figure production.
- [Default rules for summarizing chemical data](./data-summaries.qmd#summarization-of-chemistry-data).

## Data Accessibility

:::{.grid}
:::{.g-col-8}

For most projects, data are stored in a centralized Postgres database (see below and the IDB wiki page).  Project staff can directly access data in this database; some knowledge of SQL and the data structure may be required.  Login credentials are also required; the project data manager can assign these as needed.

Data can also be accessed, and summarized in various forms, through:

- Several types of [standard data summaries](./data-summaries.qmd#standard-data-summaries) that may be maintained for a project.
- [Using graphical query builders](../tools/gui/query-builders.qmd#using-graphical-query-builders-with-idb).
- [Requests](./data-summaries.qmd#producing-custom-data-summaries) made directly to data managers and GIS staff.
- Custom [Shiny web interfaces](https://envision.integral-corp.com).
- A [web interface (IWeb)](../tools/gui/iweb-web-interface-to-idb.qmd), if one has been set up for the project.
- A centralized QGIS map file containing spatial and tabular data of general use for project tasks.
- [Directly from QGIS](../gis/using-idb-with-qgis.qmd).
- [Directly from R](../development/code/r.qmd#using-idb-data-with-r).

More information on data accessibility can be found on the [Data Accessibility](./data-accessibility.qmd) page.

:::
:::{.g-col-4}

![](/static/data-management/navicat2.jpg){height=400 fig-align="left"}

:::
:::

## Databases

:::{.grid}
:::{.g-col-8}

Integral has developed a general-purpose database ([IDB](./idb/index#integrals-custom-database)) that is used as a centralized, standardized, authoritative repository for environmental measurement data. IDB is capable of storing chemical, physical, and biological data from both simple and complex environmental investigations. Software is built into the database to automatically apply standard rules for [data summarization](./data-summaries.qmd#summarization-of-chemistry-data). Data management staff maintain an additional library of standard data summarization procedures designed for IDB that will reliably produce consistent data summaries within and across projects. Integral's project web interface ([IWeb](./idb/index.qmd#integrals-custom-database)) is built to work with IDB, and provides rapid and flexible access to data for technical staff and project managers from any location. The database is built on client-server technology so that data managers (and other technical staff) in any of Integral's offices can efficiently access any project's data. Standardization of the database structure and of scripts for summarizing data allows data managers to easily share work and support one another, thereby helping level workloads and reducing project delays that might otherwise result from limited staff availability.

IDB is completely under Integral's control, and can be extended and customized as necessary to meet the needs of new projects. Although IDB is not necessarily appropriate for all projects, it should be the first option to consider for management of environmental characterization data.

A stripped-down [chemistry-only database](./idb/chemistry-only-database.qmd) is also available for projects with limited requirements.  Other custom databases are used or developed as needed by individual projects.  When adopting databases provided by clients or other consultants, good practice is to review the structure and content of those databases for the [six dimensions of data quality](./data-quality.qmd#data-quality-dimensions).

:::
:::{.g-col-4}

![](/static/data-management/simplest-chem-erd.svg){height=300 fig-align="left"}

:::
:::

## Data Management Activities

Data management is not a single activity that occurs at a discrete point in a project. Data management consists of a set of interrelated activities--acquisition, organization, summarization, analysis, and presentation of data, among others--that runs through, and connects, other project tasks. Project tasks, and their data management elements, are described briefly below.

![](/static/data-management/field-data-dfd.svg){fig-align="center"}

### Project Planning

Incorporation of data management planning into overall project planning helps to ensure that appropriate staff, tools, and budget are available, and that schedules and responsibilities are aligned so that data quality objectives are met. The pages on data management plans, data management checklists, workflows for sampling and historical data, and budgeting provide additional information and tools.

### Data Acquisition

Project data may be obtained from a variety of sources, each needing different strategies to identify, obtain, review, and standardize the data. Data management staff can lead, guide, or perform these activities.

### Data Quality Assessment

Assessment of data quality may be required at several different points in a project, and vary depending on the type and origin of the data. Data management staff work closely with analytical chemists, quality assurance specialists, and other technical staff to perform or support screening-level and in-depth technical analyses of data quality. The Data Quality page contains more in-depth information on the dimensions of data quality and our approaches to assess, improve, and document it.

![](/static/data-management/qa-coding.jpg){fig-align="center"}

### Data Organization, Standardization, and Centralization

Core elements of data quality are that data are unambiguous, do not contain internal inconsistencies, and are centrally available to avoid problems resulting from multiple versions of a data set. Organization of data to meet these requirements is one of the core functions of data management activities. Key goals of these data management activities are to ensure that data are:

- As correct as possible, internally consistent, and unambiguous
- Handled consistently across all Integral projects
- Centralized to prevent proliferation of multiple inconsistent versions
- Readily available to project technical staff in whatever formats are needed
- Managed efficiently and consistently across changes in staff availability.

The IDB database is the default tool for standardizing and centralizing environmental characterization data.  Multiple interfaces are available to make data available to all project staff.

The DMA practice uses databases instead of spreadsheets because databases provide better data integrity, speed, documentation, replicability of operations, QA-ability of operations, security, and automated backups.  Some information and recommendations regarding the use of spreadsheets is provided on the page Managing Data With Spreadsheets.  A contrast between SQL (Structured Query Language, used with databases) and spreadsheets is also provided by codecademy.

### Document and file management

Documents and files must not only ordinarily be managed to establish the provenance of project data, but management of documents or files may be a completely separate project goal or requirement. Different tools are appropriate--and available--for document and file management in different circumstances.

![](/static/data-management/document-management.jpg){fig-align="center"}

### Data Summarization, Analysis, Reporting, and Visualization

Many complex data summaries can be automated so that they can be conducted reliably and efficiently, and easily revised and updated. Quality assurance reviews can be more easily conducted on automated (scripted) processes than on point-and-click processes. Database, GIS, and statistical tools all support automation of data summarization, analysis and visualization tasks. A set of standard data summaries can be easily adapted to meet most common data summarization and analysis needs.

![](/static/data-management/vertical-profiles.png){width=60% fig-align="center"}

### Data exchange with clients, agencies, and other consultants

When data are to be provided to clients or other consultants, the data should be complete, have well-established data integrity, have documented data quality, and be provided in a well-defined format.

### Project closeout

When a project is completed, or even when it is put on hold for a lengthy period, there may be tracking and documentation tasks to be completed. Project data may be set to read-only so that it cannot be subsequently modified in any way for any reason. Automatic backups of project data should ordinarily be discontinued and a final archive copy of the complete set of data created. These steps will ensure that the future value of the data will not be lost to the client or to Integral.

Many project activities such as sample collection, sample analysis, and data analysis occur in a sequence, like beads on a thread.  However, data management activities, like project management activities, run through and connect all of these other activities.

![](/static/data-management/pm-dm-threads.png){width=60% fig-align="center"}

An analogy can also be made to the somatic systems of a vertebrate: there are organs with specialized functions like the lungs, liver, kidneys, eyes, skin, and musculo-skeletal systems, and there are also systems that connect these so that they interact and work together: the nervous system (a project management analogue) and the circulatory system (a data management analogue).

The data management activities that are carried out throughout a project's lifecycle are conducted partly in sequence and partly in parallel.

![](/static/data-management/project-data-lifecycle.svg){width=75% fig-align="center"}

## Data Management Planning

:::{.grid}
:::{.g-col-6}

Planning for data management activities helps to avoid later surprises about the type or amount of work required to support creation of project deliverables.  Data management activities should be integrated with other project activities, and both the tasks to be completed and the level of effort required should be planned so that they contribute to meeting project goals. Planning data management activities early in the project helps to ensure that they are carried out consistently and efficiently, and that appropriate tools and skilled personnel are available. In some cases you may get by without planning, relying only on the skills and experience of Integral's data management staff. In those cases, however, data management activities for your project will most likely be carried out according to the plan for some other project--or maybe even pieces and parts of several different projects if several people work on the project without any common plan.

Data management activities should be incorporated into the project plan. A standalone [data management plan](./processes/data-management-planning.qmd#data-management-plan) (DMP) may be required for projects conducted under the oversight of regulatory agencies, but even when it is not required, a DMP will provide all project staff with a reference and resource for conducting the work consistently and efficiently. Documenting data management standards and approaches is particularly valuable for projects where personnel may change over time. A DMP need not be complex, and may evolve over the course of a project. To simplify the creation of a project-specific DMP, a template is available to use as a foundation. This template describes default Integral standards and approaches for data management activities, and may be useful for project planning even if a project-specific DMP is not required.

[Checklists](./processes/data-management-checklists.qmd) are available to assist in planning of project data management activities.

:::
:::{.g-col-6}

![](/static/data-management/data-workflow.svg){fig-align="center"}

:::
:::

## Documentation

Documentation of project data management activities includes some or all of the following components:

- A [data management plan](./processes/data-management-planning.qmd#data-management-plan) (DMP)
- A [data managers' manual](./processes/data-management-planning.qmd#data-managers-manual) (DMM)
- Data management [SOPs](./processes/data-management-planning.qmd#importing-coordinate-infromation)
- A revision date and the data manager's name automatically recorded on every row of every table
- An audit log that automatically records every addition, deletion, and modification to every table
- Notes in the [header of SQL scripts](../development/code/sql.qmd#script-header)
- [Logs of data issues and resolutions](./data-issues.qmd#data-issue-tracking)
- Logs of script actions automatically created by the [execsql script processor](./processes/scripting-data-management-operations.qmd)
- Custom logs of script actions that may be created by the data manager and some standard scripts.

See the page on data management [procedures, processes, and workflows](./processes/index.qmd#data-management-procedures-processes-and-workflows) for additional information.

## Data Analysis and Visualization

Data must be managed well so that they can be used effectively, particularly for quantitative analyses and visualizations. Quantitative analyses can make use of various [analytical tools](../data-analysis/tools/index.qmd) and [techniques](../data-analysis/techniques/index.qmd) available to Integral technical staff. A variety of types of analyses and data displays prepared by GIS and data management staff are illustrated in the [data visualization gallery](https://icit.sharepoint.com/sites/wheelhouse/SitePages/TechGraphics-VisualGallery.aspx).

## DMA Resources and Tools

Data management resources and tools are available on the Integral Citrix server at `M:\DataManagement`. Tools, files, and other information available in the M:\DataManagement directory include:

- Boilerplate text and presentations on data management topics.
- ExecSQL, a Python program that is the primary tool for automating data management operations.
- Additional software tools for data manipulation, available to network and Citrix users.  Some of these are open-source, others were developed by Integral staff.
- Design documents and tools for Integral's custom environmental database (IDB), including:
  - SQL code to initialize a new PostgreSQL instance of the database
  - Specifications for field and laboratory electronic data deliverables (EDDs), including import procedures and SQL scripts to carry out QA checks and data loading.
  - A library of SQL code to carry out common (or complex) data summarizations
  - A (Python) software tool for running a SQL script on multiple PostgreSQL databases sequentially.
  - An Access template for loading and export of data (using the dbmigrator program, see below)
- Tools to review the inventory of project databases.
- Software tools to assist with manipulation of data files (e.g., automated editing and crosstabbing and un-crosstabbing CSV files).
- A (Python) software tool to assist with exporting IDB audit log data.

GIS data can be found on the network at `N:\GIS`. Resources available in the `N:\GIS` directory include:

- ArcGIS templates.
- Common data, such as data sets with national scope.
- Project-specific data sets and ArcGIS project files used by Integral's GIS staff.

Some project-specific GIS data may also be found in the project directory in `Working_Files\Map\User_layers`.  These data have generally not yet been QA'd by GIS staff and added to the project-specific data directories under `N:\GIS`.

Resources and information to support data analyses can be found on the network at `M:\DataAnalysis`.  Available information includes:

- Boilerplate text and presentations related to data analysis
- Technical approaches and guidance on a variety of topics
- A library of R scripts for data analysis and presentation
- A library of MATLAB scripts for data analysis