# File formats

In principle, Niimpy can deal with any files of any format - you only need to convert them to a DataFrame.  Still, it is very useful to have some common formats, so we present two standard formats with default readers:

* **CSV files** are very standard and normal to create and understand, but in order to deal with them everything must be loaded into memory.
* **sqlite3 databases**, which requires sqlite3 to read, but provides more power for filtering and automatic processing without reading everything into memory.
* **Google TakeOut** provides a large selection of data in different formats. We provide readers most commonly used data types.
* **MHealth** is a common format for health data.

## DataFrame format (in-memory)

In-memory, data is stored in a [pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).  This is basically a normal dataframe.  There are some standardized columns (see the [schema](schema.html)) and the index is a DatetimeIndex.

## CSV files

CSV files should have a header that lists the column names and generally be readable by `pandas.read_csv`.

Reading these can be done with `niimpy.read_csv`:

In [1]:
import os
import niimpy 
import niimpy.config as config

# Read the battery data
df= niimpy.read_csv(config.MULTIUSER_AWARE_BATTERY_PATH, tz='Europe/Helsinki')

  from .autonotebook import tqdm as notebook_tqdm


## sqlite3 databases

For the purposes of niimpy, sqlite3 databases can generally be seen as supercharged CSV files.

A single database file could contain multiple datasets within it, thus when reading them a **table name** must be specified.

One reads the entire database into memory using `sqlite.read_sqlite`:

In [2]:
# Read the sqlite3 data
df= niimpy.read_sqlite(config.SQLITE_SINGLEUSER_PATH, table="AwareScreen", tz='Europe/Helsinki')

You can list the tables within a database using `niimpy.reading.read.read_sqlite_tables`:

In [3]:
niimpy.reading.read.read_sqlite_tables(config.SQLITE_SINGLEUSER_PATH)

{'AwareScreen'}

sqlite3 files are highly recommended as a data storage format, since many common exploration options can be done within the database itself without reading the whole data into memory or writing an iterator.  However, the interface is more difficult to use.  Niimpy (before 2021-07) used this as its primary interface, but since then this interface has been de-emphasized.  You can read more in [the database section](database.html), but this is only recommended if you need efficiency when using massive amounts of data.

## Google TakeOut

Google takeout contains a many different types of data and new types are added as Google creates services or changes data storage methods. Readers are currently available for location data, emails, and activity data from the fit app. For other data types, the user needs to manually convert them into a Niimpy compatible Pandas DataFrame.

In [4]:
# Data downloaded from Google Takeout is compressed as a zip archive to conserve disk space. To 
# demonstrate reading for the zip file, we will first compress our example data into the zip format.
import zipfile
test_zip = zipfile.ZipFile("test.zip", mode="w")

for dirpath,dirs,files in os.walk(config.GOOGLE_TAKEOUT_DIR):
    for f in files:
        filename = os.path.join(dirpath, f)
        filename_in_zip = filename.replace(config.GOOGLE_TAKEOUT_DIR, "")
        test_zip.write(filename, filename_in_zip)

test_zip.close()


In [5]:
# Next we read location data from the zip file.
import niimpy
import niimpy.config as config
import niimpy.preprocessing.location as nilo

data = niimpy.reading.google_takeout.location_history("test.zip")
data = nilo.filter_location(
    data,
    latitude_column = "latitude",
    longitude_column = "longitude",
    remove_disabled=False, remove_network=False, remove_zeros=True
)
data


Unnamed: 0_level_0,accuracy,source,device,placeid,formfactor,altitude,verticalaccuracy,platformtype,servertimestamp,devicetimestamp,batterycharging,velocity,heading,latitude,longitude,inferred_latitude,inferred_longitude,activity_type,activity_inference_confidence,user
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2016-08-12 19:29:43.821000+00:00,25,WIFI,-577680260,,,,,,,,,,,35.997488,-78.922194,,,,,92581690-cfd6-11ee-bc05-b0dcef010c43
2016-08-12 19:30:49.531000+00:00,21,WIFI,-577680260,,,,,,,,,,,35.997559,-78.922504,,,STILL,62.0,92581690-cfd6-11ee-bc05-b0dcef010c43
2016-08-12 19:31:49.531000+00:00,21,WIFI,-577680260,ChIJS_5Nmuz1jUYRGYf3QiiZco4,PHONE,,,,,,,,,35.997559,-78.922504,60.187135,24.824478,STILL,62.0,92581690-cfd6-11ee-bc05-b0dcef010c43
2016-08-12 21:15:55.295000+00:00,1500,CELL,-577680260,,,,,,,,,,,36.00087,-78.923343,,,ON_FOOT,54.0,92581690-cfd6-11ee-bc05-b0dcef010c43
2016-08-12 21:16:33+00:00,8,GPS,-577680260,,,,,,,,,,,35.99725,-78.923989,,,,,92581690-cfd6-11ee-bc05-b0dcef010c43
2016-08-12 21:16:48+00:00,3,GPS,-577680260,,,,,,,,,,,35.997236,-78.924124,,,,,92581690-cfd6-11ee-bc05-b0dcef010c43
2023-11-21 11:29:21.730000+00:00,13,WIFI,1832214273,ChIJw1WKQev1jUYRCdZmYR-HCiI,PHONE,28.0,2.0,ANDROID,2023-11-21T11:29:24.747Z,2023-11-21T11:29:24.350Z,False,,,60.186818,24.821288,60.186816,24.821288,,,92581690-cfd6-11ee-bc05-b0dcef010c43


Activity data is read similarly. The data contains many columns with missing data, so in order to use the step count data, for example, we must set the NaN values to 0. 

In [6]:
data = niimpy.reading.google_takeout.activity("test.zip")
data.loc[data["step_count"].isna(), "step_count"] = 0
data[["calories_(kcal)", "step_count"]]

Unnamed: 0_level_0,calories_(kcal),step_count
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1
2023-11-20 00:00:00+02:00,18.252604,0.0
2023-11-20 00:15:00+02:00,18.252604,0.0
2023-11-20 00:30:00+02:00,18.252604,0.0
2023-11-20 00:45:00+02:00,18.252604,0.0
2023-11-20 01:00:00+02:00,18.252604,0.0
...,...,...
2023-11-21 22:45:00+02:00,18.252604,0.0
2023-11-21 23:00:00+02:00,18.252604,0.0
2023-11-21 23:15:00+02:00,18.252604,0.0
2023-11-21 23:30:00+02:00,18.252604,0.0


The `google_takeout.email_activity` and `google_takeout.chat` function will read and process all emails in the GMail mailbox and all Google chat messages respectively. They return a dataframe containing metadata and statistics of each message. Email addresses, email IDs and names are replaced by numerical indexes.

The email files can be large and processing them could take some time. You can also include sentiment analysis of each email using the `sentiment` parameter. For this, we recommend using a system with a GPU.

In [7]:
niimpy.reading.google_takeout.email_activity("test.zip")



Unnamed: 0_level_0,received,from,to,cc,bcc,message_id,in_reply_to,character_count,word_count,user
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2023-12-15 12:19:43+00:00,NaT,0,[5],[],[],1,,33,6,925e80c0-cfd6-11ee-bc05-b0dcef010c43
2023-12-15 12:29:43+00:00,NaT,0,"[1, 5]",[],[],0,,31,6,925e80c0-cfd6-11ee-bc05-b0dcef010c43
2023-12-15 12:29:43+00:00,NaT,0,"[1, 5]",[],[],0,,28,5,925e80c0-cfd6-11ee-bc05-b0dcef010c43
2023-12-15 12:39:43+00:00,2023-12-15 12:19:43+00:00,1,[0],[3],[3],3,1.0,30,5,925e80c0-cfd6-11ee-bc05-b0dcef010c43
"Sat, 15 Dec 202Not a time3 12:39:43 0000",2023-12-15 12:19:43+00:00,1,[0],[3],[3],3,1.0,51,7,925e80c0-cfd6-11ee-bc05-b0dcef010c43
2023-12-15 12:39:43+00:00,"Sat, 15 DeNot a timec 2023 12:19:43 0000",1,[0],[3],[3],3,1.0,51,7,925e80c0-cfd6-11ee-bc05-b0dcef010c43


In [8]:
niimpy.reading.google_takeout.chat("test.zip", sentiment=True)



  0%|          | 0/4 [00:01<?, ?it/s]


Unnamed: 0_level_0,topic_id,message_id,chat_group,creator_name,creator_email,creator_user_type,character_count,word_count,user,sentiment,sentiment_score
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2024-01-30 13:27:33+00:00,iDImYGRudHk,9guW_0AAAAE/iDImYGRudHk/iDImYGRudHk,0,0,0,Human,5,1,,none,0.0
2024-01-30 13:29:10+00:00,cVEoT9zu63M,9guW_0AAAAE/cVEoT9zu63M/cVEoT9zu63M,0,1,2,Human,5,1,,none,0.0
2024-01-30 13:29:17+00:00,qEfkUgUvX80,9guW_0AAAAE/qEfkUgUvX80/qEfkUgUvX80,0,1,2,Human,11,3,,positive,0.53531
2024-01-30 13:29:17+00:00,qEfkUgUvX80,9guW_0AAAAE/qEfkUgUvX80/qEfkUgUvX80,0,0,0,Human,22,5,,positive,0.912528


Finally, we have a reader for extracting Youtube watch history data. We do not, by default, return video identifiers, but replace them with numerical IDs. The only available information then is the recorded time, which corresponds to video start time.

Importantly, we have no information on how long the user watched a given video, as this is not stored in the TakeOut data. You can deduce whether the user has rewatched a given video, watched multiple videos in a row, or started another video quickly without finishing the previous one.

In [9]:
niimpy.reading.google_takeout.youtube_watch_history("test.zip")

Unnamed: 0_level_0,video_title,channel_title,user
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2024-02-13 08:36:49+02:00,0,0,9322da7e-cfd6-11ee-bc05-b0dcef010c43
2024-02-13 08:36:05+02:00,1,1,9322da7e-cfd6-11ee-bc05-b0dcef010c43
2024-02-13 08:35:38+02:00,2,2,9322da7e-cfd6-11ee-bc05-b0dcef010c43
2024-02-13 08:35:03+02:00,0,0,9322da7e-cfd6-11ee-bc05-b0dcef010c43


Since Google takeout may provide the mailbox as a single uncompressed file, it is also possible to provide it's file path directly.

In [10]:
path = os.path.join(config.GOOGLE_TAKEOUT_DIR, "Takeout", "Mail", "All mail Including Spam and Trash.mbox")
niimpy.reading.google_takeout.email_activity(path, sentiment=True)



Running sentiment analysis on 6 messages.


  0%|          | 0/6 [00:00<?, ?it/s]


Unnamed: 0_level_0,received,from,to,cc,bcc,message_id,in_reply_to,character_count,word_count,user,sentiment,sentiment_score
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2023-12-15 12:19:43+00:00,NaT,0,[4],[],[],1,,33,6,9325966a-cfd6-11ee-bc05-b0dcef010c43,positive,0.993223
2023-12-15 12:29:43+00:00,NaT,0,"[1, 4]",[],[],0,,31,6,9325966a-cfd6-11ee-bc05-b0dcef010c43,negative,0.980209
2023-12-15 12:29:43+00:00,NaT,0,"[1, 4]",[],[],0,,28,5,9325966a-cfd6-11ee-bc05-b0dcef010c43,negative,0.968588
2023-12-15 12:39:43+00:00,2023-12-15 12:19:43+00:00,1,[0],[2],[2],3,1.0,30,5,9325966a-cfd6-11ee-bc05-b0dcef010c43,positive,0.997529
"Sat, 15 Dec 202Not a time3 12:39:43 0000",2023-12-15 12:19:43+00:00,1,[0],[2],[2],3,1.0,51,7,9325966a-cfd6-11ee-bc05-b0dcef010c43,neutral,0.477333
2023-12-15 12:39:43+00:00,"Sat, 15 DeNot a timec 2023 12:19:43 0000",1,[0],[2],[2],3,1.0,51,7,9325966a-cfd6-11ee-bc05-b0dcef010c43,neutral,0.477333


Each subject Downloads their Google TakeOut data as a separate zip file. The Zipfile package, which is included in the Python standard, is convenient for reading the data files contained in the zip file. For example, one could read the location data with the following code:

In [12]:
from zipfile import ZipFile
import json
import pandas as pd

zip_file = ZipFile("test.zip")
json_data  = zip_file.read("Takeout/Location History (Timeline)/Records.json")
json_data = json.loads(json_data)
data = pd.json_normalize(json_data["locations"])
data = pd.DataFrame(data)
data.head()

Unnamed: 0,latitudeE7,longitudeE7,accuracy,source,deviceTag,timestamp,activity,locationMetadata,placeId,formFactor,...,deviceDesignation,altitude,verticalAccuracy,platformType,osLevel,serverTimestamp,deviceTimestamp,batteryCharging,velocity,heading
0,359974880,-789221943,25,WIFI,-577680260,2016-08-12T19:29:43.821Z,,,,,...,,,,,,,,,,
1,359975588,-789225036,21,WIFI,-577680260,2016-08-12T19:30:49.531Z,"[{'activity': [{'type': 'STILL', 'confidence':...",,,,...,,,,,,,,,,
2,359975588,-789225036,21,WIFI,-577680260,2016-08-12T19:31:49.531Z,"[{'activity': [{'type': 'STILL', 'confidence':...",[{'wifiScan': {'accessPoints': [{'mac': '12410...,ChIJS_5Nmuz1jUYRGYf3QiiZco4,PHONE,...,,,,,,,,,,
3,360008703,-789233433,1500,CELL,-577680260,2016-08-12T21:15:55.295Z,"[{'activity': [{'type': 'ON_FOOT', 'confidence...",,,,...,,,,,,,,,,
4,359972502,-789239894,8,GPS,-577680260,2016-08-12T21:16:33Z,,,,,...,,,,,,,,,,


Location data is stored in the json format. Other types of data are stored in various formats and with different files structures. The user must find how each type of data they need is stored and how it can be read in Python.

## MHealth

We have implemented readers for 3 data types formatted according to the [MHealth schema](https://www.openmhealth.org/documentation/#/schema-docs/schema-library). These are total sleep time, heart rate and geolocation. Other data types may be added as needed.

In [None]:
# Reading total sleep time data:
filename = config.MHEALTH_TOTAL_SLEEP_TIME_PATH
niimpy.reading.mhealth.total_sleep_time_from_file(filename)

Unnamed: 0_level_0,descriptive_statistic,descriptive_statistic_denominator,date,part_of_day,total_sleep_time,start,end
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2016-02-06 04:35:00+00:00,,,,,0 days 07:45:00,2016-02-06 04:35:00+00:00,2016-02-06 14:35:00+00:00
2016-02-05 15:00:00+00:00,average,d,,,0 days 07:15:00,2016-02-05 15:00:00+00:00,2016-06-06 15:00:00+00:00
2013-01-26 07:35:00+00:00,,,,,0 days 03:00:00,2013-01-26 07:35:00+00:00,2013-02-05 07:35:00+00:00
2013-02-05 00:00:00,,,2013-02-05 00:00:00,evening,0 days 03:00:00,NaT,NaT


In [None]:
# Reading heart rate data:
filename = config.MHEALTH_HEART_RATE_PATH
niimpy.reading.mhealth.heart_rate_from_file(filename)

Unnamed: 0_level_0,temporal_relationship_to_sleep,heart_rate,effective_time_frame.date_time,descriptive_statistic,start,end
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2023-11-20 07:25:00-08:00,on waking,70,2023-11-20T07:25:00-08:00,,,NaT
2023-12-20 01:50:00-08:00,on waking,65,,,2023-12-20 09:50:00+00:00,2023-12-20 10:00:00+00:00
2023-12-19 19:50:00-08:00,during sleep,35,,average,2023-12-20 03:50:00+00:00,2023-12-20 04:00:00+00:00


In [None]:
# Reading geolocation data:
filename = config.MHEALTH_GEOLOCATION_PATH
niimpy.reading.mhealth.geolocation_from_file(filename)

Unnamed: 0,positioning_system,latitude,latitude.unit,longitude,longitude.unit,effective_time_frame.time_interval.start_date_time,effective_time_frame.time_interval.end_date_time,elevation.value,elevation.unit
0,GPS,60.1867,deg,24.8283,deg,2016-02-05T20:35:00-08:00,2016-02-06T06:35:00-08:00,,
1,GPS,60.1867,deg,24.8283,deg,2016-02-05T20:35:00-08:00,2016-02-06T06:35:00-08:00,20.4,m


## Other formats

You can add readers for any types of formats which you can convert into a Pandas dataframe (so basically anything).  For examples of readers, see `niimpy/reading/read.py`.  Apply the function `niimpy.preprocessing.util.df_normalize` in order to apply some standardizations to get the standard Niimpy format.