# Digitale Techniken: Some data reading code for weatherstation data
January 2023, J. Kerch and G. Liebs

<img src="https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by-nc-sa.png" style="height:50px" align="left"/> <br><br>

https://creativecommons.org/licenses/by-nc-sa/4.0/

You can use code from this jupyter notebook (specifically, for your project work), if you declare it.<br>
But please, don't just copy and paste it, instead adapt it to your needs, write your own comments.

### Before the analysis: consider metadata of all the measurements
- which data is matching your own measurement period?
- which data are outdoor/indoor measurements?
- which data do you think is interesting to look at or comparable with your own data?
- collect coordinates of all measurements (3 columns: lat/long/owner, make sure you use the required format for the coordinates, might need to convert)

## Overview

Not all aspects might be relevant to your work.

- Access the data files
- Assess metadata
- Split data file with multiple data sets
- Open and read data file **(the standard way)**
- Read data file to access the data as data frame **(the pandas way)**
- Read the data as a structured array using a conversion function for datetime **(the numpy way)**

In [1]:
pip install scipy

Note: you may need to restart the kernel to use updated packages.


In [2]:
!pip install pandas



In [3]:
from glob import glob
import os
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy as sp
from io import StringIO

### Access the data files

In [4]:
# create path to data in txt format
# get files as sorted list from this path
path = os.path.join( "data",'*.txt')
files = sorted(glob(path))

print(path)
print(files)

data/*.txt
['data/data.txt']


### Open and read data file (the standard way)

In [5]:
# open data file by reading it line by line
with open(files[0],'r') as file:
    data_lines = file.readlines()

In [6]:
# how much data (lines i.e. measurements incl. header)?
# too long to print() -- will return an error connected to jupyter settings
len(data_lines)

50781

In [7]:
# quick look at the first 10 lines, note the newline escape sequence
print(data_lines[0:10])

['Time,Longitude,Latitude,Altitude,CO2,p,Light,T,h\n', 'Time,Longitude,Latitude,Altitude,CO2,p,Light,T,h\n', 'Time,Longitude,Latitude,Altitude,CO2,p,Light,T,h\n', '2023-12-02 09:54:00,9.967963,51.405773,0.0,502,97066,28.35,6.52,40.04\n', '2023-12-02 09:54:02,9.967979,51.405761,0.0,478,97074,27.86,6.45,39.98\n', '2023-12-02 09:54:04,9.968000,51.405757,0.0,403,97082,28.35,6.44,39.99\n', '2023-12-02 09:54:06,9.968023,51.405750,0.0,333,97110,28.15,6.38,40.01\n', '2023-12-02 09:54:08,9.968046,51.405746,0.0,273,97127,28.25,6.35,40.02\n', '2023-12-02 09:54:10,9.968069,51.405742,0.0,189,97152,27.86,6.28,40.02\n', '2023-12-02 09:54:12,9.968091,51.405735,0.0,144,97152,27.96,6.28,40.05\n']


In [8]:
# how does the header look?(header of WiSe2021/22 measurements)
print(data_lines[0])

Time,Longitude,Latitude,Altitude,CO2,p,Light,T,h



In [9]:
# assign the header
header = data_lines[0]

Follow up with earlier notebooks from the course to turn the data lines into columns that represent the data series.

### Split data file with multiple data sets

In [10]:
# following an earlier example with several data sets (with header line each) combined in one file,
# let's check how many data sets in this file
# find header lines (indices) using "list comprehension"
# e.g https://www.kite.com/python/answers/how-to-find-the-index-of-list-elements-that-meet-a-condition-in-python

header_indices = [index for index, line in enumerate(data_lines) if line == header]
print(header_indices)

[0, 1, 2, 50314, 50315]


In [11]:
# or read data (not as lines) and use header line found above
with open(files[0],'r') as file:
    data = file.read()
    
    print(data)

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



In [12]:
# splitting the data using the header removes the header at the same time
blocks = data.split(header)
len(blocks)

6

In [15]:
# write separate files for each block
# for loop using the indices of header lines found before
# case differentiation for the last block to get to the last line of data_lines by simply leaving out a value after the colon

for i in range(len(header_indices)):

    filename = files[0].split(".txt")[-2] + "_" + str(i+1) + ".txt"
    print(filename)
    if i < len(header_indices)-1:
        with open(filename, "w") as f:
            f.writelines(data_lines[header_indices[i]:header_indices[i+1]])
    else:
        with open(filename, "w") as f:
            f.writelines(data_lines[header_indices[i]:])

data/data_1.txt
data/data_2.txt
data/data_3.txt
data/data_4.txt
data/data_5.txt


In [13]:
# first data set is 2nd block (first block is empty due to split by header)
print(blocks[1][0:500])


