# Working with files

To open a file, you can use the built-in `open` function. The `open` function requires the name of the file you want to open, and the mode in which you want to open the file, e.g., 'r' for reading.

In [None]:
help(open)

In [21]:
f = open('textfile.txt', 'r')
print(f)
type(f)

<_io.TextIOWrapper name='textfile.txt' mode='r' encoding='UTF-8'>


_io.TextIOWrapper

Once the file is opened, you can read its content into a variable using the `read` method.

In [22]:
x = f.read()
len(x)

14

In [24]:
x

'Hello World\n42'

It is important to close a file after you are done with it to free up the resources that were tied with the file. You can use the `close` method to close a file.

In [12]:
f.close()

It is preferred to use the `with` statement for opening files. This ensures that the file is closed automatically even if an error occurs inside the `with` block.

For larger files, reading the entire content of the file into memory using the `read` method might not be efficient due to memory constraints. Instead, it is preferred to read the file line by line using a loop.

In [14]:
with open('textfile.txt','r') as fs:
    for line in fs:
        print(line.strip())

Hello World
42


In [105]:
s = '   this is a string\nwith multiple lines\n'
print('---')
print(s)
print('---')
print(s.strip())  # strip() removes leading and trailing whitespace
print('---')

---
   this is a string
with multiple lines

---
this is a string
with multiple lines
---


You can write content to files using the `write` method. To write multiple lines, you can use the `write` method with newline characters.

In [25]:
s = ["this is a test file", "this is the second line of the test file.", "this is the third line."]

with open('textfile3.txt','w') as fs:
    for line in s:
        fs.write(line + "\n")

In [26]:
with open('textfile3.txt','r') as fs:
    for line in fs:
        print(line.strip())

this is a test file
this is the second line of the test file.
this is the third line.


Note that `w` mode will overwrite the file if it already exists or create a new one if it doesn't.

To add content at the end of an existing file, you can open the file in 'a' mode and then use the `write` method.

In [27]:
with open('textfile3.txt','a') as fs:
    fs.write("Add another line here!\n")

In [28]:
with open('textfile3.txt','r') as fs:
    for line in fs:
        print(line.strip())

this is a test file
this is the second line of the test file.
this is the third line.
Add another line here!


Note that this is a very basic way of reading from and writing to files. Most of the time when we want to load data from a file for analysis, we would use dedicated methods that are provided by modules such as Pandas instead. These dedicated methods will know how to handle different file types beyond plain text files (e.g., Excel, CSV, JSON,… files) and will be covered in lecture 8.

# Standard library

The Python standard library is included in the Python distribution and provides functionalities for a wide range of useful tasks

### Random Module
The `random` module provides functions for generating random numbers.


In [29]:
import random

for i in range(1, 10):
    print(random.randint(1, 100), end = ' ')

89 24 41 69 70 76 7 64 62 

Sometimes it is useful to get the same "random" sequence if we run code multiple times. This can be achieved by setting a seed value using the `seed` method.

In [31]:
import random
random.seed(1)

for i in range(1, 11):
    print(random.randint(1, 100), end = ' ')

18 73 98 9 33 16 64 98 58 61 

If we want to get random numbers without repetitions, we can use the `sample` method.

In [32]:
x = [i for i in range(1, 101)]
random.sample(x, 10)

[84, 49, 27, 13, 63, 4, 50, 56, 78, 98]

Or using the `shuffle` method to randomly rearrange the elements of a list.

In [35]:
random.shuffle(x)
print(x[:10])

[60, 90, 87, 2, 55, 99, 74, 23, 14, 28]


### Statistics Module
The `statistics` module provides functions for calculating mathematical statistics of numeric data.

In [36]:
women = [168, 165, 162, 154, 172, 170, 160, 152, 167, 171, 159, 165, 161, 163, 166, 173, 164, 168, 167, 162]
men = [166, 168, 179, 164, 167, 181, 188, 183, 169, 179, 179, 182, 180, 175, 177, 180, 178, 168, 173, 181]

In [37]:
import statistics as ss

In [38]:
ss.mean(men), ss.stdev(men), ss.mean(women), ss.stdev(women)

(175.85, 6.706438383335351, 164.45, 5.548589199541011)

### SciPy Module (not part of the Standard Library)
The `scipy` module provides functions for scientific computing, including statistical hypothesis testing

In [39]:
from scipy import stats
stats.ttest_ind(men, women)

TtestResult(statistic=5.857210267889033, pvalue=8.937958782868781e-07, df=38.0)

### Pickle Module
The `pickle` module is used to serialize (convert to bytes) and deserialize (convert from bytes) Python objects. In a simpler words, it allows saving (and loading) arbitrarty Pythons  data structures.

In [40]:
import pickle

data = {'x' : 13, 'y': 50}

pickle.dump(data, open('data.save', 'wb'))

In [41]:
data2 = pickle.load(open('data.save', 'rb'))
print(data2)

{'x': 13, 'y': 50}


**Be careful with pickle files that you obtain from an untrusted source!** They can contain malicious code, so do not simply load any pickle that you found somewhere on the internet. For an explanation of why pickle files are dangerous, see for instance: https://huggingface.co/docs/hub/security-pickle#why-is-it-dangerous

### Date and Time

The `datetime` module provides classes to work with dates and times.

There are several classes within the module that you can use to work with dates and times, such as `datetime.date` for dates, `datetime.time` for times, and `datetime.datetime` for both dates and times.

`datetime.date` class is used for working with dates (year, month, and day).

In [42]:
import datetime

new_year = datetime.date(2023,1,1)
type(new_year)

datetime.date

Let's create a `datetime.date` object for January 1, 2023:

In [43]:
new_year

datetime.date(2023, 1, 1)

In [44]:
print(new_year)

2023-01-01


We can get the current date using the `today` method:

In [45]:
today = datetime.date.today()
print(today)

2024-03-13


`datetime.timedelta` is used for working with differences in time. This allows adding or subtracting a specific amount of time from a date.

In [46]:
week = datetime.timedelta(days = 7)
week

datetime.timedelta(days=7)

In [47]:
print(today + week, today - week)

2024-03-20 2024-03-06


In [48]:
since_new_year = today - new_year
print(since_new_year)

437 days, 0:00:00


We can use multiplication for timedelta

In [49]:
print(today + since_new_year * 9)

2034-12-19


We can also compare dates

In [50]:
print(today < new_year)
date_in_future = datetime.date(2024,1,1)
date_in_future > new_year

False


True

In [51]:
new_year.weekday()

6

The `strftime` method allows you to format `datetime` objects as strings. You can find the [full list](https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior) of available directives in the official documentation.


In [52]:
today.strftime("%d.%m.%y")

'13.03.24'

In [55]:
today.strftime('Hello, the date is: %a %b, %H:%M:%S %p %y')

'Hello, the date is: Wed Mar, 00:00:00 AM 24'

Python 3.6 introduced so-called `f-strings` that you can use to directly format dates and other data types to include them into strings.

In [61]:
f'Today is {today:%d.%m.%Y}.'

'Today is 13.03.2024.'

Another class ```datetime.datetime``` is used for working with both dates and times.

In [62]:
now = datetime.datetime.now()
type(now)

datetime.datetime

In [63]:
print(now)

2024-03-13 11:04:41.293187


We can use `datetime.timedelta` objects with `datetime.datetime` objects as well.

In [64]:
time_delta = datetime.timedelta(minutes = 10, seconds = 10)
print(now + time_delta)

2024-03-13 11:14:51.293187


The advantage of working with datetime.datetime objects is that they could be made aware of time zones which allows for easy conversion between time zones.

In [72]:
now = datetime.datetime.now(tz = datetime.timezone.utc)
print(now)

2024-03-13 10:06:54.896572+00:00


In [73]:
beijing_time = datetime.timezone(datetime.timedelta(hours = 8))
berlin_time = datetime.timezone(datetime.timedelta(hours = 1))
print(now.astimezone(beijing_time), '---', now.astimezone(berlin_time))

2024-03-13 18:06:54.896572+08:00 --- 2024-03-13 11:06:54.896572+01:00


Instead of manually defining the time zone by specifying the difference with UTC, we could use one of the real time zones. These would be aware of daylight saving time.

In [80]:
import pytz
for tz in pytz.all_timezones[:10]:
    print(tz)

Africa/Abidjan
Africa/Accra
Africa/Addis_Ababa
Africa/Algiers
Africa/Asmara
Africa/Asmera
Africa/Bamako
Africa/Bangui
Africa/Banjul
Africa/Bissau


In [82]:
mt = pytz.timezone('Asia/Shanghai')
bt = pytz.timezone('Europe/Berlin')

print(now.astimezone(mt), '---', now.astimezone(bt))

2024-03-13 18:06:54.896572+08:00 --- 2024-03-13 11:06:54.896572+01:00


A timestamp is a number that represents a specific moment in time. It is the number of seconds that have passed since a certain point in the past

In [83]:
ts = now.timestamp()
print(ts)

1710324414.896572


In [85]:
year_in_seconds = 60 * 60 * 24 * 365
ts / year_in_seconds  # seconds since timestamp 0

54.23403142112418

So it is actually January 1, 1970, 00:00:00 (UTC).

In [88]:
# let's use timestamps to measure how long an operation takes
ts = datetime.datetime.now().timestamp()
for i in range(1, 100000000):
    x = 1
print(datetime.datetime.now().timestamp() - ts)

2.1089348793029785


In [90]:
new_current_time = datetime.datetime.fromtimestamp(ts)  # convert timestamp to datetime
new_current_time.strftime("%H:%M:%S %d.%m.%Y")

'11:10:51 13.03.2024'

# pprint – "pretty-print"

In [93]:
data = [
 {'id': 731128,
  'first_name': 'Lukas',
  'last_name': 'Strohmaier',
  'age': 25,
  'friends': [483198,239570,793005,661742,366391,615506,938483,873528,763436,927225,225686,460894,674641],
  'papers': [{'journal': 'PNAS', 'title': 'Estimating educational outcomes from students’ short texts on social media'}]},
 {'id': 610278,
  'first_name': 'Sofia',
  'last_name': 'Schneider',
  'age': 21,
  'friends': [340424, 118896, 728115, 732455, 685114, 237262],
  'papers': [{'journal': 'PNAS','title': 'Schools are segregated by educational outcomes in the digital space'}]},
 {'id': 226486,
  'first_name': 'Ivan',
  'last_name': 'Schmidt',
  'age': 22,
  'friends': [840459, 849913, 550089],
  'papers': [{'journal': 'Nature', 'title': 'Parents mention sons more often than daughters on social media'},
             {'journal': 'Science', 'title': 'The digital flynn effect: Complexity of posts on social media increases over time'}]},
 {'id': 468537,
  'first_name': 'Lukas',
  'last_name': 'Lemmerich',
  'age': 73,
  'friends': [636316],
  'papers': [{'journal': 'PNAS', 'title': 'Estimating educational outcomes from students’ short texts on social media'},
             {'journal': 'PLOS One', 'title': 'Predicting PISA scores from students’ digital traces'}]},
 {'id': 257556,
  'first_name': 'Anna',
  'last_name': 'Müller',
  'age': 92,
  'friends': [227376, 706211, 876261, 388121, 368097, 790177, 551119, 622808, 204274, 606230, 737488, 874988],
  'papers': [{'journal': 'Science', 'title': 'Estimating educational outcomes from students’ short texts on social media'}]}
]

print(data) # this is almost unreadable

[{'id': 731128, 'first_name': 'Lukas', 'last_name': 'Strohmaier', 'age': 25, 'friends': [483198, 239570, 793005, 661742, 366391, 615506, 938483, 873528, 763436, 927225, 225686, 460894, 674641], 'papers': [{'journal': 'PNAS', 'title': 'Estimating educational outcomes from students’ short texts on social media'}]}, {'id': 610278, 'first_name': 'Sofia', 'last_name': 'Schneider', 'age': 21, 'friends': [340424, 118896, 728115, 732455, 685114, 237262], 'papers': [{'journal': 'PNAS', 'title': 'Schools are segregated by educational outcomes in the digital space'}]}, {'id': 226486, 'first_name': 'Ivan', 'last_name': 'Schmidt', 'age': 22, 'friends': [840459, 849913, 550089], 'papers': [{'journal': 'Nature', 'title': 'Parents mention sons more often than daughters on social media'}, {'journal': 'Science', 'title': 'The digital flynn effect: Complexity of posts on social media increases over time'}]}, {'id': 468537, 'first_name': 'Lukas', 'last_name': 'Lemmerich', 'age': 73, 'friends': [636316], '

In [94]:
from pprint import pprint, pformat

pprint(data)

[{'age': 25,
  'first_name': 'Lukas',
  'friends': [483198,
              239570,
              793005,
              661742,
              366391,
              615506,
              938483,
              873528,
              763436,
              927225,
              225686,
              460894,
              674641],
  'id': 731128,
  'last_name': 'Strohmaier',
  'papers': [{'journal': 'PNAS',
              'title': 'Estimating educational outcomes from students’ short '
                       'texts on social media'}]},
 {'age': 21,
  'first_name': 'Sofia',
  'friends': [340424, 118896, 728115, 732455, 685114, 237262],
  'id': 610278,
  'last_name': 'Schneider',
  'papers': [{'journal': 'PNAS',
              'title': 'Schools are segregated by educational outcomes in the '
                       'digital space'}]},
 {'age': 22,
  'first_name': 'Ivan',
  'friends': [840459, 849913, 550089],
  'id': 226486,
  'last_name': 'Schmidt',
  'papers': [{'journal': 'Nature',
             

In [95]:
help(pprint)

Help on function pprint in module pprint:

pprint(object, stream=None, indent=1, width=80, depth=None, *, compact=False, sort_dicts=True, underscore_numbers=False)
    Pretty-print a Python object to a stream [default is sys.stdout].



In [96]:
pprint(data, compact = True, width = 80, sort_dicts = False, indent = 1)

[{'id': 731128,
  'first_name': 'Lukas',
  'last_name': 'Strohmaier',
  'age': 25,
  'friends': [483198, 239570, 793005, 661742, 366391, 615506, 938483, 873528,
              763436, 927225, 225686, 460894, 674641],
  'papers': [{'journal': 'PNAS',
              'title': 'Estimating educational outcomes from students’ short '
                       'texts on social media'}]},
 {'id': 610278,
  'first_name': 'Sofia',
  'last_name': 'Schneider',
  'age': 21,
  'friends': [340424, 118896, 728115, 732455, 685114, 237262],
  'papers': [{'journal': 'PNAS',
              'title': 'Schools are segregated by educational outcomes in the '
                       'digital space'}]},
 {'id': 226486,
  'first_name': 'Ivan',
  'last_name': 'Schmidt',
  'age': 22,
  'friends': [840459, 849913, 550089],
  'papers': [{'journal': 'Nature',
              'title': 'Parents mention sons more often than daughters on '
                       'social media'},
             {'journal': 'Science',
              '

In [99]:
s = pformat(data)  # convert to a pretty format and store into string
print(s)

[{'age': 25,
  'first_name': 'Lukas',
  'friends': [483198,
              239570,
              793005,
              661742,
              366391,
              615506,
              938483,
              873528,
              763436,
              927225,
              225686,
              460894,
              674641],
  'id': 731128,
  'last_name': 'Strohmaier',
  'papers': [{'journal': 'PNAS',
              'title': 'Estimating educational outcomes from students’ short '
                       'texts on social media'}]},
 {'age': 21,
  'first_name': 'Sofia',
  'friends': [340424, 118896, 728115, 732455, 685114, 237262],
  'id': 610278,
  'last_name': 'Schneider',
  'papers': [{'journal': 'PNAS',
              'title': 'Schools are segregated by educational outcomes in the '
                       'digital space'}]},
 {'age': 22,
  'first_name': 'Ivan',
  'friends': [840459, 849913, 550089],
  'id': 226486,
  'last_name': 'Schmidt',
  'papers': [{'journal': 'Nature',
             