# Working with files

To open a file, you can use the built-in `open` function. The `open` function requires the name of the file you want to open, and the mode in which you want to open the file, e.g., 'r' for reading.

In [1]:
help(open)

Help on built-in function open in module io:

open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)
    Open file and return a stream.  Raise OSError upon failure.
    
    file is either a text or byte string giving the name (and the path
    if the file isn't in the current working directory) of the file to
    be opened or an integer file descriptor of the file to be
    wrapped. (If a file descriptor is given, it is closed when the
    returned I/O object is closed, unless closefd is set to False.)
    
    mode is an optional string that specifies the mode in which the file
    is opened. It defaults to 'r' which means open for reading in text
    mode.  Other common values are 'w' for writing (truncating the file if
    it already exists), 'x' for creating and writing to a new file, and
    'a' for appending (which on some Unix systems, means that all writes
    append to the end of the file regardless of the current seek position

In [23]:
f = open('textfile.txt', 'r')
print(f)
type(f)

<_io.TextIOWrapper name='textfile.txt' mode='r' encoding='UTF-8'>


_io.TextIOWrapper

Once the file is opened, you can read its content into a variable using the `read` method.

In [24]:
x = f.read()
len(x)

1645

In [25]:
x

'THE WINDOW\n\n\n1\n\n\n"Yes, of course, if it\'s fine tomorrow," said Mrs. Ramsay. "But you\'ll\nhave to be up with the lark," she added.\n\nTo her son these words conveyed an extraordinary joy, as if it were\nsettled, the expedition were bound to take place, and the wonder to which\nhe had looked forward, for years and years it seemed, was, after a night\'s\ndarkness and a day\'s sail, within touch. Since he belonged, even at the\nage of six, to that great clan which cannot keep this feeling separate\nfrom that, but must let future prospects, with their joys and sorrows,\ncloud what is actually at hand, since to such people even in earliest\nchildhood any turn in the wheel of sensation has the power to crystallise\nand transfix the moment upon which its gloom or radiance rests, James\nRamsay, sitting on the floor cutting out pictures from the illustrated\ncatalogue of the Army and Navy stores, endowed the picture of a\nrefrigerator, as his mother spoke, with heavenly bliss. It was fr

It is important to close a file after you are done with it to free up the resources that were tied with the file. You can use the `close` method to close a file.

In [26]:
f.close()

It is preferred to use the `with` statement for opening files. This ensures that the file is closed automatically even if an error occurs inside the `with` block.

For larger files, reading the entire content of the file into memory using the `read` method might not be efficient due to memory constraints. Instead, it is preferred to read the file line by line using a loop.

In [32]:
with open('textfile.txt','r') as fs:
    for li in fs:
       print(li.strip())  # remove all the extra empty line and space

THE WINDOW


1


"Yes, of course, if it's fine tomorrow," said Mrs. Ramsay. "But you'll
have to be up with the lark," she added.

To her son these words conveyed an extraordinary joy, as if it were
settled, the expedition were bound to take place, and the wonder to which
he had looked forward, for years and years it seemed, was, after a night's
darkness and a day's sail, within touch. Since he belonged, even at the
age of six, to that great clan which cannot keep this feeling separate
from that, but must let future prospects, with their joys and sorrows,
cloud what is actually at hand, since to such people even in earliest
childhood any turn in the wheel of sensation has the power to crystallise
and transfix the moment upon which its gloom or radiance rests, James
Ramsay, sitting on the floor cutting out pictures from the illustrated
catalogue of the Army and Navy stores, endowed the picture of a
refrigerator, as his mother spoke, with heavenly bliss. It was fringed
with joy. The wheel

In [28]:
s = '   this is a string\nwith multiple lines\n'
print('---')
print(s)
print('---')
print(s.strip())  # strip() removes leading and trailing whitespace
print('---')

---
   this is a string
with multiple lines

---
this is a string
with multiple lines
---


You can write content to files using the `write` method. To write multiple lines, you can use the `write` method with newline characters.

In [37]:
s = ["this is a test file", "this is the second line of the test file.", "this is the third line."]

with open('textfile3.txt','w') as fs:
    for line in s:
        fs.write("123\n"+line)

In [38]:
with open('textfile3.txt','r') as fs:
    for line in fs:
        print(line.strip())

123
this is a test file123
this is the second line of the test file.123
this is the third line.


Note that `w` mode will overwrite the file if it already exists or create a new one if it doesn't.

To add content at the end of an existing file, you can open the file in 'a' mode and then use the `write` method.

In [44]:
with open('textfile3.txt','a') as fs:
    fs.write("Add another line here!\n")

In [47]:
with open('textfile3.txt','r') as fs:
    for line in fs:
        print(line.strip())

123
this is a test file123
this is the second line of the test file.123
this is the third line.Add another line here!
Add another line here!
Add another line here!
Add another line here!


In [49]:
b = open('textfile3.txt','a')
b.write('Add one\n')
c = open('textfile3.txt','r')
print(c.read())

123
this is a test file123
this is the second line of the test file.123
this is the third line.Add another line here!
Add another line here!
Add another line here!
Add another line here!
Add one



Note that this is a very basic way of reading from and writing to files. Most of the time when we want to load data from a file for analysis, we would use dedicated methods that are provided by modules such as Pandas instead. These dedicated methods will know how to handle different file types beyond plain text files (e.g., Excel, CSV, JSON,… files) and will be covered in lecture 8.

# Standard library

The Python standard library is included in the Python distribution and provides functionalities for a wide range of useful tasks

### Random Module
The `random` module provides functions for generating random numbers.


In [56]:
import random

for i in range(1, 10):
    print(random.randint(1, 100), end = '!')

17!93!25!35!57!2!29!80!59!

Sometimes it is useful to get the same "random" sequence if we run code multiple times. This can be achieved by setting a seed value using the `seed` method.

In [55]:
import random
random.seed(23)

for i in range(1, 11):
    print(random.randint(1, 100), end = ' ')

100 38 11 3 76 40 55 49 68 46 

If we want to get random numbers without [repetitions], we can use the `sample` method.

In [70]:
x = [i for i in range(1, 101)]
y = random.sample(x, 10)
y

[100, 66, 15, 80, 93, 27, 75, 46, 26, 61]

Or using the `shuffle` method to randomly rearrange the elements of a list.

In [71]:
random.shuffle(y)
print(y)

[93, 100, 80, 75, 46, 66, 15, 27, 26, 61]


### Statistics Module
The `statistics` module provides functions for calculating mathematical statistics of numeric data.

In [72]:
women = [168, 165, 162, 154, 172, 170, 160, 152, 167, 171, 159, 165, 161, 163, 166, 173, 164, 168, 167, 162]
men = [166, 168, 179, 164, 167, 181, 188, 183, 169, 179, 179, 182, 180, 175, 177, 180, 178, 168, 173, 181]

In [73]:
import statistics as ss

In [74]:
ss.mean(men), ss.stdev(men), ss.mean(women), ss.stdev(women)

(175.85, 6.706438383335351, 164.45, 5.548589199541011)

In [77]:
help(ss)

Help on module statistics:

NAME
    statistics - Basic statistics module.

MODULE REFERENCE
    https://docs.python.org/3.10/library/statistics.html
    
    The following documentation is automatically generated from the Python
    source files.  It may be incomplete, incorrect or include features that
    are considered implementation detail and may vary between Python
    implementations.  When in doubt, consult the module reference at the
    location listed above.

DESCRIPTION
    This module provides functions for calculating statistics of data, including
    averages, variance, and standard deviation.
    
    Calculating averages
    --------------------
    
    Function            Description
    mean                Arithmetic mean (average) of data.
    fmean               Fast, floating point arithmetic mean.
    geometric_mean      Geometric mean of data.
    harmonic_mean       Harmonic mean of data.
    median              Median (middle value) of data.
    median_low  

### SciPy Module (not part of the Standard Library)
The `scipy` module provides functions for scientific computing, including statistical hypothesis testing

In [75]:
from scipy import stats
stats.ttest_ind(men, women)

TtestResult(statistic=5.857210267889033, pvalue=8.937958782868781e-07, df=38.0)

### Pickle Module
The `pickle` module is used to serialize (convert to bytes) and deserialize (convert from bytes) Python objects. In a simpler words, it allows saving (and loading) arbitrarty Pythons  data structures.

In [81]:
import pickle

data = {'x' : 13, 'y': 50}

pickle.dump(data, open('data.save', 'wb'))

In [82]:
data2 = pickle.load(open('data.save', 'rb'))
print(data2)

{'x': 13, 'y': 50}


**Be careful with pickle files that you obtain from an untrusted source!** They can contain malicious code, so do not simply load any pickle that you found somewhere on the internet. For an explanation of why pickle files are dangerous, see for instance: https://huggingface.co/docs/hub/security-pickle#why-is-it-dangerous

### Date and Time

The `datetime` module provides classes to work with dates and times.

There are several classes within the module that you can use to work with dates and times, such as `datetime.date` for dates, `datetime.time` for times, and `datetime.datetime` for both dates and times.

`datetime.date` class is used for working with dates (year, month, and day).

In [84]:
import datetime

new_year = datetime.date(2023,1,1)
type(new_year)

datetime.date

Let's create a `datetime.date` object for January 1, 2023:

In [85]:
new_year

datetime.date(2023, 1, 1)

In [86]:
print(new_year)

2023-01-01


We can get the current date using the `today` method:

In [87]:
today = datetime.date.today()
print(today)

2024-10-07


`datetime.timedelta` is used for working with differences in time. This allows adding or subtracting a specific amount of time from a date.

In [88]:
week = datetime.timedelta(days = 7)
week

datetime.timedelta(days=7)

In [90]:
weekday = datetime.timedelta(days = 5)

In [92]:
print(today + week, today - week)

2024-10-14 2024-09-30


In [93]:
since_new_year = today - new_year
print(since_new_year)

645 days, 0:00:00


We can use multiplication for timedelta

In [96]:
print(today + since_new_year * 9)

2040-08-29


We can also compare dates

In [97]:
print(today < new_year)
date_in_future = datetime.date(2024,1,1)
date_in_future > new_year

False


True

In [98]:
new_year.weekday()

6

The `strftime` method allows you to format `datetime` objects as strings. You can find the [full list](https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior) of available directives in the official documentation.


In [102]:
today.strftime("%d~%m~%y")

'07~10~24'

In [106]:
today.strftime('Hello, the date is: %A %B, %H:%M:%S %p %y')

'Hello, the date is: Monday October, 00:00:00 AM 24'

Python 3.6 introduced so-called `f-strings` that you can use to directly format dates and other data types to include them into strings.

In [104]:
f'Today is {today:%d.%m.%Y}.'

'Today is 07.10.2024.'

Another class ```datetime.datetime``` is used for working with both dates and times.

In [107]:
now = datetime.datetime.now()
type(now)

datetime.datetime

In [108]:
print(now)

2024-10-07 09:28:27.916944


We can use `datetime.timedelta` objects with `datetime.datetime` objects as well.

In [110]:
time_delta = datetime.timedelta(minutes = 10, seconds = 10)
print(now + time_delta)

2024-10-07 09:38:37.916944


The advantage of working with datetime.datetime objects is that they could be made aware of time zones which allows for easy conversion between time zones.

In [122]:
now = datetime.datetime.now(tz = datetime.timezone.utc)
print(now)

2024-10-07 09:36:45.503081+00:00


In [115]:
beijing_time = datetime.timezone(datetime.timedelta(hours = 8))
berlin_time = datetime.timezone(datetime.timedelta(hours = 2))
print(now.astimezone(beijing_time), '---', now.astimezone(berlin_time))

2024-10-07 17:33:08.589002+08:00 --- 2024-10-07 11:33:08.589002+02:00


Instead of manually defining the time zone by specifying the difference with UTC, we could use one of the real time zones. These would be aware of daylight saving time.

In [120]:
import pytz
for tz in pytz.all_timezones[-18:]:
    print(tz)

UCT
US/Alaska
US/Aleutian
US/Arizona
US/Central
US/East-Indiana
US/Eastern
US/Hawaii
US/Indiana-Starke
US/Michigan
US/Mountain
US/Pacific
US/Samoa
UTC
Universal
W-SU
WET
Zulu


In [123]:
mt = pytz.timezone('Asia/Shanghai')
bt = pytz.timezone('Europe/Berlin')

print(now.astimezone(mt), '---', now.astimezone(bt))

2024-10-07 17:36:45.503081+08:00 --- 2024-10-07 11:36:45.503081+02:00


A timestamp is a number that represents a specific moment in time. It is the number of seconds that have passed since a certain point in the past

In [124]:
ts = now.timestamp()
print(ts)

1728293805.503081


In [125]:
year_in_seconds = 60 * 60 * 24 * 365
ts / year_in_seconds  # seconds since timestamp 0

54.80383705933159

So it is actually January 1, 1970, 00:00:00 (UTC).

In [138]:
# let's use timestamps to measure how long an operation takes
ts = datetime.datetime.now().timestamp()
for i in range(1, 10000000):
    x = 1
print(datetime.datetime.now().timestamp() - ts)

0.3535439968109131


In [139]:
new_current_time = datetime.datetime.fromtimestamp(ts)  # convert timestamp to datetime
new_current_time.strftime("%H:%M:%S %d.%m.%Y")

'09:44:38 07.10.2024'

# pprint – "pretty-print"

In [140]:
data = [
 {'id': 731128,
  'first_name': 'Lukas',
  'last_name': 'Strohmaier',
  'age': 25,
  'friends': [483198,239570,793005,661742,366391,615506,938483,873528,763436,927225,225686,460894,674641],
  'papers': [{'journal': 'PNAS', 'title': 'Estimating educational outcomes from students’ short texts on social media'}]},
 {'id': 610278,
  'first_name': 'Sofia',
  'last_name': 'Schneider',
  'age': 21,
  'friends': [340424, 118896, 728115, 732455, 685114, 237262],
  'papers': [{'journal': 'PNAS','title': 'Schools are segregated by educational outcomes in the digital space'}]},
 {'id': 226486,
  'first_name': 'Ivan',
  'last_name': 'Schmidt',
  'age': 22,
  'friends': [840459, 849913, 550089],
  'papers': [{'journal': 'Nature', 'title': 'Parents mention sons more often than daughters on social media'},
             {'journal': 'Science', 'title': 'The digital flynn effect: Complexity of posts on social media increases over time'}]},
 {'id': 468537,
  'first_name': 'Lukas',
  'last_name': 'Lemmerich',
  'age': 73,
  'friends': [636316],
  'papers': [{'journal': 'PNAS', 'title': 'Estimating educational outcomes from students’ short texts on social media'},
             {'journal': 'PLOS One', 'title': 'Predicting PISA scores from students’ digital traces'}]},
 {'id': 257556,
  'first_name': 'Anna',
  'last_name': 'Müller',
  'age': 92,
  'friends': [227376, 706211, 876261, 388121, 368097, 790177, 551119, 622808, 204274, 606230, 737488, 874988],
  'papers': [{'journal': 'Science', 'title': 'Estimating educational outcomes from students’ short texts on social media'}]}
]

print(data) # this is almost unreadable

[{'id': 731128, 'first_name': 'Lukas', 'last_name': 'Strohmaier', 'age': 25, 'friends': [483198, 239570, 793005, 661742, 366391, 615506, 938483, 873528, 763436, 927225, 225686, 460894, 674641], 'papers': [{'journal': 'PNAS', 'title': 'Estimating educational outcomes from students’ short texts on social media'}]}, {'id': 610278, 'first_name': 'Sofia', 'last_name': 'Schneider', 'age': 21, 'friends': [340424, 118896, 728115, 732455, 685114, 237262], 'papers': [{'journal': 'PNAS', 'title': 'Schools are segregated by educational outcomes in the digital space'}]}, {'id': 226486, 'first_name': 'Ivan', 'last_name': 'Schmidt', 'age': 22, 'friends': [840459, 849913, 550089], 'papers': [{'journal': 'Nature', 'title': 'Parents mention sons more often than daughters on social media'}, {'journal': 'Science', 'title': 'The digital flynn effect: Complexity of posts on social media increases over time'}]}, {'id': 468537, 'first_name': 'Lukas', 'last_name': 'Lemmerich', 'age': 73, 'friends': [636316], '

In [None]:
from pprint import pprint, pformat

pprint(data)

[{'age': 25,
  'first_name': 'Lukas',
  'friends': [483198,
              239570,
              793005,
              661742,
              366391,
              615506,
              938483,
              873528,
              763436,
              927225,
              225686,
              460894,
              674641],
  'id': 731128,
  'last_name': 'Strohmaier',
  'papers': [{'journal': 'PNAS',
              'title': 'Estimating educational outcomes from students’ short '
                       'texts on social media'}]},
 {'age': 21,
  'first_name': 'Sofia',
  'friends': [340424, 118896, 728115, 732455, 685114, 237262],
  'id': 610278,
  'last_name': 'Schneider',
  'papers': [{'journal': 'PNAS',
              'title': 'Schools are segregated by educational outcomes in the '
                       'digital space'}]},
 {'age': 22,
  'first_name': 'Ivan',
  'friends': [840459, 849913, 550089],
  'id': 226486,
  'last_name': 'Schmidt',
  'papers': [{'journal': 'Nature',
             

In [None]:
help(pprint)

Help on function pprint in module pprint:

pprint(object, stream=None, indent=1, width=80, depth=None, *, compact=False, sort_dicts=True, underscore_numbers=False)
    Pretty-print a Python object to a stream [default is sys.stdout].



In [None]:
pprint(data, compact = True, width = 80, sort_dicts = False, indent = 1)

[{'id': 731128,
  'first_name': 'Lukas',
  'last_name': 'Strohmaier',
  'age': 25,
  'friends': [483198, 239570, 793005, 661742, 366391, 615506, 938483, 873528,
              763436, 927225, 225686, 460894, 674641],
  'papers': [{'journal': 'PNAS',
              'title': 'Estimating educational outcomes from students’ short '
                       'texts on social media'}]},
 {'id': 610278,
  'first_name': 'Sofia',
  'last_name': 'Schneider',
  'age': 21,
  'friends': [340424, 118896, 728115, 732455, 685114, 237262],
  'papers': [{'journal': 'PNAS',
              'title': 'Schools are segregated by educational outcomes in the '
                       'digital space'}]},
 {'id': 226486,
  'first_name': 'Ivan',
  'last_name': 'Schmidt',
  'age': 22,
  'friends': [840459, 849913, 550089],
  'papers': [{'journal': 'Nature',
              'title': 'Parents mention sons more often than daughters on '
                       'social media'},
             {'journal': 'Science',
              '

In [None]:
s = pformat(data)  # convert to a pretty format and store into string
print(s)

[{'age': 25,
  'first_name': 'Lukas',
  'friends': [483198,
              239570,
              793005,
              661742,
              366391,
              615506,
              938483,
              873528,
              763436,
              927225,
              225686,
              460894,
              674641],
  'id': 731128,
  'last_name': 'Strohmaier',
  'papers': [{'journal': 'PNAS',
              'title': 'Estimating educational outcomes from students’ short '
                       'texts on social media'}]},
 {'age': 21,
  'first_name': 'Sofia',
  'friends': [340424, 118896, 728115, 732455, 685114, 237262],
  'id': 610278,
  'last_name': 'Schneider',
  'papers': [{'journal': 'PNAS',
              'title': 'Schools are segregated by educational outcomes in the '
                       'digital space'}]},
 {'age': 22,
  'first_name': 'Ivan',
  'friends': [840459, 849913, 550089],
  'id': 226486,
  'last_name': 'Schmidt',
  'papers': [{'journal': 'Nature',
             