# Reading Data

There are various file formats, how do we make a sense of them all?

* There are archive/compression formats such as .zip, .rar, .7z, .tar those hold other files.
* There are text formats such as .txt, .csv, .json, .tsv - those can be read by humans in a text editor
* There are binary formats such as .exe, .jpg, .png - those are not human readable

### Reading text files

In this section we will read a simple text file. It contains text from the start of English Wikipedia article about Riga: https://en.wikipedia.org/wiki/Riga

In [1]:
filename = "data/riga_wikipedia.txt"

In [2]:
# open the file for reading
file_1 = open(filename)

# read contents of the file
data = file_1.read()

# close the file
file_1.close()

In [3]:
# print the first 100 characters of the file
print(data[:100])

Riga (/ˈriːɡə/; Latvian: Rīga [ˈriːɡa] (listen), Livonian: Rīgõ) is the capital of Latvia and is hom


In [13]:
# split text into tokens (words)
words = data.split()

In [12]:
# count the number of tokens in text

print(len(words))

257


In [14]:
# print the first 50 tokens
print(words[:50])

['Riga', '(/ˈriːɡə/;', 'Latvian:', 'Rīga', '[ˈriːɡa]', '(listen),', 'Livonian:', 'Rīgõ)', 'is', 'the', 'capital', 'of', 'Latvia', 'and', 'is', 'home', 'to', '671,000', 'inhabitants', '[10][11][12]', 'which', 'is', 'a', 'third', 'of', "Latvia's", 'population.', 'Being', 'significantly', 'larger', 'than', 'other', 'cities', 'of', 'Latvia,', 'Riga', 'is', 'the', "country's", 'primate', 'city.', 'It', 'is', 'also', 'the', 'largest', 'city', 'in', 'the', 'three']


### Counting word frequency

Here we will use Python's Counter object (from Python collections library) to determine word frequency of the text. 

https://docs.python.org/3/library/collections.html#collections.Counter

In [6]:
from collections import Counter

In [7]:
c = Counter(words)

In [8]:
# print the 20 most common words (tokens)
print(c.most_common(20))

[('the', 21), ('of', 13), ('is', 11), ('Riga', 9), ('and', 9), ('a', 5), ('in', 5), ('Baltic', 5), ('European', 5), ('World', 4), ('home', 3), ('to', 3), ('city', 3), ('was', 3), ('Union', 3), ('It', 2), ('largest', 2), ('three', 2), ('The', 2), ('lies', 2)]


In [11]:
# a nicer way of printing counter results using a *for* cycle

for token, count in c.most_common(20):
    print(f"{token}: {count}")


the: 21
of: 13
is: 11
Riga: 9
and: 9
a: 5
in: 5
Baltic: 5
European: 5
World: 4
home: 3
to: 3
city: 3
was: 3
Union: 3
It: 2
largest: 2
three: 2
The: 2
lies: 2


Notice how words may appear in both lowercase ("the") and uppercase ("The"). You may want to normalize the text by converting it all to lowercase and do other clean-up steps. 