# Reading Data

There are various file formats, how do we make a sense of them all?

* There are archive/compression formats such as .zip, .rar, .7z, .tar those hold other files.
* There are text formats such as .txt, .csv, .json, .tsv - those can be read by humans in a text editor
* There are binary formats such as .exe, .jpg, .png - those are not human readable

### Reading text files

In this section we will read a simple text file. It contains text from the start of English Wikipedia article about Riga: https://en.wikipedia.org/wiki/Riga

In [1]:
filename = "data/riga_wikipedia.txt"

In [2]:
# open the file for reading
file_1 = open(filename)

# read contents of the file
data = file_1.read()

# close the file
file_1.close()

Note: The above action (reading a local file) will fail if you execute it in Google Colab. 

If that's the case, uncomment (remove # marks from start of the line) the code below that reads the same file from a remote web location (a Github repository in this case):

In [3]:
#import requests

#url = "https://raw.githubusercontent.com/ValRCS/BSSDH_22/main/notebooks/data/riga_wikipedia.txt"

#response = requests.get(url)
#data = response.text

In [4]:
# print the first 100 characters of the file
print(data[:100])

Riga (/ˈriːɡə/; Latvian: Rīga [ˈriːɡa] (listen), Livonian: Rīgõ) is the capital of Latvia and is hom


In [5]:
# split text into tokens (words)
words = data.split()

In [6]:
# count the number of tokens in text

print(len(words))

257


In [7]:
# print the first 50 tokens
print(words[:50])

['Riga', '(/ˈriːɡə/;', 'Latvian:', 'Rīga', '[ˈriːɡa]', '(listen),', 'Livonian:', 'Rīgõ)', 'is', 'the', 'capital', 'of', 'Latvia', 'and', 'is', 'home', 'to', '671,000', 'inhabitants', '[10][11][12]', 'which', 'is', 'a', 'third', 'of', "Latvia's", 'population.', 'Being', 'significantly', 'larger', 'than', 'other', 'cities', 'of', 'Latvia,', 'Riga', 'is', 'the', "country's", 'primate', 'city.', 'It', 'is', 'also', 'the', 'largest', 'city', 'in', 'the', 'three']


### Counting word frequency

Here we will use Python's Counter object (from Python collections library) to determine word frequency of the text. 

https://docs.python.org/3/library/collections.html#collections.Counter

In [8]:
from collections import Counter

In [9]:
c = Counter(words)

In [10]:
# print the 20 most common words (tokens)
print(c.most_common(20))

[('the', 21), ('of', 13), ('is', 11), ('Riga', 9), ('and', 9), ('a', 5), ('in', 5), ('Baltic', 5), ('European', 5), ('World', 4), ('home', 3), ('to', 3), ('city', 3), ('was', 3), ('Union', 3), ('It', 2), ('largest', 2), ('three', 2), ('The', 2), ('lies', 2)]


In [11]:
# a nicer way of printing counter results using a *for* cycle

for token, count in c.most_common(20):
    print(f"{token}: {count}")


the: 21
of: 13
is: 11
Riga: 9
and: 9
a: 5
in: 5
Baltic: 5
European: 5
World: 4
home: 3
to: 3
city: 3
was: 3
Union: 3
It: 2
largest: 2
three: 2
The: 2
lies: 2


Notice how words may appear in both lowercase ("the") and uppercase ("The"). You may want to normalize the text by converting it all to lowercase and do other clean-up steps. 

### 🔴 Note: varbūt te derētu arī kāds CSV faila piemērs 🔴

### Reading TSV files

Corpora that we will work with are located in archived TSV (Tab-separated-values) files:
https://github.com/ValRCS/BSSDH_22/tree/main/corpora

These files consist of rows (records) that contain one or more values separated by "Tab" characters.

We will use Pandas library to read a TSV file that contains a smaller version of the "lv_old_newspapers.zip" corpus: https://github.com/ValRCS/BSSDH_22/blob/main/corpora/lv_old_newspapers_5k.tsv

In [12]:
import pandas

In [13]:
filename = "../corpora/lv_old_newspapers_5k.tsv"

In [14]:
# read the tab-separated file ("sep" parameter tells Pandas that values in the file
# are separated with the "tab" character.

df_1 = pandas.read_csv(filename, sep="\t")

Note: The above action (reading a local file) will fail if you execute it in Google Colab.

If that's the case, uncomment (remove # marks from start of the line) the code below that reads the same file from a remote web location (a Github repository in this case):


In [15]:
#url = "https://github.com/ValRCS/BSSDH_22/raw/main/corpora/lv_old_newspapers_5k.tsv"

#df_1 = pandas.read_csv(url, sep="\t")

In [16]:
df_1

Unnamed: 0,Language,Source,Date,Text
0,Latvian,rekurzeme.lv,2008/09/04,"""Viņa pirmsnāves zīmītē bija rakstīts vienīgi ..."
1,Latvian,diena.lv,2012/01/10,info@zurnalistiem.lv
2,Latvian,bauskasdzive.lv,2007/12/27,"Bhuto, kas Pakistānā no trimdas atgriezās tika..."
3,Latvian,bauskasdzive.lv,2008/10/08,Plkst. 4.00 Samoilovs / Pļaviņš (pludmales vol...
4,Latvian,diena.lv,2011/10/05,"CVK bija vērsusies Skaburska, lūdzot izskaidro..."
...,...,...,...,...
4994,Latvian,zz.lv,2011/04/25,Mākslas dienu laikā Jelgavā viesosies Baltkrie...
4995,Latvian,diena.lv,2010/05/12,"""""Melnās Piektdienas” četru pastāvēšanas gadu ..."
4996,Latvian,diena.lv,2009/09/17,Nevaru mierā nosēdēt
4997,Latvian,zz.lv,2010/02/25,Vairāki pasākumi veltīti jaunākajiem lasītājie...


In [17]:
# the size of the corpus:

print(len(df_1))

4999


In [18]:
# select the Text column, show the first 10 entries

df_1["Text"][:10]

0    "Viņa pirmsnāves zīmītē bija rakstīts vienīgi ...
1                                 info@zurnalistiem.lv
2    Bhuto, kas Pakistānā no trimdas atgriezās tika...
3    Plkst. 4.00 Samoilovs / Pļaviņš (pludmales vol...
4    CVK bija vērsusies Skaburska, lūdzot izskaidro...
5    Apbalvojumus piešķir piemiņas zīmes valde Saei...
6    - Amerikā biju uzaicināts viesoties ar visu ģi...
7    Mūrniece gan saka, ka Lužkova bitēm Latvijas p...
8    PĒDĒJĀ, kontrolēja PĀRDAUGAVAS telpu, izņemot ...
9    Ar Ivaru tikāmies viņa dzimtajos "Lazdiņos" Za...
Name: Text, dtype: object

#### 🔴 Note: Te derētu parādīt vārdu biežuma skaitīšanu df_1 datiem 🔴


### Reading archived files

Pandas can also read archived CSV and TSV files.

In [19]:
filename_2 = "../corpora/lv_old_newspapers.zip"

In [20]:
# read the archived, tab-separated file ("compression" parameter tells
# Pandas that this is a ZIP archived file).

df_2 = pandas.read_csv(filename_2, sep="\t", compression="zip")

Note: The above action (reading a local file) will fail if you execute it in Google Colab.

If that's the case, uncomment (remove # marks from start of the line) the code below that reads the same file from a remote web location (a Github repository in this case):

In [21]:
#url_2 = "https://github.com/ValRCS/BSSDH_22/raw/main/corpora/lv_old_newspapers.zip"

#df_2 = pandas.read_csv(url_2, sep="\t", compression="zip")

In [22]:
# the size of the corpus:

print(len(df_2))

319428


In [23]:
# show the last 10 entries

df_2[-10:]

Unnamed: 0,Language,Source,Date,Text
319418,Latvian,bdaugava.lv,2010/01/16,Ceturtdien no rajona padomes ēkas tika svinīgi...
319419,Latvian,nra.lv,2011/12/21,"AFP vēsta, ka naktī uz otrdienu, jau piekto na..."
319420,Latvian,db.lv,2011/12/02,TOP 500 ir vienīgais ikgadējais izdevums Latvi...
319421,Latvian,diena.lv,2009/12/21,ka pati visu mūžu bijusi saistīta ar šo jomu. ...
319422,Latvian,la.lv,2011/12/08,"Prakse liecina, ka tādos gadījumos tiesu izpil..."
319423,Latvian,ziemellatvija.lv,2008/01/30,Beigu beigās I. Klempere kopā ar dēlu mājās de...
319424,Latvian,db.lv,2012/01/03,"Vienkāršā valodā tas nozīmē, ka investori par ..."
319425,Latvian,la.lv,2011/08/27,– Visi mūsu projekti ir notikuši sadarbībā ar ...
319426,Latvian,ziemellatvija.lv,2007/03/12,"Pole atzina, ka par šo ziņojumu VM saņēmusi li..."
319427,Latvian,bauskasdzive.lv,2011/07/07,"Trešdienas, 6. jūlija, vakarā projekta vadītāj..."
