# Top Slovene words

## Using BeautifulSoup with Slovene
Scrapes a page for the [2000 most frequently used words](http://bos.zrc-sazu.si/a_top2000.html) across 7 different linguistic corpora for Slovene. Then it prints a random Slovene word from this list. 
* reassign the encoding setting for requests
* read the text into BeautifulSoup
* extract all of the tags without attributes into a set

#### What it doesn't do: 
Stemming/lemmatization of the words. Therefore some of the words are being presented with declensions that represent one of the three genders, seven cases, and/or three numbers (singular, dual, plural) in Slovene.

In [1]:
#Setting up my packages
import requests, random
from bs4 import BeautifulSoup

In [2]:
#Read in the page.  Get the table with data
top_2000 = requests.get('http://bos.zrc-sazu.si/a_top2000.html')

#The header in the HTML page says that the text encoding is ISO-8859-2. Reassign. 
top_2000.encoding = 'ISO-8859-2'

#lxml is the recommended parser for BeautifulSoup
top_2000_page = BeautifulSoup(top_2000.text,"lxml")
top_2000_table = top_2000_page.find_all("table")[1]


In [3]:
#Making a set of words from the BeautifulSoup parsed source page
words = set()
for td in top_2000_table.find_all("td"):
    if not td.attrs:
        words.add(td.string.strip())
    
#Sorting the set by returning a list of the elements in a sorted order.
#Special characters are put at the end
top_words = sorted(words)

#Select random word from list
print(random.choice(top_words))


koliko


In [209]:
len(words)

5528

I wanted to extract all of the words, but this page is structured with a bunch of <td> layers. To extract all words, they are in <td> but **not** in the <td> tags that have the align attribute. See below for a sample row. It goes across all columns (seven different linguistic corpora). I don't care which corpora they're from, for this project. Here is a sample of one row from the original file.

In [None]:
<tr><td align=right>1.<td>   je<td align=right>34920<td>   je<td align=right>58338<td>   je<td align=right>35233<td>   je<td align=right>33210<td>   je<td align=right>33483<td>   je<td align=right>27892<td>   je<td align=right>32070<td align=right>   1.

## Errors

At first I saw a lot of: <code>UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 164: invalid start byte </code>

Not surprising, given that Slovene uses some special characters. 

### Here's what didn't work

* <code>top_2000_page = BeautifulSoup(top_2000.text,"lxml", from_encoding = "ISO-8859-2")</code>
* <code>top_2000_page = BeautifulSoup(top_2000.text,"lxml", from_encoding = "latin-1") </code>

Neither did trying other [possible encodings](https://docs.python.org/3.4/library/codecs.html#encodings-and-unicode): <code>utf-8, utf-16, cp852, cp1250</code>. Let's be honest, I was just guessing what might fit Central European diacritics.

So I went **upstream** to the requesting the page. When I called <code>top_2000.text</code> the output looked...crazy. Not correct at all. 

### Here's what did work

Checking the text encoding in the header. Yep. Where I should have started hours ago before getting into a much bigger rabbithole than what it seems here.

<code><meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-2"></code>

Then I tested the initial text from the request.
<code>top_2000 = requests.get('http://bos.zrc-sazu.si/a_top2000.html')</code>
<code>top_2000.encoding</code>
It claimed to be ISO-8859-1, but I had a suspicion Python was lying (Thanks [StackOverflow](http://stackoverflow.com/questions/27109725/python-and-beautifulsoup-encoding-issue-from-utf-8) for corroborating with me. 

#### The key to unlocking the beauty of proper encoding...
Looking at the [Requests documentation for encoding](http://docs.python-requests.org/en/latest/user/quickstart/#make-a-request). Since I knew the text was jumbled before it even got into BeautifulSoup, Requests was the logical next step to check. So I reassigned the encoding.

<code>top_2000.encoding = 'ISO-8859-2'</code>

After reassigning the encoding, I ran the text through BeautifulSoup, as seen above, with lxml. 

Then the clouds parted, the sun emerged, and I felt victorious. Beautiful beautiful characters emerged. Dobrodošli, čšž.