# Laboratory 02

## Requirements

For the second part of the exercises you will need the `wikipedia` package. On Windows machines, use the following command in the Anaconda Prompt (`Start --> Anaconda --> Anaconda Prompt`):

    conda install -c conda-forge wikipedia
    
This command should work with other Anaconda environments (OSX, Linux).

If you are using virtualenv directly instead of Anaconda, the following command installs it in your virtualenv:

    pip install wikipedia

or

    sudo pip install wikipedia
    
installs it system-wide.

You are encouraged to reuse functions that you defined in earlier exercises.

## 1.1 Define a function that takes a sequence as its input and returns whether the sequence is symmetric. A sequence is symmetric if it is equal to its reverse.

In [1]:
def is_symmetric(l):
    for i in range(len(l) // 2):
        if l[i] != l[len(l)-i-1]:
            return False
    return True

# idiomatic solution
def is_symmetric(l):
    return all(l[i] == l[len(l)-i-1] for i in range(len(l) // 2))

assert(is_symmetric([1]) == True)
assert(is_symmetric([]) == True)
assert(is_symmetric([1, 2, 3, 1]) == False)
assert(is_symmetric([1, "foo", "bar", "foo", 1]) == True)
assert(is_symmetric("abcba") == True)

## 1.2 Define a function that takes a sequence and an integer $k$ as its input and returns the $k$ largest element. Do not use the built-in `max` function. Do not change the original sequence. If $k$ is not specified return one element in a list.

In [2]:
def k_largest(l, k=1):
    return sorted(l)[-k:]
    

l = [-1, 0, 3, 2]

assert(k_largest(l) == [3])
assert(k_largest(l, 2) == [2, 3] or k_largest(l, 2))

## \*1.3 Add an optional `key` argument that works analogously to the built-in `sorted`'s key argument.

Define a function that takes a matrix as an input represented as a list of lists (you can assume that the input is a valid matrix). Return its transpose without changing the original matrix.

In [3]:
def transpose(M):
    Mt = []
    for j in range(len(M[0])):
        Mt.append([])
        for i in range(len(M)):
            Mt[-1].append(M[i][j])
    return Mt

m1 = [[1, 2, 3], [4, 5, 6]]
m2 = [[1, 4], [2, 5], [3, 6]]

assert(transpose(m1) == m2)
assert(transpose(transpose(m1)) == m1)

## 2.1 Define a function that takes a string as its input and return a dictionary with the character frequencies.

In [4]:
def char_freq(s):
    freq = {}
    for c in s:
        if c not in freq:
            freq[c] = 0
        freq[c] += 1
    return freq
    
assert(char_freq("aba") == {"a": 2, "b": 1})

## 2.2 Add an optional `skip_symbols` to the `char_freq` function. `skip_symbols` is the set of symbols that should be excluded from the frequence dictionary. If this argument is not specified, the function should include every symbol.

In [5]:
def char_freq_with_skip(s, skip_symbols=None):
    freq = {}
    for c in s:
        if c in skip_symbols:
            continue
        if c not in freq:
            freq[c] = 0
        freq[c] += 1
    return freq
    
assert(char_freq_with_skip("ab.abc?", skip_symbols=".?") == {"a": 2, "b": 2, "c": 1})

## 2.2 Define a function that computes word frequencies in a text.

In [6]:
def word_freq(s):
    freq = {}
    for word in s.split():
        if word not in freq:
            freq[word] = 0
        freq[word] += 1
    return freq
    
s = "the green tea and the black tea"

assert(word_freq(s) == {"the": 2, "tea": 2, "green": 1, "black": 1, "and": 1})

## 2.3 Define a function that counts the uppercase letters in a string.

In [7]:
def count_upper_case(s):
    cnt = 0
    for c in s:
        if c.isupper():
            cnt += 1
    return cnt

# idiomatic solution
def count_upper_case(s):
    return sum(int(c.isupper()) for c in s)
    
assert(count_upper_case("A") == 1)
assert(count_upper_case("abA bcCa") == 2)

## 2.4 Define a function that takes two strings and decides whether they are anagrams. A string is an anagram of another string if its letters can be rearranged so that it equals the other string.

For example:

```
abc -- bac
aabb -- abab
```

Counter examples:

```
abc -- aabc
abab -- aaab
```

In [None]:
def anagram(s1, s2):
    # TODO

assert(anagram("abc", "bac") == True)
assert(anagram("aabb", "abab") == True)
assert(anagram("abab", "aaab") == False)

## 2.5. Define a sentence splitter function that takes a string and splits it into a list of sentences. Sentences end with `.` and the new sentence must start with a whitespace (`str.isspace`) or be the end of the string. See the examples below.

In [None]:
def sentence_splitter(s):
    # TODO
        
assert(sentence_splitter("A.b. acd.") == ['A.b', 'acd'])
assert(sentence_splitter("A. b. acd.") == ['A', 'b', 'acd'])

## Wikipedia module

The following exercises use the `wikipedia` package. The basic usage is illustrated below.

The documentation is available [here](https://pypi.python.org/pypi/wikipedia/).

Searching for pages:

In [33]:
import wikipedia

results = wikipedia.search("Budapest")
results

['Budapest',
 'Budapest Honvéd',
 'Gare de Budapest-Nyugati',
 'Arrondissements de Budapest',
 'Métro de Budapest',
 'Bataille de Budapest',
 'Papp László Budapest Sportaréna',
 'MTK Budapest FC',
 'Gare de Budapest-Déli',
 'Ligne H8 du HÉV de Budapest']

Downloading an article:

In [34]:
article = wikipedia.page("Budapest")

article.summary[:100]

'Budapest (prononcé [by.da.ˈpɛst] , hongrois : Budapest [ˈbu.dɒ.pɛʃt]  ; allemand : Budapest ou ancie'

The content attribute contains the full text:

In [35]:
type(article.content), len(article.content)

(str, 92420)

By default the module downloads the English Wikipedia. The language can be changed the following way:

In [36]:
wikipedia.set_lang("fr")

In [37]:
wikipedia.search("Budapest")

['Budapest',
 'Budapest Honvéd',
 'Gare de Budapest-Nyugati',
 'Arrondissements de Budapest',
 'Bataille de Budapest',
 'Métro de Budapest',
 'Papp László Budapest Sportaréna',
 'MTK Budapest FC',
 'Gare de Budapest-Déli',
 'MTK Budapest']

In [38]:
fr_article = wikipedia.page("Budapest")
fr_article.summary[:100]

'Budapest (prononcé [by.da.ˈpɛst] , hongrois : Budapest [ˈbu.dɒ.pɛʃt]  ; allemand : Budapest ou ancie'

## 3.0 Change the language back to English and test the package with a few other pages.

## 3.1 Download 4-5 arbitrary pages from the English Wikipedia (they should exceed 100000 characters combined) and compute the word frequencies using your previously defined function(s). Print the most common 20 words in the following format (the example is not the correct answer):

```
unintelligent <TAB>  123456
moribund <TAB>   123451
...
```

The words and their frequency are separated by TABS and no additional whitespace should be added.

## 3.2 Repeat the same exercise for your native language if it denotes word boundaries with spaces. If it doesn't choose an arbitrary language other than English.

## 3.3 Define a function that takes a string and returns its bigram frequencies as a dictionary.

Character bigrams are pairs of subsequent characters. For example word `apple` contains the following bigrams: `ap, pp, pl, le`.

They are used for language modeling. 

## 3.4 Using your previous English collection compute bigram frequencies.

What are the 10 most common and 10 least common bigrams?

## \*3.5 Define a function that takes two parameters: a string and an integer N and returns the N-gram frequencies of the string. For $N=2$ the function works the same as in the previous example.

Try the function for $N=1..5$. How many unique N-grams are in your collection?

## 3.6 Compute the same statistics for your native language.