# Ex 1

Take a text from the Gutenberg Project and clean it (.i.e.: remove the GP messages from the head and the tail of the text using any text editor). Once you have a clean text in your language of choice, write a function that takes the path to that file as input (the path itself is a string, as you know) and returns a list of lists, where every smaller list represents a sentence with its tokens. Something like:

```python
input_text = "I am blue. You are green!"
output = [["I","am","blue",".",],["You","are","green","!" ]]
```

In order to do this, you will have to build two tokenizers (you will decide whether or not to do that in the same function): one that takes care of the sentence tokenization and one that takes care of the word tokenization. You will use the "re" module to build the tokenizers.


In [136]:
import re

file = open("../data/proposal.txt", "r", encoding="utf-8-sig")
raw_text = file.read()
raw_text = re.sub("\n"," ", raw_text)
file.close()

def sent_tokenizer(text):
    return re.findall(r".*?[^Dr\.][.!\?]",text)

def word_tokenizer(sentence):
    punct = r"""([A-z])([,;:\?!\."'])"""
    temp_sentence =  re.sub(punct, r"\1 \2", sentence)
    toks = temp_sentence.split()
    temp_out =[]
    # splitting english possessive
    for tok in toks:
        if re.search(r"([A-z]+)’s?$", tok):
            temp_out.extend(re.sub(r"([A-z]+)(’s?)$", r"\1 \2", tok).split())
        else:
            temp_out.append(tok)
    return temp_out 

def my_tokenizer(text):
    sentences = sent_tokenizer(text)
    tokenized_text = []
    for sent in sentences:
        tokenized_text.append(word_tokenizer(sent))
    return tokenized_text

my_tokenizer(raw_text)

[['A',
  'Modest',
  'Proposal',
  'For',
  'preventing',
  'the',
  'children',
  'of',
  'poor',
  'people',
  'in',
  'Ireland',
  ',',
  'from',
  'being',
  'a',
  'burden',
  'on',
  'their',
  'parents',
  'or',
  'country',
  ',',
  'and',
  'for',
  'making',
  'them',
  'beneficial',
  'to',
  'the',
  'publick',
  '.'],
 ['by',
  'Dr',
  '.',
  'Jonathan',
  'Swift',
  '1729',
  'It',
  'is',
  'a',
  'melancholy',
  'object',
  'to',
  'those',
  ',',
  'who',
  'walk',
  'through',
  'this',
  'great',
  'town',
  ',',
  'or',
  'travel',
  'in',
  'the',
  'country',
  ',',
  'when',
  'they',
  'see',
  'the',
  'streets',
  ',',
  'the',
  'roads',
  ',',
  'and',
  'cabbin-doors',
  'crowded',
  'with',
  'beggars',
  'of',
  'the',
  'female',
  'sex',
  ',',
  'followed',
  'by',
  'three',
  ',',
  'four',
  ',',
  'or',
  'six',
  'children',
  ',',
  'all',
  'in',
  'rags',
  ',',
  'and',
  'importuning',
  'every',
  'passenger',
  'for',
  'an',
  'alms',
  '.

# Ex 2

Given the better dictionary structure that we created try to determine how precise out current lemmatization approach can be. More precisely, we now have a gold-standard dataset (the connlu file in the [github repository](https://github.com/UniversalDependencies/UD_English-EWT) seen in class) which contains information about the tokens, the POS and lemmas in a text. Use the token + POS information to obtain a lemmatization via the [ENGLAFF](http://redac.univ-tlse2.fr/lexiques/englaff.html)-based lemmatizer; note that you need to rewrite the function "lemmatizer" to fit the new data structure in the "better_d" variable, and that you have to find a way to convert the POS tag found in the conllu file to match the one found in the ENGLAFF file (or viceversa). How can you measure the precision of our tool?

In [8]:
import pickle

with open('better_d.pickle', 'rb') as handle:
    better_d = pickle.load(handle)
    
UD_dataset = [] 
for line in open("en_ewt-ud-dev.conllu", encoding="utf8"):
    if re.search("^[0-9]", line):
        UD_dataset.append(line.split("\t")[1:4])

In [81]:
UD_dataset = {el[0] : [el[1], el[2]] for el in UD_dataset}

In [29]:
POS_1 = set()
for tok_pos,lemma in better_d.items():
    POS_1.add(tok_pos[1])
    
POS_1

{'A', 'N', 'R', 'V'}

In [28]:
POS_2 = set([pos[2] for pos in UD_dataset])
POS_2

{'ADJ',
 'ADP',
 'ADV',
 'AUX',
 'CCONJ',
 'DET',
 'INTJ',
 'NOUN',
 'NUM',
 'PART',
 'PRON',
 'PROPN',
 'PUNCT',
 'SCONJ',
 'SYM',
 'VERB',
 'X',
 '_'}

In [76]:
translation_map = {
    'A' : 'ADJ',
    'N' : 'NOUN',
    'R' : 'ADV',
    'V' : 'VERB'
}

new_d = {}
for k,v in better_d.items():
    pos = translation_map[k[1]]  
    new_d[((k[0],pos))] = v

In [89]:
from nltk import word_tokenize

In [119]:
def get_UD_info(token, dataset):
    return dataset[token] 

def lemmatize(tokens_, dataset1, dataset2): # we have a default argument
    out = []
    lemma_diff = []
    not_match = []
    match = []
    for token in tokens_:
        try:
            UD_lemma, UD_pos = get_UD_info(token, dataset2)
            lemma = dataset1[(token.lower(),UD_pos)]
            if lemma == UD_lemma:
                out.append(lemma)
                match.append(lemma)
            else:
                lemma_diff.append([token,UD_pos,lemma,UD_lemma])
        except:
            not_match.append(token)
            out.append(">" + token)
#           print(f'PROBLEM: {token}')
    return out, lemma_diff, match, not_match


out, lemma_diff, match, not_match = lemmatize(word_tokenize(raw_text), new_d, UD_dataset)

In [126]:
print(f'{len(match)} combinations of token + POS are present in both the datasets and share the same lemma.')
print(f'{len(lemma_diff)} combinations of token + POS are present in both the datasets bot show different lemmas.')
print(f'{len(not_match)} combinations of token + POS are not present in both the dataset.')

828 combinations of token + POS are present in both the datasets and share the same lemma.
22 combinations of token + POS are present in both the datasets bot show different lemmas.
3080 combinations of token + POS are not present in both the dataset.


# Ex 3

Let's create a dataset consisting of song lyrics by scraping https://www.azlyrics.com. You can access the full list of artists in alphabetical order by browsing the top of the site (is there an easier way to go directly to the page of interest?). Using requests or selenium, in combination with BeautifoulSoup, accomplish the following tasks:
- Create a function that, given a letter of the alphabet, returns a list of all the artists present on the site whose name begins with that letter and the respective link to their page. Save the information in a list of tuples in which the first element is the name of the artist and the second one is the link to her\his page;
- Write a function that, given the link to the artist webpage, returns the list of all her/his songs present on the website and the respective link to the lyrics. Save the information in a list of tuples in which the first element is the title of the song and the second the link to the lyrics;
- Write a function that, given a link to the lyrics page, returns the lyrics of the song;

Write a function that takes a letter as input and returns a dictionary whose keys are the names of all the artists on the website starting with that letter, and as a value a dictionary structured as follows: as keys it will have the names of the songs, and as values it will have another dictionary in which you will store the link and the lyrics of the song. The final structure of your dataset should be similar to the one below.

```python
dataset = {[artist name]: {[song title]: {[link]: str,
                                          [lyrics]: str},
                           [song title]: {[link]: str,
                                          [lyrics]: str},
                           ...
                                                            },
           
           [artist name]: {...},
          
           ...
           
          }
```

Are you able to retrieve the lyrics of a song of your choice with a single line of code?

Finally, try to store artists and related lyrics using more than just one letter.  

#### For convenience, five elements have been extracted for each type (artists, songs, lyrics).

In [54]:
import requests
import time 
from bs4 import BeautifulSoup
import random 

base_url = "https://www.azlyrics.com/"


def get_soup(url):
    time.sleep(random.randint(5,16))
    source = requests.get(url)
    source.encoding = 'utf-8'  # override encoding manually
    return BeautifulSoup(source.text)


def get_artists(letter):
    url = f'{base_url}{letter}.html'
    soup = get_soup(url)
    artists = []
    for el in soup.find("div", class_="container main-page").find_all("a")[0:5]:
        name = el.get_text()
        link = base_url + el['href'] 
        artists.append((name,link))
    return artists


def get_songs(artist_url):
    soup = get_soup(artist_url)
    songs = []
    try:
        for song in soup.find_all("div", class_="listalbum-item")[0:5]:
            title = song.get_text()
            link = song.find("a")['href']
            link = link.replace("../", base_url)
            songs.append((title, link))
    except:
        print(f'ERROR: {artist_url}')
    return songs
    

def get_lyrics(song_url): 
    soup = get_soup(song_url)
    lirics = ""
    try:
        lyrics = soup.find("div", class_="col-xs-12 col-lg-8 text-center").find("div", class_ = None).get_text()
    except:
        print(f'ERROR: {song_url}')
    return lyrics

def get_all_lyrics(letter):
    dataset = {}
    print(f'\nGetting artists with letter "{letter}"...')
    artists = get_artists(letter)
    for artist in artists:
        print(f'\nGetting the songs of {artist[0]} ...')
        songs = get_songs(artist[1])
        dataset[artist[0]] = {}
        for song in songs:
            print(f'>> Getting the lyrics of {song[0]} ...')
            lyrics = get_lyrics(song[1])
            dataset[artist[0]][song[0]] = {'link' : song[1], 'lyrics' : lyrics}
    return dataset

In [55]:
songs_dataset = get_all_lyrics("u")

Getting artists with letter "u"...

Getting the songs of U2 ...
>> Getting the lyrics of I Will Follow ...
>> Getting the lyrics of Twilight ...
>> Getting the lyrics of An Cat Dubh ...
>> Getting the lyrics of Into The Heart ...
>> Getting the lyrics of Out Of Control ...

Getting the songs of UB40 ...
ERROR: https://www.azlyrics.com/u/ub40.html
>> Getting the lyrics of Tyler ...
>> Getting the lyrics of King ...

Getting the songs of Uchis, Kali ...
>> Getting the lyrics of Good To You ...
>> Getting the lyrics of ChimiChanga ...
>> Getting the lyrics of Mucho Gusto ...
>> Getting the lyrics of Dream ...
>> Getting the lyrics of TYWIG ...

Getting the songs of Ude Af Kontrol ...
>> Getting the lyrics of Ta' Dine Sko På ...
>> Getting the lyrics of Fri ...
>> Getting the lyrics of Den Dyre Pt. 1 ...
>> Getting the lyrics of Den Dyre Pt. 2 ...
>> Getting the lyrics of Ude Af Kontrol ...

Getting the songs of U.D.O. ...
>> Getting the lyrics of Animal House ...
>> Getting the lyrics of 

#### Let's retrieve the lyrics of Twilight by U2 from our dictionary

In [61]:
U2_twilight = songs_dataset['U2']['Twilight']['lyrics']
print(U2_twilight)



I look into his eyes
They're closed but I see something
A teacher told me why
I laugh when old men cry

My body grows and grows
It frightens me you know
The old man tried to walk me home
I thought he should have known

Twilight...
Twilight, lost my way
Twilight, can't find my way

In the shadow boy meets man
In the shadow boy meets man
In the shadow boy meets man
In the shadow boy meets man

I'm running in the rain
I'm caught in a late night play
It's all; it's everything
I'm soaking through the skin

Twilight...darkened day
Twilight...lost my way
Twilight...night and day
Twilight...can't find my way

Can't find your way
Can't find my way
Can't find your way

Twilight...darkened day
Twilight...lost my way
Twilight...night and day
Twilight...can't find my way

In the shadow boy meets man
In the shadow boy meets man
In the shadow boy meets man
In the shadow boy meets man



#### Let's scrape more than one letter from the website.

In [None]:
for letter in ["r", "s", "t"]:
    songs_dataset.update(get_all_lyrics(letter))

# Ex 4

Using names.txt, a 46K text file containing over five-thousand first names, begin by sorting it into alphabetical order. Then working out the alphabetical value for each name, multiply this value by its alphabetical position in the list to obtain a name score.

For example, when the list is sorted into alphabetical order, COLIN, which is worth 3 + 15 + 12 + 9 + 14 = 53, is the 938th name in the list. So, COLIN would obtain a score of 938 × 53 = 49714.

What is the total of all the name scores in the file?

In [106]:
import string 

alphabet = string.ascii_uppercase

def letters_score(name):
    score = 0
    for letter in name:
        score += alphabet.index(letter) + 1
    return score 
        
file = open("../data/names.txt", "r", encoding="utf-8")
names = file.read().split(',')
names = [name.replace('"', '') for name in names]
names.sort()

total_score = 0
for name in names:
    name_score = letters_score(name)
    position = names.index(name) + 1
    final_name_score = name_score * position
    total_score += final_name_score
    
total_score

871198282