# Ex 1

In the previous weeks you have seen how to tokenize a text using regular expressions, and how to get useful information, such as lemmas and POS, using already tagged knowledge bases. In the last lesson, you were introduced to the use of existing tools that provide an efficient and intuitive pipeline for obtaining the same information with just a few lines of code. One such tool is Stanza.

Let's put the pieces together and try to build useful functions for processing text with Stanza. After importing Stanza correctly, choose a language you want to work with (you can see the supported languages <a href="https://stanfordnlp.github.io/stanza/available_models.html"> here</a>) and download the appropriate model. Then write two functions. The first will take as argument the path to a text file written in the language of your choice (you can download it from Gutenberg or use one of your choice). The function must save the result of the Stanza pipeline (the input of which will be the last 10000 characters of your text) as a JSON file and return the corresponding dictionary. The second function is a variation of the <i>json_to_tokens(...)</i> function already written in class. Modify it so that it also returns a dictionary whose keys will be the tokens and the related values will consist of a dictionary with the following information: lemma, upos, feats.  

In [None]:
def json_to_tokens(path_, keys = ["text"] ):
    out = []  
    jin = json.load(open(path_))
    for sent in jin:
        temps = [] 
    for word in sent:
        tword = []
        for key in keys: 
            try:
                tword.append(word[key])
            except KeyError:
                pass
        temps.append(tword) 
    out.append(temps) 
return out 

In [None]:
! pip install stanza

import stanza
import json
# dowloading the English model 
# stanza.download("en")



def stanza_to_json(path):
    
    # opneing my text file
    text = open(path, "r", encoding = "utf-8-sig")
    text = text.read()
    
    # initializing Stanza Pipeline
    nlp = stanza.Pipeline("en")
    
    # feeding the Pipeline with the last 10000 chars of my text
    processed_doc = nlp(text[:-10000])
    processed_doc_dict = processed_doc.to_dict()
    
    # saving the dict in a .json file
    with open("/content/drive/MyDrive/CSTA/processed_doc.json", "w", encoding="utf-8-sig") as fout:
        json.dump(processed_doc_dict, fout)
        
    return processed_doc_dict


def json_to_tokens(path_, keys = ["text"] ):
    out = []
    dict_out = {}  
    jin = json.load(open(path_, "r", encoding ="utf-8-sig"))
    for sent in jin:
        temps = [] 
        for word in sent:
            tword = []
            for key in keys: 
                try:
                    tword.append(word[key])
                except KeyError:
                    pass
            temps.append(tword) 
        out.append(temps)
        
    # populating the dict 
    for sent in out:
        for word in sent:
            d = {'upos' : word[1], 'lemma' : word[2]}
            if len(word) == 4:
                d['feats'] = word[3]
            else:
                d['feats'] = None
        dict_out[word[0]] = d  
    return out, dict_out

In [None]:
my_path = "proposal.txt"

processed_text = stanza_to_json(my_path)
res_list, res_dict = json_to_tokens("CSTA/processed_doc.json", keys = ["text", "upos", "lemma", "feats"])

# Ex 2

From the Git profile of <a href="https://github.com/UniversalDependencies">UniversalDependencies</a> choose two repositories of your interest and download them with the command "git clone" after having positioned yourself in the folder of interest directly using the terminal.  Next, write a function that takes as argument the path to a Git folder. Using the "os" module, extract the path to all the ".conllu" files and process them to obtain for each file a dictionary whose keys will be tokens, and the ralated values will be a dictionary with the following information: lemma, and pos. Save all the dictionaries in a list and return it. Do this for both Git directories. At the end of the process make sure you get a list containing all the lists of the various dictionaries. Would you be able to attach the name of the repository you got the data from to each list?

In [None]:
import os
import re

directory_1 = "C:/Users/Daniel/CSTA_tutor/exercises/UD_Italian-PoSTWITA/"
directory_2 = "C:/Users/Daniel/CSTA_tutor/exercises/UD_Italian-TWITTIRO/"
directories = [directory_1, directory_2]

In [None]:
def get_conllu_features(path):
    files = [file for file in os.listdir(path) if ".conllu" in file]
    all_conllu_dataset = []
    for file in files:
        dataset = {}
        for line in open(f"{path}{file}", encoding="utf8"):
            if re.search("^[0-9]", line):
                features = line.split("\t")[1:4]
                dataset[features[0]] = {'lemma' : features[1], 'POS' : features[2]}
        all_conllu_dataset.append(dataset)
    return all_conllu_dataset

In [None]:
final_dataset = []
for dir_ in directories: 
    final_dataset.append(get_conllu_features(dir_))

Would you be able to attach the name of the repository you got the data from to each list?

In [None]:
dir_names = [name.split("/")[-2] for name in directories]
list(zip(dir_names, final_dataset))

# Ex 3

On this <a href="https://www.tripadvisor.it/Tourism-g194889-Rovereto_Province_of_Trento_Trentino_Alto_Adige-Vacations.html">page</a> you will find a number of places in Rovereto which have been reviewed by users. Using selenium and BeautifoulSoup, scrape the url and for each item extract the reviews at the bottom of each page.In particular, for each review extract and save in a dictionary the following information: name of the reviewer, location (if any), number of reviews, rating, number of likes, date, title, and text of the review. The final dataset will consist of a dictionary that has the name of the location/activity as key and dictionaries with the information from the different reviews as values.  Divide up the code in an appropriate and functional way, making use of several functions.

## SOLUTION

The first thing do to in order to scrape properly the website is to study it and understand which are the common patterns in it. Scraping the links of all the items you can see that there are three types of review: 
- <p>https://www.tripadvisor.it/<b>Hotel_Review</b> ...</p>
- <p>https://www.tripadvisor.it/<b>Attraction_Review</b> ...</p>
- <p>https://www.tripadvisor.it/<b>Restaurant_Review</b> ...</p>

If you take a look at the design of the website and therefore at the html codes of the three types of URLs, you will realise that they differ in certain details. In order to be able to scrape all links, I have created three functions that manage the scraping of the three different types of links. 

In [5]:
from selenium import webdriver
from bs4 import BeautifulSoup
import time

driver = webdriver.Chrome("C://Users/Daniel/CSTA_tutor/exercises/solutions/chromedriver.exe")
url = "https://www.tripadvisor.it/Tourism-g194889-Rovereto_Province_of_Trento_Trentino_Alto_Adige-Vacations.html"


In [6]:
def get_items(driver_, url_):
    time.sleep(2)
    driver.get(url_)
    soup = BeautifulSoup(driver.page_source)
    items = soup.find_all("li", class_="cNVSx Fg I")
    names = [i.find("div", class_="bsLRU btBEK fUpii").get_text() for i in items]
    base_url = "https://www.tripadvisor.it"
    links = [base_url + i.find("a")['href'] for i in items]
    # Extracting the letter in position "len(base_url) + 1" as website typology
    type_of_rev = [el[len(base_url) + 1] for el in links]
    return list(zip(names,links,type_of_rev))

In [7]:
def type_h(html):
    reviews_info = []
    reviews = html.find_all("div", class_="cWwQK MC R2 Gi z Z BB dXjiy")
    for rev in reviews:
        try: 
            rev_info = {}
            rev_info['name'] = rev.find("div", class_="bcaHz").find("a").get_text()
            
            # .split()[-2:] returns the last two elements of the list
            # " ".join(...) return a string made of all the el. of the list divided by " "
            rev_info['date'] = " ".join(rev.find("div", class_="bcaHz").find("span").get_text().split()[-2:])
            try:
                rev_info['location'] = rev.find("span", class_="default ShLyt small").get_text()
            except:
                rev_info['location'] = None
                
            # both num. of rev. and likes are stored in similar tags
            no_rev_likes = [i.get_text() for i in rev.find_all("span", class_="ckXjS")]
            rev_info['no_reviews'] = no_rev_likes[0]
            if len(no_rev_likes) > 1:
                rev_info['likes'] = no_rev_likes[1]
                
            # ratings are created using different classes in the same tag, i.e. <span class="ui_bubble_rating bubble_40">:
            # "ui_bubble_rating bubble_10", "ui_bubble_rating bubble_20", ... , "ui_bubble_rating bubble_50"
            # The number in the second el. of our class gives us the rating: bubble_10 --> 1/5, ... ,bubble_50 --> 5/5
            # I used the the first digit of the final number as rating 
            # .find("span")['class'] returns the two values of the "class" in a list: ['ui_bubble_rating', 'bubble_40']
            # .find("span")['class'][1] --> 'bubble_40'
            # .find("span")['class'][-2] --> 4
            rev_info['rating'] = rev.find("div", class_="emWez F1").find("span")['class'][1][-2]
            rev_info['title'] = rev.find("div", class_="fpMxB MC _S b S6 H5 _a").find("span").get_text()
            rev_info['text'] = rev.find("q", class_="XllAv H4 _a").get_text()
            reviews_info.append(rev_info)
        except AttributeError as e: 
            continue
    
    return reviews_info


def type_a(html): 
    reviews_info = []
    # recursive=False --> find_all returns just the div tags directly connected to <div class="bPhtn"> 
    reviews = html.find("div", class_="bPhtn").find_all("div", recursive=False)
    
    for rev in reviews:
        try: 
            rev_info = {}

            # scraping name,locations, no reviews, places in three <span>
            # tags inside the <div class="cjhIj"> one. 
            n_l_r = [i.get_text() for i in rev.find("div", class_="cjhIj").find_all("span")]

            # As the location coulb be absent the resulting list can be
            # of len 3 (location present) or 2 (location absent)
            if len(n_l_r) == 3:
                rev_info['name'] = n_l_r[0]
                rev_info['location'] = n_l_r[1]
                rev_info['no_reviews'] = n_l_r[2]
            else: 
                rev_info['name'] = n_l_r[0]
                rev_info['location'] = None
                rev_info['no_reviews'] = n_l_r[1]

            rev_info['rating'] = rev.find("svg", class_="RWYkj d H0")['title']
            rev_info['no_likes'] = rev.find("span", class_="WlYyy bTDWl").get_text()
            rev_info['date'] = rev.find("div", class_="WlYyy diXIH cspKb bQCoY").get_text()

            # title and text she the same tag, i.e. <span class="NejBf">
            title_text = [i.get_text() for i in rev.find_all("span", class_="NejBf")]
            rev_info['title'] = title_text[0]
            rev_info['text'] = title_text[1]

            #updating the list of all the reviwes
            reviews_info.append(rev_info)
        except AttributeError as e: 
            continue
            
    return reviews_info


def type_r(html):
    reviews_info = []
    reviews = html.find_all("div", class_="rev_wrap ui_columns is-multiline")
    for rev in reviews:
        try:
            rev_info = {}
            rev_info['name'] = rev.find("div", class_="info_text pointer_cursor").get_text()
            rev_info['no_reviews'] = rev.find("span", class_="badgeText").get_text()[0]
            rev_info['rating'] = rev.find("span", class_="ui_bubble_rating")['class'][1][-2]
            rev_info['date'] = rev.find("span", class_="ratingDate")['title']
            rev_info['title'] = rev.find("span", class_="noQuotes").get_text()
            rev_info['text'] = rev.find("p", class_="partial_entry").get_text()
            rev_info['likes'] = rev.find("span", class_="numHelp").get_text()
            reviews_info.append(rev_info)
        except AttributeError as e: 
            continue
    
    return reviews_info

In [8]:
def get_reviews_by_type(driver_, url_, type_):
    time.sleep(2)
    driver_.get(url_)
    soup = BeautifulSoup(driver.page_source)
    if type_ == "H":
        return type_h(soup)
    elif type_ == "A":
        return type_a(soup)
    elif type_ == "R":
        return type_r(soup)
    else:
        return "ERROR"

In [9]:
dataset = {}
items = set(get_items(driver,url))
for name,link,rev_type in items:
    print(name)
    dataset[name] = get_reviews_by_type(driver,link,rev_type)

Ristorante il Doge
Palazzo de' Cobelli
Chiesa di Santa Maria del Carmine
B&B Diele
Casa del Pittore
Il Ristorante Novecento Dell'Hotel Rovereto
La Caffetteria Bontadi
Trattoria Bar Christian
Ruina Dantesca
La Pizia Pizza & Rist√≤
Viadante 20
Museo Civico di Rovereto
Ristorante Al Lupo
Nero Caff√®
Santa Maria delle Grazie
Caverna Damiano Chiesa
Azienda Agricola Balter
Snack Bar Stella d'Italia
Museo Storico Italiano della Guerra
Assaporando Trattoria
B&B Relais Mozart
Cammino dei Dinosauri
Castel Dante Sacrario AI Caduti
Ristorante San Colombano
Chiesa Arcipretale di San Marco
Hotel Rovereto
Bar Circolo SantaMaria
B&B Casa dei Turchi
Al Silenzio
Piazza Rosmini
Casa dei Turchi
Colle Ameno Rooms & Breakfast
Hotel Leon d'Oro
Campana dei Caduti
Moja
River
Bar Rosmini
Osteria del Pettirosso
Vanila Caffe
Museo di Arte Moderna e Contemporanea di Trento e Rovereto
L'orto di Pitagora
Hotel Sant'Ilario
Casa D'Arte Futurista Depero
Bar Street Bikers
Putip√π
Biblioteca Civica
B&B Manu & Dige
Hotel 

In [10]:
dataset

{'Ristorante il Doge': [{'name': '286fathib',
   'no_reviews': '9',
   'rating': '1',
   'date': '11 novembre 2021',
   'title': 'Lontani dal mistieri.',
   'text': 'Siamo in tre.\nPiatti senze sapore\nMal presentati.\nNn torniamo\nPiu. Ce di melio.\n√Ä rovereto.\nE securamento.\nPiu professionale.',
   'likes': ''},
  {'name': '102marcog',
   'no_reviews': '3',
   'rating': '5',
   'date': '11 novembre 2021',
   'title': 'Un bel ristorantino dove si mangia bene e si pende il giuto',
   'text': "Ho cento in questo ristorante su consiglio dell'albergo in cui soggiornavo e devo ammettere che √® stato un ottimo consiglio. Ho provato un piatto unico tipico assaggiando prodotti diversi: canederli, tagliatelle al cervo, gulash con polenta e un ottimo tortino al cioccolato con cuore bianco....Tutto veramente ottimo. Viene offerta anche un'ottima selezione di vini anche se io ho scelto di non prenderne. Il personale √® stato molto gentile. Il servizio rapido. Il locale √® molto bello, piccolo 

# Ex 4 

The following iterative sequence is defined for the set of positive integers:

<br><center> n ‚Üí n/2 (n is even)

<center>n ‚Üí 3n + 1 (n is odd)

<br>Using the rule above and starting with 13, we generate the following sequence:

<br><center> 13 ‚Üí 40 ‚Üí 20 ‚Üí 10 ‚Üí 5 ‚Üí 16 ‚Üí 8 ‚Üí 4 ‚Üí 2 ‚Üí 1
    
    
<br>It can be seen that this sequence (starting at 13 and finishing at 1) contains 10 terms. Although it has not been proved yet (Collatz Problem), it is thought that all starting numbers finish at 1.

Which starting number, under one million, produces the longest chain?

NOTE: Once the chain starts the terms are allowed to go above one million.

In [11]:
def even(n):
    return int(n / 2)

def odd(n):
    return 3 * n + 1

def check_number(n):
    if n % 2 == 0:
        return "even"
    return "odd"

def generate_seq(starting_number):
    if starting_number == 0:
        return [starting_number]
    else:
        current_number = starting_number
        sequence = [starting_number]
        while current_number != 1: 
            typology = check_number(current_number)
            if typology == "even": 
                next_num = even(current_number)
            else: 
                next_num = odd(current_number)
            sequence.append(next_num)
            current_number = next_num
    return sequence, len(sequence)

In [12]:
longest = [0,0]
for number in range(1,1000000):
    sequence, length = generate_seq(number)
    if length > longest[1]:
        longest[0] = number
        longest[1] = length

In [13]:
print(f'{longest[0]} produces the longest chain (length equals to {longest[1]})')

837799 produces the longest chain (length equals to 525)
