# Web-Mining - Assignment 1 (25 points total)

This **Home Assignment** is to be submitted and you will be given points for each of the tasks. It familiarizes you with basics of *web crawling* and standard text preprocessing.

## Formalities
**Submit in a group of 3-4 people until 11.05.2020 23:59CET. The deadline is strict!**

## Evaluation and Grading
General advice for programming excercises at *CSSH*:
Evaluation of your submission is done semi-automatically. Think of it as this notebook being 
executed once. Afterwards, some test functions are appended to this file and executed respectively.

Therefore:
* Submit valid _Python3_ code only!
* Use external libraries only when specified by task.
* Ensure your definitions (functions, classes, methods, variables) follow the specification if
  given. The concrete signature of e.g. a function usually can be inferred from task description, 
  code skeletons and test cases.
* Ensure the notebook does not rely on current notebook or system state!
  * Use `Kernel --> Restart & Run All` to see if you are using any definitions, variables etc. that 
    are not in scope anymore.
  * Double check if your code relies on presence of files or directories other than those mentioned
    in given tasks. Tests run under Linux, hence don't use Windows style paths 
    (`some\path`, `C:\another\path`). Also, use paths only that are relative to and within your
    working directory (OK: `some/path`, `./some/path`; NOT OK: `/home/alice/python`, 
    `../../python`).
* Keep your code idempotent! Running it or parts of it multiple times must not yield different
  results. Minimize usage of global variables.
* Ensure your code / notebook terminates in reasonable time.

**There's a story behind each of these points! Don't expect us to fix your stuff!**

Regarding the scores, you will get no points for a task if:
- your function throws an unexpected error (e.g. takes the wrong number of arguments)
- gets stuck in an infinite loop
- takes much much longer than expected (e.g. >1s to compute the mean of two numbers)
- does not produce the desired output (e.g. returns an descendingly sorted list even though we asked for ascending, returns the mean and the std even though we asked for only the mean, prints an output instead of returning it, ...)

In [1]:
# credentials of all team members (you may add or remove items from the dictionary)
team_members = [
    {
        'first_name': 'Haron',
        'last_name': 'Nqiri',
        'student_id': 343289
    },
    {
        'first_name': 'Mike',
        'last_name': 'Grüne',
        'student_id': 381076
    },
    {
        'first_name': 'Seyed Pouria',
        'last_name': 'Mirelmi',
        'student_id': 416910
    },
    {
        'first_name': 'Pruthvi',
        'last_name': 'Hegde',
        'student_id': 404809
    }
]

# Task 1: Crawling web pages (total of 15 points)
We are interested in obtaining questions and their answers on the `stackoverflow` platform. Use the `BeautifulSoup` and `requests` packages for this task.

## a) Simple spider (5)

Write a function `get_questions` that takes a tag (like `"data-mining"`) and a number `n` and returns a list containing the hyperlinks (as strings) to the top n questions by votes for the given tag. Assume the tag exists.

The first link for "data-mining" is: https://stackoverflow.com/questions/12146914/what-is-the-difference-between-linear-regression-and-logistic-regression 

Notice that `n` might exceed the number of questions appearing on the first page.

In [2]:
from typing import List, Dict
from bs4 import BeautifulSoup
import requests
import time
from tqdm.notebook import tqdm

In [3]:
def get_questions(tag : str, n : int, pause=0.5) -> List[str]:
    ret = []
    pageIndex = 1
    while len(ret) < n:
        page = requests.get('https://stackoverflow.com/questions/tagged/' + tag + '?tab=votes&page=' + str(pageIndex))
        soup = BeautifulSoup(page.content, 'html.parser')
        for question in soup.find_all('div', class_="t-" + tag, limit=(n - len(ret))):
            ret.append('https://stackoverflow.com' + question.parent.find('a', class_='question-hyperlink').get('href'))
        pageIndex += 1
    return ret

In [4]:
# example usage
result = get_questions("machine-learning", 20)
for link in result:
    print(link)
# expected output
# https://stackoverflow.com/questions/2480650/what-is-the-role-of-the-bias-in-neural-networks
# https://stackoverflow.com/questions/879432/what-is-the-difference-between-a-generative-and-a-discriminative-algorithm
# https://stackoverflow.com/questions/33759623/tensorflow-how-to-save-restore-a-model
# https://stackoverflow.com/questions/10059594/a-simple-explanation-of-naive-bayes-classification
# https://stackoverflow.com/questions/307291/how-does-the-google-did-you-mean-algorithm-work
# https://stackoverflow.com/questions/4752626/epoch-vs-iteration-when-training-neural-networks
# https://stackoverflow.com/questions/34240703/what-is-logits-what-is-the-difference-between-softmax-and-softmax-cross-entropy
# https://stackoverflow.com/questions/11632516/what-are-advantages-of-artificial-neural-networks-over-support-vector-machines
# https://stackoverflow.com/questions/41455101/what-is-the-meaning-of-the-word-logits-in-tensorflow
# https://stackoverflow.com/questions/1832076/what-is-the-difference-between-supervised-learning-and-unsupervised-learning
# https://stackoverflow.com/questions/12146914/what-is-the-difference-between-linear-regression-and-logistic-regression
# https://stackoverflow.com/questions/29831489/convert-array-of-indices-to-1-hot-encoded-numpy-array
# https://stackoverflow.com/questions/34968722/how-to-implement-the-softmax-function-in-python
# https://stackoverflow.com/questions/34518656/how-to-interpret-loss-and-accuracy-for-a-machine-learning-model
# https://stackoverflow.com/questions/13610074/is-there-a-rule-of-thumb-for-how-to-divide-a-dataset-into-training-and-validatio
# https://stackoverflow.com/questions/2595176/which-machine-learning-classifier-to-choose-in-general
# https://stackoverflow.com/questions/10592605/save-classifier-to-disk-in-scikit-learn
# https://stackoverflow.com/questions/598726/overwhelmed-by-machine-learning-is-there-an-ml101-book
# https://stackoverflow.com/questions/5064928/difference-between-classification-and-clustering-in-data-mining
# https://stackoverflow.com/questions/42081257/why-binary-crossentropy-and-categorical-crossentropy-give-different-performances

https://stackoverflow.com/questions/2480650/what-is-the-role-of-the-bias-in-neural-networks
https://stackoverflow.com/questions/879432/what-is-the-difference-between-a-generative-and-a-discriminative-algorithm
https://stackoverflow.com/questions/33759623/tensorflow-how-to-save-restore-a-model
https://stackoverflow.com/questions/10059594/a-simple-explanation-of-naive-bayes-classification
https://stackoverflow.com/questions/307291/how-does-the-google-did-you-mean-algorithm-work
https://stackoverflow.com/questions/4752626/epoch-vs-iteration-when-training-neural-networks
https://stackoverflow.com/questions/34240703/what-are-logits-what-is-the-difference-between-softmax-and-softmax-cross-entrop
https://stackoverflow.com/questions/11632516/what-are-advantages-of-artificial-neural-networks-over-support-vector-machines
https://stackoverflow.com/questions/41455101/what-is-the-meaning-of-the-word-logits-in-tensorflow
https://stackoverflow.com/questions/1832076/what-is-the-difference-between-supe

In [5]:
# example usage
print(get_questions("python", 3))
#['https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do',
# 'https://stackoverflow.com/questions/419163/what-does-if-name-main-do',
# 'https://stackoverflow.com/questions/394809/does-python-have-a-ternary-conditional-operator']

['https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do', 'https://stackoverflow.com/questions/419163/what-does-if-name-main-do', 'https://stackoverflow.com/questions/394809/does-python-have-a-ternary-conditional-operator']


## b) Processing a page (8)

Write a function `parse_question` that takes a hyperlink, as a string, to a StackOverflow questions and returns a dictionary containing the text that is produced by the users. This includes the question (+ comments) and the answers (+ comments).

We are not interested in other content like: images, `<aside>` sections, html tags, user_information, `<code>` sections and `<a>` links.
We are also not interested in information on edits, dates etc. As that text is not user generated we also disregard all functional text-content like share, edit, follow, flag, add comment (the kind of button things). 

! Ignore everything that is not inside the div with `id="mainbar"` !

Use BeautifulSoup to process the HTML of the page. There you can get the text of an element through `.text`. You can remove elements by calling `.decompose()` on them.

The structure of the result is:

```python
{'title': 'The title',
 'question': {'text' : 'How to learn web-mining?',
              'comments':['Good question', 'Sounds\n interesting']},
 'answers' : [{'text':'Do a course at CSSH!', 
               'comments' : ['You will learn a lot', 'Good stuff']},
              {'text':'Learn on youtube', 'comments' : []}, ]}
```
You can also find an example of the  processed page https://stackoverflow.com/questions/12146914/what-is-the-difference-between-linear-regression-and-logistic-regression on moodle.


Notice, that the answers might be spread across multiple pages, see:
https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do

In [6]:
# converts the given object to text after removing unwanted parts
def convert_to_text(te: BeautifulSoup):
    for unTag in te.find_all(['code', 'img', 'aside', 'a']):
        unTag.decompose()
    return te.text

# returns the title of the stackoverflow question
def get_question_title(soup: BeautifulSoup):
    questionHeader = soup.find('div',id='question-header')
    return questionHeader.find('a', class_='question-hyperlink').text

# returns the text of the question or an answer
def get_main_text(post: BeautifulSoup):
    postDiv = post.find('div', class_=['postcell','answercell'])
    textDiv = postDiv.find('div', class_='js-post-body')
    return convert_to_text(textDiv)

# returns a list of comments on the current page
def get_comments(post: BeautifulSoup):
    comments = post.find_all('span', class_='comment-copy')
    retComments = []
    for comment in comments:
        retComments.append(convert_to_text(comment))
    return retComments

# returns a question or answer object
def get_post(pDiv: BeautifulSoup):
    ret = {}
    ret['text'] = get_main_text(pDiv)
    ret['comments'] = get_comments(pDiv)
    return ret

# returns a list of all answers objects on the current site
def get_answers(answersDiv: BeautifulSoup):
    ret = []
    answers = answersDiv.find_all('div', class_='answer')
    for answer in answers:
        ret.append(get_post(answer))
    return ret
    
def parse_question(link : str)->Dict:
    # get the wanted page and soup it
    page = requests.get(link)
    if page.status_code != 200:
        return {}
    soup = BeautifulSoup(page.content, 'html.parser')
    ret = {}
    
    # get the title of the question and add it to return object
    ret['title'] = get_question_title(soup)
    
    # only work with the mainbar div of the site
    mainbar = soup.find('div',id='mainbar')
    
    # get the question with comments of the page
    questionDiv = mainbar.find('div', id='question')
    ret['question'] = get_post(questionDiv)
    
    # get all answers with comments on the current page
    answersDiv = mainbar.find('div', id='answers')
    ret['answers'] = get_answers(answersDiv)
    
    # get all answers from the remaining pages
    sitePager = mainbar.find('div', class_='pager-answers')
    while (sitePager is not None and sitePager.find('a', attrs={'rel': 'next'}) is not None):
        page = requests.get('https://stackoverflow.com' + sitePager.find('a', attrs={'rel': 'next'}).get('href'))
        if page.status_code != 200:
            break
        soup = BeautifulSoup(page.content, 'html.parser')
        mainbar = soup.find('div',id='mainbar')
        sitePager = mainbar.find('div', class_='pager-answers')
        answersDiv = mainbar.find('div', id='answers')
        ret['answers'] = ret['answers'] + get_answers(answersDiv)
        
    return ret

In [7]:
url = "https://stackoverflow.com/questions/12146914/what-is-the-difference-between-linear-regression-and-logistic-regression"
result =  parse_question(url)
assert type(result) == dict
import json
with open('example.json', 'w', encoding='utf-8') as file:
    json.dump(result, file, ensure_ascii=False)

In [8]:
url2 = "https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do"

In [9]:
result =  parse_question(url2)

## c) Bringing it all together (2)

Write a function `process_top_questions` that takes a tag and a number `n` that processes the top n questions by votes as explained above. It returns a single list of dicts, one dict for each question. Thereby use the previously defined functions.


In [10]:
def process_top_questions(tag : str, n : int, pause = 0.5) -> List[Dict]:
    ret = []
    questionLinks = get_questions(tag, n)
    for questionLink in questionLinks:
        ret.append(parse_question(questionLink))
    return ret

In [11]:
top_questions = process_top_questions('data-mining', 15)

In [12]:
for i, question in enumerate(top_questions):
    with open(f'data_mining{i}.json', 'w', encoding='utf-8') as file:
        json.dump(question, file, ensure_ascii=False)

# Task 2: Using inverted indices (total of 10 points)
It is/was common to use stopword removal and stemming as a preparation for word embeddings. For this task, you should use the stopword list provided by the Python nltk library.

## a) The usual preprocessing (2)

Write a function `to_tokens` that takes a string and performs tokenization, lowers all tokens, removes stopwords (`'english'`) and performs stemming (`PorterStemmer`). It returns a list of tokens. The tokens are in the same order they were in the initial string. Use the functions from the nltk library.
To transform a string into tokens use the function `nltk.word_tokenize`. Do not allow for empty strings or all whitespace strings.


In [13]:
import nltk
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
nltk.download('stopwords')
nltk.download('punkt')
def to_tokens(my_string : str) -> List[str]:
    porter = PorterStemmer()
    res = [s.lower() for s in nltk.word_tokenize(my_string)]
    res = [porter.stem(s) for s in res if not s in stopwords.words('english')]
    return res

[nltk_data] Downloading package stopwords to /home/mike/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/mike/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [14]:
to_tokens("Write a function")
# ['write', 'function']

['write', 'function']

In [15]:
to_tokens("The tokens are in the same order they were in the initial string. Use the functions from the nltk library.")
#['token',
# 'order',
# 'initi',
# 'string',
# '.',
# 'use',
# 'function',
# 'nltk',
# 'librari',
# '.']

['token',
 'order',
 'initi',
 'string',
 '.',
 'use',
 'function',
 'nltk',
 'librari',
 '.']

## b) Load data (3)

Write a function `load_data` that takes a path to a folder and reads all the `'*.txt'` files inside that folder. It returns a list of file names without the folder containing them and a list of list of strings which contains the file contents preprocessed using the function from a). Use `encoding="utf-8"`

In [16]:
from pathlib import Path
from typing import Set
def load_data(folder : str) -> (List[str], List[List[str]]):
    file_list = []
    texts = []
    files = [f for f in Path(folder).glob("*.txt")]
    for file in tqdm(files):
        file_list.append(file.name)
        with open(file, "r", encoding="utf-8") as text:
            texts.append(to_tokens(text.read()))
    return file_list, texts

In [17]:
folder = "arxiv_snapshot"

In [18]:
# example usage
# this may take a while, please remove before submission
#files, stemmed_texts = load_data(folder)

HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…




In [19]:
# example usage
#print(len(files))
#print(files[0]) # could vary for you
# 0704.1181v1.txt
#print(stemmed_texts[0][:20]) # should be same for the same file
['simul', 'four-bodi', 'interact', 'nuclear', 'magnet', 'reson', 'quantum', 'inform', 'processor', '∗', 'wen-zhang', 'liu1', ',', 'jin-fu', 'zhang1', ',', 'gui', 'lu', 'long1,2', 'key']

0


IndexError: list index out of range

In [None]:
#print(files[-1]) # could vary for you
# 1807.07319v2.txt
#print(stemmed_texts[-1][:20]) # should be same for the same file
['arxiv:1807.07319v2', '[', 'hep-ph', ']', '21', 'mar', '2019', 'model-independ', 'constraint', 'ckm', 'matrix', 'element', '|vtb|', ',', '|vts|', '|vtd|', 'wenx', 'fang1', ',', 'barbara']

## c) Reduce vocabulary (2)
Write a function `filter_tokens` that takes a list of list of tokens and an integer  `min_support` as input. It discards all tokens that do not appear in all of the documents at least (`>=`) min_support times. The return values are

1) the list of list of tokens. All tokens that appear less than min_support times (in the entire corpus) are __removed__

2) a set of tokens that were __not__ removed, i.e. the frequent tokens

In [None]:
def filter_tokens(l_tokens : List[List[str]], min_support : int) -> (List[List[str]], Set[str]):
    token_count = {}
    for doc in l_tokens:
        for token in doc:
            if token in token_count:
                token_count[token] += 1
            else:
                token_count[token] = 1
    remaining_tokens = set([t for t in token_count.keys() if token_count[t] >= min_support])
    new_corpus = [[t for t in doc if t in remaining_tokens] for doc in l_tokens]
    return new_corpus, remaining_tokens

In [None]:
# example usage
#filtered_texts, frequent_tokens = filter_tokens(stemmed_texts, 0)
#len(frequent_tokens) #268879

In [None]:
# example usage
#filtered_texts, frequent_tokens = filter_tokens(stemmed_texts, 5)
#len(frequent_tokens)
# 50306

## d) Write a query function (3)
You want to provide a way to quickly search the documents at hand for a combination of search terms. 

Write a function `query` which gets a string, inverted index dict and a set of frequent tokens. It treats the string as a list of tokens and returns the filenames of all files that include all the search terms in the query string.


To test your code, first convert the list of filenames and filtered tokens into an inverted index to get the corresponding filenames for each token. You can use the `to_inverted_index` function provided in the exercise.

In [None]:
def query(search_string : str, inverted_index : Dict[str, Set[str]], frequent_tokens : Set[str]) -> Set[str]:
    search = to_tokens(search_string)
    #quick check if makes sense to search at all
    for t in search:
        if t not in frequent_tokens:
            return set()
    # build set
    res = inverted_index[search[0]]
    for i,t in enumerate(search):
        if i == 0:
            continue
        res = res.intersection(inverted_index[t])
    return res

In [None]:
def to_inverted_index(extracts : Dict[str, List[str]]) -> Dict[str, Set[str]]:
    inverted = {}
    for filename in extracts.keys():
        for token in extracts[filename]:
            if token in inverted:
                inverted[token].add(filename)
            else:
                inverted[token] = set([filename])
    return inverted

In [None]:
#II = to_inverted_index({filename : text for filename, text in zip(files, filtered_texts)})

In [None]:
# example usage
#query("computer spider", II, frequent_tokens)

#{'1007.0880v1.txt',
# '1410.5760v1.txt',
# '1604.02076v1.txt',
# '1609.00359v3.txt',
# '1610.05332v1.txt',
# '1805.03482v2.txt'}

In [None]:
# example usage
#query("computer spider man", II, frequent_tokens)
# set()

In [None]:
# example usage
#query("gene", II, frequent_tokens)
#{'0712.3383v1.txt',
# '0810.4745v1.txt',
# '1009.2186v1.txt',
# '1101.0211v2.txt',
# '1210.6482v1.txt',
# '1303.0539v1.txt',
# '1408.5461v2.txt',
# '1410.0649v1.txt',
# '1508.00192v1.txt',
# '1510.02516v2.txt',
# '1802.02337v1.txt',
# '1807.04779v1.txt'}