# Crawling StackOverflow
This project retrieves questions and answers on the `stackoverflow` platform, using `BeautifulSoup` and `requests` packages.

## a) Simple spider

A function `get_questions` takes a tag (like `"data-mining"`) and a number `n` and returns a list containing the hyperlinks (as strings) to the top n questions by votes for the given tag. It assumes the tag always exists.

The first link for "data-mining" is: https://stackoverflow.com/questions/12146914/what-is-the-difference-between-linear-regression-and-logistic-regression 

In [2]:
from typing import List, Dict
from bs4 import BeautifulSoup
import requests
import time
from tqdm.notebook import tqdm

In [3]:
def get_questions(tag : str, n : int, pause=0.5) -> List[str]:
    uri = 'https://stackoverflow.com'
    iterations = (n // 50) + 1 # `n` might exceed the number of questions appearing on the first page.
    S = requests.Session()
    links = []
    j=0
    for page_index in range(1,iterations+1):
        res = S.get(f'{uri}/questions/tagged/{tag}?sort=MostVotes&edited=true&pagesize=50&page={page_index}')
        soup = BeautifulSoup(res.text,"html.parser")
        try:
            wrapper_div = soup.find('div', id='questions')
            items = wrapper_div.find_all('a', class_='question-hyperlink')
        
            for i,item in enumerate(items):
                if j >= n: break
                j +=1
                links.append(f'{uri}{item["href"]}')

        except AttributeError:
            return []
        time.sleep(pause)
    return links

l = get_questions("machine-learning", 300)
print(len(l),l)

300 ['https://stackoverflow.com/questions/2480650/what-is-the-role-of-the-bias-in-neural-networks', 'https://stackoverflow.com/questions/879432/what-is-the-difference-between-a-generative-and-a-discriminative-algorithm', 'https://stackoverflow.com/questions/33759623/tensorflow-how-to-save-restore-a-model', 'https://stackoverflow.com/questions/10059594/a-simple-explanation-of-naive-bayes-classification', 'https://stackoverflow.com/questions/307291/how-does-the-google-did-you-mean-algorithm-work', 'https://stackoverflow.com/questions/4752626/epoch-vs-iteration-when-training-neural-networks', 'https://stackoverflow.com/questions/34240703/what-are-logits-what-is-the-difference-between-softmax-and-softmax-cross-entrop', 'https://stackoverflow.com/questions/11632516/what-are-advantages-of-artificial-neural-networks-over-support-vector-machines', 'https://stackoverflow.com/questions/41455101/what-is-the-meaning-of-the-word-logits-in-tensorflow', 'https://stackoverflow.com/questions/1832076/wh

In [4]:
# example usage
result = get_questions("machine-learning", 20)
for link in result:
    print(link)
# expected output
# https://stackoverflow.com/questions/2480650/what-is-the-role-of-the-bias-in-neural-networks
# https://stackoverflow.com/questions/879432/what-is-the-difference-between-a-generative-and-a-discriminative-algorithm
# https://stackoverflow.com/questions/33759623/tensorflow-how-to-save-restore-a-model
# https://stackoverflow.com/questions/10059594/a-simple-explanation-of-naive-bayes-classification
# https://stackoverflow.com/questions/307291/how-does-the-google-did-you-mean-algorithm-work
# https://stackoverflow.com/questions/4752626/epoch-vs-iteration-when-training-neural-networks
# https://stackoverflow.com/questions/34240703/what-is-logits-what-is-the-difference-between-softmax-and-softmax-cross-entropy
# https://stackoverflow.com/questions/11632516/what-are-advantages-of-artificial-neural-networks-over-support-vector-machines
# https://stackoverflow.com/questions/41455101/what-is-the-meaning-of-the-word-logits-in-tensorflow
# https://stackoverflow.com/questions/1832076/what-is-the-difference-between-supervised-learning-and-unsupervised-learning
# https://stackoverflow.com/questions/12146914/what-is-the-difference-between-linear-regression-and-logistic-regression
# https://stackoverflow.com/questions/29831489/convert-array-of-indices-to-1-hot-encoded-numpy-array
# https://stackoverflow.com/questions/34968722/how-to-implement-the-softmax-function-in-python
# https://stackoverflow.com/questions/34518656/how-to-interpret-loss-and-accuracy-for-a-machine-learning-model
# https://stackoverflow.com/questions/13610074/is-there-a-rule-of-thumb-for-how-to-divide-a-dataset-into-training-and-validatio
# https://stackoverflow.com/questions/2595176/which-machine-learning-classifier-to-choose-in-general
# https://stackoverflow.com/questions/10592605/save-classifier-to-disk-in-scikit-learn
# https://stackoverflow.com/questions/598726/overwhelmed-by-machine-learning-is-there-an-ml101-book
# https://stackoverflow.com/questions/5064928/difference-between-classification-and-clustering-in-data-mining
# https://stackoverflow.com/questions/42081257/why-binary-crossentropy-and-categorical-crossentropy-give-different-performances

https://stackoverflow.com/questions/2480650/what-is-the-role-of-the-bias-in-neural-networks
https://stackoverflow.com/questions/879432/what-is-the-difference-between-a-generative-and-a-discriminative-algorithm
https://stackoverflow.com/questions/33759623/tensorflow-how-to-save-restore-a-model
https://stackoverflow.com/questions/10059594/a-simple-explanation-of-naive-bayes-classification
https://stackoverflow.com/questions/307291/how-does-the-google-did-you-mean-algorithm-work
https://stackoverflow.com/questions/4752626/epoch-vs-iteration-when-training-neural-networks
https://stackoverflow.com/questions/34240703/what-are-logits-what-is-the-difference-between-softmax-and-softmax-cross-entrop
https://stackoverflow.com/questions/11632516/what-are-advantages-of-artificial-neural-networks-over-support-vector-machines
https://stackoverflow.com/questions/41455101/what-is-the-meaning-of-the-word-logits-in-tensorflow
https://stackoverflow.com/questions/1832076/what-is-the-difference-between-supe

In [5]:
# example usage
print(get_questions("python", 3))
#['https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do',
# 'https://stackoverflow.com/questions/419163/what-does-if-name-main-do',
# 'https://stackoverflow.com/questions/394809/does-python-have-a-ternary-conditional-operator']

['https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do', 'https://stackoverflow.com/questions/419163/what-does-if-name-main-do', 'https://stackoverflow.com/questions/394809/does-python-have-a-ternary-conditional-operator']


## b) Processing a page

A function `parse_question` takes a hyperlink, as a string, to a StackOverflow questions and returns a dictionary containing the text that is produced by the users. This includes the question (+ comments) and the answers (+ comments).

The structure of the result is:

```python
{'title': 'The title',
 'question': {'text' : 'How to learn web-mining?',
              'comments':['Good question', 'Sounds\n interesting']},
 'answers' : [{'text':'Do a course at CSSH!', 
               'comments' : ['You will learn a lot', 'Good stuff']},
              {'text':'Learn on youtube', 'comments' : []}, ]}
```
You can also find an example of the  processed page https://stackoverflow.com/questions/12146914/what-is-the-difference-between-linear-regression-and-logistic-regression on moodle.


Notice, that the answers might be spread across multiple pages, see:
https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do

In [6]:
# todo
# [x] question title
# [x] question text + comments
# [x] parse answer text, not only p and h elements but also lists, <b> etc...

def parse_question(link : str) -> Dict:
    
    # start a new session
    S = requests.Session()
    
    # get number of iterations for answers
    res = S.get(link)
    soup = BeautifulSoup(res.text, 'html.parser')
    try:
        answer_count = int(soup.find('div', id='answers-header').find('div').find('div').find('h2')['data-answercount'])
        iterations = (answer_count // 30) + 1
    except Exception as e:
        iterations = 1
        
    ret = {}
    answers = []
    
    # get title
    try:
        question_header = soup.find('div', id='question-header')
        ret['title'] = question_header.find('a').text
    except Exception as e:
        ret['title'] = 'title not found'
    
    # get question + comments
    try:
        question_div = soup.find('div', class_='question')
        content_wrapper = question_div.find_all('div', recursive=False)[1].find_all('div', recursive = False)[1].find('div')
        unwanted_divs = content_wrapper.find_all(['div', 'a', 'code', 'img', 'aside'])
    except exception as e:
        return {}
    
    # remove unwanted tags
    for tag in unwanted_divs:
        tag.decompose()
    
    question = {}
    try:
        question['text'] = content_wrapper.text
    except Exception as e:
        question['text'] = 'question content not found'
    
    try:
        comment_list = question_div.find_all('div', recursive=False)[1].find_all('div', recursive=False)[2].find('div').find('ul').find_all('li')
    except Exception as e:
        comment_list = []
        
    comments = []
    
    for comment in comment_list:
        try:
            comment = comment.find_all('div', recursive=False)[1].find('div').find('span')
            unwanted_tags = comment.find_all(['a', 'code', 'img', 'aside'])
            for tag in unwanted_tags:
                tag.decompose()
            comments.append(comment.text)
        except Exception as e:
            continue
    
    question['comments'] = comments
    ret['question'] = question
    
    # get answers + comments
    # iterate over answer pages (number calculated above)
    for page_index in range(1,iterations+1):
        # load the respective page
        res = S.get(f'{link}?page={page_index}')
        soup = BeautifulSoup(res.text, "html.parser")
        
        try:
            mainbar_div = soup.find('div', id='mainbar')
            answers_div = mainbar_div.find('div', id='answers')
            answer_divs = answers_div.find_all('div', class_='answer')
        except Exception as e:
            continue
        
        """
        apparently unnecessary..
        try:
            accepted_answer = answers_div.find('div', class_='answer accepted_answer')
            answer_divs.append(accepted_answer)
            
        except Exception as e:
            pass
        """
        
        # iterate over answers on the current page
        for answer in answer_divs:
            try:
                content_wrapper = answer.find('div').find_all('div', recursive=False)[1].find('div')
                unwanted_tags = content_wrapper.find_all(['a', 'code', 'img', 'aside'])

                for tag in unwanted_tags:
                    tag.decompose()

                text = ''
                tmp = {} 
                tmp['text'] = content_wrapper.text

                comment_list = answer.find('div').find_all('div', recursive=False)[2].find('div').find('ul').find_all('li')
                comments = []

                # iterate over comments to current answer
                for comment in comment_list:
                    comment = comment.find_all('div', recursive=False)[1].find('div').find('span')
                    unwanted_tags = comment.find_all(['a', 'code', 'img', 'aside'])
                    for tag in unwanted_tags:
                        tag.decompose()
                    comments.append(comment.text)

                tmp['comments'] = comments
                answers.append(tmp)
            except Exception as e:
                continue
            
    ret['answers'] = answers
    return ret


# parse_question('https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do')
# parse_question('https://stackoverflow.com/questions/12146914/what-is-the-difference-between-linear-regression-and-logistic-regression')

### Comment:
two possibilities to find the right html elements: either choose class/id of element as search tag (e.g. find('div', id='xyz)) and hope it's unique, i.e. only occurs once on the entire page, or use relative postions, i.e. descend the tree until you are at the right place. The latter option has been chosen here. Neither method is optimal, as ids/classes may not be unique and layouts may vary between different pages and therefore the have different relative positions.

In [7]:
url = "https://stackoverflow.com/questions/12146914/what-is-the-difference-between-linear-regression-and-logistic-regression"
result =  parse_question(url)
assert type(result) == dict
import json
with open('example.json', 'w', encoding='utf-8') as file:
    json.dump(result, file, ensure_ascii=False)

In [8]:
url2 = "https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do"

In [9]:
result =  parse_question(url2)

## c) Bringing it all together

A function `process_top_questions` takes a tag and a number `n` that processes the top n questions by votes as explained above. It returns a single list of dicts, one dict for each question. Thereby use the previously defined functions.


In [10]:
def process_top_questions(tag : str, n : int, pause = 0.5) -> List[Dict]:
    questions = get_questions(tag, n, pause)
    ret = []
    for url in questions:
        ret.append(parse_question(url))
    
    return ret

In [11]:
top_questions = process_top_questions('data-mining', 15)

In [12]:
len(top_questions)

15

In [13]:
for i, question in enumerate(top_questions):
    with open(f'data_mining{i}.json', 'w', encoding='utf-8') as file:
        json.dump(question, file, ensure_ascii=False)