# Scraping StackOverflow
This notebook scrapes StackOverflow for dask-tagged questions and puts the cleaned data into a DataFrame for further analysis.

### Reference
Scraping SO Tutorial: https://www.youtube.com/watch?v=BFAQCDr6Qvc

## Scraping Data

In [1]:
import requests
from requests_html import HTML
import pandas as pd 
import time
import re

In [2]:
base_url = "https://stackoverflow.com/questions/tagged/"
tag = "dask"
query_filter = "Newest"
url = f"{base_url}{tag}?tab={query_filter}"
url

'https://stackoverflow.com/questions/tagged/dask?tab=Newest'

In [3]:
r = requests.get(url)
html_str = r.text
html = HTML(html=html_str)

In [4]:
question_elements = html.find(".s-post-summary")

In [5]:
print(question_elements[0].text)

1 vote
0 answers
9 views
Issues with workers dying and garbage collection summing over rows of Dataframe with distributed DASK
I am new to dask, so this is likely a case of user error, but I'm a bit perplexed. My problem is essentially that I have a large dataframe (tens of thousands of rows and hundreds to thousands of ...
python
pandas
dataframe
dask
Joseph Kuchar
11
asked 16 mins ago


### Get Data for 1 Question

In [10]:
# get most recent question element
this_question_element = question_elements[0]

In [11]:
# inspect element
this_question_element.text

"1 vote\n0 answers\n9 views\nIssues with workers dying and garbage collection summing over rows of Dataframe with distributed DASK\nI am new to dask, so this is likely a case of user error, but I'm a bit perplexed. My problem is essentially that I have a large dataframe (tens of thousands of rows and hundreds to thousands of ...\npython\npandas\ndataframe\ndask\nJoseph Kuchar\n11\nasked 16 mins ago"

In [12]:
# get question title
this_question_element.find('.s-link', first=True).text

'Issues with workers dying and garbage collection summing over rows of Dataframe with distributed DASK'

In [13]:
# get stats (formatted more nicely)
this_question_element.find('.s-post-summary--stats', first=True).text.replace('\n', ' ')

'1 vote 0 answers 9 views'

In [14]:
# get metadata
this_question_element.find('.s-post-summary--meta', first=True).text.replace('\n', ' ')

'python pandas dataframe dask Joseph Kuchar 11 asked 16 mins ago'

In [15]:
# get question excerpt
this_question_element.find('.s-post-summary--content-excerpt', first=True).text

"I am new to dask, so this is likely a case of user error, but I'm a bit perplexed. My problem is essentially that I have a large dataframe (tens of thousands of rows and hundreds to thousands of ..."

In [16]:
# get question title
this_question_element.find('.s-post-summary--content-title', first=True).text

'Issues with workers dying and garbage collection summing over rows of Dataframe with distributed DASK'

In [17]:
# get question hyperlink
this_question_element.find('.s-link', first=True)

<Element 'a' href='/questions/74305172/issues-with-workers-dying-and-garbage-collection-summing-over-rows-of-dataframe' class=('s-link',)>

### Get Data for All Questions


We want to ideally get:
- the question title
- answered / unanswered status ( get from n_answers )
- number of answers
- number of votes
- number of views
- tags (this one might be tricky)
- the question hyperlink (unsure yet how to do exactly)
- timestamp would be nice
- 

In [18]:
# define keynames and quetions needed
keynames = ['title', 'stats', 'tags']
classes_needed = ['.s-post-summary--content-title', '.s-post-summary--stats', '.s-post-summary--meta-tags',]

In [19]:
datas = []

for q_el in question_elements:
    q_data = {}
    for i, _class in enumerate(classes_needed):
        sub_el = q_el.find(_class, first=True)
        keyname = keynames[i]
        q_data[keyname] = sub_el.text 
    datas.append(q_data)

In [20]:
df = pd.DataFrame(datas)
df.head(3)

Unnamed: 0,title,stats,tags
0,Issues with workers dying and garbage collecti...,1 vote\n0 answers\n9 views,python\npandas\ndataframe\ndask
1,Setting maximum number of workers in Dask map ...,1 vote\n0 answers\n10 views,python\ndask\ndask-distributed\ndask-dataframe...
2,Dask Dataframe read_sql_table & to_sql method ...,1 vote\n0 answers\n18 views,python\npostgresql\nsqlalchemy\ndask\ndask-dat...


OK, the basics are working here.

Let's now refine:
- Separate votes, answers, and views into separate columns
- Separate tags into separate colum

In [21]:
# get stats and split
this_question_element.find('.s-post-summary--stats', first=True).text.split("\n")

['1 vote', '0 answers', '9 views']

In [22]:
# get only numbers
stats_test = re.findall(r'\d+', this_question_element.find('.s-post-summary--stats', first=True).text)
stats_test

['1', '0', '9']

In [23]:
# get just tags
this_question_element.find('.s-post-summary--meta-tags', first=True).text.split('\n')

['python', 'pandas', 'dataframe', 'dask']

OK, we now have clean tags and stats. Let's try this again:

In [24]:
# # function that will clean the scraped data
# def clean_scraped_data(text, keyname=None):
#     if keyname == 'stats':
#         return re.findall(r'\b\d+\b', text)
#     elif keyname == 'tags':
#         return text.split("\n")
#     return text

In [37]:
# function that will clean the scraped data
def clean_scraped_data(text, keyname=None):
    if keyname == 'stats':
        rep = {" vote": "", " answer": "", " view": "", "s": ""} #remove singular instances first, then remove all remaining "s"
        # replace
        rep = dict((re.escape(k), v) for k, v in rep.items()) 
        pattern = re.compile("|".join(rep.keys()))
        text = pattern.sub(lambda m: rep[re.escape(m.group(0))], text)
        text = text.split('\n')
        # replace any k with 000
        new_text = []
        for n in text:
            res = re.sub('k', '000', n)
            n = int(res)
            new_text.append(n)
        return new_text
    
    elif keyname == 'tags':
        return text.split("\n")
    return text

In [38]:
datas = []

for q_el in question_elements:
    q_data = {}
    for i, _class in enumerate(classes_needed):
        sub_el = q_el.find(_class, first=True)
        keyname = keynames[i]
        q_data[keyname] = clean_scraped_data(sub_el.text, keyname=keyname) 
    datas.append(q_data)

In [39]:
df = pd.DataFrame(datas)
df.head()

Unnamed: 0,title,stats,tags
0,Issues with workers dying and garbage collecti...,"[1, 0, 9]","[python, pandas, dataframe, dask]"
1,Setting maximum number of workers in Dask map ...,"[1, 0, 10]","[python, dask, dask-distributed, dask-datafram..."
2,Dask Dataframe read_sql_table & to_sql method ...,"[1, 0, 18]","[python, postgresql, sqlalchemy, dask, dask-da..."
3,Dask running out of memory even when partition...,"[0, 0, 24]","[dask, dask-distributed, dask-dataframe]"
4,Disable pure function assumption in dask distr...,"[0, 0, 11]","[python, dask, dask-distributed]"


Sweet. This is working for the first 50 results!

In [40]:
# define function that will parse a single page

def parse_tagged_page(html):
    question_elements = html.find(".s-post-summary")
    keynames = ['title', 'stats', 'tags']
    classes_needed = ['.s-post-summary--content-title', '.s-post-summary--stats', '.s-post-summary--meta-tags',]
    datas = []
    for q_el in question_elements:
        q_data = {}
        for i, _class in enumerate(classes_needed):
            sub_el = q_el.find(_class, first=True)
            keyname = keynames[i]
            q_data[keyname] = clean_scraped_data(sub_el.text, keyname=keyname) 
        datas.append(q_data)
    return datas

In [41]:
# define function that will extract data from url
def extract_data_from_url(url):
    r = requests.get(url)
    if r.status_code not in range(200, 299):
        return []
    html_str = r.text
    html = HTML(html=html_str)
    datas = parse_tagged_page(html)
    return datas

In [42]:
# function that will scrape the entire tag
def scrape_tag(tag = "python", query_filter = "Newest", max_pages=100, pagesize=50):
    base_url = 'https://stackoverflow.com/questions/tagged/'
    datas = []
    for p in range(max_pages):
        page_num = p + 1
        url = f"{base_url}{tag}?tab={query_filter}&page={page_num}&pagesize={pagesize}"
        datas += extract_data_from_url(url)
        time.sleep(1.2)
    return datas

In [43]:
%%time 
datas = scrape_tag(tag='dask')

CPU times: user 22.5 s, sys: 540 ms, total: 23.1 s
Wall time: 3min 23s


In [44]:
df = pd.DataFrame(datas)
df.head()

Unnamed: 0,title,stats,tags
0,Why does dask DataFrame.to_parquet tries to in...,"[0, 0, 4]","[dask, parquet, dask-distributed]"
1,Issues with workers dying and garbage collecti...,"[1, 0, 12]","[python, pandas, dataframe, dask]"
2,Setting maximum number of workers in Dask map ...,"[1, 0, 10]","[python, dask, dask-distributed, dask-datafram..."
3,Dask Dataframe read_sql_table & to_sql method ...,"[1, 0, 18]","[python, postgresql, sqlalchemy, dask, dask-da..."
4,Dask running out of memory even when partition...,"[0, 0, 24]","[dask, dask-distributed, dask-dataframe]"


In [45]:
len(df)

4135

## Clean up dataframe

In [46]:
# get stats into separate columns
df[['votes', 'answers', 'views']] = pd.DataFrame(df['stats'].to_list())
df = df.drop(columns=['stats'])
df.head()

Unnamed: 0,title,tags,votes,answers,views
0,Why does dask DataFrame.to_parquet tries to in...,"[dask, parquet, dask-distributed]",0,0,4
1,Issues with workers dying and garbage collecti...,"[python, pandas, dataframe, dask]",1,0,12
2,Setting maximum number of workers in Dask map ...,"[python, dask, dask-distributed, dask-datafram...",1,0,10
3,Dask Dataframe read_sql_table & to_sql method ...,"[python, postgresql, sqlalchemy, dask, dask-da...",1,0,18
4,Dask running out of memory even when partition...,"[dask, dask-distributed, dask-dataframe]",0,0,24


In [47]:
#TO DO
# 1. Fetch question URL 
# 2. Fetch question timestamp
# 3. Create "answered" true/false column
# 4. For fun: use Futures to fetch data in parallel

## Run some analyses

In [48]:
# most-upvoted questions
df.sort_values('votes', ascending=False).head(10)

Unnamed: 0,title,tags,votes,answers,views
3667,Make Pandas DataFrame apply() use all cores?,"[pandas, dask]",170,12,126000
4033,At what situation I can use Dask instead of Ap...,"[python, pandas, apache-spark, dask]",101,1,41000
4025,How to transform Dask.DataFrame to pd.DataFrame?,"[python, pandas, dask]",53,3,36000
4133,"python dask DataFrame, support for (trivially ...","[python, pandas, parallel-processing, dask]",51,2,25000
3981,Convert Pandas dataframe to Dask dataframe,"[python, pandas, dataframe, data-conversion, d...",47,1,51000
3697,Out-of-core processing of sparse CSR arrays,"[python, scipy, apache-spark-mllib, dask, joblib]",43,1,2000
3987,Writing Dask partitions into single file,"[python, dask]",34,2,16000
3960,Can dask parralelize reading fom a csv file?,"[python, csv, pandas, dask]",32,2,26000
4129,Read a large csv into a sparse pandas datafram...,"[python, pandas, numpy, scipy, dask]",32,2,7000
3066,Converting numpy solution into dask (numpy ind...,"[python, numpy, dask, dask-distributed]",31,1,3000


In [49]:
# most-viewed questions
df.sort_values('views', ascending=False).head(10)

Unnamed: 0,title,tags,votes,answers,views
3667,Make Pandas DataFrame apply() use all cores?,"[pandas, dask]",170,12,126000
3981,Convert Pandas dataframe to Dask dataframe,"[python, pandas, dataframe, data-conversion, d...",47,1,51000
4033,At what situation I can use Dask instead of Ap...,"[python, pandas, apache-spark, dask]",101,1,41000
4025,How to transform Dask.DataFrame to pd.DataFrame?,"[python, pandas, dask]",53,3,36000
4014,"Convert string to dict, then access key:values...","[python, pandas, dictionary, data-manipulation...",19,4,33000
2119,Only a column name can be used for the key in ...,"[python, pandas, dask]",9,1,28000
4012,Sampling n= 2000 from a Dask Dataframe of len ...,"[python, dask]",22,4,28000
3898,"Default pip installation of Dask gives ""Import...","[python, installation, pip, importerror, dask]",22,6,28000
3539,simple dask map_partitions example,"[python, parallel-processing, dask]",10,2,28000
3986,dask dataframe how to convert column to to_dat...,"[python, pandas, dask]",19,5,27000


In [82]:
# most common tags

# get all tags into a list of lists
all_tags = []
for row in df_tags:    
    all_tags.append(row)
    
from collections import Counter
from itertools import chain
# count all tags
c = Counter(chain( * all_tags))

# transform into dataframe
df = pd.DataFrame.from_dict(c, orient='index').reset_index()
df.sort_values(0, ascending=False).head(10)

Unnamed: 0,index,0
0,dask,4135
3,python,2657
4,pandas,1075
2,dask-distributed,959
5,dataframe,465
6,dask-dataframe,313
14,python-3.x,277
7,dask-delayed,258
12,python-xarray,181
28,numpy,176


## CONTINUE HERE

Ian shared this, try it out: https://data.stackexchange.com/stackoverflow/query/edit/1491554