# Scraping StackOverflow

## Testing code from SO Question
https://stackoverflow.com/questions/65611633/scraping-dynamic-website-with-filters-python

In [None]:
import bs4
import requests
import pandas as pd

url = "https://www.feedtables.com/fr/content/table-dry-matter"
headers = {"user-agent": "Mozilla/5.0"}

page = requests.get(url, headers=headers)
soup = bs4.BeautifulSoup(page.text,'lxml')

df_list = []

for url in [url+'?feed_cat='+option['value']+'&parameter_cat=All' for option in soup.find('select',attrs={'name': 'feed_cat'}).find_all('option')][1:3]:
    df_list.append(pd.read_html(url)[0])

df = df_list[0].dropna(how='all')
df

## YouTube Tutorial
https://www.youtube.com/watch?v=BFAQCDr6Qvc


In [1]:
import requests
from requests_html import HTML
import pandas as pd 
import time
import re

In [2]:
base_url = "https://stackoverflow.com/questions/tagged/"
tag = "dask"
query_filter = "Newest"
url = f"{base_url}{tag}?tab={query_filter}"
url

'https://stackoverflow.com/questions/tagged/dask?tab=Newest'

In [3]:
r = requests.get(url)
html_str = r.text
html = HTML(html=html_str)

In [4]:
question_elements = html.find(".s-post-summary")

In [5]:
print(question_elements[0].text)

1 vote
0 answers
9 views
Issues with workers dying and garbage collection summing over rows of Dataframe with distributed DASK
I am new to dask, so this is likely a case of user error, but I'm a bit perplexed. My problem is essentially that I have a large dataframe (tens of thousands of rows and hundreds to thousands of ...
python
pandas
dataframe
dask
Joseph Kuchar
11
asked 16 mins ago


## Get Data for 1 Question

In [10]:
# get most recent question element
this_question_element = question_elements[0]

In [11]:
# inspect element
this_question_element.text

"1 vote\n0 answers\n9 views\nIssues with workers dying and garbage collection summing over rows of Dataframe with distributed DASK\nI am new to dask, so this is likely a case of user error, but I'm a bit perplexed. My problem is essentially that I have a large dataframe (tens of thousands of rows and hundreds to thousands of ...\npython\npandas\ndataframe\ndask\nJoseph Kuchar\n11\nasked 16 mins ago"

In [12]:
# get question title
this_question_element.find('.s-link', first=True).text

'Issues with workers dying and garbage collection summing over rows of Dataframe with distributed DASK'

In [13]:
# get stats (formatted more nicely)
this_question_element.find('.s-post-summary--stats', first=True).text.replace('\n', ' ')

'1 vote 0 answers 9 views'

In [14]:
# get metadata
this_question_element.find('.s-post-summary--meta', first=True).text.replace('\n', ' ')

'python pandas dataframe dask Joseph Kuchar 11 asked 16 mins ago'

In [15]:
# get question excerpt
this_question_element.find('.s-post-summary--content-excerpt', first=True).text

"I am new to dask, so this is likely a case of user error, but I'm a bit perplexed. My problem is essentially that I have a large dataframe (tens of thousands of rows and hundreds to thousands of ..."

In [16]:
# get question title
this_question_element.find('.s-post-summary--content-title', first=True).text

'Issues with workers dying and garbage collection summing over rows of Dataframe with distributed DASK'

In [17]:
# get question hyperlink
this_question_element.find('.s-link', first=True)

<Element 'a' href='/questions/74305172/issues-with-workers-dying-and-garbage-collection-summing-over-rows-of-dataframe' class=('s-link',)>

## Get Data for All Questions

## What do I want

I want to ideally get:
- the question title
- answered / unanswered status ( get from n_answers )
- number of answers
- number of votes
- number of views
- tags (this one might be tricky)
- the question hyperlink (unsure yet how to do exactly)
- timestamp would be nice
- 

In [18]:
# define keynames and quetions needed
keynames = ['title', 'stats', 'tags']
classes_needed = ['.s-post-summary--content-title', '.s-post-summary--stats', '.s-post-summary--meta-tags',]

In [19]:
datas = []

for q_el in question_elements:
    q_data = {}
    for i, _class in enumerate(classes_needed):
        sub_el = q_el.find(_class, first=True)
        keyname = keynames[i]
        q_data[keyname] = sub_el.text 
    datas.append(q_data)

In [20]:
df = pd.DataFrame(datas)
df.head(3)

Unnamed: 0,title,stats,tags
0,Issues with workers dying and garbage collecti...,1 vote\n0 answers\n9 views,python\npandas\ndataframe\ndask
1,Setting maximum number of workers in Dask map ...,1 vote\n0 answers\n10 views,python\ndask\ndask-distributed\ndask-dataframe...
2,Dask Dataframe read_sql_table & to_sql method ...,1 vote\n0 answers\n18 views,python\npostgresql\nsqlalchemy\ndask\ndask-dat...


OK, the basics are working here.

Let's now refine:
- Separate votes, answers, and views into separate columns
- Separate tags into separate colum

In [21]:
# get stats and split
this_question_element.find('.s-post-summary--stats', first=True).text.split("\n")

['1 vote', '0 answers', '9 views']

In [22]:
# get only numbers
stats_test = re.findall(r'\d+', this_question_element.find('.s-post-summary--stats', first=True).text)
stats_test

['1', '0', '9']

In [23]:
# get just tags
this_question_element.find('.s-post-summary--meta-tags', first=True).text.split('\n')

['python', 'pandas', 'dataframe', 'dask']

OK, we now have clean tags and stats. Let's try this again:

In [24]:
# # function that will clean the scraped data
# def clean_scraped_data(text, keyname=None):
#     if keyname == 'stats':
#         return re.findall(r'\b\d+\b', text)
#     elif keyname == 'tags':
#         return text.split("\n")
#     return text

In [37]:
# function that will clean the scraped data
def clean_scraped_data(text, keyname=None):
    if keyname == 'stats':
        rep = {" vote": "", " answer": "", " view": "", "s": ""} #remove singular instances first, then remove all remaining "s"
        # replace
        rep = dict((re.escape(k), v) for k, v in rep.items()) 
        pattern = re.compile("|".join(rep.keys()))
        text = pattern.sub(lambda m: rep[re.escape(m.group(0))], text)
        text = text.split('\n')
        # replace any k with 000
        new_text = []
        for n in text:
            res = re.sub('k', '000', n)
            n = int(res)
            new_text.append(n)
        return new_text
    
    elif keyname == 'tags':
        return text.split("\n")
    return text

In [38]:
datas = []

for q_el in question_elements:
    q_data = {}
    for i, _class in enumerate(classes_needed):
        sub_el = q_el.find(_class, first=True)
        keyname = keynames[i]
        q_data[keyname] = clean_scraped_data(sub_el.text, keyname=keyname) 
    datas.append(q_data)

In [39]:
df = pd.DataFrame(datas)
df.head()

Unnamed: 0,title,stats,tags
0,Issues with workers dying and garbage collecti...,"[1, 0, 9]","[python, pandas, dataframe, dask]"
1,Setting maximum number of workers in Dask map ...,"[1, 0, 10]","[python, dask, dask-distributed, dask-datafram..."
2,Dask Dataframe read_sql_table & to_sql method ...,"[1, 0, 18]","[python, postgresql, sqlalchemy, dask, dask-da..."
3,Dask running out of memory even when partition...,"[0, 0, 24]","[dask, dask-distributed, dask-dataframe]"
4,Disable pure function assumption in dask distr...,"[0, 0, 11]","[python, dask, dask-distributed]"


Sweet. This is working for the first 50 results!

In [40]:
# define function that will parse a single page

def parse_tagged_page(html):
    question_elements = html.find(".s-post-summary")
    keynames = ['title', 'stats', 'tags']
    classes_needed = ['.s-post-summary--content-title', '.s-post-summary--stats', '.s-post-summary--meta-tags',]
    datas = []
    for q_el in question_elements:
        q_data = {}
        for i, _class in enumerate(classes_needed):
            sub_el = q_el.find(_class, first=True)
            keyname = keynames[i]
            q_data[keyname] = clean_scraped_data(sub_el.text, keyname=keyname) 
        datas.append(q_data)
    return datas

In [41]:
# define function that will extract data from url
def extract_data_from_url(url):
    r = requests.get(url)
    if r.status_code not in range(200, 299):
        return []
    html_str = r.text
    html = HTML(html=html_str)
    datas = parse_tagged_page(html)
    return datas

In [42]:
# function that will scrape the entire tag
def scrape_tag(tag = "python", query_filter = "Newest", max_pages=100, pagesize=50):
    base_url = 'https://stackoverflow.com/questions/tagged/'
    datas = []
    for p in range(max_pages):
        page_num = p + 1
        url = f"{base_url}{tag}?tab={query_filter}&page={page_num}&pagesize={pagesize}"
        datas += extract_data_from_url(url)
        time.sleep(1.2)
    return datas

In [None]:
%%time 
datas = scrape_tag(tag='dask')

In [None]:
df = pd.DataFrame(datas)
df.head()

In [None]:
len(df)

## Clean up dataframe

In [34]:
# get stats into separate columns
df[['votes', 'answers', 'views']] = pd.DataFrame(df['stats'].to_list())
df = df.drop(columns=['stats'])
df.head()

Unnamed: 0,title,tags,votes,answers,views
0,Issues with workers dying and garbage collecti...,"[python, pandas, dataframe, dask]",1,0,9
1,Setting maximum number of workers in Dask map ...,"[python, dask, dask-distributed, dask-datafram...",1,0,10
2,Dask Dataframe read_sql_table & to_sql method ...,"[python, postgresql, sqlalchemy, dask, dask-da...",1,0,18
3,Dask running out of memory even when partition...,"[dask, dask-distributed, dask-dataframe]",0,0,24
4,Disable pure function assumption in dask distr...,"[python, dask, dask-distributed]",0,0,11


In [None]:
#TO DO
# 1. Fetch question URL 
# 2. Fetch question timestamp
# 3. Create "answered" true/false column
# 4. For fun: use Futures to fetch data in parallel

## Run some analyses

In [35]:
# most-upvoted questions
df.sort_values('votes', ascending=False).head(10)

Unnamed: 0,title,tags,votes,answers,views
2597,How to parallelize groupby() in dask?,"[pandas, parallel-processing, pandas-groupby, ...",9,1,2000
3705,"ValueError: Not all divisions are known, can't...","[python, dataframe, dask, dask-distributed]",9,1,3000
2969,Dask: Drop NAs on columns?,"[python, pandas, optimization, dask]",9,3,4000
3224,Sorting in Dask,"[sorting, dask, dask-distributed, dask-delayed]",9,1,6000
3150,dask: specify number of processes,"[python, dask]",9,2,6000
3410,Dask delayed object of unspecified length not ...,"[python, dictionary, dask, dask-delayed]",9,1,6000
3570,dask.multiprocessing or pandas + multiprocessi...,"[python, multithreading, pandas, multiprocessi...",9,1,11000
3572,dask apply: AttributeError: 'DataFrame' object...,"[python, dask]",9,1,8000
3436,Summarize categorical data in Dask DataFrame,"[python, dask]",9,1,747
3329,How to use Dask Pivot_table?,"[dataframe, pivot-table, dask]",9,1,8000


In [36]:
# most-viewed questions
df.sort_values('views', ascending=False).head(10)

Unnamed: 0,title,tags,votes,answers,views
2691,convert CSV file to parquet using dask (jupyte...,"[python, tensorflow, jupyter-notebook, dask, p...",2,1,999
2757,How to convert column into category 'as_known(...,"[python, dask, dask-distributed]",1,1,998
2841,Get ID of Dask worker from within a task,"[dask, dask-distributed]",7,1,998
2487,Pycharm debugger throws Bad file descriptor er...,"[python, pycharm, dask, dask-distributed]",4,0,994
3602,Lazy repartitioning of dask dataframe,"[dask, dask-distributed]",3,1,994
3345,Push a pure-python module to Dask workers,[dask],0,2,993
3451,distributed.protocol.pickle - INFO - Failed to...,"[python-3.x, python-2.7, dask, dask-distribute...",2,1,992
4066,dask.array.reshape very slow,"[arrays, dask]",1,1,991
2295,Specify dashboard port for dask,"[python, dask, dask-distributed]",5,1,990
2020,dask: astype() got an unexpected keyword argum...,[dask],0,1,990


In [None]:
# most common tags