**Stack Exchange - Data Scraping for Questions and Tags**

We want to scrape the distributions of numbers of Questions & Tags for the webpage [Stack Echange - ask ubuntu](https://askubuntu.com/) to see if we find anything interesting. To do this, we’ll first scrape data related to Questions and Tags from Stack Exchange webpage for all the respective pages.


**Working out which pages to scrape**

  Once we've defined our goal, we then need to identify an efficient set of pages to scrape. In order to scrape data from the respective web pages, we use the `request` library. A request is what happens when we access a web page. We 'request' the content of a page from the server.

In [1]:
import pandas as pd
import numpy as np
import os
import time
import requests
from requests import get
import regex as re
from requests import get
from bs4 import BeautifulSoup

In [2]:
url = 'https://askubuntu.com/questions'
time.sleep(5)
# Getting the response from the source URL.
response = get(url)
print(response.text[:500])

<!DOCTYPE html>


    <html class="html__responsive">

    <head>

        <title>Newest Questions - Ask Ubuntu</title>
        <link rel="shortcut icon" href="https://cdn.sstatic.net/Sites/askubuntu/Img/favicon.ico?v=928dfb7c1990">
        <link rel="apple-touch-icon" href="https://cdn.sstatic.net/Sites/askubuntu/Img/apple-touch-icon.png?v=e16e1315edd6">
        <link rel="image_src" href="https://cdn.sstatic.net/Sites/askubuntu/Img/apple-touch-icon.png?v=e16e1315edd6"> 
        <lin


As we can see from the first line of response.text, the server sent us an HTML document. This document describes the overall structure of that web page, along with its specific content (which is what makes that particular page unique).

In [3]:
# Parsing respose.txt by creating a BeautifulSoup object and assigning it to htlm_soup.
time.sleep(5)
html_soup = BeautifulSoup(response.text, 'html.parser')


In [4]:
# Printing the total questions summary for each page (pagination set to 50 for viewing purpose)
# The div tag has a class named question-summary that displays to us the entire information for a question posted.

Questions_EachPage = html_soup.find_all('div', class_ = 'question-summary')
print(type(Questions_EachPage))
print(len(Questions_EachPage))

<class 'bs4.element.ResultSet'>
50


As we can see the question summary class count comes out to be 50, which is what we had expected depending on the pagination that we have set.

However, we are only interested to scrape the Questions from this summary and not other things like time, comments, votes, answers,views etc. So, we will only deal with the tag `<h3>` and `<a>`.

**Extracting data for a single question:**


In [5]:
# Printing out the HTLM content of our first question.

first_question = Questions_EachPage[0]
print(first_question)

<div class="question-summary" id="question-summary-1287284">
<div class="statscontainer">
<div class="stats">
<div class="vote">
<div class="votes">
<span class="vote-count-post"><strong>0</strong></span>
<div class="viewcount">votes</div>
</div>
</div>
<div class="status unanswered">
<strong>0</strong>answers
            </div>
</div>
<div class="views" title="10 views">
    10 views
</div>
</div>
<div class="summary">
<h3><a class="question-hyperlink" href="/questions/1287284/how-do-i-fix-my-display-resolution-problem-on-ubuntu-20-10-raspberry-pi">How do I fix my display resolution problem on Ubuntu 20.10 Raspberry Pi?</a></h3>
<div class="excerpt">
            My monitor has a resolution of 1368x768. In the Ubuntu settings page, there's no option for my display resolution nor 1366x768. All of the options have the top bar cut out or the dock. I used xrandr ...
        </div>
<div class="tags t-display t-display-resolution t-xrandr t-20û10">
<a class="post-tag" href="/questions/t

As we can see, the HTML content of one container is very long. To find out the HTML line specific to each data point, we’ll use DevTools once again.

In [6]:
# Hitting the webpage to validate if there are any timeouts or connection reset warnings using try except()

from urllib.request import urlopen 
from socket import timeout

url = "https://askubuntu.com/questions?tab=newest&pagesize=50"
try: 
    string = urlopen(url, timeout=5).read()
except ConnectionResetError:
    print("==> ConnectionResetError")
    pass
except timeout: 
    print("==> Timeout")
    pass

In [7]:
pages_list=[]                         # Empty list that will be appended with the page count

pages_list.append("https://askubuntu.com/questions?tab=newest&pagesize=50")

for page in list(range(2, 601)):     # Looping over first 400 webpages

  pages_list.append('https://askubuntu.com/questions?tab=newest&page='+str(page))
  
print(len(pages_list))

600


In [8]:
questions_list=[]                     # Creating an empty list of questions that will be appended with the question counts.

# time.sleep(5)
for x in pages_list:                  # Looping over all the pages in the page list
  page = requests.get(x)
  soup = BeautifulSoup(page.text, 'html.parser')

  question_name = soup.find_all('h3')   # Finding the h3 tag that has details of the question
  for question in question_name:
    if question.find('a'):
      questions_list.append(question.find('a').text)

In [9]:
# Printing the length of the questions scraped and the top 10 questions.

print(len(questions_list))
print(questions_list[:11])

31200
['current community', 'more stack exchange communities', 'How do I fix my display resolution problem on Ubuntu 20.10 Raspberry Pi?', 'Text size in xfig not working in Ubuntu version 20.04', 'Not everything loads correctly when logging in', "Please how do i fix this NO DEVICES FOUND: Press 'M' and '+' to add", 'Ubuntu 20.04 never starts install', 'System upgrade to latest 20.04', 'USB formatting problem - unable due to error (udisks-error-quark, 0)', 'Certbot SSL certificate generation issue', 'Nvidia Graphics Card Disaster After Update to 20.04']


In [10]:
# To get the tags from the webpage

def ListToString(string):
  # initialize an empty string 
  string1 = ""
    
  # traverse in the string   
  for element in string:  
      string1 = string1 + element   
     
  return string1


tag=[]

for x in pages_list:
  page = requests.get(x)
  soup = BeautifulSoup(page.text, 'html.parser')

  tags=[]

  for div in soup.find_all('div', {"class": re.compile("^tags")}):      
    tags.append(div.get('class')[1:])



  clean_tags=[]

  for i in tags:
    clean_tags.append(ListToString(i).replace('t-','|'))



  for j in clean_tags:
    tag.append(j[1:])

In [11]:
print(len(tag))
print(tag[:11])

30000
['drivers|nvidia|cuda', 'usb|mount|sd-card|fsck|vmware-workstation', 'deja-dup', 'unity|gnome|keyboard|shortcu|keys', 'display|display-resolution|xrandr|20û10', '20û04', 'gnome-shell|desktop-environments', 'programming|minecraft|devices|preseed|bitcoin', 'system-installation', 'upgrade|system', 'partitioning|usb|format']


In [12]:
# Creating a dataframe for questions and tags

ask_ubuntu_df = pd.DataFrame(list(zip(tag, questions_list)), columns =['Tags', 'Questions']) 
ask_ubuntu_df.head(10)

Unnamed: 0,Tags,Questions
0,drivers|nvidia|cuda,current community
1,usb|mount|sd-card|fsck|vmware-workstation,more stack exchange communities
2,deja-dup,How do I fix my display resolution problem on ...
3,unity|gnome|keyboard|shortcu|keys,Text size in xfig not working in Ubuntu versio...
4,display|display-resolution|xrandr|20û10,Not everything loads correctly when logging in
5,20û04,Please how do i fix this NO DEVICES FOUND: Pre...
6,gnome-shell|desktop-environments,Ubuntu 20.04 never starts install
7,programming|minecraft|devices|preseed|bitcoin,System upgrade to latest 20.04
8,system-installation,USB formatting problem - unable due to error (...
9,upgrade|system,Certbot SSL certificate generation issue


In [13]:
# Removing special characters, hyperlinks/URL's from the Questions.

ask_ubuntu_df['Questions'] = ask_ubuntu_df['Questions'].replace('()','').replace(';','').replace(':','')
ask_ubuntu_df['Tags'] = ask_ubuntu_df['Tags'].replace('()','').replace(';','').replace(':','')

In [14]:
ask_ubuntu_df["Tags"] = [item.replace("|", " __label__") for item in ask_ubuntu_df["Tags"]]
ask_ubuntu_df['Tags'] ='__label__' + ask_ubuntu_df['Tags'].astype(str)
ask_ubuntu_df.head()

Unnamed: 0,Tags,Questions
0,__label__drivers __label__nvidia __label__cuda,current community
1,__label__usb __label__mount __label__sd-card _...,more stack exchange communities
2,__label__deja-dup,How do I fix my display resolution problem on ...
3,__label__unity __label__gnome __label__keyboar...,Text size in xfig not working in Ubuntu versio...
4,__label__display __label__display-resolution _...,Not everything loads correctly when logging in


In [167]:
# Setting the current working directory

relative_path = "/Users/Agam/Project Files/"
os.chdir(relative_path)

In [168]:
# Cloning and Installing fastText
!git clone https://github.com/facebookresearch/fastText.git
%cd fastText
!make
!cp fasttext ../
%cd ..

Cloning into 'fastText'...
remote: Enumerating objects: 3854, done.[K
remote: Total 3854 (delta 0), reused 0 (delta 0), pack-reused 3854[K
Receiving objects: 100% (3854/3854), 8.22 MiB | 1.98 MiB/s, done.
Resolving deltas: 100% (2418/2418), done.
/Users/Agam/Project Files/fastText
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/args.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/autotune.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/matrix.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/dictionary.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/loss.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/productquantizer.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/densematrix.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/quantmatrix.cc
c++ -pthread -std=c++11 -march

In [169]:
# Changing the respective path to the fastText directory after cloning fastText.

relative_path = "/Users/Agam/Project Files/fastText"
os.chdir(relative_path)

In [170]:
# Save the DataFrame as a .txt file which is required for feeding to fastText

np.savetxt('ask_ubuntu.txt', ask_ubuntu_df.values, fmt='%s', delimiter='\t')

In [171]:
!head "/Users/Agam/Project Files/fastText/ask_ubuntu.txt"

__label__server __label__games	current community
__label__18û04 __label__virtualization __label__kvm	more stack exchange communities
__label__20û04 __label__wacom __label__graphics-tablet	Unable to start virtual domain in Ubuntu 18.04 LTS
__label__cpu-load	Wacom gets stuck in Ubuntu 20.04
__label__dual-boot __label__grub2	How to Monitor CPU and GPU usage for a specific amount of time
__label__drivers __label__nvidia __label__cuda	GRUB can't find Windows 10 after changing SATA cable entry
__label__usb __label__mount __label__sd-card __label__fsck __label__vmware-workstation	CUDA: driver version is upgrading properly, but Runtime API not upgrading
__label__deja-dup	fsck read-only file system on unmounted usb stick (SD Card Reader)
__label__unity __label__gnome __label__keyboard __label__shortcu __label__keys	Ubuntu 20.04 locks up when restoring a backup
__label__display __label__display-resolution __label__xrandr __label__20û10	How can I enable shortcut hint overlay screen in Ub

In [172]:
# Splitting the data into training, validation and testing.

train_data= round(len(ask_ubuntu_df)*0.70)
validation_data = round(len(ask_ubuntu_df)*0.15)
testing_data = round(len(ask_ubuntu_df)*0.15) 

print("Number of records for training dataset are:", train_data)
print("Number of records for validation dataset are:", validation_data)
print("Number of records for testing dataset are:", testing_data)

Number of records for training dataset are: 21000
Number of records for validation dataset are: 4500
Number of records for testing dataset are: 4500


In [173]:
ask_ubuntu_df.head(10)

Unnamed: 0,Tags,Questions
0,__label__server __label__games,current community
1,__label__18û04 __label__virtualization __label...,more stack exchange communities
2,__label__20û04 __label__wacom __label__graphic...,Unable to start virtual domain in Ubuntu 18.04...
3,__label__cpu-load,Wacom gets stuck in Ubuntu 20.04
4,__label__dual-boot __label__grub2,How to Monitor CPU and GPU usage for a specifi...
5,__label__drivers __label__nvidia __label__cuda,GRUB can't find Windows 10 after changing SATA...
6,__label__usb __label__mount __label__sd-card _...,"CUDA: driver version is upgrading properly, bu..."
7,__label__deja-dup,fsck read-only file system on unmounted usb st...
8,__label__unity __label__gnome __label__keyboar...,Ubuntu 20.04 locks up when restoring a backup
9,__label__display __label__display-resolution _...,How can I enable shortcut hint overlay screen ...


In [177]:
training_data =pd.read_table("/Users/Agam/Project Files/fastText/ask_ubuntu.txt", nrows=train_data,header=None)
valid_data =pd.read_table("/Users/Agam/Project Files/fastText/ask_ubuntu.txt",skiprows=train_data,nrows=validation_data,header=None)
#test_data =pd.read_csv("/Users/Agam/Project Files/ask_ubuntu.txt",skiprows=(test_data+validation_data),nrows=testing_data,header=None)

In [178]:
# Saving the training and validation files.

np.savetxt('/Users/Agam/Project Files/fastText/ask_ubuntu.train', training_data.values, fmt='%s', delimiter='\t')
np.savetxt('/Users/Agam/Project Files/fastText/ask_ubuntu.val', valid_data.values, fmt='%s', delimiter='\t')
# np.savetxt('/Users/Agam/Project Files/ask_ubuntu.test', testing_data.values, fmt='%s', delimiter='\t')

In [179]:
!./fasttext  supervised -input "/Users/Agam/Project Files/fastText/ask_ubuntu.train" -output ./model_ask_ubuntu

Read 0M words
Number of words:  23726
Number of labels: 2122
Progress: 100.0% words/sec/thread:    8803 lr:  0.000000 avg.loss: 10.292150 ETA:   0h 0m 0s10.343347 ETA:   0h 0m 1s


In [181]:
# Validating model accuracy on the validation set.

!./fasttext test ./model_ask_ubuntu.bin "/Users/Agam/Project Files/fastText/ask_ubuntu.val"

N	4494
P@1	0.276
R@1	0.0906


**As we can see from the precision score of our fastText Model, that when we scraped data from around 600 webpages, the precision increased significantly from 9% to around 28%. For further analysis, we will consider applying epochs, learning rate, word n grams and hierarchial softmax and will validate the accuracy of the model on validation data.**