**Stack Exchange - Data Scraping for Questions and Tags**

We want to scrape the distributions of numbers of Questions & Tags for the webpage [Stack Echange - ask ubuntu](https://askubuntu.com/) to see if we find anything interesting. To do this, we’ll first scrape data related to Questions and Tags from Stack Exchange webpage for all the respective pages.


**Working out which pages to scrape**

  Once we've defined our goal, we then need to identify an efficient set of pages to scrape. In order to scrape data from the respective web pages, we use the `request` library. A request is what happens when we access a web page. We 'request' the content of a page from the server.

In [1]:
import pandas as pd
import numpy as np
import os
import time
import requests
from requests import get
import regex as re
from requests import get
from bs4 import BeautifulSoup

In [2]:
url = 'https://askubuntu.com/questions'
time.sleep(5)
# Getting the response from the source URL.
response = get(url)
print(response.text[:500])

<!DOCTYPE html>


    <html class="html__responsive">

    <head>

        <title>Newest Questions - Ask Ubuntu</title>
        <link rel="shortcut icon" href="https://cdn.sstatic.net/Sites/askubuntu/Img/favicon.ico?v=928dfb7c1990">
        <link rel="apple-touch-icon" href="https://cdn.sstatic.net/Sites/askubuntu/Img/apple-touch-icon.png?v=e16e1315edd6">
        <link rel="image_src" href="https://cdn.sstatic.net/Sites/askubuntu/Img/apple-touch-icon.png?v=e16e1315edd6"> 
        <lin


As we can see from the first line of response.text, the server sent us an HTML document. This document describes the overall structure of that web page, along with its specific content (which is what makes that particular page unique).

In [3]:
# Parsing respose.txt by creating a BeautifulSoup object and assigning it to htlm_soup.
time.sleep(5)
html_soup = BeautifulSoup(response.text, 'html.parser')


In [4]:
# Printing the total questions summary for each page (pagination set to 50 for viewing purpose)
# The div tag has a class named question-summary that displays to us the entire information for a question posted.

Questions_EachPage = html_soup.find_all('div', class_ = 'question-summary')
print(type(Questions_EachPage))
print(len(Questions_EachPage))

<class 'bs4.element.ResultSet'>
50


As we can see the question summary class count comes out to be 50, which is what we had expected depending on the pagination that we have set.

However, we are only interested to scrape the Questions from this summary and not other things like time, comments, votes, answers,views etc. So, we will only deal with the tag `<h3>` and `<a>`.

**Extracting data for a single question:**


In [5]:
# Printing out the HTLM content of our first question.

first_question = Questions_EachPage[0]
print(first_question)

<div class="question-summary" id="question-summary-1286792">
<div class="statscontainer">
<div class="stats">
<div class="vote">
<div class="votes">
<span class="vote-count-post "><strong>0</strong></span>
<div class="viewcount">votes</div>
</div>
</div>
<div class="status unanswered">
<strong>0</strong>answers
            </div>
</div>
<div class="views " title="6 views">
    6 views
</div>
</div>
<div class="summary">
<h3><a class="question-hyperlink" href="/questions/1286792/removing-all-ubuntu-packages">REMOVING ALL UBUNTU PACKAGES</a></h3>
<div class="excerpt">
            Help. I have Ubuntu software on a bootable usb, due some issues I desperately need to remove all packages installed and make my Ubuntu 20.04 brand new. The software is so messed up  I just don't want ...
        </div>
<div class="tags t-20û04">
<a class="post-tag" href="/questions/tagged/20.04" rel="tag" title="show questions tagged '20.04'">20.04</a>
</div>
<div class="started fr">
<div class="user-info "

As we can see, the HTML content of one container is very long. To find out the HTML line specific to each data point, we’ll use DevTools once again.

In [6]:
# Hitting the webpage to validate if there are any timeouts or connection reset warnings using try except()

from urllib.request import urlopen 
from socket import timeout

url = "https://askubuntu.com/questions?tab=newest&pagesize=50"
try: 
    string = urlopen(url, timeout=5).read()
except ConnectionResetError:
    print("==> ConnectionResetError")
    pass
except timeout: 
    print("==> Timeout")
    pass

In [7]:
pages_list=[]                         # Empty list that will be appended with the page count

pages_list.append("https://askubuntu.com/questions?tab=newest&pagesize=50")

for page in list(range(2, 401)):     # Looping over first 400 webpages

  pages_list.append('https://askubuntu.com/questions?tab=newest&page='+str(page))
  
print(len(pages_list))

400


In [8]:
questions_list=[]                     # Creating an empty list of questions that will be appended with the question counts.

# time.sleep(5)
for x in pages_list:                  # Looping over all the pages in the page list
  page = requests.get(x)
  soup = BeautifulSoup(page.text, 'html.parser')

  question_name = soup.find_all('h3')   # Finding the h3 tag that has details of the question
  for question in question_name:
    if question.find('a'):
      questions_list.append(question.find('a').text)

In [9]:
# Printing the length of the questions scraped and the top 10 questions.
print(len(questions_list))
print(questions_list[:11])

20800
['current community', 'more stack exchange communities', 'REMOVING ALL UBUNTU PACKAGES', 'What is the dependency of tshark', 'When I make upgrade will version increase', 'Problem with installing package (python-qt4)', "Screenpad (Asus Zenbook's screen touchpad) gestures on Ubuntu 20.10", "Can’t open /var/../freshclam.log error when updating ClamAV's virus database", 'Dell Inspiron 5491 Fingerprint not working on Ubuntu 20.04', 'Atom text editor gone after update to 20.10, error: Depends: gvfs-bin but it is not installable', 'socket error when try to start mysql']


In [10]:
# To get the tags from the webpage

def ListToString(string):
  # initialize an empty string 
  string1 = ""
    
  # traverse in the string   
  for element in string:  
      string1 = string1 + element   
     
  return string1


tag=[]

for x in pages_list:
  page = requests.get(x)
  soup = BeautifulSoup(page.text, 'html.parser')

  tags=[]

  for div in soup.find_all('div', {"class": re.compile("^tags")}):      
    tags.append(div.get('class')[1:])



  clean_tags=[]

  for i in tags:
    clean_tags.append(ListToString(i).replace('t-','|'))



  for j in clean_tags:
    tag.append(j[1:])

In [13]:
print(len(tag))
print(tag[:11])

12050
['20û04', 'dependencies|uninstall', 'lubuntu', 'python', 'touchpad|asus|synaptics|20û10', 'clamav|clamtk', '20û04', 'apt|atom', 'mysql', 'networking', 'system-installation|20û04|windows-10|secure-boot']


In [15]:
# Creating a dataframe for questions and tags

ask_ubuntu_df = pd.DataFrame(list(zip(tag, questions_list)), columns =['Tags', 'Questions']) 
ask_ubuntu_df.head(10)

Unnamed: 0,Tags,Questions
0,20û04,current community
1,dependencies|uninstall,more stack exchange communities
2,lubuntu,REMOVING ALL UBUNTU PACKAGES
3,python,What is the dependency of tshark
4,touchpad|asus|synaptics|20û10,When I make upgrade will version increase
5,clamav|clamtk,Problem with installing package (python-qt4)
6,20û04,Screenpad (Asus Zenbook's screen touchpad) ges...
7,apt|atom,Can’t open /var/../freshclam.log error when up...
8,mysql,Dell Inspiron 5491 Fingerprint not working on ...
9,networking,"Atom text editor gone after update to 20.10, e..."


In [16]:
# Removing special characters, hyperlinks/URL's from the Questions.

ask_ubuntu_df['Questions'] = ask_ubuntu_df['Questions'].replace('()','').replace(';','').replace(':','')
ask_ubuntu_df['Tags'] = ask_ubuntu_df['Tags'].replace('()','').replace(';','').replace(':','')

In [22]:
ask_ubuntu_df["Tags"] = [item.replace("|", " __label__") for item in ask_ubuntu_df["Tags"]]
ask_ubuntu_df['Tags'] ='__label__' + ask_ubuntu_df['Tags'].astype(str)
ask_ubuntu_df.head()

Unnamed: 0,Tags,Questions
0,__label____label____label__20û04,current community
1,__label____label____label__dependencies __labe...,more stack exchange communities
2,__label____label____label__lubuntu,REMOVING ALL UBUNTU PACKAGES
3,__label____label____label__python,What is the dependency of tshark
4,__label____label____label__touchpad __label__a...,When I make upgrade will version increase


In [23]:
# Setting the current working directory

os.chdir("/content/")

In [27]:
    # Save the DataFrame as a .txt file which is required for feeding to fastText

np.savetxt(r'/content/ask_ubuntu.txt', ask_ubuntu_df.values, fmt='%s', delimiter='\t')

In [29]:
!head "/content/ask_ubuntu.txt"

__label____label____label__20û04	current community
__label____label____label__dependencies __label__uninstall	more stack exchange communities
__label____label____label__lubuntu	REMOVING ALL UBUNTU PACKAGES
__label____label____label__python	What is the dependency of tshark
__label____label____label__touchpad __label__asus __label__synaptics __label__20û10	When I make upgrade will version increase
__label____label____label__clamav __label__clamtk	Problem with installing package (python-qt4)
__label____label____label__20û04	Screenpad (Asus Zenbook's screen touchpad) gestures on Ubuntu 20.10
__label____label____label__apt __label__atom	Can’t open /var/../freshclam.log error when updating ClamAV's virus database
__label____label____label__mysql	Dell Inspiron 5491 Fingerprint not working on Ubuntu 20.04
__label____label____label__networking	Atom text editor gone after update to 20.10, error: Depends: gvfs-bin but it is not installable


In [56]:
# Splitting the data into training, validation and testing.

train_data= round(len(ask_ubuntu_df)*0.70)
validation_data = round(len(ask_ubuntu_df)*0.15)
testing_data = round(len(ask_ubuntu_df)*0.15) 

print("Number of records for training dataset are:", train_data)
print("Number of records for validation dataset are:", validation_data)
print("Number of records for testing dataset are:", testing_data)


Number of records for training dataset are: 8435
Number of records for validation dataset are: 1808
Number of records for testing dataset are: 1808


In [45]:
training_data =pd.read_table("/content/ask_ubuntu.txt", nrows=train_data,header=None)
valid_data =pd.read_table("/content/ask_ubuntu.txt",skiprows=train_data,nrows=validation_data,header=None)
test_data =pd.read_table("/content/ask_ubuntu.txt",skiprows=(train_data+validation_data),nrows=testing_data,header=None)

In [48]:
# Creating and saving all the files to disk

np.savetxt(r'/content/ask_ubuntu.train', training_data.values, fmt='%s', delimiter='\t')
np.savetxt(r'/content/ask_ubuntu.validation', valid_data.values,fmt='%s', delimiter='\t')
np.savetxt(r'/content/ask_ubuntu.testing', test_data.values ,fmt='%s', delimiter='\t')

In [None]:
# Cloning and Installing fastText
!git clone https://github.com/facebookresearch/fastText.git
%cd fastText
!make
!cp fasttext ../
%cd ..

In [53]:
# Training fastText on the training data

!./fasttext  supervised -input "/content/ask_ubuntu.train" -output ./model_ask_ubuntu

Read 0M words
Number of words:  12855
Number of labels: 2208
Progress: 100.0% words/sec/thread:    3353 lr:  0.000000 avg.loss: 12.654812 ETA:   0h 0m 0s


In [51]:
# Validating model accuracy on the validation set.

!./fasttext test ./model_ask_ubuntu.bin "/content/ask_ubuntu.validation"

N	1771
P@1	0.0898
R@1	0.0299
