## Web scraping exercise

Define a generic function `SOS_help` which retrieves help results from Stack Overflow Stunning results. <br>

The following command works just fine:

Create a function `get_SOS_help` which: <br>
    - Prints "works as intended" if no error. <br>
    - Prints the first link from stack overflow related to the error. As an example: <br>
        `print_output(command = 'np.random.uniform(-1, 1, siz=100)'`
        should retrieve the following link:
        https://stackoverflow.com/questions/72537485/typeerror-uniform-got-an-unexpected-keyword-argument-low-size <br>
    - Prints the most voted help
    - Opens a new browser using the link

## Create a malfunctioning code and use this function on it

# Required Packages

In [1]:
import pandas as pd
import requests
import urllib
import bs4
from bs4 import BeautifulSoup
from html.parser import HTMLParser
from os import linesep
from selenium import webdriver
from selenium.webdriver.chrome.options import Options


# 2. Functions and classes

## 2.1. Global Parameters

In [2]:
GOOGLE_COOKIE_CHECK = ["h1","Uo8X3b OhScic zsYMMe"]
STACKOVERFLOW_COOKIE_CHECK = ["div","answer"]

## 2.2. Exceptions

In [3]:
class CookieError(Exception):
    pass

## 2.3. Pretty print functions

### 2.3.1 Colors class

This class has the codes to prettify python output

In [4]:
class color:
    PURPLE = '\033[95m'
    CYAN = '\033[96m'
    DARKCYAN = '\033[36m'
    BLUE = '\033[94m'
    GREEN = '\033[92m'
    YELLOW = '\033[93m'
    RED = '\033[91m'
    BOLD = '\033[1m'
    UNDERLINE = '\033[4m'
    ITALIC = '\x1B[3m'
    END = '\033[0m'

### 2.3.2 MyHTMLParser 

This class has the way to convert a html to a array of strings with some format.

In [5]:

    
class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
    def feed(self, in_html):
        self.output = ""
        self.tag_lines = []
        super(MyHTMLParser, self).feed(in_html)
        return self.tag_lines
    def handle_data(self, data):
        if self.output not in([linesep,""," "]):
            self.tag_lines.append(self.output)
        self.output = data.strip()
    def handle_starttag(self, tag, attrs):
        aux = ""
        if tag == 'li':
            self.output = self.output + linesep
        elif tag == 'code':
            pass
        elif tag == 'blockquote' :
            self.output  =  self.output + '\t' 
        elif tag == 'p':
            self.output = self.output + color.END
        elif tag == 'div':
            pass
             
    def handle_endtag(self, tag):

        if tag == 'blockquote':
            self.output += ""  
        elif tag == 'code':
            self.output = '\t' + color.DARKCYAN + self.output + color.END
        elif tag == 'p':
            pass 
        elif tag == 'div':
            pass
            

### 2.3.3 print_help 

A way to print the answers of stack overflow. the parser may not works properly so it can be switched to False 

In [6]:

def print_help(content,url,parser_b = False):
    """
    print the answers with little style. Code will turn cianblack if parser_b is True
  
    Parameters:
    content (str): content in html to be parsed 
    url (str): direction of the web with the answers
    parser_b (bool): if true then it will parsed with MyHTMLParser class
  
    Returns:
    str: url encoded
    """
    print(color.BOLD + "FIRST STACKOVERFLOW RESPONSE IN: "+ color.END + url ) 
    if parser_b:
        parser = MyHTMLParser()
        for text in parser.feed(content):
            print(text)
    else:
        soup = BeautifulSoup(df['answer_html'][0], 'html.parser')
        print(soup.get_text())


## 2.4. Get html

- `error_encode_url`
    - error:str
- `get_html_search`
    - query: str
- `check_html_mal`
    - soup:
- `is_valid_html`
    - soup:bs4.BeautifulSoup, 
    - config:[str, str]

In [7]:
def error_encode_url(error:str):
    """
    Generate url encode from a error using a defined base in google and a web to search
  
    Parameters:
    error (str): string with the error to be searched 
  
    Returns:
    str: url encoded
    """
    base_search = "https://www.google.com/search?q="
    web = "stackoverflow.com"
    search = urllib.parse.quote_plus(f'{web} {error}')
    url_base = f'{base_search}{search}'
    return url_base

In [8]:
def get_html_search(query: str):
    """
    Get the html of a web page from the a query url given.
  
    Parameters:
    query (str): string with the url to be searched
  
    Returns:
    str: hmlt string of the search
    """
    
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36', 'Upgrade-Insecure-Requests': '1', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'DNT': '1', 'Accept-Encoding': 'gzip, deflate', 'Cookie':'CONSENT=YES+cb.20210418-17-p0.it+FX+917; '}
    # headers = ""

    res = requests.get(query,headers=headers)
    if res.status_code != 200:
        raise requests.exceptions.RequestException("Cannot procceed only works with 200 status code in response")
    return res

In [9]:
def check_html_mal(soup):
    """
    MALA Solo para sirve para ver el ejemlpo de como ha evolucionado:
    me comlique mucho
    """
    ret = True
    common_button = ["Accept all", "Reject all","Rechazar todo","Aceptar todo"]
    common_title = ['Antes de ir a Google', 'Before you continue to Google']
    title_res = soup.find_all("h1")
    title_str = title_res[0].contents[0]
    button_res = soup.find_all("input",attrs={'class':'basebutton'})

    for b in button_res:
        if b.get_attribute_list("value")[0] in common_button:
            ret = False
    if title_str in common_title:
            ret = False
    return ret 

In [10]:


def is_valid_html(soup:bs4.BeautifulSoup, config:[str, str]):
    """
    Check if the page url has the search result. 
    if not return False. This may happends when you have to aggre cookies policy
    Parameters:
    soup (bs4.BeautifulSoup): soup with the html parsed 
  
    Returns:
    bool: True if it is ok False if not
    """
    ret = False
    search_result = soup.find_all(config[0],attrs={'class':config[1]})
    if len(search_result)>0:
        ret = True
    return ret

## 2.5. Parse HTML

- `parse_html`: 
    - html: str, 
    - config: [str, str]
- `get_search_urls`: 
    - soup:bs4.BeautifulSoup
- `get_user`: 
    - answer:bs4.element.Tag, 
    - detail: str
- `get_answers`
    - url:str
- `most_rated_answer`
    - df:pd.DataFrame
    - sort_by:str

In [11]:
def parse_html(html: str,config: [str, str]):
    '''
    It takes a html string and a config list as input and returns a BeautifulSoup object
    
    Parameters:
    html : str
        the html of the page
    config : [str, str]
        [str, str]
    
    Returns:
        A list of dictionaries, each dictionary contains the following keys:
        - question: The question
        - answer: The answer
        - answer_type: The type of answer (text, image, etc)
        - image_url: The url of the image if the answer_type is image
        - image_alt: The alt of the image if the answer_
    
    '''
    soup = BeautifulSoup(html, 'html.parser')
    if not(is_valid_html(soup,config)):
        raise CookieError("Check the headers of the request, it seems google has update some required fields or its values")
    return soup

def get_search_urls(soup:bs4.BeautifulSoup): 
    '''
    It takes a BeautifulSoup object and returns a list of urls that are the search results
    
    Parameters:
    soup : bs4.BeautifulSoup
        the soup object of the page
    
    Returns:
        A list of urls
    
    '''
    #find all div that have MjjYud == search results class in google search 
    results = soup.find_all('div', attrs={'class':'MjjYud'})

    urls = []
    for result in results:
        #find the first a that contains the href with the url of the solution
        res = result.find("a")
        #Check if the url is in the stackoverflow site
        if "https://stackoverflow.com" in res.attrs["href"]:
            urls.append(res.attrs["href"])
        
    return urls

def get_user(answer:bs4.element.Tag, detail:str):
    '''
    I have some problems with the user. Some times appears a blanck user, and i find some post withou itemprop (actually i cannot find those now but in case...)
    It looks for the user's name or rank in the answer, and returns it
    
    Parameters
    answer : bs4.element.Tag
        the answer to be parsed
    detail
        the detail you want to extract from the user.
    
    Returns
        The name of the user who posted the answer.
    
    '''
    if detail == "name":
        search = 'a'
        attrs = {}
    elif detail == "rank":
        search = 'span'
        attrs = {'class':'reputation-score'}
    user = []
    findings = answer.find_all('div', attrs={'class':'user-details','itemprop':'author'})
    for pos in findings:
        aux = pos.find(search, attrs=attrs)
        if aux!=-1:
            user.append(aux.contents[0])
    if len(user) == 0:
        findings = answer.find_all('div', attrs={'class':'user-details'})
        for pos in findings:
            aux = pos.find(search,attrs=attrs)
            if aux!=-1:
                user.append(aux.contents[0])
    return user[0]

def get_answers(url:str):
    '''
    It takes a url, gets the html, parses it, finds all the answers, and returns a dataframe with the
    answers
    
    Parameters:
    url : str
        The url of the question you want to get the answers for.
    
    Returns:
        A dataframe with the following columns:
        votes: number of votes for the answer
        time: time of the answer
        user_name: name of the user who answered
        user_rank: rank of the user who answered
        answer_html: html of the answer
    
    '''
    res = get_html_search(url)
    soup_r = parse_html(res.text,STACKOVERFLOW_COOKIE_CHECK)
    answers = soup_r.find_all('div', attrs={'class':'answer'})
    df = pd.DataFrame({},columns = ["votes","time","answer_html","user_name","user_rank"])

    for answer in answers:
        votes = answer.find('div', attrs={'class':'js-vote-count'})
        exp = answer.find('div', attrs={'class':'s-prose'})
        time = answer.find('div', attrs={'class':'user-action-time'}).find('span', attrs={'class':'relativetime'}).attrs["title"]
        usr_name = get_user(answers[0],"name")
        usr_rank = get_user(answers[0],"rank")
        ans = {'votes': [int(votes.attrs["data-value"])],
               'time': [time],
               'user_name':[usr_name],
               'user_rank':[usr_rank],
               'answer_html': [exp.prettify()]
              }
        df = pd.concat([df, pd.DataFrame(ans)])
    return df.reset_index(drop = True)

def most_rated_answer(df:pd.DataFrame,sort_by:str):
    '''The function takes in a dataframe and a string as input and returns the most rated answer in the
    dataframe
    
    Parameters:
    df : pd.DataFrame
        The dataframe that contains the answers
    sort_by : str
        "ans" or "ques"
    
    Returns:
        A dataframe with the most rated answer
    
    '''
    if sort_by == "ans":
        df = df.sort_values("votes",ascending=False).reset_index(drop = True)
    
    return df.iloc[[0]]



## 2.6. Open URL
- `open_url_web`: url as str

In [12]:
def open_url_web(url:str):
    """
    Open an url in a chrome tab. 
    Requirements are the selenium module and the web driver 
    the options are to keeo alive the browser
    
    Parameters:
    url (str): url of the web to be open 
    """
    try:
        options = Options()
        options.add_experimental_option("detach", True)
        options.add_experimental_option("excludeSwitches", ["enable-automation"])
        driver = webdriver.Chrome(options=options)
        driver.get(url)
        driver.service.stop()
    except Exception as ex:
        print("Consider to install selenium or replace the way of getting the dricer as: driver = webdriver.Chrome(ChromeDriverManager().install()))")
        raise(ex)
# get Stack Overflow Stunning help


## 2.7. Get Help

- `get_SOS_help`: command parameters

In [13]:
def get_SOS_help(command):
    
    try:
        exec(command)
        print("Works as intended")
    except Exception as ex:
        error_str = f'{type(ex).__name__}: {ex}'
        r = get_html_search(error_encode_url(error_str))
        soup = parse_html(r.text,GOOGLE_COOKIE_CHECK)
        urls = get_search_urls(soup)
        df = get_answers(urls[0])

        ans = most_rated_answer(df,"ans")
        print_help(ans['answer_html'][0],urls[0],True)
        open_url_web(urls[0])
        

In [15]:
command = 'pandas.random.uniform(-1, 1, siz=100)'
get_SOS_help(command)

[1mFIRST STACKOVERFLOW RESPONSE IN: [0mhttps://stackoverflow.com/questions/31721996/is-pandas-not-importing-nameerror-global-name-pandas-is-not-defined
[0m
You have imported it as
	[36mimport pandas as pd[0m
[0m
and calling
	[36m#pandas
df = pandas.DataFrame(columns = ['Date','Unix','Ticker','DE Ratio'])[0m
[0m
You could either change
	[36mimport pandas as pd[0m
to
	[36mimport pandas[0m
or
	[36mdf = pandas.DataFrame(columns = ['Date','Unix','Ticker','DE Ratio'])[0m
to
	[36mdf = pd.DataFrame(columns = ['Date','Unix','Ticker','DE Ratio'])[0m
.
[0m
edit:
[0m
error is due to missing
	[36m)[0m
	[36msave = gather.replace(' ','').replace(')','').replace('(','').replace('/',''+('.csv'))[0m
