# Webscraping Functions
All the webscraping takes place in this code block. Webscraping is being done by Newspaper3k. Webscrapping errors are generated when the article does not contain enough words to avoid the wrong text such as ads and article previews. Also errors on articles that contain repeated phrases (in the case that the article blocks webscraping and we get repeated error messages), and any known social media platforms.

In [2]:
def most_repeated_phrase_count(text):
    """
    Taking in a webscraped text, finds all of the phrases in an article and makes a count of the repeated phrases. Returns
    the count of the most repeated phrase in an article.
    """
    # Split the text into phrases (e.g., sentences)
    phrases = re.split(r'\.', text)

    # Remove leading and trailing spaces from each phrase
    phrases = [phrase.strip() for phrase in phrases if phrase.strip()]

    # Count the occurrences of each phrase using Counter
    phrase_counts = Counter(phrases)

    if not phrase_counts:
        return 0

    # Find the most common phrase and its count
    most_common_phrase, count = phrase_counts.most_common(1)[0]
    
    return count


def filter_scrape_data(text, url):
    """
    Taking in an article text, this performs word counting and repeated phrase counting on the text. If either word count
    or repeated phrases do not fit within our filter settings, it returns False and prints some error statements.

    If both of our filters pass, we return True to show that it passed this filter.
    """
    strLength = np.char.count(text, ' ') + 1
    if strLength < word_count_filter or most_repeated_phrase_count(text) >= repeated_phrase_filter:
        print("Webscraping failed: Word Count or Repeated Phrase")
        print(url)
        print("Word count: " , strLength)
        print("Repeated phrases: ", most_repeated_phrase_count(text))
        print("\n")
        return False
    return True


def filter_social(url):
    """
    Taking in a single url, checks if the url is from a known social media website using our social_starts_with social media
    url list (found in main pipeline settings). If the URL is from any of the social media websites, return False to indicate
    the webscraping failed and prints some error statements.

    If the URL is not found in this list, returns True to indicate it passed this filter.
    """
    for y in social_starts_with:
        if(url.startswith(y)):
            print("Webscraping failed: Social Media")
            print(url)
            print("\n")
            return False
    return True



def scrapeData(url):
    """
    Our main data scraping function. Taking in an unprocessed URL, performs several filters on it to make sure it is
    properly webscraped and the text is something we want in our model.

    Filters out social media websites using filter_social(), then gathers the text and filters out the article if it
    doesn't follow our word count and repeated phrase settings. 

    Finally, if it passes all filters, the text is processed and returned successfully.

    If at any point the text scraping or other functions fail, returns a "Couldn't Parse" error indicating there was a failure
    to properly gather the data. All webscraping fails have a unique PARERROR that they correspond to so you can know why
    that URL failed to be webscraped.
    """
    try:
        if not filter_social(url):
            return "PARERROR: SocialError"
        else:
            article = Article(url)
            article.download()
            article.parse()
            
            page_text = (article.text).lower()
            
            if not filter_scrape_data(page_text, url):
                return "PARERROR: WebBlockerError"
        
            page_text = page_text.strip().replace("  ","")
            page_text = "".join([s for s in page_text.splitlines(True) if s.strip("\r\n")])
            
    except:
        print("Webscraping Error: Couldn't Parse")
        print(url)
        print("\n")
        page_text = "PARERROR: ErrorCouldntParse"
    return page_text