## Libraries 

In [14]:
import pandas as pd
from tqdm import tqdm

## Part 1: Analyze the Fake News Dataset


### (1) - Import dataset

I have created a python script of the last assignment to easily import the cleaned dataset from last assignment. However I've had problems downloading the 995,000_rows.csv subset of the fake news corpus, so will unfortunately have to proceed with a cleaned version of the news_sample.csv for now, until I've resolved the issue of being denied downloading the subset.

In [13]:
from graded_exercise_1 import df

### (2) - Dataset analysis

#### (A)

From inspecting and analyzing the description of each article type of the fake-news corpus on github: (https://github.com/several27/FakeNewsCorpus) I would argue that many of the article types are to some degree not reliable, and therefore can be classified as fake or be omitted.

In my classification I also try to distinguish in such a way, that either an article type is clearly fake (in atleast to some degree), or it's not, since these are the only categories we are interested in. However if I fail to determine if an article type is clearly fake or not, that would be a reason for me to omit such a type. 

Such a way of distinguishing is of course also prone to subjective errors, so others might classify the fake-news corpus in some other way. 


#### (B)

OMITTED:

First of all I've omitted three types of articles: proceed with caution, state news & hate news.

My reasoning is, that from the description of the article types through github, state news & hate news doesn't clearly state if the content in such an article type is fake or reliable (of course this is also an attempt to set aside my prejudice and subjective views) - however I also see an argument for classifying these.

As of the proceed with caution article type, I've chosen to omit it since the description states these articles are yet to be verified as reliable but may be reliable. I feel that an article type which is at such a state, doesn't fit either of the classifications until verified, and therefore might as well be omitted.


RELIABLE:

I've also classified three article types as reliable: credible, political & clickbait.

Theses I've classified as reliable, as these are the only three article types which description to some degree justifies that the content is reliable. 

Though the clickbait articles might use exagerated and/or misleading headlines, images, etc., I also see an argument for omitting it or even classifying it as fake, but I've chosen to mark it as reliable, since the content is descriped as generally credible. 

For the other two article types, I don't feel it as necessary to argue why these are labeled as reliable, since their description clearly states their extend of credibilty.


FAKE:

My classification of the fake article types is then the remaining types: fake news, satire, extreme bias, conspiracy theory & junk science.

All of these article types is descriped to some extend as being unreliable or contain unreliable information. It is clear to me, that atleast these five types should be marked as fake based on the descriptions of the fake news corpus.



In [15]:
# This block contain 3 lists showing my classification of article types as fake, reliable or omitted
fake_news_labels = ["fake", 
                    "satire", 
                    "bias", 
                    "conspiracy",
                    "junksci"
                    ]

reliable_news_labels = ["reliable",
                        "political",
                        "clickbait"
                        ]

omitted_news_labels = ["unreliable",
                       "state",
                       "hate"]

#### (C)

Unfortunately I'm limited to comment on the distribution of the sample, but will try to generalize some of the aspects that might be relevant for the bigger subset.

Clearly in this case, the distribution indicates that most of the articles of the sample is labeled as fake compared to reliable, which indicates that the dataset is not balanced. This ofcourse might not be the case for another sample of the dataset, and since this is a very small subset compared to the 995,000_rows.csv-subset, the distribution might look vastly different.

Having a balanced distribution is important for training a model! In a case where the distribution looks like this (80/20), the model may become biased towards the majority classification - in this case it would be the fake articles. That is, it would might perform well in identifying fake news, but poorly in recognizing reliable news.

There is also a risk of the model overfitting, making it less generalizable to new data, in particular from the minority classification - in this case the reliable news.

Since I am unable to use the other sample as of now, I've manually inspected the count of article types from the github documentation, and roughly estimated what the distribution would look like with my classification.

Fake: 928,083 + 146,080 + 1,300,444 + 905,981 + 144,939 = 3425527

Reliable: 292,201 + 2,435,471 + 1,920,139 = 4647811 

Total: 4647811 + 3425527 = 8073338

Which roughly translates to a percentage distribution of 57% reliable news and 43% fake news, given my classification.

That would mean a way more balanced dataset, and there training a model would also not be as prone to things as over- /underfitting and bias.


In [16]:
# This block of code filters the omitted article types, and categorize the remaining articles into
# fake & reliable, and lastly calculates the distribution of fake & reliable articles in the dataset.
df_filtered = df[~df['type'].isin(omitted_news_labels)].copy()

df_filtered['category'] = df_filtered['type'].apply(lambda x: 'fake' if x in fake_news_labels else 'reliable')

distribution = df_filtered['category'].value_counts(normalize=True) * 100

distribution

category
fake        81.481481
reliable    18.518519
Name: proportion, dtype: float64

## Part 2: Gathering Links

### (1) - Importing libraries

In [17]:
import requests
from bs4 import BeautifulSoup

### (2) - Retrieving HTML content
After analyzing the string output of contents, I'd verify that contents does indeed hold HTML source of the given webpage.

In [32]:
response = requests.get('https://www.bbc.com/news/world/europe')
contents = response.text
contents;

### (3) - Extract articles

In [19]:
def extract_articles(url):
    """This function extracts all articles from a given url, 
    and for each article retrieves its corresponding link"""

    response = requests.get(url)
    contents = response.text

    soup = BeautifulSoup(contents, 'html.parser')
    
    articles = soup.find_all('div', attrs={'type': 'article'})

    links = [article.find('a')['href'] for article in articles]

    return links

# Example useage on the Europe section on bbc
links = extract_articles('https://www.bbc.com/news/world/europe')
print("Number of retrieved article links on the first page of the europe section: ")
print(len(links))


Number of retrieved article links on the first page of the europe section: 
38


### (4) - Scrape multiple pages


In [21]:
def extract_all_articles(url):
    """This function loops through the pages of a section, 
    and for each page collects the corresponding links for all of the articles of the page"""
    all_links = []
    
    response = requests.get(url)
    contents = response.text

    soup = BeautifulSoup(contents, 'html.parser')

    max_pages = soup.find_all('div', class_="ssrcss-3vkeha-StyledButtonContent e1kcrsdk1")[-2]
    max_pages_number = int(max_pages.text.strip())
    

    for page in tqdm(range(1,max_pages_number + 1)):
        page_url = f'{url}?page={page}'
        response = requests.get(page_url)
        contents = response.text
        soup = BeautifulSoup(contents, 'html.parser')
        articles = soup.find_all('div', attrs={'type': 'article'})
        links = [article.find('a')['href'] for article in articles]
        all_links.extend(links)
    return all_links


# Example usage on the Europe section once again
links = extract_all_articles('https://www.bbc.com/news/world/europe')
print("Number of retrieved article links within all the pages of the europe section: ")
print(len(links)) 

100%|██████████| 42/42 [00:06<00:00,  6.11it/s]

Number of retrieved article links within all the pages of the europe section: 
902





### (5) - Expand the scope

In [22]:
def extract_all_articles_for_regions(url):
    """This function loops through the list of given regions, and for each region scrapes
    the links of all articles on all of its pages using the former function"""
    response = requests.get(url)
    contents = response.text
    soup = BeautifulSoup(contents, 'html.parser')
    
    all_region_links = []

    regions = ['europe','australia', 'asia', 'africa', 'latin_america', 'middle_east']

    for region in regions:
        region_links = extract_all_articles(f"{url}/{region}")
        all_region_links.extend(region_links)
    
    return all_region_links


# Example usage on the given regions
links = extract_all_articles_for_regions('https://www.bbc.com/news/world')
print("Number of retrieved article links within all of the specified regions: " )
print(len(links))

100%|██████████| 42/42 [00:05<00:00,  7.05it/s]
100%|██████████| 42/42 [00:05<00:00,  7.22it/s]
100%|██████████| 42/42 [00:06<00:00,  6.72it/s]
100%|██████████| 25/25 [00:03<00:00,  6.37it/s]
100%|██████████| 42/42 [00:05<00:00,  7.03it/s]
100%|██████████| 41/41 [00:06<00:00,  6.07it/s]

Number of retrieved article links within all of the specified regions: 
4790





### (6) - Save your results

In [23]:
# This block of code saves all of the links in a .csv file
all_links_df = pd.DataFrame(links, columns=['Article Links'])
all_links_df.to_csv('links.csv', index=False)

## Part 3: Scraping Article Text

### (1) - Article inspection

After inspecting the HTML code for a couple of articles within a few of the different regions using the in-browser devtools, it seems that the attributes to identify the headline, author, published date & text is as follows:

- Headline: 
    - tag = h1
    - class = "ssrcss-fmi64d-StyledHeading e10rt3ze0" _>HEADLINE HERE<_
    
- Author: 
    - tag = div
    - class = "ssrcss-68pt20-Text-TextContributorName e8mq1e96" _>AUTHOR HERE<_

- Date: 
    - tag = time
    - data-testid = "timestamp"

- Text: 
    - tag = p
    - class = "ssrcss-1q0x1qg-Paragraph e1jhz7w10" _>TEXT HERE<_

### (2) - Text scraping function

In [24]:
def scrape_essentials(url):
    """This function scrapes an article for its headline, author, publish-date & text"""
    response = requests.get(url)
    contents = response.text
    soup = BeautifulSoup(contents, 'html.parser')
    
    article_info = {}
    
    header = soup.find('h1', class_="ssrcss-fmi64d-StyledHeading e10rt3ze0")
    article_info["Header"] = header.text if header else "No headline found"

    author = soup.find('div', class_="ssrcss-68pt20-Text-TextContributorName e8mq1e96")
    article_info["Author"] = author.text.replace("By ", "") if author else "No author found"

    date = soup.find('time', attrs={"data-testid": "timestamp"})
    article_info["Date"] = date.text if date else "No date found"

    text = ""

    for text_block in soup.find_all('div', attrs={"data-component": "text-block"}):
        if text_block.find('p'):
            text += text_block.find('p').text + " "
    article_info["Text"] = text.strip() if text else "No text found"

    return article_info

# Example usage on an arbitrary article
scrape_essentials('https://www.bbc.com/news/uk-england-humber-68339496')


{'Header': 'Hull: Amy Johnson Cup for Courage nominations sought',
 'Author': 'Emma Hartley',
 'Date': '24 February',
 'Text': 'Hull City Council is inviting nominations for the 2024 Amy Johnson Cup for Courage.  Nominees must be 17 or under, have been born in Hull and have to have committed a "signal act of courage" in the previous year while living in the city.  The Cup for Courage was first awarded in 1931, arising from the famous local aviator\'s solo flight to Australia. Midge Gillies, Miss Johnson\'s biographer, said: "A part of [Amy Johnson] would always be a Hull girl." The last award-winner was Keleighsha Thorpe in 2012, who was 10 when she saved her grandmother\'s life after their home in Clarendon Street was targeted by arsonists.  She had woken in the night and smelt smoke, then woke her grandmother in time for them both to escape the burning building. Her name was entered in the city\'s roll of honour at the Guildhall as a result. The award originated shortly after Amy Joh

### (3) - Scrape all articles

In [25]:
# This block of code prepares a list of all links to loop through for scraping
links_file_path = '/Users/askeklausen/Desktop/GDS/links.csv'
base_url = 'https://www.bbc.com'

df = pd.read_csv(links_file_path)
df["Full links"] = base_url + df['Article Links']

article_links = df["Full links"].to_list()
print("Number of prepared article links: ")
print(len(article_links))


Number of prepared article links: 
4790


In [30]:
# This block of code scrapes all the essentials for every article in the specified regions 
# of the bbc website, and organizes them nicely in a pandas dataframe
import time
scraped_articles = []

for link in tqdm(article_links[:]):
    try:
        article_data = scrape_essentials(link)
        scraped_articles.append(article_data)
        time.sleep(0.1)
    except Exception as e:
        print(f"Error scraping {link}: {e}")

scraped_df = pd.DataFrame(scraped_articles)
scraped_df[:50]

100%|██████████| 4790/4790 [25:20<00:00,  3.15it/s]


Unnamed: 0,Header,Author,Date,Text
0,Singapore sting: How spies listened in on Germ...,Jessica Parker,11 hours ago,It's nearly midnight in Singapore. A senior of...
1,"Portugal elections: AndrÃ© Ventura, ex-footbal...",Alison Roberts,12 hours ago,Friday marks the last day of campaigning for P...
2,Ireland Referendum: Polls open on family and c...,No author found,4 hours ago,Polls have opened in two referendums on changi...
3,Production of Duvel beer hit by cyber-attack,Imran Rahman-Jones,1 hour ago,Production at four breweries owned by Belgian ...
4,Sweden formally joins Nato military alliance,Laura Gozzi,18 hours ago,Sweden has officially become the 32nd member o...
5,Ukraine war: Eastern residents brace for Russi...,James Waterhouse,1 day ago,"In eastern Ukraine, the tide of this war hasn'..."
6,Valerii Zaluzhnyi: Ukraine to appoint ex-army ...,"James Waterhouse, Ukraine correspondent & Joha...",15 hours ago,The former head of Ukraine's armed forces is t...
7,Millions affected by German air and rail strikes,Lipika Pelham,21 hours ago,Millions of travellers in Germany are facing s...
8,France MPs back compensating victims of anti-g...,No author found,21 hours ago,France's lower house has passed a bill which w...
9,Pornhub challenges EU over online content rules,Shiona McCallum & Chris Vallance,23 hours ago,One of the world's most popular pornography we...


### (4) - Data storage

In [31]:
# This block of code saves the essentials of all the scraped articles of the given regions
# nicely in a .csv file
scraped_df.to_csv('scraped_articles_data.csv', index=False)

### (5) - Discussion

Arguments for:

Adding the new data might add more diversity to the dataset, covering a wider variety of article types.
Also the new data which consist of more present articles, might make the model more applicable to current scenarios.
And if the dataset were to be heavily imbalanced toward the fake side, adding a lot of reliable articles could help mitigate the imbalance issues.

Arguments against:

The quality of the newly scraped data, and the quality of the older scraped data might differ in quality and consistency causing problems when training the model.
Also the labeling of the newly scraped articles, are probably not labeled the same way as the former scraped articles.

Conclusion:

As of right now I would not include the newly scraped data into the dataset, unless it were to even out the distribution of fake and reliable data, and to avoid problems such as bias and over- /underfitting.