# Graded Exercise 2



## Part 1: Analyze the Fake News Dataset



### 1.1. Import Dataset



In [79]:
import pandas as pd

data = pd.read_csv('995K_subset.csv')

  data = pd.read_csv('995K_subset.csv')


### 1.2. Dataset Analysis



#### 1.2.A. Determine which article types should be omitted

In [80]:
article_type_counts = data['type'].value_counts()

print("Article type counts:")
print(article_type_counts)

Article type counts:
type
reliable                      218564
political                     194518
bias                          133232
fake                          104883
conspiracy                     97314
rumor                          56445
unknown                        43534
unreliable                     35332
clickbait                      27412
junksci                        14040
satire                         13160
hate                            8779
2018-02-10 13:43:39.521661         1
Name: count, dtype: int64


* In order to determine which article types should be omitted, we print the article type count. If any of the types are rare of irrelevant to the goal of the analysis, we can choose to omit them.
* Each variable in the dataset may provide valuable information or context about the articles being analyzed. Omitting variables could lead to loss of information and potentially bias the analysis.
* Each variable represent a category of articles. By including all variables, I ensure that the analysis captures the full spectrum of article types and characteristics present in the dataset.
* Even if a variable seems less relevant at the moment, it might become important in future analyses.
* I have decided not to omit any variables.

#### 1.2.B. Group the types into 'fake' and 'reliable'

In [81]:
fake_types = ['fake', 'conspiracy', 'humor', 'unreliable', 'clickbait', 'junksci', 'satire']
reliable_types = ['reliable', 'political', 'bias', 'hate']

FAKE:
* 'Fake': Articles with this type include false information, which often lack credible sources, evidence, or verification for their statements.
* 'Conspiracy': Often propose theories suggesting events or phenomena that are results of secret plots by influencial people. Many are based on speculation and conjecture.
* 'Rumor': Information that is not verified and might also contain claims that are not true.
'Unreliable': Source with history of publishing inaccurate, misleading content.
* 'Clickbait': Articles with big headlines or misleading thumbnails to attract clicks and make websites popular.
* 'Junksci': Pseudoscientific claims, incorrect research findings that does not have scietific evidence.
* 'Satire': Humor, irony, exaggeration in order to critique real-world evens. Can contain fictional narratives or unreal scenarios.

RELIABLE:
* 'Reliable': Origin from reputable sources, jounalistic standards, ethical guidelines, fact-checking protocols.
* 'Political': Established news organizations that cover political developments, policies, events thorough research and analysis.
* 'Bias': Viewspoints or opinions that may reflect a particular ideological stance.
* 'Hate': Typically not tolerated in reputable sources, but articles addressing hate-related topics from a factual and ethical standpoint may still be categorized as 'reliable'.

#### 1.2.C. Distribution of 'reliable' vs. 'fake' articles

In [82]:
# Function definitions
def calculate_reliable_count(article_counts_df, reliable_types):
    reliable_count = article_counts_df.loc[article_counts_df.index.isin(reliable_types)].sum()
    return reliable_count

def calculate_fake_count(article_counts_df, fake_types):
    fake_count = article_counts_df.loc[article_counts_df.index.isin(fake_types)].sum()
    return fake_count

# Using the functions with your data
reliable_count = calculate_reliable_count(article_type_counts, reliable_types)
fake_count = calculate_fake_count(article_type_counts, fake_types)

# Printing the results
print("Reliable Count:", reliable_count)
print("Fake Count:", fake_count)

Reliable Count: 555093
Fake Count: 292141


In [83]:
def calculate_distribution(reliable_count, fake_count):
    total_articles = reliable_count + fake_count
    reliable_percentage = (reliable_count / total_articles) * 100
    fake_percentage = (fake_count / total_articles) * 100
    return reliable_percentage, fake_percentage

reliable_percentage, fake_percentage = calculate_distribution(reliable_count, fake_count)
print("Percentage of 'reliable' articles:", reliable_percentage)
print("Percentage of 'fake' articles:", fake_percentage)

Percentage of 'reliable' articles: 65.51826295922967
Percentage of 'fake' articles: 34.48173704077032


* We can see that the dataset is imbalanced, with a higher proportion of 'reliable' articles compared to 'fake' articles.
* While balanced datasets are generally desirable for training models and performing analysis, the importance of balance depends on the specific context and goals of the analysis.
* It is essential to carefully consider the implications of dataset imbalance and take appropriate steps to address it, which could be using techniques like oversampling, undersampling, or employing algorithms designed to handle imbalanced data.

## Part 2: Gathering Links


### 2.1. Library Installation

In [84]:
import requests
from bs4 import BeautifulSoup

### 2.2. Retrieve HTML Content

In [85]:
response = requests.get('https://www.bbc.com/news/world/europe')
contents = response.text
soup = BeautifulSoup(contents, 'html.parser')

### 2.3. Extract Articles

In [86]:
def extract_article_links(contents):

    soup = BeautifulSoup(contents, 'html.parser')
    
    articles = soup.find_all('div', attrs={'type': 'article'})

    article_links = []
    
    for article in articles:
        link = article.find('a')['href']
        
        article_links.append(link)

    return articles, article_links

articles, article_links = extract_article_links(contents)

print(len(articles))
print(len(article_links))
extract_article_links(contents)

38
38


  <div class="ssrcss-18mhvre-Promo e1vyq2e80" data-testid="promo" type="article"><div class="ssrcss-cmbgq-PromoSwitchLayoutAtBreakpoints et5qctl0"><div class="ssrcss-tq7xfh-PromoContent exn3ah99"><div class="ssrcss-1f3bvyz-Stack e1y4nx260" spacing="2"><a class="ssrcss-1mrs5ns-PromoLink exn3ah91" href="/news/world-europe-68505522"><span role="text"><p class="ssrcss-15dlehh-PromoHeadline exn3ah96"><span aria-hidden="false">Far-right ex-football pundit shakes up Portuguese vote</span></p></span></a></div><div class="ssrcss-wdw1q-Stack e1y4nx260" spacing="4"><div><ul class="ssrcss-17wa8hx-MetadataStripContainer eh44mf03" role="list"><li class="ssrcss-30fcoe-MetadataStripItem eh44mf01" role="listitem"><div class="visually-hidden ssrcss-1f39n02-VisuallyHidden e16en2lz0">Attribution</div><div class="ssrcss-m5j4pi-MetadataContent eh44mf00"><span class="ssrcss-1pvwv4b-MetadataSnippet e4wm5bw3"><a class="ssrcss-1v4aea2-MetadataLink e4wm5bw2" href="/news/world/europe"><span class="ssrcss-1if1g9v-

### 2.4. Scrape Multiple Pages

In [87]:
def all_article_links(soup):
    max_pages = soup.find_all('div', class_="ssrcss-3vkeha-StyledButtonContent e1kcrsdk1")[-2]
    max_page_number = int(max_pages.text.strip())

    all_links = []

    for page_number in range(1, max_page_number + 1):
        page_url = f"https://www.bbc.com/news/world/europe?page={page_number}"
        response = requests.get(page_url)
        contents = response.text
        soup = BeautifulSoup(contents, 'html.parser')
        articles = soup.find_all('div', attrs={'type': 'article'})
        links = [article.find('a')['href'] for article in articles]
        all_links.extend(links)
    return max_page_number, all_links

response = requests.get('https://www.bbc.com/news/world/europe')
contents = response.text
soup = BeautifulSoup(contents, 'html.parser')
max_page_number, links = all_article_links(soup)

print("Number of pages available in the 'Europe' section:", max_page_number)
print("Number of article links from all pages in 'Europe' section:", len(links))

Number of pages available in the 'Europe' section: 42
Number of article links from all pages in 'Europe' section: 904


### 2.5. Expand the Scope

In [88]:
def all_article_links(region):
    region_urls = {
        'Europe': 'https://www.bbc.com/news/world/europe',
        'Australia': 'https://www.bbc.com/news/world/australia',
        'Asia': 'https://www.bbc.com/news/world/asia',
        'Africa': 'https://www.bbc.com/news/world/africa',
        'Latin America': 'https://www.bbc.com/news/world/latin_america',
        'Middle East': 'https://www.bbc.com/news/world/middle_east'
    }

    region_url = region_urls.get(region)

    response = requests.get(region_url)
    contents = response.text
    soup = BeautifulSoup(contents, 'html.parser')
    max_pages = soup.find_all('div', class_="ssrcss-3vkeha-StyledButtonContent e1kcrsdk1")[-2]
    max_page_number = int(max_pages.text.strip())

    all_links = []

    for page_number in range(1, max_page_number + 1):
        page_url = f"{region_url}?page={page_number}"
        response = requests.get(page_url)
        contents = response.text
        soup = BeautifulSoup(contents, 'html.parser')
        articles = soup.find_all('div', attrs={'type': 'article'})
        links = [article.find('a')['href'] for article in articles]
        all_links.extend(links)

    return max_page_number, all_links

regions = ['Europe', 'Australia', 'Asia', 'Africa', 'Latin America', 'Middle East']

total_links = []
total_pages = 0

for region in regions:
    max_page_number, links = all_article_links(region)
    if max_page_number is not None:
        print(f"Number of pages available in the '{region}' section:", max_page_number)
        print(f"Number of article links from all pages in '{region}' section:", len(links))
        total_links.extend(links)
        total_pages += max_page_number

print("Total number of article links across all regions:", len(total_links))
print("Total number of pages across all regions:", total_pages)

Number of pages available in the 'Europe' section: 42
Number of article links from all pages in 'Europe' section: 904
Number of pages available in the 'Australia' section: 42
Number of article links from all pages in 'Australia' section: 827
Number of pages available in the 'Asia' section: 42
Number of article links from all pages in 'Asia' section: 907
Number of pages available in the 'Africa' section: 25
Number of article links from all pages in 'Africa' section: 489
Number of pages available in the 'Latin America' section: 42
Number of article links from all pages in 'Latin America' section: 845
Number of pages available in the 'Middle East' section: 41
Number of article links from all pages in 'Middle East' section: 818
Total number of article links across all regions: 4790
Total number of pages across all regions: 234


#### 2.6. Save Results

In [89]:
import csv

csv_file = 'article_links.csv'

with open(csv_file, 'w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['Article Links'])
    writer.writerows([[link] for link in total_links])

print("Article links saved to:", csv_file)

Article links saved to: article_links.csv


## Part 3: Scraping Article Text



In [90]:
import requests
from bs4 import BeautifulSoup

### 3.1. Article Inspection



### 3.2. Text Scraping Function



In [91]:
import requests
from bs4 import BeautifulSoup
from datetime import datetime

def extract_published_date(timestamp_str):
    timestamp = datetime.fromisoformat(timestamp_str[:-1])  # Removing the 'Z' at the end
    return timestamp.strftime('%Y-%m-%d %H:%M:%S %Z')

def get_article_details(article_url):
    response = requests.get(article_url)
    if response.status_code != 200:
        print(f"Failed to fetch article: {article_url}")
        return None
    
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Extract headline
    headline = soup.find('h1', class_="ssrcss-fmi64d-StyledHeading e10rt3ze0").get_text() if soup.find('h1', class_="ssrcss-fmi64d-StyledHeading e10rt3ze0") else None
    
    # Extract text
    text = ""
    paragraphs = soup.find_all('p', class_='ssrcss-1q0x1qg-Paragraph e1jhz7w10')
    for paragraph in paragraphs:
        text += paragraph.get_text() + "\n"
    
    # Extract published date
    metadata_divs = soup.find_all('div', class_="ssrcss-m5j4pi-MetadataContent eh44mf00")
    for div in metadata_divs:
        timestamp = div.find('time', {'data-testid': 'timestamp'})
        if timestamp:
            published_date = extract_published_date(timestamp['datetime'])
            break
    else:
        published_date = None
    
    # Extract author
    author = soup.find('div', class_="ssrcss-68pt20-Text-TextContributorName e8mq1e96").get_text() if soup.find('div', class_="ssrcss-68pt20-Text-TextContributorName e8mq1e96") else None
    
    article_details = {
        'headline': headline,
        'text': text,
        'published_date': published_date,
        'author': author
    }
    
    return article_details

# Example usage:
article_urls = [
    'https://www.bbc.com/news/world-europe-68479836',
    'https://www.bbc.com/news/entertainment-arts-68505050',
    'https://www.bbc.com/news/world-europe-68493215'
]

for url in article_urls:
    article_details = get_article_details(url)  # Corrected function name
    print(article_details)

{'headline': 'Singapore sting: How spies listened in on German general', 'text': 'It\'s nearly midnight in Singapore.\nA senior officer of the Luftwaffe, the German Air Force, is in his hotel room.\nHe\'s in the region to rub shoulders with defence industry players at Asia\'s largest air show. \nHe has had a long day - but he can\'t go to bed just yet.\nBrigadier General Frank GrÃ¤fe has a work call to dial into with his boss - the commander of the German air force.\nIt\'s not a big deal for the head of Air Force Operations. He sounds relaxed on the line as he chats with two colleagues about the "mega" view from his room, and how he\'s just come back from a drink at a nearby hotel where there\'s an incredible swimming pool.\n"Not too shabby," one of them remarks.\nFinally, the boss, Lieutenant General Ingo Gerhartz, dials in - and they begin. Over the next 40 minutes, the group appear to touch upon highly sensitive military issues, including the ongoing debate over whether Germany shou

### 3.3. Scrape All Articles

In [92]:
df = pd.read_csv('article_links.csv')

base_url = 'https://www.bbc.com'

subset_urls = df['Article Links']

for path in subset_urls:
    url = urljoin(base_url, path)
    article_details = get_article_details(url)
    if article_details:
        print(article_details)

NameError: name 'urljoin' is not defined

### 3.4. Data Storage

In [None]:
article_details_list = []

for path in subset_urls:
    url = urljoin(base_url, path)
    article_details = get_article_details(url)
    if article_details:
        article_details_list.append(article_details)

article_details_df = pd.DataFrame(article_details_list)

article_details_df.to_csv('scraped_article_details.csv', index=False)

### 3.5. Discussion

* Evaluate whether the scraped article details add value to the existing dataset. If the dataset aims to provide comprehensive information about BBC articles, it can enhance its completeness and utility.
* Assess the quality of the scraped data. Ensure that the scraping process is reliable and accurate, and that the extracted information matches the actual content of the articles. High-quality data will contribute positively to the dataset, whereas inaccurate or incomplete data may decrease its reliability.
* *Check if the scraped data aligns with the existing structure and format of the dataset. It ensures that the dataset remains coherent and easy to analyze.
* Conduct statistical analysis to quantify the impact of including the scraped data. Calculate descriptive statistics such as counts, averages, and distributions to understand the characteristics of the new data compared to the existing dataset. 
* Consider the needs of potential users or stakeholders of the dataset.
* It depends on a careful assessment of relevance, quality, consistency, statistical analysis, and user needs. If these criterias are met, the inclusion can enrich the dataset and improve its utility for analysis and insights.