# Choose a Data Set

You can choose to analyze any data that you would like! Remember, you need 1000 rows of non-null data in order to get 5 points for the "Data" criteria of my [rubric](https://docs.google.com/document/d/1s3wllcF3LLnytxwD8mZ-BCypXKnfaahnizWGNojT-B4/edit?usp=sharing). Consider looking at [Kaggle](https://www.kaggle.com/datasets) or [free APIs](https://free-apis.github.io/#/browse) for datasets of this size. Alternatively, you can scrape the web to make your own dataset! :D

Once you have chosen your dataset, please read your data into a dataframe and call `.info()` below. If you don't call `info` I will give you 0 points for the first criteria described on the [rubric](https://docs.google.com/document/d/1s3wllcF3LLnytxwD8mZ-BCypXKnfaahnizWGNojT-B4/edit?usp=sharing).

In [12]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from urllib.parse import urljoin
import re
from collections import defaultdict

def fetch_links(url):
    """
    Fetch all article and sub-index links from a Wikipedia page.
    """
    response = requests.get(url)
    if response.status_code != 200:
        print(f"Failed to fetch page: {url}")
        return []
    
    soup = BeautifulSoup(response.text, 'html.parser')
    links = []
    
    # Extract all relevant links from the page
    for a_tag in soup.find_all('a', href=True):
        href = a_tag['href']
        if href.startswith('/wiki/') and ':' not in href:  # Ignore special pages like "Category:"
            full_url = urljoin("https://en.wikipedia.org", href)
            links.append(full_url)
    
    return list(set(links))  # Remove duplicates

def scrape_sources(url):
    """
    Scrape the sources (references) from a Wikipedia article.
    """
    response = requests.get(url)
    if response.status_code != 200:
        print(f"Failed to fetch article: {url}")
        return []
    
    soup = BeautifulSoup(response.text, 'html.parser')
    references = soup.find_all('ol', class_='references')
    sources = []
    
    for ref_list in references:
        for ref in ref_list.find_all('li'):
            link = ref.find('a', href=True)
            if link and link['href'].startswith('http'):  # Only capture external links
                sources.append(link['href'])
    
    return sources

def main():
    science_index_url = "https://en.wikipedia.org/wiki/Category:Indexes_of_science_articles"
    page_limit = 100  # Adjusted to limit the total number of pages processed
    pages_per_subcategory = page_limit // 5  # Assuming even distribution over 5 subcategories
    
    print("Fetching main index...")
    all_links = fetch_links(science_index_url)
    
    # Collect a subset of pages for even distribution
    article_links = set()
    for link in all_links[:5]:  # Limit to the first 5 subcategories
        print(f"Fetching links from: {link}")
        sub_links = fetch_links(link)
        article_links.update(sub_links[:pages_per_subcategory])
    
    print(f"Total articles to scrape: {len(article_links)}")
    
    # Scrape sources and build the DataFrame
    data = defaultdict(list)
    for idx, page in enumerate(article_links):
        print(f"Scraping article {idx + 1}/{len(article_links)}: {page}")
        sources = scrape_sources(page)
        print(f"Page: {page} - Sources Found: {len(sources)}")
        data['Wikipedia Page'].append(page)
        for i, source in enumerate(sources):
            data[f"Source {i + 1}"].append(source)
    
    # Normalize the DataFrame so each column has equal length
    max_columns = max(len(values) for values in data.values())
    for column in data.keys():
        while len(data[column]) < max_columns:
            data[column].append(None)
    
    # Convert to DataFrame
    df = pd.DataFrame(data)
    
    # Print the full DataFrame
    pd.set_option('display.max_columns', None)  # Show all columns
    pd.set_option('display.max_rows', None)  # Show all rows
    print(df)
    
    # Analyze total sources
    all_sources = df.iloc[:, 1:].values.flatten()  # Exclude the first column (Wikipedia Page)
    all_sources = [source for source in all_sources if source and source.startswith('http')]  # Filter only external links
    source_counts = pd.Series(all_sources).value_counts()
    
    print("\nTop Referenced Sources:")
    print(source_counts.head(10))  # Display the top 10 sources by frequency

if __name__ == "__main__":
    main()

Fetching main index...
Fetching links from: https://en.wikipedia.org/wiki/Index_of_biochemistry_articles
Fetching links from: https://en.wikipedia.org/wiki/Index_of_branches_of_science
Fetching links from: https://en.wikipedia.org/wiki/Main_Page
Fetching links from: https://en.wikipedia.org/wiki/Index_of_chemistry_articles
Fetching links from: https://en.wikipedia.org/wiki/Index_of_genetics_articles
Total articles to scrape: 91
Scraping article 1/91: https://en.wikipedia.org/wiki/DNA_sequence
Page: https://en.wikipedia.org/wiki/DNA_sequence - Sources Found: 0
Scraping article 2/91: https://en.wikipedia.org/wiki/Chloride
Page: https://en.wikipedia.org/wiki/Chloride - Sources Found: 0
Scraping article 3/91: https://en.wikipedia.org/wiki/Codex_Monacensis_(X_033)
Page: https://en.wikipedia.org/wiki/Codex_Monacensis_(X_033) - Sources Found: 0
Scraping article 4/91: https://en.wikipedia.org/wiki/Biometrics
Page: https://en.wikipedia.org/wiki/Biometrics - Sources Found: 0
Scraping article 5/9

Page: https://en.wikipedia.org/wiki/Aceology - Sources Found: 0
Scraping article 57/91: https://en.wikipedia.org/wiki/Timeline_of_the_Sudanese_civil_war_(2024)
Page: https://en.wikipedia.org/wiki/Timeline_of_the_Sudanese_civil_war_(2024) - Sources Found: 0
Scraping article 58/91: https://en.wikipedia.org/wiki/Pathology
Page: https://en.wikipedia.org/wiki/Pathology - Sources Found: 0
Scraping article 59/91: https://en.wikipedia.org/wiki/Chemistry
Page: https://en.wikipedia.org/wiki/Chemistry - Sources Found: 0
Scraping article 60/91: https://en.wikipedia.org/wiki/Nature
Page: https://en.wikipedia.org/wiki/Nature - Sources Found: 0
Scraping article 61/91: https://en.wikipedia.org/wiki/You_Belong_with_Me
Page: https://en.wikipedia.org/wiki/You_Belong_with_Me - Sources Found: 0
Scraping article 62/91: https://en.wikipedia.org/wiki/Ectopic_expression
Page: https://en.wikipedia.org/wiki/Ectopic_expression - Sources Found: 0
Scraping article 63/91: https://en.wikipedia.org/wiki/Negative_contr

# My Question

### Which sources are most frequently cited across Wikipedia articles in a specific domain, and what does this tell us about the reliability and diversity of the information on Wikipedia?

# My Analysis


Full DataFrame:


NameError: name 'df' is not defined

# My Answer

### Write your answer here.