# Choose a Data Set

You can choose to analyze any data that you would like! Remember, you need 1000 rows of non-null data in order to get 5 points for the "Data" criteria of my [rubric](https://docs.google.com/document/d/1s3wllcF3LLnytxwD8mZ-BCypXKnfaahnizWGNojT-B4/edit?usp=sharing). Consider looking at [Kaggle](https://www.kaggle.com/datasets) or [free APIs](https://free-apis.github.io/#/browse) for datasets of this size. Alternatively, you can scrape the web to make your own dataset! :D

Once you have chosen your dataset, please read your data into a dataframe and call `.info()` below. If you don't call `info` I will give you 0 points for the first criteria described on the [rubric](https://docs.google.com/document/d/1s3wllcF3LLnytxwD8mZ-BCypXKnfaahnizWGNojT-B4/edit?usp=sharing).

In [4]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from urllib.parse import urljoin
from collections import defaultdict
import re

def fetch_links(url):
    """
    Fetch all article links from a Wikipedia page.
    """
    response = requests.get(url)
    if response.status_code != 200:
        print(f"Failed to fetch page: {url}")
        return []

    soup = BeautifulSoup(response.text, 'html.parser')
    links = []

    # Extract all relevant links from the page
    for a_tag in soup.find_all('a', href=True):
        href = a_tag['href']
        if href.startswith('/wiki/') and ':' not in href:  # Ignore special pages like "Category:"
            full_url = urljoin("https://en.wikipedia.org", href)
            links.append(full_url)

    return list(set(links))  # Remove duplicates

def scrape_sources(url):
    """
    Scrape the sources (references) from a Wikipedia article.
    """
    response = requests.get(url)
    if response.status_code != 200:
        print(f"Failed to fetch article: {url}")
        return []

    soup = BeautifulSoup(response.text, 'html.parser')
    references = soup.find_all('ol', class_='references')
    source_pages = []

    for ref_list in references:
        for ref in ref_list.find_all('li'):
            link = ref.find('a', href=True)
            if link and 'http' in link['href']:  # Capture external links
                # Extract domain or title from the link
                match = re.search(r'https?://([^/]+)', link['href'])
                if match:
                    source_pages.append(match.group(1))

    return source_pages

def main():
    science_index_url = "https://en.wikipedia.org/wiki/Category:Indexes_of_science_articles"

    print("Fetching main index...")
    index_links = fetch_links(science_index_url)

    # Prepare data collection
    data = []

    print("Processing indexes...")
    for index_url in index_links:
        print(f"Processing index: {index_url}")
        sub_links = fetch_links(index_url)

        # Group links by their starting letter
        grouped_links = defaultdict(list)
        for link in sub_links:
            page_name = link.split('/wiki/')[-1]
            first_letter = page_name[0].upper() if page_name[0].isalpha() else '#'
            grouped_links[first_letter].append(link)

        # Select one page per letter
        for letter, pages in grouped_links.items():
            if pages:  # Take the first page for each letter
                selected_page = pages[0]
                print(f"Scraping sources from: {selected_page}")
                sources = scrape_sources(selected_page)

                # Create a new row with the page and its sources
                row = {"Wikipedia Page": selected_page}
                for i, source in enumerate(sources):
                    row[f"Source {i + 1}"] = source

                data.append(row)


    # Normalize the DataFrame to handle uneven rows
    max_length = max(len(column) for column in data.values())  # data must be a defaultdict
    for key in data.keys():
        while len(data[key]) < max_length:
            data[key].append(None)  # Ensure all columns are the same length

    # Create and display the DataFrame
    df = pd.DataFrame(data)

        # Fill empty cells with an appropriate value, such as "No Source"
    df.fillna("No Source", inplace=True)

    pd.set_option('display.max_columns', None)  # Show all columns
    pd.set_option('display.max_rows', None)  # Show all rows
    print(df)

if __name__ == "__main__":
    main()

Fetching main index...
Processing indexes...
Processing index: https://en.wikipedia.org/wiki/Index_of_psychology_articles
Scraping sources from: https://en.wikipedia.org/wiki/Collective_consciousness
Scraping sources from: https://en.wikipedia.org/wiki/Problem_solving
Scraping sources from: https://en.wikipedia.org/wiki/Inhibited_male_orgasm
Scraping sources from: https://en.wikipedia.org/wiki/Valence_(psychology)
Scraping sources from: https://en.wikipedia.org/wiki/Bicameral_mentality
Scraping sources from: https://en.wikipedia.org/wiki/Kinesics
Scraping sources from: https://en.wikipedia.org/wiki/Research_methods
Scraping sources from: https://en.wikipedia.org/wiki/AIDS_dementia_complex
Scraping sources from: https://en.wikipedia.org/wiki/Libido
Scraping sources from: https://en.wikipedia.org/wiki/Scholastic_Aptitude_Test
Scraping sources from: https://en.wikipedia.org/wiki/Epiphany_(feeling)
Scraping sources from: https://en.wikipedia.org/wiki/Frustration
Scraping sources from: http

Scraping sources from: https://en.wikipedia.org/wiki/World_AIDS_Day
Scraping sources from: https://en.wikipedia.org/wiki/Antenatal
Scraping sources from: https://en.wikipedia.org/wiki/NICHD
Scraping sources from: https://en.wikipedia.org/wiki/Tuberculin_skin_test
Scraping sources from: https://en.wikipedia.org/wiki/Erythema
Scraping sources from: https://en.wikipedia.org/wiki/List_of_HIV-positive_people
Scraping sources from: https://en.wikipedia.org/wiki/Organelle
Scraping sources from: https://en.wikipedia.org/wiki/Kaposi%27s_sarcoma-associated_herpesvirus
Scraping sources from: https://en.wikipedia.org/wiki/DNA
Scraping sources from: https://en.wikipedia.org/wiki/Gp120
Scraping sources from: https://en.wikipedia.org/wiki/Retina
Scraping sources from: https://en.wikipedia.org/wiki/Vaginal_candidiasis
Scraping sources from: https://en.wikipedia.org/wiki/Fat_redistribution
Scraping sources from: https://en.wikipedia.org/wiki/United_States_National_Library_of_Medicine
Scraping sources f

Scraping sources from: https://en.wikipedia.org/wiki/Zvonnitsa
Scraping sources from: https://en.wikipedia.org/wiki/University_of_London_Society_of_Change_Ringers
Processing index: https://en.wikipedia.org/wiki/Index_of_physics_articles
Scraping sources from: https://en.wikipedia.org/wiki/List_of_physics_journals
Scraping sources from: https://en.wikipedia.org/wiki/Index_of_physics_articles_(B)
Scraping sources from: https://en.wikipedia.org/wiki/Astrophysics
Scraping sources from: https://en.wikipedia.org/wiki/Quantum_information_science
Scraping sources from: https://en.wikipedia.org/wiki/Medical_physics
Scraping sources from: https://en.wikipedia.org/wiki/Non-equilibrium_thermodynamics
Scraping sources from: https://en.wikipedia.org/wiki/Glossary_of_classical_physics
Scraping sources from: https://en.wikipedia.org/wiki/Biophysics
Scraping sources from: https://en.wikipedia.org/wiki/Relativistic_mechanics
Scraping sources from: https://en.wikipedia.org/wiki/Electromagnetism
Scraping 

Scraping sources from: https://en.wikipedia.org/wiki/Tobler%27s_first_law_of_geography
Scraping sources from: https://en.wikipedia.org/wiki/Human_ecology
Scraping sources from: https://en.wikipedia.org/wiki/Political_ecology
Scraping sources from: https://en.wikipedia.org/wiki/Geobiology
Scraping sources from: https://en.wikipedia.org/wiki/Internet_GIS
Scraping sources from: https://en.wikipedia.org/wiki/Rank-size_distribution
Scraping sources from: https://en.wikipedia.org/wiki/List_of_geoscience_organizations
Scraping sources from: https://en.wikipedia.org/wiki/Anatopism
Scraping sources from: https://en.wikipedia.org/wiki/Earth_science
Scraping sources from: https://en.wikipedia.org/wiki/Direction_(geometry,_geography)
Scraping sources from: https://en.wikipedia.org/wiki/Survey_(human_research)
Scraping sources from: https://en.wikipedia.org/wiki/Meteorology
Scraping sources from: https://en.wikipedia.org/wiki/Children%27s_geographies
Scraping sources from: https://en.wikipedia.org/

Scraping sources from: https://en.wikipedia.org/wiki/Yasuni_National_Park
Scraping sources from: https://en.wikipedia.org/wiki/Functional_agrobiodiversity
Scraping sources from: https://en.wikipedia.org/wiki/University_of_California,_Riverside_Herbarium
Scraping sources from: https://en.wikipedia.org/wiki/Langtang_National_Park
Scraping sources from: https://en.wikipedia.org/wiki/Measurement_of_biodiversity
Scraping sources from: https://en.wikipedia.org/wiki/Data_Deficient
Scraping sources from: https://en.wikipedia.org/wiki/Occupancy%E2%80%93abundance_relationship
Scraping sources from: https://en.wikipedia.org/wiki/Key_Biodiversity_Areas
Scraping sources from: https://en.wikipedia.org/wiki/Park_Grass_Experiment
Scraping sources from: https://en.wikipedia.org/wiki/The_Economics_of_Ecosystems_and_Biodiversity
Scraping sources from: https://en.wikipedia.org/wiki/Vulnerable_species
Scraping sources from: https://en.wikipedia.org/wiki/2010_Biodiversity_Indicators_Partnership
Processing i

Scraping sources from: https://en.wikipedia.org/wiki/Yadin_Dudai
Processing index: https://en.wikipedia.org/wiki/Index_of_pesticide_articles
Scraping sources from: https://en.wikipedia.org/wiki/Trifluralin
Scraping sources from: https://en.wikipedia.org/wiki/Fluoroacetamide
Scraping sources from: https://en.wikipedia.org/wiki/Botany
Scraping sources from: https://en.wikipedia.org/wiki/Agent_Purple
Scraping sources from: https://en.wikipedia.org/wiki/Pyrimethanil
Scraping sources from: https://en.wikipedia.org/wiki/Hexachlorophenol
Scraping sources from: https://en.wikipedia.org/wiki/Copper(I)_cyanide
Scraping sources from: https://en.wikipedia.org/wiki/Roman_gardens
Scraping sources from: https://en.wikipedia.org/wiki/Lime_sulfur
Scraping sources from: https://en.wikipedia.org/wiki/Insecticide
Scraping sources from: https://en.wikipedia.org/wiki/Neonicotinoid
Scraping sources from: https://en.wikipedia.org/wiki/Garden
Scraping sources from: https://en.wikipedia.org/wiki/Electropositive

Scraping sources from: https://en.wikipedia.org/wiki/Gravitational_binding_energy
Scraping sources from: https://en.wikipedia.org/wiki/Laws_of_thermodynamics
Scraping sources from: https://en.wikipedia.org/wiki/Units_of_energy
Scraping sources from: https://en.wikipedia.org/wiki/Rotational_energy
Scraping sources from: https://en.wikipedia.org/wiki/Josephson_energy
Scraping sources from: https://en.wikipedia.org/wiki/Vacuum_energy
Scraping sources from: https://en.wikipedia.org/wiki/Kinetic_energy
Processing index: https://en.wikipedia.org/wiki/Index_of_topics_related_to_life_extension
Scraping sources from: https://en.wikipedia.org/wiki/Eugenics
Scraping sources from: https://en.wikipedia.org/wiki/Megadose
Scraping sources from: https://en.wikipedia.org/wiki/Nutrition
Scraping sources from: https://en.wikipedia.org/wiki/Antagonistic_Pleiotropy
Scraping sources from: https://en.wikipedia.org/wiki/Dietary_supplement
Scraping sources from: https://en.wikipedia.org/wiki/Technological_sing

Scraping sources from: https://en.wikipedia.org/wiki/Pyruvate
Scraping sources from: https://en.wikipedia.org/wiki/Outline_of_biochemistry
Scraping sources from: https://en.wikipedia.org/wiki/Transport_protein
Scraping sources from: https://en.wikipedia.org/wiki/Globin
Scraping sources from: https://en.wikipedia.org/wiki/Action_potential
Scraping sources from: https://en.wikipedia.org/wiki/Van_der_Waals_force
Scraping sources from: https://en.wikipedia.org/wiki/5%27_end
Scraping sources from: https://en.wikipedia.org/wiki/Uric_acid
Scraping sources from: https://en.wikipedia.org/wiki/Water
Scraping sources from: https://en.wikipedia.org/wiki/Junk_DNA
Scraping sources from: https://en.wikipedia.org/wiki/Y_chromosome
Processing index: https://en.wikipedia.org/wiki/Index_of_philosophy_of_science_articles
Scraping sources from: https://en.wikipedia.org/wiki/Galileo_Galilei
Scraping sources from: https://en.wikipedia.org/wiki/Scientific_Communism
Scraping sources from: https://en.wikipedia.

Scraping sources from: https://en.wikipedia.org/wiki/Urban_renewal
Scraping sources from: https://en.wikipedia.org/wiki/Economics
Scraping sources from: https://en.wikipedia.org/wiki/Index_of_law_articles
Scraping sources from: https://en.wikipedia.org/wiki/Agency_(sociology)
Scraping sources from: https://en.wikipedia.org/wiki/Modernity
Scraping sources from: https://en.wikipedia.org/wiki/Group_action_(sociology)
Scraping sources from: https://en.wikipedia.org/wiki/Trade_union
Scraping sources from: https://en.wikipedia.org/wiki/Hybridity
Scraping sources from: https://en.wikipedia.org/wiki/Kinship
Scraping sources from: https://en.wikipedia.org/wiki/Diffusion
Scraping sources from: https://en.wikipedia.org/wiki/Neo-locality
Scraping sources from: https://en.wikipedia.org/wiki/Violence
Scraping sources from: https://en.wikipedia.org/wiki/Balance_of_power_(federalism)
Scraping sources from: https://en.wikipedia.org/wiki/Jingoism
Scraping sources from: https://en.wikipedia.org/wiki/Quan

AttributeError: 'list' object has no attribute 'values'

# My Question

### Which sources are most frequently cited across Wikipedia articles in a specific domain, and what does this tell us about the reliability and diversity of the information on Wikipedia?

# My Analysis


Full DataFrame:


NameError: name 'df' is not defined

# My Answer

### Write your answer here.