# Web Scrapping
<br/>
This project is a Python-based web scraping tool that uses the Trafilatura library to extract and save text content from a list of specified websites from mother website 'https://www.nytimes.com/' as an exmple. The program is designed to process multiple URLs, extract their main content, and save each website's content to a separate .txt file.<br/>
These row data/ corpus can be used on various purposes

In [1]:
# Importing necessary libraries
# minidom is used for parsing XML files
from xml.dom import minidom
# trafilatura is designed to gather text on the Web, including discovery, extraction,
# and text-processing components.
import trafilatura

# IPython.display is used to display HTML content in Jupyter notebooks
from IPython.display import display, HTML
# this line sets the display style for the output area in a Jupyter notebook.
display(HTML("<style>div.output_area pre {white-space: pre;}</style>"))

In [2]:
# Fetching a random web page
url = "https://www.voicesofyouth.org/blog/export-waste-how-it-exacerbates-global-inequalities-and-counterintuitive-fight-climate-action"

# We want to fetch the content of the specified URL.
# Use the fetch_url function() from the trafilatura library to do so.
downloaded = trafilatura.fetch_url(url)

In [3]:
# Extracting information from the fetched web page
result = trafilatura.extract(
    downloaded,
    # add the desired output format
    output_format='xml',
    url=url,
    #include_comments=True,
    #include_formatting=True,
    #include_links=True,
    #include_images=True,
    #include_tables=True,
    #favor_precision=True,
    #favor_recall=True
)
print(result)

<doc sitename="Voices of Youth" title="Export Waste: How it Exacerbates Global Inequalities and is Counterintuitive to the Fight for Climate Action" author="Naomi Like" date="2022-05-02" source="https://www.voicesofyouth.org/blog/export-waste-how-it-exacerbates-global-inequalities-and-counterintuitive-fight-climate-action" hostname="voicesofyouth.org" categories="" tags="" fingerprint="16cd90ab2ba57e5e">
  <main>
    <p>The buildup of global waste throughout the years is not an enjoyable subject to dwell on. However, given how certain countries and communities around the world bear more of the burden of plastic pollution than others, it is necessary to consider how current global waste production and waste management practices exacerbate inequalities.</p>
    <p>In focusing specifically on the horrors of waste exporting, this piece will argue that, despite recent efforts to curb waste exporting abroad, this does not change the fact that countries have found ways around these new laws, 

In [4]:
# Focused web crawling
# Use the focused_crawler() function from the trafilatura.spider module
# to perform focused web crawling on the specified homepage.
from trafilatura.spider import focused_crawler

In [5]:
homepage = "https://www.nytimes.com/"

# Now we set the crawler to visit a maximum of 10 URLs and store up to 100,000 known URLs.
to_visit, known_urls = focused_crawler(homepage, max_seen_urls=10, max_known_urls=100_000)
to_visit, known_urls = focused_crawler(homepage, max_seen_urls=10, max_known_urls=100_000, todo=to_visit, known_links=known_urls)

# Use the sorted() function to sort the known_urls
found_url=sorted([url for url in known_urls if url.startswith("https://www.nytimes.com/")])

### Demostration of the collected websites

In [6]:
# Displaying all the url under the mother web address 'https://www.nytimes.com/'
found_url

['https://www.nytimes.com/',
 'https://www.nytimes.com/2022/09/19/crosswords/mini-to-maestro-part-1.html',
 'https://www.nytimes.com/2024/08/25/opinion/christianity-evangelicals-persecution-faith.html',
 'https://www.nytimes.com/2024/08/26/us/new-orleans-appeals-court-trump.html',
 'https://www.nytimes.com/2024/08/27/us/politics/trump-indictment-election-jan-6.html',
 'https://www.nytimes.com/2024/08/28/us/politics/biden-student-loans-supreme-court.html',
 'https://www.nytimes.com/2024/08/28/us/politics/supreme-court-biden-student-loans.html',
 'https://www.nytimes.com/2024/08/29/insider/a-father-found-his-son-but-a-happy-ending-remains-elusive.html',
 'https://www.nytimes.com/2024/08/29/us/politics/biden-courts-immigration-student-loans-title-ix.html',
 'https://www.nytimes.com/2024/08/29/us/politics/supreme-court-death-penalty-cole.html',
 'https://www.nytimes.com/2024/08/30/business/biden-student-loan-debt-plan.html',
 'https://www.nytimes.com/2024/08/30/us/black-enrollment-affirmat

In [7]:
len(found_url)

636

# Sorting the websites
from the found websites, we can sort out a specific catagory. For the demonstration purpose, here "climate" related websites are shown.

In [None]:
climate=[]
for i in found_url:
    if 'climate' in i:
        climate.append(i)
print(climate)

# Storing the found data in 'txt' file
<br/>
We are storing the content from all the found websites in a 'txt' file. 

In [None]:
from trafilatura import fetch_url


# Function to fetch and extract content from a website
def fetch_website_content(url):
    downloaded = trafilatura.fetch_url(url)  # Download raw HTML content
    if downloaded:
        extracted = trafilatura.extract(downloaded)  # Extract main text from the page
        return extracted
    else:
        print(f"Failed to download content from {url}")
        return None

# Iterate through the list of URLs and fetch the content
for url in found_url:
    content = fetch_website_content(url)
    
    
    if content:
        with open("all_content.txt", "w", encoding="utf-8") as file:
            file.write(f"Content from {url}:\n")
            file.write(content)
            file.write("\n" + "="*10 + "\n")

            
#for demonstrating purpose, in the following we read the txt file. 
with open("all_content.txt", "r", encoding="utf-8") as file:  # Open file in read mode
    articles = file.read()  # Read entire content of the file
    print(articles)  # Print or process the content

In [None]:
with open("all_content.txt", "r", encoding="utf-8") as file:  # Open file in read mode
    articles = file.read()  # Read entire content of the file
    print(articles)  # Print or process the content