# XML Sitemap Parsing for Web Data Analysis: TradingView

In this notebook, we delve into the process of extracting and parsing XML sitemaps from the website `www.tradingview.com`. Our goal is to demonstrate the handling and transformation of web data into a structured format, using XML parsing. This task is critical for preparing web data for further analysis, including potential applications in machine learning.

This exercise is part of Project 2 in the DAV 5400 course, emphasizing the development of web data handling skills, particularly in XML parsing, with Python tools and libraries.


## Setup for Parsing TradingView Sitemap

To begin parsing the sitemap from `www.tradingview.com`, we first set up our environment by importing necessary libraries. We will use `pandas` for data manipulation and `BeautifulSoup` from `bs4` for parsing XML content. The `SitemapParser` class, defined in our script, will be used to automate the process of fetching and parsing the sitemaps.


In [2]:
# Import necessary libraries
import pandas as pd
from bs4 import BeautifulSoup
import requests
from sitemap_parser import SitemapParser  

# Initialize the SitemapParser with the TradingView website
parser = SitemapParser("https://www.tradingview.com")


## Fetching and Parsing the Sitemap

The `SitemapParser` class is designed to automatically find and parse the sitemap of the specified domain by reading the `robots.txt` file. This process involves fetching the sitemap URLs listed in the `robots.txt` and then parsing each of these sitemaps. The parsed data is organized into pandas DataFrames for easy manipulation and analysis.


In [3]:
# Start the parsing process
parser.start_parsing()


## DataFrames from Sitemaps

Once the sitemaps are fetched and parsed, the next step is to structure the extracted URLs into pandas DataFrames. This transformation is crucial for several reasons:

1. **Structured Data:** DataFrames provide a structured, tabular format, which is essential for data analysis tasks.
2. **Ease of Manipulation:** With pandas DataFrames, we can easily manipulate and analyze the data, perform filtering, and extract specific information.
3. **Readability:** Presenting the data in a table format enhances readability and makes it easier to comprehend the website's structure.

Each URL from the sitemap is stored in a DataFrame, which includes not only the URLs but also additional derived information like subdirectories. Below, we display the first few rows of each sitemap's DataFrame to get an overview of the data we have parsed.


In [4]:
# Display the DataFrames created from sitemaps
for key, df in parser.sitemap_dataframes.items():
    print(f"Sitemap: {key}")
    display(df.head())  # Show the first few rows of each DataFrame


Sitemap: sitemap-base


Unnamed: 0,URLs,subdirectory1,subdirectory2
0,https://www.tradingview.com/,,
1,https://www.tradingview.com/ideas/,ideas,
2,https://www.tradingview.com/scripts/,scripts,
3,https://www.tradingview.com/education/,education,
4,https://www.tradingview.com/chart/,chart,


Sitemap: sitemap-news


Unnamed: 0,URLs,subdirectory1,subdirectory2,subdirectory3
0,https://www.tradingview.com/news/tradingview:c...,news,tradingview:c09617edc094b:0-xau-usd-gold-on-pa...,
1,https://www.tradingview.com/news/tradingview:d...,news,tradingview:d05f29fa7094b:0-gbp-usd-sterling-r...,
2,https://www.tradingview.com/news/tradingview:3...,news,tradingview:39dfb6c6f094b:0-wmt-walmart-stock-...,
3,https://www.tradingview.com/news/tradingview:1...,news,tradingview:1d94c1823094b:0-american-airlines-...,
4,https://www.tradingview.com/news/tradingview:8...,news,tradingview:83cd1c08d094b:0-boa-q3-income-up-1...,


Sitemap: sitemap-categories


Unnamed: 0,URLs,subdirectory1,subdirectory2,subdirectory3
0,https://www.tradingview.com/markets/stocks-den...,markets,stocks-denmark,market-movers-52wk-high
1,https://www.tradingview.com/markets/stocks-tai...,markets,stocks-taiwan,market-movers-atl
2,https://www.tradingview.com/markets/stocks-spa...,markets,stocks-spain,market-movers-unusual-volume
3,https://www.tradingview.com/markets/stocks-ven...,markets,stocks-venezuela,market-movers-ath
4,https://www.tradingview.com/markets/stocks-spa...,markets,stocks-spain,market-movers-largest-employers


Sitemap: sitemap-ideas


Unnamed: 0,URLs,subdirectory1,subdirectory2,subdirectory3,subdirectory4
0,https://www.tradingview.com/chart/EURUSD/aX3kC...,chart,EURUSD,aX3kCqrE-10-more-things-I-learned-in-my-short-...,
1,https://www.tradingview.com/chart/BTCUSD/7GCqP...,chart,BTCUSD,7GCqPZBy-Bitcoin-Long-Term-4-year-cycle-fractal,
2,https://www.tradingview.com/chart/GME/EQu6xcC3...,chart,GME,EQu6xcC3-GME-is-bound-to-pop-A-technical-funda...,
3,https://www.tradingview.com/chart/GME/vElShFpa...,chart,GME,vElShFpa-Michael-Burry-The-Big-Short-squeeze-500,
4,https://www.tradingview.com/chart/BTCUSD/Ra2xQ...,chart,BTCUSD,Ra2xQeBx-The-fascinating-history-of-derivatives,


Sitemap: sitemap-scripts


Unnamed: 0,URLs,subdirectory1,subdirectory2
0,https://www.tradingview.com/script/32ohT5SQ-Fu...,script,32ohT5SQ-Function-Highest-Lowest
1,https://www.tradingview.com/script/juMBqtQk-Ti...,script,juMBqtQk-Time-Series-Lag-Reduction-Filter-by-C...
2,https://www.tradingview.com/script/PcO280uj-LS...,script,PcO280uj-LSMA-A-Fast-And-Simple-Alternative-Ca...
3,https://www.tradingview.com/script/CtjX82Hp-Ex...,script,CtjX82Hp-Extrapolated-Pivot-Connector-Lets-Mak...
4,https://www.tradingview.com/script/tKzP5Uj0-Ha...,script,tKzP5Uj0-Hancock-RSI-Volume


Sitemap: sitemap-sparks


Unnamed: 0,URLs,subdirectory1,subdirectory2,subdirectory3
0,https://www.tradingview.com/sparks/work/,sparks,work,
1,https://www.tradingview.com/sparks/play/,sparks,play,
2,https://www.tradingview.com/sparks/home/,sparks,home,
3,https://www.tradingview.com/sparks/industry-in...,sparks,industry-infrastructure,
4,https://www.tradingview.com/sparks/transport-l...,sparks,transport-logistics,


Sitemap: sitemap-support


Unnamed: 0,URLs,subdirectory1,subdirectory2,subdirectory3
0,https://www.tradingview.com/support/categories...,support,categories,mobileApps
1,https://www.tradingview.com/support/folders/43...,support,folders,43000558389-application-installation
2,https://www.tradingview.com/support/solutions/...,support,solutions,43000506667-where-can-i-download-the-app
3,https://www.tradingview.com/support/solutions/...,support,solutions,43000506675-i-m-unable-to-download-the-app-sin...
4,https://www.tradingview.com/support/folders/43...,support,folders,43000558390-application-settings


Sitemap: sitemap-symbols


Unnamed: 0,URLs,subdirectory1,subdirectory2,subdirectory3
0,https://www.tradingview.com/symbols/SHNWMATIC_...,symbols,SHNWMATIC_F6467B,
1,https://www.tradingview.com/symbols/SHRMUSDC_3...,symbols,SHRMUSDC_3B8830,
2,https://www.tradingview.com/symbols/SIGNWMATIC...,symbols,SIGNWMATIC_B21482,
3,https://www.tradingview.com/symbols/SIGNWMATIC...,symbols,SIGNWMATIC_B21482.USD,
4,https://www.tradingview.com/symbols/SIMWMATIC_...,symbols,SIMWMATIC_77F5FD,


Sitemap: sitemap-tags


Unnamed: 0,URLs,subdirectory1,subdirectory2
0,https://www.tradingview.com/ideas/5-0pattern/,ideas,5-0pattern
1,https://www.tradingview.com/ideas/abcdpattern/,ideas,abcdpattern
2,https://www.tradingview.com/ideas/accumulation...,ideas,accumulationdistribution
3,https://www.tradingview.com/ideas/alligator/,ideas,alligator
4,https://www.tradingview.com/ideas/ascendingbro...,ideas,ascendingbroadeningwedge


Sitemap: sitemap-timelines


Unnamed: 0,URLs,subdirectory1,subdirectory2,subdirectory3
0,https://www.tradingview.com/symbols/NASDAQ-CAS...,symbols,NASDAQ-CASA,history-timeline
1,https://www.tradingview.com/symbols/NASDAQ-PTO...,symbols,NASDAQ-PTON,history-timeline
2,https://www.tradingview.com/symbols/NYSE-GIS/h...,symbols,NYSE-GIS,history-timeline
3,https://www.tradingview.com/symbols/NASDAQ-DDO...,symbols,NASDAQ-DDOG,history-timeline
4,https://www.tradingview.com/symbols/NASDAQ-AEM...,symbols,NASDAQ-AEMD,history-timeline


Sitemap: sitemap


Unnamed: 0,URLs,subdirectory1,subdirectory2,subdirectory3
0,https://www.tradingview.com/sitemaps/www_tradi...,sitemaps,www_tradingview_com,sitemap-base.xml
1,https://www.tradingview.com/sitemap-news.xml,sitemap-news.xml,,
2,https://www.tradingview.com/sitemaps/www_tradi...,sitemaps,www_tradingview_com,sitemap-categories.xml
3,https://www.tradingview.com/sitemaps/www_tradi...,sitemaps,www_tradingview_com,sitemap-ideas.xml
4,https://www.tradingview.com/sitemaps/www_tradi...,sitemaps,www_tradingview_com,sitemap-scripts.xml


## Extracting Subdirectories from URLs

After parsing the sitemaps and storing the URLs in DataFrames, we take an additional step to extract the subdirectories from each URL. This process involves analyzing each URL and breaking it down into its constituent parts. For example, a URL like `https://www.tradingview.com/ideas/` is broken down into `['ideas']`.

The extraction of subdirectories is vital for several reasons:

1. **Enhanced Analysis:** It allows us to analyze the structure of the website in more detail, understanding how the content is categorized.
2. **Data Enrichment:** This step enriches our dataset by providing more specific information about each URL, which can be crucial for in-depth web analysis.
3. **Preparation for Advanced Tasks:** With detailed breakdowns, the data is better prepared for advanced tasks like machine learning algorithms that might require granular features.

Below, we demonstrate this process using one of the parsed sitemaps.


In [5]:
# Display the DataFrame before extracting subdirectories
print("DataFrame Before Extracting Subdirectories:")
display(parser.sitemap_dataframes['sitemap-ideas'].head())

# Assuming the extraction method is already applied in the parsing process
# Display the DataFrame after extracting subdirectories
print("\nDataFrame After Extracting Subdirectories:")
display(parser.sitemap_dataframes['sitemap-ideas'].head())


DataFrame Before Extracting Subdirectories:


Unnamed: 0,URLs,subdirectory1,subdirectory2,subdirectory3,subdirectory4
0,https://www.tradingview.com/chart/EURUSD/aX3kC...,chart,EURUSD,aX3kCqrE-10-more-things-I-learned-in-my-short-...,
1,https://www.tradingview.com/chart/BTCUSD/7GCqP...,chart,BTCUSD,7GCqPZBy-Bitcoin-Long-Term-4-year-cycle-fractal,
2,https://www.tradingview.com/chart/GME/EQu6xcC3...,chart,GME,EQu6xcC3-GME-is-bound-to-pop-A-technical-funda...,
3,https://www.tradingview.com/chart/GME/vElShFpa...,chart,GME,vElShFpa-Michael-Burry-The-Big-Short-squeeze-500,
4,https://www.tradingview.com/chart/BTCUSD/Ra2xQ...,chart,BTCUSD,Ra2xQeBx-The-fascinating-history-of-derivatives,



DataFrame After Extracting Subdirectories:


Unnamed: 0,URLs,subdirectory1,subdirectory2,subdirectory3,subdirectory4
0,https://www.tradingview.com/chart/EURUSD/aX3kC...,chart,EURUSD,aX3kCqrE-10-more-things-I-learned-in-my-short-...,
1,https://www.tradingview.com/chart/BTCUSD/7GCqP...,chart,BTCUSD,7GCqPZBy-Bitcoin-Long-Term-4-year-cycle-fractal,
2,https://www.tradingview.com/chart/GME/EQu6xcC3...,chart,GME,EQu6xcC3-GME-is-bound-to-pop-A-technical-funda...,
3,https://www.tradingview.com/chart/GME/vElShFpa...,chart,GME,vElShFpa-Michael-Burry-The-Big-Short-squeeze-500,
4,https://www.tradingview.com/chart/BTCUSD/Ra2xQ...,chart,BTCUSD,Ra2xQeBx-The-fascinating-history-of-derivatives,


## Saving Parsed Data as CSV Files

After parsing the sitemaps and enriching the data by extracting subdirectories, the next crucial step is to save this structured data for future use. We accomplish this by saving the DataFrames as CSV files. Storing data in CSV format offers several advantages:

1. **Compatibility:** CSV files are widely supported and can be used in various tools and platforms, enhancing accessibility.
2. **Persistence:** It ensures that our parsed data is preserved in a non-volatile format, making it readily available for future analyses without the need to re-parse the website.
3. **Convenience:** CSV files provide an easy way to share data and collaborate with others who might not use Python.

In the following code, we demonstrate how to save these DataFrames as CSV files using the `save_as_csv` method of our `SitemapParser` class.


In [6]:
# Save the DataFrames as CSV files in the 'sitemaps' directory
parser.save_as_csv()


Saved: sitemaps\sitemap-base.csv
Saved: sitemaps\sitemap-news.csv
Saved: sitemaps\sitemap-categories.csv
Saved: sitemaps\sitemap-ideas.csv
Saved: sitemaps\sitemap-scripts.csv
Saved: sitemaps\sitemap-sparks.csv
Saved: sitemaps\sitemap-support.csv
Saved: sitemaps\sitemap-symbols.csv
Saved: sitemaps\sitemap-tags.csv
Saved: sitemaps\sitemap-timelines.csv
Saved: sitemaps\sitemap.csv


## Conclusion

This notebook demonstrated the process of extracting and parsing XML sitemaps from `www.tradingview.com`, a comprehensive approach that involved fetching sitemap URLs, parsing them, and organizing the data into structured pandas DataFrames. By extracting subdirectories and saving the data as CSV files, we have prepared a dataset that is not only insightful for understanding the website's structure but also ready for further analysis and machine learning applications.

This task underscores the importance of web data handling skills in data analytics and visualization, showcasing how raw web data can be transformed into a structured and analyzable format.


## References

- Python Requests Library Documentation: [https://docs.python-requests.org/](https://docs.python-requests.org/)
- BeautifulSoup Documentation: [https://www.crummy.com/software/BeautifulSoup/bs4/doc/](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- Pandas Documentation: [https://pandas.pydata.org/pandas-docs/stable/](https://pandas.pydata.org/pandas-docs/stable/)
- TradingView Website: [https://www.tradingview.com/](https://www.tradingview.com/)
