# Structured data: Introduction to XML


In the world of data science, structured data plays a crucial role in ensuring that information is organized, accessible, and easy to analyze. Well-structured data formats enable efficient storage, retrieval, and processing, which is essential for data-driven decision-making in businesses. One of the widely used formats for structured data is XML (Extensible Markup Language), which provides a flexible yet standardized way to store and exchange data across different systems. For a data scientist, understanding and working with structured data, including XML, is fundamental for tasks such as data integration, preprocessing, and analysis. In a business context, structured data allows organizations to optimize operations, enhance customer insights, and improve decision-making processes. This week, we will explore the principles and applications of XML, demonstrating how it supports structured data management and why it remains a valuable tool in the data ecosystem.

## Basic XML Structure

To understand XML, consider the following simple structure:

```xml
<data>
    <person guid="123e4567-e89b-12d3-a456-426614174000" status="Active">
        <name>John Doe</name>
        <age>30</age>
        <email>johndoe@example.com</email>
    </person>
    <person guid="789e1234-e89b-34d3-a456-426614174999" status="Inactive">
        <name>Jane Smith</name>
        <age>25</age>
        <email>janesmith@example.com</email>
    </person>
</data>
```

Each `<person>` element contains nested tags providing details such as `<name>`, `<age>`, and `<email>`. Additionally, attributes like `guid` and `status` provide supplementary data that describes the entity.

### Tags vs. Attributes

- **Tags** (`<name>John Doe</name>`) store structured data in a hierarchical way.
- **Attributes** (`guid="123e4567-e89b-12d3-a456-426614174000" status="Active"`) store metadata or properties associated with an element without requiring additional nested tags.

Note: [This]([https://stackoverflow.com/questions/1096797/should-i-use-elements-or-attributes-in-xml](https://stackoverflow.com/questions/1096797/should-i-use-elements-or-attributes-in-xml)) discussion about best practices using tags and attributes is worth to check.

### XML Schemas for validation

XML Schemas provide a way to define the structure, content, and data types of XML documents. Think of them as a blueprint or contract for your XML. Instead of just being a free-for-all of tags, a schema allows you to specify precisely what elements and attributes are allowed, the order they appear in, and the kind of data they can contain. This is crucial for data validation, ensuring that XML documents conform to a predefined standard, making data exchange between different systems much more reliable and predictable. Without a schema, interpreting and processing XML can become ambiguous and error-prone.

Example for XML Schema:
```xml
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

  <xs:element name="data">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="person" maxOccurs="unbounded">
          <xs:complexType>
            <xs:sequence>
              <xs:element name="name" type="xs:string"/>
              <xs:element name="age" type="xs:integer"/>
              <xs:element name="email" type="xs:string"/>
            </xs:sequence>
            <xs:attribute name="guid" type="xs:string" use="required"/>
            <xs:attribute name="status" type="xs:string" use="required">
                <xs:simpleType>
                  <xs:restriction base="xs:string">
                    <xs:enumeration value="Active"/>
                    <xs:enumeration value="Inactive"/>
                  </xs:restriction>
                </xs:simpleType>
            </xs:attribute>
          </xs:complexType>
        </xs:element>
      </xs:sequence>
    </xs:complexType>
  </xs:element>

</xs:schema>
```

### JSON and XML

Previously we have generated JSON using API endpoints. So far XML looks similar to JSON, there are a few key differences though. Most important differences are
- **Metadata Representation:** JSON lacks a direct equivalent to XML attributes. It relies entirely on key-value pairs. Relationships that XML expresses via attributes are represented in JSON through the structure of the key-value pairs, often leading to nested objects to differentiate between data and metadata. This is a fundamental difference in data modeling philosophy, not merely a difference in syntax.

- **Schema and Validation:** JSON has schema validation options (e.g., JSON Schema), which, while powerful and increasingly adopted, are still less ubiquitously enforced in existing systems compared to XML schemas. However, the trend is towards greater JSON Schema usage, particularly in modern API design and development.

- **Interoperability and Use Cases:** XML has historically been dominant in enterprise applications, particularly in industries like finance, healthcare, and government, where its strong schema validation and established standards for data integrity and long-term storage were crucial. JSON has become the preferred choice in web development and APIs, particularly in RESTful services and front-end applications, due to its more compact syntax and easier parsing in Javascript. *Note: However, these lines are not absolute. JSON is increasingly finding use in enterprise contexts for specific use cases, and some web services still utilize XML, particularly those interacting with older systems.*


XML is the preferred format for many document-based technologies, such as **HTML**, **eBooks (ePub)** and **news feeds** (RSS and Atom).

Read more [here](https://www.geeksforgeeks.org/difference-between-json-and-xml/).

## Navigating XML using XPath

In [1]:
# Sample XML data
xml_data = """
<data>
    <person guid="123e4567-e89b-12d3-a456-426614174000" status="Active">
        <name>John Doe</name>
        <age>30</age>
        <email>johndoe@example.com</email>
    </person>
    <person guid="789e1234-e89b-34d3-a456-426614174999" status="Inactive">
        <name>Jane Smith</name>
        <age>25</age>
        <email>janesmith@example.com</email>
    </person>
</data>"""

**XPath** is a query language for selecting nodes from an XML document. It provides a way to traverse the hierarchical structure of an XML document and select specific elements or attributes based on their names, relationships, and values.

XPath expressions are patterns used to locate nodes within an XML document. Here are some basic constructs:

  * `/`: Selects from the root node.
  * `//`: Selects nodes in the document from the current node that match the selection no matter where they are.
  * `.`: Selects the current node.
  * `..`: Selects the parent of the current node.

**Examples**:
- Select all `person` elements: `.//person`
- Select the `name` of the first `person` element: `.//person[1]/name`
- Select the `email` for all `person` elements: `.//person/email`
- Select the `name` of `person` elements with `status` "Active": `.//person[@status='Active']/name`
- Select the parent of the `<name>` element of Jane Smith: `.//person[name='Jane Smith']/..`

## Processing XML in Python



To parse XML and extract information using Python, we can use `xml.etree.ElementTree`:

In [2]:
import xml.etree.ElementTree as ET

# Parse XML
root = ET.fromstring(xml_data)

# Extract and display information
for person in root.findall(".//person"): # Explained below
    guid = person.get("guid")  # Extract attribute
    status = person.get("status")  # Extract attribute
    name = person.find("name").text #extract element
    age = person.find("age").text
    email = person.find("email").text
    print(f"GUID: {guid}, Status: {status}, Name: {name}, Age: {age}, Email: {email}")

GUID: 123e4567-e89b-12d3-a456-426614174000, Status: Active, Name: John Doe, Age: 30, Email: johndoe@example.com
GUID: 789e1234-e89b-34d3-a456-426614174999, Status: Inactive, Name: Jane Smith, Age: 25, Email: janesmith@example.com


The code snippet above demonstrates how to use XPath to extract information from an XML document. The `root.findall(".//person")` expression selects all `person` elements in the document, regardless of their position. For more detailed information, you can refer to the [XPath specification](https://www.w3.org/TR/xpath/).

> Note: Above code will work only if `xml_data` is valid XML. Usually HTML documents are not perfect XML documents, so they need different tools to parse effectively.

When working with XML data, not all elements may be present in every entry. If an element is missing and we attempt to access its .text property directly, Python will raise an AttributeError. Using a conditional expression ensures that we handle missing data gracefully without breaking the program. Additionally, providing a default value such as "N/A" makes the output more consistent, which is useful when storing or analyzing data. Instead of dealing with NoneType errors or inconsistencies in datasets, we can ensure that every entry has a meaningful fallback value.

Examine the following syntax:

```python
for person in root.findall(".//person"):
    name = person.find("name").text if person.find("name") is not None else None
```

### Exercise: Practice XPaths

1. Retrieve the `guid` attribute of the first `person` element.
2. Retrieve the `email` element of the second `person` element.
3. Find all `email` elements and print their text content.

In [11]:
# TODO: code here
import xml.etree.ElementTree as ET

# Parse XML
root = ET.fromstring(xml_data)

for person in root.findall(".//person[1]"):
  guid = person.get("guid")
  print(f'GUID of the first person: {guid}')
for person in root.findall(".//person[2]"):
  email = person.find("email").text
  print(f'email of the second person: {email}')
all_emails = root.findall(".//email")
print("All emails:")
for email in all_emails:
    print(email.text)
print()


GUID of the first person: 123e4567-e89b-12d3-a456-426614174000
email of the second person: janesmith@example.com
All emails:
johndoe@example.com
janesmith@example.com




## Parsing XML using pandas

The `pandas.read_xml` function in Python is a powerful tool for parsing XML data and converting it into a pandas DataFrame. This function relies heavily on the structured nature of XML.  XML's hierarchical, tag-based format, with clearly defined elements and attributes, allows `read_xml` to predictably map the XML structure to a DataFrame's rows and columns. Without this consistent structure, the function wouldn't be able to determine how to represent the data in a tabular form. Elements and attributes within the structure become columns, and repeating elements translate into rows.

Here's how you can use it with the provided XML sample:

In [12]:
import pandas as pd
import io

#Using pandas read_xml.  We use io.StringIO to treat the string as a file.
df = pd.read_xml(io.StringIO(xml_data), xpath='.//person')
df

Unnamed: 0,guid,status,name,age,email
0,123e4567-e89b-12d3-a456-426614174000,Active,John Doe,30,johndoe@example.com
1,789e1234-e89b-34d3-a456-426614174999,Inactive,Jane Smith,25,janesmith@example.com


> You can include the `elems_only` and `attrs_only` flags to include elements or attributes only.

This code snippet first imports the pandas library and the io module. `io.StringIO` is used to wrap the `xml_data` string, presenting it to `read_xml` as if it were reading from a file. The `xpath='//person'` argument tells `read_xml` to find all `<person>` elements within the XML and use them as the basis for the DataFrame's rows.


# Assignment for Week 3: Processing RSS feeds

## Learning Objectives and Assignment Goals

For this assignment, you will develop a Python program that gathers news articles from three different **BBC RSS feeds**: [Business](https://feeds.bbci.co.uk/news/business/rss.xml), [Science and Environment](https://feeds.bbci.co.uk/news/science_and_environment/rss.xml), and [Technology](https://feeds.bbci.co.uk/news/technology/rss.xml). The goal is to

a) extract relevant details such as **article title, description, publication date, and category**

b) and organize them into a **pandas DataFrame**.

d) Finally, we will **filter the dataset to include only articles published on weekdays** and export them as a **structured JSON file** sorted by publication date.

## Expected Output

The final output of this assignment will be a structured dataset in the form of a pandas DataFrame, with only weekday articles included. The DataFrame should have the following structure:

| title     | description            | pub\_date  | link                                                   | category     | day_of_week | is_weekday |
| --------- | ---------------------- | ---------- | ------------------------------------------------------ | ------------ | ----------- | ---------- |
| "Title 1" | "Summary of article 1" | 2025-01-26 | "https://bbc.com/article1" | "Business"   | "Sunday" | True       |
| "Title 2" | "Summary of article 2" | 2025-01-25 | "https://bbc.com/article2" | "Technology" | "Saturday"   | True       |

The dataset should be **filtered to include only weekday articles**, sorted by publication date (most recent first), and exported as a **JSON file**.

## Understanding RSS Feeds

RSS feeds provide a standardized way to publish frequently updated information, such as news articles and blog posts. RSS data is formatted in XML, following a specific structure that consists of nested elements.

Each article within an RSS feed follows a common structure:

```xml
<item>
    <title>Sample News Article</title>
    <description>This is a summary of the article.</description>
    <pubDate>Tue, 28 Jan 2025 14:00:00 GMT</pubDate>
    <link>https://www.bbc.com/sample-article</link>
</item>
```

Each `<item>` represents an article, containing:

- `<title>`: The article’s headline.
- `<description>`: A brief summary.
- `<pubDate>`: The publication date.
- `<link>`: A URL to the full article.


### Publication Date Format

The `<pubDate>` field follows the **RFC-822 format**, which looks like this:

```
Tue, 28 Jan 2025 14:00:00 GMT
```

This format needs to be converted into a **datetime object** for proper sorting and analysis.

### Step 1: Parse the RSS Feeds

We will start by fetching RSS feeds from BBC's Business, Science and Environment, and Technology sections. In **Week 2**, you learned how to retrieve data from online resources using the `requests` library in Python. Apply those same concepts here to fetch the RSS feeds.

- Define the RSS feed URLs.
- Use the `requests` library to retrieve the XML content.
- Check the response status to ensure the request was successful.
- Read XML into pandas dataframe.

> **Hint:** Use `requests.get(url)` to retrieve the RSS feed and `response.text` to access the content.

To prevent excessive requests from overwhelming the server, introduce a delay between each request. This ensures compliance with best practices and helps avoid temporary blocks due to too many requests in a short time frame. In Python, this can be achieved using the `time.sleep()` function, which pauses execution for a specified number of seconds before proceeding with the next request. Set an appropriate delay, such as two seconds, between requests to balance efficiency and server-friendly behavior.

You can **concat** different DataFrames using the `pd.concat` function. Ignoring indeces might be a wise option.

In [18]:
# Todo
import requests
import pandas as pd
import io

URL_business = 'https://feeds.bbci.co.uk/news/business/rss.xml'
URL_sne = 'https://feeds.bbci.co.uk/news/science_and_environment/rss.xml'
URL_tech = 'https://feeds.bbci.co.uk/news/technology/rss.xml'
def get_data(url):
  response = requests.get(url)
  if response.status_code == 200:
    data = response.text
    df = pd.read_xml(io.StringIO(data), xpath=".//item")
    df['Source_url'] = url
    return df
df_bus = get_data(URL_business)
df_sne = get_data(URL_sne)
df_tech = get_data(URL_tech)


df = pd.concat([df_bus, df_sne, df_tech])
print(df.head())

                                               title  \
0     Tesco trials giant trolley scales in Gateshead   
1  Data, waves and wind to be counted in the economy   
2  Trump 'strongly considering' large-scale sanct...   
3        Price of first-class stamp to rise to £1.70   
4      US job growth stable as government cuts start   

                                         description  \
0  Trolleys are weighed before checkout to identi...   
1  Wind and wave power is set to be included in c...   
2  The president has reversed US policy and says ...   
3  The cost of a second-class stamp will also ris...   
4  Employers added 151,000 jobs in February but t...   

                                             link  \
0  https://www.bbc.com/news/articles/c0rzvrjkklko   
1  https://www.bbc.com/news/articles/czedpnen168o   
2  https://www.bbc.com/news/articles/c36wkpy3497o   
3  https://www.bbc.com/news/articles/cwygl9vj28do   
4  https://www.bbc.com/news/articles/czedwgrwp32o   

       

### Step 2: Data preparation

#### Task 1: Drop unnecessary columns

Remove the columns named 'guid' and 'thumbnail' from the DataFrame using `drop` function. Ensure the changes are applied directly to the DataFrame without creating a new copy.

In [19]:
# Todo

df.drop(columns=['thumbnail'], inplace=True)
df.drop(columns=['guid'], inplace=True)
df.head()

Unnamed: 0,title,description,link,pubDate,Source_url
0,Tesco trials giant trolley scales in Gateshead,Trolleys are weighed before checkout to identi...,https://www.bbc.com/news/articles/c0rzvrjkklko,"Fri, 07 Mar 2025 22:48:44 GMT",https://feeds.bbci.co.uk/news/business/rss.xml
1,"Data, waves and wind to be counted in the economy",Wind and wave power is set to be included in c...,https://www.bbc.com/news/articles/czedpnen168o,"Sat, 08 Mar 2025 01:52:58 GMT",https://feeds.bbci.co.uk/news/business/rss.xml
2,Trump 'strongly considering' large-scale sanct...,The president has reversed US policy and says ...,https://www.bbc.com/news/articles/c36wkpy3497o,"Fri, 07 Mar 2025 19:14:20 GMT",https://feeds.bbci.co.uk/news/business/rss.xml
3,Price of first-class stamp to rise to £1.70,The cost of a second-class stamp will also ris...,https://www.bbc.com/news/articles/cwygl9vj28do,"Fri, 07 Mar 2025 14:25:04 GMT",https://feeds.bbci.co.uk/news/business/rss.xml
4,US job growth stable as government cuts start,"Employers added 151,000 jobs in February but t...",https://www.bbc.com/news/articles/czedwgrwp32o,"Fri, 07 Mar 2025 17:28:24 GMT",https://feeds.bbci.co.uk/news/business/rss.xml


#### Task 2: Category column

Create a dictionary called categories where keys are the source URLs and values are the corresponding category names (e.g., 'Business', 'Science and environment', 'Technology'). If source URL was not stored previously, make the necessary changes. Use the DataFrame `map` function to populate the `category` column based on the source url.

> The `map()` function applies a given function to each element of a sequence (like a list or a column in a DataFrame) and returns a new sequence with the transformed elements. You can use a `dict` object as function too.

In [21]:
# Todo
mapping = {'https://feeds.bbci.co.uk/news/business/rss.xml': 'Business', 'https://feeds.bbci.co.uk/news/science_and_environment/rss.xml': 'Science and environment', 'https://feeds.bbci.co.uk/news/technology/rss.xml': 'Technology'}
df['Category'] = df['Source_url'].map(mapping)
df.head()

Unnamed: 0,title,description,link,pubDate,Source_url,Category
0,Tesco trials giant trolley scales in Gateshead,Trolleys are weighed before checkout to identi...,https://www.bbc.com/news/articles/c0rzvrjkklko,"Fri, 07 Mar 2025 22:48:44 GMT",https://feeds.bbci.co.uk/news/business/rss.xml,Business
1,"Data, waves and wind to be counted in the economy",Wind and wave power is set to be included in c...,https://www.bbc.com/news/articles/czedpnen168o,"Sat, 08 Mar 2025 01:52:58 GMT",https://feeds.bbci.co.uk/news/business/rss.xml,Business
2,Trump 'strongly considering' large-scale sanct...,The president has reversed US policy and says ...,https://www.bbc.com/news/articles/c36wkpy3497o,"Fri, 07 Mar 2025 19:14:20 GMT",https://feeds.bbci.co.uk/news/business/rss.xml,Business
3,Price of first-class stamp to rise to £1.70,The cost of a second-class stamp will also ris...,https://www.bbc.com/news/articles/cwygl9vj28do,"Fri, 07 Mar 2025 14:25:04 GMT",https://feeds.bbci.co.uk/news/business/rss.xml,Business
4,US job growth stable as government cuts start,"Employers added 151,000 jobs in February but t...",https://www.bbc.com/news/articles/czedwgrwp32o,"Fri, 07 Mar 2025 17:28:24 GMT",https://feeds.bbci.co.uk/news/business/rss.xml,Business


#### Task 3: Day columns, dropping rows and sorting

1. Convert the `pubDate` column to `datetime` using `pd.to_datetime()`.
2. Create a new column `day_of_week` containing the day of the week for each publication date. (Search online for such a function)
3. Create a boolean column `is_weekday` indicating whether an article was published on a weekday (Saturday or Sunday). (`dayofweek` should be 5 or 6)
4. Filter the DataFrame to include only weekday articles.
5. Sort the DataFrame by publication date in descending order (most recent first) using `sort_values`.

In [29]:
from datetime import datetime
df['pubDate'] = pd.to_datetime(df['pubDate'])
df['day_of_week'] = df['pubDate'].dt.dayofweek
df['is_weekday'] = df['day_of_week'].apply(lambda x: x in [5, 6])
df = df[df['is_weekday'] == False]
df.sort_values(by='pubDate', ascending=False)
df.head()

Unnamed: 0,title,description,link,pubDate,Source_url,Category,day_of_week,is_weekday
0,Tesco trials giant trolley scales in Gateshead,Trolleys are weighed before checkout to identi...,https://www.bbc.com/news/articles/c0rzvrjkklko,2025-03-07 22:48:44,https://feeds.bbci.co.uk/news/business/rss.xml,Business,4,False
2,Trump 'strongly considering' large-scale sanct...,The president has reversed US policy and says ...,https://www.bbc.com/news/articles/c36wkpy3497o,2025-03-07 19:14:20,https://feeds.bbci.co.uk/news/business/rss.xml,Business,4,False
3,Price of first-class stamp to rise to £1.70,The cost of a second-class stamp will also ris...,https://www.bbc.com/news/articles/cwygl9vj28do,2025-03-07 14:25:04,https://feeds.bbci.co.uk/news/business/rss.xml,Business,4,False
4,US job growth stable as government cuts start,"Employers added 151,000 jobs in February but t...",https://www.bbc.com/news/articles/czedwgrwp32o,2025-03-07 17:28:24,https://feeds.bbci.co.uk/news/business/rss.xml,Business,4,False
5,Boots gets new US owner in multi-billion dolla...,There have been reports the Boots chain could ...,https://www.bbc.com/news/articles/cdxq0p27z69o,2025-03-07 10:59:39,https://feeds.bbci.co.uk/news/business/rss.xml,Business,4,False


#### Task 4: Export data into JSON

Export the combined_df DataFrame to a JSON file named `result.json` using the `to_json()` method with the orient='records' parameter.

In [33]:
combined_df = df.to_json('result.json',orient ='records')
print(combined_df)

None


In [35]:
from google.colab import files

# Download the file
files.download('result.json')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Extra task for advanced data wranglers :)

Periodically check if new articles are available (like each 15 minutes). If so, append the new items to the already exported ones.

Extra task in more detail:
- The main objective is to create a program that checks for new articles from three BBC RSS feeds (Business, Science and Environment, and Technology). If new weekday articles are found, they should be added to an existing JSON file (result.json) that already contains previous weekday articles.
- You should leverage the code you wrote in the earlier steps of the assignment, which involved fetching RSS feeds, extracting relevant information (title, description, date, etc.), filtering for weekday articles, and creating the initial result.json file.
- Defining a new function might be necessary to encapsulate the process of: 1) fetching news from the three rss feeds, 2) performing data preparation steps (dropping unnecessary columns, adding categories, converting dates, filtering for weekdays), 3) appending new weekday articles to the existing json file.
- Don't simply overwrite previous data, but append to it. Keep old articles. One article should be stored only once.