# Parsing Orange Express Wordpress post data from xml to csv

The purpose of this script is to parse the post data from .xml format to .csv file.

Note: data `oranjeexpress.WordPress.2020-12-06.xml` was exported from Wordpress on Dec 6 2020.

In [1]:
import pandas as pd
from bs4 import BeautifulSoup 
import re

In [2]:
with open('./oranjeexpress.WordPress.2020-12-06.xml', 'r', encoding='utf8', errors='ignore') as f: 
    data = f.read() 

In [3]:
soup = BeautifulSoup(data, "xml")

A demonstration of the xml data structure of a post:
- `{CONTENT}` implies the type of content
- `<TAG>`
- Some data comes with prefix e.g. _"<![CDATA]"_

A glance:

```
<item>
    <title>{title}</title>
    <link>{long url link}</link>
    <pubDate>{published data}</pubDate>
    <dc:creator>{author, `<![CDATA[OliviaDung]]>`}</dc:creator>
    <guid isPermaLink="false">{short url link}</guid>
    <content:encoded>{content, formatted}</content:encoded>
    
    <wp:post_id>267</wp:post_id>
    <wp:post_date><![CDATA[2012-02-07 00:00:00]]></wp:post_date>
    <category domain="post_tag" nicename="%e8%8d%b7%e8%98%ad%e5%82%b3%e7%b5%b1"><![CDATA[傳統]]></category>
    <category domain="category" nicename="{omitted}"><![CDATA[吃喝 &amp; 玩樂]]></category>
</item>
```

In [4]:
def remove_tags(txt):
    """Remove xml/html tags from a string"""
    clean = re.compile('<.*?>')
    return re.sub(clean, '', str(txt))

In [5]:
# define the tags to be retrieved from each post data (excl. <category>, special fix needed)
metadata = ['title', 'dc:creator', 'wp:post_id', 'wp:post_date']
categoryDomAttr = ['post_tag', 'category']

def filterMetaDataPerPost(post):
    """
    Cleanup and filter out the targeted data for each input post data
    """
    miniSoup = BeautifulSoup(str(post))
    postData = {key: remove_tags(miniSoup.find(key)) for key in metadata}
    
    # special fix for <category> tags since attributes are involved
    for attr in categoryDomAttr:
        postData[attr] = [remove_tags(x).replace('&amp;amp;', '&') 
                          for x in miniSoup.find_all('category', {'domain': attr})]
        
    return postData

In [6]:
# get data!

posts = soup.find_all('item')
postsCleaned = list(map(filterMetaDataPerPost, posts))
df = pd.DataFrame(postsCleaned)
df.set_index('wp:post_id', inplace=True)

In [7]:
# check how data loooooooks
df.sample(5)

Unnamed: 0_level_0,title,dc:creator,wp:post_date,post_tag,category
wp:post_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
10391,[貓力畫荷蘭] 創業也要趁早？荷蘭青少年創業正夯,moli,2019-01-28 06:28:11,"[創業, 教育, 荷蘭教育, 荷蘭青年, 貓力畫荷蘭]",[社會 & 文化]
8234,Museum MORE give you even more! -盧若城堡分館也精彩!,Cynthia,2017-11-10 00:14:08,"[博物館, 景點, 藝術, 設計]",[人文 & 藝術]
1412,阿姆斯特丹轉角遇到ART－De Staalman,QB,2014-03-31 09:21:51,"[公共藝術, 景點, 設計, 轉角遇到ART, 阿姆斯特丹, 霍夫曼]",[人文 & 藝術]
5238,有尊嚴的老去－荷蘭與比利時老年照護,rhythmsmonthly,2015-12-04 08:17:25,"[合作文章, 社會, 老年照護, 銀髮族]",[社會 & 文化]
10914,在城市中看見攝影，鋪陳觀眾視角：探訪荷蘭布雷達攝影節的導覽與推廣規劃,daning,2019-10-30 00:02:48,"[Breda, 展覽, 攝影, 藝術]",[人文 & 藝術]


In [8]:
# save data to csv
df.to_csv('oe-wp-posts.csv', encoding='utf_8_sig')