# Parsing Orange Express Wordpress post data from xml to csv

The purpose of this script is to parse the post data from .xml format to .csv file.

Note: data `oranjeexpress.WordPress.2020-12-06.xml` was exported from Wordpress on Dec 6 2020.

In [1]:
import pandas as pd
from bs4 import BeautifulSoup 
import re

In [2]:
with open('./oranjeexpress.WordPress.2020-12-06.xml', 'r', encoding='utf8', errors='ignore') as f: 
    data = f.read() 

In [3]:
soup = BeautifulSoup(data, "xml")

A demonstration/glance of the xml data structure of a post:

```
<item>
    <title>{title}</title>
    <link>{long url link}</link>
    <pubDate>{published data}</pubDate>
    <dc:creator>{author, `<![CDATA[OliviaDung]]>`}</dc:creator>
    <guid isPermaLink="false">{short url link}</guid>
    <content:encoded>{content, formatted}</content:encoded>
    
    <wp:post_id>267</wp:post_id>
    <wp:post_date><![CDATA[2012-02-07 00:00:00]]></wp:post_date>
    <category domain="post_tag" nicename="%e8%8d%b7%e8%98%ad%e5%82%b3%e7%b5%b1"><![CDATA[傳統]]></category>
    <category domain="category" nicename="{omitted}"><![CDATA[吃喝 &amp; 玩樂]]></category>
</item>
```

In [4]:
def remove_tags(txt):
    """Remove xml/html tags from a string"""
    clean = re.compile('<.*?>')
    return re.sub(clean, '', str(txt))

In [5]:
# define the tags to be retrieved from each post data (excl. <category>, special fix needed)
metadata = ['title', 'dc:creator', 'wp:post_id', 'wp:post_date']
categoryDomAttr = ['post_tag', 'category']

def filterMetaDataPerPost(post):
    """
    Cleanup and filter out the targeted data for each input post data
    """
    miniSoup = BeautifulSoup(str(post))
    postData = {key: remove_tags(miniSoup.find(key)) for key in metadata}
    
    # special fix for <category> tags since attributes are involved
    for attr in categoryDomAttr:
        postData[attr] = [remove_tags(x).replace('&amp;amp;', '&') 
                          for x in miniSoup.find_all('category', {'domain': attr})]
        
    return postData

In [6]:
# get data!

posts = soup.find_all('item')
postsCleaned = list(map(filterMetaDataPerPost, posts))
df = pd.DataFrame(postsCleaned)
df.set_index('wp:post_id', inplace=True)

In [7]:
# check how data loooooooks
df.sample(5)

Unnamed: 0_level_0,title,dc:creator,wp:post_date,post_tag,category
wp:post_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
8890,2018荷蘭亞洲電影節專訪《轉彎之後》導演黃千殷--一個重新找自己的旅程,SandyTu,2018-03-15 06:52:57,"[CinemAsia, 亞洲, 亞洲電影節, 文化, 社會, 藝術, 電影]",[社會 & 文化]
1735,歡慶百年來第一個國王節－創意不限！,Ching,2014-04-27 13:55:28,"[二手市場, 國王節, 橘色, 購物, 阿姆斯特丹]",[吃喝 & 玩樂]
6633,當水岸再生遇上循環經濟－永續新星De Ceuvel（下）,alleychu,2016-08-24 12:29:31,"[再生, 循環經濟, 永續, 環境, 能源, 阿姆斯特丹]",[環境 & 科學]
3351,荷蘭王國憲法200週年(上)：從共和國走向君主國的崎嶇旅程,LUChen,2014-12-24 00:00:30,"[政治, 歷史, 皇室]",[社會 & 文化]
7227,生命終止與延續課題（下）：荷蘭安樂死相關人員訪談,brontesun,2016-11-25 00:49:36,"[安樂死, 法律, 社會, 老人]",[社會 & 文化]


In [8]:
# save data to csv
df.to_csv('./data/oe-wp-posts.csv', encoding='utf_8_sig')

### get post tags

In [1]:
import pandas as pd

In [2]:
# load data
posts = pd.read_csv("./data/oe-wp-posts.csv", encoding="utf_8_sig")

# get a glance of the dataset
posts.head()

Unnamed: 0,wp:post_id,title,dc:creator,wp:post_date,post_tag,category
0,133,荷式路邊設計－舊衣回收桶,CindyLiao,2014-01-15 19:45:03,"['Breda', '創新', '設計']",['街拍543']
1,223,到底是荷蘭，還是尼德蘭？,OliviaDung,2012-12-27 14:08:28,"['English', '加勒比海荷屬地', '地理', '皇室']",['社會 & 文化']
2,254,希望與絕望的秘密角落－阿姆斯特丹安妮之家,YingChen,2012-02-08 22:57:56,"['English', '二戰', '博物館', '景點', '歷史', '阿姆斯特丹']",['人文 & 藝術']
3,260,阿姆斯特丹的消失八號電車－二戰猶太人電車,OliviaDung,2013-05-30 23:20:29,"['English', '二戰', '博物館', '歷史', '社會', '阿姆斯特丹']",['人文 & 藝術']
4,267,荷蘭冬季暖胃湯－不倒豌豆湯,OliviaDung,2012-02-07 00:00:00,"['English', '傳統', '冬天', '吃', '食譜']",['吃喝 & 玩樂']


In [3]:
def getTags(postTag):
    '''
    formats the given raw `postTag` string to list-like
    '''
    listTags = [tag.strip("'") for tag in postTag.lstrip("[").rstrip("]").split(", ")]
    return listTags

In [4]:
# aggreate post tags from the dataset
postTags = posts.post_tag.apply(getTags).sum()
postTags = list(filter(None, postTags)) # remove empty entries

print("In total there are %i post tags (without duplicates removed)." % len(postTags))

In total there are 3629 post tags (without duplicates removed).


In [5]:
# remove duplicated tags
postTagsCleaned = list(set(postTags))

print("In total there are %i unique post tags." % len(postTagsCleaned))

In total there are 674 unique post tags.


In [6]:
# save data to csv
pd.DataFrame({ "tag" : postTagsCleaned }).to_csv('./data/oe-wp-posts-tags.csv', encoding='utf_8_sig', index=False)