# Scraping Semi-Structured Data and Export as XML
#### Author Name: Dahye Kim

Date: 02/09/2020

Version: 1.0

Environment: Python 3.7.9 and Jupyter notebook

Libraries used:
* re (for regular expression, included in Anaconda Python 3) 
* os (for numpy array, included in Anaconda Python 3) 
* langid (for numpy array, included in Anaconda Python 3) 

## 1.  Import libraries 

In [1]:
import os
# loop through all the documents in working directory 

import re 
# regular expression 

import langid
# classify the language of the tweet

## 2. Parse Twitters

To parse the text files containing all the tweets, I used os.listdir() to retrieve the file names in the working directory. Then I looped through the names of the files in the working directory. 

In [2]:
fileNames = os.listdir('tweetsJSON')

## 2.1 Parse as Text File using Regex

In [4]:
tweets = []

for textFileName in fileNames: 
    data = open('tweetsJSON/'+textFileName)
    # opening the file in the working directory 

    for i in data.read().split('},{'):
    # since the id of the tweet users, created date, and the tweets are seperated by {}
    # I split the chunk with '},{'

        if not re.findall(r'"id":"(\d{19})"', i) and \
        not re.findall(r'"text":"(.+?)"', i) and \
        not re.findall(r'"created_at":"(.+?)"', i):    
            continue 
            # every single valid tweet requires the text, the id, and the created date
            # if these three factors are not present in the line, the line is omitted

        else: 
            singleTweet = {}
            # each tweet are collected into its own dictionary 
            
            singleTweet['text']=re.search(r'"text":"(.+?)"', i).group(1)
            # the tweet is assigned with the key 'tweet' into the dictionary singleTweet
            singleTweet['id']=re.search(r'"id":"(\d{19})"', i).group(1)
            # the id of the tweet user is assigned with the key 'id' into the dictionary 
            singleTweet['createdAt']=re.search(r'"created_at":"(.+?)"', i).group(1)
            # the created date of the tweet is assigned with the key 'createdAt' into the dictionary 

            tweets.append(singleTweet)
            # each singleTweet dictionary composed of a single piece of tweet with its relevant information 
            # is appended into the list tweets 
            
    data.close()

tweets = [dict(t) for t in {tuple(d.items()) for d in tweets}]
# any duplicate tweets are removed using set() 

## 2.2 Parse as ``json`` Using ``json`` Library

In [5]:
import json 

In [6]:
allTweets = list()
for index in range(len(fileNames)):
    handle = open('tweetsJSON/'+fileNames[index])
    textFormat = handle.read()
    allTweets.append(json.loads(textFormat)['data'])
    handle.close()

In [7]:
individualTweet= [singleTweet for dailyTweets in allTweets for singleTweet in dailyTweets]

# 3. Unescape Backslash in Each Tweet 

For each tweet in parsed from the text files, a lot of them contains double backslashes. 
This hampers us from decoding and encoding the surrogate pairs or printing out the escape sequence. 
Therefore, before writing the tweet into xml, I used the following procedures: 

1. Remove double backslash at the end of the tweets if they have one 
2. Replace double backslash of the escape sequence with single backslash 
3. For the emojis, first retrieve the tweets which contains the emojis via calling regex
4. After retrieving the tweets with emoji texts, unescape the backslashes 
5. Encode and decode the surrogate pairs 
6. Filter the tweets that are not in English language 

In [8]:
filteredTweets = []

for i in range(len(tweets)): 
# looping through every tweet in the list tweets

    if tweets[i]['text'].endswith('\\'): 
    # if the tweet ends with'\', .decode('unicode_escape') function cannot be used 
        tweets[i]['text'] = tweets[i]['text'].replace('\\','')
        # remove the backslash at the end of the tweet if the tweets do end with a backslash 
        
    if re.findall(r'(\\n)',tweets[i]['text']): 
    # if the tweet contains escape sequence but with double backslash 
        tweets[i]['text'] = tweets[i]['text'].replace('\\n','\n')
        # replace it with single backslash 
        
    if re.findall(r'(\\uD\S{3})',tweets[i]['text']): 
    # retrieve the tweets with emojis by using '\\ud\S{3}'
        tweets[i]['text']=tweets[i]['text'].encode('utf-8').decode('unicode_escape').encode("utf-16", "surrogatepass").decode("utf-16")
        # unescape backslashes and then encode & decode the surrogate pairs 
        
    if langid.classify(tweets[i]['text'])[0]!='en': continue 
    # omit all the tweets that are not in English 
    # this includes the decoded emojis, which is recognised as latin by langid.classify()
    
    filteredTweets.append(tweets[i])
    # append all the corrected tweets to filteredTweets list

In [9]:
filteredTweets=sorted(filteredTweets, key = lambda x: x['createdAt'])
# sort the filteredTweets list based on the created date 

# 4. Writing in the XML File 

After filtering the tweets and unescaping sequences, we can finally write in the XML file. 
For the XML, we first need to decide the tags of the XML. 

The tags are as follow: 
1. ```<data>``` - containing all the tweets from the filteredTweets file 
2. ```<tweets>``` - this tag contains all the tweets created in one single day. Therefore the attribute of the tag is the created date of the tweets, which is ```date=```
3. ```<tweet>``` - this tag contains individual tweet and the user ID of the tweet. The attribute of ```<tweet>``` tag is the ```id=```.

Each tag should be closed up after inserting the relevant information. 

In [10]:
tweetXML = open('scrapedTweet.xml', 'w')
# open an empty xml file to write in the tweets 

tweetXML.write('''<?xml version="1.0" encoding="UTF-8"?>
<data>\n''')
# insert the XML version and the encoding method, followed by the starting tag of 'data'

for index in range(len(filteredTweets)): 
    if index == 0: 
    # for the first tweet, writing in the date created directly 
    #followed by the actual tweet and the user ID 
        tweetXML.write('''<tweets date="'''+\
                       filteredTweets[index]['createdAt'].split('T')[0]+\
                       '''">\n<tweet id="'''+\
                       filteredTweets[index]['id']+\
                       '''">'''\
                       +filteredTweets[index]['text']\
                       +'</tweet>\n')
        
    elif filteredTweets[index]['createdAt'].split('T')[0] != filteredTweets[index-1]['createdAt'].split('T')[0]: 
    # if the created date of the former tweet and the tweet in current loop differs
    # a new tag is created with a different created date 
        tweetXML.write('''</tweets>\n<tweets date="'''+\
                       filteredTweets[index]['createdAt'].split('T')[0]+\
                       '''">\n<tweet id="'''+filteredTweets[index]['id']+'''">'''+\
                       filteredTweets[index]['text']+'</tweet>\n')  
    else: 
        # if the created date of the former tweet written into the file and that of the tweet in current loop are identical 
        # we do not need to create a new tag with a new created date in it 
        tweetXML.write('''<tweet id="'''+filteredTweets[index]['id']+'''">'''+filteredTweets[index]['text']+'</tweet>\n')
        
tweetXML.write('''</tweets>\n</data>''')
# end the XML file by closing all the tags 

tweetXML.close()

## Bibliography 

How to open multiple files in a directory. (2016, August 17). Stack Overflow. https://stackoverflow.com/questions/38991923/how-to-open-multiple-files-in-a-directory/38992988

How to un-escape a backslash-escaped string? (2009, December 11). Stack Overflow. https://stackoverflow.com/questions/1885181/how-to-un-escape-a-backslash-escaped-string

How to work with surrogate pairs in Python? (2016, July 1). Stack Overflow. https://stackoverflow.com/questions/38147259/how-to-work-with-surrogate-pairs-in-python

Remove duplicate dict in list in Python. (2012, February 24). Stack Overflow. https://stackoverflow.com/questions/9427163/remove-duplicate-dict-in-list-in-python

Unescaping escaped characters in a string using Python 3.2. (2012, February 18). Stack Overflow. https://stackoverflow.com/questions/9339630/unescaping-escaped-characters-in-a-string-using-python-3-2