Date: 01/09/2021

Version: 4.0

Environment: Python 3.8.3 and Anaconda 6.0.3

Operating System: macOS Big Sur (Version 11.5.1)

Libraries used:    
* re (for regular expressions)
* os (for file/directory related operations)


## Table of Contents
* 1. Introduction
* 2. Import libraries
* 3. Examining and loading data
* 4. Helper Functions
* 5. Segregating each user instance
* 6. Regexes
* 7. Parsing and Operations
* 8. Creating XML string
* 9. Writing XML file
* Conclusion
* References

## 1. Introduction

This task assesses our ability to extracting data from our designated semi-structured text file. The dataset is about cryptocurrency tweets with various metadata involved.

Each text file contains information about the tweets, i.e., `“user name”`, `“user code”`, `“user description”`, `“number of followers”`, `“whether or not the user account is verified”`, `“date of the tweet”`, and the `“tweet text”`. 

Our task is to extract the data from the text file and transform the data into a `XML format` with the following elements:

1. users: this tag wraps all the users
2. user: this tag wraps all the tweets from a particular user and keeps the meta data for each user such as number of followers, verified or not, user description etc. If a user has multiple tweets, the meta data of the latest tweet (i.e., the tweet with the most recent date) must be used.
3. Tweets: wraps all the tweets of a specific user
4. tweet: for each user, this tag represents the text of the user tweet

## 2. Import libraries

In [1]:
import re
import os

## 3. Examining and loading data

In [2]:
with open('31072100_task1_input.txt','r') as tweet_file:
    read_tweets = tweet_file.read()

# checking first 4000 chars for examination/observation    
read_tweets[:4000]

"$user_name.: Sedekah Bang\n$user_code.: 100024962\n$user_desc.: cari dolar\n$No. followers.: 1.0 $verified_user?.: False $tweet_date.: 2021-06-23 17:52:12\n$tweet.: I found #bitcoin in a User vault at this location! Join me playing #coinhuntworld, It's awesome! https://t.co/xBy6ZGO8jZ #cryptocurrency #14303 https://t.co/ZXLt62F6wr\n$uname.: Harshen Hars\n$user_code.: 100000003\n$tweet_date.: 2021-02-19 22:05:02\n$tweet.: $BTC A big chance in a million! Price: \\6114862.0 (2021/02/20 07:04) #Bitcoin #FX #BTC #crypto\n$verified_user?.: False $followerNo.: 648.0 $user_desc.: nan\n$username.: 🌋 SmallState HODL 🌋\n$user_code.: 100035233\n$udesc.: #bitcoin\n$No. followers.: 1368.0 $verified_user?.: False $tweet_date.: 2021-06-22 04:41:08\n$tweet_text.: This is why we #bitcoin https://t.co/CjAiwNE1eR\n$user_name.: aya | ia ♨\n$user_code.: 100061387\n$udesc.: @BTS_twt || #JanielGang || #EunieTy || #ThatGiveawaySquid || #BaniousRaid || $EDDA\n\nyou are sky that full of star ✨🌌\n$No. followers.

## 4. Helper Functions

In [3]:
def datetime_split(date):
    """
     take date including data-time as input and returns 
     year, month, day, hour, minute, seconds values
     :param date: date-time in designated format. Eg 2021-04-18 13:48:25
    """
    date_part = date.split(' ')[0]
    time_part = date.split(' ')[1]
    
    year, month, day = map(int, date_part.split('-'))
    hour, minute, sec = map(int, time_part.split(':'))
    
    return (year, month, day, hour, minute, sec)


def compare_datetime(date1, date2):
    """
     take two date-time parameters as input and returns true if 
     2nd date-time is higher than 1st, False otherwise
     :param date1: 1st date-time
            date2: 2nd date-time 
    """
    year1, month1, day1, hour1, minute1, sec1 = datetime_split(date1)
    year2, month2, day2, hour2, minute2, sec2 = datetime_split(date2)
        
    return (year2, month2, day2, hour2, minute2, sec2) > (year1, month1, day1, hour1, minute1, sec1)

In [4]:
# Testing code
date1 = '2021-04-18 13:48:25'
date2 = '2021-02-06 16:48:35'

compare_datetime(date1, date2)

False

The `maketrans()` method returns a mapping table that can be used with the `translate()` method to replace specified characters.

In [5]:
def handle_xml_chars(value, tag_type):
    """
      translates and replaces the value provided as dictionary according to tag_type
      :param value
             tag_type 
    """
    if tag_type == 'xml_description' or tag_type == 'xml_tweet':
        return value.translate(str.maketrans({'&': '&amp;', '<':'&lt;','>':'&gt;'}))
    if tag_type == 'xml_name':
        return value.translate(str.maketrans({'"': '&quot;', '&': '&amp;'}))

## 5. Segregating each user instance

We will segregate each user instance containing meta data and tweet to get the each user instances of tweet. We write a regex to achieve this and then use `findall()` so as to get each instance as an element of list.

In [6]:
# matches each user data instance
tweet_info_regex = re.compile(
                        "(?s)"
                        "(?:\$(?:u(?:ser)?_?name)\.:)"
                        ".*?"
                        "(?=\$(?:u(?:ser)?_?name)|$)"
                    )

tweet_info = re.findall(tweet_info_regex, read_tweets)
print(len(tweet_info))

12129


> We got `12129` user instances/blocks.

**Regex Explanation**

In the above regex, `(?s)` enables he following effective flags: `gs`, where `s` modifier means single line. Dot matches newline characters too and not ignored. We use non-capturing group `?:` to extract the various variations of 
`user name` followed by character `.:`.

The data we capture is captured using `.*?` with lazy match enabled.

The last group is positive lookahead with 2 alternatives. 1st one is same as our `user name` group explained earlier. The 2nd one `$` asserts position at the end of the string, or before the line terminator right at the end of the string (if any)

## 6. Regexes

In [7]:
re_uname = re.compile("(?:\$u(?:ser)?_?name\.:\s)(.+)")
re_ucode =re.compile("(?:\$user_code\.:\s)(.+)")

re_udesc = re.compile("(?sm)\$u(?:ser)?_?desc(?:ription)?\.:\s(.*?)(?=^\$\w+\??\.:?)")
re_udesc_alt = re.compile("(?sm)\$u(?:ser)?_?desc(?:ription)?\.:\s(.+$)")

re_tdate = re.compile("(?:\$tweet_date.:\s)(.+)")

re_followers = re.compile("\$(?:followerNo\.|No.\sfollowers\.):\s(\d+\.0)")
re_uverified = re.compile("\$verified(?:\_user)?\?\.:\s(True|False)")  
    
re_tweet = re.compile("(?sm)\$tweet(?:\_text)?.:\s(.*?)(?=^\$\w+\??\.:?)")
re_tweet_alt = re.compile("(?sm)\$tweet(?:\_text)?\.:\s(.+$)")

**Regex Explanation**

1. re_uname
> We create two groups, one non-capturing that will match the user name tag variation and other capturing group which will be the actual username text given by `(.+)`

2. re_ucode 
> This again contains two groups, one non-capturing which returns match of user code tags and second capturing group which returns us the actual user code by `(.+)`

3. re_udesc and re_udesc_alt
> * There are two variations we have included to match the user description in the dataset. This we have done because user description tag has no fixed position in user instance. Mostly it is between other tags but in few instances it appears at the end. So as to match both according to condition (will see later) we take this approach. 
> * So, the `re_udesc` regex enables `(?sm)` tags where `m` modifier is multi line. We take up the data between between `\$u(?:ser)?_?desc(?:ription)?\.:\s` (which is matching user description tag variations) and `(?=^\$\w+\??\.:?)` (which is matches any other when encountered) using the `(.*?)` lazily. The other variation `re_udesc_alt` is similar approach just it matches till end of the text `(.+$)` of a user instance data.

4. re_tweet and re_tweet_alt
> We use similar approach as used in user description, as for tweet text as well the position is not fixed in the user instance data.

5. re_tdate
> This matches the tweet date by using non-capturing to get `$tweet_date` tag and a capturing-group `(.+)` to get the actual date text.

6. re_followers
> Follower tag had two variations. We handle both and validates that it contains a numerical value with appended `.0` to it using `(\d+\.0)`

7. re_uverified
> Verified tag had two variations to it matches then and validates the tag value is `(True|False)`. Since `(True|False)` is capturing group we get the value of it for future use.

## 7. Parsing and Operations

The approach we use is briefly described below:

For every instance of user data we capture the values of meta-data and related values and store it in variables. Since "user description" and "user tweet" had two cases because of its position variations, we match one first and if it return error/null while taking its `.group()` we look up the alternative regex and use it. 

Once everything is captured we write to a dictionary `tweets_dict` with `user_code` as key and well structured dictionary with data as its values. `user_tweet` will be a list of tweets as a user can have several tweets. This reminds us if need to update the metadate according to the recent tweet of the user. We wrote `compare_datetime(d1,d2)` function earlier, this will come to our rescue to update the values if `user_code` will already be present in dictionary.

**Why we chose dictionary?**
Dictionaries are implemented as a hash-map. So time-complexity is O(1) for the access which is much faster then O(n) for a list. As we have a large dataset if you use List it will take lot of time as compared to dictionary to perform the same operation. Another important factor to choose dictionary was that, since we had to keep users unique and append their tweets, dictionary was the first choice.

Another important function we wrote was `handle_xml_chars(value, tag_type)` which according to the tag type replaces the xml special characters with required character/string **( This has been carefully observed with the sample output and have been decided as per it which tags requires which replacements )**. The ultimate purpose of doing this is that the xml we create should be a valid xml. 

In [8]:
tweets_dict = {}

# for each user instance in all users
for tweet in tweet_info:

    try:
        # captures value of regex compiled earlier
        user_code = re_ucode.search(tweet).group(1)
        user_name = re_uname.search(tweet).group(1)
        tweet_date = re_tdate.search(tweet).group(1)
        user_followers = re_followers.search(tweet).group(1)
        user_verfied = re_uverified.search(tweet).group(1)
    except:
        print('ERROR: Some meta data missing CHECK!!!!', user_name, user_code)


    user_desc = re_udesc.search(tweet)    
    user_tweet = re_tweet.search(tweet)


    if (user_code and user_name and tweet_date and user_followers and user_verfied):
        pass
    else:
        print(user_name)
        print('ERROR: One of more then one tag seems to throw None value!!!')


    try: 
        if user_desc:
            user_desc = user_desc.group(1)
        else:
            user_desc_alt = re_udesc_alt.search(tweet)
            user_desc = user_desc_alt.group(1)
    except:
        print('ERROR: user_desc missing !!!!')

        
    # user tweet    
    try:
        if user_tweet:
            user_tweet = user_tweet.group(1)
        else:
            user_tweet_alt = re_tweet_alt.search(tweet)
            user_tweet = user_tweet_alt.group(1)
    except:
        print('ERROR: user_tweet missing')


    if (user_desc and user_tweet):
        pass
    else:
        print(user_name)
        print('ERROR: One of the (user_desc or user_tweet) seems to throw None value!!!')


        
    # Adding user data to dictionary according to condition    
        
    # if user already in dictionary    
    if user_code in tweets_dict:
        
        old_tweet_datetime = tweets_dict[user_code]['tweet_date']
        new_tweet_datetime = tweet_date
        
        # comparing date, if True returned: we will update the metadata to recent date
        if compare_datetime(old_tweet_datetime, new_tweet_datetime):
            tweets_dict[user_code]['name'] = handle_xml_chars(user_name, 'xml_name')
            tweets_dict[user_code]['verified_user'] = user_verfied
            tweets_dict[user_code]['user_description'] = handle_xml_chars(user_desc, 'xml_description')
            tweets_dict[user_code]['no_followers'] = user_followers            
        
        # appending the tweet
        tweets_dict[user_code]['user_tweet'].append(handle_xml_chars(user_tweet, 'xml_tweet'))
    
    # create a new user
    else:
        tweets_dict[user_code] = {
            "name": handle_xml_chars(user_name,'xml_name'),
            "verified_user": user_verfied,
            "user_description": handle_xml_chars(user_desc, 'xml_description'),
            "no_followers": user_followers,
            "tweet_date": tweet_date,
            "user_tweet": [handle_xml_chars(user_tweet, 'xml_tweet')]
        }
        

## 8. Creating XML string

We create XML based string using fstring of python. This would lead us to have a full fledged text, which we can just write directly to a new `.xml` at the end.

In [9]:
# xml's outermost tag open
xml_string = '<users>'

# for each user, create a xml string based block
for user in tweets_dict:
    
    # joining each tweet and placing <tweet> at start of it, along with its closing tag
    all_tweet_xml_str = ''.join(["<tweet>"+tweet.rstrip('\n')+"</tweet>" for tweet in tweets_dict[user]["user_tweet"]])
    
    descript_xml_str = "<user_description>"+tweets_dict[user]["user_description"].rstrip('\n')+"</user_description>"

    xml_string += (

        f'<user name=\"{tweets_dict[user]["name"]}\">'
            f'<verified_user>{tweets_dict[user]["verified_user"]}</verified_user>'
            f'{descript_xml_str}'
            f'<no_followers>{tweets_dict[user]["no_followers"]}</no_followers>'
            f'<tweets>'
            f'{all_tweet_xml_str}'
            f'</tweets>'
        f'</user>'  
    )
    
# xml's outermost tag close
xml_string += '</users>'

As the data contains lot of emojis, different languages we need to tackle that and make it parsable by xml

**`encode()`**
The encode() method encodes the string, using the specified encoding. If no encoding is specified, UTF-8 will be used.
`encoding="ascii"`: converts to ASCII encoding
`errors="xmlcharrefreplace"`: replaces the character with an xml character 

Finally we decode back to `utf-8` so as to get rid of `b'` enocoding output.

In [10]:
final_xml_encoded = xml_string.encode(encoding="ascii",errors="xmlcharrefreplace").decode('utf-8')

In [11]:
len(tweets_dict.keys())

7416

> Unique users are `7416`

## 9. Writing XML file

In [12]:
# writing .xml file with 'w' (write) mode

with open('31072100.xml','w') as f:
    f.write(final_xml_encoded)

# Conclusion

The required formatted XML file was created successfully as per the specification.

# References

*  W3 schools. Python String encode() Method. Retrieved from https://www.w3schools.com/python/ref_string_encode.asp
*  W3 schools. Python String maketrans() Method. Retrieved from https://www.w3schools.com/python/ref_string_maketrans.asp
* Code Beautify. XMLViewer. Used from https://codebeautify.org/xmlviewer
* Ciro Santilli. How to validate very large XML files?. Answer on Stackoverflow. Retrieved from https://stackoverflow.com/questions/7528249/how-to-validate-very-large-xml-files
* Regex 101. Used from https://regex101.com/
* Steve Campbell. Guru99. Python RegEx: re.match(), re.search(), re.findall() with Example. Retrieved from https://www.guru99.com/python-regular-expressions-complete-tutorial.html