# FIT5196 Assessment 1 Task1
#### Student Name: ZHIYIN WANG
#### Student ID: 31436285

Date: 13/09/2020

Version: 3.0

Environment: Python 3.6.0 

Libraries used:
* os (for readingfiles given path, included in Anaconda Python 3.6)
* re 2.2.1 (for regular expression, included in Anaconda Python 3.6) 
* langid.classify (for classifying language of text, included in Anaconda Python 3.6)


## 1. Introduction
This assessment touches the very first step of analyzing textual data, i.e., extracting data from semi-structured text files. Each student is provided with a data-set that contains information about COVID-19 related tweets.
The task is to extract the data and transform the data into the XML format with the following
elements:
1. id: is a 19-digit number.
2. text: is the actual tweet.
3. Created_at: is the date and time that the tweet was created

And the following constraints must also be satisfied:
1. The “id”s must be unique, so if there are multiple instances of the same tweets, you must
only keep one of them in your final XML file.

2. The non-english tweets should be filtered out from the dataset and the final XML should
only contain the tweets in English language. For the sake of consistency, you must use
the langid package to classify the language of a tweet.

3. The re, os, and the langid packages in Python are the only packages that you are
allowed to use for the task 1 of this assessment (e.g., “pandas” is not allowed!). Any
other packages that you need to “import” before usage is not allowed.

## 2.  Import libraries 

In [1]:
import re
import os
from langid import classify

## 3. Examining and loading data

First, file path will be defined, all files will be inside the folder <student_id> and the folder and this code file will be under the same path. 

Before reading all files, empty lists and dictionary is defined to hold the information in the files. All files will be loaded into a dictionary with key = file name, value = text in the file. Thus, each file is arranged and easy to access.

Files will be loaded with encoding = utf-8, this will cause the emojis in the twitters can't be displayed normally. As they need to be loaded with utf-16. However, python only supports encoding with utf-8, this issue need to be fixed later.

In [2]:
# define files pathways
path = './31436285'
all_files = os.listdir(path)

# define list to contain twitter information
text_list=[]
id_list = []
date_list = []

# read txt fiels into a dictionary:  keys: title of text file (date) , value: data
data_dict = {}

# read all text files and store in list
for file in all_files:
    with open(os.path.join(path, file),'r', encoding='utf-8') as f:
        text = f.read()
    f.close()
    data_dict[file] = text

## 4. Extract and transform data

In this step, data is extracted from the dictionary using re.findall and regex. Three list stores different data corresponding to text, id and date.

The regular expression used for text finds all text that is after "text:" and before the next quotation mark. The re.findall expression returns the text inside the (.*?) which is anything occurs 0+ and the "?" is used here to match using as few characters as possible.

When finding id with regular expression, (\d+) is used to match 1+ digitals after "id:"" and same for matching date. (?:) is used while matching date so that the result matching regex inside the bracket is not retured as we only want the entire date to be returned.

In the requirement, the “id”s must be unique. Thus id is checked first before any further data wrangling. A cage list is created, when an tweet is processed, its id will be added to the list. Next time, tweet with same id appear, it will be ignored.   

The goal is to write english twitters to xml in correct format. Thus, non-english should not be recorded. Langid is used here to classify twitters in english. Text that are classified as english will be arranged into correct format. Then the text will be appended to new_list that used to stores all text that is being processed and ready to write into xml.

However, before classifying , the issue with emoji need to be fixed here, otherwise it will effect the results of classify().
Emojis are in format of \uxxxx, but with two backslash \\ instead of single \. eval() and replace() is used to repair the issue. Backslashes are replaced, and eval() so that text with emojis can be encoded with utf-16 to correctly display emojis.

After extraction and classification of twitter data, the twitter data need to be written in format of (id + text) for twitters in every day. The format is pre-set in this make_tweet function, so it can be applied later to every file in the dictionary.

In [5]:
# define function for extracting tweets into corresponding lists and write into xml in correct format
def make_tweet(dict_key): 
    
    # define list for saving output 
    new_list = []
    # define list for id checking
    cage = []
    global date_list
    
    # extract text, id ,date from the twitter list
    text_list = re.findall(re.compile('"text":"(.*?)"'), dict_key)  # find text with pattern as regex defined
    id_list = re.findall(r'"id":"(\d+)"', dict_key) # find id
    date_list = re.findall(r'"created_at":"(\d+(?:\W\d{2}){2})\w', dict_key) # find date
    
    
    #global t_date
    #t_date = date_list # copy of date_list
    
    # Before loop to find tweets only in english, need to solve the encoding issue of emojis in each tweet
    for i in range(len(text_list)):
        if bool(re.search(r'(\\u.*?)', text_list[i])) == True:
            # change the "\\" of emoji into "\"  and other "\" issues
            text_list[i] = text_list[i].rstrip("\\")
            a = eval(repr(text_list[i]).replace('\\\\', '\\'))
            a = a.replace('\\n\\n','\\n')
            a = a.encode('utf-16', 'surrogatepass').decode('utf-16') # encode and decode the tweet again
            text_list[i] = a

    # loop to find tweets only in english and write in correct format.
    for i in range(len(text_list)):
        # id check againest the appended id list, only going next step if id is not duplicated.
        if (str(id_list[i])) not in cage:
            # check if tweet is in english 
            if (classify(text_list[i])[0]) == 'en':
                    # output string in correct format 
                    new_list.append(f"<tweet id=\"{id_list[i]}\">"+ f"{text_list[i]}</tweet>")
        cage.append(str(id_list[i]))  # add this id to list, this id will be checked next time
        
    return new_list

new_list is returned here with text and id output to strings in the format("<tweet id=\"{id_list[i]}\">"+ f"{text_list[i]}</tweet>")

## 5. Output data to xml 

This is the section where formated data get to written into xml file.

The only issue in this section is to make sure the xml tags are correctly written. The xml need to start with <data>, followed by a <tweets date = > tag. For each new day, this date tag need to be created. In the files, some of the files are twitters in the same date. Thus, the date tag is conditional when writing pre-processed text to each file. The date tag is only needed when starts to write tweets information in a new day. This is achieved through checking date againest date list for each file. After writing of tweets in one file is done, the date of that file will be appended into a list and this date will be checked when writing other files.

In [6]:
# define function to output the string in correct format in the "make tweet" function into xml
def write_file():  
    # define the list for date checking, if date = date of last file, don't close <tweets> (don't write </tweets>)
    date_cage = []
    # count number of files
    count = 0
    
    # Output data into xml
    f = open('31436285.xml','w',encoding='utf-8')
    # header tag
    f.write("<?xml version=\"1.0\" encoding=\"UTF-8\"?>" + "\n")
    f.write("<data>"+ "\n")
    
    # write to xml 
    for i in data_dict:
        file_name = str(i) # define file name
        print(file_name)
        # prcess a single day, single file with the make_tweet function
        tweets = make_tweet(data_dict[file_name])
        
        # if it is a new date, write end tag for last date and a new xml date title
        if date_list[0] not in date_cage:
            # except for the first file
            if count != 0:
                f.write("</tweets>"+"\n")
            f.write(f"<tweets date=\"{date_list[0]}\">"+'\n')
            
        # write the output from make_tweet function into xml    
        for line in tweets:
            f.write("%s\n" % line)
        
        count =+1  # count + 1
        date_cage.append(date_list[0]) # append date to checking list
    f.write("</tweets>"+"\n")
    # close tag at the end
    f.write("</data>")
    f.close()

## 6.Run and write to xml

Finally, run the function and output all formated twitter information to xml.

In [None]:
write_file()