## Table of Contents

* [1. Extracting the data](#one)
* [2. Using Regex to extract tweet, id and date](#two)
    * [2.1 Tweet](#two1)
    * [2.2 ID](#two2)
    * [2.3 Date](#two3) 
* [3. Striping and making the tweets in proper format ](#three)
* [4. Creating a dictionary with ID and tweets ](#four)
* [5. Convert special characters to HTML Code ](#five)    
* [6. Use surrogate pairs to get the emojis](#six)       
* [7. Using langid to remove non - english tweets](#seven)
* [8. Writing to XML](#eight)
* [Converting to dict from XML](#nine)
* [Summary](#sum)


### 1.  Extracting the data <a class="anchor" name="one"></a>

Here the first step of the task is the extract all the text file from the folder 30511704 ( My student ID). 

Here I have used the package os and using the funtion scandir, I can pass the basepath where all my files are present and extract the data.

First I initialize the variable 'final'  where I append the extrire content of all the text files. After the execution of this cell, the variable 'final' will have the raw data extracted from all the text files.

In [1]:
import re
import os

In [2]:

basepath = "./data/"
files = []

final = " "
with os.scandir(basepath) as entries:
    for entry in entries:
        if entry.is_file():
            files.append(entry.name)
            
for index in files:
    name = "{index}".format(index=index)
    path = os.path.join(basepath, name)
    with open(path, mode="r", encoding="utf-8") as fd:
        content = fd.read()
        final += content

In [3]:
len(files)

2421

### 2. Using Regex to extract tweet, id and date <a class="anchor" name="two"></a>

Next step is to extract tweet, id and date using regular expression

#### 2.1 Tweets <a class="anchor" name="two1"></a>

The regex used to extract the tweets is "text":(.*?)(?:"id":|"created_at":|"errors":)

This regex and the split into 3 parts. 

First part is the starting of the tweet. All the text tweets start with "text": .So thats why I have given "text": as the starting of the regex and this is a non-capture subpattern

Second part is the actual tweet text. This can be of any characters and no. of charecters can be 0 or more than 0. So I have made this part as (.*?) . This part has the captured subpattern.


The third part is the ending of the tweet. Here I have used the starting of the next tweet to get the ending of the current tweet. I have noticed that all the tweets start with either  "id": or "created_at": or "errors":. So just after the ending of each tweet text it can be id": or "created_at": or "errors":. So I have given (?:"id":|"created_at":|"errors":) in the regex. The part is a non-capturing subpattern and so I have given ?: at the starting of this part

Most of the tweets start with "created_at":. for eg. tweet id: 1258367433028636677 ( In the sample file) 

After some tweets there might be a error tag. For eg. tweet id: 1258367433322237952 ( In the sample file) 

#### 2.2 ID <a class="anchor" name="two2"></a>

The regex used to extract the ID is "id":"([0-9]+)"

eg. "id":"1258367433322237952"

The ID start with the tag "id":" and after that it will be a set of numbers. It can 1 or more than 1 digits. So I have given "id":"([0-9]+)". Here "id":" is a non capturing group and the capturing group is the digit ([0-9]+)


#### 2.3 Date <a class="anchor" name="two3"></a>


The regex used to extract the Date is "created_at":"(\d{4}-\d{2}-\d{2})(?:.*?)"

eg. "created_at":"2020-05-07T12:05:50.000Z"

Each date field start with the tag "created_at":". This is part is a non-capturing subpattern. 

After the first tag, then the date will be present in the format (\d{4}-\d{2}-\d{2}). This has to be made as capturing group.

After this date, there is a time which needs to be ignored and end with "


In [4]:
date = re.findall(r'(\d{4}-\d{2}-\d{2})', final)

_id = re.findall(r'"id":"([0-9]+)"', final)
uniq_id = list(set(_id))
tweet = re.findall(r'"text":(.*?)(?:"id":|"created_at":|"errors":)', final,re.DOTALL)
#tweet = re.findall(r'(?:"text":")(.*?)(?:"},{"created_at")', final)
created = re.findall(r'"created_at":"(\d{4}-\d{2}-\d{2})(?:.*?)"', final)
uniq_date = list(set(created))

### 3. Striping and making the tweets in proper format <a class="anchor" name="three"></a>

Next step is the remove the unwanted characters from the tweet and just have the tweet text alone. ( This could have handled in regex itself, but I tried with this method. 

Here I strip of the characters ,{}]" which is present at the staring and ending of the tweet

In [5]:
tweet =[ x.strip(',{}]') for x in tweet]

for i in range(len(tweet)):
    tweet[i] = tweet[i].strip('"')

### 4 Creating a dictionary with ID and tweets  <a class="anchor" name="four"></a>

Now we have 3 lists each with tweets, ID and Date.

For the future processing I combine the tweets and ID list into a single dictionary to make the processing easier

In [6]:
res = dict(zip(_id, tweet))

### 5. Convert special characters to HTML Code <a class="anchor" name="five"></a>

Here we need to convert few of the special characters to HTML Code inorder to make it compatible for XML format

& , ' , " , > , <

These 5 characters needs to be converted.

There are some tweets in which \n is given. This needs to be converted to \\n so that the system can read it.

Also, there were some tweets in which \ is given. This needs to be converted to \\ so that the system can read it.


In [7]:
for key, val in res.items(): 
    t = val.replace("&","&amp;").replace("'", "&apos;").replace('\\"',"&quot;").replace(">","&gt;").replace("<","&lt;").replace("\\","\\\\").replace("\n","\\n")
    res[key] = t

Since we are converting \\ to \\\\, \\ u will be converted to \\\\u and so this need to be converted back to its orginal form to use the surrogate pass

In [8]:
for key, val in res.items(): 
    t = val.replace("\\\\u","\\u")
    res[key] = t

### 6. Use surrogate pairs to get the emojis <a class="anchor" name="six"></a>

Next step is to get the emojis or formula which are encoded with 'surrogate pairs'.

For this we have to encode this with surrogatepass and then decode to the form which we want.

Here we do this encoding and decoding only if there exist any word with the pattern \\uD

In [9]:
for key, val in res.items(): 
    if (re.findall(r'\\uD', val)) != []:
        t = eval("'" + val + "'").encode('utf-16', 'surrogatepass').decode('utf-16')
    else:
        t = val
    res[key] = t

Since we are converting \\ to \\\\, \\ n will be converted to \\\\n and so this need to be converted back to its orginal form to \\n

In [10]:
# Replace // with /
for key, val in res.items(): 
    t = val.replace("\\\\n","\\n")
    res[key] = t

### 7. Using langid to remove non - english tweets <a class="anchor" name="seven"></a>

Next step is to remove the tweets which are not in english

For this, we using langid package to use the classify function to check if the tweet is in english.

We remove the tweets which are not in english

In [11]:
import langid

In [12]:
delete = [] 
for key, val in res.items(): 
    if langid.classify(str(val))[0]!= 'en': 
        delete.append(key)
for i in delete: 
    del res[i] 

Now since all the processing is completed, now I convert our final dictionary to 2 seperate lists to write it to the XML file

In [13]:
_id = []
tweet = []
for i,j in res.items(): 
    _id.append(i) 
    tweet.append(j)

Here  _id and tweet are 2 lists which are having the values after all the processing

### 8. Writing to XML <a class="anchor" name="eight"></a>

The final step is to write the id and tweets collected to XML files.

First we open a file  - 30511704.xml to write our tweets and ID 

Next we write the header to this xml File 

In [14]:
header = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<data>\n"
output = open("30511704.xml", mode="w", encoding="utf-8")

In [15]:
for each in header:
    output.write(each)

Next we write the id and tweets according to format of the sample xml file.

In [16]:
for i in range(len(uniq_date)):
    output.write("<tweets date=\""+ uniq_date[i] + "\">\n")
    for j in range(len(_id)):  
        if (date[j] == uniq_date[i]):
            output.write("<tweet id=\"" + _id[j] +"\">" +  tweet[j].replace("\\n","\n")+ "</tweet>\n")
    output.write("</tweets>\n")
output.write("</data>")
output.close()

### Converting to dict from XML <a class="anchor" name="nine"></a>

This part is to test if the XML created is correct and readable.

In [17]:
import xmltodict
with open('30511704.xml','r') as infile:
    text = infile.read()

In [18]:
import xmltodict
final_dic = xmltodict.parse(text)

In [19]:
final_lis = []
final_lis = (final_dic['data']['tweets']['@date'== '2020-06-07']['tweet'])

In [20]:
final_lis

[OrderedDict([('@id', '1244612928990588929'),
              ('#text',
               "@paddymacc1 People need to learn from  this sad  realities, if not  careful many countries will plunge into  the  same ditch. Covid-19 is real &amp; traumatizing  families. Don't add yours on this list.")]),
 OrderedDict([('@id', '1244612929229590529'),
              ('#text',
               '(City A.M):#Michael Gove overstates #UK Covid-19 testing as government misses target : Senior government minister Michael Gove has been caught out misstating the number of coronavirus tests carried out last week. The The post .. https://t.co/XMcbDdUpF9 https://t.co/1sagt5ryr3')]),
 OrderedDict([('@id', '1244612929800015872'),
              ('#text',
               'Six response plan guidelines for navigating Covid-19 https://t.co/8PGsomHtVF')]),
 OrderedDict([('@id', '1244612929833615364'),
              ('#text',
               "Trump Cabinet's Bible teacher says gays cause 'God's wrath' in COVID-19 blog post ht

Here we are successfully able to parse through the XML file

### Summary  <a class="anchor" name="sum"></a>

Here in this task the main task was to convert the raw text into proper XML text format.

The first and main task involved in this is to extract the tweets, ID and the dates using Regex

The regex used were: - 


id  - '"id":"([0-9]+)"'

tweet - "text":(.*?)(?:"id":|"created_at":|"errors":)

created - "created_at":"(\d{4}-\d{2}-\d{2})(?:.*?)" 

The tweets were also stripped by the charecters ',{}]'

The tweets and id was converted to dictionary res for further processing

Characters & , ' , " , > , < are to be converted to the HTML Codes.

Then the tweets which are non - english are removed.

Finally we write the tweets and id into XML file as per the xml sample file 







####  