# Data Pre-Processing
#### Author: Dhrruv Tokas
#### Email ID: dhrruvtokas@gmail.com

Date: 1/7/2020

Environment: Python 3.6.0 and Anaconda 4.3.0 (64-bit)

Libraries used:
* os 0.1.4 (mainly for working with os specific task (reading into directories), included in Anaconda Python 3.6)
* regex 2020.7.14 (mainly for, included in Anaconda Python 3.6) 
* langid 1.1.6 (mainly for language related operations, included in Anaconda Python 3.6) 

## Phase 1: Parsing Text Files

# 1. Introduction

This task focuses on extracting the `meaningful` data from the available datasets and parsing that data which is collected from several `semi-structured` files. The final `XML` file have around `956` `unique` `tweet-ids` which have been provided in the later half of the environment.

Tasks that were performed:
1. Data Extraction: Extracting plausible information such as `tweet-ids`, `tweet-texts`, and `tweet-dates` from the provided text files `(.txt)`.
2. Parsing Semi-structured text files: Parsing this information into a new `XML` file which have its own specified format.
3. Reviewing: Reading final `XML` File

Other details are thoroughly provided in the following sections of the environment.

# 2.  Import Libraries 

In [None]:
# Importing libraries which will be required throughout this task
# For this specific task, I have only used 3 different libraries: os ,regex or reg , and langid
import os # useful for carrying out operating system related tasks
import re # useful for all regular expression operations
import langid # useful for language related tasks


# 3. Hold and Verify

In [None]:
# Set() function is used to keep a record of each id that has been parsed into an xml file in order to avoid repeated id values 
hold_repeated = set() # hold_repeated variable contains set() function

# 4. Loading and Understanding The Dataset

In [None]:
count = 0 # Count is used to keep a record of each file that will be opened for parsing
current_directory = os.getcwd() # Current working directory
print("Dataset: Semi-structured Text Files:") 
print("\n") #Next line
for root, directory, file_list in os.walk(current_directory): #Checks the current working directory
    for file_name in file_list: #For each file which is present in the current working directory
        if file_name.endswith(".txt"): #Checks if the name of the file ends with ".txt"
            print(os.path.join(root, file_name)) #Displays the name of all the semi-structured text files
            count = count + 1 # Increments the count after reading each file
print(f"\nNumber of Semi-structured Files: {count}") #Dispays the number of available files in the dataset

# 5. Data Exploration

#### Displays the first 100 words from each file

In [None]:
for files in ('.'): # Current directory
    for each_file_present in os.listdir('.'): # For each available file in the current directory
        if re.match('.*\.txt', each_file_present): # Uses regular expression to match only .txt files
            with open(os.path.join(files, each_file_present), "r", encoding='utf-8') as file_open: # Opens the file for reading with utf-8 as its encoding
                line = file_open.readline().strip() # Reads the file line by line
                print(line[0:100]) # Displays first 100 words of the current file
            file_open.close() # Closes the file


# 6. Extraction and Parsing

Extracting all the `required` information from the available `semi-structured` text files and parsing the collected data into an `XML` file

In [None]:
write_file  = open('dhrruvtokas.xml', "w", encoding="utf-8") # Writes the extracted data into an XM file
write_file.truncate(0) # Will remove anything that already exists in the XML file
write_file.write("<?xml version=\"1.0\" encoding=\"UTF-8\"?>") # Writes the starting tag into the the XML file, it is the top marking line for the XML file, specifies type of encoding and version of the XML file
write_file.write("\n<data>") # Writes the <data> tag for the XML file, <data> tag is opened, all the body information will be recorded in between <data> and </data> tags 

print ("XML Formatted Information:") # A label for the output
print("\n<?xml version=\"1.0\" encoding=\"UTF-8\"?>") # Top marking line for the XML dataset which will be displayed here using a print function, just for displaying the output
print("<data>")  # Top marking line for the XML dataset which will be displayed here using a print function, just for displaying the output

for file in ('.'): # Current directory
    for each_file in os.listdir('.'): # For each file in the current working directory
        if re.match('.*\.txt', each_file): # If extension of the file matches with .txt (if it is a text file)
            with open(os.path.join(file, each_file), 'r', encoding='utf-8') as open_file: # Will open the file and specify its encoding as 'utf-8'
                line = open_file.readline().strip() # Reads the whole file line by line
                pattern_date = r'((19|20)(00[0-9]|1[0-9]|2[0-9])(-|/|)(0?[1-9]|1[0-2])-(00[0-9]|1[0-9]|2[0-9)]|3[0-1]))'
                find_pattern_date = re.search(pattern_date, line) # Looks for a pattern (all the matching dates will be retrieved) that matches the specified pattern or regular expression
                pattern_id = r'\d{19}' # Specifies a pattern or regular expression for extracting all the tweet ids, each tweet id is a 19 digit number
                find_pattern_id = re.search(pattern_id, line) # Looks for a pattern (all the matching tweet ids will be retrieved) that matches the specified pattern or regular expression
                pattern_text = r'(?<="text":").*?(?=")' # Specifies a pattern or regular expression for extracting all the recorded tweets (texts), so anything that comes between "text": and " will be extracted as a tweet
                find_pattern_text = re.search(pattern_text, line) # Looks for a pattern (all the the tweets which come between "text": and ")
                check_english = langid.classify(find_pattern_text.group(0))[0] # Uses the langid.classify() function to verify whether the twets generated are in english or not, thi function on call will return the type of test language that is being used (en/fr/it where en = english) along with the matching factor (A real number). But here it will only return th matching factor as I have restricted the returning output by specifying langid.classify()[1]
                if find_pattern_date is not None: # If there's a match for the date
                    if find_pattern_id is not None: # If there's a match for the tweet id
                        if find_pattern_text is not None: # If there's a match for the tweets (texts)
                            if find_pattern_id not in hold_repeated: # If tweet is is unique, the current tweet id will be matched with the tweet ids present in the hold_repeated variable
                                if (check_english=="en"): # If the matching factor (real number returned) is less than 0 or if it is negative then the tweet is in english or it contains english words and if the tweet does not contain any english words then it will not be extracted for the main XML file
                                    hold_repeated.add(find_pattern_id.group(0)) # If it is a unique tweet id then it weill be added to set function inside the hold repeated_variable
                                    write_file.write(f"\n<tweets date=\"{find_pattern_date.group(0)}\">") # Writes the matching dates into an XML file
                                    write_file.write(f"\n<tweet id=\"{find_pattern_id.group(0)}\">{find_pattern_text.group(0)}</tweet>") # Writes the matching tweet ids into an XML file along with the matching tweets (texts)
                                    print(f"<tweets date=\"{find_pattern_date.group(0)}\">") # Matched dates for the XML file will be displayed here using a print function
                                    print(f"<tweet id=\"{find_pattern_id.group(0)}\">{find_pattern_text.group(0)}</tweet>") # Matched tweet ids and tweets (texts) will be displayed here using a print function
                else: # If there's no match then pass
                    pass
            open_file.close() # Closes the file
            
print("</tweets>") # </tweets> closing tag will be displayed here using a print function, just for displaying the output
print("</data>") # </data> closing tag will be displayed here using a print function, just for displaying the output            

write_file.write("\n</tweets>") # Writes the </tweets> tag to the XML file, </tweets> tag is closed, <tweets> tag for the XML file
write_file.write("\n</data>") # Writes the </data> tag to the XML file, </data> tag is closed for the XML file
write_file.close() # Closes the XML file after writing is completed

# 7. Recorded Tweet ID(s)

The following section will display the all those `tweet-ids` which have been retrieved from the dataset as well as the `total` number number of individual `tweet-ids`

In [None]:
print("Number of Tweet IDs: ",len(hold_repeated)) # Displays the number of recorded unique tweet ids
print("\nList of Tweet IDs:") # Label for tweet IDs
print("\n", hold_repeated) # Displays all the unique tweet ids

# 7. Reviewing The XML File

The entire content documented in the `XML` file will be displayed in the following section 


In [None]:
open_review = open('dhrruvtokas.xml', "r", encoding="utf-8") # Opens the XML file fo reading its content, for the final review
print(open_review.read()) # Displays the content of XML file 
open_review.close() # Closes the XML file