In this notebook, we take two folders of .txt files representing the two classes of data in our binary classification, and concatenates them into one .csv file in order to fine-tune our model. The conditions for adding the .txt file into our .csv file is that they must not be in both folders. 

Here, we import the the required packages and get the path to the two folders of data.

In [1]:
#import and move to correct directories
import pandas as pd
import os

path_parent = os.path.dirname(os.getcwd())
os.chdir(path_parent)
print(os.getcwd())
path = "data/raw/cs.AI"
path2 = "data/raw/cs.PL"

C:\Users\danie\Documents\COGS402\cogs402longformer


Using a global dataframes list, we iterate through all files of one folder first, check if they end with a .txt extension, and if we do not see the same named file in the other folder, we add it to our list of examples. Each item is a dictionary containing the text and label that we will convert into a Pandas dataframe after. The function we use to add to the dataframe takes in a filepath (for example, the ones we made above) and a label as input, and creates the dictionary we need for the dataframe and appends the item into the list.

In [2]:
dataframes = []

#takes in a folder path and the correct label we want
def add_file_to_df(file_path, label):
    with open(file_path, 'r', encoding='utf8') as f:
        contents = f.read()
        contents = contents.replace('\r', ' ').replace('\n', ' ')
        d = {"text":contents, "labels":label}
        dataframes.append(d)
        

# iterate through all files for the AI documents
for file in os.listdir(path):
    # Check whether file is in text format or not
    if file.endswith(".txt"):
        # if the file is also in the other class, dont add
        if file not in os.listdir(path2):
            file_path = f"{path}/{file}"

            # call add file 
            add_file_to_df(file_path, label=1)

Now we go through the other folder of data, and do the same, but this time we change the label to 0 as these are all programming languages (PL) related data.

In [3]:
# iterate through all file for the PL 
for file in os.listdir(path2):
    # Check whether file is in text format or not
    if file.endswith(".txt"):
        if file not in os.listdir(path):
            # if the file is also in the other class, dont add
            file_path = f"{path2}/{file}"

            # call read text file function
            add_file_to_df(file_path, label=0)


Finally, we convert the list of dictionaries containing our text and labels, and convert it into our required dataframe.

In [4]:
df = pd.DataFrame(dataframes)
df

Unnamed: 0,text,labels
0,arXiv:cs/0405102v1 [cs.PL] 27 May 2004 Under ...,1
1,Adaptive Submodularity: Theory and Application...,1
2,Anthropic decision theory for self-locating be...,1
3,Analysis of a Natural Gradient Algorithm on Mo...,1
4,A Simplified Description of Fuzzy TOPSIS Balwi...,1
...,...,...
5345,"The Alma Project, or How First-Order Logic Can...",0
5346,After Compilers and Operating Systems : The Th...,0
5347,The Rough Guide to Constraint Propagation Krzy...,0
5348,Automatic Generation of Constraint Propagation...,0


Lets check how many instances of each class are in our final dataset before we partition the data.

In [6]:
df['labels'].value_counts()

1    2722
0    2628
Name: labels, dtype: int64

We save this dataframe into the parent folder of our two data folders. Feel free to change the path to something more suited to your project

In [10]:
df.to_csv("data/longdoc.csv", index=False)

In [11]:
import datasets
from sklearn.model_selection import train_test_split

Just as a sanity check, load our new .csv file and see if it works

In [12]:
dataset2 = pd.read_csv("data/longdoc.csv")
dataset2

Unnamed: 0,text,labels
0,arXiv:cs/0405102v1 [cs.PL] 27 May 2004 Under ...,1
1,Adaptive Submodularity: Theory and Application...,1
2,Anthropic decision theory for self-locating be...,1
3,Analysis of a Natural Gradient Algorithm on Mo...,1
4,A Simplified Description of Fuzzy TOPSIS Balwi...,1
...,...,...
5345,"The Alma Project, or How First-Order Logic Can...",0
5346,After Compilers and Operating Systems : The Th...,0
5347,The Rough Guide to Constraint Propagation Krzy...,0
5348,Automatic Generation of Constraint Propagation...,0


See if it can be partitioned into a train and test set for fine-tuning.

In [13]:
df_train, df_test = train_test_split(dataset2, test_size = 0.2, shuffle=True, random_state=42)
print(df_train.shape)
print(df_test.shape)

(4280, 2)
(1070, 2)
