Before we get started, we'll need to clean our dataset. We'll only do a bit of preprocessing here, since the different approaches we try have slightly different expectations in terms of preprocessing.

I've downloaded the file RTAnews.OriginalTexts.zip from [https://data.mendeley.com/datasets/322pzsdxwy/1](https://data.mendeley.com/datasets/322pzsdxwy/1). It contains training and test folders, with news items for each category in seperate folders.

We'd like to get one dataframe with the text and labels, then split it into training, validation, and test sets.

You can rerun this notebook if you like, but the cleaned files are also availabe in this GitHub repo as csvs.

In [0]:
#import libraries
import numpy as np
import pandas as pd
import os

In [0]:
#mount Google Drive to access the raw version of the data.
#if you're trying to run this notebook locally, just skip/delete this cell and change the path in the cell below
from google.colab import drive

drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive
/content/gdrive/My Drive/RTAnews_raw


We'll load our data by creating empty dataframes, and then looping through the folders that contain text files for each category of news. 

In [0]:
path = 'gdrive/My Drive/RTAnews_raw'

for folder in os.listdir(os.path.join(path,'training')):
    file_list = os.listdir(os.path.join(path,'training', folder))
    for file_path in file_list:
        with open(os.path.join(path,'training', folder, file_path)) as f_input:
            temp = (f_input.read())
            df = pd.DataFrame({'text':[temp]})
            df['category'] = folder
        train = train.append(df)
print('training data loaded!')


test = pd.DataFrame()

for folder in os.listdir(os.path.join(path,'test')):
    file_list = os.listdir(os.path.join(path,'test', folder))
    for file_path in file_list:
        with open(os.path.join(path,'test', folder, file_path)) as f_input:
            temp = (f_input.read())
            df = pd.DataFrame({'text':[temp]})
            df['category'] = folder
        test = test.append(df)
print('test data loaded!')

training data loaded!
test data loaded!


In [0]:
print(f'Training data size:{len(train)}\n',
      f'Test data size:{len(test)}')

Training data size:16610
 Test data size:11060


That's more test data than we need, so we'll create one dataframe and do our own split for both test and val sets.

In addition to the category names, we'll want integer values for each category. `sklearn` has a `LabelEncoder` utility for exactly this purpose.



In [0]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

le = LabelEncoder().fit(df.category)
df['labels'] = le.transform(df.category)

train, test = train_test_split(df, test_size=0.2)
val,test = train_test_split(test, test_size=0.5)

So now we have 80% of our data in a training set, 10% in a validation set, and 10% in a test set. We'll save those files to csv, and we're done!

In [0]:
train.to_csv(os.path.join(path,'arabic_train.csv'), index=False)
val.to_csv(os.path.join(path,'arabic_val.csv'), index=False)
test.to_csv(os.path.join(path,'arabic_test.csv'), index=False)