# Data Processing Notebook

This notebook processes data obtained from various sources, including the Cochrane Library, Pfizer documents, non-plain examples from the Clinical Trials API for the Pfizer documents, and files from Trial Summaries.

In [13]:
import pandas as pd
import os
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

## Loading Datasets

We have four main datasets divided into training and testing sets, each containing plain (pls) and non-plain (non_pls) texts. These datasets are loaded from CSV files for further processing and analysis. The data has undergone data augmentation by paragraphs and has been processed through an API to obtain linguistic features.

- **Dataset 1 (Cochrane Library)**:
  - `data_cochrane_train_non_pls`: Non-plain language training set.
  - `data_cochrane_train_pls`: Plain language training set.
  - `data_cochrane_test_non_pls`: Non-plain language testing set.
  - `data_cochrane_test_pls`: Plain language testing set.

- **Dataset 2 (Pfizer and Clinical Trials)**:
  - `data_pfizer_train_non_pls`: Non-plain language training set.
  - `data_pfizer_train_pls`: Plain language training set.
  - `data_pfizer_test_non_pls`: Non-plain language testing set.
  - `data_pfizer_test_pls`: Plain language testing set.

- **Dataset 3 (Trial Summaries)**:
  - `data_trial_summaries_test_pls`: Plain language testing set.

In [14]:
# Load Dataset 1 (Cochrane Library) training and testing sets
data_cochrane_train_non_pls = pd.read_csv('data/data_cochrane_train_non_pls.csv')
data_cochrane_train_pls = pd.read_csv('data/data_cochrane_train_pls.csv')
data_cochrane_test_non_pls = pd.read_csv('data/data_cochrane_test_non_pls.csv')
data_cochrane_test_pls = pd.read_csv('data/data_cochrane_test_pls.csv')

# Load Dataset 2 (Pfizer and Clinical Trials) training and testing sets
data_pfizer_train_non_pls = pd.read_csv('data/data_pfizer_train_non_pls.csv')
data_pfizer_train_pls = pd.read_csv('data/data_pfizer_train_pls.csv')
data_pfizer_test_non_pls = pd.read_csv('data/data_pfizer_test_non_pls.csv')
data_pfizer_test_pls = pd.read_csv('data/data_pfizer_test_pls.csv')

# Load Dataset 3 (Trial Summaries) only test set
data_trial_summaries_test_pls = pd.read_csv('data/data_trial_summaries_test_pls.csv')


## Labeling Data

To facilitate the identification and classification of plain and non-plain texts, we add a `label` column to each dataset:
- A label of `0` is assigned to non-plain texts.
- A label of `1` is assigned to plain texts.

In [15]:
# Add a label column to Dataset 1 (Cochrane Library) training sets
data_cochrane_train_non_pls['label'] = 0
data_cochrane_train_pls['label'] = 1

# Add a label column to Dataset 2 (Pfizer and Clinical Trials) training sets
data_pfizer_train_non_pls['label'] = 0
data_pfizer_train_pls['label'] = 1

# Add a label column to Dataset 1 (Cochrane Library) testing sets
data_cochrane_test_non_pls['label'] = 0
data_cochrane_test_pls['label'] = 1

# Add a label column to Dataset 2 (Pfizer and Clinical Trials) testing sets
data_pfizer_test_non_pls['label'] = 0
data_pfizer_test_pls['label'] = 1

# Add a label column to Dataset 3 (TS)
data_trial_summaries_test_pls['label'] = 1

## Combining and Shuffling Data

To create comprehensive training and testing datasets, we concatenate the individual datasets (both plain and non-plain texts) into single DataFrames. This helps ensure that the machine learning models have a diverse and representative set of examples to learn from.

After concatenating, we shuffle the data to remove any order bias and ensure a random distribution of samples. On the other hand, the data from TS remains separated.

In [5]:
# Concatenate training data from both datasets
data_train = pd.concat([data_cochrane_train_non_pls, data_cochrane_train_pls, data_pfizer_train_non_pls, data_pfizer_train_pls])

# Concatenate testing data from both datasets
data_test = pd.concat([data_cochrane_test_non_pls, data_cochrane_test_pls, data_pfizer_test_non_pls, data_pfizer_test_pls])

# Shuffle the training data to ensure random distribution
data_train = data_train.sample(frac=1).reset_index(drop=True)

# Shuffle the testing data to ensure random distribution
data_test = data_test.sample(frac=1).reset_index(drop=True)
data_trial_summaries_test_pls = data_trial_summaries_test_pls.sample(frac=1).reset_index(drop=True)

In [20]:
# Ensure the "histograms" folder exists
if not os.path.exists("histograms"):
    os.makedirs("histograms")

# Concatenate training data from both datasets
data_train = pd.concat([data_cochrane_train_non_pls, data_cochrane_train_pls])

# Concatenate testing data from both datasets
data_test = pd.concat([data_cochrane_test_non_pls, data_cochrane_test_pls])

# Merge train and test data
data = pd.concat([data_train, data_test])

data_plain = data[data['label'] == 1]
data_no_plain = data[data['label'] == 0] 

# Map the labels to their names
label_mapping = {0: 'Technical', 1: 'Plain'}
data['label'] = data['label'].map(label_mapping)

# Columns to plot
columns_to_plot = ['Kincaid', 'ARI', 'Coleman-Liau', 'FleschReadingEase', 'GunningFogIndex', 'LIX', 'SMOGIndex', 'RIX', 
                   'DaleChallIndex', 'total_words', 'total_sentences', 'total_characters', 'passive_voice', 'active_voice', 
                   'passive_toks', 'active_toks', 'verbs', 'nouns', 'adjectives', 'adverbs', 'prepositions', 'auxiliaries', 
                   'conjunctions', 'coord_conjunctions', 'determiners', 'interjections', 'numbers', 'particles', 'pronouns', 
                   'proper_nouns', 'punctuations', 'subordinating_conjunctions', 'symbols', 'other', 'money', 'persons', 
                   'norp', 'facilities', 'organizations', 'gpe', 'products', 'events', 'works', 'languages', 'dates', 'times', 
                   'quantities', 'ordinals', 'cardinals', 'percentages', 'stopwords', 'locations', 'laws', 'characters_per_word', 
                   'syll_per_word', 'words_per_sentence', 'sentences_per_paragraph', 'type_token_ratio', 'characters', 'syllables', 
                   'words', 'wordtypes', 'sentences', 'paragraphs', 'long_words', 'complex_words', 'complex_words_dc', 'tobeverb', 
                   'auxverb', 'conjunction', 'nominalization']

# Generate histograms
for column in columns_to_plot:
    plt.figure(figsize=(12, 8))
    sns.histplot(data=data, x=column, hue='label', bins=80, element='bars', edgecolor=None, stat='percent')
    plt.title(f'Histogram of {column}')
    plt.legend(title='Label', labels=['Technical', 'Plain'])
    plt.ylabel('Percentage of texts (%)')
    plt.savefig(f'histograms/{column}.tiff', format='tiff')
    plt.close()

## Dropping columns
### Dropping Zero-Value columns
We identify and remove columns that contain only zero values to streamline the datasets.

1. Identify columns with only zero values.
2. Drop these columns from the training and testing sets.

In [6]:
# Identify columns in the training data that contain only zero values
zeros = data_train.columns[(data_train == 0).all()]
print(zeros)

Index(['conjunctions'], dtype='object')


### Dropping Columns With a Small Non-Zero Percentage

We calculate the percentage of non-zero values for each column in the training dataset. This helps in understanding the distribution of values and identifying columns that might have a significant amount of zero values.

In [7]:
mask = (data_train != 0)
percent_zeros = mask.sum() / mask.count() * 100
percent_zeros.to_frame()

Unnamed: 0,0
file,100.000000
Kincaid,100.000000
ARI,100.000000
Coleman-Liau,100.000000
FleschReadingEase,100.000000
...,...
tobeverb,99.971117
auxverb,82.705123
conjunction,99.995874
nominalization,99.971117


Next, we set a threshold of 3% to identify columns with a low percentage of non-zero values. Columns below this threshold will be listed for removal.

In [8]:
# Set a threshold for the minimum acceptable percentage of non-zero values
threshold = 3

# Identify columns that fall below the threshold
columns_to_drop = percent_zeros[percent_zeros < threshold].index

# List the columns to be dropped
columns_to_drop

Index(['conjunctions', 'events', 'languages'], dtype='object')

### Dropping Redundant Columns

We identify some columns that are unnecessary since their information is already found in others.

In [9]:
# Add to columns to drop
columns_to_drop = columns_to_drop.append(pd.Index(['characters', 'sentences', 'words', 'paragraphs']))
columns_to_drop = columns_to_drop.append(zeros)

In [10]:
columns_to_drop

Index(['conjunctions', 'events', 'languages', 'characters', 'sentences',
       'words', 'paragraphs', 'conjunctions'],
      dtype='object')

### Dropping Columns With Not Enough _p_-value

Taking into account the hypothesis tests carried out, we eliminated the columns with a low p-value.

In [11]:
columns_to_drop = columns_to_drop.append(pd.Index(['money', 'passive_voice', 'interjections', 'facilities']))
columns_to_drop

Index(['conjunctions', 'events', 'languages', 'characters', 'sentences',
       'words', 'paragraphs', 'conjunctions', 'money', 'passive_voice',
       'interjections', 'facilities'],
      dtype='object')

In [12]:
# Drop the columns with low non-zero percentage from the training and testing dataset
data_train = data_train.drop(columns_to_drop, axis=1)
data_test = data_test.drop(columns_to_drop, axis=1)
data_trial_summaries_test_pls = data_trial_summaries_test_pls.drop(columns_to_drop, axis=1)

Then, we define our feature set (`X`) and label set (`y`)

In [13]:
X_train = data_train.drop(['label'], axis=1)
y_train = data_train['label']

X_test = data_test.drop(['label'], axis=1)
y_test = data_test['label']

X_ts = data_trial_summaries_test_pls.drop(['label'], axis=1)
y_ts = data_trial_summaries_test_pls['label']

## Normalization of Data

In this section, we normalize the data to ensure that all features are on a comparable scale. We have identified that certain columns require a specific normalization method based on min-max normalization, while other columns will be normalized by the count of words of their respective type.

### Special Columns

The following columns require min-max normalization:

- `total_words`
- `total_sentences`
- `total_characters`
- `characters_per_word`
- `syll_per_word`
- `words_per_sentence`
- `sentences_per_paragraph`
- `type_token_ratio`
- `syllables`
- `wordtypes`
- `Kincaid`
- `ARI`
- `Coleman-Liau`
- `FleschReadingEase`
- `GunningFogIndex`
- `LIX`
- `SMOGIndex`
- `RIX`
- `DaleChallIndex`

In [14]:
# Columns that require min-max normalization
special_columns = ['total_words', 'total_sentences', 'total_characters',                 
                   'characters_per_word','syll_per_word', 'words_per_sentence', 'sentences_per_paragraph',
                   'type_token_ratio', 'syllables', 'wordtypes', 'Kincaid', 'ARI', 'Coleman-Liau', 'FleschReadingEase',
                   'GunningFogIndex', 'LIX', 'SMOGIndex', 'RIX', 'DaleChallIndex']

Columns to ignore and we don't want to normalize.

In [15]:
cols_to_ignore = ['file', 'text']

We normalize the columns not listed in the special columns by dividing their values by the `total_words` column. This ensures these features are scaled appropriately relative to the total word count.

In [16]:
# Normalize columns not listed in special columns by dividing by total_words
for column in X_train.columns:
    if column not in special_columns:
        if column in cols_to_ignore:
            continue
        X_train[column] = X_train[column] / X_train['total_words']
        X_test[column] = X_test[column] / X_test['total_words']
        X_ts[column] = X_ts[column] / X_ts['total_words']

### Min-Max Normalization for Special Columns

For columns identified as requiring min-max normalization, we scale their values to a range between 0 and 1.

In [17]:
# Initialize dictionaries to store the minimum and maximum values for each special column
maximos = {}
minimos = {}

# Perform min-max normalization for each special column
for column in special_columns:
    if column in cols_to_ignore:
        continue
    # Calculate the minimum and maximum values across training and testing datasets
    minimum = min(X_train[column].min(), X_test[column].min())
    maximum = max(X_train[column].max(), X_test[column].max())
    
    # Store the calculated minimum and maximum values
    maximos[str(column)] = maximum
    minimos[str(column)] = minimum

    # Apply min-max normalization to the training and testing datasets
    X_train[column] = (X_train[column] - minimum) / (maximum - minimum)
    X_test[column] = (X_test[column] - minimum) / (maximum - minimum)
    X_ts[column] = (X_ts[column] - minimum) / (maximum - minimum)


## Combining Normalized Data

In this section, we combine the normalized training and testing datasets back into a single DataFrame for further analysis. We also reattach the labels to the combined dataset.

### Steps:
1. Concatenate the normalized training and testing feature sets.
2. Reattach the labels to the combined dataset.

In [18]:
# Combine the normalized training and testing feature sets into a single DataFrame
data = pd.concat([X_train, X_test])

# Reattach the labels to the combined dataset
data['label'] = pd.concat([y_train, y_test])

# Reattach the labels to the combined dataset for TS dataset.
data_ts = pd.concat([X_ts])
data_ts['label'] = y_ts

In [19]:
data

Unnamed: 0,file,Kincaid,ARI,Coleman-Liau,FleschReadingEase,GunningFogIndex,LIX,SMOGIndex,RIX,DaleChallIndex,...,syllables,wordtypes,long_words,complex_words,complex_words_dc,tobeverb,auxverb,conjunction,nominalization,label
0,10.1002-14651858.CD003237.pub2-abstract.txt,0.036417,0.027587,0.738893,0.921574,0.033140,0.041748,0.104468,0.028943,0.110047,...,0.065053,0.236364,0.393678,0.244253,0.554598,0.048851,0.005747,0.051724,0.054598,0
1,10.1002-14651858.CD011070.pub2-abstract.txt_ac...,0.058573,0.054424,0.670448,0.915326,0.061864,0.067161,0.176646,0.060099,0.138720,...,0.112028,0.392727,0.365159,0.236181,0.584590,0.028476,0.006700,0.043551,0.050251,0
2,10.1002-14651858.CD008856.pub2-pls.txt_accumul...,0.049334,0.043075,0.688413,0.918900,0.045598,0.051959,0.131758,0.042423,0.061509,...,0.136797,0.338182,0.328058,0.195683,0.366906,0.034532,0.017266,0.035971,0.050360,1
3,10.1002-14651858.MR000043.pub2-abstract.txt_ac...,0.028315,0.024618,0.559319,0.954066,0.033862,0.034057,0.104649,0.024164,0.134249,...,0.048826,0.185455,0.287324,0.205634,0.616901,0.022535,0.000000,0.039437,0.039437,0
4,10.1002-14651858.CD011507.pub2-pls.txt_accumul...,0.058157,0.047716,0.590786,0.911095,0.056511,0.053864,0.159376,0.041305,0.065922,...,0.078007,0.249091,0.273148,0.212963,0.363426,0.050926,0.006944,0.030093,0.034722,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11844,10.1002-14651858.CD011511.pub2-abstract.txt_ac...,0.055242,0.044445,0.644924,0.908730,0.055324,0.056145,0.166429,0.047088,0.135721,...,0.103345,0.310909,0.350598,0.258964,0.605578,0.009960,0.000000,0.039841,0.049801,0
11845,10.1002-14651858.CD010019.pub4-pls.txt_accumul...,0.047232,0.041954,0.562071,0.933279,0.043064,0.046814,0.105070,0.033870,0.040756,...,0.060214,0.194545,0.253165,0.139241,0.301266,0.043038,0.010127,0.045570,0.035443,1
11846,10.1002-14651858.CD011426.pub2-abstract.txt_ac...,0.046779,0.037102,0.753543,0.909529,0.043558,0.052363,0.136434,0.042220,0.113444,...,0.095089,0.290909,0.413717,0.258850,0.553097,0.028761,0.008850,0.035398,0.035398,0
11847,10.1002-14651858.CD004378.pub3-abstract.txt_ac...,0.042606,0.036770,0.590955,0.934243,0.041240,0.047766,0.116300,0.037639,0.107453,...,0.110036,0.300000,0.301711,0.171073,0.496112,0.015552,0.009331,0.023328,0.038880,0


In [20]:
data_ts

Unnamed: 0,file,Kincaid,ARI,Coleman-Liau,FleschReadingEase,GunningFogIndex,LIX,SMOGIndex,RIX,DaleChallIndex,...,syllables,wordtypes,long_words,complex_words,complex_words_dc,tobeverb,auxverb,conjunction,nominalization,label
0,11056-MT_MED_9011_page12.txt,0.009082,0.008457,0.632308,0.964573,-0.002733,0.009854,-0.009014,0.001859,0.038490,...,0.035730,0.160000,0.252525,0.090909,0.350168,0.030303,0.010101,0.033670,0.020202,1
1,1252-MT_MED_9011_page4.txt,0.025508,0.020824,0.575991,0.953215,0.019137,0.025435,0.053206,0.016729,0.034890,...,0.041281,0.127273,0.253247,0.123377,0.331169,0.035714,0.012987,0.022727,0.022727,1
2,1017-MT_MED_9011_page8.txt,0.041861,0.036483,0.710361,0.925649,0.036914,0.050692,0.109818,0.040892,0.072429,...,0.033310,0.054545,0.385593,0.190678,0.423729,0.046610,0.021186,0.038136,0.029661,1
3,11555-MT_MED_9011_page8.txt,0.018527,0.017148,0.714199,0.952129,0.006668,0.020163,0.020414,0.010161,0.046016,...,0.030320,0.123636,0.286290,0.108871,0.370968,0.032258,0.012097,0.028226,0.040323,1
4,11638-MT_MED_9011_page5.txt,0.028544,0.021762,0.580259,0.946244,0.022391,0.023072,0.065498,0.014870,0.013188,...,0.028897,0.030909,0.225000,0.141667,0.262500,0.029167,0.000000,0.008333,0.004167,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1029,1092-MT_MED_9011_page1.txt,0.019466,0.013107,0.521313,0.959827,0.010495,0.016049,0.027310,0.009707,0.024140,...,0.053950,0.167273,0.220159,0.098143,0.310345,0.053050,0.007958,0.031830,0.005305,1
1030,11547-MT_MED_9011_page3.txt,0.019441,0.011888,0.407538,0.967844,0.011071,0.006432,0.016621,0.002478,0.045562,...,-0.007687,-0.040000,0.109589,0.068493,0.369863,0.041096,0.000000,0.027397,0.000000,1
1031,11230-MT_MED_9011_page5.txt,0.023610,0.016993,0.567945,0.951788,0.020843,0.024448,0.064962,0.015489,0.022852,...,0.035160,0.078182,0.270073,0.164234,0.299270,0.054745,0.014599,0.025547,0.021898,1
1032,11522-MT_MED_9011_page11.txt,0.038504,0.032445,0.623532,0.934907,0.028178,0.036370,0.069718,0.026022,0.041769,...,0.015516,0.054545,0.266272,0.118343,0.331361,0.041420,0.023669,0.035503,0.017751,1


## Saving the Data

After normalizing and combining the data, we save the cleaned and processed datasets into CSV files for future use. This includes saving the training and testing datasets, as well as separating the plain and non-plain test sets.

In [79]:
# Combine the training features and labels, then save to data_train.csv
data = pd.concat([X_train, y_train], axis=1)
data.to_csv('data_clean/data_train.csv', index=False)

# Combine the testing features and labels, then save to data_test.csv
data = pd.concat([X_test, y_test], axis=1)
data.to_csv('data_clean/data_test.csv', index=False)

data_ts.to_csv('data_clean/data_ts.csv', index=False)

In [24]:
# Ensure the "histograms" folder exists
if not os.path.exists("histograms_normalized"):
    os.makedirs("histograms_normalized")

url_test = "data_clean/data_test.csv"
url_train = "data_clean/data_train.csv"

data_test = pd.read_csv(url_test)
data_train = pd.read_csv(url_train)

data = pd.concat([data_train, data_test])
data_plain = data[data['label'] == 1]
data_no_plain = data[data['label'] == 0] 

# Map the labels to their names
label_mapping = {0: 'Technical', 1: 'Plain'}
data['label'] = data['label'].map(label_mapping)

# Columns to plot
columns_to_plot = data.columns[1:-1]

# Generate histograms
for column in columns_to_plot:
    plt.figure(figsize=(20, 12))
    sns.histplot(data=data, x=column, hue='label', bins=50, element='bars', edgecolor=None, stat='percent')
    plt.title(f'Histogram of {column}')
    plt.legend(title='Label', labels=['Technical', 'Plain'])
    plt.ylabel('Percentage of texts (%)')
    plt.xlabel(f'{column} normalized')
    plt.savefig(f'histograms_normalized/{column}.tiff', format='tiff')
    plt.close()