### Suspicious writing/article detection

This notebook contains code and documentation for reproducing results for the replication analysis of the paper "Fake news detection: A hybrid CNN-RNN based deep learning approach." A link to the paper is available [here](https://www.researchgate.net/publication/348379370_Fake_news_detection_A_hybrid_CNN-RNN_based_deep_learning_approach).

If you'd like to run all analyses from the original datasets--comprising 40K+ original and modified news articles and excerpts--start with "__To run analysis from scratch__." If you'd like to skip classifier training and testing and calculate summary statistics only, jump to "__To run analysis from provided files__."

__To run analysis from scratch__:

- Confirm that Python v3.10 (or later) and IPython v8 (or later) are installed. 
- On your own machine, move all data files (`real_nytimes.csv`, `modified_nytimes.csv`, `real_reuters.csv`, `modified_reuters.csv`, `isot.csv`, and `fakes.csv`) to the same directory as the bashscript `run.sh`. These data files will need to be downloaded from the zip archive at this [GDrive link](https://drive.google.com/file/d/1h8ML2LS8g44M2WpyX2L7Lr_bVMHECftL/view?usp=sharing). The `DataPrep` directory (from the zip archive) should also be in the same directory as the bashscript and all four csvs.  
- Confirm also that pyscripts `CNN_revised.py` and `CNN_RNN.py` are in the same directory as all the above. Descriptions of both follow:    
&nbsp;
    - `CNN_RNN.py` is a lightly edited version of the original classifier provided by the authors. We added an argparser to make the script callable from `run.sh`; all model parameters are unchanged. 
    - `CNN_revised.py` ingests modified and original versions of our custom news datasets, preprocesses and pads them together (this is a necessary step, as the model accepts same-length inputs only), trains the classifier on the ISOT dataset, then tests on original and modified news datasets.    
&nbsp;
- At the conclusion of this run (wall time approximately 7.5 hours on 4 threads; see `run.sh` for more info), you'll have generated four files containing console logs and classification reports for 30 seeded runs on each dataset: `final_metrics_isot.txt`, `final_metrics_fakes.txt`, `final_metrics_reu.txt`, and `final_metrics_nyt.txt`. Note: The modified and original (real) classification reports for each of the Reuters and NYTimes datasets are bundled into the same datafile. So `final_metrics_nyt.txt` will contain classification reports for runs on the real and modified NYTimes datasets. These reports contain accuracy, FPR, FNR, and other performance measurements. 
- You'll also optionally generate the label files for all modified datasets. These are named `y_bin_pred_original_{nyt/reu}_{randseed}.csv`. There are 60 of these: one for each combination of nyt/reu and every seed in our list of 30 random seeds. 
- To complete analysis on output files, continue with instructions in the next section. 


__To run analysis from provided files__:

- Confirm that `final_metrics_isot.txt`, `final_metrics_fakes.txt`, `final_metrics_reu.txt`, and `final_metrics_nyt.txt` are available in the same directory as this notebook. 
- Run cells in order.* Note that classification reports for both modified news datasets in the `final_metrics_` data files are over the whole set of 100 excerpts, including those 50 excerpts that appear in original, unmodified form between `real` and `modified` versions of both news datasets. 
- In our paper, we report accuracy and FNRs over the set of 50 modified excerpts only. To calculate these statistics, see Section "Modified Dataset Statistics." These require the label prediction files `y_bin_pred_original_{nyt/reu}_{randseed}.csv` mentioned previously. These files contain the label (true/false) predictions output by the trained model. If you skipped the previous section, sample label files are provided in the `outputs_and_analysis` directory. __Note__: yes, it'd definitely be easier to compute these stats in the pyscripts themselves and output directly to console or a self-contained datafile. These updates are in-progress; doing things the klugey way just for now. 

*For the sake of explainability and transparency, we err on the side of verbosity in this notebook: though in-order execution is recommended, we repeat the same function calls for each unique dataset such that each analysis cell can be run independently and out of order without affecting results. 

In [10]:
import re
import pandas as pd 
import numpy as np
from scipy.stats import ttest_ind

-----------------------------------------------------
__(1) ISOT dataset analysis__

In [35]:
# ISOT analysis: training on 0.8 ISOT, validating/testing on 0.2 ISOT

# pull acc, FPR, FNR from ISOT output files
with open("final_metrics_isot.txt", "r") as f:
    text = f.read()

seed = re.findall(r'RANDOM SEED:\s*(\d+)', text)
accuracy_matches = re.findall(r'accuracy\s+([\d.]+)', text)
fpr_matches = re.findall(r'FPR:\s+\[([^\]]+)\]', text)
fnr_matches = re.findall(r'FNR:\s+\[([^\]]+)\]', text)

parsed = []
for seed, acc, fpr_str, fnr_str in zip(seed, accuracy_matches, fpr_matches, fnr_matches):
    seed = int(seed)
    acc = float(acc)
    fpr = [float(x) for x in fpr_str.strip().split()]
    fnr = [float(x) for x in fnr_str.strip().split()]
    parsed.append((seed, acc, fpr, fnr))

print(f"acc(mean, var) : ({np.mean(acc):.2f}, {np.var(acc):.2f})")
print(f"fnr(mean, var) : ({np.mean(fpr):.2f}, {np.var(fpr):.2f})")
print(f"fpr(mean, var) : ({np.mean(fnr):.2f}, {np.var(fnr):.2f})")

## optional: uncomment lines below to output extracted stats to tsv    
# with open("final_extracted_metrics_fakes.tsv", "w") as out:
#     out.write("Run\tSeed\tAccuracy\tFPR_1\tFNR_1\n")
#     for i, (seed, acc, fpr, fnr) in enumerate(parsed, 1):
#         out.write(f"{i}\t{seed}\t{acc:.4f}\t{fpr[1]:.6f}\t{fnr[1]:.6f}\n")


acc(mean, var) : (1.00, 0.00)
fnr(mean, var) : (0.34, 0.22)
fpr(mean, var) : (0.33, 0.22)


-----------------------------------------------------
__(2) FAKES dataset analysis__

In [36]:
## FAKES analysis: training on ISOT, testing on FAKES

## pull acc, FPR, FNR from FAKES output files

with open("final_metrics_fakes.txt", "r") as f:
    text = f.read()

seed = re.findall(r'RANDOM SEED:\s*(\d+)', text)
accuracy_matches = re.findall(r'accuracy\s+([\d.]+)', text)
fpr_matches = re.findall(r'FPR:\s+\[([^\]]+)\]', text)
fnr_matches = re.findall(r'FNR:\s+\[([^\]]+)\]', text)

parsed = []
for seed, acc, fpr_str, fnr_str in zip(seed, accuracy_matches, fpr_matches, fnr_matches):
    seed = int(seed)
    acc = float(acc)
    fpr = [float(x) for x in fpr_str.strip().split()]
    fnr = [float(x) for x in fnr_str.strip().split()]
    parsed.append((seed, acc, fpr, fnr))

print(f"acc(mean, var) : ({np.mean(acc):.2f}, {np.var(acc):.2f})")
print(f"fnr(mean, var) : ({np.mean(fpr):.2f}, {np.var(fpr):.2f})")
print(f"fpr(mean, var) : ({np.mean(fnr):.2f}, {np.var(fnr):.2f})")

## optional: output extracted stats to tsv 
# with open("final_extracted_metrics_isot.tsv", "w") as out:
#     out.write("Run\tSeed\tAccuracy\tFPR_1\tFNR_1\n")
#     for i, (seed, acc, fpr, fnr) in enumerate(parsed, 1):
#         out.write(f"{i}\t{seed}\t{acc:.4f}\t{fpr[1]:.6f}\t{fnr[1]:.6f}\n")
#         out.write("\t".join(str(item) for item in row) + "\n") 

acc(mean, var) : (0.58, 0.00)
fnr(mean, var) : (0.50, 0.25)
fpr(mean, var) : (0.50, 0.25)


-----------------------------------------------------
__(3) Reuters original and modified dataset analyis__

Note: modified dataset statistics are reported over the whole set of 100 excerpts, including those 50 excerpts that are unchanged between original and modified datasets. For statistics over the set of modified articles only, see the next section (3.5). 

In [22]:
# generate separate extraction files for original and unmodified Reuters datasets

input_file = "final_metrics_reu.txt"
orig_output_file = "orig_results_reu.tsv"
mod_output_file = "mod_results_reu.tsv"

with open(input_file, 'r') as f:
    lines = f.readlines()

reu_orig_results = []
reu_mod_results = []

i = 0
while i < len(lines):
    line = lines[i]

    # fetch random seed 
    if "RANDOM SEED:" in line:
        seed_match = re.search(r'RANDOM SEED:\s*(\d+)', line)
        seed = seed_match.group(1) if seed_match else "NA"

    # orig block
    if "Original Classification Report:" in line:
        while i < len(lines) and "accuracy" not in lines[i]:
            i += 1
        if i < len(lines):
            acc_match = re.search(r'accuracy\s+(\d+\.\d+)', lines[i])
            orig_acc = acc_match.group(1) if acc_match else "NA"
            reu_orig_results.append([int(seed), float(orig_acc), 1 - float(orig_acc)])

    # modif block
    if "Modified Classification Report:" in line:
        while i < len(lines) and "accuracy" not in lines[i]:
            i += 1
        if i < len(lines):
            acc_match = re.search(r'accuracy\s+(\d+\.\d+)', lines[i])
            mod_acc = acc_match.group(1) if acc_match else "NA"
        fpr, fnr = "NA", "NA"
        while i < len(lines):
            if "FPR:" in lines[i]:
                fpr_line = lines[i]
                fpr_match = re.search(r'FPR:\s+\[([^\]]+)\]', fpr_line)
                if fpr_match:
                    fpr_vals = [float(v.strip()) for v in fpr_match.group(1).split()]
                    fpr = f"{fpr_vals[1]:.4f}" if len(fpr_vals) > 1 else "NA"
            if "FNR:" in lines[i]:
                fnr_line = lines[i]
                fnr_match = re.search(r'FNR:\s+\[([^\]]+)\]', fnr_line)
                if fnr_match:
                    fnr_vals = [float(v.strip()) for v in fnr_match.group(1).split()]
                    fnr = f"{fnr_vals[1]:.4f}" if len(fnr_vals) > 1 else "NA"
                reu_mod_results.append([int(seed), float(mod_acc), float(fpr), float(fnr)])
                break
            i += 1
    i += 1


# # optional: write original results
# with open(orig_output_file, 'w') as f:
#     f.write("seed\taccuracy\n")
#     for row in orig_results:
#         # f.write("\t".join(row) + "\n")
#         f.write("\t".join(str(item) for item in row) + "\n")

# # optional: write modified results
# with open(mod_output_file, 'w') as f:
#     f.write("seed\taccuracy\tFPR\tFNR\n")
#     for row in reu_mod_results:
#         f.write("\t".join(str(item) for item in row) + "\n")
        
reu_orig_df = pd.DataFrame(reu_orig_results, columns = ['seed', 'accuracy', 'FNR'])
reu_modif_df = pd.DataFrame(reu_mod_results, columns = ['seed', 'accuracy', 'FPR', 'FNR'])

print("Summary statistics on both Reuters datasets (modified statistics are over the full dataset of 100 excerpts):")
print(f"orig acc(mean, var) : ({np.mean(reu_orig_df['accuracy']):.2f}, {np.var(reu_orig_df['accuracy']):.2f})")
print(f"orig FNR(mean, var) : ({np.mean(reu_orig_df['FNR']):.2f}, {np.var(reu_orig_df['FNR']):.2f})")
print(f"modif acc(mean, var) : ({np.mean(reu_modif_df['accuracy']):.2f}, {np.var(reu_modif_df['accuracy']):.2f})")


Summary statistics on both Reuters datasets (modified statistics are over the full dataset of 100 excerpts):
orig acc(mean, var) : (0.60, 0.02)
orig FNR(mean, var) : (0.40, 0.02)
modif acc(mean, var) : (0.57, 0.00)


__(3.5) Reuters modified dataset analysis__

... by contrast with the modified dataset analysis in the previous cell, these statistics are over the set of _edited_ Reuters articles (n = 50) only. The accuracy statistics (mean and variance) should be self-explanatory; the `flips` statistic measures the average number of toggled labels (i.e., from 0 -> 1 or 1 -> 0) between the original and modified datasets for the 50 edited excerpts.

In [33]:
seeds=[384, 328, 479, 21, 304, 355, 285, 105, 135,
       263, 91, 88, 73, 177, 7, 66, 492, 344, 402,
       274, 467, 413, 339, 427, 201, 373, 214, 223, 366, 246]

reu_mod_acc = []

reu_flips = [0]*30

for j in range(len(seeds)):

    reu_orig = pd.read_csv(f'y_bin_pred_original_reuters_{seeds[j]}.csv')
    reu_orig = np.array(reu_orig)
    
    reu_modif = pd.read_csv(f'y_bin_pred_modified_reuters_{seeds[j]}.csv')
    reu_modif = np.array(reu_modif)
    
    reu_mod = (50 - reu_modif[0:50].sum()) / 50 
    reu_mod_acc.append(reu_mod)  
    
    for i in range(50):
        
        if reu_orig[i] != reu_modif[i]:
            reu_flips[j] += 1
    
#     print('------------')
#     print('seed: ', seeds[j])
#     print('reu mod acc: ', np.sum(reu_mod))
    
    
print('----------')
print('reu_mod var(acc): ', np.var(reu_mod_acc))
print('reu mod mean(acc): ', np.mean(reu_mod_acc))
print('reu flips average: ', np.mean(reu_flips))



----------
reu_mod var(acc):  0.02203555555555555
reu mod mean(acc):  0.4733333333333334
reu flips average:  1.1


-----------------------------------------------------
__(4) NYTimes original and modified dataset analyis__

Note: modified dataset statistics are reported over the whole set of 100 excerpts, including those 50 excerpts that are unchanged between original and modified datasets. For statistics over the set of modified articles only, see the next section (4.5). 


In [23]:
# generate separate extraction files for original and unmodified NYTimes datasets

input_file = "final_metrics_nyt.txt"
orig_output_file = "orig_results_nyt.tsv"
mod_output_file = "mod_results_nyt.tsv"

with open(input_file, 'r') as f:
    lines = f.readlines()

nyt_orig_results = []
nyt_mod_results = []

i = 0
while i < len(lines):
    line = lines[i]

    # fetch random seed 
    if "RANDOM SEED:" in line:
        seed_match = re.search(r'RANDOM SEED:\s*(\d+)', line)
        seed = seed_match.group(1) if seed_match else "NA"

    # orig block
    if "Original Classification Report:" in line:
        while i < len(lines) and "accuracy" not in lines[i]:
            i += 1
        if i < len(lines):
            acc_match = re.search(r'accuracy\s+(\d+\.\d+)', lines[i])
            orig_acc = acc_match.group(1) if acc_match else "NA"
            nyt_orig_results.append([int(seed), float(orig_acc), 1 - float(orig_acc)])

    # modif block
    if "Modified Classification Report:" in line:
        while i < len(lines) and "accuracy" not in lines[i]:
            i += 1
        if i < len(lines):
            acc_match = re.search(r'accuracy\s+(\d+\.\d+)', lines[i])
            mod_acc = acc_match.group(1) if acc_match else "NA"
        fpr, fnr = "NA", "NA"
        while i < len(lines):
            if "FPR:" in lines[i]:
                fpr_line = lines[i]
                fpr_match = re.search(r'FPR:\s+\[([^\]]+)\]', fpr_line)
                if fpr_match:
                    fpr_vals = [float(v.strip()) for v in fpr_match.group(1).split()]
                    fpr = f"{fpr_vals[1]:.4f}" if len(fpr_vals) > 1 else "NA"
            if "FNR:" in lines[i]:
                fnr_line = lines[i]
                fnr_match = re.search(r'FNR:\s+\[([^\]]+)\]', fnr_line)
                if fnr_match:
                    fnr_vals = [float(v.strip()) for v in fnr_match.group(1).split()]
                    fnr = f"{fnr_vals[1]:.4f}" if len(fnr_vals) > 1 else "NA"
                nyt_mod_results.append([int(seed), float(mod_acc), float(fpr), float(fnr)])
                break
            i += 1
    i += 1

# # optional: write original results
# with open(orig_output_file, 'w') as f:
#     f.write("seed\taccuracy\n")
#     for row in orig_results:
#         f.write("\t".join(row) + "\n")

# # optional: write modified results
# with open(mod_output_file, 'w') as f:
#     f.write("seed\taccuracy\tFPR\tFNR\n")
#     for row in mod_results:
#         f.write("\t".join(row) + "\n")

nyt_orig_df = pd.DataFrame(nyt_orig_results, columns = ['seed', 'accuracy', 'FNR'])
nyt_modif_df = pd.DataFrame(nyt_mod_results, columns = ['seed', 'accuracy', 'FPR', 'FNR'])

print("Summary statistics on both NYTimes datasets (modified dataset statistics are over the full dataset of 100 excerpts):")
print(f"orig acc(mean, var) : ({np.mean(nyt_orig_df['accuracy']):.2f}, {np.var(nyt_orig_df['accuracy']):.2f})")
print(f"orig FNR(mean, var) : ({np.mean(nyt_orig_df['FNR']):.2f}, {np.var(nyt_orig_df['FNR']):.2f})")
print(f"modif acc(mean, var) : ({np.mean(nyt_modif_df['accuracy']):.2f}, {np.var(nyt_modif_df['accuracy']):.2f})")


Summary statistics on both NYTimes datasets (modified dataset statistics are over the full dataset of 100 excerpts):
orig acc(mean, var) : (0.49, 0.03)
orig FNR(mean, var) : (0.51, 0.03)
modif acc(mean, var) : (0.49, 0.00)


__(4.5) NYTimes modified dataset analysis__

... by contrast with the modified dataset analysis in the previous cell, these statistics are over the set of _edited_ NYT articles (n = 50) only. The accuracy statistics (mean and variance) should be self-explanatory; the `flips` statistic measures the average number of toggled labels (i.e., from 0 -> 1 or 1 -> 0) between the original and modified datasets for the 50 edited excerpts.

In [32]:
seeds=[384, 328, 479, 21, 304, 355, 285, 105, 135,
       263, 91, 88, 73, 177, 7, 66, 492, 344, 402,
       274, 467, 413, 339, 427, 201, 373, 214, 223, 366, 246]

nyt_mod_acc = []

nyt_flips = [0]*30


for j in range(len(seeds)):

    nyt_orig = pd.read_csv(f'y_bin_pred_original_nytimes_{seeds[j]}.csv') 
    nyt_orig = np.array(nyt_orig)
    
    nyt_modif = pd.read_csv(f'y_bin_pred_modified_nytimes_{seeds[j]}.csv')
    nyt_modif = np.array(nyt_modif)
    
    nyt_mod = (50 - nyt_modif[0:50].sum()) / 50
    nyt_mod_acc.append(nyt_mod)
    
    for i in range(50):
        
        if nyt_orig[i] != nyt_modif[i]:
            nyt_flips[j] += 1
        
#   # print per-run statistics:   
#     print('seed: ', seeds[j])
#     print('nyt mod acc: ', np.sum(nyt_mod))
#     print('------------')
    
print('----------')
print('nyt mod variance: ', np.var(nyt_mod_acc))
print('nyt mod average: ', np.mean(nyt_mod_acc))
print('nyt flips average: ', np.mean(nyt_flips))


----------
nyt mod variance:  0.04125333333333333
nyt mod average:  0.52
nyt flips average:  1.4333333333333333


__(5) Pairwise significance test__ 

For significance testing of model performance on original NYTimes versus original Reuters datasets. This analysis requires that you run the analysis in sections (3) and (4) first in order to initialize `nyt_orig_df` and `reu_orig_df.`

In [34]:

t_stat, p_val = ttest_ind(nyt_orig_df['accuracy'], reu_orig_df['accuracy'])
print(f"p = {p_val}")


p = 0.000374963616536027
