# Text feature generation
This notebook focuses on generating textual features for later use in the classifier. The features include:

- LIWC features
- BERT embeddings
- CE embeddings
- Alexa ranks

An extraction function is implemented for each feature set. To compute LIWC features and BERT embeddings, it is necessary to place the Celebrity dataset into the project folder. Additionally, the LIWC2015_English.dic file (available at https://drive.google.com/file/d/1XWJVSVGkSDLOpKQ34lRojgO_iIP_hV34/view?usp=sharing) is required for the proper functioning of the liwc library.

As the Alexa rank API is retired, a pkl-file with precomputed CE embeddings and Alexa ranks from previous research is used to incorporate them. The corresponding link to the pkl-file can be found in the relevant section of the code.

In [None]:
from google.colab import drive
drive.mount('/content/drive/')
PROJECT_FOLDER = "/content/drive/My Drive/Colab Notebooks/Deep Learning/project/" #replace it with your project folder
!ls "$PROJECT_FOLDER"

To begin with, let's examine the dataset.

In [None]:
import glob
import pandas as pd
import re
import pickle

#class to read Celebrity dataset
def dataset_reader(news_path, is_fake): 
    dataset = pd.DataFrame(columns=['file', 'headline', 'content', 'fake'])

    regex_pattern = r"[\n.]"
    for filename in glob.glob(news_path + "*.txt"):
        with open(filename, 'r') as file:
            text = file.read().strip()
            text_splitted = re.split(regex_pattern, text)
            text_splitted = [sentence for sentence in text_splitted if len(sentence) > 0]
            headline = ''
            for index, sentence in enumerate(text_splitted):
                headline += sentence
                if len(headline.split(' ')) >= 5:
                    break
            content = '.'.join(text_splitted[index + 1:])
            news_file = filename.split('/')[-1]
            dataset = dataset.append(
                {'file': news_file,
                'headline': headline.strip(),
                'content': content.strip(),
                'fake': is_fake}, 
                ignore_index=True
            )

    return dataset

In [None]:
fakes = dataset_reader(PROJECT_FOLDER + 'celebrityDataset/fake/', 1)
legits = dataset_reader(PROJECT_FOLDER + 'celebrityDataset/legit/', 0)

  dataset = dataset.append({'file': news_file,
  dataset = dataset.append({'file': news_file,
  dataset = dataset.append({'file': news_file,
  dataset = dataset.append({'file': news_file,
  dataset = dataset.append({'file': news_file,
  dataset = dataset.append({'file': news_file,
  dataset = dataset.append({'file': news_file,
  dataset = dataset.append({'file': news_file,
  dataset = dataset.append({'file': news_file,
  dataset = dataset.append({'file': news_file,
  dataset = dataset.append({'file': news_file,
  dataset = dataset.append({'file': news_file,
  dataset = dataset.append({'file': news_file,
  dataset = dataset.append({'file': news_file,
  dataset = dataset.append({'file': news_file,
  dataset = dataset.append({'file': news_file,
  dataset = dataset.append({'file': news_file,
  dataset = dataset.append({'file': news_file,
  dataset = dataset.append({'file': news_file,
  dataset = dataset.append({'file': news_file,
  dataset = dataset.append({'file': news_file,
  dataset = d

In [None]:
dataset = pd.concat([fakes, legits]).reset_index()
dataset.sample(4)

Unnamed: 0,index,file,headline,content,fake
130,130,102fake.txt,Taylor Swift Goes Naked in ',Ready for It?' Music Video Teaser.2017-10-23T1...,1
252,2,070legit.txt,Ellen DeGeneres sends her producer to another ...,This is probably the one time of year you don'...,0
305,55,013legit.txt,Kanye West is launching his own makeup line ca...,Rapper-come-designer Kanye West has already co...,0
93,93,052fake.txt,Beyonce Gets a Surprise Visit From Hillary Cli...,Hillary Clinton recently paid a visit to the Q...,1


## LIWC features

In [None]:
!pip install liwc

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting liwc
  Downloading liwc-0.5.0-py2.py3-none-any.whl (5.1 kB)
Installing collected packages: liwc
Successfully installed liwc-0.5.0


In [None]:
import pandas as pd
import numpy as np
from collections import Counter
import liwc

parse, category_names = liwc.load_token_parser(PROJECT_FOLDER + 'LIWC2015_English.dic') #https://drive.google.com/file/d/1XWJVSVGkSDLOpKQ34lRojgO_iIP_hV34/view?usp=sharing
def liwc_extractor(dataset):
    features = pd.DataFrame(0, index=np.arange(dataset.shape[0]), columns = category_names)
    
    for index, row in dataset.iterrows():
        tokens = row['content'].split(' ')
        category_counts = Counter(category for token in tokens for category in parse(token))
        for category, value in category_counts.items():
            features.at[index, category] = value
    
    return features

In [None]:
dataset = pd.concat([liwc_extractor(dataset), dataset], axis="columns")

In [None]:
dataset.sample(4)

Unnamed: 0,function (Function Words),pronoun (Pronouns),ppron (Personal Pronouns),i (I),we (We),you (You),shehe (SheHe),they (They),ipron (Impersonal Pronouns),article (Articles),...,swear (Swear),netspeak (Netspeak),assent (Assent),nonflu (Nonfluencies),filler (Filler Words),index,file,headline,content,fake
487,66,8,3,0,0,0,2,1,5,12,...,0,0,0,0,0,237,243legit.txt,Victoria Beckham and Anna Wintour among those ...,Tennis world number one Andy Murray has been a...,0
111,108,19,13,0,0,0,10,3,6,17,...,0,1,0,0,0,111,077fake.txt,Angelina Jolie & Amal Alamuddin: It's War!,When George and Amal Clooney celebrated their ...,1
63,58,9,8,0,0,0,8,0,1,12,...,0,0,0,0,0,63,203fake.txt,Why Celine Dion and Cher hate each other,Cher wowed the crowd in a barely-there crystal...,1
333,94,12,6,0,2,0,4,0,6,19,...,0,0,0,1,0,83,096legit.txt,"Gwen Stefani is returning to 'The Voice,' but ...",Gwen Stefani is all set to return to the hot s...,0


## BERT embeddings

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m53.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m28.5 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m120.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.29.2


In [None]:
from transformers import BertTokenizer, BertModel
from tqdm import tqdm

#This function compute the first 512 BERT embeddings for content of every news
def bert_extractor(dataset):
    features = pd.DataFrame(None, index=np.arange(dataset.shape[0]), columns = ["bert_features"])
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    model = BertModel.from_pretrained("bert-base-uncased")
    
    for index, row in tqdm(dataset.iterrows()):
        text = row['content']
        encoded_input = tokenizer(text, return_tensors='pt')
        for key in encoded_input:
            encoded_input[key] = encoded_input[key][:, :512]
        output = model(**encoded_input)
        features.at[index, "bert_features"] = output[1][0].detach().cpu().numpy()
    
    return features

In [None]:
dataset = pd.concat([bert_extractor(dataset), dataset], axis="columns")
dataset.sample(4)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
2it [00:02,  1.04s/it]Token indices sequence length is longer than the specified maximum sequence length for this model (535 > 512). Running this sequence 

Unnamed: 0,bert_features,function (Function Words),pronoun (Pronouns),ppron (Personal Pronouns),i (I),we (We),you (You),shehe (SheHe),they (They),ipron (Impersonal Pronouns),...,swear (Swear),netspeak (Netspeak),assent (Assent),nonflu (Nonfluencies),filler (Filler Words),index,file,headline,content,fake
281,"[-0.5825985, -0.6338391, -0.9602135, 0.3774475...",133,21,15,0,1,0,10,4,6,...,0,0,1,0,0,31,045legit.txt,Moving fashion forward: Brooding Tom Hiddlesto...,He's one of Hollywood's most debonair stars.So...,0
402,"[-0.73400617, -0.73961544, -0.9954511, 0.64951...",128,25,12,3,1,0,7,1,13,...,0,0,0,0,0,152,177legit.txt,Vanessa Lachey breaks silence amid ‘DWTS’ feud...,This dancing duo is trying to find their rhyth...,0
347,"[-0.43977165, -0.53327477, -0.9107255, 0.12738...",143,18,8,0,0,0,8,0,10,...,0,0,0,1,0,97,077legit.txt,AMAL CLOONEY IS THE MOST INFLUENTIAL WOMAN IN ...,You'd be forgiven for thinking the most powerf...,0
147,"[-0.4821114, -0.6562978, -0.94312906, 0.064184...",120,17,6,0,0,2,3,1,11,...,0,0,0,0,0,147,178fake.txt,Kelly Ripa and Megyn Kelly in a Ratings Battle...,Things are getting heated! Megyn Kelly Today a...,1


## CE embeddings & Alexa rank

In [None]:
import gdown
import pickle

In [None]:
url = 'https://drive.google.com/file/d/1BujwLdIrdGCBOjs3scSaGmv_ZhRM77lj/view' #file from previous research
gdown.download(url,"./data.zip",quiet=False, fuzzy=True)

Downloading...
From: https://drive.google.com/uc?id=1BujwLdIrdGCBOjs3scSaGmv_ZhRM77lj
To: /content/data.zip
100%|██████████| 161M/161M [00:01<00:00, 160MB/s]


'./data.zip'

In [None]:
!unzip -q ./data.zip

In [None]:
path = '/content/multilingual_evidence_2/celebritydataset_fake_fixed.pkl'
file = open(path, 'rb')
data_fake = pickle.load(file)
path = '/content/multilingual_evidence_2/celebritydataset_legit_fixed.pkl'
file = open(path, 'rb')
data_legit = pickle.load(file)

In [None]:
data_all = {**data_fake, **data_legit}

In [None]:
#This function just extract features' values for the first 10 scraped news from dictionary with data. If there're less than 10 news, missing values are filled with Nones. 
def sim_rank_extractor(dataset, n_news = 10):
    langs = ['en', 'fr', 'de', 'es', 'ru']
    columns = []
    for lang in langs:
        for i in range(n_news):
            columns.append(lang + "_" + str(i) + "_sim")
            columns.append(lang + "_" + str(i) + "_rank")

    features = pd.DataFrame(None, index=np.arange(dataset.shape[0]), columns=columns)
    
    for index, row in tqdm(dataset.iterrows()):
        filename = row['file']
        for lang in langs:
            for i in range(n_news):
                col_sim = lang + "_" + str(i) + "_sim"
                col_rank = lang + "_" + str(i) + "_rank"
                try:
                    features.at[index, col_sim] = data_all[filename][lang][i]["similarity"]
                except:
                    features.at[index, col_sim] = None
                try:
                    features.at[index, col_rank] = data_all[filename][lang][i]["alexa_rank"]
                except:
                    features.at[index, col_rank] = None
    
    return features

In [None]:
dataset = pd.concat([dataset, sim_rank_extractor(dataset)], axis="columns")
dataset.sample(4)

500it [00:02, 221.96it/s]


Unnamed: 0,bert_features,function (Function Words),pronoun (Pronouns),ppron (Personal Pronouns),i (I),we (We),you (You),shehe (SheHe),they (They),ipron (Impersonal Pronouns),...,ru_5_sim,ru_5_rank,ru_6_sim,ru_6_rank,ru_7_sim,ru_7_rank,ru_8_sim,ru_8_rank,ru_9_sim,ru_9_rank
230,"[-0.035090838, -0.27781427, -0.6077346, -0.112...",272,52,22,0,0,0,17,5,30,...,,,,,,,,,,
158,"[-0.37792993, -0.39935485, -0.8798182, 0.32014...",125,31,20,0,2,1,14,3,11,...,0.675714,514863.0,0.640751,1957.0,,,,,,
427,"[-0.6663442, -0.36305353, -0.8525779, 0.580729...",50,14,8,0,1,0,4,3,6,...,0.667702,5263.0,0.717015,44431.0,0.693203,9.223372036854778e+18,,,,
276,"[-0.23331955, -0.33858702, -0.28926602, -0.048...",207,43,28,0,3,0,20,5,15,...,,,,,,,,,,


## Save the dataset

In [None]:
dataset.to_csv(PROJECT_FOLDER + "dataset.csv")

In [None]:
pd.read_csv(PROJECT_FOLDER + "dataset.csv").head() #check

Unnamed: 0.1,Unnamed: 0,bert_features,function (Function Words),pronoun (Pronouns),ppron (Personal Pronouns),i (I),we (We),you (You),shehe (SheHe),they (They),...,ru_5_sim,ru_5_rank,ru_6_sim,ru_6_rank,ru_7_sim,ru_7_rank,ru_8_sim,ru_8_rank,ru_9_sim,ru_9_rank
0,0,[-0.6696718 -0.6551779 -0.99718195 0.610286...,185,36,23,0,0,0,17,6,...,0.786291,100583.0,0.609681,80152.0,0.667345,21506.0,0.615025,1485.0,,
1,1,[-1.86803266e-02 -3.85523289e-01 -9.36701298e-...,74,19,12,0,0,0,12,0,...,0.622169,81315.0,0.731784,22.0,,,,,,
2,2,[-5.27673781e-01 -4.12977397e-01 -5.36640704e-...,144,29,22,1,1,1,19,0,...,0.691262,693315.0,0.632577,684775.0,0.660828,615260.0,0.689651,10669.0,0.656184,2491.0
3,3,[-2.57600158e-01 -2.91617334e-01 -3.44029903e-...,195,48,31,1,1,3,25,1,...,0.678961,5527.0,0.645147,482.0,0.703757,20483.0,,,,
4,4,[-0.5991494 -0.7292267 -0.97551596 0.669945...,157,14,6,0,0,0,4,2,...,0.601203,620.0,0.671526,31887.0,0.576362,10083.0,,,,
