**Note:** This code was run on google colab

In [1]:
#remove warnings
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings('ignore')

In [None]:
#installing tweet-preprocessor used to preprocess the news
!pip install tweet-preprocessor

In [2]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import os
import json
import re, string, unicodedata
import preprocessor as p
import spacy, nltk, gensim, sklearn
from gensim.parsing.preprocessing import *

### Load the datasets

The datasets can be found in: 

In [3]:
#set the data folder name
DATA_FOLDER = 'data_news'

#set the features data folder name 
FEATURES_FOLDER = 'news_features'

#set the path for the news datasets using os.path.join to gurantee that it work on the different operating systems
FAKE_DATASET= os.path.join(DATA_FOLDER, "Fake.csv")
TRUE_DATASET= os.path.join(DATA_FOLDER, "True.csv")
    
#load the news datasets 
df_Fake = pd.read_csv(FAKE_DATASET )
df_True = pd.read_csv(TRUE_DATASET )

In [None]:
df_Fake.head(5)

In [None]:
df_True.sample(5)

In [None]:
print("shape of Fake news dataset:" , df_Fake.shape , "\nshape of True news dataset:" , df_True.shape)

We will only use the 'text' columns of our datasets to perform our analysis

### Preprocessing

The news datasets requires some preprocessing before the analysis. In fact, the news contain a lot of links, tags ... that are useless for the linguistic cues analysis thus we delete them. We also map all the news to lower case letters to avoid miss-leading the models. We also perform some specific modifications to remove empty strings, multiple spaces... to ensure that we have proper entries both for the analysis and the models.

#### Fake text

In [5]:
#get the texts of Fake news
df_Fake_text= df_Fake[['text']]

In [11]:
df_Fake_text['text'].iloc[4]

'Pope Francis used his annual Christmas Day message to rebuke Donald Trump without even mentioning his name. The Pope delivered his message just days after members of the United Nations condemned Trump s move to recognize Jerusalem as the capital of Israel. The Pontiff prayed on Monday for the  peaceful coexistence of two states within mutually agreed and internationally recognized borders. We see Jesus in the children of the Middle East who continue to suffer because of growing tensions between Israelis and Palestinians,  Francis said.  On this festive day, let us ask the Lord for peace for Jerusalem and for all the Holy Land. Let us pray that the will to resume dialogue may prevail between the parties and that a negotiated solution can finally be reached. The Pope went on to plead for acceptance of refugees who have been forced from their homes, and that is an issue Trump continues to fight against. Francis used Jesus for which there was  no place in the inn  as an analogy. Today, as

In [None]:
#removal of URLs, Mentions
df_Fake_text['text']= df_Fake_text['text'].apply(p.clean)

In [None]:
df_Fake_text['text'].iloc[19]

In [None]:
#first we separate words that are connected with '-' '/' '_'
#Then we keep only alphabet caracters and some punctuation marks that are useful to detect sentences
df_Fake_text['text']=df_Fake_text['text'].apply( lambda x: "".join(re.findall(r'[a-zA-Z]*[ ?.!,]', x.translate(str.maketrans({'-': ' ', '/': ' ', '_': ' '}))+' ')))

In [None]:
#map all characters to lowercase to make the text uniform
#replace some sepcial cases to remove empty sentences
df_Fake_text['text']= df_Fake_text['text'].apply(lambda x: x[:-1].lower().replace("...", "").replace(", ,", ",").replace(". .", ".").replace(' ,', ',').replace(' .', '.').replace('  ',' '))

In [None]:
#here we can see how an entry became after preprocessing 
df_Fake_text['text'].iloc[888]

In [None]:
#This is how the entry was before preprocessing
df_Fake['text'].iloc[888]

In [None]:
#some news containes only links or tags thus after preprocessing they became empty text. We remove those because they are irrelevant for our analysis
df_Fake_text= df_True_text[ ~df_Fake_text['text'].isnull()] 

In [None]:
print("shape of the Fake news dataframe after preprocessing: ", df_Fake_text.shape )

In [None]:
#save the new dataframe
df_Fake_text.to_csv(os.path.join(FEATURES_FOLDER,"df_Fake_text.csv"), index=False)

#### True Text

In [12]:
#As we can see, all entries in the True news start with the company's name(Reuters) and locaion of news 
df_True['text'].iloc[0]

'WASHINGTON (Reuters) - The head of a conservative Republican faction in the U.S. Congress, who voted this month for a huge expansion of the national debt to pay for tax cuts, called himself a “fiscal conservative” on Sunday and urged budget restraint in 2018. In keeping with a sharp pivot under way among Republicans, U.S. Representative Mark Meadows, speaking on CBS’ “Face the Nation,” drew a hard line on federal spending, which lawmakers are bracing to do battle over in January. When they return from the holidays on Wednesday, lawmakers will begin trying to pass a federal budget in a fight likely to be linked to other issues, such as immigration policy, even as the November congressional election campaigns approach in which Republicans will seek to keep control of Congress. President Donald Trump and his Republicans want a big budget increase in military spending, while Democrats also want proportional increases for non-defense “discretionary” spending on programs that support educat

In [13]:
#get the texts of True news
df_True_text= df_True[['text']]

In [14]:
#We remove the first part of each news ( the origin part ) to avoid having bias
df_True_text['text'] = df_True_text['text'].apply(lambda x : x.split('-', maxsplit=1)[1] if '-' in x else x)

In [15]:
df_True_text.head()

Unnamed: 0,text
0,The head of a conservative Republican faction...
1,Transgender people will be allowed for the fi...
2,The special counsel investigation of links be...
3,Trump campaign adviser George Papadopoulos to...
4,President Donald Trump called on the U.S. Pos...


In [16]:
#removal of URLs, Mentions
df_True_text['text']= df_True_text['text'].apply(p.clean)

In [17]:
df_True_text['text'].iloc[0]

'The head of a conservative Republican faction in the U.S. Congress, who voted this month for a huge expansion of the national debt to pay for tax cuts, called himself a fiscal conservative on Sunday and urged budget restraint in . In keeping with a sharp pivot under way among Republicans, U.S. Representative Mark Meadows, speaking on CBS Face the Nation, drew a hard line on federal spending, which lawmakers are bracing to do battle over in January. When they return from the holidays on Wednesday, lawmakers will begin trying to pass a federal budget in a fight likely to be linked to other issues, such as immigration policy, even as the November congressional election campaigns approach in which Republicans will seek to keep control of Congress. President Donald Trump and his Republicans want a big budget increase in military spending, while Democrats also want proportional increases for non-defense discretionary spending on programs that support education, scientific research, infrastr

In [18]:
#we separate words that are connected with '-' '/' '_'
#Then we keep only alphabet caracters and some punctuation marks that are useful to detect sentences
df_True_text['text']=df_True_text['text'].apply( lambda x: "".join(re.findall(r'[a-zA-Z]*[ ?.!,]', x.translate(str.maketrans({'-': ' ', '/': ' ', '_': ' '}))+' ')))

In [19]:
#map all characters to lowercase to make the text uniform
#replace some sepcial cases to remove empty sentences
df_True_text['text']= df_True_text['text'].apply(lambda x: x[:-1].lower().replace("...", "").replace(", ,", ",").replace(". .", ".").replace(' ,', ',').replace(' .', '.').replace('  ',' '))

In [20]:
#here we can see how an entry became after preprocessing 
df_True_text['text'].iloc[1]

'transgender people will be allowed for the first time to enlist in the u.s. military starting on monday as ordered by federal courts, the pentagon said on friday, after president donald trumps administration decided not to appeal rulings that blocked his transgender ban. two federal appeals courts, one in washington and one in virginia, last week rejected the administrations request to put on hold orders by lower court judges requiring the military to begin accepting transgender recruits on jan. a justice department official said the administration will not challenge those rulings. the department of defense has announced that it will be releasing an independent study of these issues in the coming weeks. so rather than litigate this interim appeal before that occurs, the administration has decided to wait for dods study and will continue to defend the presidents lawful authority in district court in the meantime, the official said, speaking on condition of anonymity. in september, the 

In [21]:
#This is how the entry was before preprocessing
df_True['text'].iloc[1]

'WASHINGTON (Reuters) - Transgender people will be allowed for the first time to enlist in the U.S. military starting on Monday as ordered by federal courts, the Pentagon said on Friday, after President Donald Trump’s administration decided not to appeal rulings that blocked his transgender ban. Two federal appeals courts, one in Washington and one in Virginia, last week rejected the administration’s request to put on hold orders by lower court judges requiring the military to begin accepting transgender recruits on Jan. 1. A Justice Department official said the administration will not challenge those rulings. “The Department of Defense has announced that it will be releasing an independent study of these issues in the coming weeks. So rather than litigate this interim appeal before that occurs, the administration has decided to wait for DOD’s study and will continue to defend the president’s lawful authority in District Court in the meantime,” the official said, speaking on condition 

In [None]:
#some news containes only links or tags thus after preprocessing they became empty text. We remove those because they are irrelevant for our analysis
df_True_text= df_True_text[ ~df_True_text['text'].isnull()] 

In [None]:
print("shape of the True news dataframe after preprocessing: ", df_True_text.shape )

In [None]:
#save the new dataframe
df_True_text.to_csv(os.path.join(FEATURES_FOLDER,"df_True_text.csv"), index=False)

### Sentiment analysis with coreNLP

Here we present the code to run on a local computer, if you want to run it on google colab which is better and advised see the next section

**Note** to run the coreNLP analysis on your local computer, you need first to download the the stanford coreNLP file from https://stanfordnlp.github.io/CoreNLP/download.html, then you need to open a java pipeline on another jupyter notebook as follows:   
cd stanford-corenlp-full-2018-10-05  
!java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000

In [None]:
!pip3 install pycorenlp

In [None]:
from pycorenlp import StanfordCoreNLP

#here we connect the coreNLP to the java pipline that we open on the corresponding port number
nlp = StanfordCoreNLP('http://localhost:9000')

In [None]:
def stanford_sentiment(text_str, show=False):
    
    res = nlp.annotate(text_str,
                   properties={
                       'annotators': 'sentiment',
                       'outputFormat': 'json',
                       'timeout': 400000,
                   })
    #print(res)
    
    #get parses 
    parses = []
    for elem in res['sentences']:
        parses.append(elem['parse'])
        
    numSentence = len(res["sentences"])
    numWords = len(text_str.split())
    
    # data arrangement
    arraySentVal = np.zeros(numSentence)

    for i, s in enumerate(res["sentences"]):
        arraySentVal[i] = int(s["sentimentValue"])

    # sum of sentiment values 
    totSentiment = sum(arraySentVal)

    # avg. of sentiment values 
    avgSentiment = np.mean(arraySentVal)

    # frequency of sentimentValue
    bins = [0,1,2,3,4,5,6]
    freq = np.histogram(arraySentVal, bins)[0]    # getting freq. only w/o bins
    
    #shows the computes values if requested, used to test the method
    if(show):
        for s in res["sentences"]:
            print("%d: '%s': %s %s" % (
                s["index"],
                " ".join([t["word"] for t in s["tokens"]]),
                s["sentimentValue"], s["sentiment"]))

    return(numSentence, numWords, totSentiment, avgSentiment, freq, parses )

#### Fake news

In [None]:
df_Fake_text['text'].iloc[19]

In [None]:
#takes a lot of time
df_Fake_text['sentiment']= df_Fake_text['text'].apply(stanford_sentiment)

In [None]:
#numSentence, numWords, totSentiment, avgSentiment, freq, parses
df_Fake_text[['numSentence', 'numWords', 'totSentiment', 'avgSentiment', 'positive_Sentiment', 'negative_Sentiment',
              'neutral_Sentiment', 'parses']]= None

In [None]:
df_Fake_text.head()

In [None]:
#"Very negative" = 0 "Negative" = 1 "Neutral" = 2 "Positive" = 3 "Very positive" = 4

#in our case we need to map very positive to positive and very neg to  neg

for i in range( len( df_Fake_text)):
    #print(i)
    elem= stanford_sentiment(df_Fake_text['text'].iloc[i])
    freq= elem[4]
    df_Fake_text.iloc[i, 1:]= [elem[0], elem[1], elem[2],elem[3], freq[3]+freq[4] ,freq[0]+freq[1] , freq[2], elem[5]]
    if( i%10 ==0):
        #print("let's save")
        df_Fake_text.to_csv(os.path.join(FEATURES_FOLDER,'df_Fake_text.csv'), index=False)
        
    


#### True news

In [None]:
#numSentence, numWords, totSentiment, avgSentiment, freq, parses
df_True_text[['numSentence', 'numWords', 'totSentiment', 'avgSentiment', 'positive_Sentiment', 'negative_Sentiment',
              'neutral_Sentiment', 'parses']]= None

In [None]:
df_True_text.head()

In [None]:
#"Very negative" = 0 "Negative" = 1 "Neutral" = 2 "Positive" = 3 "Very positive" = 4

#in our case we need to map very positive to positive and very neg to  neg

for i in range( len( df_True_text)):
    #print(i)
    elem= stanford_sentiment(df_True_text['text'].iloc[i])
    freq= elem[4]
    df_True_text.iloc[i, 1:]= [elem[0], elem[1], elem[2],elem[3], freq[3]+freq[4] ,freq[0]+freq[1] , freq[2], elem[5]]
    if( i%10 ==0):
        #print("let's save")
        df_True_text.to_csv(os.path.join(FEATURES_FOLDER,'df_True_text.csv'), index=False)
        

### Sentiment analysis with coreNLP on colab

In this part we try to perform the same sentiment analysis as the authors did in their word on the diplomacy game. We use the coreNLP of Stanford to quantifie the negative, positive and neutral sentiments in each sentence in the news and take the average for each news.
These computations are very time and bandwidth consuming thus we run them on google Colab. But we weren't able to run them on the entire dataset which is quite big. We performed the computations on the first 3000 entries of the dataset for the True and Fake news respectively to be able to compare them. Since the entries of the datasets are independent we consider than 3000 samples can be enough to describe on average their trends.

Rq: The tutorial on how to run the Stanford coreNLP on google Colab can be found in this link : https://colab.research.google.com/github/stanfordnlp/stanza/blob/master/demo/Stanza_CoreNLP_Interface.ipynb#scrollTo=mbOBugvd9JaM

In [None]:
# Install stanza; note that the prefix "!" is not needed if you are running in a terminal
!pip install stanza

# Import stanza
import stanza

In [None]:
# Download the Stanford CoreNLP package with Stanza's installation command
# This'll take several minutes, depending on the network speed
corenlp_dir = './corenlp'
stanza.install_corenlp(dir=corenlp_dir)

# Set the CORENLP_HOME environment variable to point to the installation location
import os
os.environ["CORENLP_HOME"] = corenlp_dir

In [None]:
# Examine the CoreNLP installation folder to make sure the installation is successful
!ls $CORENLP_HOME

In [None]:
# Import client module
from stanza.server import CoreNLPClient

In [None]:
# Construct a CoreNLPClient with some basic annotators, a memory allocation of 4GB, and port number 9001
client = CoreNLPClient(
    annotators=['sentiment'], 
    outputFormat= 'json',
    memory='16G', 
    endpoint='http://localhost:9002',
    be_quiet=True)
print(client)

# Start the background server and wait for some time
# Note that in practice this is totally optional, as by default the server will be started when the first annotation is performed
client.start()
import time; time.sleep(10)

In [None]:
# Print background processes and look for java
# You should be able to see a StanfordCoreNLPServer java process running in the background
!ps -o pid,cmd | grep java


In [None]:
#used to kill a pipeline if needed
#!kill 908

In [None]:
"""
function that uses the Stanford coreNLP to annotate the sentences of a given text entry and return:
-number of sentences in the given text 
-number of words in the given text
-total number of sentiments 
-average number of snetiments per sentence 
-frequency of each sentiment (negative, positive and neutral)
-parses for each sentence
"""
def stanford_sentiment(text_str, show=False):
    
    res =  client.annotate(text_str,
                   properties={
                       'annotators': 'sentiment',
                       'outputFormat': 'json',
                       'timeout': 400000,
                   })
    
    
    #get parses 
    parses = []
    for elem in res['sentences']:
        parses.append(elem['parse'])

    #nb of sentences and words    
    numSentence = len(res["sentences"])
    numWords = len(text_str.split())
    
    # data arrangement
    arraySentVal = np.zeros(numSentence)

    for i, s in enumerate(res["sentences"]):
        arraySentVal[i] = int(s["sentimentValue"])

    # sum of sentiment values 
    totSentiment = sum(arraySentVal)

    # avg. of sentiment values 
    avgSentiment = np.mean(arraySentVal)

    # frequency of sentimentValue
    bins = [0,1,2,3,4,5,6]
    freq = np.histogram(arraySentVal, bins)[0]    # getting freq. only w/o bins
    if(show):
        for s in res["sentences"]:
            print("%d: '%s': %s %s" % (
                s["index"],
                " ".join([t["word"] for t in s["tokens"]]),
                s["sentimentValue"], s["sentiment"]))

    return(numSentence, numWords, totSentiment, avgSentiment, freq, parses )

In [None]:
#Use GPU if available
import torch
if torch.cuda.is_available():
                torch.device("cuda")

#### Fake news

In [None]:
#add empty columns in the Fake dataframe for numSentence, numWords, totSentiment, avgSentiment, freq andparses
df_Fake_text[['numSentence', 'numWords', 'totSentiment', 'avgSentiment', 'positive_Sentiment', 'negative_Sentiment',
              'neutral_Sentiment', 'parses']]= None

In [None]:
df_Fake_text.head()

Some entries caused a read timeout and we are unable to process them, those entries were deleted

In [None]:
#The returned frequencies are orderes as follows: "Very negative" = 0 "Negative" = 1 "Neutral" = 2 "Positive" = 3 "Very positive" = 4
#In our case we need to map " Very Positive" to "Positive" and "Very Negative" to  "Negative" to match the work of the authors on the diplomacy game


#We run the stanford_sentiment method on each Fake news and store the returned values in the dataframe 
#we save the dataframe evry 10 iterations to make sure not a lot of data is lost in case of a runtime stop
for i in range( len( df_Fake_text)):
    #print(i)
    elem= stanford_sentiment(df_Fake_text['text'].iloc[i])
    freq= elem[4]
    df_Fake_text.iloc[i, 1:]= [elem[0], elem[1], elem[2],elem[3], freq[3]+freq[4] ,freq[0]+freq[1] , freq[2], elem[5]]
    if( i%10 ==0):
        #print("let's save")
        df_Fake_text.to_csv(os.path.join(FEATURES_FOLDER,'df_Fake_text.csv'), index=False)
        

#### True news

In [None]:
# add empty columns in the True dataframe for numSentence, numWords, totSentiment, avgSentiment, freq andparses
df_True_text[['numSentence', 'numWords', 'totSentiment', 'avgSentiment', 'positive_Sentiment', 'negative_Sentiment',
              'neutral_Sentiment', 'parses']]= None

In [None]:
df_True_text.head()

Some entries caused a read timeout and we are unable to process them, those entries were deleted

In [None]:
#The returned frequencies are orderes as follows: "Very negative" = 0 "Negative" = 1 "Neutral" = 2 "Positive" = 3 "Very positive" = 4
#In our case we need to map " Very Positive" to "Positive" and "Very Negative" to  "Negative" to match the work of the authors on the diplomacy game


#We run the stanford_sentiment method on each True news and store the returned values in the dataframe 
#we save the dataframe evry 10 iterations to make sure not a lot of data is lost in case of a runtime stop
for i in range( 1398, len( df_True_text)):
    #print(i)
    elem= stanford_sentiment(df_True_text['text'].iloc[i])
    freq= elem[4]
    df_True_text.iloc[i, 1:]= [elem[0], elem[1], elem[2],elem[3], freq[3]+freq[4] ,freq[0]+freq[1] , freq[2], elem[5]]
    if( i%10 ==0):
        #print("let's save")
        df_True_text.to_csv(os.path.join(FEATURES_FOLDER,'df_True_text.csv'), index=False)
        

In [None]:
# Shut down the background CoreNLP server
client.stop()

time.sleep(10)
!ps -o pid,cmd | grep java