<a href="https://colab.research.google.com/github/desireedisco/MSDS-Machine-Learning-Supervised/blob/main/2_LeakageTest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**Data leakage happens when training data contains information about the target, but similar data will not be available when the model is used for prediction. In our case, certain phases like (Reuters) are only listed on the real label side. There is also some phases in the title that are only associated with real titles. Similarily there are different ones on the fake side. None of these phrases has anything to do with the story or title. It is extra information that could bias the model**

In [2]:
import pandas as pd
import numpy as np
import sklearn
import nltk
import re

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt_tab')

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [3]:
#read csv file
file =('/content/drive/MyDrive/Colab Notebooks/Machine Learning - Supervised Learning/project/ProjectData/WELFake_Dataset.csv')
data = pd.read_csv(file)
data.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1
1,1,,Did they post their votes for Hillary already?,1
2,2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1
3,3,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0
4,4,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1


In [4]:
#drop dataset index column
data = data.drop(columns=['Unnamed: 0'])

#drop all null values
data = data.dropna().reset_index(drop=True)

#drop duplicate text stories
data = data.drop_duplicates(subset=['text'])

#count of real and fake
data['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,34620
1,27580


In [5]:
data.head()

Unnamed: 0,title,text,label
0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1
1,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1
2,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0
3,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1
4,About Time! Christian Group Sues Amazon and SP...,All we can say on this one is it s about time ...,1


In [6]:
#load stop words
stop_words = set(stopwords.words('english'))

In [7]:
def clean_tokens(text):
  text_clean = text.lower()

  #get rid of web links
  text_clean = re.sub(r"https?://(?:www\.)?[a-zA-Z0-9./]+", '', text_clean)

  #get rid of #hashtags
  text_clean = re.sub(r"#\w+", '', text_clean)

  #get rid of @mentions
  text_clean = re.sub(r"@\w+", '', text_clean)

  #get rid of digit only text strings
  text_clean = re.sub(r"\d+", '', text_clean)

  #set non-words to ''
  text_clean = re.sub("\W", ' ', text_clean)
  #print(text_clean)
  tokens = word_tokenize(text_clean)
  text_list = [word for word in tokens if word not in stop_words and len(word)>1]
  token_to_text = ' '.join(text_list)

  return token_to_text

In [8]:
#testing my text to tokens fuction
text1 = data.loc[1062,'text']
print(text1)
print(clean_tokens(text1))

Ever since ISIS appeared in Iraq in 2014, both the policies and strategies coming out of Washington have ranged from confused to inept, as politicians and Pentagon officials spar over whether or not to cooperate with various Iranian-affiliated Shia militias and People s Mobilization Units (PMF), led by Hash d  al-Shaabi and Badr Organisation.The driving factor behind Washington s stance is the Israeli Lobby and Gulf state led by Saudi Arabia   who vocally oppose any US cooperation with Shia PMF s in Iraq. This lack of coherency has also helped alienate the Iraq government in Baghdad who appear to be less and less concerned with Washington s sectarian imposition and more concerned with closing-out the ISIS threat in Iraq.This dysfunctional US policy of exclusion in local operational partners on the ground may have helped to prolong the lifespan of ISIS in parts of Iraq. Washington s insistence on playing the sectarian card has led to its inability to openly cooperate with key players   

In [9]:
data.loc[:,'text_tokens_to_text'] = data.loc[:,'text'].apply(clean_tokens)

In [10]:
data.head()

Unnamed: 0,title,text,label,text_tokens_to_text
0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1,comment expected barack obama members movement...
1,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1,demonstrators gathered last night exercising c...
2,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0,dozen politically active pastors came private ...
3,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1,rs sarmat missile dubbed satan replace ss flie...
4,About Time! Christian Group Sues Amazon and SP...,All we can say on this one is it s about time ...,1,say one time someone sued southern poverty law...


In [12]:
# set my features and label
X = data[['text','text_tokens_to_text']]
y = data['label']
print(x.head())

                                                text  \
0  No comment is expected from Barack Obama Membe...   
1   Now, most of the demonstrators gathered last ...   
2  A dozen politically active pastors came here f...   
3  The RS-28 Sarmat missile, dubbed Satan 2, will...   
4  All we can say on this one is it s about time ...   

                                 text_tokens_to_text  
0  comment expected barack obama members movement...  
1  demonstrators gathered last night exercising c...  
2  dozen politically active pastors came private ...  
3  rs sarmat missile dubbed satan replace ss flie...  
4  say one time someone sued southern poverty law...  


In [13]:
# split data in train test split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=29)

In [14]:
# vectorize the text data
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(x_train['text_tokens_to_text'])
X_test = vectorizer.transform(x_test['text_tokens_to_text'])

In [15]:
# perform a test model to see if the erroneous data is biasing the model
logistic_model = LogisticRegression(max_iter=1000)
logistic_model.fit(X_train, y_train)
y_pred = logistic_model.predict(X_test)

In [16]:
# the accuracy score is good but I don't know how this model would perform in the real word.
logistic_model.score(X_test, y_test)
#0.9414790996784566

0.9414790996784566

In [17]:
# confusion matrix
cm = confusion_matrix(y_test, y_pred)
cm

array([[8221,  464],
       [ 446, 6419]])

**In summary, this section of code is testing if adding "(Reuters) - " to the beginning of fake news articles artificially inflates the model's performance by introducing a bias or leakage. The accuracy score and confusion matrix help evaluate the impact of this modification on the model's predictions.**



In [18]:
#I am going to get the correctely label fake articles to add (Reuters) to to see if the model would mislabel them.
test_lst_fake = []
for text, prediction, label in zip(x_test['text'], y_pred, y_test):
  if prediction == 1 and label == 1:
    test_lst_fake.append({'text': text, 'prediction': prediction, 'label': label})

# this dataframe is of all the correctly labeled fake articles
test_df_fake = pd.DataFrame(test_lst_fake)
print(test_df_fake.head())
print(test_df_fake.shape)

                                                text  prediction  label
0  Posted 10/31/2016 3:17 pm by PatriotRising wit...           1      1
1  The moral decay continues The Kapiolani Medica...           1      1
2  No wonder she didn t want anyone to see her sp...           1      1
3  Miley Cyrus Crying Over Trump Victory (Video) ...           1      1
4  A “gateway” between Western and Eastern Europe...           1      1
(6419, 3)


In [19]:
#add "(Reuters) - " to beginning of text of the new test df trying to pretend fake news is real news
test_df_fake['text'] = test_df_fake['text'].apply(lambda text: "(Reuters) - " + text)

In [20]:
#create tokens
test_df_fake.loc[:,'text_tokens_to_text'] = test_df_fake.loc[:,'text'].apply(clean_tokens)

In [21]:
#after (Reuters is added)
test_df_fake.head()

Unnamed: 0,text,prediction,label,text_tokens_to_text
0,(Reuters) - Posted 10/31/2016 3:17 pm by Patri...,1,1,reuters posted pm patriotrising comments democ...
1,(Reuters) - The moral decay continues The Kapi...,1,1,reuters moral decay continues kapiolani medica...
2,(Reuters) - No wonder she didn t want anyone t...,1,1,reuters wonder want anyone see speeches singin...
3,(Reuters) - Miley Cyrus Crying Over Trump Vict...,1,1,reuters miley cyrus crying trump victory video...
4,(Reuters) - A “gateway” between Western and Ea...,1,1,reuters gateway western eastern europe lies ch...


**After running the fake articles with (Reuters) added through the logistic model there was a large number of articles that were misclassified. So correctly labeled fake articles can be labeled real if we add (Reuters).**

**Almost 10% was mislabeled by adding (Reuters)**

In [22]:
#vectorize the new dataframe of altered fake articles
# also since we are using TF-IDF order or placement doesn't matter
y_test_fake_leakage = test_df_fake['label']
X_test_fake_leakage = vectorizer.transform(test_df_fake['text_tokens_to_text'])
y_pred_fake_leakage = logistic_model.predict(X_test_fake_leakage)
print(logistic_model.score(X_test_fake_leakage, y_test_fake_leakage))
print(confusion_matrix(y_test_fake_leakage, y_pred_fake_leakage))

0.9126032092226204
[[   0    0]
 [ 561 5858]]


**Then we do the opposite and take correctly labeled real articles and strip the (Reuters) reference out.**

**In essence, this code segment is testing the performance of the logistic_model on a dataset (test_df_real) that has been potentially manipulated to examine the model's sensitivity to artificial bias or leakage related to "(Reuters)" appearing in the text.**

In [23]:
# in this experiment I take articles that were correly labeled real and strip the (Reuters) information off to see if they will be misclassed
test_lst_real = []
for text, prediction, label in zip(x_test['text'], y_pred, y_test):
  if prediction == 0 and label == 0:
    test_lst_real.append({'text': text, 'prediction': prediction, 'label': label})
# this is a dataframe of all articles correctly predicted as real
test_df_real = pd.DataFrame(test_lst_real)
print(test_df_real.head())
print(test_df_real.shape)

                                                text  prediction  label
0  The reclusive leader of the Taliban hasn't bee...           0      0
1  HONG KONG (Reuters) - Four Hong Kong pro-democ...           0      0
2  Rep. Jared Huffman ( ) has announced that he w...           0      0
3  How Candidates Announce Can Say A Lot About Th...           0      0
4  WASHINGTON (Reuters) - President Barack Obama ...           0      0
(8221, 3)


**Then I use regex to take out the (Reuters) reference**

In [24]:
# I am striping the (Reuters) and other similar patterns involving Reuters using regex
pattern_reu = r"^[“”‘’A-Za-z.,$&()/\-:;0-9  ]*\(Reuters\) - "
test_df_real.loc[:,'text'] = test_df_real.loc[:,'text'].replace(to_replace=pattern_reu, value='', regex=True)

pattern_reu2 = r"^[“”‘’A-Za-z.,$&()/\-:;0-9  ]*\(Reuters\)  —  "
test_df_real.loc[:,'text'] = test_df_real.loc[:,'text'].replace(to_replace=pattern_reu2, value='', regex=True)

pattern_reu3 = r"^[“”‘’A-Za-z.,$&()/\-:;0-9  ]*\(Reuters\)\) - "
test_df_real.loc[:,'text'] = test_df_real.loc[:,'text'].replace(to_replace=pattern_reu3, value='', regex=True)

pattern_reu4 = r"^\(Reuters\)"
test_df_real.loc[:,'text'] = test_df_real.loc[:,'text'].replace(to_replace=pattern_reu4, value='', regex=True)

print(test_df_real.head(10))

                                                text  prediction  label
0  The reclusive leader of the Taliban hasn't bee...           0      0
1  Four Hong Kong pro-democracy lawmakers who wer...           0      0
2  Rep. Jared Huffman ( ) has announced that he w...           0      0
3  How Candidates Announce Can Say A Lot About Th...           0      0
4  President Barack Obama said he could envision ...           0      0
5  “For I delivered to you as of first importance...           0      0
6  The Arctic Ocean may seem remote and forbiddin...           0      0
7  France said on Tuesday it wanted the United Na...           0      0
8  Republican presidential candidate Ted Cruz sai...           0      0
9  U.S. President-elect Donald Trump must divest ...           0      0


In [26]:
# testing to see if there are still references to (Reuters) in the data
pattern_reu_corr = r"\(Reuters\)"
match_lst = test_df_real['text'].str.match(pattern_reu_corr)
df_filtered = test_df_real[match_lst]
print(df_filtered.head())

Empty DataFrame
Columns: [text, prediction, label]
Index: []


In [27]:
#create tokens
test_df_real.loc[:,'text_tokens_to_text'] = test_df_real.loc[:,'text'].apply(clean_tokens)

**Only a small amount of the correctly labeled real articles were mislabeled so it is more of an issue with fake articles being labeled as real if they put (Reuters) in the text.**

In [None]:
# vectorize the df and run predictions
y_test_real_leakage = test_df_real['label']
X_test_real_leakage = vectorizer.transform(test_df_real['text_tokens_to_text'])
y_pred_real_leakage = logistic_model.predict(X_test_real_leakage)
print(logistic_model.score(X_test_real_leakage, y_test_real_leakage))
print(confusion_matrix(y_test_real_leakage, y_pred_real_leakage))

0.9923366986984552
[[8158   63]
 [   0    0]]



**I did feel that we need to remove (Reuters) and other similar type phrases that were only attributable to one class. It would be useful information and could be left in the dataset if sources were provided for all data items, but that is not the case here.**



