# **FakeNews Data Pre-Processing**

# **Datasets:**

`True.csv:` Contains true news articles.
`Fake.csv:` Contains fake news articles.

load these datasets and combine them into a single dataset with a target variable indicating whether the news is true or fake.

# **Data Preprocessing**

Steps for preprocessing the text data:

`Loading Data:` Combine True.csv and Fake.csv into a single DataFrame.

`Tokenization:` Splitting text into individual words or tokens.

`Removing Stop Words:` Removing common words that do not contribute to the meaning (e.g., "and", "the").

`Stemming/Lemmatization:` Reducing words to their base or root form.

In [32]:
#IMPORTS
import numpy as np
import pandas as pd
import seaborn as sns

%matplotlib inline
import matplotlib.pyplot as plt

import numpy as np
import pandas as pd

import string
import spacy

from collections import Counter

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

import spacy
import re
from spacy.lang.en.stop_words import STOP_WORDS




In [33]:
#checcking data
true_df = pd.read_csv(r'C:\Users\GGPC\IoD_Mini_Projects\Mini_Project_3\data\raw\True.csv')
fake_df = pd.read_csv(r'C:\Users\GGPC\IoD_Mini_Projects\Mini_Project_3\data\raw\Fake.csv')

true_df.sample(10), fake_df.sample(10)

(                                                   title  \
 11765  Mexican governor requests leave to run for pre...   
 10315  South Carolina Governor Haley supports Cruz in...   
 12872  Iraq demands U.S. backtrack on Jerusalem, summ...   
 3178   White House does not yet have plan on debt lim...   
 15665  Riding high, Xi looks to soothe Trump as U.S. ...   
 3874   Trump to make decision on Paris climate pact a...   
 1232   Republican Senator Corker blasts Trump for 'ca...   
 19208  North Korea says rockets to U.S. 'inevitable' ...   
 15995  Russia invites Syrian Kurds to people's congre...   
 16285  U.N. special envoy urges Poland to open up deb...   
 
                                                     text       subject  \
 11765  MEXICO CITY (Reuters) - The governor of Nuevo ...     worldnews   
 10315  WASHINGTON (Reuters) - South Carolina Governor...  politicsNews   
 12872  BAGHDAD (Reuters) - Iraq demanded on Thursday ...     worldnews   
 3178   WASHINGTON (Reuters

In [34]:
# 
true_df['label'] = 1
fake_df['label'] = 0

df = pd.concat([true_df, fake_df]).reset_index(drop=True)

df

Unnamed: 0,title,text,subject,date,label
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",1
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",1
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",1
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017",1
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017",1
...,...,...,...,...,...
44893,McPain: John McCain Furious That Iran Treated ...,21st Century Wire says As 21WIRE reported earl...,Middle-east,"January 16, 2016",0
44894,JUSTICE? Yahoo Settles E-mail Privacy Class-ac...,21st Century Wire says It s a familiar theme. ...,Middle-east,"January 16, 2016",0
44895,Sunnistan: US and Allied ‘Safe Zone’ Plan to T...,Patrick Henningsen 21st Century WireRemember ...,Middle-east,"January 15, 2016",0
44896,How to Blow $700 Million: Al Jazeera America F...,21st Century Wire says Al Jazeera America will...,Middle-east,"January 14, 2016",0


In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44898 entries, 0 to 44897
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    44898 non-null  object
 1   text     44898 non-null  object
 2   subject  44898 non-null  object
 3   date     44898 non-null  object
 4   label    44898 non-null  int64 
dtypes: int64(1), object(4)
memory usage: 1.7+ MB


In [36]:
df.isnull().sum()

title      0
text       0
subject    0
date       0
label      0
dtype: int64

In [37]:
df.duplicated().sum()

209

In [38]:
df = df.drop_duplicates()

In [39]:
df.duplicated().sum()

0

In [40]:
df

Unnamed: 0,title,text,subject,date,label
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",1
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",1
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",1
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017",1
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017",1
...,...,...,...,...,...
44893,McPain: John McCain Furious That Iran Treated ...,21st Century Wire says As 21WIRE reported earl...,Middle-east,"January 16, 2016",0
44894,JUSTICE? Yahoo Settles E-mail Privacy Class-ac...,21st Century Wire says It s a familiar theme. ...,Middle-east,"January 16, 2016",0
44895,Sunnistan: US and Allied ‘Safe Zone’ Plan to T...,Patrick Henningsen 21st Century WireRemember ...,Middle-east,"January 15, 2016",0
44896,How to Blow $700 Million: Al Jazeera America F...,21st Century Wire says Al Jazeera America will...,Middle-east,"January 14, 2016",0


Takes too long for spacy

In [41]:
# nlp = spacy.load('en_core_web_sm')

# def convert_text(text):
#     text = re.sub(r"'", "", text)
#     text = re.sub(r"[^a-zA-Z\s]", "", text)
#     text = re.sub(r"\b[a-zA-Z]\b", "", text)
#     text = re.sub(r"\d+", "", text)
#     text = re.sub(r"\s+", " ", text)
    
#     doc = nlp(text)
    
#     cleaned_text = []
#     for token in doc:
#         if token.text in STOP_WORDS:
#             continue
#         if token.is_alpha:
#             cleaned_text.append(token.lemma_.lower())
#         else:
#             cleaned_text.append(token.text)
#     return ' '.join(cleaned_text)

# df_clean = df.copy()
# df_clean['clean_title'] = df_clean['title'].apply(convert_text)
# df_clean['clean_text'] = df_clean['text'].apply(convert_text)

# df_clean

## Cleaning Texts and Title

In [42]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()


def tokenize(text):
    tokens = word_tokenize(text.lower())
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word.isalpha() and word not in stop_words and word != 'reuters']
    return ' '.join(tokens)

df_clean = df.copy()

df_clean['clean_title'] = df_clean['title'].apply(tokenize)
df_clean['clean_text'] = df_clean['text'].apply(tokenize)

df_clean

Unnamed: 0,title,text,subject,date,label,clean_title,clean_text
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",1,budget fight loom republican flip fiscal script,washington head conservative republican factio...
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",1,military accept transgender recruit monday pen...,washington transgender people allowed first ti...
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",1,senior republican senator mueller job,washington special counsel investigation link ...
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017",1,fbi russia probe helped australian diplomat nyt,washington trump campaign adviser george papad...
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017",1,trump want postal service charge amazon shipment,president donald trump called postal service f...
...,...,...,...,...,...,...,...
44893,McPain: John McCain Furious That Iran Treated ...,21st Century Wire says As 21WIRE reported earl...,Middle-east,"January 16, 2016",0,mcpain john mccain furious iran treated u sail...,century wire say reported earlier week unlikel...
44894,JUSTICE? Yahoo Settles E-mail Privacy Class-ac...,21st Century Wire says It s a familiar theme. ...,Middle-east,"January 16, 2016",0,justice yahoo settle privacy lawyer user,century wire say familiar theme whenever dispu...
44895,Sunnistan: US and Allied ‘Safe Zone’ Plan to T...,Patrick Henningsen 21st Century WireRemember ...,Middle-east,"January 15, 2016",0,sunnistan u allied safe zone plan take territo...,patrick henningsen century wireremember obama ...
44896,How to Blow $700 Million: Al Jazeera America F...,21st Century Wire says Al Jazeera America will...,Middle-east,"January 14, 2016",0,blow million al jazeera america finally call q...,century wire say al jazeera america go history...


In [43]:
df_clean['clean_text']

0        washington head conservative republican factio...
1        washington transgender people allowed first ti...
2        washington special counsel investigation link ...
3        washington trump campaign adviser george papad...
4        president donald trump called postal service f...
                               ...                        
44893    century wire say reported earlier week unlikel...
44894    century wire say familiar theme whenever dispu...
44895    patrick henningsen century wireremember obama ...
44896    century wire say al jazeera america go history...
44897    century wire say predicted new year look ahead...
Name: clean_text, Length: 44689, dtype: object

## Extracting Features

In [44]:
df_clean['clean_text'] = df_clean['clean_text'].astype(str)
df_clean['clean_title'] = df_clean['clean_title'].astype(str)

In [45]:
nlp = spacy.load('en_core_web_sm')

def count_features(text):
    doc = nlp(text)
    pos_counts = Counter([token.pos_ for token in doc])

    char_count = len(text)
    word_count = len(text.split())
    word_density = char_count / word_count if word_count else 0
    

    features = {
        'char_count': char_count,
        'word_count': word_count,
        'word_density': word_density,
        'adj_count': pos_counts.get('ADJ', 0),
        'adv_count': pos_counts.get('ADV', 0),
        'noun_count': pos_counts.get('NOUN', 0),
        'pron_count': pos_counts.get('PRON', 0),
        'propn_count': pos_counts.get('PROPN', 0),
        'verb_count': pos_counts.get('VERB', 0)
    }

    return features

text_features = df_clean['clean_text'].apply(count_features).apply(pd.Series)
title_features = df_clean['clean_title'].apply(count_features).apply(pd.Series)

text_features_df = pd.DataFrame(text_features)
title_features_df = pd.DataFrame(title_features)

## Saving csv for EDA

In [46]:
text_features_df.to_csv(r'C:\Users\GGPC\IoD_Mini_Projects\Mini_Project_3\data\raw\text_features.csv')
title_features_df.to_csv(r'C:\Users\GGPC\IoD_Mini_Projects\Mini_Project_3\data\raw\title_features.csv')
df_clean.to_csv(r'C:\Users\GGPC\IoD_Mini_Projects\Mini_Project_3\data\raw\Clean.csv')