## Data Cleaning and Preprocessing Notebook

This notebook is to be strictly used for data cleaning and preprocessing purposes. Steps:

1. Read the dataset
2. Handle Missing Values (if any).
3. Do visualizations as required
4. Explore your data here
5. Save the cleaned and processed dataset as `data/final_dataset.csv`.
6. Split the dataset obtained in step 5 as `input/train.csv`,`input/test.csv`,`input/validation.csv`

NO MODELLING WILL BE DONE IN THIS NOTEBOOK!

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
import re
import string
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

In [3]:
from sklearn.model_selection import StratifiedShuffleSplit


In [4]:
df=pd.read_csv('../data/TARP_Project_Final_Dataset.csv')

In [5]:
df.head()

Unnamed: 0,title,text,label
0,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [6]:
df.shape

(61144, 3)

In [7]:
df['label'].value_counts()/len(df['label'])

REAL    0.523224
FAKE    0.476776
Name: label, dtype: float64

Before looking and wrangling at data, let's take a glance at the datasets

In [8]:
print("Tagged REAL:",df[df['label']=="REAL"]['title'].values[0])
print("Tagged FAKE:",df[df['label']=="FAKE"]['title'].values[0])

Tagged REAL: Kerry to go to Paris in gesture of sympathy
Tagged FAKE: You Can Smell Hillary’s Fear


In [9]:
df

Unnamed: 0,title,text,label
0,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL
...,...,...,...
61139,Powell pushes diplomacy for N. Korea,WASHINGTON -- Outgoing Secretary of State Coli...,REAL
61140,Void is filled with Clement,With the supply of attractive pitching options...,REAL
61141,Martinez leaves bitter,Like Roger Clemens did almost exactly eight ye...,REAL
61142,5 of arthritis patients in Singapore take Bext...,SINGAPORE : Doctors in the United States have ...,REAL


In [10]:
stops=set(stopwords.words('english'))

In [11]:
stops

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [12]:
def clean_text(text):
    """
    Cleans and preprocesses strings for NLP based tasks

    Args:
        text (str): Text that needs to be cleaned
    
    Returns:
        cleaned_text(str): The given text after cleaning and preprocessing
    """
    cleaned_text=text.strip() # Remove trailing whitespaces
    cleaned_text=text.lower() # Convert all to lowercase
    cleaned_text=word_tokenize(cleaned_text)
    cleaned_text=[word for word in cleaned_text if word not in stops]
    cleaned_text=" ".join(cleaned_text)
    cleaned_text=cleaned_text.replace("\n"," ") # Replace new lines with a space
    cleaned_text=re.sub(r'http[s]*://[A-Za-z0-9:./?=]*','url',cleaned_text) # Clean urls
    cleaned_text=re.sub(r"[^a-z\s]",'',cleaned_text) # Cleans out anything that is not a lowercase alphabet or a space
    cleaned_text=re.sub(r'[\s]+'," ",cleaned_text)
    return cleaned_text

In [13]:
" hello    hi".strip()

'hello    hi'

In [14]:
re.findall(r'http[s]*://[A-Za-z0-9:./?=]*',"https://t.co/VyTT49YvoE pic.twitter.com/wCvSCg4a5I But I am. this is my phone")

['https://t.co/VyTT49YvoE']

In [15]:
df.sample(5,random_state=10)['title'].apply(lambda x : clean_text(x))

37412    watch sean spicer fails spectacularly asked tr...
54257                         hewitt cruises quarterfinals
30131    tillerson warns region using lebanon proxy con...
38805    trump list unreported terror attacks absurd ci...
238      israel stole classified us information used he...
Name: title, dtype: object

In [16]:
df['text'].values[0]



In [17]:
clean_text(df['text'].values[0])



In [18]:
df['title']=df['title'].apply(lambda x: clean_text(x))
df['text']=df['text'].apply(lambda x: clean_text(x))

In [31]:
df=df[df['text']!=""]

In [32]:
df

Unnamed: 0,title,text,label
0,smell hillary fear,daniel greenfield shillman journalism fellow f...,FAKE
1,watch exact moment paul ryan committed politic...,google pinterest digg linkedin reddit stumbleu...,FAKE
2,kerry go paris gesture sympathy,us secretary state john f kerry said monday st...,REAL
3,bernie supporters twitter erupt anger dnc we t...,kaydee king kaydeeking november lesson tonigh...,FAKE
4,battle new york primary matters,s primary day new york frontrunners hillary cl...,REAL
...,...,...,...
61139,powell pushes diplomacy n korea,washington outgoing secretary state colin l po...,REAL
61140,void filled clement,supply attractive pitching options dwindling d...,REAL
61141,martinez leaves bitter,like roger clemens almost exactly eight years ...,REAL
61142,arthritis patients singapore take bextra cele...,singapore doctors united states warned painkil...,REAL


In [33]:
df=df[(df['title']!=" ") & (df['title']!="")]
df=df[(df['title']!=" ") & (df['title']!="")]

In [34]:
X=['title','text']
y='label'

In [35]:
df

Unnamed: 0,title,text,label
0,smell hillary fear,daniel greenfield shillman journalism fellow f...,FAKE
1,watch exact moment paul ryan committed politic...,google pinterest digg linkedin reddit stumbleu...,FAKE
2,kerry go paris gesture sympathy,us secretary state john f kerry said monday st...,REAL
3,bernie supporters twitter erupt anger dnc we t...,kaydee king kaydeeking november lesson tonigh...,FAKE
4,battle new york primary matters,s primary day new york frontrunners hillary cl...,REAL
...,...,...,...
61139,powell pushes diplomacy n korea,washington outgoing secretary state colin l po...,REAL
61140,void filled clement,supply attractive pitching options dwindling d...,REAL
61141,martinez leaves bitter,like roger clemens almost exactly eight years ...,REAL
61142,arthritis patients singapore take bextra cele...,singapore doctors united states warned painkil...,REAL


In [36]:
df.isnull().sum()

title    0
text     0
label    0
dtype: int64

In [23]:
def stratified_shuffle_split(df,X, y,test_size=0.3):
    """
    Generates a train and a test dataset using stratified sampling(the proportion of labels remains the same)
    from the parent dataset.

    Args:
        df (pandas.DataFrame): The pandas DataFrame object containing your data
        X (list/str): A string or list containing column(s) being used as features
        y (list/str): A string or list of target column(s)
        test_size (float, optional): Number between 0.0 and 1.0 indicating the proportion of the test data. Defaults to 0.3.

    Returns:
        (df_train,df_test): A tuple containing the given DataFrame split into two startified DataFrames
    """
    stratify=StratifiedShuffleSplit(test_size=test_size,random_state=42)
    for train_index, test_index in stratify.split(df[X],df[y]):
        df_train=df.iloc[train_index]
        df_test=df.iloc[test_index]
    return (df_train,df_test)

In [37]:
train,test=stratified_shuffle_split(df,X,y, test_size=0.35)
test,validation=stratified_shuffle_split(test,X,y,test_size=0.5)

In [38]:
print(train.shape)
print(test.shape)
print(validation.shape)

(39737, 3)
(10698, 3)
(10699, 3)


In [39]:
train.isnull().sum()

title    0
text     0
label    0
dtype: int64

In [40]:
train.to_csv('../input/train.csv')
test.to_csv('../input/test.csv')
validation.to_csv("../input/validation.csv")