# Data cleansing

#### This **.ipynb** file converts the text to lowercase, removes columns that are <br>not necessary and removes punctuations and stopwords.

Removes 'date' and 'title' columns

In [23]:
data.shape

(62222, 2)

In [24]:
from sklearn.utils import shuffle
data = shuffle(data)

In [25]:
data.head()

Unnamed: 0,text,label
5332,['Libertarian presidential nominee Gary Johnso...,real
6740,['Attorney General Jeff Sessions raised concer...,real
8058,WASHINGTON (Reuters) - A behind-the-scenes con...,real
19365,DUBAI (Reuters) - A Saudi cleric who said wome...,real
12296,,fake


Makes all letters lowercase in text column

In [26]:
data['text'] = data['text'].str.lower()

In [27]:
data.head()

Unnamed: 0,text,label
5332,['libertarian presidential nominee gary johnso...,real
6740,['attorney general jeff sessions raised concer...,real
8058,washington (reuters) - a behind-the-scenes con...,real
19365,dubai (reuters) - a saudi cleric who said wome...,real
12296,,fake


Removes punctuation like (,.()- etc.) in text column

In [28]:
from string import punctuation

def remove_punctuation(text):
    temp_list = []
    for char in text:
        if char not in punctuation:
            temp_list.append(char)
    return ''.join(temp_list)

''' 
or with comprehension list
def remove_punctuation(text):
    return ''.join([char for char in text if char not in punctuation])
'''

data['text'] = data['text'].apply(remove_punctuation)

In [29]:
data.head()

Unnamed: 0,text,label
5332,libertarian presidential nominee gary johnson ...,real
6740,attorney general jeff sessions raised concerns...,real
8058,washington reuters a behindthescenes congress...,real
19365,dubai reuters a saudi cleric who said women s...,real
12296,,fake


Removes stopwords like (as, at, be, both, etc.) in text column

In [30]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stopwords = stopwords.words('english')

[nltk_data] Downloading package stopwords to /Users/belis/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [31]:
def remove_stopwords(text):
    temp_list = []
    for word in text.split():
        if word not in stopwords:
            temp_list.append(word)
    return ' '.join(temp_list)

''' 
or with comprehension list
def remove_stopwords(text):
    return ' '.join([word for word in text.split() if word not in stopwords])
'''

data['text'] = data['text'].apply(remove_stopwords)

In [32]:
data.head()

Unnamed: 0,text,label
5332,libertarian presidential nominee gary johnson ...,real
6740,attorney general jeff sessions raised concerns...,real
8058,washington reuters behindthescenes congression...,real
19365,dubai reuters saudi cleric said women drive br...,real
12296,,fake
