# Data cleansing

#### This **.ipynb** file converts the text to lowercase, removes columns that are <br>not necessary and removes punctuations and stopwords.

Removes 'date' and 'title' columns

In [9]:
data.shape

(44898, 2)

In [10]:
from sklearn.utils import shuffle
data = shuffle(data)

In [11]:
data.head()

Unnamed: 0,text,label
12001,I know what I m doing. I m prepared to wear t...,fake
11993,We are supposed to have a government of the pe...,fake
6429,WASHINGTON (Reuters) - U.S. Defense Secretary ...,true
18026,"COX S BAZAR, Bangladesh (Reuters) - At least t...",true
20566,"JUCHITAN, Mexico (Reuters) - At least 61 peopl...",true


Makes all letters lowercase in text column

In [12]:
data['text'] = data['text'].str.lower()

In [13]:
data.head()

Unnamed: 0,text,label
12001,i know what i m doing. i m prepared to wear t...,fake
11993,we are supposed to have a government of the pe...,fake
6429,washington (reuters) - u.s. defense secretary ...,true
18026,"cox s bazar, bangladesh (reuters) - at least t...",true
20566,"juchitan, mexico (reuters) - at least 61 peopl...",true


Removes punctuation like (,.()- etc.) in text column

In [14]:
from string import punctuation

def remove_punctuation(text):
    temp_list = []
    for char in text:
        if char not in punctuation:
            temp_list.append(char)
    return ''.join(temp_list)

''' 
or with comprehension list
def remove_punctuation(text):
    return ''.join([char for char in text if char not in punctuation])
'''

data['text'] = data['text'].apply(remove_punctuation)

In [15]:
data.head()

Unnamed: 0,text,label
12001,i know what i m doing i m prepared to wear th...,fake
11993,we are supposed to have a government of the pe...,fake
6429,washington reuters us defense secretary ash c...,true
18026,cox s bazar bangladesh reuters at least two p...,true
20566,juchitan mexico reuters at least 61 people di...,true


Removes stopwords like (as, at, be, both, etc.) in text column

In [16]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stopwords = stopwords.words('english')

[nltk_data] Downloading package stopwords to /Users/belis/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [17]:
def remove_stopwords(text):
    temp_list = []
    for word in text.split():
        if word not in stopwords:
            temp_list.append(word)
    return ' '.join(temp_list)

''' 
or with comprehension list
def remove_stopwords(text):
    return ' '.join([word for word in text.split() if word not in stopwords])
'''

data['text'] = data['text'].apply(remove_stopwords)

In [18]:
data.head()

Unnamed: 0,text,label
12001,know prepared wear consequences conflict famil...,fake
11993,supposed government people people people inste...,fake
6429,washington reuters us defense secretary ash ca...,true
18026,cox bazar bangladesh reuters least two people ...,true
20566,juchitan mexico reuters least 61 people died p...,true
