<a href="https://colab.research.google.com/github/desireedisco/MSDS-Machine-Learning-Supervised/blob/main/1_Data_Preprocess.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
#Notebook for Data Preprocessing
---


This notebook is for data preprocessing. We will do the following:
* Load file, drop index, drop na, drop duplicates
* Separate out the email links, hashtag, mentions.
* Clean the data leakage problem in the text column. There is also some minor problems in the title column that will be fixed.
* Count the number of sentences and calculate the mean sentence length.
* Drop the foreighn language rows as determined by NLTK package on language
* Check to see of the remaining tokens which are recognized as words
* Calculate the title and text similarity


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#install langdetect for when we separate out foreign language rows
pip install langdetect

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m972.8/981.5 kB[0m [31m44.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m25.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993222 sha256=33fe788de1b181267fa66ac7f8df367050ed6ceac3992c0c396be3b603d6d3d5
  Stored in directory: /root/.cache/pip/wheels/95/03/7d/59ea870c70ce4e5a370638b5462a7711ab78fba2f655d05106
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9


In [None]:
import pandas as pd
import numpy as np
import re
import nltk

from langdetect import detect
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.corpus import words
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from gensim.models import Word2Vec

nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

***
##Load file, drop index, drop na, drop duplicates
***

In [None]:
#read csv file
data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Machine Learning - Supervised Learning/project/ProjectData/WELFake_Dataset.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1
1,1,,Did they post their votes for Hillary already?,1
2,2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1
3,3,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0
4,4,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1


In [None]:
# show dataframe info
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72134 entries, 0 to 72133
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  72134 non-null  int64 
 1   title       71576 non-null  object
 2   text        72095 non-null  object
 3   label       72134 non-null  int64 
dtypes: int64(2), object(2)
memory usage: 2.2+ MB
None


**Note on matching Real and Fake labels:**
In the dataset description the authors state the there are 72,134 news articles with 35,028 real and 37,106 fake news articles. The authors then go on to state the labels are labeled as follows: 0=fake and 1=real. The two statements are contradictory based on the following label counts. Upon further inspection of the data and following the authors first statement of '35,028 real and 37,106 fake news articles', I am following the mapping for the labels as 0=real and 1=fake.

In [None]:
# show the number of rows labeled 1 and number of rows labeled 0
print(data['label'].value_counts())

# show unique count for title and text
print(data.describe(include=['object']))

label
1    37106
0    35028
Name: count, dtype: int64
                                                    title   text
count                                               71576  72095
unique                                              62347  62718
top     Factbox: Trump fills top jobs for his administ...       
freq                                                   14    738


In [None]:
# drop dataset index column
data = data.drop(columns=['Unnamed: 0'])

**Check to see how many rows are null and what the percentage of total data is before dropping rows.**

In [None]:
# show the number of rows that are null
for col in data.columns:
    print(f'{data[col].isnull().sum()} rows are null in ' + col)

# show what percentage of total is null for each column
for col in data.columns:
    print(col, f'{round(data[col].isnull().sum() / data.shape[0] * 100, 2)} % is null')

558 rows are null in title
39 rows are null in text
0 rows are null in label
title 0.77 % is null
text 0.05 % is null
label 0.0 % is null


In [None]:
#drop all null values
data = data.dropna().reset_index(drop=True)

**Split the data by label and check and drop duplicate 'text' columns. Then combine to check for duplicates**

I did this because I did not want to drop duplicates that had multiple labels attached. If there were any duplicate articles that had different labels then I wanted to know about that. I this dataset the only duplicate with multiple labels was a blank text row.

In [None]:
#split the data by label to look at unique count in case there are duplicates label both real and fake
real_data = data[data['label'] == 0]
fake_data = data[data['label'] == 1]
print(real_data.describe(include=['object']))
print(fake_data.describe(include=['object']))

                                                    title  \
count                                               35028   
unique                                              34409   
top     Factbox: Trump fills top jobs for his administ...   
freq                                                   14   

                                                     text  
count                                               35028  
unique                                              34621  
top     Killing Obama administration rules, dismantlin...  
freq                                                   58  
                                                    title   text
count                                               36509  36509
unique                                              27903  27580
top     Get Ready For Civil Unrest: Survey Finds That ...       
freq                                                    8    737


In [None]:
#drop duplicate text stories
real_data = real_data.drop_duplicates(subset=['text']).reset_index(drop=True)
fake_data = fake_data.drop_duplicates(subset=['text']).reset_index(drop=True)

# combine the separate labels to one to check for duplicates
data = pd.concat([real_data, fake_data], ignore_index=True)
print(data.describe(include=['object']))

                                                    title   text
count                                               62201  62201
unique                                              61400  62200
top     Factbox: Trump fills top jobs for his administ...       
freq                                                   14      2


As you can see the duplicate with multiple labels had blank text column.

In [None]:
# display duplicate rows that have 2 different labels
print(data[data['text'].duplicated(keep=False)])

                                                   title text  label
920                       Graphic: Supreme Court roundup           0
34626  HOUSE INTEL CHAIR On Trump-Russia Fake Story: ...           1


In [None]:
# they have blank text so drop rows
data = data.drop_duplicates(subset=['text'], keep=False).reset_index(drop=True)

# check total counts and unique counts
print(data.describe(include=['object']))

# display value counts for labels
data['label'].value_counts()

#value counts should be
# 0 - 34620
# 1 - 27579

                                                    title  \
count                                               62199   
unique                                              61398   
top     Factbox: Trump fills top jobs for his administ...   
freq                                                   14   

                                                     text  
count                                               62199  
unique                                              62199  
top     A dozen politically active pastors came here f...  
freq                                                    1  


Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,34620
1,27579


**Shuffle and reindex dataset**

In [None]:
# shuffle dataframe and reset index
data = data.sample(frac=1).reset_index(drop=True)

***
##Clean Title - Source Data Leakage
***

In the data_leakage notebook we demonstrated that some source info that was embedded in the 'text' column influenced the model. Because not all articles have source info, it was unduly providing bias in the model. I looked through the 'title' column and decided there was similar data that could influence the model: therefore, I decided to remove that as well. I just want to focus on the actual title and actual text, not the sources of the articles. We could use the sources as a feature if all the articles have source information. In the current dataset, only the real news includes sources so it needs to be removed.

In [None]:
def clean_title_leakage(title):
  # making sure title is a string
  title = str(title)

  # get rid of New York Times and Breitbart reference
  pattern_leakage = r" - The New York Times$| - Breitbart$"
  # find all occurrences not just the first one
  match_lst = re.findall(pattern_leakage, title)
  # if match then substitue for ''
  if match_lst:
    title = re.sub(pattern_leakage, '', title)
    #print(match_lst)

  return title

In [None]:
# create a new column for the clean title and apply the clean_title_leakage method
data['title_clean'] = data['title'].apply(clean_title_leakage)

In [None]:
# just as well as the real news provides source information the fake news refers to a source of video so I decide to clean this as well
def clean_title_video(title):

  # get rid of Video reference - easy match pattern
  pattern_video = r"\[VIDEO\]|\(VIDEO\)"

  # find all matches and ignore case
  match_lst = re.findall(pattern_video, title, re.IGNORECASE)

  if match_lst:
    title = re.sub(pattern_video, '', title, flags=re.IGNORECASE)
    #print(match_lst)

  # search for more match patterns to get rid of all references to (Video)
  pattern_video_re = r"\[video[a-z 0-9/,+-].*\]|\[[a-z 0-9/,+-].*video\]|\(video[a-z 0-9/,+-].*\)|\([a-z 0-9/,+-].*video\)|\([a-z 0-9/,+-].*videos\)"
  match_lst_re = re.findall(pattern_video_re, title, re.IGNORECASE)
  if match_lst_re:
    title = re.sub(pattern_video_re, '', title, flags=re.IGNORECASE)
    #print(match_lst_re)

  return title

In [None]:
# cleaning the 'title_clean column of the (Video) references which is skewed to fake news
data['title_clean'] = data['title_clean'].apply(clean_title_video)

***
##Clean Text
***

In [None]:
# clean out newline reference
data['text_clean'] = data['text'].str.replace('\n', ' ')

In [None]:
# method to extract email links from text and put in another data frame column
def extract_email_links(text):
  emails = ''

  # get rid of web links - match for web links
  pattern_email = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
  match_lst = re.findall(pattern_email, text)
  # if match we will substitute email for '' in text_clean and add to emails list
  if match_lst:
    text = re.sub(pattern_email, '', text)
    print(match_lst)
    emails = match_lst
  # return text_clean and emails list
  return pd.Series({'text_clean': text, 'email': emails})

In [None]:
# get text_clean striped of emails and put emails in separate column
data[['text_clean', 'email']] = data['text_clean'].apply(lambda x: pd.Series(extract_email_links(x)))

['themexican@askamexican.net']
['huma@clintonemail.com']
['tcaeditors@tribune.com']
['editor@greanvillepost.com']
['cyberpatriot@hotmail.com']
['bruce.dixon@blackagendareport.com', 'editor@greanvillepost.com']
['FoxNewsFirst@FOXNEWS.COM']
['lendmanstephen@sbcglobal.net']
['hdr22@clintonemail.com']
['HALFTIMEREPORT@FOXNEWS.COM']
['diana.johnstone@wanadoo.fr']
['Paul.Bond@THR.com']
['lendmanstephen@sbcglobal.net']
['Dmmonypeny@att.net']
['Mike_Peril@aol.com']
['onehundredpercentfedup@gmail.comFrom']
['cliff.kincaid@aim.org']
['john.podesta@gmail.com', 'cheryl.mills@gmail.com', 'brianefallon@gmail.com', 'lauren.elena.smith@gmail.com', 'bigcampaign@googlegroups.com']
['cliff.kincaid@aim.org']
['bobama@ameritech.net']
['john.podesta@gmail.com', 'donna@brazileassociates.com', 'john.podesta@gmail.com', 'donna@brazileassociates.com', 'donna@brazileassociates.com', 'john.podesta@gmail.com', 'john.podesta@gmail.com', 'jpalmieri@hillaryclinton.com', 'aelrod@hillaryclinton.com', 'Minyon.Moore@dewe

In [None]:
# method to extract web links and put in another dataframe column
def extract_web_links(text):
  links = ''

  #get rid of web links - we try to match a number of different patterns to try to capture all the weblinks in the text column
  pattern_html = r"https?://(?:www\.)?[a-zA-Z0-9./?=&_-]+|pic.twitter.com[a-zA-Z0-9./]+|[a-zA-Z0-9./]+(?:\.com|\.org)/[a-zA-Z0-9/\-.]+|[a-zA-Z0-9./]+(?:\.com|\.org)"
  match_lst = re.findall(pattern_html, text)
  # if match we will substitute web link for '' in text_clean and add to web link list
  if match_lst:
    text = re.sub(pattern_html, '', text)
    print(match_lst)
    links = match_lst

  # return text_clean and web link list
  return pd.Series({'text_clean': text, 'links': links})

In [None]:
# get text_clean striped of web links and put web links in separate column
data[['text_clean', 'links']] = data['text_clean'].apply(lambda x: pd.Series(extract_web_links(x)))
# record link count in separate column
data['link_count'] = data['links'].apply(lambda x: len(x))

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
['abravealabamaatheist.com', 'weneedtotalkaboutmentalhealth.com']
['Townhall.com']
['https://t.co/NUiy9j4fBt']
['www.msn.com', 'one.com']
['FiveThirtyEight.com', 'https://t.co/sjVY67qouE', 'pic.twitter.com/rrc3GuXmGl', 'https://t.co/F455bP3D8I', 'pic.twitter.com/qjr6zLh640', 'https://t.co/WLXtJodIzD', 'pic.twitter.com/GPmyueqczL', 'https://t.co/o4qQf8STlT', 'https://t.co/7oGXVkQDMA', 'https://t.co/sPyCIrTYC9', 'https://t.co/4CTf1FP6EJ', 'https://t.co/DjhyWRB2eD', 'https://t.co/EuTRbYlNys']
['https://www.youtube.com/watch?v=5_s5gs0I4Uw']
['bookpatch.com']
['pic.twitter.com/OUdKXMBfqm']
['pic.twitter.com/uKoKg63Ft5']
['pic.twitter.com/bfgk37LbFC']
['https://api.soundcloud.com/tracks/214633454']
['https://t.co/KNeF02vhbv', 'https://t.co/skw76OndVK']
['Cinncinati.com']
['https://t.co/JDF3VsVEJ1', 'https://t.co/mm8YzH2NPB']
['Amazon.com']
['pic.twitter.com/MDxGhX2GoO']
['pic.twitter.com/dzgYpIIao3', 'https://t.co/Kzuc4r77CK']


In [None]:
# method to extract mentions and put in another dataframe column
def clean_mentions(text):
  mentions = ''

  # get rid of mentions in text and keep mentions in a separate column
  pattern_mentions = r"@\w+"
  match_lst = re.findall(pattern_mentions, text)
  # if match we will substitute mentions for '' in text_clean and add to mention list
  if match_lst:
    text = re.sub(pattern_mentions, '', text)
    print(match_lst)
    mentions = match_lst

  # return text_clean and mention list
  return pd.Series({'text_clean': text, 'mentions': mentions})

In [None]:
# get text_clean striped of mentions and put mentions in separate column
data[['text_clean', 'mentions']] = data['text_clean'].apply(lambda x: pd.Series(clean_mentions(x)))
# new column with mention count
data['mentions_count'] = data['mentions'].apply(lambda x: len(x))
# turn mention list into string so we can vectorize latter on if we want
data['mentions'] = data['mentions'].apply(lambda mention_lst: ' '.join([str(mention) for mention in mention_lst]))

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
['@hooverwhalen']
['@21WIRE']
['@PostGraphics']
['@MMFlint', '@jas', '@MMFlint', '@MMFlint', '@MMFlint', '@MMFlint']
['@IanHanchett']
['@igorvolsky']
['@LizWFB']
['@PatriotNotPol']
['@JBA_NAFW', '@JBA_NAFW']
['@IanHanchett']
['@StarTribune', '@reneejon', '@JaimeDeLage', '@JaimeDeLage', '@EmmaSapong']
['@nytimes', '@nytimes']
['@nytimes']
['@MagnifiTrent']
['@jakeshieldsajj', '@jakeshieldsajj', '@jakeshieldsajj', '@jakeshieldsajj']
['@maddow', '@JesseFFerguson', '@MaddowBlog', '@realDonaldTrump']
['@MarcACaputo', '@seanhannity', '@marcorubio', '@seanhannity', '@seanhannity', '@TheMattWilstein', '@seanhannity']
['@tciccotta', '@breitbart']
['@21WIRE']
['@JohnCornyn', '@JohnCornyn', '@JohnCornyn', '@JohnCornyn', '@MRSSMH2', '@JulesLorey1', '@ejkmom1998', '@Tex92eye', '@shipp_kenneth', '@EllenMorris1222', '@EricHolder', '@JohnCornyn', '@LoriW66', '@PittieBoo', '@JohnCornyn', '@standsagreenoak', '@marciebp']
['@MagnifiTrent']


In [None]:
# method to extract hashtags and put in another dataframe column
def clean_hashtags(text):
  hashtags = ''
  hashtag_count = 0

  # get rid of hashtags in text and keep mentions in a separate column
  pattern_hashtags = r"#\w+"
  match_lst = re.findall(pattern_hashtags, text)

  # if match we will substitute mentions for '' in text_clean and add to mention list
  if match_lst:
    text = re.sub(pattern_hashtags, '', text)

    hashtags = match_lst
    hash_clean = ''

    # we want to search through hashtag list for #1,#2, or other strings with just numbers after # because there is a lot of them and they are not hashtags
    for index, hashtag in enumerate(hashtags):
      # if match is a number string do nothing otherwise add to hashtag list
      if re.search(r"#\d+$", hashtag):
        pass
      else:
        hash_clean = hash_clean + ' ' + hashtag
    # strip left space
    hashtags = hash_clean.lstrip()
    print(hashtags)

    # count the number of hashtags in list
    hashtag_count = len(hashtags.split(' '))

  # return text_clean and hashtag list in string form and return hashtag count
  return pd.Series({'text_clean': text, 'hashtags': hashtags, 'hashtag_count': hashtag_count})

In [None]:
# process text to return 'text_clean', hashtag list, and hashtag count
data[['text_clean', 'hashtags', 'hashtag_count']] = data['text_clean'].apply(lambda x: pd.Series(clean_hashtags(x)))

#seismic #DPRK #seismic #CTBT

#Saban17
#Cuba
#Comey
#DraftOurDaughters #DraftOurDaughters
#gucci
#DrainTheSwampThe #DrainTheSwamp
#fucktrump #fucktrump #fucktrump #fucktrump
#TheMessyTruth
#buildthatwall
#2059more
#Yulin2016
#Seismic
#whitegirl
#FlintWaterCrisis
#xfbml
#SouthJersey
#PalinOnCNN #PalinOnCNN
#CLT #KeithScottA
#PTSDAwareness
#DNCLeak2 #MadelineMcCann
#xfbml
#StandWithPP
#ThisFlag
#MAGA #TrumpTrain #DTS
#AmericatheBeautiful #SuperBowl #sisterhood #ItAintBrokeDontFixIt #SB51

#CNNSOTU
#Charlottesville
#NeverTrump
#Assange #Clinton #FBI
#DayWithoutImmigrants #protest #Dallas #TX #AntiTrump #DayWithoutImmigrants

#nhpolitics
#APEChottie
#BlackFair #BlackLivesMatter #FTP #ACAB #BlackLivesMatter
#goodbyenukes #FirstCommittee

#TravelBan
#DAPL
#NFB #WhiteHouseFlashingLights #whitehouseflashinglights10
#Durham #durham #Charlottesville #Durham
#WAR
#NoDAPL #DAPL #nodapl #DakotaAccessPipeline #NoDAPL #police #NoDAPL #NoDAPL #NoDAPL
#SendtheComfort
#x27
#NFL #MAGA
#PodestaEmails19 #

In [None]:
# clean the references to (Reuters) from the text column
def clean_reuters_leakage(text):

  # get rid of reference to (Reuters) in text
  # some articles start out with the is a correction not followed by the (Reuters) reference and we are matching the whole string and substituting out
  pattern_reuters_1 = r"^[“”‘’A-Za-z.,&$()/\-:;0-9  ]*(\(Reuters\) - |\(Reuters\)\) - |\(Reuters\)  —  )"
  # we are matching first occurence so we are using match and not findall
  match_str_1 = re.match(pattern_reuters_1, text)
  if match_str_1:
    text = re.sub(pattern_reuters_1, '', text)
    print(match_str_1.group())

  return text

In [None]:
# clean the (Reuters) reference out
data['text_clean'] = data['text_clean'].apply(clean_reuters_leakage)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
LUANDA (Reuters) - 
NEW YORK (Reuters) - 
NEW YORK (Reuters) - 
WASHINGTON (Reuters) - 
WASHINGTON (Reuters) - 
WASHINGTON (Reuters) - 
WASHINGTON (Reuters) - 
WASHINGTON (Reuters) - 
WASHINGTON (Reuters) - 
(Reuters) - 
BERLIN (Reuters) - 
ISLAMABAD (Reuters) - 
WASHINGTON (Reuters) - 
WASHINGTON (Reuters) - 
ST PETERSBURG, Russia (Reuters) - 
WASHINGTON/HOUSTON (Reuters) - 
NEW YORK (Reuters) - 
MOSCOW/WASHINGTON (Reuters) - 
RIYADH (Reuters) - 
WASHINGTON (Reuters) - 
SITTWE, Myanmar (Reuters) - 
SAN FRANCISCO (Reuters) - 
WASHINGTON (Reuters) - 
(Reuters) - 
DHAKA (Reuters) - 
WASHINGTON (Reuters) - 
SANTIAGO (Reuters) - 
 (This story corrects reference to Russian ambassador in paragraph 10.) By Steve Holland and Jeff Mason WASHINGTON (Reuters) - 
WASHINGTON (Reuters) - 
LONDON (Reuters) - 
UNITED NATIONS (Reuters) - 
WASHINGTON (Reuters) - 
SEOUL (Reuters) - 
HAVANA (Reuters) - 
WINSTON-SALEM, N.C. (Reuters) - 
ROME 

***
##Sentence count and mean sentence length.
***

In [None]:
# using NLTK sentence tokenizer to count the number of sentences in each text passage in the 'text_clean' column
data.loc[:,'text_sent_count'] = data.loc[:,'text_clean'].map(lambda txt: len(sent_tokenize(txt)))

In [None]:
# check and see if there are any less than 1
# in the result dataframe the text column is just links with the text_clean column blank
data[data['text_sent_count'] < 1]

Unnamed: 0,title,text,label,title_clean,text_clean,email,links,link_count,mentions,mentions_count,hashtags,hashtag_count,text_sent_count
887,‘Half my family’ is here illegally!…State Sena...,https://www.youtube.com/watch?v=rUr8pYr5AXs,1,‘Half my family’ is here illegally!…State Sena...,,,[https://www.youtube.com/watch?v=rUr8pYr5AXs],1,,0,,0,0
2262,GOTCHA! CNN PANELIST Called Out For Lying Abou...,https://www.youtube.com/watch?v=ISm-p8e-D7I,1,GOTCHA! CNN PANELIST Called Out For Lying Abou...,,,[https://www.youtube.com/watch?v=ISm-p8e-D7I],1,,0,,0,0
2329,DISGUSTING! USA TODAY Video Suggests “Trump Er...,https://www.youtube.com/watch?v=8dsDdBqF828,1,DISGUSTING! USA TODAY Video Suggests “Trump Er...,,,[https://www.youtube.com/watch?v=8dsDdBqF828],1,,0,,0,0
3989,BRILLIANT: REP KING Calls Out CIA Director For...,https://www.youtube.com/watch?v=HXJZbPAf0sk,1,BRILLIANT: REP KING Calls Out CIA Director For...,,,[https://www.youtube.com/watch?v=HXJZbPAf0sk],1,,0,,0,0
4315,BREAKING: DEMOCRAT Makes Shocking Statement Re...,https://www.youtube.com/watch?v=IioEIUmawRo,1,BREAKING: DEMOCRAT Makes Shocking Statement Re...,,,[https://www.youtube.com/watch?v=IioEIUmawRo],1,,0,,0,0
5908,BRILLIANT! TUCKER CARLSON Humiliates Jill Stei...,https://www.youtube.com/watch?v=uQbAww5wajA,1,BRILLIANT! TUCKER CARLSON Humiliates Jill Stei...,,,[https://www.youtube.com/watch?v=uQbAww5wajA],1,,0,,0,0
8407,WATCH! TRUMP SUPPORTER “BIG JOE” Surrounded By...,https://www.youtube.com/watch?v=IPqrimR8GWw,1,WATCH! TRUMP SUPPORTER “BIG JOE” Surrounded By...,,,[https://www.youtube.com/watch?v=IPqrimR8GWw],1,,0,,0,0
9045,“F*ck Trump! F*ck White People!” LEFTY GOES NU...,https://www.youtube.com/watch?v=zZ7GrEItGoo,1,“F*ck Trump! F*ck White People!” LEFTY GOES NU...,,,[https://www.youtube.com/watch?v=zZ7GrEItGoo],1,,0,,0,0
9757,FULL INTERVIEW: PRESIDENT TRUMP Nails It On Im...,https://www.youtube.com/watch?v=hNPX8ZCIfc0&t=26s,1,FULL INTERVIEW: PRESIDENT TRUMP Nails It On Im...,,,[https://www.youtube.com/watch?v=hNPX8ZCIfc0&t...,1,,0,,0,0
10880,WATCH Huge Crowd Of Muslims Admit That ALL Mus...,https://www.youtube.com/watch?v=8Mehk5eWcZA,1,WATCH Huge Crowd Of Muslims Admit That ALL Mus...,,,[https://www.youtube.com/watch?v=8Mehk5eWcZA],1,,0,,0,0


In [None]:
# we will drop the rows less than 1
data.drop(data.loc[data['text_sent_count'] < 1].index, inplace=True)

In [None]:
# get mean sentence length
data.loc[:,'mean_sent_length'] = data.loc[:,'text_clean'].map(lambda txt: np.mean([len(sent) for sent in sent_tokenize(txt)]))

In [None]:
# look at rows with mean sentance length less than 20
# we will apply the language check and that needs a good amount of charaters
df = data.loc[data['mean_sent_length'] < 20,['text_clean','label']]
print(df)

                                              text_clean  label
433                                   Boom!Courtesy of:       1
994                                        advertisement      0
1591                                               Ouch!      1
2038                                      Lefty losers        1
2545                                          Brilliant       1
6778          Be the First to Comment!   Search articles      1
9023                                             Via: WT      1
9554                                     Guest   Guest        1
11514                                       11/08/2016        1
11833                                    Zones"confuse        1
13525  Les Deplorables Unite     ??VOTE TRUMP?? () No...      1
14116                                 Take note America       1
14614                                          Trending       1
14721                                           Via: TMZ      1
15761  Notify the CDC. It's spreading.  

In [None]:
# drop the low mean length articles
data.drop(data.loc[data['mean_sent_length'] < 20].index, inplace=True)

***
##Drop other languages
***

In [None]:
# runs for 16min
# this method will detect other languages using langdetect package
def detect_language(text):
    try:
        return detect(text)
    except:
        return 'could not detect language'
# we are running the detect_language methon on the 'text_clean' column
data.loc[:,'lang'] = data.loc[:,'text_clean'].apply(detect_language)

In [None]:
# the count of english slightly changes every time I run the above method but the count should be around #61608
print(data.loc[:,'lang'].value_counts())

lang
en       61608
ru         156
es         141
de          98
fr          32
ar          19
tr           7
pt           7
it           4
no           3
nl           3
hr           3
pl           2
el           2
zh-cn        1
sw           1
vi           1
Name: count, dtype: int64


In [None]:
# show non english rows
df = data.loc[~data['lang'].eq('en')]
df.head()
#df.to_csv('/content/drive/MyDrive/Colab Notebooks/Machine Learning - Supervised Learning/project/ProjectData/lang.csv', index=False)

Unnamed: 0,title,text,label,title_clean,text_clean,email,links,link_count,mentions,mentions_count,hashtags,hashtag_count,text_sent_count,mean_sent_length,lang
48,Выкинуть хлам и жить по фэншуй,Общество » Практика » Как отдохнуть Как измени...,1,Выкинуть хлам и жить по фэншуй,Общество » Практика » Как отдохнуть Как измени...,,,0,,0,,0,115,95.304348,ru
91,"Sonntagsfrage: Was sagen Sie dazu, dass Donald...","Sonntag, 13. November 2016 Sonntagsfrage: Was ...",1,"Sonntagsfrage: Was sagen Sie dazu, dass Donald...","Sonntag, 13. November 2016 Sonntagsfrage: Was ...",,,0,,0,,0,9,107.333333,de
123,Трамп разбушевался,"Происшествия \nЧем ближе выборы, тем сильнее н...",1,Трамп разбушевался,"Происшествия Чем ближе выборы, тем сильнее не...",,,0,,0,,0,20,122.55,ru
144,فضيحة جنسية تهز أحد أشهر قارئي القرآن وتحرج سل...,فضيحة جنسية تهز أحد أشهر قارئي القرآن وتحرج سل...,1,فضيحة جنسية تهز أحد أشهر قارئي القرآن وتحرج سل...,فضيحة جنسية تهز أحد أشهر قارئي القرآن وتحرج سل...,,[http://ar.rt.com/i5gt],1,,0,,0,8,188.5,ar
253,Стала известна возможная причина взрыва дома в...,Фото: © Пресс-служба МЧС по Рязанской области ...,1,Стала известна возможная причина взрыва дома в...,Фото: © Пресс-служба МЧС по Рязанской области ...,,,0,,0,,0,15,148.866667,ru


In [None]:
# keep english rows
data = data.loc[data['lang'] == 'en']

# drop lang column
data = data.drop(columns=['lang'])

***
##Tokenize title and text
***

In [None]:
# add additional stop words that aren't in the nltk library list
add_stop_words = {'also'}
print(add_stop_words)

# load the nltk library stop word list
stop_words = set(stopwords.words('english'))

# combine the 2 lists - don't really need this now because I did not add a lot of additional stop words but leaving in as place holder
stop_words = stop_words.union(add_stop_words)

# display stop words
print(stop_words)
print('said' in stop_words)

{'also'}
{'herself', 'shouldn', 'under', 'also', 'how', 'why', 'an', 'each', 'up', 'about', 'yourselves', 'down', 'needn', 'myself', 'him', "aren't", 'ourselves', 'only', 't', 'again', 'you', 'most', 'off', "that'll", 'during', 'in', 'after', "isn't", 'into', 'couldn', 're', 'own', 'their', "mustn't", 'be', "hasn't", 'has', 'did', 'above', 'other', 'his', 'ma', 'i', 'ours', 'the', 'hadn', 'below', "it's", 'when', 'is', 'those', 'just', 'no', 'haven', 'your', 'was', "needn't", "mightn't", 'it', "you're", "shan't", 'her', "wouldn't", 'that', 'itself', 'should', "haven't", 'our', 'such', 'doesn', "hadn't", 'isn', 'if', 'this', 'yourself', "don't", 's', 'same', 'as', "doesn't", 'been', "you'll", 'ain', 'between', 'all', "weren't", 'of', 'me', 'than', 'few', 'being', 'while', 'before', 'there', 'more', 'don', 'having', 'who', 'do', 'on', "shouldn't", 'or', 'by', 'does', 'now', 'from', 'll', 'we', 'mustn', 'because', 'and', 'they', "didn't", 'very', 'until', 've', 'hers', 'theirs', 'aren', '

In [None]:
# clean digit only strigs and get rid of other characters that are not words and tokenize string
def clean_tokens(text):
  # text to lower
  text_clean = text.lower()

  # get rid of digit only text strings
  text_clean = re.sub(r"\d+", '', text_clean)

  # set non-words to ''
  text_clean = re.sub("\W", ' ', text_clean)

  # tokenize text
  tokens = word_tokenize(text_clean)

  # check to make sure length of token is greater than 1 and combine into a new text string
  # I want to combine into text string to be able to save to csv file and load later with minimal processing
  text_list = [word for word in tokens if word not in stop_words and len(word)>1]
  token_to_text = ' '.join(text_list)

  # return text string
  return token_to_text

In [None]:
# test clean_tokens method
text1 = data.loc[0,'text_clean']
print(text1)
print(clean_tokens(text1))
print('--------------')
title1 = data.loc[0,'title_clean']
print(title1)
print(clean_tokens(title1))

Corrections and clarifications: An earlier version of this story misstated who Bernie Sanders would be meeting with Thursday at the White House. He is scheduled to meet with President Obama.  BROOKLYN, N.Y. — Hillary Clinton marked her place in American history Tuesday night, declaring victory in the Democratic presidential race.  “Thanks to you, we’ve reached a milestone,” she told cheering supporters in Brooklyn, saying for the “first time in our nation’s history” a woman would lead a major-party ticket.  Clinton hit the magic number of 2,383 delegates needed to clinch the nomination on Monday night, as news organizations called the race for her based on support from superdelegates — party leaders and elected officials who have a vote at the convention and pledged to back her over Vermont Sen. Bernie Sanders.  Clinton waited until six states held a final round of contests Tuesday to declare victory, which will solidify her lead in pledged delegates earned through primaries and caucus

In [None]:
# tokenize the cleaned title and text columns
data.loc[:,'title_tokens_to_text'] = data.loc[:,'title_clean'].apply(clean_tokens)
data.loc[:,'text_tokens_to_text'] = data.loc[:,'text_clean'].apply(clean_tokens)

***
##Check recognized words and non-ascii characters
***

In [None]:
# load saved csv file of other recognized words that are not included in nltk or wordnet word or lemmas collections
# the saved addl word list was taken from a prior run resulting in a non recognized word and then running through MS Word to see if recognized by spell check and if it was not considered misspelled then I added to addl word list.
file = open('/content/drive/MyDrive/Colab Notebooks/Machine Learning - Supervised Learning/project/ProjectData/addl_words.txt', 'r', encoding='utf-8-sig')

# parse csv file with ', '
addl_words = file.read().split(', ')

# strip the extra ' characters
addl_words = [word.strip('\'') for word in addl_words]
add_words = set(addl_words)
file.close()

print(add_words)
print(len(add_words)) # the count should be 49483

{'tuchman', 'maguindanao', 'latinxs', 'yolanda', 'keshia', 'iguaçu', 'chronobiology', 'seán', 'galtung', 'rhodesian', 'davante', 'fulford', 'triterpenes', 'banksy', 'billerica', 'bellen', 'neda', 'dail', 'grozny', 'frodo', 'sunbelt', 'schar', 'agoura', 'orientis', 'aireys', 'lesotho', 'xmas', 'zlatko', 'tysons', 'jonsen', 'caveated', 'bordoff', 'wyomia', 'okoth', 'sobek', 'khomeinist', 'volodin', 'rosenbaum', 'jalapeño', 'neonatologist', 'rizk', 'yasuhisa', 'ryann', 'pichincha', 'alper', 'yamashiro', 'zaharie', 'delacruz', 'deshaun', 'ocalan', 'cussler', 'wiens', 'sikorski', 'chillwave', 'subiaco', 'crispin', 'melanesians', 'venegas', 'porifera', 'overconsuming', 'homebuilt', 'adami', 'zollar', 'sokol', 'regnery', 'aptos', 'garneau', 'hanns', 'trainspotting', 'césar', 'doenitz', 'conway', 'shoaf', 'youtuber', 'binford', 'ethiopia', 'subclans', 'sliema', 'corrado', 'swaroop', 'tadamichi', 'valvano', 'zaydis', 'ferreira', 'fencepost', 'clichés', 'khodja', 'mah', 'uninhibitedly', 'ciprian

In [None]:
# get NLTK word list
nltk.download('words')
word_list = set(words.words())

# get lemmas from wordnet
# get all synsets from wordnet.all_synsets() then get the lemma names from each synset
# this will help us expand the words list with different words and we just want to add the lemmas
# when we check the words we will take down to lemma form if there is not a direct match right away
wordnet_lemmas = set(lemma.name() for synset in wordnet.all_synsets() for lemma in synset.lemmas())
# combine lists
word_list_master = word_list.union(wordnet_lemmas)
word_list_master = word_list_master.union(add_words)
print(len(word_list_master)) # the count should be 376754

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


376754


In [None]:
# initiate a wordnet lemmatizer
wnl = WordNetLemmatizer()

# method to check if word is a recognized word
def check_words(tokens_to_text):
  # this is a part of speach dictionary to match coding between NLTK
  # maps part-of-speech tags (like 'NN' for noun, 'VB' for verb) to simplified tags ('n', 'v', 'a', 'r') used by WordNet
  pos_ref = {'NN': 'n', 'NNP': 'n', 'NNPS': 'n', 'NNS': 'n', 'JJ': 'a', 'JJR': 'a', 'JJS': 'a', 'RB': 'r', 'RBR': 'r', 'RBS': 'r', 'RP': 'r', 'VB': 'v', 'VBD': 'v', 'VBG': 'v', 'VBN': 'v', 'VBP': 'v', 'VBZ': 'v'}
  token_lst = tokens_to_text.split()
  # list of tokens with pos tags using nltk.pos_tag method
  lst_pos_tags = nltk.pos_tag(token_lst)

  # set of non words or mispelled words
  not_in_words = set()
  # set of non ascii charater strings because I could not get rid of all foreign words about and want to see where they are at
  not_ascii = set()
  # add recognized words to list
  recog_words = []

  # loop through tokens
  for token in lst_pos_tags:
    # if word in master word list add to recog_words
    if token[0] in word_list_master:
      recog_words.append(token[0])
      pass
    # if word is not recognized right away check is lemma and see if it is recognized
    elif wnl.lemmatize(token[0]) in word_list_master:
      recog_words.append(token[0])
      pass
    # otherwise look at token and pos tag
    else:
      word = token[0]
      # get wordnet pos from NLTK pos tag
      pos = pos_ref.get(token[1])
      # is pos tag is one from the dictionary above then check lemma using lematize with the pos information
      if pos is not None:
        lemma = wnl.lemmatize(word, pos)
        if lemma in word_list_master:
          recog_words.append(token[0])
          pass
        # if the token is still not recognized then check to see if it is ascii
        else:
          if token[0].isascii():
            not_in_words.add(token[0])
          else:
            not_ascii.add(token[0])
            print(token[0]) #printing all non-ascii tokens
      # for tokens not labeled with pos that is in above pos_ref dictionary then check to see if the token is isascii
      else:
        if token[0].isascii():
          not_in_words.add(token[0])
        else:
          not_ascii.add(token[0])
          print(token[0])
  recog_words_to_text = ' '.join(recog_words)
  # return recog_words_to_text and the not_in_words set and not_ascii set as columns
  return pd.Series({'recog_words_to_text': recog_words_to_text, 'text_not_words': not_in_words, 'not_ascii': not_ascii})

In [None]:
# test check_words methon
text1 = data.loc[0,'text_tokens_to_text']
print(text1)

good, bad, not_ascii = check_words(text1)
print(good)
print(bad)
print(not_ascii)

corrections clarifications earlier version story misstated bernie sanders would meeting thursday white house scheduled meet president obama brooklyn hillary clinton marked place american history tuesday night declaring victory democratic presidential race thanks reached milestone told cheering supporters brooklyn saying first time nation history woman would lead major party ticket clinton hit magic number delegates needed clinch nomination monday night news organizations called race based support superdelegates party leaders elected officials vote convention pledged back vermont sen bernie sanders clinton waited six states held final round contests tuesday declare victory solidify lead pledged delegates earned primaries caucuses well advantage overall popular vote clinton picked easy win new jersey claimed victories new mexico south dakota sanders meanwhile north dakota caucuses montana primary ap cnn nbc called california clinton early wednesday clinton celebrated supporters brooklyn 

In [None]:
# run check_words method and get new columns
data[['recog_words_to_text', 'text_not_words', 'not_ascii']] = data['text_tokens_to_text'].apply(lambda x: pd.Series(check_words(x)))
# run for approx 20min

forêt
cotterêts
forêt
islāmic
sèvres
sèvres
pájaro
mörner
naïve
funès
bundespräsidentenstichwahlwiederholungsverschiebung
exposé
dieudonné
dieudonné
dieudonné
dieudonné
dieudonné
vercingétorix
mañana
janaé
naïve
didnâ
holocaustâ
particularistâ
centricâ
weâ
countryâ
shinzō
würselen
müntefering
schöneberg
śâ
hereâ
lascañas
lascañas
lascañas
lascañas
avanceña
avanceña
pèlerin
fahrgästen
rüdiger
mädchen
überreicht
caitlín
oñate
oñate
yıldırım
jørgen
tensión
jóvenes
univisión
capturó
ésto
valentía
después
cómo
teléfonos
sí
ésta
lloré
aún
después
jamás
pensé
podría
ésta
saña
ésta
ésta
ésta
únicas
teléfono
micrófono
única
así
movilización
defensoría
musée
schwäbisch
gmünd
km²
km²
lineageالقادمون
ædonis
begründung
atilémilé
castañeda
español
español
kéké
القادمون
åsa
ಠ_ಠ
sucré
salé
baném
ángel
remón
crítico
peñuelas
rumbakuá
etián
señora
fariña
fariña
fariña
fariña
fariña
fariña
fariña
fariña
fariña
fariña
fariña
fariña
fariña
àlvaro
çavuşoğlu
süddeutsche
lorén
aqṣa
ḥaram
beyoncés
díazes
campe

In [None]:
# count the number of non_ascii tokens
data.loc[:,'non_ascii_count'] = data.loc[:,'not_ascii'].map(lambda lst: len(lst))

In [None]:
# see which label has the most rows with high _non_ascii_counts
# this is most likely because they have text that is a mix of english and foreign words
df_no_ascii_high = data[data['non_ascii_count'] > 5]
print(df_no_ascii_high['label'].value_counts())

label
1    42
0    19
Name: count, dtype: int64


In [None]:
# calculate non_regog_word count and the percentage that is not recognized
data.loc[:,'non_recog_word_count'] = data.loc[:,'text_not_words'].map(lambda lst: len(lst))
data.loc[:,'total_word_count'] = data.loc[:,'text_tokens_to_text'].map(lambda lst: len(lst))
data['non_word_percent'] = (data['non_recog_word_count'] + data['non_ascii_count']) / data['total_word_count'] * 100

In [None]:
# compare non_word_percent by label
print(data.loc[data['label'] == 0, 'non_word_percent'].mean())
print(data.loc[data['label'] == 1, 'non_word_percent'].mean())

0.13381513238181675
0.19995186342467755


***
##Remove non-ascii tokens out of text tokens
***

In [None]:
# method to clean out the non_ascii characters 'text_token_to_text' column
def clean_ascii(text_tokens, not_ascii):
  words = text_tokens.split()
  text_ascii = [word for word in words if word not in not_ascii]
  text_clean_ascii = ' '.join(text_ascii)
  return text_clean_ascii

In [None]:
# test above method
text_clean = data.loc[7666,'text_tokens_to_text']
not_ascii = data.loc[7666,'not_ascii']
print(clean_ascii(text_clean, not_ascii))

things need know sanctuary cities aaron bandler november donald trump elected presidency one policies likely come fire sanctuary cities cities policies make safe havens illegal aliens issue became front center early portion republican primary trump decried murder kate steinle illegal alien san francisco sanctuary city response election trump leftist city leaders digging officials los angeles chicago boston signaling cooperate federal government deportation efforts five things need know sanctuary cities sanctuary cities blatant violation federal law left tried claim perfectly legal clearly false james walsh former associate general counsel immigration naturalization services explains usc section deals persons knowingly conceal harbor shield undocumented aliens could apply officials sanctuary cities states fact leftists digging heels sanctuary cities means supporting form nullification irony missed victor davis hanson much rural west opposes endangered species act wyoming declare federal

In [None]:
# create a new column of clean tokens that do not have non-ascii tokens
data.loc[:,'text_clean_ascii'] = data.loc[:,['text_tokens_to_text', 'not_ascii']].apply(lambda row: clean_ascii(row['text_tokens_to_text'], row['not_ascii']), axis=1)

***
##Title and Text Similarity with word2vec
***

In [None]:
# check to see rows with 'title_tokens_to_text' values blank
# I need to keep the rows
df = data.loc[data['title_tokens_to_text'].str.len()<1,['title_clean','text_clean','title_tokens_to_text','text_tokens_to_text','label']]
print(df)

Empty DataFrame
Columns: [title_clean, text_clean, title_tokens_to_text, text_tokens_to_text, label]
Index: []


In [None]:
# we will drop the rows less than 1
data.drop(data.loc[data['title_tokens_to_text'].str.len() < 1].index, inplace=True)

In [None]:
# list of tokens from text and title
# this is to establish a vocabulary
tokens_collection_title = [token_lst.split() for token_lst in data.loc[:,'title_tokens_to_text']]
tokens_collection_text = [token_lst.split() for token_lst in data.loc[:,'text_tokens_to_text']]
tokens_collection = tokens_collection_title + tokens_collection_text
print(tokens_collection[0:10])

[['clinton', 'makes', 'history', 'declares', 'win', 'democratic', 'race'], ['ctbto', 'looking', 'unusual', 'seismic', 'activity', 'north', 'korea'], ['fakenews', 'made', 'mainstream', 'media'], ['push', 'yemen', 'aid', 'warned', 'saudis', 'threats', 'congress'], ['hurricane', 'mathew', 'vs', 'shock', 'awe', 'empire'], ['casey', 'anthony', 'seen', 'crowd', 'trump', 'protesters', 'mar', 'lago'], ['oath', 'office', 'words', 'harder', 'look'], ['four', 'year', 'old', 'dies', 'finding', 'loaded', 'gun', 'friend', 'home'], ['defiant', 'kurds', 'shrug', 'risk', 'trade', 'war', 'independence', 'vote'], ['east', 'timor', 'president', 'swears', 'first', 'minority', 'government']]


In [None]:
# make a trained word2vec model from the title and text tokens
# vector_size is the dimensionality of the word - the higher dimension can capture more complex relationships
# window determines how many words back and forward to look around the word
word2vec_model = Word2Vec(tokens_collection, vector_size=1000, window=5, min_count=2, workers=4)

In [None]:
def title_text_similarity(title, text, model=word2vec_model):
  title_tokens = title.split()
  text_tokens = text.split()

  #the title_vec and text_vec need to be at same size as vector_size in the word2vec_model
  title_vec = np.zeros(1000)
  text_vec = np.zeros(1000)

  # loop through title tokens
  for token in title_tokens:
    # if token in word2vec model add to title_vec
    if token in model.wv:
      title_vec = np.add(title_vec, model.wv[token])

  # loop through text tokens
  for token in text_tokens:
    # if token in word2vec model add to title_vec
    if token in model.wv:
      text_vec = np.add(text_vec, model.wv[token])


  # if either title_vec or text_vec is a zero vector return 0
  if np.linalg.norm(title_vec) == 0 or np.linalg.norm(text_vec) == 0:
      return 0  # or any other default value you prefer
  else:
      # similarity calculation = (dot product of title_vec and text_vec) / (magnitude of title_vec * magnitude of text_vec)
      # cosine similarity is calculated by dividing the dot product of two vectors by the product of their magnitudes
      return round(np.dot(title_vec, text_vec) / (np.linalg.norm(title_vec) * np.linalg.norm(text_vec)), 4)

In [None]:
# test title_text_similarity method
print(data.loc[1,'title_tokens_to_text'])
print(data.loc[1,'text_tokens_to_text'])

print(title_text_similarity(data.loc[1,'title_tokens_to_text'], data.loc[1,'text_tokens_to_text']))

ctbto looking unusual seismic activity north korea
nuclear proliferation watchdog ctbto examining unusual seismic activity north korea took place around km miles previous nuclear testing isolated country said saturday analysts looking unusual activity much smaller magnitude ctbto executive secretary lassina zerbo said twitter post korean peninsula unusual activity lat lon mb km prior tests analysts investigating said subsequent post china earthquake administration said detected magnitude quake north korea suspected explosion raising fears pyongyang might conducted another nuclear bomb test ctbto spokeswoman said zerbo remark smaller magnitude referred monitoring event north korea sept agency described consistent man made explosion stopped short calling nuclear blast pending testing airborne radioactivity north korea described sept incident test advanced hydrogen bomb long range missile marking dramatic escalation regime stand united states allies ctbto preparatory commission comprehens

In [None]:
# calculate title_text_similarity on all rows
data.loc[:,'title_text_similarity'] = data.loc[:,['title_tokens_to_text', 'text_tokens_to_text']].apply(lambda row: title_text_similarity(row['title_tokens_to_text'], row['text_tokens_to_text']), axis=1)

In [None]:
# check how title_text_similarity compare based on label
print(data.loc[data['label']==0, 'title_text_similarity'].mean())
print(data.loc[data['label']==1, 'title_text_similarity'].mean())

0.671165579500809
0.6375889016527088


***
##Save data
***

In [None]:
data.head()

Unnamed: 0,title,text,label,title_clean,text_clean,email,links,link_count,mentions,mentions_count,...,text_tokens_to_text,recog_words_to_text,text_not_words,not_ascii,non_ascii_count,non_recog_word_count,total_word_count,non_word_percent,text_clean_ascii,title_text_similarity
0,"Clinton makes history, declares win in Democra...",Corrections and clarifications: An earlier ver...,0,"Clinton makes history, declares win in Democra...",Corrections and clarifications: An earlier ver...,,,0,,0,...,corrections clarifications earlier version sto...,corrections clarifications earlier version sto...,"{ap, nbc, mcduff, abc, cnn, realclearpolitics}",{},0,6,4116,0.145773,corrections clarifications earlier version sto...,0.7144
1,"CTBTO looking at ""unusual seismic activity"" in...",ZURICH (Reuters) - Nuclear proliferation watch...,0,"CTBTO looking at ""unusual seismic activity"" in...",Nuclear proliferation watchdog CTBTO is examin...,,,0,,0,...,nuclear proliferation watchdog ctbto examining...,nuclear proliferation watchdog examining unusu...,"{mb, ctbto, lassina, zerbo}",{},0,4,1121,0.356824,nuclear proliferation watchdog ctbto examining...,0.719
2,#FakeNews Made By Mainstream Media,21st Century Wire says There is no greater sou...,1,#FakeNews Made By Mainstream Media,21st Century Wire says There is no greater sou...,,[https://api.soundcloud.com/tracks/304139324],1,@21WIRE,1,...,st century wire says greater source called fak...,st century wire says greater source called fak...,"{filessupport, show_comments, hide_related, nb...",{},0,12,541,2.218115,st century wire says greater source called fak...,0.5834
3,"In push for Yemen aid, U.S. warned Saudis of t...",WASHINGTON (Reuters) - The United States has w...,0,"In push for Yemen aid, U.S. warned Saudis of t...",The United States has warned Saudi Arabia that...,,,0,,0,...,united states warned saudi arabia anger congre...,united states warned saudi arabia anger congre...,{},{},0,0,433,0.0,united states warned saudi arabia anger congre...,0.6875
4,Hurricane Mathew vs. shock and awe of empire,Hurricane Mathew vs. shock and awe of empire B...,1,Hurricane Mathew vs. shock and awe of empire,Hurricane Mathew vs. shock and awe of empire B...,,[TheSleuthJournal.com],1,,0,...,hurricane mathew vs shock awe empire philip fa...,hurricane mathew vs shock awe empire philip po...,"{farruggio, ed, op}",{},0,3,2689,0.111566,hurricane mathew vs shock awe empire philip fa...,0.3405


In [None]:
#data.loc[:3000].to_csv('/content/drive/MyDrive/Colab Notebooks/Machine Learning - Supervised Learning/project/ProjectData/working.csv', index=False)

In [None]:
# save to csv
data.to_csv('/content/drive/MyDrive/Colab Notebooks/Machine Learning - Supervised Learning/project/ProjectData/DataClean.csv', index=False)

In [None]:
print(data['label'].value_counts())
#0    34615
#1    26983

label
0    34616
1    26986
Name: count, dtype: int64
