<a href="https://colab.research.google.com/github/desireedisco/MSDS-Machine-Learning-Supervised/blob/main/1_Data_Preprocess.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
#Notebook for Data Preprocessing
---


This notebook is for data preprocessing. We will do the following:
* Load file, drop index, drop na, drop duplicates
* Separate out the email links, web links, hashtag, mentions.
* Clean the data leakage problem in the text column. There are also some minor problems in the title column that will be fixed.
* Count the number of sentences and calculate the mean sentence length.
* Drop the foreign language rows as determined by langdetect
* Check to see of the remaining tokens which are recognized as words and calculate the non-recognized word percentage
* Calculate the title and text similarity


In [174]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [175]:
#install langdetect for when we separate out foreign language rows
!pip install langdetect



In [176]:
import pandas as pd
import numpy as np
import re
import nltk

from langdetect import detect
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.corpus import words
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from gensim.models import Word2Vec

nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

***
##Load file, drop index, drop na, drop duplicates
***

In [177]:
# read csv file
data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Machine Learning - Supervised Learning/project/ProjectData/WELFake_Dataset.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,0,LAW ENFORCEMENT ON HIGH ALERT Following Threats Against Cops And Whites On 9-11By #BlackLivesMatter And #FYF911 Terrorists [VIDEO],No comment is expected from Barack Obama Members of the #FYF911 or #FukYoFlag and #BlackLivesMatter movements called for the lynching and hanging of white people and cops. They encouraged others o...,1
1,1,,Did they post their votes for Hillary already?,1
2,2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MOST CHARLOTTE RIOTERS WERE “PEACEFUL” PROTESTERS…In Her Home State Of North Carolina [VIDEO],"Now, most of the demonstrators gathered last night were exercising their constitutional and protected right to peaceful protest in order to raise issues and create change. Loretta Lynch aka Er...",1
3,3,"Bobby Jindal, raised Hindu, uses story of Christian conversion to woo evangelicals for potential 2016 bid",A dozen politically active pastors came here for a private dinner Friday night to hear a conversion story unique in the context of presidential politics: how Louisiana Gov. Bobby Jindal traveled f...,0
4,4,SATAN 2: Russia unvelis an image of its terrifying new ‘SUPERNUKE’ – Western world takes notice,"The RS-28 Sarmat missile, dubbed Satan 2, will replace the SS-18 Flies at 4.3 miles (7km) per sec and with a range of 6,213 miles (10,000km) The weapons are perceived as part of an increasingly ag...",1


In [178]:
# show dataframe info
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72134 entries, 0 to 72133
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  72134 non-null  int64 
 1   title       71576 non-null  object
 2   text        72095 non-null  object
 3   label       72134 non-null  int64 
dtypes: int64(2), object(2)
memory usage: 2.2+ MB
None


**Note on matching Real and Fake labels:**
In the dataset description the authors state that there are 72,134 news articles with 35,028 real and 37,106 fake news articles. The authors then go on to state the labels are labeled as follows: 0=fake and 1=real. The two statements are contradictory based on the following label counts. Upon further inspection of the data and following the authors first statement of '35,028 real and 37,106 fake news articles', I am following the mapping for the labels as 0=real and 1=fake.

In [179]:
# show the number of rows labeled 1 and number of rows labeled 0
print(data['label'].value_counts())

# show unique count for title and text
print(data.describe(include=['object']))

label
1    37106
0    35028
Name: count, dtype: int64
                                                       title   text
count                                                  71576  72095
unique                                                 62347  62718
top     Factbox: Trump fills top jobs for his administration       
freq                                                      14    738


In [180]:
# drop dataset index column
data = data.drop(columns=['Unnamed: 0'])

**Check to see how many rows are null and what the percentage of total data is before dropping rows.**

In [181]:
# show the number of rows that are null
for col in data.columns:
    print(f'{data[col].isnull().sum()} rows are null in ' + col)

# show what percentage of total is null for each column
for col in data.columns:
    print(col, f'{round(data[col].isnull().sum() / data.shape[0] * 100, 2)} % is null')

558 rows are null in title
39 rows are null in text
0 rows are null in label
title 0.77 % is null
text 0.05 % is null
label 0.0 % is null


In [182]:
#drop all null values
data = data.dropna().reset_index(drop=True)

**Split the data by label and check and drop duplicate 'text' columns. Then combine to check for duplicates**

I did this because I did not want to drop duplicates that had multiple labels attached. If there were any duplicate articles that had different labels then I wanted to know about that. I this dataset the only duplicate with multiple labels was a blank text row.

In [183]:
#split the data by label to look at unique count in case there are duplicates label both real and fake
real_data = data[data['label'] == 0]
fake_data = data[data['label'] == 1]
print(real_data.describe(include=['object']))
print(fake_data.describe(include=['object']))

                                                       title  \
count                                                  35028   
unique                                                 34409   
top     Factbox: Trump fills top jobs for his administration   
freq                                                      14   

                                                                                                                         text  
count                                                                                                                   35028  
unique                                                                                                                  34621  
top     Killing Obama administration rules, dismantling Obamacare and pushing through tax reform are on the early to-do list.  
freq                                                                                                                       58  
                                       

In [184]:
#drop duplicate text stories
real_data = real_data.drop_duplicates(subset=['text']).reset_index(drop=True)
fake_data = fake_data.drop_duplicates(subset=['text']).reset_index(drop=True)

# combine the separate labels to one to check for duplicates
data = pd.concat([real_data, fake_data], ignore_index=True)
print(data.describe(include=['object']))

                                                       title   text
count                                                  62201  62201
unique                                                 61400  62200
top     Factbox: Trump fills top jobs for his administration       
freq                                                      14      2


**As you can see the duplicate with multiple labels had blank text column.**

In [185]:
# display duplicate rows that have 2 different labels
print(data[data['text'].duplicated(keep=False)])

                                                                                 title  \
920                                                     Graphic: Supreme Court roundup   
34626  HOUSE INTEL CHAIR On Trump-Russia Fake Story: “No evidence of anything” [Video]   

      text  label  
920             0  
34626           1  


In [186]:
# they have blank text so drop rows
data = data.drop_duplicates(subset=['text'], keep=False).reset_index(drop=True)

# check total counts and unique counts
print(data.describe(include=['object']))

# display value counts for labels
data['label'].value_counts()

#value counts should be
# 0 - 34620
# 1 - 27579

                                                       title  \
count                                                  62199   
unique                                                 61398   
top     Factbox: Trump fills top jobs for his administration   
freq                                                      14   

                                                                                                                                                                                                           text  
count                                                                                                                                                                                                     62199  
unique                                                                                                                                                                                                    62199  
top     A dozen politically active pastors came h

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,34620
1,27579


**Shuffle and reindex dataset**

In [187]:
# shuffle dataframe and reset index
data = data.sample(frac=1).reset_index(drop=True)

***
##Clean Title - Source Data Leakage
***

In the data_leakage notebook we demonstrated that some source info that was embedded in the 'text' column influenced the model. Because not all articles have source info, it was unduly providing bias in the model. I looked through the 'title' column and decided there was similar data that could influence the model: therefore, I decided to remove that as well. I just want to focus on the actual title and actual text, not the sources of the articles. We could use the sources as a feature if all the articles have source information. In the current dataset, only the real news includes sources so it needs to be removed.

In [188]:
def clean_title_leakage(title):
  # making sure title is a string
  title = str(title)

  # get rid of New York Times and Breitbart reference
  pattern_leakage = r" - The New York Times$| - Breitbart$"
  # find all occurrences not just the first one
  match_lst = re.findall(pattern_leakage, title)
  # if match then substitue for ''
  if match_lst:
    title = re.sub(pattern_leakage, '', title)
    #print(match_lst)

  return title

In [189]:
# create a new column for the clean title and apply the clean_title_leakage method
data['title_clean'] = data['title'].apply(clean_title_leakage)

In [190]:
# just as well as the real news provides source information the fake news refers to a source of video so I decide to clean this as well
def clean_title_video(title):

  # get rid of Video reference - easy match pattern
  pattern_video = r"\[VIDEO\]|\(VIDEO\)"

  # find all matches and ignore case
  match_lst = re.findall(pattern_video, title, re.IGNORECASE)

  if match_lst:
    title = re.sub(pattern_video, '', title, flags=re.IGNORECASE)
    #print(match_lst)

  # search for more match patterns to get rid of all references to (Video)
  pattern_video_re = r"\[video[a-z 0-9/,+-].*\]|\[[a-z 0-9/,+-].*video\]|\(video[a-z 0-9/,+-].*\)|\([a-z 0-9/,+-].*video\)|\([a-z 0-9/,+-].*videos\)"
  match_lst_re = re.findall(pattern_video_re, title, re.IGNORECASE)
  if match_lst_re:
    title = re.sub(pattern_video_re, '', title, flags=re.IGNORECASE)
    #print(match_lst_re)

  return title

In [191]:
# cleaning the 'title_clean column of the (Video) references which is skewed to fake news
data['title_clean'] = data['title_clean'].apply(clean_title_video)

In [192]:
# increase column width to display wider columns
pd.set_option('display.max_colwidth', 200)

In [193]:
# check to see changes have been made
df_show = data[['title','title_clean']]
print(df_show.head)

<bound method NDFrame.head of                                                                                                             title  \
0                                                           Democrats see chance to reshape map as Trump stumbles   
1                       Spanish government to suspend direct rule in Catalonia if regional election called: media   
2                                         Cory Booker SCORCHES Trump Over New Muslim Ban In BLISTERING Tweetstorm   
3                                         Heart-Rending Testimony as Dylann Roof Trial Opens - The New York Times   
4                                                   Trump Supporter In Phoenix Just Threatened John McCain’s Life   
...                                                                                                           ...   
62194         OBAMA ADMINISTRATION Sues Private Business For Saying NO To Dreadlocks…Just The Tip Of The Iceberg!   
62195              EXCLUSIVE: Form

***
##Clean Text
***

In [194]:
# clean out newline reference
data['text_clean'] = data['text'].str.replace('\n', ' ')

**Extract email links**

In [195]:
# method to extract email links from text and put in another data frame column
def extract_email_links(text):
  emails = ''

  # get rid of web links - match for web links
  pattern_email = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
  match_lst = re.findall(pattern_email, text)
  # if match we will substitute email for '' in text_clean and add to emails list
  if match_lst:
    text = re.sub(pattern_email, '', text)
    #print(match_lst)
    emails = match_lst
  # return text_clean and emails list
  return pd.Series({'text_clean': text, 'email': emails})

In [196]:
# get text_clean striped of emails and put emails in separate column
data[['text_clean', 'email']] = data['text_clean'].apply(lambda x: pd.Series(extract_email_links(x)))

In [197]:
# show changes
data.loc[data['email'].str.len()>0,['text_clean', 'email']].head()

Unnamed: 0,text_clean,email
98,"Accuracy in Media – by Cliff Kincaid In one of her secret speeches, Hillary Clinton said, “My dream is a hemispheric common market, with open trade and open borders…” Before this comment was reve...",[cliff.kincaid@aim.org]
326,"**Want FOX News Halftime Report in your inbox every day? Sign up here.** On the roster - Bernie’s swan song begins - Dubya back in the game - Rubio leans towards a run, GOP insiders say - Save yo...",[HALFTIMEREPORT@FOXNEWS.COM]
407,"BARRY GREY, WSWS.ORG O ne week before Election Day, with the polls tightening and the outcome uncertain, the entire US political system has been thrown into turmoil by the unprecedented interventi...",[editor@greanvillepost.com]
431,"Dispatches from STEPHEN LENDMAN A s the moment of truth approaches, America’s two leading broadsheets, the NYT and WaPo, continue relentlessly bashing Trump, shamelessly supporting a woman belongi...",[lendmanstephen@sbcglobal.net]
525,"Sent: Monday, March 30, 2015 9:58:55 AM To: Amitabh Desai Cc: Jon Davidson; Margaret Steenburg; Jake Sullivan; Dan Schwerin; Huma Abedin; John Podesta Subject: Re: Victor Pinchuk Team HRC – we’ll...",[ami@presidentclinton.com]


**Extract web links**

In [198]:
# method to extract web links and put in another dataframe column
def extract_web_links(text):
  links = ''

  #get rid of web links - we try to match a number of different patterns to try to capture all the weblinks in the text column
  pattern_html = r"https?://(?:www\.)?[a-zA-Z0-9./?=&_-]+|pic.twitter.com[a-zA-Z0-9./]+|[a-zA-Z0-9./]+(?:\.com|\.org)/[a-zA-Z0-9/\-.]+|[a-zA-Z0-9./]+(?:\.com|\.org)"
  match_lst = re.findall(pattern_html, text)
  # if match we will substitute web link for '' in text_clean and add to web link list
  if match_lst:
    text = re.sub(pattern_html, '', text)
    #print(match_lst)
    links = match_lst

  # return text_clean and web link list
  return pd.Series({'text_clean': text, 'links': links})

In [199]:
# get text_clean striped of web links and put web links in separate column
data[['text_clean', 'links']] = data['text_clean'].apply(lambda x: pd.Series(extract_web_links(x)))
# record link count in separate column
data['link_count'] = data['links'].apply(lambda x: len(x))

In [200]:
# show changes
data.loc[data['links'].str.len()>0,['text_clean', 'links']].head()

Unnamed: 0,text_clean,links
4,"Donald Trump, in his ongoing effort to spur on a Civil War, is holding a campaign rally in Phoenix, Arizona and his white supremacist supporters are waiting for him. While violence hasn t yet brok...","[pic.twitter.com/pwjogHfYgH, pic.twitter.com/X6tlFtyZT5]"
13,The victim of the angry Bernie Sanders supporter has an amazing attitude. It s the civility many of us have come to expect from Trump supporters who ve been unjustly attacked by uncivil and in man...,"[pic.twitter.com/QghClfl2ak, pic.twitter.com/AlVPDLym85]"
15,Getty - Jemall Countess/Stringer The Wildfire is an opinion platform and any opinions or information put forth by contributors are exclusive to them and do not represent the views of IJR. Megyn K...,[People.com]
40,"Print Fairfax County, Virginia, voter Jena Jones told WND and Radio America she found this Democrat insert included with her absentee ballot, among others Democratic Party officials in Fairfax Co...",[WND.com]
64,21st Century Wire says The US media s neoMcCarthy-like witch hunt continues as social media giant Facebook creates a newly designed feature that allows the website to curate favored news in its ...,[https://github.com/selfagency/bs-detector/blob/master/chrome/data/data.json]


**Extract mentions**

In [201]:
# method to extract mentions and put in another dataframe column
def clean_mentions(text):
  mentions = ''

  # get rid of mentions in text and keep mentions in a separate column
  pattern_mentions = r"@\w+"
  match_lst = re.findall(pattern_mentions, text)
  # if match we will substitute mentions for '' in text_clean and add to mention list
  if match_lst:
    text = re.sub(pattern_mentions, '', text)
    #print(match_lst)
    mentions = match_lst

  # return text_clean and mention list
  return pd.Series({'text_clean': text, 'mentions': mentions})

In [202]:
# get text_clean striped of mentions and put mentions in separate column
data[['text_clean', 'mentions']] = data['text_clean'].apply(lambda x: pd.Series(clean_mentions(x)))
# new column with mention count
data['mentions_count'] = data['mentions'].apply(lambda x: len(x))
# turn mention list into string so we can vectorize latter on if we want
data['mentions'] = data['mentions'].apply(lambda mention_lst: ' '.join([str(mention) for mention in mention_lst]))

In [203]:
# show changes
data.loc[data['mentions'].str.len()>0,['text_clean', 'mentions']].head()

Unnamed: 0,text_clean,mentions
2,"Well, the moment we have all been anticipating has come to pass. Donald Trump and the bigots in his administration have rolled out a new Muslim Ban. There are very few differences. It excludes Ira...",@SenBookerOffice @SenBookerOffice @SenBookerOffice @SenBookerOffice @SenBookerOffice @SenBookerOffice @SenBookerOffice @realDonaldTrump @SenBookerOffice
4,"Donald Trump, in his ongoing effort to spur on a Civil War, is holding a campaign rally in Phoenix, Arizona and his white supremacist supporters are waiting for him. While violence hasn t yet brok...",@davecatanese @realDonaldTrump @FNorth49 @Lifelibertyguy @BancoLeonard @SamBam39 @marymac169
8,"Four Israeli soldiers were murdered and at least 16 people were wounded, some seriously, when an Arab resident of eastern Jerusalem rammed his truck into pedestrians near Jerusalem’s Armon Hanatzi...",@AaronKleinShow
11,“Modern Family” star Sarah Hyland says her skinny appearance lately is due to a medical condition. [The says in a social media post that critics have accused her of promoting anorexia in pictur...,@Sarah_Hyland @Sarah_Hyland
13,The victim of the angry Bernie Sanders supporter has an amazing attitude. It s the civility many of us have come to expect from Trump supporters who ve been unjustly attacked by uncivil and in man...,@pulsarVision @USAA @pulsarVision @pulsarVision @USAA_help @BreitbartNews @Cernovich @DanScavino @realDonaldTrump @Nero @pulsarVision @pulsarVision


**Extract hashtags**

In [204]:
# method to extract hashtags and put in another dataframe column
def clean_hashtags(text):
  hashtags = ''
  hashtag_count = 0

  # get rid of hashtags in text and keep mentions in a separate column
  pattern_hashtags = r"#\w+"
  match_lst = re.findall(pattern_hashtags, text)

  # if match we will substitute mentions for '' in text_clean and add to mention list
  if match_lst:
    text = re.sub(pattern_hashtags, '', text)

    hashtags = match_lst
    hash_clean = ''

    # we want to search through hashtag list for #1,#2, or other strings with just numbers after # because there is a lot of them and they are not hashtags
    for index, hashtag in enumerate(hashtags):
      # if match is a number string do nothing otherwise add to hashtag list
      if re.search(r"#\d+$", hashtag):
        pass
      else:
        hash_clean = hash_clean + ' ' + hashtag
    # strip left space
    hashtags = hash_clean.lstrip()
    #print(hashtags)

    # count the number of hashtags in list
    hashtag_count = len(hashtags.split(' '))

  # return text_clean and hashtag list in string form and return hashtag count
  return pd.Series({'text_clean': text, 'hashtags': hashtags, 'hashtag_count': hashtag_count})

In [205]:
# process text to return 'text_clean', hashtag list, and hashtag count
data[['text_clean', 'hashtags', 'hashtag_count']] = data['text_clean'].apply(lambda x: pd.Series(clean_hashtags(x)))

In [206]:
# show changes
data.loc[data['hashtags'].str.len()>0,['text_clean', 'hashtags']].head()

Unnamed: 0,text_clean,hashtags
2,"Well, the moment we have all been anticipating has come to pass. Donald Trump and the bigots in his administration have rolled out a new Muslim Ban. There are very few differences. It excludes Ira...",#MuslimBan
4,"Donald Trump, in his ongoing effort to spur on a Civil War, is holding a campaign rally in Phoenix, Arizona and his white supremacist supporters are waiting for him. While violence hasn t yet brok...",#crocodiletears #Deplorables
36,This news is heartbreaking because it s the unnecessary delay in treatment that ultimately determined the fate of this little baby: The window of opportunity has been lost Chris Gard and Connie Y...,#CharlieGard
76,"The Year 2016 Set To Be Hottest On Record By AFP November 14, "" AFP "" - The year 2016 will ""very likely"" be the hottest on record, the UN said Monday, warning of calamitous consequences if the ma...",#StateofClimate #COP22
111,Tune in to the Alternate Current Radio Network (ACR) for another LIVE broadcast of The Boiler Room tonight 6:00 PM PST | 8:00 PM CST | 9:00 PM EST for this special broadcast. Join us for uncensor...,#BernieBro #113Please


**Clean leakage issue**

See notebook 2_LeakageTest for further info

In [207]:
# clean the references to (Reuters) from the text column
def clean_reuters_leakage(text):

  # get rid of reference to (Reuters) in text
  # some articles start out with the is a correction not followed by the (Reuters) reference and we are matching the whole string and substituting out
  pattern_reuters_1 = r"^[“”‘’A-Za-z.,&$()/\-:;0-9  ]*(\(Reuters\) - |\(Reuters\)\) - |\(Reuters\)  —  )"
  # we are matching first occurence so we are using match and not findall
  match_str_1 = re.match(pattern_reuters_1, text)
  if match_str_1:
    text = re.sub(pattern_reuters_1, '', text)
    #print(match_str_1.group())

  return text

In [208]:
# clean the (Reuters) reference out
data['text_clean'] = data['text_clean'].apply(clean_reuters_leakage)

***
##Sentence count and mean sentence length.
***

I want to drop the blank text_clean rows and make sure I have a decent mean sentence length for the langdetect package to work.

In [209]:
# using NLTK sentence tokenizer to count the number of sentences in each text passage in the 'text_clean' column
data.loc[:,'text_sent_count'] = data.loc[:,'text_clean'].map(lambda txt: len(sent_tokenize(txt)))

In [210]:
# check and see if there are any less than 1
# in the result dataframe the text column is just links with the text_clean column blank
data[data['text_sent_count'] < 1].head()

Unnamed: 0,title,text,label,title_clean,text_clean,email,links,link_count,mentions,mentions_count,hashtags,hashtag_count,text_sent_count
3496,"TOMI LAHREN Blasts The Left For Attacking Trump’s Grandson…Yes, his GRANDSON! [Video]",https://www.youtube.com/watch?time_continue=1&v=NeqMSI6OR5Q,1,"TOMI LAHREN Blasts The Left For Attacking Trump’s Grandson…Yes, his GRANDSON!",,,[https://www.youtube.com/watch?time_continue=1&v=NeqMSI6OR5Q],1,,0,,0,0
4210,BRILLIANT! TUCKER CARLSON Humiliates Jill Stein’s Campaign Manager Over Phony Recount Scam [Video],https://www.youtube.com/watch?v=uQbAww5wajA,1,BRILLIANT! TUCKER CARLSON Humiliates Jill Stein’s Campaign Manager Over Phony Recount Scam,,,[https://www.youtube.com/watch?v=uQbAww5wajA],1,,0,,0,0
5207,WOW! Leftist Bully ROSIE O’DONNELL PUSHES Horrible Rumor On Social Media…Suggests Barron Trump Has Mental Disorder [VIDEO],https://twitter.com/Rosie/status/800939338615824384,1,WOW! Leftist Bully ROSIE O’DONNELL PUSHES Horrible Rumor On Social Media…Suggests Barron Trump Has Mental Disorder,,,[https://twitter.com/Rosie/status/800939338615824384],1,,0,,0,0
6308,SARA CARTER AND JAY SEKULOW With The Latest On Obama Spying On Trump: “I think this goes to the highest levels of the Obama administration” [Video],https://www.youtube.com/watch?v=DRLVvYzG46w,1,SARA CARTER AND JAY SEKULOW With The Latest On Obama Spying On Trump: “I think this goes to the highest levels of the Obama administration”,,,[https://www.youtube.com/watch?v=DRLVvYzG46w],1,,0,,0,0
6409,SARA CARTER WAS RIGHT ABOUT SPYING ON TRUMP! “This goes far beyond what is being reported” [VIDEO],https://www.youtube.com/watch?v=Ws5ojb0PCCo,1,SARA CARTER WAS RIGHT ABOUT SPYING ON TRUMP! “This goes far beyond what is being reported”,,,[https://www.youtube.com/watch?v=Ws5ojb0PCCo],1,,0,,0,0


In [211]:
# we will drop the rows less than 1
data.drop(data.loc[data['text_sent_count'] < 1].index, inplace=True)

In [212]:
# get mean sentence length
data.loc[:,'mean_sent_length'] = data.loc[:,'text_clean'].map(lambda txt: np.mean([len(sent) for sent in sent_tokenize(txt)]))

In [213]:
# look at rows with mean sentance length less than 20
# we will apply the language check and that needs a good amount of charaters
df = data.loc[data['mean_sent_length'] < 20,['text_clean','label']]
print(df)

                                                         text_clean  label
409                                                : Gateway Pundit      1
1843                                  Awkward! I am a progressive        1
2071   Notify the CDC. It's spreading.      BCP () October 14, 2016      1
3233                            Wow! The Dems are so out of touch        1
5158                                              Great interview!       1
5398                                                  RECENT POSTS       1
8024                                                  advertisement      0
8051                                                       Via: TMZ      1
8825   Les Deplorables Unite     ??VOTE TRUMP?? () November 9, 2016      1
10481                                                 I VE HAD IT!       1
10643                            Gary North has the video . 12:56        1
13982                                                Read more: TMZ      1
14528                    

In [214]:
# drop the low mean length articles
data.drop(data.loc[data['mean_sent_length'] < 20].index, inplace=True)

***
##Drop other languages
***

I want to keep only english rows

In [215]:
# runs for 16min
# this method will detect other languages using langdetect package
def detect_language(text):
    try:
        return detect(text)
    except:
        return 'could not detect language'
# we are running the detect_language methon on the 'text_clean' column
data.loc[:,'lang'] = data.loc[:,'text_clean'].apply(detect_language)

In [216]:
# the count of english slightly changes every time I run the above method but the count should be around #61608
print(data.loc[:,'lang'].value_counts())

lang
en       61610
ru         156
es         142
de          97
fr          31
ar          19
pt           7
tr           7
it           4
nl           4
no           3
hr           2
pl           2
el           2
vi           1
zh-cn        1
Name: count, dtype: int64


In [217]:
# show non english rows
df = data.loc[~data['lang'].eq('en')]
df.head()
#df.to_csv('/content/drive/MyDrive/Colab Notebooks/Machine Learning - Supervised Learning/project/ProjectData/lang.csv', index=False)

Unnamed: 0,title,text,label,title_clean,text_clean,email,links,link_count,mentions,mentions_count,hashtags,hashtag_count,text_sent_count,mean_sent_length,lang
14,Трамп разбушевался,"Происшествия \nЧем ближе выборы, тем сильнее нервничают кандидаты в президенты США. Особенно тяжко приходится Трампу — мало того, что он в рейтингах отстаёт от своей конкурентки, так ещё и владеть...",1,Трамп разбушевался,"Происшествия Чем ближе выборы, тем сильнее нервничают кандидаты в президенты США. Особенно тяжко приходится Трампу — мало того, что он в рейтингах отстаёт от своей конкурентки, так ещё и владеть ...",,,0,,0,,0,20,122.55,ru
35,Американо-корейские отношения и новый президент США | Новое восточное обозрение,"Регион: Восточная Азия Наблюдая реакцию корейской общественности на победу в президентских выборах Трампа, можно отметить «великую печаль» — политические круги Южной Кореи скорее болели за Клинтон...",1,Американо-корейские отношения и новый президент США | Новое восточное обозрение,"Регион: Восточная Азия Наблюдая реакцию корейской общественности на победу в президентских выборах Трампа, можно отметить «великую печаль» — политические круги Южной Кореи скорее болели за Клинтон...",,,0,,0,,0,45,156.733333,ru
140,Türkei: Kritischer Journalist interviewt Oppositionspolitiker in gemeinsamer Gefängniszelle,"Montag, 7. November 2016 Türkei: Kritischer Journalist interviewt Oppositionspolitiker in gemeinsamer Gefängniszelle Ankara (dpo) - Es ist ein starkes Stück Journalismus und zeigt, dass in der Tür...",1,Türkei: Kritischer Journalist interviewt Oppositionspolitiker in gemeinsamer Gefängniszelle,"Montag, 7. November 2016 Türkei: Kritischer Journalist interviewt Oppositionspolitiker in gemeinsamer Gefängniszelle Ankara (dpo) - Es ist ein starkes Stück Journalismus und zeigt, dass in der Tür...",,,0,,0,,0,12,153.083333,de
252,WORKING CLASS REVOLT! OLD SCHOOL JERSEY PATRIOTS Let The Liberals and Hollywood Have It! [VIDEO],HE WILL NOT DIVIDE US .,1,WORKING CLASS REVOLT! OLD SCHOOL JERSEY PATRIOTS Let The Liberals and Hollywood Have It!,HE WILL NOT DIVIDE US .,,,0,,0,,0,1,23.0,de
273,"""Реакция Венгрии на сюжет Киселева это не угроза России""","Мир » Европа » Восточная Европа Во вторник МИД Венгрии вызвал посла России в связи с ""пренебрежительными комментариями"" в российских государственных СМИ о событиях антисоветского восстания в Венгр...",1,"""Реакция Венгрии на сюжет Киселева это не угроза России""","Мир » Европа » Восточная Европа Во вторник МИД Венгрии вызвал посла России в связи с ""пренебрежительными комментариями"" в российских государственных СМИ о событиях антисоветского восстания в Венгр...",,,0,,0,,0,43,106.162791,ru


In [218]:
# keep english rows
data = data.loc[data['lang'] == 'en']

# drop lang column
data = data.drop(columns=['lang'])

***
##Tokenize title and text
***

**This is where we finally tokenize the clean title and text**

In [219]:
# add additional stop words that aren't in the nltk library list
add_stop_words = {'also'}
print(add_stop_words)

# load the nltk library stop word list
stop_words = set(stopwords.words('english'))

# combine the 2 lists - don't really need this now because I did not add a lot of additional stop words but leaving in as place holder
stop_words = stop_words.union(add_stop_words)

# display stop words
print(stop_words)
print('said' in stop_words)

{'also'}
{'such', 'my', 'yours', 'll', 'm', "couldn't", "mightn't", 'under', "aren't", 'yourselves', 'each', 'couldn', 'these', 'ours', 'who', 'we', "weren't", "needn't", 'ma', 'here', 's', 'further', "she's", 'again', 'hers', 'an', 'against', 'theirs', 'other', 'he', 'off', 'not', "don't", 'all', 'been', 'hadn', "mustn't", "wouldn't", 'when', 'after', 'through', 'do', 'if', 'as', "it's", 'aren', 'into', 'those', "haven't", 'needn', 'about', 'below', "wasn't", 'for', 'didn', 'the', 've', "didn't", 'too', 'are', "hadn't", 'wouldn', 'own', "you'll", 'this', 'by', "won't", 'that', 'will', 'or', 'its', 'y', 'any', 'i', 'both', 'himself', 'no', 'does', 'won', 'him', 'were', 'having', 'd', 'there', 'before', 'you', 'in', "shouldn't", 'it', 'our', 'while', 'herself', 'once', 'up', 'a', 'can', 'shan', 'until', 'she', "you'd", 'so', 'myself', 'wasn', 'weren', 'had', 't', 'shouldn', 'hasn', 'then', 'your', 'but', 'don', 'be', 'at', 'between', "hasn't", 'over', 'am', 'haven', 'of', 'nor', 'his', 

In [220]:
# clean digit only strigs and get rid of other characters that are not words and tokenize string
def clean_tokens(text):
  # text to lower
  text_clean = text.lower()

  # get rid of digit only text strings
  text_clean = re.sub(r"\d+", '', text_clean)

  # set non-words to ''
  text_clean = re.sub("\W", ' ', text_clean)

  # tokenize text
  tokens = word_tokenize(text_clean)

  # check to make sure length of token is greater than 1 and combine into a new text string
  # I want to combine into text string to be able to save to csv file and load later with minimal processing
  text_list = [word for word in tokens if word not in stop_words and len(word)>1]
  token_to_text = ' '.join(text_list)

  # return text string
  return token_to_text

In [221]:
# test clean_tokens method
text1 = data.loc[0,'text_clean']
print(text1)
print(clean_tokens(text1))
print('--------------')
title1 = data.loc[0,'title_clean']
print(title1)
print(clean_tokens(title1))

Salt Lake City (CNN) In a less volatile election cycle, the notion that Democrats would be on offense in red states like Utah, Arizona and Georgia would suggest the presidential race was effectively over.  No one is willing to make that kind of bet in a race that has defied all political norms. But as Donald Trump's downward spiral continues in round after round of battleground polls, and the Hillary Clinton campaign has begun to dabble in ruby-red states, Democrats are clearly feeling bullish. Some are now openly mulling the possibility of a Clinton blowout in November.  Even Trump acknowledged Thursday that his campaign was "having a tremendous problem in Utah," a reliably Republican state where Mitt Romney won more than 70% of the vote in 2012 and the hunger for another choice ushered independent candidate Evan McMullin, who has strong ties to Utah and the LDS community, into the presidential race this week.  There are far too many variables at play over the next three months for an

In [222]:
# tokenize the cleaned title and text columns
data.loc[:,'title_tokens_to_text'] = data.loc[:,'title_clean'].apply(clean_tokens)
data.loc[:,'text_tokens_to_text'] = data.loc[:,'text_clean'].apply(clean_tokens)

***
##Check recognized words and non-ascii characters
***

**I am checking to see if there are unrecognized words. While looking for unrecognized words I found more foreign words. Most of the foreign words are non-ascii so I removed those to create a recogized_words_to_text column. This is probably over kill, but I wanted to look at the percentage of non_recognized words to total_words.**

In [223]:
# load saved csv file of other recognized words that are not included in nltk or wordnet word or lemmas collections
# the saved addl word list was taken from a prior run resulting in a non recognized word and then running through MS Word to see if recognized by spell check and if it was not considered misspelled then I added to addl word list.
file = open('/content/drive/MyDrive/Colab Notebooks/Machine Learning - Supervised Learning/project/ProjectData/addl_words.txt', 'r', encoding='utf-8-sig')

# parse csv file with ', '
addl_words = file.read().split(', ')

# strip the extra ' characters
addl_words = [word.strip('\'') for word in addl_words]
add_words = set(addl_words)
file.close()

#print(add_words)
print(len(add_words)) # the count should be 49483

49483


In [224]:
# get NLTK word list
nltk.download('words')
word_list = set(words.words())

# get lemmas from wordnet
# get all synsets from wordnet.all_synsets() then get the lemma names from each synset
# this will help us expand the words list with different words and we just want to add the lemmas
# when we check the words we will take down to lemma form if there is not a direct match right away
wordnet_lemmas = set(lemma.name() for synset in wordnet.all_synsets() for lemma in synset.lemmas())
# combine lists
word_list_master = word_list.union(wordnet_lemmas)
word_list_master = word_list_master.union(add_words)
print(len(word_list_master)) # the count should be 376754

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


376754


In [225]:
# initiate a wordnet lemmatizer
wnl = WordNetLemmatizer()

# method to check if word is a recognized word
def check_words(tokens_to_text):
  # this is a part of speach dictionary to match coding between NLTK
  # maps part-of-speech tags (like 'NN' for noun, 'VB' for verb) to simplified tags ('n', 'v', 'a', 'r') used by WordNet
  pos_ref = {'NN': 'n', 'NNP': 'n', 'NNPS': 'n', 'NNS': 'n', 'JJ': 'a', 'JJR': 'a', 'JJS': 'a', 'RB': 'r', 'RBR': 'r', 'RBS': 'r', 'RP': 'r', 'VB': 'v', 'VBD': 'v', 'VBG': 'v', 'VBN': 'v', 'VBP': 'v', 'VBZ': 'v'}
  token_lst = tokens_to_text.split()
  # list of tokens with pos tags using nltk.pos_tag method
  lst_pos_tags = nltk.pos_tag(token_lst)

  # set of non words or mispelled words
  not_in_words = set()
  # set of non ascii charater strings because I could not get rid of all foreign words about and want to see where they are at
  not_ascii = set()
  # add recognized words to list
  recog_words = []

  # loop through tokens
  for token in lst_pos_tags:
    # if word in master word list add to recog_words
    if token[0] in word_list_master:
      recog_words.append(token[0])
      pass
    # if word is not recognized right away check is lemma and see if it is recognized
    elif wnl.lemmatize(token[0]) in word_list_master:
      recog_words.append(token[0])
      pass
    # otherwise look at token and pos tag
    else:
      word = token[0]
      # get wordnet pos from NLTK pos tag
      pos = pos_ref.get(token[1])
      # is pos tag is one from the dictionary above then check lemma using lematize with the pos information
      if pos is not None:
        lemma = wnl.lemmatize(word, pos)
        if lemma in word_list_master:
          recog_words.append(token[0])
          pass
        # if the token is still not recognized then check to see if it is ascii
        else:
          if token[0].isascii():
            not_in_words.add(token[0])
          else:
            not_ascii.add(token[0])
            #print(token[0]) #printing all non-ascii tokens
      # for tokens not labeled with pos that is in above pos_ref dictionary then check to see if the token is isascii
      else:
        if token[0].isascii():
          not_in_words.add(token[0])
        else:
          not_ascii.add(token[0])
          #print(token[0])
  recog_words_to_text = ' '.join(recog_words)
  # return recog_words_to_text and the not_in_words set and not_ascii set as columns
  return pd.Series({'recog_words_to_text': recog_words_to_text, 'text_not_words': not_in_words, 'not_ascii': not_ascii})

In [226]:
# test check_words methon
text1 = data.loc[0,'text_tokens_to_text']
print(text1)

good, bad, not_ascii = check_words(text1)
print(good)
print(bad)
print(not_ascii)

salt lake city cnn less volatile election cycle notion democrats would offense red states like utah arizona georgia would suggest presidential race effectively one willing make kind bet race defied political norms donald trump downward spiral continues round round battleground polls hillary clinton campaign begun dabble ruby red states democrats clearly feeling bullish openly mulling possibility clinton blowout november even trump acknowledged thursday campaign tremendous problem utah reliably republican state mitt romney vote hunger another choice ushered independent candidate evan mcmullin strong ties utah lds community presidential race week far many variables play next three months anyone say certainty race end two major candidates intensely disliked electorate week clinton shadowed controversy emails ties clinton foundation secretary state trump contender shown extraordinary level resilience overcoming controversy mitch stewart obama campaign battleground states director said clin

In [227]:
# run check_words method and get new columns
data[['recog_words_to_text', 'text_not_words', 'not_ascii']] = data['text_tokens_to_text'].apply(lambda x: pd.Series(check_words(x)))
# run for approx 20min

In [228]:
# count the number of non_ascii tokens
data.loc[:,'non_ascii_count'] = data.loc[:,'not_ascii'].map(lambda lst: len(lst))

In [229]:
# see which label has the most rows with high _non_ascii_counts
# this is most likely because they have text that is a mix of english and foreign words
df_no_ascii_high = data[data['non_ascii_count'] > 5]
print(df_no_ascii_high['label'].value_counts())

label
1    42
0    19
Name: count, dtype: int64


In [230]:
# calculate non_regog_word count and the percentage that is not recognized
data.loc[:,'non_recog_word_count'] = data.loc[:,'text_not_words'].map(lambda lst: len(lst))
data.loc[:,'total_word_count'] = data.loc[:,'text_tokens_to_text'].map(lambda lst: len(lst))
data['non_word_percent'] = (data['non_recog_word_count'] + data['non_ascii_count']) / data['total_word_count'] * 100

**I did expect the real articles to have a lower non-word percentage**

In [231]:
# compare non_word_percent by label
print(data.loc[data['label'] == 0, 'non_word_percent'].mean())
print(data.loc[data['label'] == 1, 'non_word_percent'].mean())

0.13381513238181678
0.20018089393796737


***
##Remove non-ascii tokens out of text tokens
***

In [232]:
# method to clean out the non_ascii characters 'text_token_to_text' column
def clean_ascii(text_tokens, not_ascii):
  words = text_tokens.split()
  text_ascii = [word for word in words if word not in not_ascii]
  text_clean_ascii = ' '.join(text_ascii)
  return text_clean_ascii

In [233]:
# test above method
text_clean = data.loc[7666,'text_tokens_to_text']
not_ascii = data.loc[7666,'not_ascii']
print(clean_ascii(text_clean, not_ascii))

president donald trump wednesday said would urge fellow republicans senate invoke rule change force simple majority vote supreme court nominee democrats block choice democrats questioning trump choice federal appeals court judge neil gorsuch day president announced trump said would want congressional gridlock interfere gorsuch spoke meeting interest groups support choice gorsuch confirmed restore conservative majority supreme court trump asked whether would urge senate majority leader mitch mcconnell use called nuclear option change rules make easier confirm yes end gridlock washington longer eight years fairness president barack obama lot longer eight years trump said end gridlock would say mitch go nuclear trump said said would absolute shame man quality blocked democrats avoid eventuality would say mitch would say go


In [234]:
# create a new column of clean tokens that do not have non-ascii tokens
data.loc[:,'text_clean_ascii'] = data.loc[:,['text_tokens_to_text', 'not_ascii']].apply(lambda row: clean_ascii(row['text_tokens_to_text'], row['not_ascii']), axis=1)

***
##Title and Text Similarity with word2vec
***

**I thought it would be helpful to look at the similarity between title and text.**

In [236]:
# check to see rows with 'title_tokens_to_text' values blank
# I need to keep the rows
df = data.loc[data['title_tokens_to_text'].str.len()<1,['title_clean','text_clean','title_tokens_to_text','text_tokens_to_text','label']]
print(df)

                 title_clean  \
6315          Won, Now What?   
7058                       :   
8470             If It’s She   
13307  It’s Not You, It’s Me   
16595              What If….   
27246              What Now?   

                                                                                                                                                                                                    text_clean  \
6315   18 Shares 17 0 0 1 (Donald Trump speaking at the 2013 Conservative Political Action Conference (CPAC) in National Harbor, Maryland. Credit: Gage Skidmore / flickr) Trump's victory was a shock to m...   
7058    We the People Against Tyranny: Seven Principles for Free Government By John W. Whitehead As I look at America today, I am not afraid to say that I am afraid.Former presidential advisor Bertr...   
8470   Breitbart October 27, 2016  U.S. Republican presidential nominee Donald Trump said on Tuesday that Democrat Hillary Clinton’s plan for Syr

In [237]:
# we will drop the rows less than 1
data.drop(data.loc[data['title_tokens_to_text'].str.len() < 1].index, inplace=True)

In [238]:
# list of tokens from text and title
# this is to establish a vocabulary
tokens_collection_title = [token_lst.split() for token_lst in data.loc[:,'title_tokens_to_text']]
tokens_collection_text = [token_lst.split() for token_lst in data.loc[:,'text_tokens_to_text']]
tokens_collection = tokens_collection_title + tokens_collection_text
print(tokens_collection[0:10])

[['democrats', 'see', 'chance', 'reshape', 'map', 'trump', 'stumbles'], ['spanish', 'government', 'suspend', 'direct', 'rule', 'catalonia', 'regional', 'election', 'called', 'media'], ['cory', 'booker', 'scorches', 'trump', 'new', 'muslim', 'ban', 'blistering', 'tweetstorm'], ['heart', 'rending', 'testimony', 'dylann', 'roof', 'trial', 'opens'], ['trump', 'supporter', 'phoenix', 'threatened', 'john', 'mccain', 'life'], ['germany', 'keen', 'avoid', 'new', 'ice', 'age', 'ties', 'russia', 'west'], ['life', 'lights'], ['fidel', 'castro', 'sister', 'outspoken', 'critic', 'takes', 'joy', 'death'], ['four', 'killed', 'injured', 'jerusalem', 'truck', 'ramming', 'terror', 'attack'], ['angry', 'patriot', 'assaulted', 'black', 'woman', 'trump', 'rally', 'finds', 'ruined', 'life', 'images']]


In [239]:
# make a trained word2vec model from the title and text tokens
# vector_size is the dimensionality of the word - the higher dimension can capture more complex relationships
# window determines how many words back and forward to look around the word
word2vec_model = Word2Vec(tokens_collection, vector_size=300, window=5, min_count=2, workers=4)

In [240]:
def title_text_similarity(title, text, model=word2vec_model):
  title_tokens = title.split()
  text_tokens = text.split()

  #the title_vec and text_vec need to be at same size as vector_size in the word2vec_model
  title_vec = np.zeros(300)
  text_vec = np.zeros(300)

  # loop through title tokens
  for token in title_tokens:
    # if token in word2vec model add to title_vec
    if token in model.wv:
      title_vec = np.add(title_vec, model.wv[token])

  # loop through text tokens
  for token in text_tokens:
    # if token in word2vec model add to title_vec
    if token in model.wv:
      text_vec = np.add(text_vec, model.wv[token])


  # if either title_vec or text_vec is a zero vector return 0
  if np.linalg.norm(title_vec) == 0 or np.linalg.norm(text_vec) == 0:
      return 0  # or any other default value you prefer
  else:
      # similarity calculation = (dot product of title_vec and text_vec) / (magnitude of title_vec * magnitude of text_vec)
      # cosine similarity is calculated by dividing the dot product of two vectors by the product of their magnitudes
      return round(np.dot(title_vec, text_vec) / (np.linalg.norm(title_vec) * np.linalg.norm(text_vec)), 4)

In [241]:
# test title_text_similarity method
print(data.loc[1,'title_tokens_to_text'])
print(data.loc[1,'text_tokens_to_text'])

print(title_text_similarity(data.loc[1,'title_tokens_to_text'], data.loc[1,'text_tokens_to_text']))

spanish government suspend direct rule catalonia regional election called media
spain government ready suspend application direct rule catalonia catalan head carles puigdemont calls snap regional election la vanguardia newspaper reported thursday citing sources ruling people party puigdemont set call election according political allies move could help break one month deadlock madrid government separatists seeking split spain
0.8214


In [242]:
# calculate title_text_similarity on all rows
data.loc[:,'title_text_similarity'] = data.loc[:,['title_tokens_to_text', 'text_tokens_to_text']].apply(lambda row: title_text_similarity(row['title_tokens_to_text'], row['text_tokens_to_text']), axis=1)

**I did expect the real articles to have a higher title to text similarity score**

In [243]:
# check how title_text_similarity compare based on label
print(data.loc[data['label']==0, 'title_text_similarity'].mean())
print(data.loc[data['label']==1, 'title_text_similarity'].mean())

0.6767813121100069
0.6426477916110863


***
##Save data
***

In [123]:
data.head()

Unnamed: 0,title,text,label,title_clean,text_clean,email,links,link_count,mentions,mentions_count,...,text_tokens_to_text,recog_words_to_text,text_not_words,not_ascii,non_ascii_count,non_recog_word_count,total_word_count,non_word_percent,text_clean_ascii,title_text_similarity
0,"Oh, What a Lovely War!","Written by Philip Giraldi Tuesday November 8, 2016 The American people don’t know very much about war even if Washington has been fighting on multiple fronts since 9/11. The continental United Sta...",1,"Oh, What a Lovely War!","Written by Philip Giraldi Tuesday November 8, 2016 The American people don’t know very much about war even if Washington has been fighting on multiple fronts since 9/11. The continental United Sta...",,,0,,0,...,written philip giraldi tuesday november american people know much war even washington fighting multiple fronts since continental united states experienced presence hostile military force years war...,written philip giraldi tuesday november american people know much war even washington fighting multiple fronts since continental united states experienced presence hostile military force years war...,"{ed, op, madeleine, stavridis}",{},0,4,8976,0.044563,written philip giraldi tuesday november american people know much war even washington fighting multiple fronts since continental united states experienced presence hostile military force years war...,0.2733
1,Deported Italian Mobster Caught Sneaking Across U.S.-Mexico Border,A previously deported Italian mobster has been arrested trying to illegally enter the U. S. by sneaking across the porous border with Mexico. The man had originally been deported after serving tim...,0,Deported Italian Mobster Caught Sneaking Across U.S.-Mexico Border,A previously deported Italian mobster has been arrested trying to illegally enter the U. S. by sneaking across the porous border with Mexico. The man had originally been deported after serving tim...,,,0,,0,...,previously deported italian mobster arrested trying illegally enter sneaking across porous border mexico man originally deported serving time federal prison connection drug trafficking violent ass...,previously deported italian mobster arrested trying illegally enter sneaking across porous border mexico man originally deported serving time federal prison connection drug trafficking violent ass...,{marciante},{},0,1,1665,0.06006,previously deported italian mobster arrested trying illegally enter sneaking across porous border mexico man originally deported serving time federal prison connection drug trafficking violent ass...,0.454
2,China lodges protest after Trump call with Taiwan president,"BEIJING/WASHINGTON (Reuters) - China lodged a diplomatic protest on Saturday after U.S. President-elect Donald Trump spoke by phone with President Tsai Ing-wen of Taiwan, but blamed the self-ruled...",0,China lodges protest after Trump call with Taiwan president,"China lodged a diplomatic protest on Saturday after U.S. President-elect Donald Trump spoke by phone with President Tsai Ing-wen of Taiwan, but blamed the self-ruled island Beijing claims as its o...",,,0,,0,...,china lodged diplomatic protest saturday president elect donald trump spoke phone president tsai ing wen taiwan blamed self ruled island beijing claims petty move minute telephone call taiwan lead...,china lodged diplomatic protest saturday president elect donald trump spoke phone president tsai ing wen taiwan blamed self ruled island beijing claims petty move minute telephone call taiwan lead...,{cnn},{},0,1,4195,0.023838,china lodged diplomatic protest saturday president elect donald trump spoke phone president tsai ing wen taiwan blamed self ruled island beijing claims petty move minute telephone call taiwan lead...,0.8456
3,The US May Soon Face an Apocalyptic Seismic Event,"Today, an ever increasing number of earthquakes in the United States may soon bring the country to ruin, as geologists, journalists and politicians say. \nVia UsualRoutine \n\nThe University of Wa...",1,The US May Soon Face an Apocalyptic Seismic Event,"Today, an ever increasing number of earthquakes in the United States may soon bring the country to ruin, as geologists, journalists and politicians say. Via UsualRoutine The University of Washi...",,,0,,0,...,today ever increasing number earthquakes united states may soon bring country ruin geologists journalists politicians say via usualroutine university washington already presented seismological cha...,today ever increasing number earthquakes united states may soon bring country ruin geologists journalists politicians say via university washington already presented seismological charts showing g...,"{andtrump, voc, usgs, doonly, undermines, dnc, usualroutine, abc, biryol, rodinia}",{},0,10,3920,0.255102,today ever increasing number earthquakes united states may soon bring country ruin geologists journalists politicians say via usualroutine university washington already presented seismological cha...,0.517
4,JOY BEHAR Still Claims Clinton Won…BUT Wore Bizarre Mourning Item When Hillary Lost [Video],The View s Joy Behar just let everyone know that she s a sore loser who just can t let go that Hillary Clinton lost the 2016 election. Even though she refuses to acknowledge Trump as president and...,1,JOY BEHAR Still Claims Clinton Won…BUT Wore Bizarre Mourning Item When Hillary Lost,The View s Joy Behar just let everyone know that she s a sore loser who just can t let go that Hillary Clinton lost the 2016 election. Even though she refuses to acknowledge Trump as president and...,,,0,,0,...,view joy behar let everyone know sore loser let go hillary clinton lost election even though refuses acknowledge trump president says hillary told view audience wore black veil clinton lost indica...,view joy behar let everyone know sore loser let go hillary clinton lost election even though refuses acknowledge trump president says hillary told view audience wore black veil clinton lost indica...,"{sara, mccain}",{},0,2,1783,0.11217,view joy behar let everyone know sore loser let go hillary clinton lost election even though refuses acknowledge trump president says hillary told view audience wore black veil clinton lost indica...,0.668


In [244]:
# save to csv
data.to_csv('/content/drive/MyDrive/Colab Notebooks/Machine Learning - Supervised Learning/project/ProjectData/DataClean.csv', index=False)

In [245]:
# remaining label counts
print(data['label'].value_counts())
# label counts should be the following, but can be slightly different because of the variability in how langdetect works
#0    34615
#1    26983

label
0    34616
1    26988
Name: count, dtype: int64
