<a href="https://colab.research.google.com/github/desireedisco/MSDS-Machine-Learning-Supervised/blob/main/1_Data_Preprocess.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
#Notebook for Data Preprocessing
---


This notebook is for data preprocessing. We will do the following:
* Load file, drop index, drop na, drop duplicates
* Separate out the email links, web links, hashtag, mentions.
* Clean the data leakage problem in the text column. There are also some minor problems in the title column that will be fixed.
* Count the number of sentences and calculate the mean sentence length.
* Drop the foreign language rows as determined by langdetect
* Check to see of the remaining tokens which are recognized as words and calculate the non-recognized word percentage
* Calculate the title and text similarity


In [174]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [175]:
#install langdetect for when we separate out foreign language rows
!pip install langdetect



In [176]:
import pandas as pd
import numpy as np
import re
import nltk

from langdetect import detect
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.corpus import words
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from gensim.models import Word2Vec

nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

***
##Load file, drop index, drop na, drop duplicates
***

In [246]:
# read csv file
data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Machine Learning - Supervised Learning/project/ProjectData/WELFake_Dataset.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,0,LAW ENFORCEMENT ON HIGH ALERT Following Threats Against Cops And Whites On 9-11By #BlackLivesMatter And #FYF911 Terrorists [VIDEO],No comment is expected from Barack Obama Members of the #FYF911 or #FukYoFlag and #BlackLivesMatter movements called for the lynching and hanging of white people and cops. They encouraged others o...,1
1,1,,Did they post their votes for Hillary already?,1
2,2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MOST CHARLOTTE RIOTERS WERE “PEACEFUL” PROTESTERS…In Her Home State Of North Carolina [VIDEO],"Now, most of the demonstrators gathered last night were exercising their constitutional and protected right to peaceful protest in order to raise issues and create change. Loretta Lynch aka Er...",1
3,3,"Bobby Jindal, raised Hindu, uses story of Christian conversion to woo evangelicals for potential 2016 bid",A dozen politically active pastors came here for a private dinner Friday night to hear a conversion story unique in the context of presidential politics: how Louisiana Gov. Bobby Jindal traveled f...,0
4,4,SATAN 2: Russia unvelis an image of its terrifying new ‘SUPERNUKE’ – Western world takes notice,"The RS-28 Sarmat missile, dubbed Satan 2, will replace the SS-18 Flies at 4.3 miles (7km) per sec and with a range of 6,213 miles (10,000km) The weapons are perceived as part of an increasingly ag...",1


In [247]:
# show dataframe info
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72134 entries, 0 to 72133
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  72134 non-null  int64 
 1   title       71576 non-null  object
 2   text        72095 non-null  object
 3   label       72134 non-null  int64 
dtypes: int64(2), object(2)
memory usage: 2.2+ MB
None


**Note on matching Real and Fake labels:**
In the dataset description the authors state that there are 72,134 news articles with 35,028 real and 37,106 fake news articles. The authors then go on to state the labels are labeled as follows: 0=fake and 1=real. The two statements are contradictory based on the following label counts. Upon further inspection of the data and following the authors first statement of '35,028 real and 37,106 fake news articles', I am following the mapping for the labels as 0=real and 1=fake.

In [248]:
# show the number of rows labeled 1 and number of rows labeled 0
print(data['label'].value_counts())

# show unique count for title and text
print(data.describe(include=['object']))

label
1    37106
0    35028
Name: count, dtype: int64
                                                       title   text
count                                                  71576  72095
unique                                                 62347  62718
top     Factbox: Trump fills top jobs for his administration       
freq                                                      14    738


In [249]:
# drop dataset index column
data = data.drop(columns=['Unnamed: 0'])

**Check to see how many rows are null and what the percentage of total data is before dropping rows.**

In [250]:
# show the number of rows that are null
for col in data.columns:
    print(f'{data[col].isnull().sum()} rows are null in ' + col)

# show what percentage of total is null for each column
for col in data.columns:
    print(col, f'{round(data[col].isnull().sum() / data.shape[0] * 100, 2)} % is null')

558 rows are null in title
39 rows are null in text
0 rows are null in label
title 0.77 % is null
text 0.05 % is null
label 0.0 % is null


In [251]:
#drop all null values
data = data.dropna().reset_index(drop=True)

**Split the data by label and check and drop duplicate 'text' columns. Then combine to check for duplicates**

I did this because I did not want to drop duplicates that had multiple labels attached. If there were any duplicate articles that had different labels then I wanted to know about that. I this dataset the only duplicate with multiple labels was a blank text row.

In [252]:
#split the data by label to look at unique count in case there are duplicates label both real and fake
real_data = data[data['label'] == 0]
fake_data = data[data['label'] == 1]
print(real_data.describe(include=['object']))
print(fake_data.describe(include=['object']))

                                                       title  \
count                                                  35028   
unique                                                 34409   
top     Factbox: Trump fills top jobs for his administration   
freq                                                      14   

                                                                                                                         text  
count                                                                                                                   35028  
unique                                                                                                                  34621  
top     Killing Obama administration rules, dismantling Obamacare and pushing through tax reform are on the early to-do list.  
freq                                                                                                                       58  
                                       

In [253]:
#drop duplicate text stories
real_data = real_data.drop_duplicates(subset=['text']).reset_index(drop=True)
fake_data = fake_data.drop_duplicates(subset=['text']).reset_index(drop=True)

# combine the separate labels to one to check for duplicates
data = pd.concat([real_data, fake_data], ignore_index=True)
print(data.describe(include=['object']))

                                                       title   text
count                                                  62201  62201
unique                                                 61400  62200
top     Factbox: Trump fills top jobs for his administration       
freq                                                      14      2


**As you can see the duplicate with multiple labels had blank text column.**

In [254]:
# display duplicate rows that have 2 different labels
print(data[data['text'].duplicated(keep=False)])

                                                                                 title  \
920                                                     Graphic: Supreme Court roundup   
34626  HOUSE INTEL CHAIR On Trump-Russia Fake Story: “No evidence of anything” [Video]   

      text  label  
920             0  
34626           1  


In [255]:
# they have blank text so drop rows
data = data.drop_duplicates(subset=['text'], keep=False).reset_index(drop=True)

# check total counts and unique counts
print(data.describe(include=['object']))

# display value counts for labels
data['label'].value_counts()

#value counts should be
# 0 - 34620
# 1 - 27579

                                                       title  \
count                                                  62199   
unique                                                 61398   
top     Factbox: Trump fills top jobs for his administration   
freq                                                      14   

                                                                                                                                                                                                           text  
count                                                                                                                                                                                                     62199  
unique                                                                                                                                                                                                    62199  
top     A dozen politically active pastors came h

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,34620
1,27579


**Shuffle and reindex dataset**

In [256]:
# shuffle dataframe and reset index
data = data.sample(frac=1).reset_index(drop=True)

***
##Clean Title - Source Data Leakage
***

In the data_leakage notebook we demonstrated that some source info that was embedded in the 'text' column influenced the model. Because not all articles have source info, it was unduly providing bias in the model. I looked through the 'title' column and decided there was similar data that could influence the model: therefore, I decided to remove that as well. I just want to focus on the actual title and actual text, not the sources of the articles. We could use the sources as a feature if all the articles have source information. In the current dataset, only the real news includes sources so it needs to be removed.

In [257]:
def clean_title_leakage(title):
  # making sure title is a string
  title = str(title)

  # get rid of New York Times and Breitbart reference
  pattern_leakage = r" - The New York Times$| - Breitbart$"
  # find all occurrences not just the first one
  match_lst = re.findall(pattern_leakage, title)
  # if match then substitue for ''
  if match_lst:
    title = re.sub(pattern_leakage, '', title)
    #print(match_lst)

  return title

In [258]:
# create a new column for the clean title and apply the clean_title_leakage method
data['title_clean'] = data['title'].apply(clean_title_leakage)

In [259]:
# just as well as the real news provides source information the fake news refers to a source of video so I decide to clean this as well
def clean_title_video(title):

  # get rid of Video reference - easy match pattern
  pattern_video = r"\[VIDEO\]|\(VIDEO\)"

  # find all matches and ignore case
  match_lst = re.findall(pattern_video, title, re.IGNORECASE)

  if match_lst:
    title = re.sub(pattern_video, '', title, flags=re.IGNORECASE)
    #print(match_lst)

  # search for more match patterns to get rid of all references to (Video)
  pattern_video_re = r"\[video[a-z 0-9/,+-].*\]|\[[a-z 0-9/,+-].*video\]|\(video[a-z 0-9/,+-].*\)|\([a-z 0-9/,+-].*video\)|\([a-z 0-9/,+-].*videos\)"
  match_lst_re = re.findall(pattern_video_re, title, re.IGNORECASE)
  if match_lst_re:
    title = re.sub(pattern_video_re, '', title, flags=re.IGNORECASE)
    #print(match_lst_re)

  return title

In [260]:
# cleaning the 'title_clean column of the (Video) references which is skewed to fake news
data['title_clean'] = data['title_clean'].apply(clean_title_video)

In [261]:
# increase column width to display wider columns
pd.set_option('display.max_colwidth', 200)

In [262]:
# check to see changes have been made
df_show = data[['title','title_clean']]
print(df_show.head)

<bound method NDFrame.head of                                                                                                               title  \
0                                      Prosecutors argue against prison time for New Jersey 'Bridgegate' mastermind   
1                                Is Obama preparing a parting shot on Israel? This President must not bind the next   
2              HOLLYWOOD RICH AND FAMOUS Shafted By “Sick” Hillary…Left With Replacement For Mega-Bucks Fundraisers   
3                                                         Hillary Wants Aggressively Interventionist Foreign Policy   
4                                   Donald Trump Says Drugs Are ‘Big Factor’ in Urban Violence - The New York Times   
...                                                                                                             ...   
62194                                                   Trump draws even with Clinton in national White House poll    
62195             

***
##Clean Text
***

In [263]:
# clean out newline reference
data['text_clean'] = data['text'].str.replace('\n', ' ')

**Extract email links**

In [264]:
# method to extract email links from text and put in another data frame column
def extract_email_links(text):
  emails = ''

  # get rid of web links - match for web links
  pattern_email = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
  match_lst = re.findall(pattern_email, text)
  # if match we will substitute email for '' in text_clean and add to emails list
  if match_lst:
    text = re.sub(pattern_email, '', text)
    #print(match_lst)
    emails = match_lst
  # return text_clean and emails list
  return pd.Series({'text_clean': text, 'email': emails})

In [265]:
# get text_clean striped of emails and put emails in separate column
data[['text_clean', 'email']] = data['text_clean'].apply(lambda x: pd.Series(extract_email_links(x)))

In [266]:
# show changes
data.loc[data['email'].str.len()>0,['text_clean', 'email']].head()

Unnamed: 0,text_clean,email
3,10-27-1 6 The first Bill and Hillary Clinton co-presidency included eight years of Balkan and other wars of aggression. Bush/Cheney exceeded their lawlessness. Obama outdid the worst of both previ...,[lendmanstephen@sbcglobal.net]
77,**Want FOX News First in your inbox every day? Sign up here.** Buzz Cut: • Panicky Hillary starts shouting • NYT dumps on Heidi Cruz • Rubio touts pistol purchase • GOP Power Index: Chris...,[FOXNEWSFIRST@FOXNEWS.COM]
423,"It looked like the fix was in. But the ATF says it was just a misfire. As the ATF faces a firestorm of controversy for seeking public comment on a proposal to ban a popular type of bullet, critic...","[APAComments@atf.gov, maxim.lott@foxnews.com]"
447,"21st Century Wire says This week s documentary film curated by our editorial team at 21WIRE. By August 1945, the Allied Manhattan Project had successfully detonated an atomic device in the New Mex...",[MEMBER@21WIRE.TV]
454,"link originally posted by: Kashai Hyperloop one is a about a transportation system that is really fast and technologically feasible to the extent that in relation to investors, easily Is on the fa...","[hyperloop@spacex.com, hyperloop@teslamotors.com]"


**Extract web links**

In [267]:
# method to extract web links and put in another dataframe column
def extract_web_links(text):
  links = ''

  #get rid of web links - we try to match a number of different patterns to try to capture all the weblinks in the text column
  pattern_html = r"https?://(?:www\.)?[a-zA-Z0-9./?=&_-]+|pic.twitter.com[a-zA-Z0-9./]+|[a-zA-Z0-9./]+(?:\.com|\.org)/[a-zA-Z0-9/\-.]+|[a-zA-Z0-9./]+(?:\.com|\.org)"
  match_lst = re.findall(pattern_html, text)
  # if match we will substitute web link for '' in text_clean and add to web link list
  if match_lst:
    text = re.sub(pattern_html, '', text)
    #print(match_lst)
    links = match_lst

  # return text_clean and web link list
  return pd.Series({'text_clean': text, 'links': links})

In [268]:
# get text_clean striped of web links and put web links in separate column
data[['text_clean', 'links']] = data['text_clean'].apply(lambda x: pd.Series(extract_web_links(x)))
# record link count in separate column
data['link_count'] = data['links'].apply(lambda x: len(x))

In [269]:
# show changes
data.loc[data['links'].str.len()>0,['text_clean', 'links']].head()

Unnamed: 0,text_clean,links
3,10-27-1 6 The first Bill and Hillary Clinton co-presidency included eight years of Balkan and other wars of aggression. Bush/Cheney exceeded their lawlessness. Obama outdid the worst of both previ...,"[http://www.claritypress.com/LendmanIII.html, sjlendman.blogspot.com, Rense.com, Rense.com]"
8,Print Islam and sex slavery are like peanut butter and jelly - you always find the one next to the other. Muslims kidnapping vulnerable white girls in the UK and forcing them to be sex slaves has...,[Shoebat.com]
11,"Try as they may, Republicans have just not figured out how to connect with younger generations. They ve locked down a good portion of older white America, but it seems millennials just really don ...","[https://t.co/5Tv5hb0GjP, pic.twitter.com/xc4XF3fmuR, https://t.co/N1zddPOeZR, https://t.co/N1zddPOeZR, https://t.co/N1zddPOeZR]"
13,ST. LOUIS Former St. Louis police Officer Jason Stockley was found not guilty Friday of murdering a man while on duty.St. Louis Circuit Judge Timothy Wilson s highly anticipated verdict found th...,"[https://t.co/eLmJAKkagx, pic.twitter.com/Kvd7aThPZe, https://t.co/ucUlKuWorO, pic.twitter.com/4HjKvPtIwJ]"
16,"The atmosphere created by conservatives and their anti-transgender witch hunt is becoming more and more apparent by the day. The trend started by North Carolina, with Texas in the process of follo...",[https://www.facebook.com/never.shout.aimee/videos/vb.1473214698/10206543775341127/?type=3&theaterFeatured]


**Extract mentions**

In [270]:
# method to extract mentions and put in another dataframe column
def clean_mentions(text):
  mentions = ''

  # get rid of mentions in text and keep mentions in a separate column
  pattern_mentions = r"@\w+"
  match_lst = re.findall(pattern_mentions, text)
  # if match we will substitute mentions for '' in text_clean and add to mention list
  if match_lst:
    text = re.sub(pattern_mentions, '', text)
    #print(match_lst)
    mentions = match_lst

  # return text_clean and mention list
  return pd.Series({'text_clean': text, 'mentions': mentions})

In [271]:
# get text_clean striped of mentions and put mentions in separate column
data[['text_clean', 'mentions']] = data['text_clean'].apply(lambda x: pd.Series(clean_mentions(x)))
# new column with mention count
data['mentions_count'] = data['mentions'].apply(lambda x: len(x))
# turn mention list into string so we can vectorize latter on if we want
data['mentions'] = data['mentions'].apply(lambda mention_lst: ' '.join([str(mention) for mention in mention_lst]))

In [272]:
# show changes
data.loc[data['mentions'].str.len()>0,['text_clean', 'mentions']].head()

Unnamed: 0,text_clean,mentions
11,"Try as they may, Republicans have just not figured out how to connect with younger generations. They ve locked down a good portion of older white America, but it seems millennials just really don ...",@mcbc @louisvirtel @GOP @louisvirtel @GOP @louisvirtel
12,Former House Speaker Newt Gingrich said Sunday in an interview on New York AM 970 radio show “The Cats Roundtable” that it was very clear that former FBI Director James Comey hated President Donal...,@MagnifiTrent
13,ST. LOUIS Former St. Louis police Officer Jason Stockley was found not guilty Friday of murdering a man while on duty.St. Louis Circuit Judge Timothy Wilson s highly anticipated verdict found th...,@SLMPD @KMOV @SLMPD @EricGreitens @tariqnasheed
21,21st Century Wire says .From May Day riots to designer technocracy (Photo Illustration 21WIRE s Shawn Helton)21WIRE s Shawn Helton joins well-known Talk Radio host Jack Blood on The Jack Blood Sh...,@21WIRE
23,"On Friday’s broadcast of the Fox News Channel’s “America’s Newsroom,” House Freedom Caucus Member Representative Ron DeSantis ( ) stated that there’s “definitely a path” to an Obamacare bill, and ...",@IanHanchett


**Extract hashtags**

In [273]:
# method to extract hashtags and put in another dataframe column
def clean_hashtags(text):
  hashtags = ''
  hashtag_count = 0

  # get rid of hashtags in text and keep mentions in a separate column
  pattern_hashtags = r"#\w+"
  match_lst = re.findall(pattern_hashtags, text)

  # if match we will substitute mentions for '' in text_clean and add to mention list
  if match_lst:
    text = re.sub(pattern_hashtags, '', text)

    hashtags = match_lst
    hash_clean = ''

    # we want to search through hashtag list for #1,#2, or other strings with just numbers after # because there is a lot of them and they are not hashtags
    for index, hashtag in enumerate(hashtags):
      # if match is a number string do nothing otherwise add to hashtag list
      if re.search(r"#\d+$", hashtag):
        pass
      else:
        hash_clean = hash_clean + ' ' + hashtag
    # strip left space
    hashtags = hash_clean.lstrip()
    #print(hashtags)

    # count the number of hashtags in list
    hashtag_count = len(hashtags.split(' '))

  # return text_clean and hashtag list in string form and return hashtag count
  return pd.Series({'text_clean': text, 'hashtags': hashtags, 'hashtag_count': hashtag_count})

In [274]:
# process text to return 'text_clean', hashtag list, and hashtag count
data[['text_clean', 'hashtags', 'hashtag_count']] = data['text_clean'].apply(lambda x: pd.Series(clean_hashtags(x)))

In [275]:
# show changes
data.loc[data['hashtags'].str.len()>0,['text_clean', 'hashtags']].head()

Unnamed: 0,text_clean,hashtags
11,"Try as they may, Republicans have just not figured out how to connect with younger generations. They ve locked down a good portion of older white America, but it seems millennials just really don ...",#SOTU #SNAPoftheUnion #SnapOfTheUnion
13,ST. LOUIS Former St. Louis police Officer Jason Stockley was found not guilty Friday of murdering a man while on duty.St. Louis Circuit Judge Timothy Wilson s highly anticipated verdict found th...,#STLVerdict #DowntownSTL #JasonStockley #STLVerdict #STL #JasonStockley
54,"After much back and forth between the white dude wearing a Black Lives Matter t-shirt and the black man who appears to be acting as security for corrupt Hillary, the black man suggests he go to ...",#MSNBCTownhall
136,"The Super Bowl had not yet begun and Trump fans were already throwing a tantrum not over something Lady Gaga did, but because of what is normally one of their favorite songs: America the Beautif...",#AmericatheBeautiful #SuperBowl #sisterhood #ItAintBrokeDontFixIt #SB51
144,"Tonight s the night of the Golden Globe Awards, and comedian Jimmy Fallon was tapped to emcee the event. Before he even started his opening monologue, his teleprompter failed, so he ad libbed the ...",#GoldenGlobes


**Clean leakage issue**

See notebook 2_LeakageTest for further info

In [276]:
# clean the references to (Reuters) from the text column
def clean_reuters_leakage(text):

  # get rid of reference to (Reuters) in text
  # some articles start out with the is a correction not followed by the (Reuters) reference and we are matching the whole string and substituting out
  pattern_reuters_1 = r"^[“”‘’A-Za-z.,&$()/\-:;0-9  ]*(\(Reuters\) - |\(Reuters\)\) - |\(Reuters\)  —  )"
  # we are matching first occurence so we are using match and not findall
  match_str_1 = re.match(pattern_reuters_1, text)
  if match_str_1:
    text = re.sub(pattern_reuters_1, '', text)
    #print(match_str_1.group())

  return text

In [277]:
# clean the (Reuters) reference out
data['text_clean'] = data['text_clean'].apply(clean_reuters_leakage)

***
##Sentence count and mean sentence length.
***

I want to drop the blank text_clean rows and make sure I have a decent mean sentence length for the langdetect package to work.

In [278]:
# using NLTK sentence tokenizer to count the number of sentences in each text passage in the 'text_clean' column
data.loc[:,'text_sent_count'] = data.loc[:,'text_clean'].map(lambda txt: len(sent_tokenize(txt)))

In [279]:
# check and see if there are any less than 1
# in the result dataframe the text column is just links with the text_clean column blank
data[data['text_sent_count'] < 1].head()

Unnamed: 0,title,text,label,title_clean,text_clean,email,links,link_count,mentions,mentions_count,hashtags,hashtag_count,text_sent_count
970,TRUMP SUPPORTER Whose Brutal Beating By Black Mob Was Caught On Video Asks: “What Happened To America?” [VIDEO],https://youtu.be/kKFQ5i9jXmA,1,TRUMP SUPPORTER Whose Brutal Beating By Black Mob Was Caught On Video Asks: “What Happened To America?”,,,[https://youtu.be/kKFQ5i9jXmA],1,,0,,0,0
2363,https://100percentfedup.com/served-roy-moore-vietnamletter-veteran-sets-record-straight-honorable-decent-respectable-patriotic-commander-soldier/,https://100percentfedup.com/served-roy-moore-vietnamletter-veteran-sets-record-straight-honorable-decent-respectable-patriotic-commander-soldier/,1,https://100percentfedup.com/served-roy-moore-vietnamletter-veteran-sets-record-straight-honorable-decent-respectable-patriotic-commander-soldier/,,,[https://100percentfedup.com/served-roy-moore-vietnamletter-veteran-sets-record-straight-honorable-decent-respectable-patriotic-commander-soldier/],1,,0,,0,0
3328,GOTCHA! CNN PANELIST Called Out For Lying About Terror Attacks In The US [Video],https://www.youtube.com/watch?v=ISm-p8e-D7I,1,GOTCHA! CNN PANELIST Called Out For Lying About Terror Attacks In The US,,,[https://www.youtube.com/watch?v=ISm-p8e-D7I],1,,0,,0,0
4233,WATCH Huge Crowd Of Muslims Admit That ALL Muslims Should Be Considered “Extremists”…Any Questions?,https://www.youtube.com/watch?v=8Mehk5eWcZA,1,WATCH Huge Crowd Of Muslims Admit That ALL Muslims Should Be Considered “Extremists”…Any Questions?,,,[https://www.youtube.com/watch?v=8Mehk5eWcZA],1,,0,,0,0
6635,THE VIEW’S Whoopi Goldberg To Co-Host: “This Is Why Black People Don’t Wanna Talk To White People” [VIDEO],https://youtu.be/RTuxvWjH3a4,1,THE VIEW’S Whoopi Goldberg To Co-Host: “This Is Why Black People Don’t Wanna Talk To White People”,,,[https://youtu.be/RTuxvWjH3a4],1,,0,,0,0


In [280]:
# we will drop the rows less than 1
data.drop(data.loc[data['text_sent_count'] < 1].index, inplace=True)

In [281]:
# get mean sentence length
data.loc[:,'mean_sent_length'] = data.loc[:,'text_clean'].map(lambda txt: np.mean([len(sent) for sent in sent_tokenize(txt)]))

In [282]:
# look at rows with mean sentance length less than 20
# we will apply the language check and that needs a good amount of charaters
df = data.loc[data['mean_sent_length'] < 20,['text_clean','label']]
print(df)

                                                         text_clean  label
2428                           How many? I think we ve lost count!       1
3151                                              Great interview!       1
3394                                            Via: GATEWAY PUNDIT      1
4550                                                       Via: TMZ      1
5922                                              Nice Admin Lady        1
6354                                                         Watch:      1
7816                                                Zones"confuse        1
8032   Les Deplorables Unite     ??VOTE TRUMP?? () November 9, 2016      1
8218                                              Spirited debate:       1
8429                                                   Ramzy Baroud      1
9609                                             Boom!Courtesy of:       1
10283                                                       Via: GP      1
11374            Unreal! 

In [283]:
# drop the low mean length articles
data.drop(data.loc[data['mean_sent_length'] < 20].index, inplace=True)

***
##Drop other languages
***

I want to keep only english rows

In [284]:
# runs for 16min
# this method will detect other languages using langdetect package
def detect_language(text):
    try:
        return detect(text)
    except:
        return 'could not detect language'
# we are running the detect_language methon on the 'text_clean' column
data.loc[:,'lang'] = data.loc[:,'text_clean'].apply(detect_language)

In [285]:
# the count of english slightly changes every time I run the above method but the count should be around #61608
print(data.loc[:,'lang'].value_counts())

lang
en       61604
ru         156
es         141
de          98
fr          33
ar          19
tr           7
pt           7
it           5
hr           4
nl           3
no           3
pl           2
el           2
sq           1
zh-cn        1
vi           1
sw           1
Name: count, dtype: int64


In [286]:
# show non english rows
df = data.loc[~data['lang'].eq('en')]
df.head()
#df.to_csv('/content/drive/MyDrive/Colab Notebooks/Machine Learning - Supervised Learning/project/ProjectData/lang.csv', index=False)

Unnamed: 0,title,text,label,title_clean,text_clean,email,links,link_count,mentions,mentions_count,hashtags,hashtag_count,text_sent_count,mean_sent_length,lang
71,Ikea crea un carril rápido para solteros,"Ikea crea un carril rápido para solteros LOS CLIENTES CON PAREJA PODRÁN USAR EL CARRIL A LA TERCERA DISCUSIÓN, CUANDO LA RUPTURA SEA IRREVERSIBLE solteros \nCon el fin de que las parejas que discu...",1,Ikea crea un carril rápido para solteros,"Ikea crea un carril rápido para solteros LOS CLIENTES CON PAREJA PODRÁN USAR EL CARRIL A LA TERCERA DISCUSIÓN, CUANDO LA RUPTURA SEA IRREVERSIBLE solteros Con el fin de que las parejas que discut...",,,0,,0,,0,7,254.714286,es
272,Se reencuentran dos gemeliers separados al nacer,Se reencuentran dos gemeliers separados al nacer LOS EXPERTOS CONFIRMAN QUE SON EJEMPLARES DE LA MISMA CAMADA reencuentro \nDos gemeliers separados al nacer se han reencontrado hoy tras 18 años so...,1,Se reencuentran dos gemeliers separados al nacer,Se reencuentran dos gemeliers separados al nacer LOS EXPERTOS CONFIRMAN QUE SON EJEMPLARES DE LA MISMA CAMADA reencuentro Dos gemeliers separados al nacer se han reencontrado hoy tras 18 años sob...,,,0,,0,,0,8,190.75,es
335,Russlandgeschäft: Deutsche Unternehmen halten am russischen Markt fest,28. Oktober 2016 Jekaterina Iwanowa 30 Prozent der befragten deutschen Unternehmen in Russland bezeichnen das Geschäftsklima als leicht verbessert und mehr als die Hälfte rechnet optimistisch mit ...,1,Russlandgeschäft: Deutsche Unternehmen halten am russischen Markt fest,28. Oktober 2016 Jekaterina Iwanowa 30 Prozent der befragten deutschen Unternehmen in Russland bezeichnen das Geschäftsklima als leicht verbessert und mehr als die Hälfte rechnet optimistisch mit ...,,,0,,0,,0,4,91.5,de
534,В машине предполагаемых убийц Немцова обнаружены биоматериалы,"0 комментариев 4 поделились Фото: Fotodom.ru/Коммерсантъ \n""Экспертиза биологических материалов с подголовника, изъятого из указанного автомобиля была проведена экспертами ФСБ. Согласно результата...",1,В машине предполагаемых убийц Немцова обнаружены биоматериалы,"0 комментариев 4 поделились Фото: Fotodom.ru/Коммерсантъ ""Экспертиза биологических материалов с подголовника, изъятого из указанного автомобиля была проведена экспертами ФСБ. Согласно результатам...",,,0,,0,,0,28,90.107143,ru
641,Канадская делегация не приедет для подписания соглашения о свободной торговле с ЕС,"Короткая ссылка 27 октября 2016, 02:47 Делегация Канады не прибудет на запланированный 27 октября саммит, где должно было состояться подписание соглашения о свободной торговле (CETA) с Европейским...",1,Канадская делегация не приедет для подписания соглашения о свободной торговле с ЕС,"Короткая ссылка 27 октября 2016, 02:47 Делегация Канады не прибудет на запланированный 27 октября саммит, где должно было состояться подписание соглашения о свободной торговле (CETA) с Европейским...",,,0,,0,,0,6,156.5,ru


In [287]:
# keep english rows
data = data.loc[data['lang'] == 'en']

# drop lang column
data = data.drop(columns=['lang'])

***
##Tokenize title and text
***

**This is where we finally tokenize the clean title and text**

In [288]:
# add additional stop words that aren't in the nltk library list
add_stop_words = {'also'}
print(add_stop_words)

# load the nltk library stop word list
stop_words = set(stopwords.words('english'))

# combine the 2 lists - don't really need this now because I did not add a lot of additional stop words but leaving in as place holder
stop_words = stop_words.union(add_stop_words)

# display stop words
print(stop_words)
print('said' in stop_words)

{'also'}
{'such', 'my', 'yours', 'll', 'm', "couldn't", "mightn't", 'under', "aren't", 'yourselves', 'each', 'couldn', 'these', 'ours', 'who', 'we', "weren't", "needn't", 'ma', 'here', 's', 'further', "she's", 'again', 'hers', 'an', 'against', 'theirs', 'other', 'he', 'off', 'not', "don't", 'all', 'been', 'hadn', "mustn't", "wouldn't", 'when', 'after', 'through', 'do', 'if', 'as', "it's", 'aren', 'into', 'those', "haven't", 'needn', 'about', 'below', "wasn't", 'for', 'didn', 'the', 've', "didn't", 'too', 'are', "hadn't", 'wouldn', 'own', "you'll", 'this', 'by', "won't", 'that', 'will', 'or', 'its', 'y', 'any', 'i', 'both', 'himself', 'no', 'does', 'won', 'him', 'were', 'having', 'd', 'there', 'before', 'you', 'in', "shouldn't", 'it', 'our', 'while', 'herself', 'once', 'up', 'a', 'can', 'shan', 'until', 'she', "you'd", 'so', 'myself', 'wasn', 'weren', 'had', 't', 'shouldn', 'hasn', 'then', 'your', 'but', 'don', 'be', 'at', 'between', "hasn't", 'over', 'am', 'haven', 'of', 'nor', 'his', 

In [289]:
# clean digit only strigs and get rid of other characters that are not words and tokenize string
def clean_tokens(text):
  # text to lower
  text_clean = text.lower()

  # get rid of digit only text strings
  text_clean = re.sub(r"\d+", '', text_clean)

  # set non-words to ''
  text_clean = re.sub("\W", ' ', text_clean)

  # tokenize text
  tokens = word_tokenize(text_clean)

  # check to make sure length of token is greater than 1 and combine into a new text string
  # I want to combine into text string to be able to save to csv file and load later with minimal processing
  text_list = [word for word in tokens if word not in stop_words and len(word)>1]
  token_to_text = ' '.join(text_list)

  # return text string
  return token_to_text

In [290]:
# test clean_tokens method
text1 = data.loc[0,'text_clean']
print(text1)
print(clean_tokens(text1))
print('--------------')
title1 = data.loc[0,'title_clean']
print(title1)
print(clean_tokens(title1))

The mastermind of the “Bridgegate” lane closure scandal that helped torpedo New Jersey Governor Chris Christie’s presidential bid should not be sentenced to prison due to his cooperation, U.S. prosecutors said in a court document filed on Tuesday. David Wildstein, who helped the government convict two former Christie associates after he pleaded guilty in 2015, is set to be sentenced in federal court in Newark on Wednesday. Wildstein, 55, admitted overseeing a scheme to shut down access lanes at the busy George Washington Bridge in 2013 to create massive traffic gridlock as punishment for a local Democratic mayor who refused to endorse Christie’s reelection campaign. U.S. prosecutors charged Wildstein, former Christie deputy chief of staff Bridget Kelly and a former executive of the Port Authority of New York and New Jersey, Bill Baroni, with concocting the plot. The Port Authority supervises operations for the George Washington Bridge, which connects Manhattan and New Jersey and is one

In [291]:
# tokenize the cleaned title and text columns
data.loc[:,'title_tokens_to_text'] = data.loc[:,'title_clean'].apply(clean_tokens)
data.loc[:,'text_tokens_to_text'] = data.loc[:,'text_clean'].apply(clean_tokens)

***
##Check recognized words and non-ascii characters
***

**I am checking to see if there are unrecognized words. While looking for unrecognized words I found more foreign words. Most of the foreign words are non-ascii so I removed those to create a recogized_words_to_text column. This is probably over kill, but I wanted to look at the percentage of non_recognized words to total_words.**

In [292]:
# load saved csv file of other recognized words that are not included in nltk or wordnet word or lemmas collections
# the saved addl word list was taken from a prior run resulting in a non recognized word and then running through MS Word to see if recognized by spell check and if it was not considered misspelled then I added to addl word list.
file = open('/content/drive/MyDrive/Colab Notebooks/Machine Learning - Supervised Learning/project/ProjectData/addl_words.txt', 'r', encoding='utf-8-sig')

# parse csv file with ', '
addl_words = file.read().split(', ')

# strip the extra ' characters
addl_words = [word.strip('\'') for word in addl_words]
add_words = set(addl_words)
file.close()

#print(add_words)
print(len(add_words)) # the count should be 49483

49483


In [293]:
# get NLTK word list
nltk.download('words')
word_list = set(words.words())

# get lemmas from wordnet
# get all synsets from wordnet.all_synsets() then get the lemma names from each synset
# this will help us expand the words list with different words and we just want to add the lemmas
# when we check the words we will take down to lemma form if there is not a direct match right away
wordnet_lemmas = set(lemma.name() for synset in wordnet.all_synsets() for lemma in synset.lemmas())
# combine lists
word_list_master = word_list.union(wordnet_lemmas)
word_list_master = word_list_master.union(add_words)
print(len(word_list_master)) # the count should be 376754

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


376754


In [294]:
# initiate a wordnet lemmatizer
wnl = WordNetLemmatizer()

# method to check if word is a recognized word
def check_words(tokens_to_text):
  # this is a part of speach dictionary to match coding between NLTK
  # maps part-of-speech tags (like 'NN' for noun, 'VB' for verb) to simplified tags ('n', 'v', 'a', 'r') used by WordNet
  pos_ref = {'NN': 'n', 'NNP': 'n', 'NNPS': 'n', 'NNS': 'n', 'JJ': 'a', 'JJR': 'a', 'JJS': 'a', 'RB': 'r', 'RBR': 'r', 'RBS': 'r', 'RP': 'r', 'VB': 'v', 'VBD': 'v', 'VBG': 'v', 'VBN': 'v', 'VBP': 'v', 'VBZ': 'v'}
  token_lst = tokens_to_text.split()
  # list of tokens with pos tags using nltk.pos_tag method
  lst_pos_tags = nltk.pos_tag(token_lst)

  # set of non words or mispelled words
  not_in_words = set()
  # set of non ascii charater strings because I could not get rid of all foreign words about and want to see where they are at
  not_ascii = set()
  # add recognized words to list
  recog_words = []

  # loop through tokens
  for token in lst_pos_tags:
    # if word in master word list add to recog_words
    if token[0] in word_list_master:
      recog_words.append(token[0])
      pass
    # if word is not recognized right away check is lemma and see if it is recognized
    elif wnl.lemmatize(token[0]) in word_list_master:
      recog_words.append(token[0])
      pass
    # otherwise look at token and pos tag
    else:
      word = token[0]
      # get wordnet pos from NLTK pos tag
      pos = pos_ref.get(token[1])
      # is pos tag is one from the dictionary above then check lemma using lematize with the pos information
      if pos is not None:
        lemma = wnl.lemmatize(word, pos)
        if lemma in word_list_master:
          recog_words.append(token[0])
          pass
        # if the token is still not recognized then check to see if it is ascii
        else:
          if token[0].isascii():
            not_in_words.add(token[0])
          else:
            not_ascii.add(token[0])
            #print(token[0]) #printing all non-ascii tokens
      # for tokens not labeled with pos that is in above pos_ref dictionary then check to see if the token is isascii
      else:
        if token[0].isascii():
          not_in_words.add(token[0])
        else:
          not_ascii.add(token[0])
          #print(token[0])
  recog_words_to_text = ' '.join(recog_words)
  # return recog_words_to_text and the not_in_words set and not_ascii set as columns
  return pd.Series({'recog_words_to_text': recog_words_to_text, 'text_not_words': not_in_words, 'not_ascii': not_ascii})

In [295]:
# test check_words methon
text1 = data.loc[0,'text_tokens_to_text']
print(text1)

good, bad, not_ascii = check_words(text1)
print(good)
print(bad)
print(not_ascii)

mastermind bridgegate lane closure scandal helped torpedo new jersey governor chris christie presidential bid sentenced prison due cooperation prosecutors said court document filed tuesday david wildstein helped government convict two former christie associates pleaded guilty set sentenced federal court newark wednesday wildstein admitted overseeing scheme shut access lanes busy george washington bridge create massive traffic gridlock punishment local democratic mayor refused endorse christie reelection campaign prosecutors charged wildstein former christie deputy chief staff bridget kelly former executive port authority new york new jersey bill baroni concocting plot port authority supervises operations george washington bridge connects manhattan new jersey one world busiest crossings christie denied involvement charged fallout dogged bid republican nomination president contributed current record low approval rating percent new jersey wildstein pleaded guilty conspiracy agreed coopera

In [296]:
# run check_words method and get new columns
data[['recog_words_to_text', 'text_not_words', 'not_ascii']] = data['text_tokens_to_text'].apply(lambda x: pd.Series(check_words(x)))
# run for approx 20min

In [297]:
# count the number of non_ascii tokens
data.loc[:,'non_ascii_count'] = data.loc[:,'not_ascii'].map(lambda lst: len(lst))

In [298]:
# see which label has the most rows with high _non_ascii_counts
# this is most likely because they have text that is a mix of english and foreign words
df_no_ascii_high = data[data['non_ascii_count'] > 5]
print(df_no_ascii_high['label'].value_counts())

label
1    42
0    19
Name: count, dtype: int64


In [299]:
# calculate non_regog_word count and the percentage that is not recognized
data.loc[:,'non_recog_word_count'] = data.loc[:,'text_not_words'].map(lambda lst: len(lst))
data.loc[:,'total_word_count'] = data.loc[:,'text_tokens_to_text'].map(lambda lst: len(lst))
data['non_word_percent'] = (data['non_recog_word_count'] + data['non_ascii_count']) / data['total_word_count'] * 100

**I did expect the real articles to have a lower non-word percentage**

In [300]:
# compare non_word_percent by label
print(data.loc[data['label'] == 0, 'non_word_percent'].mean())
print(data.loc[data['label'] == 1, 'non_word_percent'].mean())

0.13381513238181678
0.19963963905746687


***
##Remove non-ascii tokens out of text tokens
***

In [301]:
# method to clean out the non_ascii characters 'text_token_to_text' column
def clean_ascii(text_tokens, not_ascii):
  words = text_tokens.split()
  text_ascii = [word for word in words if word not in not_ascii]
  text_clean_ascii = ' '.join(text_ascii)
  return text_clean_ascii

In [302]:
# test above method
text_clean = data.loc[7666,'text_tokens_to_text']
not_ascii = data.loc[7666,'not_ascii']
print(clean_ascii(text_clean, not_ascii))

iranian authorities sentenced member iran nuclear negotiating team five years jail tasnim news agency reported wednesday although gave details case iran reached nuclear deal united states five major powers led lifting international sanctions iran return curbs nuclear program potential detente west alarmed iranian hardliners seen flood european trade investment delegations arrive tehran discuss possible deals according iran experts reports last year iranian media said nuclear negotiator dual nationality arrested accused providing sensitive economic information iran enemies may judiciary spokesman gholamhossein mohseni ejei said member negotiating team facing espionage charges sentenced prison term added could provide details since verdict could appealed case reviewed appeal court five year jail sentence upheld tasnim quoted informed source saying wednesday name person tasnim reported july member negotiating team charge banking affairs talks arrested agency semi official media named abdo

In [303]:
# create a new column of clean tokens that do not have non-ascii tokens
data.loc[:,'text_clean_ascii'] = data.loc[:,['text_tokens_to_text', 'not_ascii']].apply(lambda row: clean_ascii(row['text_tokens_to_text'], row['not_ascii']), axis=1)

***
##Title and Text Similarity with word2vec
***

**I thought it would be helpful to look at the similarity between title and text.**

In [304]:
# check to see rows with 'title_tokens_to_text' values blank
# I need to keep the rows
df = data.loc[data['title_tokens_to_text'].str.len()<1,['title_clean','text_clean','title_tokens_to_text','text_tokens_to_text','label']]
print(df)

                 title_clean  \
12570              What Now?   
18030         Won, Now What?   
21195  It’s Not You, It’s Me   
27086              What If….   
39200                      :   
62133            If It’s She   

                                                                                                                                                                                                    text_clean  \
12570  by Thomas Sowell  The good news is that we dodged a bullet in this election. The bad news is that we don’t know how many other bullets are coming, or from what direction.  A Hillary Clinton victor...   
18030  18 Shares 17 0 0 1 (Donald Trump speaking at the 2013 Conservative Political Action Conference (CPAC) in National Harbor, Maryland. Credit: Gage Skidmore / flickr) Trump's victory was a shock to m...   
21195  Kim Severson is filling in for Sam Sifton, who emails readers of Cooking five days a week to talk about food and suggest recipes. That ema

In [305]:
# we will drop the rows less than 1
data.drop(data.loc[data['title_tokens_to_text'].str.len() < 1].index, inplace=True)

In [306]:
# list of tokens from text and title
# this is to establish a vocabulary
tokens_collection_title = [token_lst.split() for token_lst in data.loc[:,'title_tokens_to_text']]
tokens_collection_text = [token_lst.split() for token_lst in data.loc[:,'text_tokens_to_text']]
tokens_collection = tokens_collection_title + tokens_collection_text
print(tokens_collection[0:10])

[['prosecutors', 'argue', 'prison', 'time', 'new', 'jersey', 'bridgegate', 'mastermind'], ['obama', 'preparing', 'parting', 'shot', 'israel', 'president', 'must', 'bind', 'next'], ['hollywood', 'rich', 'famous', 'shafted', 'sick', 'hillary', 'left', 'replacement', 'mega', 'bucks', 'fundraisers'], ['hillary', 'wants', 'aggressively', 'interventionist', 'foreign', 'policy'], ['donald', 'trump', 'says', 'drugs', 'big', 'factor', 'urban', 'violence'], ['trump', 'irs', 'decided', 'going', 'give', 'info', 'equifax', 'got', 'hacked'], ['tyranny', 'obamaphone', 'fraud', 'kept', 'wraps', 'vote', 'expand', 'program'], ['hillary', 'clinton', 'explains', 'say', 'radical', 'islam'], ['uk', 'children', 'charity', 'muslims', 'kidnapping', 'white', 'girls', 'forcing', 'sex', 'slavery'], ['shocking', 'report', 'law', 'enforcement', 'officers', 'killed', 'young', 'black', 'men', 'white', 'cops']]


In [307]:
# make a trained word2vec model from the title and text tokens
# vector_size is the dimensionality of the word - the higher dimension can capture more complex relationships
# window determines how many words back and forward to look around the word
word2vec_model = Word2Vec(tokens_collection, vector_size=300, window=5, min_count=2, workers=4)

In [308]:
def title_text_similarity(title, text, model=word2vec_model):
  title_tokens = title.split()
  text_tokens = text.split()

  #the title_vec and text_vec need to be at same size as vector_size in the word2vec_model
  title_vec = np.zeros(300)
  text_vec = np.zeros(300)

  # loop through title tokens
  for token in title_tokens:
    # if token in word2vec model add to title_vec
    if token in model.wv:
      title_vec = np.add(title_vec, model.wv[token])

  # loop through text tokens
  for token in text_tokens:
    # if token in word2vec model add to title_vec
    if token in model.wv:
      text_vec = np.add(text_vec, model.wv[token])


  # if either title_vec or text_vec is a zero vector return 0
  if np.linalg.norm(title_vec) == 0 or np.linalg.norm(text_vec) == 0:
      return 0  # or any other default value you prefer
  else:
      # similarity calculation = (dot product of title_vec and text_vec) / (magnitude of title_vec * magnitude of text_vec)
      # cosine similarity is calculated by dividing the dot product of two vectors by the product of their magnitudes
      return round(np.dot(title_vec, text_vec) / (np.linalg.norm(title_vec) * np.linalg.norm(text_vec)), 4)

In [309]:
# test title_text_similarity method
print(data.loc[1,'title_tokens_to_text'])
print(data.loc[1,'text_tokens_to_text'])

print(title_text_similarity(data.loc[1,'title_tokens_to_text'], data.loc[1,'text_tokens_to_text']))

obama preparing parting shot israel president must bind next
print last week un premier cultural agency unesco approved resolution viciously condemning israel referred occupying power various alleged trespasses violations temple mount jerusalem except resolution never uses term judaism holiest shrine refers treats exclusively muslim site deliberate attempt eradicate connection let alone centrality jewish people jewish history orwellian absurdity insult judaism christianity makes mockery gospels chronicle story galilean jew whose life ministry unfolded throughout holy land especially jerusalem temple nothing muslim site happens foundation christianity occurred years islam even came unesco resolution merely surreal extreme worldwide campaign delegitimize israel features bds movement boycott divestment sanctions growing western university campuses mainline protestant churches extends even precincts democratic party
0.3405


In [310]:
# calculate title_text_similarity on all rows
data.loc[:,'title_text_similarity'] = data.loc[:,['title_tokens_to_text', 'text_tokens_to_text']].apply(lambda row: title_text_similarity(row['title_tokens_to_text'], row['text_tokens_to_text']), axis=1)

**I did expect the real articles to have a higher title to text similarity score**

In [311]:
# check how title_text_similarity compare based on label
print(data.loc[data['label']==0, 'title_text_similarity'].mean())
print(data.loc[data['label']==1, 'title_text_similarity'].mean())

0.6771524150681766
0.6432999629382551


***
##Save data
***

In [312]:
data.head()

Unnamed: 0,title,text,label,title_clean,text_clean,email,links,link_count,mentions,mentions_count,...,text_tokens_to_text,recog_words_to_text,text_not_words,not_ascii,non_ascii_count,non_recog_word_count,total_word_count,non_word_percent,text_clean_ascii,title_text_similarity
0,Prosecutors argue against prison time for New Jersey 'Bridgegate' mastermind,NEW YORK (Reuters) - The mastermind of the “Bridgegate” lane closure scandal that helped torpedo New Jersey Governor Chris Christie’s presidential bid should not be sentenced to prison due to his ...,0,Prosecutors argue against prison time for New Jersey 'Bridgegate' mastermind,"The mastermind of the “Bridgegate” lane closure scandal that helped torpedo New Jersey Governor Chris Christie’s presidential bid should not be sentenced to prison due to his cooperation, U.S. pro...",,,0,,0,...,mastermind bridgegate lane closure scandal helped torpedo new jersey governor chris christie presidential bid sentenced prison due cooperation prosecutors said court document filed tuesday david w...,mastermind bridgegate lane closure scandal helped torpedo new jersey governor chris presidential bid sentenced prison due cooperation prosecutors said court document filed tuesday david wildstein ...,{christie},{},0,1,1857,0.05385,mastermind bridgegate lane closure scandal helped torpedo new jersey governor chris christie presidential bid sentenced prison due cooperation prosecutors said court document filed tuesday david w...,0.7181
1,Is Obama preparing a parting shot on Israel? This President must not bind the next,"Print \nLast week, the UN’s premier cultural agency, UNESCO, approved a resolution viciously condemning Israel (referred to as “the Occupying Power”) for various alleged trespasses and violations ...",1,Is Obama preparing a parting shot on Israel? This President must not bind the next,"Print Last week, the UN’s premier cultural agency, UNESCO, approved a resolution viciously condemning Israel (referred to as “the Occupying Power”) for various alleged trespasses and violations o...",,,0,,0,...,print last week un premier cultural agency unesco approved resolution viciously condemning israel referred occupying power various alleged trespasses violations temple mount jerusalem except resol...,print last week un premier cultural agency approved resolution viciously condemning israel referred occupying power various alleged trespasses violations temple mount jerusalem except resolution n...,{unesco},{},0,1,864,0.115741,print last week un premier cultural agency unesco approved resolution viciously condemning israel referred occupying power various alleged trespasses violations temple mount jerusalem except resol...,0.3405
2,HOLLYWOOD RICH AND FAMOUS Shafted By “Sick” Hillary…Left With Replacement For Mega-Bucks Fundraisers,Hillary Clinton s campaign charged up to a hundred thousand dollars for fundraisers in L.A. she may well not attend.It s a big deal the rich and famous were expecting to rub elbows with Hillary ...,1,HOLLYWOOD RICH AND FAMOUS Shafted By “Sick” Hillary…Left With Replacement For Mega-Bucks Fundraisers,Hillary Clinton s campaign charged up to a hundred thousand dollars for fundraisers in L.A. she may well not attend.It s a big deal the rich and famous were expecting to rub elbows with Hillary ...,,,0,,0,...,hillary clinton campaign charged hundred thousand dollars fundraisers may well attend big deal rich famous expecting rub elbows hillary seth macfarlane home intimate dinner hillary billionaire bar...,hillary clinton campaign charged hundred thousand dollars fundraisers may well attend big deal rich famous expecting rub elbows hillary seth macfarlane home intimate dinner hillary billionaire bar...,"{hrc, tmz, von}",{},0,3,620,0.483871,hillary clinton campaign charged hundred thousand dollars fundraisers may well attend big deal rich famous expecting rub elbows hillary seth macfarlane home intimate dinner hillary billionaire bar...,0.6129
3,Hillary Wants Aggressively Interventionist Foreign Policy,10-27-1 6 The first Bill and Hillary Clinton co-presidency included eight years of Balkan and other wars of aggression. Bush/Cheney exceeded their lawlessness. Obama outdid the worst of both previ...,1,Hillary Wants Aggressively Interventionist Foreign Policy,10-27-1 6 The first Bill and Hillary Clinton co-presidency included eight years of Balkan and other wars of aggression. Bush/Cheney exceeded their lawlessness. Obama outdid the worst of both previ...,[lendmanstephen@sbcglobal.net],"[http://www.claritypress.com/LendmanIII.html, sjlendman.blogspot.com, Rense.com, Rense.com]",4,,0,...,first bill hillary clinton co presidency included eight years balkan wars aggression bush cheney exceeded lawlessness obama outdid worst previous administrations attacking seven countries destabil...,first bill hillary clinton co presidency included eight years balkan wars aggression bush cheney exceeded lawlessness obama outdid worst previous administrations attacking seven countries destabil...,"{mps, lendman, ww, renseradio}",{},0,4,2590,0.15444,first bill hillary clinton co presidency included eight years balkan wars aggression bush cheney exceeded lawlessness obama outdid worst previous administrations attacking seven countries destabil...,0.6677
4,Donald Trump Says Drugs Are ‘Big Factor’ in Urban Violence - The New York Times,PITTSBURGH — Donald J. Trump said Thursday that drugs were to blame for the violence roiling cities across the nation. Mr. Trump used the first roughly 10 minutes of his remarks to energy execu...,0,Donald Trump Says Drugs Are ‘Big Factor’ in Urban Violence,PITTSBURGH — Donald J. Trump said Thursday that drugs were to blame for the violence roiling cities across the nation. Mr. Trump used the first roughly 10 minutes of his remarks to energy execu...,,,0,,0,...,pittsburgh donald trump said thursday drugs blame violence roiling cities across nation mr trump used first roughly minutes remarks energy executives shale industry conference address current unre...,pittsburgh donald trump said thursday drugs blame violence roiling cities across nation mr trump used first roughly minutes remarks energy executives shale industry conference address current unre...,{},{},0,0,2721,0.0,pittsburgh donald trump said thursday drugs blame violence roiling cities across nation mr trump used first roughly minutes remarks energy executives shale industry conference address current unre...,0.6466


In [313]:
# save to csv
data.to_csv('/content/drive/MyDrive/Colab Notebooks/Machine Learning - Supervised Learning/project/ProjectData/DataClean.csv', index=False)

In [314]:
# remaining label counts
print(data['label'].value_counts())
# label counts should be the following, but can be slightly different because of the variability in how langdetect works
#0    34615
#1    26983

label
0    34616
1    26982
Name: count, dtype: int64
