<a href="https://colab.research.google.com/github/abinashp437/Stance_Detection_FNC_1/blob/main/fnc_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Exploration and Preprocess**

In [1]:
import pandas as pd

In [2]:
body_url = "https://raw.githubusercontent.com/FakeNewsChallenge/fnc-1/master/train_bodies.csv"
body = pd.read_csv(body_url)

In [3]:
stance_url = "https://raw.githubusercontent.com/FakeNewsChallenge/fnc-1/master/train_stances.csv"
stance = pd.read_csv(stance_url)

In [4]:
print(body.shape)
print(stance.shape)

(1683, 2)
(49972, 3)


In [5]:
body.head()

Unnamed: 0,Body ID,articleBody
0,0,A small meteorite crashed into a wooded area i...
1,4,Last week we hinted at what was to come as Ebo...
2,5,(NEWSER) – Wonder how long a Quarter Pounder w...
3,6,"Posting photos of a gun-toting child online, I..."
4,7,At least 25 suspected Boko Haram insurgents we...


In [6]:
stance.head()

Unnamed: 0,Headline,Body ID,Stance
0,Police find mass graves with at least '15 bodi...,712,unrelated
1,Hundreds of Palestinians flee floods in Gaza a...,158,agree
2,"Christian Bale passes on role of Steve Jobs, a...",137,unrelated
3,HBO and Apple in Talks for $15/Month Apple TV ...,1034,unrelated
4,Spider burrowed through tourist's stomach and ...,1923,disagree


In [7]:
len(body['Body ID'].unique())

1683

In [8]:
len(stance['Body ID'].unique())

1683

In [9]:
stance['Body ID'].value_counts()

1921    187
1948    175
40      172
524     171
1549    166
       ... 
376       1
140       1
307       1
1066      1
59        1
Name: Body ID, Length: 1683, dtype: int64

In [10]:
stance

Unnamed: 0,Headline,Body ID,Stance
0,Police find mass graves with at least '15 bodi...,712,unrelated
1,Hundreds of Palestinians flee floods in Gaza a...,158,agree
2,"Christian Bale passes on role of Steve Jobs, a...",137,unrelated
3,HBO and Apple in Talks for $15/Month Apple TV ...,1034,unrelated
4,Spider burrowed through tourist's stomach and ...,1923,disagree
...,...,...,...
49967,Urgent: The Leader of ISIL 'Abu Bakr al-Baghda...,1681,unrelated
49968,Brian Williams slams social media for speculat...,2419,unrelated
49969,Mexico Says Missing Students Not Found In Firs...,1156,agree
49970,US Lawmaker: Ten ISIS Fighters Have Been Appre...,1012,discuss


In [11]:
stance['Stance'].isnull().value_counts()

False    49972
Name: Stance, dtype: int64

**\n Removal**

In [12]:
for row_id, art in body.iterrows():
  body['articleBody'][row_id] = art['articleBody'].replace('\n','')
for row_id, stan in stance.iterrows():
  stance['Headline'][row_id] = stan['Headline'].replace('\n','')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


**Stopword Removal**

In [13]:
import nltk
nltk.download('stopwords') #for stopword removal
nltk.download('wordnet') #for lemmatisation

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [14]:
from nltk.corpus import stopwords

In [15]:
stop_words = stopwords.words('english')
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()

In [16]:
for row_id, art in body.iterrows():
  body['articleBody'][row_id] = ' '.join([word for word in art['articleBody'].split() if word not in stop_words])
for row_id, head in stance.iterrows():
  stance['Headline'][row_id] = ' '.join([word for word in head['Headline'].split() if word not in stop_words])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


**no_of_words**

In [17]:
#for article in body
word_article = pd.Series([], dtype = int) 

for row_id, art in body.iterrows():
    no = len(w_tokenizer.tokenize(art['articleBody']))
    word_article[row_id] = no
    
# adding the created series into the dataframe at position 2
body.insert(2, "no_of_words", word_article)

#for headlines in stance
word_headline = pd.Series([], dtype = int) 

for row_id, head in stance.iterrows():
    no = len(w_tokenizer.tokenize(head['Headline']))
    word_headline[row_id] = no
    
# adding the created series into the dataframe at position 2
stance.insert(3, "no_of_words", word_headline)

print(word_article.describe())
print(word_headline.describe())

count    1683.000000
mean      218.915033
std       178.444123
min         2.000000
25%       116.000000
50%       180.000000
75%       269.500000
max      2925.000000
dtype: float64
count    49972.000000
mean         9.307872
std          3.247104
min          2.000000
25%          7.000000
50%          9.000000
75%         11.000000
max         28.000000
dtype: float64


**Lemmatisation**

In [18]:
lemmatizer = nltk.stem.WordNetLemmatizer()

In [19]:
def lemmatize(text):
  return [lemmatizer.lemmatize(word) for word in w_tokenizer.tokenize(text)]

In [20]:
for row_id, art in body.iterrows():
  body['articleBody'][row_id] = ' '.join(lemmatize(art['articleBody']))
for row_id, head in stance.iterrows():
  stance['Headline'][row_id] = ' '.join(lemmatize(head['Headline']))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


**Case Independent**

In [21]:
for row_id, art in body.iterrows():
  body['articleBody'][row_id] = art['articleBody'].lower()
for row_id, head in stance.iterrows():
  stance['Headline'][row_id] = head['Headline'].lower()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [22]:
body.head()

Unnamed: 0,Body ID,articleBody,no_of_words
0,0,a small meteorite crashed wooded area nicaragu...,187
1,4,last week hinted come ebola fear spread across...,74
2,5,(newser) – wonder long quarter pounder cheese ...,156
3,6,"posting photo gun-toting child online, isis su...",292
4,7,at least 25 suspected boko haram insurgent kil...,223


In [23]:
stance.head()

Unnamed: 0,Headline,Body ID,Stance,no_of_words
0,police find mass graf least '15 bodies' near m...,712,unrelated,15
1,hundreds palestinians flee flood gaza israel o...,158,agree,8
2,"christian bale pass role steve jobs, actor rep...",137,unrelated,11
3,hbo apple talks $15/month apple tv streaming s...,1034,unrelated,10
4,spider burrowed tourist's stomach chest,1923,disagree,5


**Corpus Creation**

In [24]:
corpus = []
for row_id, art in body.iterrows():
  corpus.append(w_tokenizer.tokenize(art['articleBody']))
for row_id, head in stance.iterrows():
  corpus.append(w_tokenizer.tokenize(head['Headline']))

**Trimming the  length of article bodies and stances with the median of no_of_words**

In [25]:
pad_art = 180
pad_head = 11
for row_id, art in body.iterrows():
  body['articleBody'][row_id] = ' '.join(w_tokenizer.tokenize(art['articleBody'])[:pad_art])
for row_id, head in stance.iterrows():
  stance['Headline'][row_id] = ' '.join(w_tokenizer.tokenize(head['Headline'])[:pad_head])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


**Vectorization**

In [26]:
#install gensim
from gensim.models import Word2Vec
import tensorflow as tf
from tensorflow import keras
from google.colab import files

In [27]:
model = Word2Vec(corpus, min_count = 1, size = 100, workers = 3, window = 3, iter = 30)
model.save('vector.bin')
files.download('vector.bin')
len(model.wv.vocab)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

57450

In [28]:
def vectorize(text):
  return [model[word] for word in w_tokenizer.tokenize(text)]

In [29]:
art_vec = pd.Series([], dtype = float)
for row_id, art in body.iterrows():
  art_vec[row_id] = vectorize(art['articleBody'])

head_vec = pd.Series([], dtype = float)
for row_id, head in stance.iterrows():
  head_vec[row_id] = vectorize(head['Headline'])

  


In [30]:
pad_art_vec = keras.preprocessing.sequence.pad_sequences(art_vec, padding = 'post', maxlen = pad_art, dtype = float)
pad_head_vec = keras.preprocessing.sequence.pad_sequences(head_vec, padding = 'post', maxlen = pad_head, dtype = float)

In [31]:
#converting to list for storing in dataframe
pad_art_vec = pad_art_vec.tolist()
pad_head_vec = pad_head_vec.tolist()

In [32]:
print(len(pad_art_vec))
print(len(pad_art_vec[0]))
print(len(pad_art_vec[0][0]))
print(len(pad_head_vec))
print(len(pad_head_vec[0]))
print(len(pad_head_vec[0][0]))

1683
180
100
49972
11
100


**Getting in Dataframe**

In [33]:
article_vector = pd.DataFrame()
article_vector.insert(0, 'Body ID', body['Body ID'])
article_vector.insert(1, 'Vectors', pad_art_vec)

headline_vector = pd.DataFrame()
headline_vector.insert(0, 'Body ID', stance['Body ID'])
headline_vector.insert(1, 'Vectors', pad_head_vec)