<a href="https://colab.research.google.com/github/abinashp437/Stance_Detection_FNC_1/blob/main/fnc_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Exploration and Preprocess**

In [1]:
import pandas as pd

In [2]:
body_url = "https://raw.githubusercontent.com/FakeNewsChallenge/fnc-1/master/train_bodies.csv"
body = pd.read_csv(body_url)

In [3]:
stance_url = "https://raw.githubusercontent.com/FakeNewsChallenge/fnc-1/master/train_stances.csv"
stance = pd.read_csv(stance_url)

In [4]:
print(body.shape)
print(stance.shape)

(1683, 2)
(49972, 3)


In [5]:
body.head()

Unnamed: 0,Body ID,articleBody
0,0,A small meteorite crashed into a wooded area i...
1,4,Last week we hinted at what was to come as Ebo...
2,5,(NEWSER) – Wonder how long a Quarter Pounder w...
3,6,"Posting photos of a gun-toting child online, I..."
4,7,At least 25 suspected Boko Haram insurgents we...


In [6]:
stance.head()

Unnamed: 0,Headline,Body ID,Stance
0,Police find mass graves with at least '15 bodi...,712,unrelated
1,Hundreds of Palestinians flee floods in Gaza a...,158,agree
2,"Christian Bale passes on role of Steve Jobs, a...",137,unrelated
3,HBO and Apple in Talks for $15/Month Apple TV ...,1034,unrelated
4,Spider burrowed through tourist's stomach and ...,1923,disagree


In [7]:
len(body['Body ID'].unique())

1683

In [8]:
len(stance['Body ID'].unique())

1683

In [9]:
stance['Body ID'].value_counts()

1921    187
1948    175
40      172
524     171
1549    166
       ... 
376       1
140       1
307       1
1066      1
59        1
Name: Body ID, Length: 1683, dtype: int64

In [10]:
stance

Unnamed: 0,Headline,Body ID,Stance
0,Police find mass graves with at least '15 bodi...,712,unrelated
1,Hundreds of Palestinians flee floods in Gaza a...,158,agree
2,"Christian Bale passes on role of Steve Jobs, a...",137,unrelated
3,HBO and Apple in Talks for $15/Month Apple TV ...,1034,unrelated
4,Spider burrowed through tourist's stomach and ...,1923,disagree
...,...,...,...
49967,Urgent: The Leader of ISIL 'Abu Bakr al-Baghda...,1681,unrelated
49968,Brian Williams slams social media for speculat...,2419,unrelated
49969,Mexico Says Missing Students Not Found In Firs...,1156,agree
49970,US Lawmaker: Ten ISIS Fighters Have Been Appre...,1012,discuss


In [11]:
stance['Stance'].isnull().value_counts()

False    49972
Name: Stance, dtype: int64

In [12]:
art_len = pd.Series([], dtype = int)
head_len = pd.Series([], dtype = int)
for row_index, art in body.iterrows():
  art_len[row_index] = len(art['articleBody'])
for row_index, head in stance.iterrows():
  head_len[row_index] = len(head['Headline'])
body.insert(2, 'length', art_len)
stance.insert(3, 'length', head_len)
print(art_len.describe())
print(head_len.describe())

count     1683.000000
mean      2201.877005
std       1791.974864
min         38.000000
25%       1169.500000
50%       1808.000000
75%       2716.500000
max      27579.000000
dtype: float64
count    49972.000000
mean        69.356860
std         24.825253
min          9.000000
25%         54.000000
50%         65.000000
75%         79.000000
max        225.000000
dtype: float64


**\n Removal**

In [13]:
for row_id, art in body.iterrows():
  body['articleBody'][row_id] = art['articleBody'].replace('\n','')
for row_id, stan in stance.iterrows():
  stance['Headline'][row_id] = stan['Headline'].replace('\n','')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


**Stopword Removal**

In [14]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [15]:
from nltk.corpus import stopwords
# from nltk.tokenize import word_tokenize

In [16]:
stop_words = stopwords.words('english')

In [17]:
for row_id, art in body.iterrows():
  body['articleBody'][row_id] = ' '.join([word for word in art['articleBody'].split() if word not in stop_words])
for row_id, head in stance.iterrows():
  stance['Headline'][row_id] = ' '.join([word for word in head['Headline'].split() if word not in stop_words])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


**Lemmatisation**

In [18]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [19]:
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

In [20]:
def lemmatize(text):
  return [lemmatizer.lemmatize(word) for word in w_tokenizer.tokenize(text)]

In [21]:
for row_id, art in body.iterrows():
  body['articleBody'][row_id] = ' '.join(lemmatize(art['articleBody']))
for row_id, head in stance.iterrows():
  stance['Headline'][row_id] = ' '.join(lemmatize(head['Headline']))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


trimming the  length of article bodies and stances to 300 and 150

In [22]:
for row_id, art in body.iterrows():
  body['articleBody'][row_id] = art['articleBody'][:300]
for row_id, head in stance.iterrows():
  stance['Headline'][row_id] = head['Headline'][:150]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


case insensitive

In [23]:
for row_id, art in body.iterrows():
  body['articleBody'][row_id] = art['articleBody'].lower()
for row_id, head in stance.iterrows():
  stance['Headline'][row_id] = head['Headline'].lower()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [24]:
body.head()

Unnamed: 0,Body ID,articleBody,length
0,0,a small meteorite crashed wooded area nicaragu...,1902
1,4,last week hinted come ebola fear spread across...,621
2,5,(newser) – wonder long quarter pounder cheese ...,1360
3,6,"posting photo gun-toting child online, isis su...",2708
4,7,at least 25 suspected boko haram insurgent kil...,2367


In [25]:
stance.head()

Unnamed: 0,Headline,Body ID,Stance,length
0,police find mass graf least '15 bodies' near m...,712,unrelated,115
1,hundreds palestinians flee flood gaza israel o...,158,agree,65
2,"christian bale pass role steve jobs, actor rep...",137,unrelated,91
3,hbo apple talks $15/month apple tv streaming s...,1034,unrelated,82
4,spider burrowed tourist's stomach chest,1923,disagree,63


In [26]:
body['articleBody'][0] 

"a small meteorite crashed wooded area nicaragua's capital managua overnight, government said sunday. residents reported hearing mysterious boom left 16-foot deep crater near city's airport, associated press reports. government spokeswoman rosario murillo said committee formed government study event "

In [27]:
stance['Headline'][0]

"police find mass graf least '15 bodies' near mexico town 43 student disappeared police clash"

Vectorization