#### Bag of n_grams: Exercise**
- Fake news refers to misinformation or disinformation in the country which is spread through word of mouth and more recently through digital communication such as What's app messages, social media posts, etc.

- Fake news spreads faster than Real news and creates problems and fear among groups and in society.

- We are going to address these problems using classical NLP techniques and going to classify whether a given message/ text is Real or Fake Message.

- You will use a Bag of n-grams to pre-process the text and apply different classification algorithms.

- Sklearn CountVectorizer has the inbuilt implementations for Bag of Words.

#### About Data: Fake News Detection
- Credits: https://www.kaggle.com/datasets/nopdev/real-and-fake-news-dataset

- This data consists of two columns. - Text - label

- Text is the statements or messages regarding a particular event/situation.

- label feature tells whether the given Text is Fake or Real.

- As there are only 2 classes, this problem comes under the Binary Classification.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("Dataset/news.csv")

In [3]:
df.shape

(6335, 4)

In [4]:
df.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [5]:
df.label.value_counts()

label
REAL    3171
FAKE    3164
Name: count, dtype: int64

In [6]:
df['label_num'] = df['label'].apply(lambda x: 1 if x =='REAL' else 0)

In [7]:
df.text = df.title + df.text

In [8]:
df.head()

Unnamed: 0.1,Unnamed: 0,title,text,label,label_num
0,8476,You Can Smell Hillary’s Fear,"You Can Smell Hillary’s FearDaniel Greenfield,...",FAKE,0
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Watch The Exact Moment Paul Ryan Committed Pol...,FAKE,0
2,3608,Kerry to go to Paris in gesture of sympathy,Kerry to go to Paris in gesture of sympathyU.S...,REAL,1
3,10142,Bernie supporters on Twitter erupt in anger ag...,Bernie supporters on Twitter erupt in anger ag...,FAKE,0
4,875,The Battle of New York: Why This Primary Matters,The Battle of New York: Why This Primary Matte...,REAL,1


In [9]:
df.drop(['Unnamed: 0', 'title'], axis = 1, inplace = True)

In [10]:
df.head()

Unnamed: 0,text,label,label_num
0,"You Can Smell Hillary’s FearDaniel Greenfield,...",FAKE,0
1,Watch The Exact Moment Paul Ryan Committed Pol...,FAKE,0
2,Kerry to go to Paris in gesture of sympathyU.S...,REAL,1
3,Bernie supporters on Twitter erupt in anger ag...,FAKE,0
4,The Battle of New York: Why This Primary Matte...,REAL,1


#### Modelling without Pre-processing Text data

In [11]:
from sklearn.model_selection import train_test_split

In [12]:
X_train, X_test, y_train, y_test = train_test_split(df.text, df.label_num, test_size = 0.2, random_state = 2022, stratify = df.label_num)

In [13]:
X_train.shape

(5068,)

In [14]:
X_test.shape

(1267,)

**Attempt 1 :**
- 1.using sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with unigram, bigram, and trigrams.
- use KNN as the classifier with n_neighbors of 10 and metric as 'euclidean' distance.
- print the classification report.

In [15]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer

In [16]:
clf1 = Pipeline([
    ('vectorizer_bow', CountVectorizer(ngram_range = (1, 3))),
    ('knn', KNeighborsClassifier(n_neighbors = 10, metric = 'euclidean'))

])

In [17]:
clf1.fit(X_train, y_train)

In [18]:
y_pred = clf1.predict(X_test)

In [19]:
from sklearn.metrics import classification_report

In [20]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.67      0.80      0.73       633
           1       0.75      0.60      0.67       634

    accuracy                           0.70      1267
   macro avg       0.71      0.70      0.70      1267
weighted avg       0.71      0.70      0.70      1267



**Attempt 2 :**
- 1.using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with unigram, bigram, and trigrams.
- use KNN as the classifier with n_neighbors of 10 and metric as 'cosine' distance.
- print the classification report.

In [21]:
clf2 = Pipeline([
    ('vectorizer_bow', CountVectorizer(ngram_range = (1, 3))),
    ('knn', KNeighborsClassifier(n_neighbors = 10, metric = 'cosine'))

])

In [22]:
clf2.fit(X_train, y_train)

In [23]:
y_pred = clf2.predict(X_test)

In [24]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.74      0.74      0.74       633
           1       0.74      0.74      0.74       634

    accuracy                           0.74      1267
   macro avg       0.74      0.74      0.74      1267
weighted avg       0.74      0.74      0.74      1267



**Attempt 3 :**
- 1.using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with only trigrams.
- use RandomForest as the classifier.
- print the classification report.

In [25]:
from sklearn.ensemble import RandomForestClassifier

In [26]:
clf3 = Pipeline([
    ('vectorizer_bow', CountVectorizer(ngram_range = (3, 3))),
    ('rf', RandomForestClassifier())

])

In [27]:
clf3.fit(X_train, y_train)

In [28]:
y_pred = clf3.predict(X_test)

In [29]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.81      0.92      0.86       633
           1       0.91      0.78      0.84       634

    accuracy                           0.85      1267
   macro avg       0.86      0.85      0.85      1267
weighted avg       0.86      0.85      0.85      1267



**Attempt 4 :**
- 1.using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note**
- using CountVectorizer with both unigram and bigrams.
- use Multinomial Naive Bayes as the classifier with an alpha value of 0.75.
- print the classification report.


In [30]:
from sklearn.naive_bayes import MultinomialNB

In [31]:
clf4 = Pipeline([
    ('vectorizer_bow', CountVectorizer(ngram_range = (1, 2))),
    ('mnb', MultinomialNB())

])

In [32]:
clf4.fit(X_train, y_train)

In [33]:
y_pred = clf4.predict(X_test)

In [34]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.98      0.81      0.89       633
           1       0.84      0.98      0.91       634

    accuracy                           0.90      1267
   macro avg       0.91      0.90      0.90      1267
weighted avg       0.91      0.90      0.90      1267



#### Use text pre-processing to remove stop words, punctuations and apply lemmatization

In [35]:
import spacy

nlp = spacy.load("en_core_web_sm") 

def preprocess(text):
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)
    
    return " ".join(filtered_tokens)

In [36]:
df['preprocessed_text'] = df['text'].apply(preprocess)

In [37]:
df.head()

Unnamed: 0,text,label,label_num,preprocessed_text
0,"You Can Smell Hillary’s FearDaniel Greenfield,...",FAKE,0,smell Hillary FearDaniel Greenfield Shillman J...
1,Watch The Exact Moment Paul Ryan Committed Pol...,FAKE,0,watch Exact Moment Paul Ryan Committed Politic...
2,Kerry to go to Paris in gesture of sympathyU.S...,REAL,1,Kerry Paris gesture sympathyu.s. Secretary Sta...
3,Bernie supporters on Twitter erupt in anger ag...,FAKE,0,Bernie supporter Twitter erupt anger DNC try w...
4,The Battle of New York: Why This Primary Matte...,REAL,1,Battle New York Primary MattersIt primary day ...


In [38]:
X_train, X_test, y_train, y_test = train_test_split(df.preprocessed_text, df.label_num, test_size = 0.2, random_state = 2022, stratify = df.label_num)

**Attempt1 :**
- 1.using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with only trigrams.
- use RandomForest as the classifier.
- print the classification report

In [39]:
clf5 = Pipeline([
    ('vectorizer_bow', CountVectorizer(ngram_range = (3, 3))),
    ('rf', RandomForestClassifier())

])

In [40]:
clf5.fit(X_train, y_train)

In [41]:
y_pred = clf5.predict(X_test)

In [42]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.69      0.99      0.81       633
           1       0.98      0.56      0.72       634

    accuracy                           0.78      1267
   macro avg       0.84      0.78      0.77      1267
weighted avg       0.84      0.78      0.77      1267



**Attempt2 :**
- 1.using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with unigram, Bigram, and trigrams.
- use RandomForest as the classifier.
- print the classification report.

In [43]:
clf6 = Pipeline([
    ('vectorizer_bow', CountVectorizer(ngram_range = (1, 3))),
    ('rf', RandomForestClassifier())

])

In [44]:
clf6.fit(X_train, y_train)

In [45]:
y_pred = clf6.predict(X_test)

In [46]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.88      0.89      0.89       633
           1       0.89      0.88      0.89       634

    accuracy                           0.89      1267
   macro avg       0.89      0.89      0.89      1267
weighted avg       0.89      0.89      0.89      1267

