# Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency Inverse Document Frequency of records. It can be defined as the calculation of how relevant a word in a series or corpus is to a text. The meaning increases proportionally to the number of times in the text a word appears but is compensated by the word frequency in the corpus (data-set).

## Terminologies:

1. Term Frequency: In document d, the frequency represents the number of instances of a given word t. Therefore, we can see that it becomes more relevant when a word appears in the text, which is rational. Since the ordering of terms is not significant, we can use a vector to describe the text in the bag of term models. For each specific term in the paper, there is an entry with the value being the term frequency.

    The weight of a term that occurs in a document is simply proportional to the term frequency.

    tf(t,d) = count of t in d / number of words in d

2. Document Frequency: This tests the meaning of the text, which is very similar to TF, in the whole corpus collection. The only difference is that in document d, TF is the frequency counter for a term t, while df is the number of occurrences in the document set N of the term t. In other words, the number of papers in which the word is present is DF.

    df(t) = occurrence of t in documents

3. Inverse Document Frequency: Mainly, it tests how relevant the word is. The key aim of the search is to locate the appropriate records that fit the demand. Since tf considers all terms equally significant, it is therefore not only possible to use the term frequencies to measure the weight of the term in the paper. First, find the document frequency of a term t by counting the number of documents containing the term:

    df(t) = N(t)

    where,
    
    * df(t) = Document frequency of a term t
    
    * N(t) = Number of documents containing the term t

    Term frequency is the number of instances of a term in a single document only; although the frequency of the document is the number of separate documents in which the term appears, it depends on the entire corpus. Now let’s look at the definition of the frequency of the inverse paper. The IDF of the word is the number of documents in the corpus separated by the frequency of the text.

    idf(t) = N/ df(t) = N/N(t)

    The more common word is supposed to be considered less significant, but the element (most definite integers) seems too harsh. We then take the logarithm (with base 2) of the inverse frequency of the paper. So the idf of the term t becomes:

    idf(t) = log(N/ df(t))

Tf-idf is one of the best metrics to determine how significant a term is to a text in a series or a corpus. tf-idf is a weighting system that assigns a weight to each word in a document based on its term frequency (tf) and the reciprocal document frequency (idf). The words with higher scores of weight are deemed to be more significant.


## Fake News Detection using TF-IDF

In [35]:
# Import required libraries
import pandas as pd
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Load spacy's english model
nlp = spacy.load("en_core_web_sm")

In [15]:
# Load and inspect data
data = pd.read_csv("./fake_and_real_news.csv")
data.head()

Unnamed: 0,Text,label
0,Top Trump Surrogate BRUTALLY Stabs Him In The...,Fake
1,U.S. conservative leader optimistic of common ...,Real
2,"Trump proposes U.S. tax overhaul, stirs concer...",Real
3,Court Forces Ohio To Allow Millions Of Illega...,Fake
4,Democrats say Trump agrees to work on immigrat...,Real


In [16]:
# Unique values in label column
print(data["label"].unique())

# Class distribution of the data
print(data["label"].value_counts())

['Fake' 'Real']
label
Fake    5000
Real    4900
Name: count, dtype: int64


In [17]:
# So, the data contains two labels: "Fake" and "Real".
# Now we will convert them into integer labels
# Representing Real with 0 and Fake with 1

# Define the label mappings
label_map = {
    "Real": 0,
    "Fake": 1
}

# Apply label mappings to the label column
data["label"] = data["label"].apply(lambda x: label_map[x])
data.head()

Unnamed: 0,Text,label
0,Top Trump Surrogate BRUTALLY Stabs Him In The...,1
1,U.S. conservative leader optimistic of common ...,0
2,"Trump proposes U.S. tax overhaul, stirs concer...",0
3,Court Forces Ohio To Allow Millions Of Illega...,1
4,Democrats say Trump agrees to work on immigrat...,0


In [20]:
# Function to preprocess the text
def preprocess(text):
    text = text.lower()
    doc = nlp(text)
    
    # Apply lowercasing and lemmatization if the token is alphanumeric and is not a punctuation
    processed_tokens = [token.lemma_ for token in doc if token.is_alpha and not token.is_punct]
    
    return " ".join(processed_tokens)

data["clean_text"] = data["Text"].apply(lambda x: preprocess(x))
data.head()

Unnamed: 0,Text,label,clean_text
0,Top Trump Surrogate BRUTALLY Stabs Him In The...,1,top trump surrogate brutally stab he in the ba...
1,U.S. conservative leader optimistic of common ...,0,conservative leader optimistic of common groun...
2,"Trump proposes U.S. tax overhaul, stirs concer...",0,trump propose tax overhaul stir concern on def...
3,Court Forces Ohio To Allow Millions Of Illega...,1,court force ohio to allow million of illegally...
4,Democrats say Trump agrees to work on immigrat...,0,democrats say trump agree to work on immigrati...


In [21]:
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    data["clean_text"],
    data["label"],
    shuffle=True,
    random_state=42
)

# Shapes
print("X_train shape: ", X_train.shape)
print("X_test shape: ", X_test.shape)
print("y_train shape: ", y_train.shape)
print("y_test shape: ", y_test.shape)

X_train shape:  (7425,)
X_test shape:  (2475,)
y_train shape:  (7425,)
y_test shape:  (2475,)


In [22]:
# Extract TfIdf Vectors
vectorizer = TfidfVectorizer(max_features=2000)

X_train_vect = vectorizer.fit_transform(X_train.values)
X_test_vect = vectorizer.transform(X_test.values)

In [30]:
# Vocabulary
vectorizer.vocabulary_

{'fact': 635,
 'check': 292,
 'intelligence': 885,
 'prove': 1387,
 'trump': 1844,
 'liar': 1006,
 'he': 787,
 'know': 962,
 'about': 5,
 'russian': 1542,
 'attack': 150,
 'donald': 528,
 'seem': 1583,
 'to': 1812,
 'be': 179,
 'when': 1944,
 'it': 913,
 'come': 338,
 'cyber': 435,
 'security': 1580,
 'threat': 1801,
 'display': 515,
 'this': 1796,
 'once': 1209,
 'again': 48,
 'during': 543,
 'the': 1784,
 'second': 1574,
 'presidential': 1341,
 'state': 1690,
 'that': 1783,
 'didn': 495,
 'whether': 1946,
 'there': 1789,
 'in': 855,
 'regard': 1450,
 'hack': 767,
 'and': 93,
 'election': 563,
 'but': 245,
 'an': 90,
 'lie': 1009,
 'have': 784,
 'brief': 230,
 'on': 1208,
 'very': 1888,
 'nbc': 1144,
 'news': 1157,
 'senior': 1591,
 'official': 1201,
 'government': 750,
 'attempt': 151,
 'interfere': 888,
 'discuss': 511,
 'with': 1966,
 'both': 222,
 'party': 1261,
 'candidate': 257,
 'leadership': 989,
 'since': 1636,
 'august': 156,
 'not': 1175,
 'at': 149,
 'point': 1306,
 'say':

In [None]:
# First sample of training text
X_train.iloc[0]

'fact check intelligence prove trump a liar he know about russian attack donald trump seem to be willfully ignorant when it come to a russian cyber security threat he display this once again during the second presidential state that he didn t know whether there be russian involvement in regard to hack and the election but that s an outright lie intelligence have be brief he on this very to nbc news a senior intelligence official assure nbc news that cybersecurity and the russian government s attempt to interfere in the election have be brief to and discuss extensively with both party candidate surrogate and leadership since mid august to profess not to know at this point be willful misrepresentation say the official the intelligence community have walk a very thin line in not take side but both candidate have all the information they need to be crystal clear thus prove without a doubt that trump either have amnesia or willingly lie during the debate regard cyber security and the debate

In [33]:
# First sample of vectorized training text
print(X_train_vect[0])

  (0, 635)	0.10897484299872526
  (0, 292)	0.04466108081061239
  (0, 885)	0.2376286430090696
  (0, 1387)	0.07421007749313771
  (0, 1844)	0.15989926920727565
  (0, 1006)	0.05224058778800144
  (0, 787)	0.12815259568389134
  (0, 962)	0.21121546315290027
  (0, 5)	0.06330092808473571
  (0, 1542)	0.2773721091147251
  (0, 150)	0.029375278996492168
  (0, 528)	0.041053240575949064
  (0, 1583)	0.06230621566430067
  (0, 1812)	0.22443838502276442
  (0, 179)	0.16242126278479108
  (0, 1944)	0.037470843568818184
  (0, 913)	0.06233617920567271
  (0, 338)	0.04333706573675609
  (0, 435)	0.21597121917856105
  (0, 1580)	0.11456870020661825
  (0, 1801)	0.03728177782649713
  (0, 515)	0.053160189338006925
  (0, 1796)	0.1145151615267912
  (0, 1209)	0.034263639382361245
  (0, 48)	0.03042061325307966
  :	:
  (0, 1112)	0.034236883037227064
  (0, 1947)	0.018072418380519657
  (0, 987)	0.02909231916372689
  (0, 1052)	0.025126633321196384
  (0, 190)	0.029207916026727457
  (0, 1981)	0.016167723758764765
  (0, 1115)	0.

In [37]:
# Training and testing the model
model = RandomForestClassifier()

model.fit(X_train_vect, y_train)

y_pred = model.predict(X_test_vect)

print(classification_report(y_pred, y_test))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1267
           1       1.00      1.00      1.00      1208

    accuracy                           1.00      2475
   macro avg       1.00      1.00      1.00      2475
weighted avg       1.00      1.00      1.00      2475




## Pros of TF-IDF:

* Simplicity: TF-IDF is straightforward to calculate and understand, making it a good starting point for text analysis tasks. 

* Identifies important words: By considering both how often a word appears in a document (term frequency) and how rare it is across the corpus (inverse document frequency), TF-IDF effectively highlights key terms within a document. 

* Scalability: It can be applied to large datasets with a large number of documents efficiently.

* Language-agnostic: TF-IDF works well across different languages without requiring language-specific adjustments. 

## Cons of TF-IDF:

* No semantic understanding: TF-IDF only considers word frequency, not their contextual meaning, so it can't differentiate between words with similar spellings but different meanings. 

* Ignores word order: As a bag-of-words model, TF-IDF doesn't take into account the sequence of words in a sentence, potentially missing important nuances.

* Sensitivity to stop words: Common words like "the" or "and" can have high TF-IDF scores if not properly handled with stop word removal. 
Potential for misinterpretation: In certain cases, rare words appearing only in a few documents might have high TF-IDF scores even if they are not semantically important.

* Curse of dimensionality: When dealing with a large vocabulary, TF-IDF vectors can become very high-dimensional, potentially causing computational issues. 
