##### Part 1: Data Preprocessing using NLP Techniques

In this first section, we will first perform some exploratory data analysis on the data provided, and preprocess the data.

In [71]:
import pandas as pd
import numpy as np
import nltk 
from nltk import word_tokenize
from nltk.corpus import stopwords

2. Read the dataset from the CSV file

In [72]:
df = pd.read_csv("fake_news_train.csv")

3. After loading the dataset, perform some primary exploratory data analysis to understand the dataset provided. You can use simple pandas methods and attributes such as `head()`, `shape` and `info()`.

In [73]:
# Exploratory data analysis to familiarize yourself with the data
df.head()


Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [74]:
df.shape

(20800, 5)

In [75]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20800 entries, 0 to 20799
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      20800 non-null  int64 
 1   title   20242 non-null  object
 2   author  18843 non-null  object
 3   text    20761 non-null  object
 4   label   20800 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 812.6+ KB


4. Checking if null values are present in the dataset or not. 



In [76]:
# Check for null values and if any, fill them with an empty string 
df.isnull().sum()


id           0
title      558
author    1957
text        39
label        0
dtype: int64

In [77]:
df.fillna('', inplace=True)

In [78]:
df.isnull().sum()

id        0
title     0
author    0
text      0
label     0
dtype: int64

In [79]:
df['text'] = df['text'].astype(str)


5. For data preprocessing, we will focus on the 'text' column of the DataFrame, which contains the content of each news article. We will apply tokenization, the first text preprocessing method covered in Quest 1.


In [80]:
# Define a function to tokenize the text given
def tokenize_text(text):
    return text.split() 

# Apply the tokenize_text function to the 'text' column of the DataFrame and create a new column 'tokenized_text'
df['tokenized_text'] = df['text'].apply(tokenize_text)

6. Remove the stop words from the tokens.

Create a list `stop_words` that contains the NLTK predefined stopwords.

In [81]:
stop_words = set(nltk.corpus.stopwords.words('english'))

7. Define a function that removes stop words from a list of tokens. Take note that the NLTK predefined stopwords are in lowercase, while some of the tokens in your current DataFrame contain uppercase alphabets. 

In [None]:
# Define a function to remove stopwords from a list of tokens
def remove_stopwords(tokens):
    if tokens is None:
        return []  # Return an empty list if tokens is None
    return [word for word in tokens if word.lower() not in stop_words]

# Apply the remove_stopwords function to the 'tokenized_text' column
df['cleaned_text'] = df['tokenized_text'].apply(remove_stopwords)

##### Part 2: Separating the dataset and Vectorization

8. we will separate the dataset into features and targets. This allows us to clearly define the inputs and outputs of our model. 

The features are the independent variables that we use to predict the target variable, which is the dependent variable we want to predict. In this case, we are using the text from the article to determine if the article is reliable or unreliable. Reliable articles are labelled '0' in the `label` column while unreliable articles are labelled '1'.

In [None]:
X_df = df['text'].apply(tokenize_text)  # Ensure tokenization if needed
y_df = df['fake'] 

9. Analyse the `y_df` data. Notice that the data type appears to be a `str` or `object` data type. 


In [None]:
y_df = y_df.astype(int)

10. As machine learning models take in numerical values for their inputs, we have to convert our feature data into numerical format as well. This is where we can incorporate our vectorization skills covered in Quest 2!

Import the `TfidfVectorizer` and create a TfidfVectorizer object. Since the features we are working with are in tokens, we have to specify this in the parameter as the vectorizer takes in strings by default. 

We set the `tokenizer` parameter to a lambda function that simply returns each document as-is. We also set `lowercase=False` to ensure that the tokenization is not modified.

After this, fit and transform the vectorizer on the tokenized documents `x_df`. This produces the TFIDF matrix.

In [None]:
# Perform vectorization using the TFIDF Vectorizer and fit and transform the tokenized documents
from sklearn.feature_extraction.text import TfidfVectorizer

# The lambda function just returns the tokens as they are
tfidf_vectorizer = TfidfVectorizer(tokenizer=lambda doc: doc, lowercase=False)

try:
    tfidf_matrix = tfidf_vectorizer.fit_transform(df['cleaned_text'])
    print(tfidf_matrix.shape)
except ValueError as e:
    print(e)

tfidf_matrix = tfidf_vectorizer.fit_transform(df['cleaned_text'])

In [None]:
df['text_length'] = df['cleaned_text'].apply(len)
print(df[['cleaned_text', 'text_length']].head())

##### Part 3: Training and testing the model

11. we will be making use of a `LogisticRegression` model to create our fake news classifier..

We will split the data into a training set and a testing set to evaluate the performance of our Logistic Regression model. The training set is used to train the model, while the testing set is used to evaluate the model's performance on new data that is has not seen before. This helps us to determine how well the model will generalize to new data and avoid overfitting. 

Now with our TFIDF matrix and target data, we can split the data into testing and training sets using `train_test_split`. 

In [None]:
df['fake'] = df['label'].apply(lambda x: 0 if x == "REAL" else 1)
df = df.drop("label", axis=1)  # Drop the original 'label' column if necessary

In [None]:
# Split the data into test and train
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(tfidf_matrix, df['fake'], test_size=0.3, random_state=0, stratify=df['fake'])

print(y_train.value_counts())


12. Import the necessary modules and create a LogisticRegression object. Fit the model according to the X and y training data produced above. 

In [None]:
print(y_train.value_counts())
print(y_test.value_counts())

In [68]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)     

ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: 1

13. Now that the model has been trained, obtain the predictions of the model using the test data set.

In [69]:
y_pred = logreg.predict(X_test)

AttributeError: 'LogisticRegression' object has no attribute 'coef_'

14. Now we need to evaluate how well the model did. Here, we use three evaluation metrics to assess the performance of our model. These are the metrics we will be working with:

+ **Accuracy**: Accuracy is the proportion of correct predictions made by the model out of all the predictions made. It is calculated as the ratio of the number of correct predictions to the total number of predictions.
+ **Precision**: Precision is the proportion of true positives out of all the positive predictions made by the model. It is calculated as the ratio of the number of true positives to the total number of positive predictions.
+ **Recall**: Recall is the proportion of true positives out of all the actual positive cases in the dataset. It is calculated as the ratio of the number of true positives to the total number of actual positive cases.

Import the following metrics and calculate the scores by comparing the test targets to the predicted targets of the test set. 

In [67]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

NameError: name 'y_pred' is not defined

In machine learning, multiple metrics for evaluation is typically used as a single metric may not provide a complete picture of the model's performance. Different metrics capture different aspects of model performance, and evaluating a model using multiple metrics helps to provide a better understanding of how well the model is performing.

By using multiple metrics for evaluation, we can identify the strengths and weaknesses of the model to make informed decisions about how to improve its performance. It is important to choose evaluation metrics that are relevant to the data you are dealing with, and consider the trade-offs between different metrics.

15. Print out each of the scores for your model below.

In [None]:
print("The accuracy score is ", accuracy)
print("The precision score is ", precision)
print("The recall score is ", recall)