<img src="https://www.cems.uwe.ac.uk/~pa-legg/uwecyber/images/uwe.png" width=300>
<img src="https://www.cems.uwe.ac.uk/~pa-legg/uwecyber/images/uwecyber_acecse_200.jpg" width=300>

# UFCFEL-15-3 Security Data Analytics and Visualisation
# Portfolio Task 4: Fake News Classification (2021)
---

The completion of this worksheet is worth **30%** towards your portfolio for the UFCFEL-15-3 Security Data Analytics and Visualisation (SDAV) module.

### Task
***

In this task, you will need to develop a system that can perform text analytics to classify news articles as either being fake news or true. You should look to use the ***ski-kit learn*** Python library as part of your work - you will find it useful to conduct research and to examine the user guide: [https://scikit-learn.org/stable/user_guide.html](https://scikit-learn.org/stable/user_guide.html). 

The provided dataset has two csv files: one contains 23481 fake news articles, the other contains 21417 true news articles. The data consists of the following columns: title (contains news headlines), text (contains news content/article), subject (type of news), and date (date the news was published).

You will need to consider how to pre-process the data so that it is suitable for further analysis and for use with a machine learning classifier. This will include how the data is structured, how the output class is denoted, any cleansing of the text that may be required (e.g., removal of stopwords, stemming, n-grams).

You should then show how you can deploy 3 different ML classifiers on the data, using the scikit-learn library to achieve this. As an example, you may choose the following 3 algorithms (or you may explore suitable alternatives):

- logreg_cv = LogisticRegression(random_state=0)
- dt_cv=DecisionTreeClassifier()
- nb_cv=MultinomialNB(alpha=0.1) 

You should make use of visualisation to illustrate the distinguishing characteristics of the dataset classes, including the news categories and the top 20 n-grams of the data in each respective class.

You will need to describe the analytical process you have taken using Markdown, and you will need to report your overall final accuracy for the classifier.

### Assessment and Marking
***

| Criteria | 0-39 | 40-49 | 50-59 | 60-69 | 70-84 | 85-100 |
| --- | --- | --- | --- | --- | --- | --- |
| **Suitable use of text pre-processing (30%)** | No evidence of progress | A limited attempt to address this criteria | A working solution but perhaps not optimal | Good approach to the problem | Very good approach to the problem | Excellent approach to the problem |
| **Use of 3 Machine Learning classifiers and reporting their performance (30%)**  | No evidence of progress | A limited attempt to address this criteria |  A working solution but perhaps not optimal | Good approach to the problem | Very good approach to the problem | Excellent approach to the problem |
| **Visualisation techniques to understand the pre-processing and classification stages (20%)**  | No evidence of progress | A limited attempt to address this criteria |  A working solution but perhaps not optimal | Good approach to the problem | Very good approach to the problem | Excellent approach to the problem |
| **Clarity and professional presentation (20%)**  | No evidence of progress | A limited attempt to address this criteria | Some evidence of markdown commentary | Good approach to the problem | Very good approach to the problem | Excellent approach to the problem |

You will need to implement your final solution in the Notebook format, with Markdown annotation -  you should use this notebook file as a template for your submission. You are also expected to complete the assignment self-assessment.

Your submission should include:
- HTML export of your complete assignment in notebook format.
- Original ipynb source file of your notebook.

### Self-Assessment
---

For each criteria, please reflect on the marking rubric and indicate what grade you would expect to receive for the work that you are submitting. For your own personal development and learning, it is important to reflect on your work and to attempt to assess this careful. Do think carefully about both positive aspects of your work, as well as any limitations you may have faced.

- **Suitable use of text pre-processing (30%)**: You estimate that your grade will be 75.

- **Use of 3 Machine Learning classifiers and reporting their performance (30%)**: You estimate that your grade will be 85.

- **Visualisation techniques to understand the pre-processing and classification stages (20%)**: You estimate that your grade will be 65.

- **Clarity and professional presentation (20%)**: You estimate that your grade will be 65.

Please provide a minimum of two sentences to comment and reflect on your own self-assessment: This task is very chanlenging for me. I will find a more suitable way to finish the work.


### Contact
---

Questions about this assignment should be directed to your module leader (Phil.Legg@uwe.ac.uk). You can use the Blackboard Q&A feature to ask questions related to this module and this assignment, as well as the on-site teaching sessions.

---

In [1]:
# !pip install pandas numpy matplotlib seaborn wordcloud sklearn
#! pip install nltk

#Basic libraries
import pandas as pd 
import numpy as np 

#Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

#Miscellanous libraries
from collections import Counter

#Ignore warnings
import warnings
warnings.filterwarnings('ignore')

## Loading the data


In [2]:
#reading the fake and true datasets
fake_news = pd.read_csv('./T4_data/Fake.csv')
true_news = pd.read_csv('./T4_data/True.csv')

# print ("Fake news: ", fake_news.shape)
# print ("True news: ", true_news.shape)

In [3]:
## fake_news.head(10)

In [4]:
## true_news.head(10)

The above code will load in the two datasets, showing the overall size of the datasets and also the first 10 rows from each dataset.

Now it is over to you...

## Preprocessing the data

In [5]:
fake_news.insert(loc=4, column='state', value=0)
fake_news = fake_news.dropna()
fake_news = fake_news[~fake_news['title'].isin(['http','https'])]

In [6]:
true_news.insert(loc=4, column='state', value=1)
true_news = true_news.dropna()

In [7]:
data = pd.concat([fake_news,true_news])
# data

Import machine learning algorithm related Toolkit

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from nltk import pos_tag, word_tokenize,sent_tokenize
from nltk.util import ngrams
import re
from nltk.stem import WordNetLemmatizer
from itertools import chain
# from sklearn.utils import shuffle


In [9]:
# def cut_word(text):
#     sens = sent_tokenize(text)
#     cw = [word_tokenize(sentence) for sentence in sens]
#     return cw
# data['title'] = data['title'].apply(word_tokenize)
# data['text'] = data['text'].apply(word_tokenize)
# data.head()


N-gram generation method

In [10]:
def ngram_list(n, word_list, stop_word_list=None):
        """
        Generate ngrams with width n excluding those that are entirely formed of stop words

        Args:
            n (int): i.e. 1, 2, 3...
            word_list (list of str): list of words
            stop_word_list (list of str, Optional): list of words that should be excluded while obtaining
                                                    list of ngrams

        Returns:
            list of str: List of ngrams formed from the given word list except for those that have all their tokes in
                         stop words list
        """
        stop_word_set = set(stop_word_list) if stop_word_list else []
        all_ngrams = ngrams(word_list, n)
        ngram_list = []
        for ngram in all_ngrams:
            lowered_ngram_tokens = map(lambda token: token.lower(), ngram)
            if any(token not in stop_word_set for token in lowered_ngram_tokens):
                ngram_list.append(' '.join(ngram))
        return ngram_list

Visual analysis

In [None]:
df = pd.DataFrame([],columns=["Category","Thigram"])
df["Category"] = data["subject"]
df["Thigram"] = data["text"].apply(lambda x:ngram_list(3, word_tokenize(x))[:20])

In [None]:
plt.title("The volume of N-Gram for each Category")
sns_plot = sns.countplot(x="Category", data=df, order=df['Category'].value_counts().index)
sns_plot.figure.set_size_inches(12,8)
plt.ylabel("the volume of Category")
plt.xlabel("Categorys")

In [None]:
from wordcloud import WordCloud
# import numpy as np
list_title = true_news['title'].values
txt =  " ".join(np.ravel(list_title))
w = WordCloud(width = 1000,height = 700,background_color = "white",max_words = 30)
w.generate(txt)       #加载文本
w.to_file("wordcloud.png")

In [None]:
# Determine training data and objectives
X = data.iloc[:, 0:4]
Y = data.iloc[:, 4]

In [None]:
# Split training set and test set
x_train,y_train,x_test,y_test= train_test_split(X,Y,test_size=.25, random_state=1)

In [None]:
#calculate F-IDF
vectorizer = TfidfVectorizer(binary = False, decode_error = 'ignore',stop_words = 'english')
X_train_counts_tf = vectorizer.fit_transform(x_train["text"].values.tolist())
# print(vectorizer.get_feature_names())

In [None]:
#Algorithm evaluation benchmark
'''The accuracy of the algorithm is compared by 5-fold cross validation'''
num_folds = 5
seed = 0
scoring = 'accuracy'
#评估算法
models = {}
models['LR'] = LogisticRegression()
models['CART'] = DecisionTreeClassifier()
models['MNB'] = MultinomialNB()
results = []

for key in models:
    kfold = KFold(n_splits=num_folds,random_state=seed,shuffle=True)
    cv_results = cross_val_score(models[key], X_train_counts_tf,x_test, cv=kfold, scoring=scoring)
    results.append(cv_results)
#     print('%s:%f(%f)' %(key,cv_results.mean(),cv_results.std()))

In [None]:
#Parameter tuning
param_grid = {}
param_grid['C'] = [0.1,5,13,15]
model = LogisticRegression()
kfold = KFold(n_splits=num_folds,random_state=seed,shuffle=True)
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, cv=kfold)
grid_result = grid.fit(X=X_train_counts_tf, y=x_test)
# print('best : %s param %s' % (grid_result.best_score_, grid_result.best_params_))

In [None]:
# Integrated algorithm
ensembles = {}
ensembles['RF'] = RandomForestClassifier()
ensembles['AB'] = AdaBoostClassifier()
results = []
for key in  ensembles:
    kfold = KFold(n_splits=num_folds, random_state=seed,shuffle=True)
    cv_results = cross_val_score(ensembles[key], X_train_counts_tf, x_test, cv=kfold, scoring=scoring)
    results.append(cv_results)
#     print('%s : %f (%f)' % (key, cv_results.mean(), cv_results.std()))

In [None]:
#Integrated algorithm tuning
# param_grid = {}
# param_grid['n_estimators'] = [10,100,150,200]
# model = RandomForestClassifier()
# kfold = KFold(n_splits=num_folds,random_state=seed,shuffle=True)
# grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, cv=kfold)
# grid_result = grid.fit(X=X_train_counts_tf, y=x_test)
# print('best : %s param %s' % (grid_result.best_score_, grid_result.best_params_))

In [None]:
# Generate model
# model = LogisticRegression(C=13)
# model.fit(X_train_counts_tf,x_test)
# predictions = model.predict(y_test)
# print(accuracy_score(y_test, predictions))
# print(classification_report(y_test, predictions))