# Final Project in Data Science
## Fake news Predictor

### Group 8
*Lykke Laura Sørensen (ltm712)* <br>
*Jeppe Ram Pedersen (lxd520)*

In [None]:
# importing the relevant libraries
import numpy as np
import pandas as pd
import re  # to be able to clean the text using Regular Expression
import nltk
nltk.download('punkt')
from nltk import word_tokenize 
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.stem import PorterStemmer
!pip install sqlalchemy
import sqlalchemy

# reading the csv-file
# define relevant path for file
dataForMilestone2 = '../250t-raw.csv'

# read the csv-file and load into dataframe
dataRaw = pd.read_csv(dataForMilestone2)

data = dataRaw.copy() # So we still have easily access to original data when working on the data

We make a preliminary investigation of our data, in order to filter out irrelevant tables. We also start by clean-up and structuring of data.

In [None]:
# inspecting the data
print(data.content,"\n")
print(data.shape, "\n")

def fieldLengthPrinter (string):
  print(column, len(data[string].value_counts()))

for column in data:
  fieldLengthPrinter (column)

data[:4]

In [None]:
# TODO: More date formats needs to be recognized by the regex

def clean_text(text):
    text = str(text)
    text = text.lower() # lowercase
    text = re.sub("(\n+|\t+|\s{2,})", " ", text) # removing multible white spaces, tabs or new lines
    text = re.sub("[a-z]+ [0-9]{1,2}, [0-9]{4}", "<DATE>", text) # removes date in the format "month[letters] date, year"
    text = re.sub("(https?\:\/\/)?\w*.\w*.(com|net)[^\s\]]*", "<URL>", text) # urls replaced by <URL>
    text = re.sub("[0-9]+", "<NUM>", text) # numbers replaced by <NUM>, dates replaces by <DATE>
    text = re.sub("(\w+@\w+.com)", "<EMAIL>", text) # E-mail replaced by <EMAIL>
    return text

In [None]:
## DETTE TAGER LANG TID!!!!! KØR DEN KUN HVIS CLEANING AF KOLONNEN "CONTENT" SKAL BRUGES. 

data["content"] = data.apply(lambda row : clean_text(row["content"]), axis = 1)

In [None]:
# For testing purposes
relString = 6
print(dataRaw["content"][relString],"\n")

print(data["content"][relString])

#print(clean_text(dataRaw["content"][relString]))

In [None]:
# to Process the text we will use the nltk library (https://www.nltk.org/)
# we have chosen to use our own data_clean_homemade, so we are able to remove dates etc. 


# tokens have only been 

tokens = nltk.word_tokenize(data["content"][relString]) # to tokenize the text

print("Number of tokens in text: ", len(tokens))
#print(tokens[:10])

# stopwords
stopWords = set(stopwords.words('english'))

# removal of stopwords in tokens
tokens_without_stopWords = [w for w in tokens if not w in stopWords]
print("Number of tokens without stopwords: ", len(tokens_without_stopWords))

#print(tokens_without_Stopwords
reduction_rate = (len(tokens)-len(tokens_without_stopWords))/len(tokens)

print("We have removed %d stopwords from the text" %(len(tokens)-len(tokens_without_stopWords)))
print("Reduction rate: ", reduction_rate)

# Stemming
ps = PorterStemmer()

stem = []
for w in tokens_without_stopWords:
  stem.append(ps.stem(w))
print("Number of tokens in text without stopwords and after stemming: ", len(stem))

print("\nTo illustrate the stemming of the words")
print(tokens_without_stopWords[10:20])
print(stem[10:20])


In the final project, we will work toward actually creating a Fake News predictor. This will build on the work you have done in the Milestones, combined with the topics covered in the lectures in this second block of the course. 

### 1. Know your data (~1 page)
Your milestones were primarily about getting to know your data and representing it in a reasonable way. The first part of your final project is to summarize the main findings from this process (possibly incorporating feedback that you got in Peergrade):

Describe how you ended up representing the FakeNewsCorpus dataset (for instance by describing your ER diagram). Argue for why you chose this design.
Did you discover any inherent problems with the data while working with it?
Report key properties of the data set - for instance through statistics or visualization. If you use non-trivial SQL queries to extract these properties, please describe them.
What were your experiences with scraping your assigned fragment of the "Politics and Conflict" section of the Wikinews site?
To go further on the work you started with the milestones, we ask you take the following steps:

**Create a relational database schema to represent the dataset you scraped from the "Politics and Conflict" section of the Wikinews site and import the data you scraped into this schema. Document your schema design in an ER diagram and briefly discuss how you dealt with the metadata in this source.
Use SQL to report basic statistics on this additional data source, e.g., number of articles or distribution over dates. 
Now that you have two different sources in the database, corresponding to the FakeNewsCorpus and to the Wikinews fragment you scraped, create a view that integrates the article information from the two sources. How do you map the different metadata from the sources into a common schema? NOTE: You need at a minimum to create a view schema that will suffice for the modeling task below, though you may include more metadata in the view if possible.
Finally, conclude by specifying how you will use the data to train a Fake News predictor:**

**Specify which data you will be using to train and test the models in the remaining part of this project. Does it makes sense to include the Wikinews data or will you limit yourself to (a subset of) FakeNewsCorpus. Argue why.
In this project, we will consider fake news detection as a binary classification problem. Find a good way to aggregate the many output classes of FakeNewsCorpus into 2 classes (FAKE/REAL). Argue why this is a reasonable choice.**

### 2. Establish a baseline (~0.5 page)
The next step is to create one or more reasonable baselines for your Fake News predictor. These should be simple models that you can use to benchmark your more advanced models against later.

Start by considering only features extracted from the main text (content) field. Choose one or more simple baseline models, train them, and report their accuracies. Also remember to report any necessary details about your baseline models (e.g., the choice of relevant parameters and how you chose them). Describe why you chose these baseline models - why are they reasonable?
Consider whether it would make sense to include meta-data features as well. If so, which ones, and why? If relevant, report the performance when including these additional features and compare it to the first baselines. Discuss whether these results match your expectations.
For the remainder of the project, we will limit ourselves to main-text data only (i.e. no meta-data). This makes it easier to do the cross-domain experiment in question 4 (which does not have the same set of meta-data fields).

### 3. Create a Fake News predictor (~1 page)
Create the best Fake News predictor that you can come up with. This should be a more complex model than the one(s) you used as baseline, either in the sense that it uses a more advanced method, or because it uses a more elaborate set of features. Report necessary details about your models ensuring full reproducibility. This could include, for example, the choice of relevant parameters and how you chose them. 

Quantify the performance of your Fake News predictor against your baseline(s).
Argue for why you chose this approach over potential alternatives.

### 4. Performance beyond the original dataset (~0.5 page)
Now, we will test how well the model works beyond the dataset that you described in question 1.

We have set up a friendly competition between the groups. The idea is that we provide a dataset *without* labels (CSV format, two columns: ID,text), and that you all use your model to try to predict the labels. You will then upload a file with the ID and the labels (CSV format: two columns: ID, "REAL" or "FAKE"). We will then compare your predictions against the true labels and create an online leaderboard where you can see your rank compared to the other groups. The leaderboard is hosted as a Kaggle competition accessible here (Links to an external site.) (Links to an external site.). You can also find the test data set there. Please don't try to reverse-engineer the source of the data we provide, in order to download and train on it (we will be able to tell).
In order to allow you to play around cross-domain performance locally as well, try the same exercise on the LIAR dataset (https://www.cs.ucsb.edu/~william/data/liar_dataset.zip (Links to an external site.)), where you know the labels, and can thus immediately calculate the performance.
Compare the results of these two experiments to the results you obtained in question 3. Report both your LIAR results and the leaderboard results as part of your report. Remember to test the performance of your baseline model as well.
Arrange all these results in a table to facilitate a comparison between them.

### 5. Discussion (~1 page)
Conclude your report by discussing the results you obtained.

Explain the discrepancy between the performance on your test set and on the LIAR set and leaderboard. If relevant, use visualizations or report relevant statistics to point out differences in the datasets.
Conclude with describing overall lessons learned during the project, for instance considering questions like: Does the discrepancy between performance on different data sets surprise you? What can be done to improve the performance of Fake News prediction? Will further progress be driven primarily by better models or by better data? Is it even a solvable problem?
Please note that this question is not merely a summary of what you have done in the other questions. We expect to see some non-trivial reflection in this section.

He belives that we should know what neural networks is, so we can apply it for our final project. We can use it by Embeddings. (See the end of video 3, Lecture on Neural networks in Week 18, Thursday). Vi behøver ikke at lave word embedding selv. Der findes pre-trained modeller, som vi bare kan hente ned. 

Feedback: Vi kan sammenligne 'content'. 