 # 1. Import the data
  We start with importing the two given datasets.

In [1]:
import pandas as pd
import numpy as np

In [2]:
train = pd.read_csv("fake_or_real_news_training.csv")
test = pd.read_csv("fake_or_real_news_test.csv")

In [3]:
#Here we visualize some rows of the dataset to get a feel of its structure.
test.head(2)

Unnamed: 0,ID,title,text
0,10498,September New Homes Sales Rise——-Back To 1992 ...,September New Homes Sales Rise Back To 1992 Le...
1,2439,Why The Obamacare Doomsday Cult Can't Admit It...,But when Congress debated and passed the Patie...


In [4]:
#Here we visualize some rows of the dataset to get a feel of its structure.
train.head(2)

Unnamed: 0,ID,title,text,label,X1,X2
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE,,
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE,,


In [5]:
#Here, we will fill NAs in the last two columns of the training dataset, which we know to be often empty.
train['X1'] = train['X1'].fillna("")
train['X2'] = train['X2'].fillna("")

# 2. Data preparation
In this section, we will prepare the data so that it is in a suitable format to be analyzed and processed. We start by splitting the training set into training and test. After this first step, we use BeautifulSoup to clean the text from unwanted characters. After this we use the CountVectorizer to create a bag of words (BOW) on the one side, and on the other side, using the TF-IDF Vectorizer we create the weights for every token in the dataset, so that we can have two different inputs for our modelling stage.

As for the train and test split, we chose not to create a validation set as we wanted to keep the highest possible amount of observations for training the models (the test set is not labelled so we can't use that one either).

In [6]:
#Here we split the dataset into training and test, and from the test set we omit columns we dont need. 
X_train = train['text'] + train['title'] + train['X1'] + train['X2'] 
X_test = test['text'] + test['title']
y_train = train.label.values

## 2.1 Data cleaning
BeautifulSoup is what will help us here to clean the text from unwanted characters, namely punctuation, html tags and numbers. 

In [7]:
import re
from bs4 import BeautifulSoup

In [8]:
# defing the function to keep only the letters.
def cleaning(text):
    soup = BeautifulSoup(
        text
    )  # defining a BeautifulSoup object, which represents the document as a nested data structure:
    text = soup.get_text()  # extracting all the text
    return re.sub("[^A-Za-z ]+", "", str(text))  # return the cleaned text

In [9]:
X_train = X_train.apply(cleaning)
X_test = X_test.apply(cleaning)

In [10]:
X_train = pd.DataFrame(X_train)
X_test = pd.DataFrame(X_test)
X_train.columns = ['text']
X_test.columns = ['text']

In [11]:
X_train.head()

Unnamed: 0,text
0,Daniel Greenfield a Shillman Journalism Fellow...
1,Google Pinterest Digg Linkedin Reddit Stumbleu...
2,US Secretary of State John F Kerry said Monday...
3,Kaydee King KaydeeKing November The lesson ...
4,Its primary day in New York and frontrunners H...


In [12]:
# Here we make the whole text lowercase to allow the models to reach better scores.
X_train['text'] = [entry.lower() for entry in X_train['text']]
X_test['text'] = [entry.lower() for entry in X_test['text']]

In [13]:
dataset = X_train.append(X_test)
dataset.head()

Unnamed: 0,text
0,daniel greenfield a shillman journalism fellow...
1,google pinterest digg linkedin reddit stumbleu...
2,us secretary of state john f kerry said monday...
3,kaydee king kaydeeking november the lesson ...
4,its primary day in new york and frontrunners h...


## 2.2 Bag of Words
In this part, we preprocess the data so that it is in a suitable format to be processed by the algorithms we will use later on.
Bag of Words (BOW) is a method to extract features from text documents. These features can be used for training machine learning algorithms. It creates a vocabulary of all the unique words occurring in all the documents in the training set (Source: freecodecamp)

In [14]:
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk.corpus import stopwords

nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/andrea_salvati/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [15]:
print(stopwords.words("english")[1:5])

['me', 'my', 'myself', 'we']


In [16]:
#Here we apply the Vectorizer to the dataset.
vectorizer = CountVectorizer(min_df=2, stop_words="english")
vectorizer.fit(dataset['text'])

dataset_bw = vectorizer.transform(dataset['text'])

print(dataset_bw.shape)

(6320, 48635)


## 2.4 Tf-IDF weighting
Here we apply a different vectorizer, so that we have two different inputs and we can then compare the results of the two techniques. We try however a third (not completely new as it is still basing on TF_IDF vectorizer) version of the input: we try and adopt the n-gram technique. An n-gram is a contiguous sequence of n items from a given sample of text or speech (Wikipedia). The purpose of adding this particular technique is that it is supposed to make results better by retrieving lexical and semantic information from the data.

Term Frequency (TF) gives us the frequency of the word in each document in the corpus. It is the ratio of number of times the word appears in a document compared to the total number of words in that document (Source: freecodecamp)

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [18]:
#Here we apply the vectorizer to the dataset.
tfidf = TfidfVectorizer(min_df=2, stop_words="english")
tfidf.fit(dataset['text'])

dataset_tfidf = tfidf.transform(dataset['text'])

print(dataset_tfidf.shape)

(6320, 48635)


In [19]:
# Here we apply the vectorizer together with n-grams
ng = TfidfVectorizer(min_df=2, stop_words="english", ngram_range=(1,3))
ng.fit(dataset['text'])

dataset_tfidf_ng = ng.transform(dataset['text'])

print(dataset_tfidf_ng.shape)

(6320, 444384)


## 2.3 Definition of X and y

In [20]:
print("train_len: " + str(len(X_train)))
print("test_len: " + str(len(X_test)))

train_len: 3999
test_len: 2321


In [21]:
#Here we re-split the data into training and test after we have prepared it and cleaned it for the TF-IDF approach
X_train_tfidf = dataset_tfidf[:3999]
X_test_tfidf = dataset_tfidf[3999:]

In [22]:
#Here we re-split the data into training and test after we have prepared it and cleaned it for the BOW approach
X_train_bw = dataset_bw[:3999]
X_test_bw = dataset_bw[3999:]

In [23]:
#Here we re-split the data into training and test after we have prepared it and cleaned it for the TF_IDF + n-gram approach
X_train_tfidf_ng = dataset_tfidf_ng[:3999]
X_test_tfidf_ng = dataset_tfidf_ng[3999:]

## 2.2 Lemmatization
Lemmatization helps to achieve greater precision in the modelling stage as it considers the root of the word and not the word itself.



In [24]:
from nltk.stem import WordNetLemmatizer 
from nltk import pos_tag
from collections import defaultdict
from nltk.corpus import wordnet as wn
from nltk.tokenize import word_tokenize

In [25]:
X_train_lem = X_train
X_test_lem = X_test

In [26]:
# Source: https://medium.com/@bedigunjit/simple-guide-to-text-classification-nlp-using-svm-and-naive-bayes-with-python-421db3a72d34
def lemmatization(dataset):
    # Step - c : Tokenization : In this each entry in the corpus will be broken into set of words
    dataset['text'] = [word_tokenize(entry) for entry in dataset['text']]
    # Step - d : Remove Stop words, Non-Numeric and perfom Word Stemming/Lemmenting.
    # WordNetLemmatizer requires Pos tags to understand if the word is noun or verb or adjective etc. By default it is set to Noun
    tag_map = defaultdict(lambda: wn.NOUN)
    tag_map["J"] = wn.ADJ
    tag_map["V"] = wn.VERB
    tag_map["R"] = wn.ADV
    for index, entry in enumerate(dataset['text']):
        # Declaring Empty List to store the words that follow the rules for this step
        Final_words = []
        # Initializing WordNetLemmatizer()
        word_Lemmatized = WordNetLemmatizer()
        # pos_tag function below will provide the 'tag' i.e if the word is Noun(N) or Verb(V) or something else.
        for word, tag in pos_tag(entry):
                word_Final = word_Lemmatized.lemmatize(word, tag_map[tag[0]])
                Final_words.append(word_Final)
        # The final processed set of words for each iteration will be stored in 'text_final'
        dataset.loc[index, 'text'] = str(Final_words)
    return dataset

In [27]:
X_train_lem = lemmatization(X_train_lem)
X_test_lem = lemmatization(X_test_lem)

In [28]:
dataset_lem = X_train_lem.append(X_test_lem)
dataset_lem.head()

Unnamed: 0,text
0,"['daniel', 'greenfield', 'a', 'shillman', 'jou..."
1,"['google', 'pinterest', 'digg', 'linkedin', 'r..."
2,"['u', 'secretary', 'of', 'state', 'john', 'f',..."
3,"['kaydee', 'king', 'kaydeeking', 'november', '..."
4,"['it', 'primary', 'day', 'in', 'new', 'york', ..."


In [29]:
#Here we apply the vectorizer to the dataset.
tfidf = TfidfVectorizer(min_df=2)
tfidf.fit(dataset_lem['text'])
dataset_tfidf_lem = tfidf.transform(dataset_lem['text'])

print(dataset_tfidf_lem.shape)

(6320, 40757)


In [30]:
#Here we re-split the data into training and test after we have prepared it and cleaned it for the TF_IDF + n-gram approach
X_train_tfidf_lem = dataset_tfidf_lem[:3999]
X_test_tfidf_lem = dataset_tfidf_lem[3999:]

# 3. Feature Selection
In this section, we take care of selecting only the most important features in the hope to make computation more efficient and to make our results better. To do so, we adopt a Chi-Squared test, to avoid taking into consideration features that are correlated, as the chi-square test is a statistical test of independence to determine the dependency of two variables. 

In [31]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.model_selection import cross_val_score
from sklearn import naive_bayes
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression

import warnings

In [32]:
# Here we try to optimize the K parameter for the NB algorithm
def chi_optimization(dataset, k, estimator):
    from sklearn import naive_bayes
    k = k
    results = []

    for i in k:
        selector = SelectKBest(chi2, k=i)
        selector.fit(dataset, y_train)
        top_words = selector.get_support().nonzero()

        # Pick only the most informative columns in the data.
        X_train_chi = dataset[:,top_words[0]]

        #Here we insert the TF-IDF processed data and we train our model
        from sklearn import naive_bayes

        naive_bayes = naive_bayes.MultinomialNB()
        naive_bayes.fit(X_train_chi, y_train)

        #Here we get the metrics for the model
        warnings.simplefilter("ignore")
        accuracy_chi = cross_val_score(estimator=estimator, X=X_train_chi, y=y_train, cv=5)
        results.append(max(accuracy_chi).round(3))
        
    opt = {"k":k, "results":results}
    opt = pd.DataFrame(opt)
    
    best_k = int(opt.loc[opt["results"].idxmax()]["k"])
    print ('the best k is: ' + str(best_k))
    return best_k

In [33]:
chi_optimization(X_train_tfidf, np.arange(3000,3500,50),naive_bayes.MultinomialNB())

the best k is: 3050


3050

In [34]:
# Find the 1000 most informative columns for tfidf
selector = SelectKBest(chi2, k=3050)
selector.fit(X_train_tfidf, y_train)
top_words = selector.get_support().nonzero()

# Pick only the most informative columns in the data.
X_train_tfidf_chi = X_train_tfidf[:,top_words[0]]
X_test_tfidf_chi = X_test_tfidf[:,top_words[0]]

In [35]:
chi_optimization(X_train_tfidf_lem, np.arange(12000,20000,1000),LinearSVC())

the best k is: 13000


13000

In [36]:
# Find the 1000 most informative columns for tfidf
selector = SelectKBest(chi2, k=13000)
selector.fit(X_train_tfidf_lem, y_train)
top_words = selector.get_support().nonzero()

# Pick only the most informative columns in the data.
X_train_lem_chi = X_train_tfidf_lem[:,top_words[0]]
X_test_lem_chi = X_test_tfidf_lem[:,top_words[0]]

In [37]:
chi_optimization(X_train_tfidf_lem, np.arange(2000,5000,500),LogisticRegression())

the best k is: 3500


3500

In [38]:
# Find the 1000 most informative columns for tfidf
selector = SelectKBest(chi2, k=3500)
selector.fit(X_train_tfidf_lem, y_train)
top_words = selector.get_support().nonzero()

# Pick only the most informative columns in the data.
X_train_lem_chi_me = X_train_tfidf_lem[:,top_words[0]]
X_test_lem_chi_me = X_test_tfidf_lem[:,top_words[0]]

# 4. Modelling
In this section, we start the modeling task. We will experiment with different models, namely: Nayve Bayes, SVM, MaxEnt aka LogReg and in the end we try to put the steps together in a pipeline. The procedure we will follow is to input the three versions of the input dataset (BOW, TF-IDF, TF-IDF with n-grams) and train three models based on the sets. After this, we will test the models, to see which input provides best training and hence best results.

## 4.1 Naive Bayes
In machine learning, naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features. Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. (Source: Wikipedia)

In [39]:
from sklearn import naive_bayes, metrics

In [40]:
#Here we start with inputting the TF-IDF processed data
naive_bayes = naive_bayes.MultinomialNB()
naive_bayes.fit(X_train_tfidf, y_train)

predictions_nb_tfidf = naive_bayes.predict(X_test_tfidf)

warnings.simplefilter("ignore")
accuracy_nb_tfidf = cross_val_score(
    estimator=naive_bayes, X=X_train_tfidf, y=y_train, cv=5
)
print("accuracy with cross validation: " + str(max(accuracy_nb_tfidf).round(3)))

accuracy with cross validation: 0.854


In [41]:
#Here we start with inputting the BOW processed data
from sklearn import naive_bayes

naive_bayes = naive_bayes.MultinomialNB()
naive_bayes.fit(X_train_bw, y_train)

predictions_nb_bw = naive_bayes.predict(X_test_bw)

warnings.simplefilter("ignore")
accuracy_nb_bw = cross_val_score(estimator=naive_bayes, X=X_train_bw, y=y_train, cv=5)
print("accuracy with cross validation: " + str(max(accuracy_nb_bw).round(3)))

accuracy with cross validation: 0.897


In [42]:
#Here we start with inputting the TF-IDF + Chi2 processed data
from sklearn import naive_bayes

naive_bayes = naive_bayes.MultinomialNB()
naive_bayes.fit(X_train_tfidf_chi, y_train)

predictions_nb_tfidf_chi = naive_bayes.predict(X_test_tfidf_chi)

warnings.simplefilter("ignore")
accuracy_nb_tfidf_chi = cross_val_score(estimator=naive_bayes, X=X_train_tfidf_chi, y=y_train, cv=5)
print("accuracy with cross validation: " + str(max(accuracy_nb_tfidf_chi).round(3)))

accuracy with cross validation: 0.918


In [43]:
#Here we start with inputting the TF-IDF + Chi2 processed data
from sklearn import naive_bayes

naive_bayes = naive_bayes.MultinomialNB()
naive_bayes.fit(X_train_tfidf_lem, y_train)

predictions_nb_tfidf_lem = naive_bayes.predict(X_test_tfidf_lem)

warnings.simplefilter("ignore")
accuracy_nb_tfidf_lem = cross_val_score(estimator=naive_bayes, X=X_train_tfidf_lem, y=y_train, cv=5)
print("accuracy with cross validation: " + str(max(accuracy_nb_tfidf_lem).round(3)))

accuracy with cross validation: 0.826


In [44]:
#Here we start with inputting the TF-IDF + n-gram processed data
from sklearn import naive_bayes

naive_bayes = naive_bayes.MultinomialNB()
naive_bayes.fit(X_train_tfidf_ng, y_train)



warnings.simplefilter("ignore")
accuracy_nb_tfidf_ng = cross_val_score(estimator=naive_bayes, X=X_train_tfidf_ng, y=y_train, cv=5)
print("accuracy with cross validation: " + str(max(accuracy_nb_tfidf_ng).round(3)))

accuracy with cross validation: 0.808


## 4.2 SVM
Support vector machines are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier (Source: Wikipedia).

In [45]:
from sklearn import svm

In [46]:
# fit the training dataset on the classifier
SVM = LinearSVC()
SVM.fit(X_train_tfidf,y_train)

# predict the labels on validation dataset
predictions_SVM_tfidf = SVM.predict(X_test_tfidf)

warnings.simplefilter("ignore")
accuracy_svm_tfidf = cross_val_score(estimator=SVM, X=X_train_tfidf, y=y_train, cv=5)
print("accuracy with cross validation: " + str(max(accuracy_svm_tfidf).round(3)))

accuracy with cross validation: 0.933


In [47]:
# fit the training dataset on the classifier
SVM = LinearSVC()
SVM.fit(X_train_bw,y_train)

# predict the labels on validation dataset
predictions_SVM_bw = SVM.predict(X_test_bw)

warnings.simplefilter("ignore")
accuracy_svm_bw = cross_val_score(estimator=SVM, X=X_train_bw, y=y_train, cv=5)
print("accuracy with cross validation: " + str(max(accuracy_svm_bw).round(3)))

accuracy with cross validation: 0.875


In [48]:
# fit the training dataset on the classifier
SVM = LinearSVC()
SVM.fit(X_train_tfidf_ng, y_train)

# predict the labels on validation dataset
predictions_SVM_ng = SVM.predict(X_test_tfidf_ng)

# predict the labels on validation dataset
warnings.simplefilter("ignore")
accuracy_svm_tfidf_ng = cross_val_score(estimator=SVM, X=X_train_tfidf_ng, y=y_train, cv=5)
print("accuracy with cross validation: " + str(max(accuracy_svm_tfidf_ng).round(3)))

accuracy with cross validation: 0.927


In [49]:
from sklearn.svm import LinearSVC
# fit the training dataset on the classifier
SVM = LinearSVC()
SVM.fit(X_train_tfidf_lem, y_train)

# predict the labels on validation dataset
predictions_SVM_lem = SVM.predict(X_test_tfidf_lem)

warnings.simplefilter("ignore")
accuracy_svm_tfidf_lem = cross_val_score(estimator=SVM, X=X_train_tfidf_lem, y=y_train, cv=5)
print("accuracy with cross validation: " + str(max(accuracy_svm_tfidf_lem).round(3)))

accuracy with cross validation: 0.933


In [50]:
from sklearn.svm import LinearSVC
# fit the training dataset on the classifier
SVM = LinearSVC()
SVM.fit(X_train_lem_chi, y_train)

# predict the labels on validation dataset
predictions_svm_lem_chi = SVM.predict(X_test_lem_chi)

warnings.simplefilter("ignore")
accuracy_svm_lem_chi = cross_val_score(estimator=SVM, X=X_train_lem_chi, y=y_train, cv=5)
print("accuracy with cross validation: " + str(max(accuracy_svm_lem_chi).round(4)))

accuracy with cross validation: 0.947


### 4.2.1 SVM Optimized
Grid search builds a model for every combination of hyperparameters specified and evaluates each model. This is the approach we pick to optimze the hyperparameter of our model.

In [51]:
from sklearn.model_selection import GridSearchCV

In [52]:
# Create regularization hyperparameter space. After several trial, it was possible to see that the best C was around 0.1.
# In order to reduce the computation time I build the C space around that value
C = np.arange(0.1, 1.3, 0.05)

# Create hyperparameter options
hyperparameters = dict(C=C)
print(hyperparameters)

{'C': array([0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55, 0.6 ,
       0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  , 1.05, 1.1 , 1.15,
       1.2 , 1.25])}


In [53]:
clf_SVM = GridSearchCV(SVM, hyperparameters, cv=5, verbose=0)
best_model = clf_SVM.fit(X_train_lem_chi, y_train)
print("Best C:", best_model.best_estimator_.get_params()["C"])

Best C: 1.2500000000000004


In [54]:
# optimizing my SVM Model
SVM_GS = LinearSVC(C=best_model.best_estimator_.get_params()["C"])
SVM_GS.fit(X_train_lem_chi,y_train)

y_pred_GS_SVM_tfidf_lem = SVM_GS.predict(X_test_lem_chi)

warnings.simplefilter("ignore")
accuracy_GS_SVM_lem_chi = cross_val_score(estimator=SVM_GS, X=X_train_lem_chi, y=y_train, cv=5)
print("accuracy with cross validation: " + str(max(accuracy_GS_SVM_lem_chi).round(4)))

accuracy with cross validation: 0.9458


## 4.3 MaxEnt
Here we want to adopt the MaxEnt model, aka the Logistic Regression (or logit model) which is a widely used statistical model that in its basic form uses a logistic function to model a binary dependent variable. (Source: Techopedia)

In [55]:
from sklearn.linear_model import LogisticRegression

In [56]:
# defing my logistic Regression Model, input as BOW
logreg = LogisticRegression()
logreg.fit(X_train_bw, y_train)
y_pred_logreg_bw = logreg.predict(X_test_bw)

warnings.simplefilter("ignore")
accuracy_logreg_bw = cross_val_score(estimator=logreg, X=X_train_bw, y=y_train, cv=5)
print("accuracy with cross validation: " + str(max(accuracy_logreg_bw).round(3)))

accuracy with cross validation: 0.907


In [57]:
# defing my logistic Regression Model, input processed with TF-IDF
logreg = LogisticRegression()
logreg.fit(X_train_tfidf,y_train)
y_pred_logreg_tfidf = logreg.predict(X_test_tfidf)

warnings.simplefilter("ignore")
accuracy_logreg_tfidf = cross_val_score(estimator=logreg, X=X_train_tfidf, y=y_train, cv=5)
print("accuracy with cross validation: " + str(max(accuracy_logreg_tfidf).round(3)))

accuracy with cross validation: 0.907


In [58]:
# defing my logistic Regression Model, input processed with TF-IDF
logreg = LogisticRegression()
logreg.fit(X_train_tfidf_lem,y_train)
y_pred_logreg_tfidf = logreg.predict(X_test_tfidf_lem)

warnings.simplefilter("ignore")
accuracy_logreg_tfidf_lem = cross_val_score(estimator=logreg, X=X_train_tfidf_lem, y=y_train, cv=5)
print("accuracy with cross validation: " + str(max(accuracy_logreg_tfidf_lem).round(3)))

accuracy with cross validation: 0.913


In [59]:
# defing my logistic Regression Model, input processed with TF-IDF
logreg = LogisticRegression()
logreg.fit(X_train_lem_chi_me,y_train)

y_pred_logreg_tfidf = logreg.predict(X_test_lem_chi_me)

warnings.simplefilter("ignore")
accuracy_logreg_lem_chi = cross_val_score(estimator=logreg, X=X_train_lem_chi_me, y=y_train, cv=5)
print("accuracy with cross validation: " + str(max(accuracy_logreg_lem_chi).round(3)))

accuracy with cross validation: 0.914


### 4.3.1 MaxEnt Optimized

In [60]:
from sklearn import linear_model, datasets
from sklearn.model_selection import GridSearchCV

In [61]:
# Create regularization hyperparameter space. After several trial, it was possible to see that the best C was around 0.1.
# In order to reduce the computation time I build the C space around that value
C = np.arange(11, 13.25, 0.25)

# Create hyperparameter options
hyperparameters = dict(C=C)

print(hyperparameters)

{'C': array([11.  , 11.25, 11.5 , 11.75, 12.  , 12.25, 12.5 , 12.75, 13.  ])}


In [62]:
#here we train the model with the TF-IDF processed input
clf_ME = GridSearchCV(logreg, hyperparameters, cv=5, verbose=0)
best_model = clf_ME.fit(X_train_lem_chi_me, y_train)
print("Best C:", best_model.best_estimator_.get_params()["C"])

Best C: 12.0


In [63]:
# optimizing my logistic Regression Model with the optimal hyperparameters
logreg_GS = LogisticRegression(C=best_model.best_estimator_.get_params()["C"])
logreg_GS.fit(X_train_lem_chi_me, y_train)
y_pred_GS_logreg_tfidf = logreg_GS.predict(X_test_lem_chi_me)

warnings.simplefilter("ignore")
accuracy_GS_logreg_lem_chi = cross_val_score(estimator=logreg_GS, X=X_train_lem_chi_me, y=y_train, cv=5)
print("accuracy with cross validation: " + str(max(accuracy_GS_logreg_lem_chi).round(3)))

accuracy with cross validation: 0.941


# 4.4 Pipeline
Here we try to stack different steps of the process in a pipeline, so to execute more efficiently and in a more automated fashion.

In [64]:
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline

In [65]:
# this calculates a vector of term frequencies for
# each document
vect = CountVectorizer()

# this normalizes each term frequency by the
# number of documents having that term
tfidf = TfidfTransformer()

# this is a linear SVM classifier
clf = LinearSVC()

pipeline = Pipeline([("vect", vect), ("tfidf", tfidf), ("clf", clf)])

# call fit as you would on any classifier
pipeline.fit(X_train['text'], y_train)

# predict test instances
pred_pip = pipeline.predict(X_test['text'])

# calculate f1
warnings.simplefilter("ignore")
accuracy = cross_val_score(estimator=pipeline, X=X_train['text'], y=y_train, cv=5)
print("accuracy with cross validation: " + str(max(accuracy).round(3)))

accuracy with cross validation: 0.932


# 5. Create CSV with submission

In [66]:
accuracy = {
    "best_Naive_Bayes": max(accuracy_nb_tfidf_chi).round(2),
    "best_SVM": max(accuracy_svm_lem_chi).round(2),
    "best_MaxEnt": max(accuracy_GS_logreg_lem_chi).round(2),
}

In [67]:
test['pre_label'] = predictions_svm_lem_chi
test.head()

Unnamed: 0,ID,title,text,pre_label
0,10498,September New Homes Sales Rise——-Back To 1992 ...,September New Homes Sales Rise Back To 1992 Le...,FAKE
1,2439,Why The Obamacare Doomsday Cult Can't Admit It...,But when Congress debated and passed the Patie...,REAL
2,864,"Sanders, Cruz resist pressure after NY losses,...",The Bernie Sanders and Ted Cruz campaigns vowe...,REAL
3,4128,Surviving escaped prisoner likely fatigued and...,Police searching for the second of two escaped...,REAL
4,662,Clinton and Sanders neck and neck in Californi...,No matter who wins California's 475 delegates ...,REAL


In [68]:
submission = test[['ID','pre_label']]
submission.to_csv('submission.csv', index=False)

# 6. Conclusions

We will hereby quickly summarize the flow of the work. The main steps we took will be briefly described below.
At first, we imported the data and started preparing it to be suitable to analysis (after having performed some EDA).
The main steps in the preparation were cleaning and preprocessing using vectorizers to get the data into suitable format for 
processing it with the chosen algorithms. The cleaning was performed for the most using bs4 BeatifulSoup.

Before the modelling phase though, there are two more steps, lemmatization being the first one. Lemmatization helps to achieve greater precision in the modelling stage as it considers the root of the word and not the word itself. On top of this, we decided to adopt a Chi Square test for feature selection so to make processing leaner and more efficient, while isolating noise and achieving better results.  

The modelling phase consisted of iteratively running the same models multiple times with different input data (in terms of preprocessing) to discover what treatment generated the best results. 
Naive Bayes, Support Vector Machines and Logistic Regression (MaxEnt) were the models we adopted to classify the news as fake or real. The best results were achieved while running the optimized SVM model where the accuracy was as high as 0.9458, when the input was the data preprocessed through lemmatization, Chi Square and TF-IDF. Below a summary of the results achieved in the different models.