## Problem Descriptions

#### 1. The problem given is False News Detection using Natural Language Processing by using WORD2VEC and TFIDF vectorizier for a sample dataset

Given dataset having 4 features

**title** - Title of the news article

**author** - author of the article

**text** - Body of the article

**label** - Whether the news provided in the article is fake or not

#### 2 .Reading the Data

In [1]:
import pandas as pd

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
train_data = pd.read_csv("/content/drive/MyDrive/train.csv")
#test_data = pd.read_csv(r"C:\Users\PREETI\Downloads\Fake News Detection Data\test.csv")

In [4]:
pd.set_option("display.max_rows", None)
train_data.sample(10)

Unnamed: 0,id,title,author,text,label
13,13,US Officials See No Link Between Trump and Russia,Jason Ditz,Clinton Campaign Demands FBI Affirm Trump's Ru...,1
53,55,The Trump Election Will Spark More Individual ...,Lance Schuttler,The Trump Election Will Spark More Individual ...,1
63,65,U.S. General: Islamic State Chemical Attack Ha...,Kristina Wong,WASHINGTON — U. S. and Australian troops ad...,0
74,76,News: PR Disaster: The President Of Panasonic ...,,Email \nGet ready for the most cringeworthy st...,1
67,69,Bernie Sanders Says What The Media Won’t: Trum...,Jason Easley,"— Bernie Sanders (@BernieSanders) October 27, ...",1
75,77,Judge spanks transgender-obsessed Obama: You l...,Redflag Newsdesk,"\nBob Unruh | WND \nFor a third time, a federa...",1
77,79,Franken Calls for ’Independent Investigation’ ...,Pam Key,"Sunday on CNN’s “State of the Union,” in react...",0
93,95,White House Confirms More Gitmo Transfers Befo...,Edwin Mora,President Barack Obama will likely release mor...,0
55,57,Cognition and True Islam - A Book Review,M.R. Islam,1 Shares\n1 0 0 0\nBook Review – Dr. Rafiq Isl...,1
21,22,Rob Reiner: Trump Is ’Mentally Unstable’ - Bre...,Pam Key,"Sunday on MSNBC’s “AM Joy,” actor and director...",0


In [5]:
#dropping the column id
train_data.drop("id", inplace = True, axis = 1)

In [6]:
train_data.head()

Unnamed: 0,title,author,text,label
0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [7]:
train_data['label'].value_counts()

0    54
1    44
Name: label, dtype: int64

We can observe that the dataset is almost balanced as there are almost equal number of data points with label 0 and 1.

### 3. Data Cleaning 

3.1 Checking for Duplicates and Droping them

In [8]:
train_data.duplicated().sum()

0

There are 109 duplicate rows in train data

In [9]:
#dropping duplicate rows in the train dataset
train_data.drop_duplicates(inplace = True)

In [10]:
#checking again duplicates
train_data.duplicated().sum()

0

#### 3.2 Checking Missing Values

In [11]:
train_data.isnull().sum()

title      1
author    11
text       0
label      0
dtype: int64

In [12]:
train_data.isnull().sum()/len(train_data)*100

title      1.020408
author    11.224490
text       0.000000
label      0.000000
dtype: float64

We can observe that there are many missing values. As there are many rows, we shouldn't drop all the rows. As author is a categorical
feature, we create a new category (missing) for missing authors. For missing values of title text we just replace NAN with space

In [13]:
train_data['title'].fillna(" ", inplace = True)
train_data['text'].fillna(" ", inplace = True)
train_data['author'].fillna("missing", inplace = True)

In [14]:
train_data.isnull().sum()

title     0
author    0
text      0
label     0
dtype: int64

### 4. Data Preprocessing

In [15]:

import re

def decontracted(phrase):
    # specific
    phrase = re.sub(r"’", "'", phrase)
    phrase = re.sub(r"“", """, phrase)
    phrase = re.sub(r"“", """, phrase)
    phrase = re.sub(r"won\'t", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    
    return phrase


In [16]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [17]:
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [18]:
stop_words = stopwords.words("english")

In [19]:
lemmatizer = WordNetLemmatizer()

#### 4.1 Title

In [22]:
import nltk
nltk.download('omw-1.4')
nltk.download('wordnet')

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [23]:
from tqdm import tqdm
preprocessed_titles = []

#tqdm is for printing the status bar
for sentence in tqdm(train_data['title'].values):
    sentence = decontracted(sentence)
    sentence = re.sub(r'^https?:\/\/.*[\r\n]*', '', sentence) # remove hyperlinks
    sentence = re.sub('[^A-Za-z0-9]+', ' ', sentence) # removing special characters
    sentence = ''.join([i for i in sentence if not i.isdigit()]) # removing numbers
    sentence = ' '.join(e for e in sentence.split() if e not in stop_words) # removing stop words
    sentence = ' '.join(lemmatizer.lemmatize(e) for e in sentence.split()) # Lemmatization
    preprocessed_titles.append(sentence.lower().strip())

100%|██████████| 98/98 [00:01<00:00, 72.42it/s]


In [24]:
train_data['title'] = preprocessed_titles

#### 4.2 Text

In [25]:
from tqdm import tqdm
preprocessed_texts = []

#tqdm is for printing the status bar
for sentence in tqdm(train_data['text'].values):
    sentence = decontracted(sentence)
    sentence = re.sub(r'^https?:\/\/.*[\r\n]*', '', sentence) # remove hyperlinks
    sentence = re.sub('[^A-Za-z0-9]+', ' ', sentence) # removing special characters
    sentence = ''.join([i for i in sentence if not i.isdigit()]) # removing numbers
    sentence = ' '.join(e for e in sentence.split() if e not in stop_words) # removing stop words
    sentence = ' '.join(lemmatizer.lemmatize(e) for e in sentence.split()) # Lemmatization
    preprocessed_texts.append(sentence.lower().strip())

100%|██████████| 98/98 [00:00<00:00, 200.25it/s]


In [26]:
train_data['text'] = preprocessed_texts

In [27]:
train_data.head(50)

Unnamed: 0,title,author,text,label
0,house dem aide we did even see comey letter un...,Darrell Lucus,house dem aide we did even see comey letter un...,1
1,flynn hillary clinton big woman campus breitbart,Daniel J. Flynn,ever get feeling life circle roundabout rather...,0
2,why truth might get you fired,Consortiumnews.com,why truth might get you fired october the tens...,1
3,civilians killed in single us airstrike have b...,Jessica Purkiss,videos civilians killed in single us airstrike...,1
4,iranian woman jailed fictional unpublished sto...,Howard Portnoy,print an iranian woman sentenced six year pris...,1
5,jackie mason hollywood would love trump he bom...,Daniel Nussbaum,in trying time jackie mason voice reason in we...,0
6,life life of luxury elton john favorite shark ...,missing,ever wonder britain iconic pop pianist get lon...,1
7,beno hamon wins french socialist party preside...,Alissa J. Rubin,paris france chose idealistic traditional cand...,0
8,excerpts from draft script donald trump q ampa...,missing,donald j trump scheduled make highly anticipat...,0
9,a back channel plan ukraine russia courtesy tr...,Megan Twohey and Scott Shane,a week michael t flynn resigned national secur...,0


### 5. Splitting data into train, cv, test

In [28]:
from sklearn.model_selection import train_test_split

In [29]:
output = train_data['label']

In [30]:
train, test, train_output, test_output = train_test_split(train_data.drop(columns = {'label'}), 
                                                         output, 
                                                         test_size = 0.3,
                                                         stratify = output,
                                                         random_state = 0)

train, cv, train_output, cv_output = train_test_split(train, 
                                                      train_output,
                                                     test_size = 0.3,
                                                     stratify = train_output,
                                                     random_state = 0)

In [31]:
train.shape, cv.shape, test.shape

((47, 3), (21, 3), (30, 3))

In [32]:
train_output.shape, cv_output.shape, test_output.shape

((47,), (21,), (30,))

### 6. Data Encoding

#### 6.1 title - TFIDF Vectorization

In [33]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [34]:
title_tfidf_vectorizer = TfidfVectorizer(min_df = 5)

train_title_tdifdf = title_tfidf_vectorizer.fit_transform(train['title'].values)
cv_title_tdifdf = title_tfidf_vectorizer.transform(cv['title'].values)
test_title_tdifdf = title_tfidf_vectorizer.transform(test['title'].values)

In [36]:
# #saving the tfidf vectorizer
# import pickle
# with open('title_tfidf_vectorizer.pickle', 'wb') as fp:
#     pickle.dump(title_tfidf_vectorizer, fp, protocol = pickle.HIGHEST_PROTOCOL)

In [37]:
title_tfidf_vectorizer.get_feature_names()[:10]



['breitbart', 'new', 'the', 'times', 'trump', 'york']

#### 6.2 author Response Coding

In [38]:
train['author'].unique().shape

(43,)

In [39]:
train.keys()

Index(['title', 'author', 'text'], dtype='object')

We can observe that there are 2707 authors in total train data. As dimenions will be high with one hot encoding, we use response
coding

In [69]:
prob_dict = {}
train_author = train.copy()
train_author['label'] = train_output
train_author_1 = train_author.groupby('author')
for i in (train_author_1.groups):
    group = train_author_1.get_group(i)
    tot = group.shape[0]
    fake = group[group['label'] == 1].shape[0]
    prob_fake = fake/tot
    prob_not_fake = 1 - prob_fake
    prob_dict.update({i:[prob_not_fake, prob_fake]})
    
keys = prob_dict.keys()

train_author_response_code = []
for author in train['author']:
    if author not in keys:
        train_author_response_code.append([.5,.5])
    else:
        train_author_response_code.append(prob_dict.get(author))

cv_author_response_code = []
for author in cv['author']:
    if author not in keys:
        cv_author_response_code.append([.5,.5])
    else:
        cv_author_response_code.append(prob_dict.get(author))

test_author_response_code = []
for author in test['author']:
    if author not in keys:
        test_author_response_code.append([.5,.5])
    else:
        test_author_response_code.append(prob_dict.get(author))


In [42]:
# #saving the probability dictionary
# import pickle
# with open('prob_dict.pickle', 'wb') as fp:
#     pickle.dump(prob_dict, fp, protocol = pickle.HIGHEST_PROTOCOL)

#### 6.3 text TFIDF Vectorization

In [43]:
text_tfidf_vectorizer = TfidfVectorizer(min_df = 5)

train_text_tdifdf = text_tfidf_vectorizer.fit_transform(train['text'].values)
cv_text_tdifdf = text_tfidf_vectorizer.transform(cv['text'].values)
test_text_tdifdf = text_tfidf_vectorizer.transform(test['text'].values)

In [44]:
# #saving the text tfidf vectorizer
# import pickle
# with open('text_tfidf_vectorizer.pickle', 'wb') as fp:
#     pickle.dump(text_tfidf_vectorizer, fp, protocol = pickle.HIGHEST_PROTOCOL)

In [45]:
text_tfidf_vectorizer.get_feature_names()[0:10]



['able',
 'about',
 'access',
 'according',
 'accused',
 'achieve',
 'across',
 'act',
 'acting',
 'action']

#### 6.4 Word2Vec

In [46]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [47]:
from gensim.models import Word2Vec

In [48]:
import numpy as np


def load_glove_model(glove_file):
    print("Loading Glove Model")
    f = open(glove_file, 'r', errors = 'ignore', encoding = 'utf8')
    model = {}
    vector_size = 300
    for line in tqdm(f):
        split_line = line.split()
        word = split_line[0]
        embedding = np.array([float(val) for val in split_line[1:]])
        model[word] = embedding
    print("Done.\n" + str(len(model)) + " words loaded!")
    return model

model = load_glove_model('/content/drive/MyDrive/glove.42B.300d.txt')  #sample check with one word

Loading Glove Model


1917495it [03:17, 9705.18it/s] 

Done.
1917495 words loaded!





In [49]:
 # #saving the glove model
# import pickle
# with open('model.pickle', 'wb') as fp:
#     pickle.dump(model, fp, protocol = pickle.HIGHEST_PROTOCOL)

In [50]:
glove_words = set(model.keys())

#### 6.4.1 title Word2Vec

In [51]:
avg_w2v_vectors_title_train = []  # the avg-w2v for each title is stored in this list
avg_w2v_vectors_title_cv = []
avg_w2v_vectors_title_test = []

for sentence in tqdm(train['title'].values): #for each title
    vector = np.zeros(300) #as word vectors are of zero length
    cnt_words = 0; # num of words with a valid vector in the sentence
    for word in glove_words: # for each word in a title
        vector += model[word]
        cnt_words += 1
    if cnt_words != 0:
        vector /= cnt_words # normalize the vector / vector decomposition
    avg_w2v_vectors_title_train.append(vector)
avg_w2v_vectors_title_train = np.array(avg_w2v_vectors_title_train)

for sentence in tqdm(cv['title'].values): #for each title
    vector = np.zeros(300) #as word vectors are of zero length
    cnt_words = 0; # num of words with a valid vector in the sentence
    for word in glove_words: # for each word in a title
        #vector += model[word]
        cnt_words += 1
    if cnt_words != 0:
        vector /= cnt_words # normalize the vector / vector decomposition
    avg_w2v_vectors_title_cv.append(vector)
avg_w2v_vectors_title_cv = np.array(avg_w2v_vectors_title_cv)

for sentence in tqdm(test['title'].values): #for each title
    vector = np.zeros(300) #as word vectors are of zero length
    cnt_words = 0; # num of words with a valid vector in the sentence
    for word in glove_words: # for each word in a title
        vector += model[word]
        cnt_words += 1
    if cnt_words != 0:
        vector /= cnt_words # normalize the vector / vector decomposition
    avg_w2v_vectors_title_test.append(vector)
avg_w2v_vectors_title_test = np.array(avg_w2v_vectors_title_test)


100%|██████████| 47/47 [03:41<00:00,  4.72s/it]
100%|██████████| 21/21 [00:08<00:00,  2.36it/s]
100%|██████████| 30/30 [02:16<00:00,  4.56s/it]


In [52]:
avg_w2v_vectors_text_train = []  # the avg-w2v for each title is stored in this list
avg_w2v_vectors_text_cv = []
avg_w2v_vectors_text_test = []

for sentence in tqdm(train['text'].values): #for each title
    vector = np.zeros(300) #as word vectors are of zero length
    cnt_words = 0; # num of words with a valid vector in the sentence
    for word in glove_words: # for each word in a title
        vector += model[word]
        cnt_words += 1
    if cnt_words != 0:
        vector /= cnt_words # normalize the vector / vector decomposition
    avg_w2v_vectors_text_train.append(vector)
avg_w2v_vectors_text_train = np.array(avg_w2v_vectors_text_train)

for sentence in tqdm(cv['text'].values): #for each title
    vector = np.zeros(300) #as word vectors are of zero length
    cnt_words = 0; # num of words with a valid vector in the sentence
    for word in glove_words: # for each word in a title
        #vector += model[word]
        cnt_words += 1
    if cnt_words != 0:
        vector /= cnt_words # normalize the vector / vector decomposition
    avg_w2v_vectors_text_cv.append(vector)
avg_w2v_vectors_text_cv = np.array(avg_w2v_vectors_text_cv)

for sentence in tqdm(test['text'].values): #for each title
    vector = np.zeros(300) #as word vectors are of zero length
    cnt_words = 0; # num of words with a valid vector in the sentence
    for word in glove_words: # for each word in a title
        vector += model[word]
        cnt_words += 1
    if cnt_words != 0:
        vector /= cnt_words # normalize the vector / vector decomposition
    avg_w2v_vectors_text_test.append(vector)
avg_w2v_vectors_text_test = np.array(avg_w2v_vectors_text_test)


100%|██████████| 47/47 [04:15<00:00,  5.43s/it]
100%|██████████| 21/21 [00:10<00:00,  1.97it/s]
100%|██████████| 30/30 [02:21<00:00,  4.72s/it]


### Combining all Encoded Features

In [53]:
from scipy.sparse import hstack

####7.1 Combining TFIDF encoded features

In [65]:
cv_title_tdifdf.shape,  cv_text_tdifdf.shape

((21, 6), (21, 691))

In [67]:
len(cv_author_response_code)

47

In [71]:
train_data_final_tfidf = hstack((train_title_tdifdf, np.array(train_author_response_code), train_text_tdifdf))
cv_data_final_tfidf = hstack((cv_title_tdifdf, np.array(cv_author_response_code), cv_text_tdifdf))
test_data_final_tfidf = hstack((test_title_tdifdf, np.array(test_author_response_code), test_text_tdifdf))

In [72]:
train_data_final_tfidf.shape, cv_data_final_tfidf.shape, test_data_final_tfidf.shape

((47, 699), (21, 699), (30, 699))

####7.2 Combining Word2Vec encoded features

<47x6 sparse matrix of type '<class 'numpy.float64'>'
	with 94 stored elements in Compressed Sparse Row format>

In [96]:
train_data_final_w2v = np.concatenate((avg_w2v_vectors_title_train, np.array(train_author_response_code), avg_w2v_vectors_text_train),axis = 1)
cv_data_final_w2v = np.concatenate((avg_w2v_vectors_title_cv, np.array(cv_author_response_code), avg_w2v_vectors_text_cv),axis = 1)
test_data_final_w2v = np.concatenate((avg_w2v_vectors_title_test, np.array(test_author_response_code), avg_w2v_vectors_text_test),axis = 1)

In [97]:
#standardizing the data 
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
train_data_final_w2v = scaler.fit_transform(train_data_final_w2v)
cv_data_final_w2v = scaler.transform(cv_data_final_w2v)
test_data_final_w2v = scaler.transform(test_data_final_w2v)

#8. Modelling

In [100]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, f1_score, classification_report

8.1 Multinomial Naive bayes with TFIDF ecoded features

In [103]:
alpha_range = [0.0001, 0.001, 0.01, 0.1, 1]
for i in alpha_range:
  nb_clf = MultinomialNB(alpha = i, fit_prior = True)
  nb_clf.fit(train_data_final_tfidf, train_output )

  train_prob = nb_clf.predict_proba(train_data_final_tfidf)[:,1]
  train_AUC = roc_auc_score(train_output, train_prob)
  print("for alpha = %f, train AUC = %f" % (i, train_AUC))

  
  cv_prob = nb_clf.predict_proba(cv_data_final_tfidf)[:,1]
  cv_AUC = roc_auc_score(cv_output, cv_prob)
  print("for alpha = %f, CV AUC = %f" % (i, cv_AUC))

  train_scores = nb_clf.predict(train_data_final_tfidf)
  train_f1 = f1_score(train_output, train_scores)
  print("for alpha = %f, train f1 score = %f" % (i, train_f1))

  cv_scores = nb_clf.predict(cv_data_final_tfidf)
  cv_f1 = f1_score(cv_output, cv_scores)
  print("for alpha = %f, cv f1 score = %f" % (i, cv_f1))
  
  print("-"*50)

for alpha = 0.000100, train AUC = 1.000000
for alpha = 0.000100, CV AUC = 1.000000
for alpha = 0.000100, train f1 score = 1.000000
for alpha = 0.000100, cv f1 score = 0.952381
--------------------------------------------------
for alpha = 0.001000, train AUC = 1.000000
for alpha = 0.001000, CV AUC = 1.000000
for alpha = 0.001000, train f1 score = 1.000000
for alpha = 0.001000, cv f1 score = 0.952381
--------------------------------------------------
for alpha = 0.010000, train AUC = 1.000000
for alpha = 0.010000, CV AUC = 0.990909
for alpha = 0.010000, train f1 score = 1.000000
for alpha = 0.010000, cv f1 score = 0.952381
--------------------------------------------------
for alpha = 0.100000, train AUC = 1.000000
for alpha = 0.100000, CV AUC = 0.981818
for alpha = 0.100000, train f1 score = 1.000000
for alpha = 0.100000, cv f1 score = 0.857143
--------------------------------------------------
for alpha = 1.000000, train AUC = 1.000000
for alpha = 1.000000, CV AUC = 0.972727
for alpha

### We can observe that alpha = 0.01 having good auc and f1 score with lower differences

In [113]:
  nb_clf = MultinomialNB(alpha = 0.01, fit_prior = True)
  nb_clf.fit(train_data_final_tfidf, train_output )

  
  test_prob = nb_clf.predict_proba(test_data_final_tfidf)[:,1]
  test_AUC = roc_auc_score(test_output, test_prob)
  print("for alpha = %f, test AUC = %f" % (0.01, test_AUC))

  test_scores = nb_clf.predict(test_data_final_tfidf)
  test_f1 = f1_score(test_output, test_scores)
  print("for alpha = %f, test f1 score = %f" % (0.01, test_f1))



for alpha = 0.010000, test AUC = 0.941176
for alpha = 0.010000, test f1 score = 0.846154


In [110]:
print(classification_report(test_output, test_scores, target_names = ['Label 0', 'Label 1']))

              precision    recall  f1-score   support

     Label 0       0.88      0.88      0.88        17
     Label 1       0.85      0.85      0.85        13

    accuracy                           0.87        30
   macro avg       0.86      0.86      0.86        30
weighted avg       0.87      0.87      0.87        30



8.2 Logistic Regression with Word2Vec Encoded Features

In [112]:
c_range = [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000]
for i in c_range:
  logistic_clf = LogisticRegression(C = i, max_iter =300)
  logistic_clf.fit(train_data_final_w2v, train_output )

  train_prob = logistic_clf.predict_proba(train_data_final_w2v)[:,1]
  train_AUC = roc_auc_score(train_output, train_prob)
  print("for alpha = %f, train AUC = %f" % (i, train_AUC))

  
  cv_prob = logistic_clf.predict_proba(cv_data_final_w2v)[:,1]
  cv_AUC = roc_auc_score(cv_output, cv_prob)
  print("for alpha = %f, CV AUC = %f" % (i, cv_AUC))

  train_scores = logistic_clf.predict(train_data_final_w2v)
  train_f1 = f1_score(train_output, train_scores)
  print("for alpha = %f, train f1 score = %f" % (i, train_f1))

  cv_scores = logistic_clf.predict(cv_data_final_w2v)
  cv_f1 = f1_score(cv_output, cv_scores)
  print("for alpha = %f, cv f1 score = %f" % (i, cv_f1))
  
  print("-"*50)

for alpha = 0.000100, train AUC = 1.000000
for alpha = 0.000100, CV AUC = 0.650000
for alpha = 0.000100, train f1 score = 0.000000
for alpha = 0.000100, cv f1 score = 0.000000
--------------------------------------------------
for alpha = 0.001000, train AUC = 1.000000
for alpha = 0.001000, CV AUC = 0.650000
for alpha = 0.001000, train f1 score = 0.000000
for alpha = 0.001000, cv f1 score = 0.000000
--------------------------------------------------
for alpha = 0.010000, train AUC = 1.000000
for alpha = 0.010000, CV AUC = 0.650000
for alpha = 0.010000, train f1 score = 1.000000
for alpha = 0.010000, cv f1 score = 0.461538
--------------------------------------------------
for alpha = 0.100000, train AUC = 1.000000
for alpha = 0.100000, CV AUC = 0.650000
for alpha = 0.100000, train f1 score = 1.000000
for alpha = 0.100000, cv f1 score = 0.461538
--------------------------------------------------
for alpha = 1.000000, train AUC = 1.000000
for alpha = 1.000000, CV AUC = 0.650000
for alpha

### We can observe that C = 0.1 having auc and f1 score with lower differences

In [114]:
  logistic_clf = LogisticRegression(C = 0.1, max_iter =300)
  logistic_clf.fit(train_data_final_w2v, train_output )

  
  test_prob = logistic_clf.predict_proba(test_data_final_w2v)[:,1]
  test_AUC = roc_auc_score(test_output, test_prob)
  print("for alpha = %f, test AUC = %f" % (0.1, test_AUC))

  test_scores = logistic_clf.predict(test_data_final_w2v)
  test_f1 = f1_score(test_output, test_scores)
  print("for alpha = %f, test f1 score = %f" % (0.1, test_f1))



for alpha = 0.100000, test AUC = 0.585973
for alpha = 0.100000, test f1 score = 0.352941


In [115]:
print(classification_report(test_output, test_scores, target_names = ['Label 0', 'Label 1']))

              precision    recall  f1-score   support

     Label 0       0.62      0.94      0.74        17
     Label 1       0.75      0.23      0.35        13

    accuracy                           0.63        30
   macro avg       0.68      0.59      0.55        30
weighted avg       0.67      0.63      0.57        30



#9. Conclusion

After observing we find that working of word2vec and TFIDF vectorizer and find for the data that TFIDF shows good results 

In [116]:
# #saving the model
# with open("nb.clf_best.pickle", "wb") as fp:
#   pickle.dump(nb_clf_best, fp, protocol = pickle.HIGHEST_PROTOCOL)