# Import main libraries

In [1]:
import pandas as pd
import numpy as np
import re
import pickle
import nltk 
from sklearn.pipeline import Pipeline
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import PredefinedSplit
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
import warnings 
warnings.filterwarnings('ignore')

# Load Data

read our train and test data using pd.read_csv function

In [2]:
train_data=pd.read_csv('xy_train.csv',index_col='id') # read train data (xy_train.csv) file
test_data=pd.read_csv('x_test.csv',index_col='id')    # read test data (x_test.csv) file

In [3]:
# print train data
train_data 

Unnamed: 0_level_0,text,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1
265723,A group of friends began to volunteer at a hom...,0.0
284269,British Prime Minister @Theresa_May on Nerve A...,0.0
207715,"In 1961, Goodyear released a kit that allows P...",0.0
551106,"Happy Birthday, Bob Barker! The Price Is Right...",0.0
8584,"Obama to Nation: 聙""Innocent Cops and Unarmed Y...",0.0
...,...,...
244580,Convention Crowd Really Hoping Bill Clinton Br...,0.0
309472,"North Korean officials playing in a band, down...",1.0
202030,Deleted scene from Star Wars where the rebel a...,0.0
503084,I bought a grinder at a government-run cannabi...,1.0


Here I found  that our text data includes noise in form of punctuation, text in different cases, special characters, url and numbers So we need to clean this data to get good modelling results

# Cleaning and preprocessing text data


**What preprocessing steps are used?**

For good modelling results, we need consistent and clean data. No matter how advanced your model is, the basic principle remains the same: trash in, trash out. As a result, data preprocessing is an important stage in developing a Machine Learning model, and the outcomes are dependent on how effectively the data has been preprocessed.

Text preprocessing is the initial stage in the NLP model construction process.


Preprocessing steps:

* Removing the URL 

* Removing all non-essential letters (Numbers and Punctuation)

* Convert all characters to lowercase.

* Tokenization.

* Removing stopwords

* Lemmatization

* Stemming 

* Remove the words with a length of 2 

* Return the list of tokens to the string.

**URLs** references to a location on the web, but do not provide any additional information.

So it's better to  remove these using the library named re
We take our sample text and analyse each word, removing words or strings starting with http.

In [4]:
# function removing Url from text data 
def remove_urls(_text):
    return re.sub(r'http\S+','',_text)
train_data['clean_text']=train_data['text'].apply(remove_urls) # apply function on train data
test_data['clean_text']=test_data['text'].apply(remove_urls)   # apply function on test data


Remove any numbers (0–9) that aren't important to our analysis.

Punctuation will also be removed. Punctuation is a collection of symbols. [!”#$%&’()*+,-./:;<=>?@[\]^_`{|}~]

In [5]:
# function removing all irrelevant characters ( special characters, numbers and punctuation)

def remove_non_alphanumeric(_text):
    return re.sub('[^a-zA-Z]',' ',_text)
train_data['clean_text']=train_data['clean_text'].apply(remove_non_alphanumeric) # apply function on train data
test_data['clean_text']=test_data['clean_text'].apply(remove_non_alphanumeric)   # apply function on test data

To avoid repetition, all words are converted to lower. Because if this step is skipped  "Phone" and "phone" will be treated as two independent words.

In [6]:
# function converting all uppercase to lowercase
def to_lowercase(_text):
    return str(_text).lower()
train_data['clean_text']=train_data['clean_text'].apply(to_lowercase) # apply function on train data
test_data['clean_text']=test_data['clean_text'].apply(to_lowercase)  # apply function on test data

**Tokenization** is the process of breaking down a text into smaller pieces known as tokens. Tokens can include words, numbers, punctuation marks, and other symbols.

I use it here to convert sentence into words to make it easier to remove stop words

In [7]:
nltk.download('punkt')
# function to  tokenaze data 
def _tokenization(_text):
   return word_tokenize(_text)
train_data['clean_text']=train_data['clean_text'].apply(_tokenization) # apply function on train data
test_data['clean_text']=test_data['clean_text'].apply(_tokenization)   # apply function on test data

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


**Stopwords** are the most commonly used words in a language, such as "an" , "a" , "our" ,"are", "into" , and "they." These words have little meaning and are frequently removed from text. So it's better to remove them

In [8]:
nltk.download('stopwords')
stop_words=set(stopwords.words('english'))

# function removing stop words
def remove_stopwords(token):
    return[item for item in token if item not in stop_words]    # return words which are not stop words 
train_data['clean_text']=train_data['clean_text'].apply(remove_stopwords) # apply function on train data
test_data['clean_text']=test_data['clean_text'].apply(remove_stopwords)   # apply function on test data

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


**lemmatization** consists in doing things properly with the use of a vocabulary and morphological analysis of words, to return the base or dictionary form of a word, which is known as the lemma.[6]



In [9]:
nltk.download('wordnet')

lemma=WordNetLemmatizer()

def lemmatization(token):
    return [lemma.lemmatize(word=w,pos='v') for w in token]
train_data['clean_text']=train_data['clean_text'].apply(lemmatization)
test_data['clean_text']=test_data['clean_text'].apply(lemmatization)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


**Stemming** is a natural language processing technique that helps with text preprocessing by lowering inflection in words to their base forms.

In [10]:
stemmer=PorterStemmer()
# stemming function 
def stem(token):
    return[stemmer.stem(i) for i in token] # return  a common base or root of word 
train_data['clean_text']=train_data['clean_text'].apply(stem) # apply function on train data
test_data['clean_text']=test_data['clean_text'].apply(stem)   # apply function on test data

There is some kind of noise in our text after executing all required text processing processes, so I am removing the words that are really short in length.

In [11]:
# function removeing the words having length <= 2
def length(token):
    return [i for i in token if len(i) > 2] # return words which having length more than 2 characters 
train_data['clean_text']=train_data['clean_text'].apply(length) # apply function on train data
test_data['clean_text']=test_data['clean_text'].apply(length)  # apply function on test data

In [12]:
train_data['clean_text'] # text in a form of list, so need to convert it into string form 

id
265723    [group, friend, begin, volunt, homeless, shelt...
284269    [british, prime, minist, theresa, may, nerv, a...
207715    [goodyear, releas, kit, allow, bring, heel, zw...
551106    [happi, birthday, bob, barker, price, right, h...
8584      [obama, nation, innoc, cop, unarm, young, blac...
                                ...                        
244580    [convent, crowd, realli, hop, bill, clinton, b...
309472    [north, korean, offici, play, band, downtown, ...
202030    [delet, scene, star, war, rebel, allianc, atta...
503084    [buy, grinder, govern, run, cannabi, retail, c...
62769     [billionair, richard, branson, offer, makepeac...
Name: clean_text, Length: 40949, dtype: object

In [13]:
# function converting a list of tokens back into a string.
def convert_to_string(list_text):
    return ' '.join(list_text) # return a string of our text 
train_data['clean_text']=train_data['clean_text'].apply(convert_to_string)   # apply function on train data
test_data['clean_text']=test_data['clean_text'].apply(convert_to_string)     # apply function on test data

In [14]:
train_data['clean_text'] # after applying  (convert_to_string ) function to convert into a string form

id
265723    group friend begin volunt homeless shelter nei...
284269    british prime minist theresa may nerv attack f...
207715    goodyear releas kit allow bring heel zwillc fi...
551106    happi birthday bob barker price right host lik...
8584      obama nation innoc cop unarm young black men d...
                                ...                        
244580    convent crowd realli hop bill clinton break te...
309472    north korean offici play band downtown pyongya...
202030    delet scene star war rebel allianc attack impe...
503084    buy grinder govern run cannabi retail come wra...
62769     billionair richard branson offer makepeac isla...
Name: clean_text, Length: 40949, dtype: object

In [15]:
pd.DataFrame(train_data) # print train data to show 'text' after apply cleaning and pre processing

Unnamed: 0_level_0,text,label,clean_text
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
265723,A group of friends began to volunteer at a hom...,0.0,group friend begin volunt homeless shelter nei...
284269,British Prime Minister @Theresa_May on Nerve A...,0.0,british prime minist theresa may nerv attack f...
207715,"In 1961, Goodyear released a kit that allows P...",0.0,goodyear releas kit allow bring heel zwillc fi...
551106,"Happy Birthday, Bob Barker! The Price Is Right...",0.0,happi birthday bob barker price right host lik...
8584,"Obama to Nation: 聙""Innocent Cops and Unarmed Y...",0.0,obama nation innoc cop unarm young black men d...
...,...,...,...
244580,Convention Crowd Really Hoping Bill Clinton Br...,0.0,convent crowd realli hop bill clinton break te...
309472,"North Korean officials playing in a band, down...",1.0,north korean offici play band downtown pyongya...
202030,Deleted scene from Star Wars where the rebel a...,0.0,delet scene star war rebel allianc attack impe...
503084,I bought a grinder at a government-run cannabi...,1.0,buy grinder govern run cannabi retail come wra...


# Drop rows

here I dropped rows which it's value equals 2 
becouse we in case of binary grade 

we want to predict if the tilte represent fake news (1) or not fake (0)

In [16]:
# create binary grade, label  0-1  not fake or fake 
train_data.loc[train_data["label"] > 1] = np.NaN
# Drop when any of x missing
train_data = train_data[(train_data["clean_text"] != "") & (train_data["clean_text"] != "null")]
train_data = train_data.dropna(axis="index", subset=["label", "text", "clean_text"]).reset_index(drop=True)

# Descriptive data 



Show most common  & uncommon words

Even if we are dealing with text, descriptive analysis should be used to have a better understanding of the data.


In [17]:
# word Frequency of most common words
word_freq = pd.Series(" ".join(train_data["clean_text"]).split()).value_counts()
pd.DataFrame(word_freq[10:20].reset_index(name="freq"))

Unnamed: 0,index,freq
0,peopl,1879
1,trump,1855
2,man,1832
3,color,1822
4,use,1787
5,first,1707
6,old,1621
7,time,1582
8,poster,1562
9,day,1503


In [18]:
# list most uncommon words
word_freq[-10:].reset_index(name="freq")

Unnamed: 0,index,freq
0,eppa,1
1,plotlin,1
2,ceylon,1
3,mandala,1
4,walkman,1
5,adden,1
6,shermin,1
7,marquez,1
8,britian,1
9,northkoreap,1


show distribution of label column to check if data is balanced or not.


In [19]:
# distribution of label
train_data["label"].value_counts(normalize=True) 

0.0    0.575723
1.0    0.424277
Name: label, dtype: float64

# Split data into two columns

In [20]:
# splite train data into X,y columns
X=train_data['clean_text'] 
y=train_data['label'] 

# Pipeline Tuning Vectorizer and classification models
**TFidfVectorizer** 


**Vectorizer** used to convert the text data into numerical data


**TF-IDF** is an abbreviation for Term Frequency Inverse Document Frequency. This is very common algorithm to transform text into a meaningful representation of numbers which is used to fit machine algorithm for prediction.[1]

**max_df**  using  for removing terms that appear too frequently

For example: max_df = 0.50 means "ignore terms that appear in more than 50% of the documents"

**min_df** useding for removing terms that appear too infrequently.

For example: min_df = 5 means "ignore terms that appear in less than 5 documents"

**ngram_range** is just a string of n words in a row 

*    ngram_range=(1, 2) which mean use pairs of two words
*    ngram_range=(1, 3) which mean use three words .




In [21]:
# Pipeline tuning 
# tune model's hyperparameter
# tune Vectorizer 
# here I used (word-level vectorizer)

pipeline = Pipeline([("tfidf", TfidfVectorizer(analyzer="word")), ("lr_classifier", LogisticRegression())])
params = {    
"tfidf__ngram_range": [(1, 2),(1,3)],                                   # tfidf__ngram_range points to tfidf->nngram_range
    "tfidf__max_df": np.arange(0.3, 0.6),                                   # tfidf__max_df points to tfidf->max_df
    "tfidf__min_df": np.arange(5,30 ),                                      # tfidf__min_df points to tfidf->min_df
    'lr_classifier__penalty' : ['l1','l2'],                                 #lr_classifier__penalty points to lr_classifier->penalty
    'lr_classifier__C' : np.logspace(-1,1,10),                              #lr_classifier__C points to lr_classifier->C
    'lr_classifier__solver': [ 'liblinear','newton-cg', 'lbfgs']            #lr_classifier__solver points to lr_classifier->solver
}

# What is the experimental protocol used and how was it carried out?

here I used validation set with (random , grid search ) 

Validation set is a set of data that is separate from the training set and is used to verify the performance of our model during training.

# Logistic regression model

Logistic Regression is very effective on text data and the underlying algorithm it also fairly easy to understand. More importantly, in the NLP world, it’s generally accepted that Logistic Regression is a great starter algorithm for text related classification. 

**Hyperparameter Tuning** 

Hyperparameters are an essential aspect of machine learning process. as they control the overall behaviour of a machine learning model. The ultimate goal is to find an optimal combination of hyperparameters  to give better results.

So I tried to tune a list of parameters of Logisticregression's  hyper parameter such as **( penalty , C , solver )**

**penalty** it's values is  ‘l1’, ‘l2’, ‘elasticnet’, ‘none’.

Penalty is used to specify the norm used in the penalization 

The default value is ’l2’.

**C** which mean inverse of regularization strength in Logistic Regression.and 
 it be in range ( 0.001 , 0.01 , 0.1 , 1 , 10 )

**solver** It provides options to choose solver algorithm for optimization.
it's parameter 'liblinear','newton-cg', 'lbfgs

### Trial_1
**Using Logistic regression model and grid search method (word-level vectorizer)**


In **grid search**, we test every combination of values from a pre-defined list of hyper-parameters and pick the best one based on the cross validation score or validation set 

**Expectation**

After cleaning and preprocessing the data using several techniques 
(removing the URL, removing all non-essential letters (Numbers and Punctuation ,convert all characters to lowercase, tokenization, removing stopwords, lemmatization ,remove the words with a length of 2) and using TfidfVectorizer (word-level)

I tried grid search with Logistic regression classifier and set of hyperparameters tuned and vectorizer (word-level) to get a high performance

my expectation that the accuraccy may be high

**Observation** 

A fter running the model I found that the best hyperparameters that defined with grid search method are 



        best score 0.8730260693722905 ---> The model gave me
        best hyperparameter {'lr_classifier__C': 2.1544346900318834, 'lr_classifier__penalty': 'l2', 'lr_classifier__solver': 'newton-cg', 'tfidf__max_df': 0.3, 'tfidf__min_df': 5, 'tfidf__ngram_range': (1, 3)}

        Score :  0.83985 on kaggle



**plan** 

I will try to use the rondom search method with Logistic regression model and the same hyperparameters in trial_2


In [None]:
#Further split the original training set to a train and a validation set
X_train, X_val, y_train, y_val = train_test_split(
    X, y, train_size = 0.8, stratify = y, random_state = 42)

# Create a list where train data indices are -1 and validation data indices are 0
# X_train2 (new training set), X_train
split_index = [-1 if x in X_train.index else 0 for x in X.index]

# Use the list to create PredefinedSplit
pds = PredefinedSplit(test_fold = split_index)
#
grid_search_lr = GridSearchCV(
    pipeline, params, cv=pds, verbose=1, n_jobs=2, 
    # number of random trials
    scoring='roc_auc')

grid_search_lr.fit(X,y)

print('best score {}'.format(grid_search_lr.best_score_))
print('best hyperparameter {}'.format(grid_search_lr.best_params_))

Fitting 1 folds for each of 3000 candidates, totalling 3000 fits
best score 0.8730260693722905
best hyperparameter {'lr_classifier__C': 2.1544346900318834, 'lr_classifier__penalty': 'l2', 'lr_classifier__solver': 'newton-cg', 'tfidf__max_df': 0.3, 'tfidf__min_df': 5, 'tfidf__ngram_range': (1, 3)}


In [None]:
submission = pd.DataFrame()

submission['id'] = test_data.index

submission['label'] = grid_search_lr.predict_proba(test_data.clean_text)[:,1]

submission.to_csv('trial1_Logistic_grid_word.csv', index=False)

# Score: 0.83985 on kaggle

### Trial_2
**Using Logistic regression model and random search method (word-level vectorizer)**

**Random search** is a technique where random combinations of the hyperparameters are used to find the best solution for the built model. and yet it has proven to yield better results comparatively.

**Expectation**

After cleaning and preprocessing the data using several techniques 
(removing the URL, removing all non-essential letters (Numbers and Punctuation ,convert all characters to lowercase, tokenization, removing stopwords, lemmatization ,remove the words with a length of 2) and using TfidfVectorizer (word-level)

I tried random search with Logistic regression  classifier  and set of hyperparameters which must be tuned to get a high performance with Logistic regression model.

My expectation that I may get a high accuraccy and The model run faster than using grid search 


**Observation** 

After running the model I found that the best hyperparameters that defined with random search method and TfidfVectorizer are

        best score 0.8650467370071189
        best hyperparameter {'tfidf__ngram_range': (1, 2), 'tfidf__min_df': 17, 'tfidf__max_df': 0.3, 'lr_classifier__solver': 'liblinear', 'lr_classifier__penalty': 'l2', 'lr_classifier__C': 0.774263682681127}

        Score : 0.83719 on kaggle

**Plan**

I will try to use the same search method with Logistic regression model but with using (character-level vectorizer) in trial_3

In [None]:
#Further split the original training set to a train and a validation set
X_train, X_val, y_train, y_val = train_test_split(
    X, y, train_size = 0.8, stratify = y, random_state = 42)

# Create a list where train data indices are -1 and validation data indices are 0
# X_train2 (new training set), X_train
split_index = [-1 if x in X_train.index else 0 for x in X.index]

# Use the list to create PredefinedSplit
pds = PredefinedSplit(test_fold = split_index)
#
random_search_lr = RandomizedSearchCV(
    pipeline, params, cv=pds, verbose=1, n_jobs=2, 
    # number of random trials
    n_iter=10,
    scoring='roc_auc')

random_search_lr.fit(X,y)

print('best score {}'.format(random_search_lr.best_score_))
print('best hyperparameter {}'.format(random_search_lr.best_params_))

Fitting 1 folds for each of 10 candidates, totalling 10 fits
best score 0.8650467370071189
best hyperparameter {'tfidf__ngram_range': (1, 2), 'tfidf__min_df': 17, 'tfidf__max_df': 0.3, 'lr_classifier__solver': 'liblinear', 'lr_classifier__penalty': 'l2', 'lr_classifier__C': 0.774263682681127}


In [None]:
submission = pd.DataFrame()

submission['id'] = test_data.index

submission['label'] = random_search_lr.predict_proba(test_data.clean_text)[:,1]

submission.to_csv('trial2_Logistic_random_word.csv', index=False)

# Score: 0.83719 on kaggle

### Trial_3
**Using Logistic regression model and random search method (character-level vectorizer)**

**Expectation**

After cleaning and preprocessing the data using several techniques (removing the URL, removing all non-essential letters (Numbers and Punctuation ,convert all characters to lowercase, tokenization, removing stopwords, lemmatization ,remove the words with a length of 2) and using TfidfVectorizer (character-level)

I tried random search with Logistic regression classifier model and set of hyperparameters which must be tune to get a high performance with Logistic regression model.

my expectation that the accuracy will not be higher than using the Vectorizer with ( character level )

**Observation**

After running the model I found that the accuracy is lower than using Vectorizer (character-level) and the best hyperparameters that defined with rondom search method and TfidfVectorizer are

        best score 0.8398320275772793
        best hyperparameter {'tfidf__ngram_range': (1, 3), 'tfidf__min_df': 7, 'tfidf__max_df': 0.3, 'lr_classifier__solver': 'newton-cg', 'lr_classifier__penalty': 'l2', 'lr_classifier__C': 3.593813663804626}

        Score : 0.0.77987 0n kaggle
when using analyzer= character, I found the model accuracy became lower than using analyzer = word 

**Plan**

I will try to use the grid search method with XGBoost model (word-level vectorizer) on trial_4

In [None]:
# Pipeline tuning 
# tune model's hyperparameter
# tune Vectorizer 
# here I used (char-level vectorizer)

pipeline = Pipeline([("tfidf", TfidfVectorizer(analyzer="char")), ("lr_classifier", LogisticRegression())]) 

params = {    
    "tfidf__ngram_range": [(1, 2),(1,3)],                                   # tfidf__ngram_range points to tfidf->nngram_range
    "tfidf__max_df": np.arange(0.3, 0.6),                                   # tfidf__max_df points to tfidf->max_df
    "tfidf__min_df": np.arange(5,30 ),                                      # tfidf__min_df points to tfidf->min_df
    'lr_classifier__penalty' : ['l1','l2'],                                 #lr_classifier__penalty points to lr_classifier->penalty
    'lr_classifier__C' : np.logspace(-1,1,10),                              #lr_classifier__C points to lr_classifier->C
    'lr_classifier__solver': [ 'liblinear','newton-cg', 'lbfgs']            #lr_classifier__solver points to lr_classifier->solver
}


In [None]:

#Further split the original training set to a train and a validation set
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size = 0.8, stratify = y, random_state = 40)

# Create a list where train data indices are -1 and validation data indices are 0
# X_train2 (new training set), X_train
split_index = [-1 if x in X_train.index else 0 for x in X.index]

# Use the list to create PredefinedSplit
pds = PredefinedSplit(test_fold = split_index)
random_search_lr_char = RandomizedSearchCV(
    pipeline, params, cv=pds, verbose=1, n_jobs=2, 
    # number of random trials
    n_iter=10,
    scoring='roc_auc')
    # number of random trials
    

random_search_lr_char.fit(X,y)

print('best score {}'.format(random_search_lr_char.best_score_))
print('best hyperparameter {}'.format(random_search_lr_char.best_params_))

Fitting 1 folds for each of 10 candidates, totalling 10 fits
best score 0.8398320275772793
best hyperparameter {'tfidf__ngram_range': (1, 3), 'tfidf__min_df': 7, 'tfidf__max_df': 0.3, 'lr_classifier__solver': 'newton-cg', 'lr_classifier__penalty': 'l2', 'lr_classifier__C': 3.593813663804626}


In [None]:
submission = pd.DataFrame()

submission['id'] = test_data.index

submission['label'] = random_search_lr_char.predict_proba(test_data.clean_text)[:,1]

submission.to_csv('trial3_Logistic_random_char.csv', index=False)

# best score 0.8221684598839467
# best hyperparameter {'tfidf__ngram_range': (1, 3), 'tfidf__min_df': 15, 'tfidf__max_df': 0.3, 'lr_classifier__solver': 'newton-cg', 'lr_classifier__penalty': 'l2', 'lr_classifier__C': 0.2782559402207124}

#XGBoost model


XGBoost machine learning models provide the best combination of prediction performance and processing time when compared to other algorithms, It's also a library for creating gradient boosting tree models that are both quick and high-performing. That XGBoost outperforms the competition on a variety of tough machine learning tasks becuse it is  powerful enough to deal with all sorts of irregularities of data. So I will use this model to try to get a highly prediction.


**To improve and get fully use the benefits  of  XGBoost model,  hyper  parameter tuning is must.**

And It is very difficult to get answers to practical questions like

*   Which set of parameters you should tune ?
*   What is the ideal value of these parameters to obtain optimal output ? 

So I tried to tune a list of parameters of XGBoost hyper parameter such as   **(learning_rate, max_depth , gamma ,n_estimators)**

**learning_rate :**Step size shrinkage used in update to prevents overfitting.[3]

it's value must be between 0 and 1 to optimizes the chances to reach the best optimum

I choosed a list that has values = [0.01,0.05,0.1] 

that it lay on the range from 0.01 to 0.3 

and it's Default = 0.3. 




**max_depth:** which mean the maximum depth of a tree we use it.

if I use a high max_depth, performance  might increase but the model's complexity will increase and lead to overfit on the other hand if I choose a very small max_depth if made my model can't able to learn well and lead to underfit 
so I choosed a list of max_depth paramters to get the most suitable max_depth to make our model increase it's performance 

I choosed a list that has values = [60,70,75]

**gamma:** which mean  is a pseudo-regularisation parameter (Lagrangian multiplier), and depends on the other parameters. The higher Gamma is, the higher the regularization. It can be any integer. Default is 0 [4]

I choosed a list that has values = [ 0.1, 0.2 , 0.3]


**n_estimators:** which mean  number of trees in our ensemble.

why we use this hyper parameter in our model?!!!

The reason is in the way that the boosted tree model is constructed, sequentially where each new tree attempts to model and correct for the errors made by the sequence of previous trees. [5]

I choosed a list that has values = [150,200,250] 

it's Default = 100.


In [None]:
# Pipeline tuning 
# tune model's hyperparameter
# tune Vectorizer 
# here I used (word-level vectorizer)

pipeline = Pipeline([("tfidf", TfidfVectorizer(analyzer="word")), ("xgb_classifier",XGBClassifier())])

params = {    
    "tfidf__ngram_range": [(1, 2),(1,3)],                    # tfidf__ngram_range points to tfidf->nngram_range
    "tfidf__max_df": np.arange(0.3, 0.5),                    # tfidf__max_df points to tfidf->max_df
    "tfidf__min_df": np.arange(5,30),                        # tfidf__min_df points to tfidf->min_df
    'xgb_classifier__learning_rate' : [0.01,0.05,0.1],       # my_classifier__n_estimators points to my_classifier->learning_rate
    'xgb_classifier__max_depth' : [60,70,75],                # my_classifier__n_estimators points to my_classifier->max_depth 
    'xgb_classifier__gamma': [ 0.1, 0.2 , 0.3],              # my_classifier__n_estimators points to my_classifier->gama
    'xgb_classifier__n_estimators': [150,200,250]            # my_classifier__n_estimators points to my_classifier->gamma
}


## Trial_4 
**Using XGBoost model and random search method (word-level vectorizer)**

**Expectation**

After cleaning and preprocessing the data using several techniques 
(removing the URL, removing all non-essential letters (Numbers and Punctuation ,convert all characters to lowercase, tokenization, removing stopwords, lemmatization ,remove the words with a length of 2) and using TfidfVectorizer (word-level)

I tried random search with XGBoost classifier model and set of hyperparameters which must be tune to get a high performance with XGBoost model.

my expectation that I may get a high accuraccy and get the best hyperparamter which the model need to get high performance

**Observation**

After running the model I found that the best hyperparameters that defined with rondom search method are


        best score 0.8563903704660246
        best hyperparameter {'xgb_classifier__n_estimators': 250, 'xgb_classifier__max_depth': 60, 'xgb_classifier__learning_rate': 0.1, 'xgb_classifier__gamma': 0.1, 'tfidf__ngram_range': (1, 2), 'tfidf__min_df': 15, 'tfidf__max_df': 0.3}
        Score: 0.80346 on kaggle

**Plan**

I will try to use grid search method with  RandomForest model and  analyzer = character  

In [None]:
#Further split the original training set to a train and a validation set
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size = 0.8, stratify = y, random_state = 40)

# Create a list where train data indices are -1 and validation data indices are 0
# X_train2 (new training set), X_train
split_index = [-1 if x in X_train.index else 0 for x in X.index]

# Use the list to create PredefinedSplit
pds = PredefinedSplit(test_fold = split_index)
random_xgb_word = RandomizedSearchCV(
    pipeline, params, cv=pds, verbose=1, n_jobs=2, 
    # number of random trials
    n_iter=10,
    scoring='roc_auc')
    # number of random trials
    

random_xgb_word.fit(X,y)

print('best score {}'.format(random_xgb_word.best_score_))
print('best hyperparameter {}'.format(random_xgb_word.best_params_))

Fitting 1 folds for each of 10 candidates, totalling 10 fits
best score 0.8551628296279068
best hyperparameter {'xgb_classifier__n_estimators': 250, 'xgb_classifier__max_depth': 60, 'xgb_classifier__learning_rate': 0.1, 'xgb_classifier__gamma': 0.3, 'tfidf__ngram_range': (1, 3), 'tfidf__min_df': 19, 'tfidf__max_df': 0.3}


In [None]:
submission = pd.DataFrame()

submission['id'] = test_data.index

submission['label'] = random_xgb_word.predict_proba(test_data.clean_text)[:,1]

submission.to_csv('trial4_xgb_random_word.csv', index=False)

# RandomForest model

### Trial_5
**Using RandomForest model and grid search method (character-level vectorizer)**

**Random Forest Classifier** is a classification algorithm made up of several decision trees. The algorithm uses randomness to build each individual tree to promote uncorrelated forests, which then uses the forest's predictive powers to make accurate decisions.

hyperparameters in random forest are important  to increase the predictive power of the model or to make the model faster.

**n_estimators**  hyperparameter, which is just the number of trees the algorithm builds before taking the maximum voting or taking the averages of predictions. In general, a higher number of trees increases the performance and makes the predictions more stable, but it also slows down the computation. [7]

In [None]:
from sklearn.ensemble import RandomForestClassifier
pipeline = Pipeline([("tfidf", TfidfVectorizer(analyzer="char")), ("randonforest_classifier",RandomForestClassifier())])

params = {"tfidf__ngram_range": [(1, 2),(1,3)],                    # tfidf__ngram_range points to tfidf->nngram_range
          "tfidf__max_df": np.arange(0.3, 0.5),                    # tfidf__max_df points to tfidf->max_df
          "tfidf__min_df": np.arange(5,15),                        # tfidf__min_df points to tfidf->min_df
          'randonforest_classifier__n_estimators': [250]           # my_classifier__n_estimators points to my_classifier->n_estimators
}

**Expectation**

After cleaning and preprocessing the data using several techniques (removing the URL, removing all non-essential letters (Numbers and Punctuation ,convert all characters to lowercase, tokenization, removing stopwords, lemmatization ,remove the words with a length of 2) and using TfidfVectorizer (word-level)

I tried grid search with Randomforest classifier model and it's hyperparameter (n_estimators)  which must be tune to get a high performance with Randomforest model.

my expectation that I will get a high accuraccy and get the best hyperparamter which the model need to get high performance

**Observation**

After running the model I found that the best hyperparameters that defined with rondom search method are

      best score 0.8151476406921723
      best hyperparameter {'randonforest_classifier__n_estimators': 250, 'tfidf__max_df': 0.3, 'tfidf__min_df': 9, 'tfidf__ngram_range': (1, 3)}
 
      Score: 0.84910 on kaggle

In [None]:
#Further split the original training set to a train and a validation set
X_train, X_val, y_train, y_val = train_test_split(
    X, y, train_size = 0.8, stratify = y, random_state = 42)

# Create a list where train data indices are -1 and validation data indices are 0
# X_train2 (new training set), X_train
split_index = [-1 if x in X_train.index else 0 for x in X.index]

# Use the list to create PredefinedSplit
pds = PredefinedSplit(test_fold = split_index)
#
grid_search_random = GridSearchCV(
    pipeline, params, cv=pds, verbose=1, n_jobs=2, 
    # number of random trials
    scoring='roc_auc')

grid_search_random.fit(X,y)

print('best score {}'.format(grid_search_random.best_score_))
print('best hyperparameter {}'.format(grid_search_random.best_params_))

Fitting 1 folds for each of 4050 candidates, totalling 4050 fits


In [None]:
submission = pd.DataFrame()

submission['id'] = test_data.index

submission['label'] = grid_search_random.predict_proba(test_data.clean_text)[:,1]

submission.to_csv('trial5_randomforest_char.csv', index=False)

# Problem Formulation

**Define the problem** 

Because of the increase of social networks and their involvement in several spheres such as politics, false information on the Internet has generated many social difficulties. So we need to predict if a specific reddit post is fake news or not, by looking at its title. And the provided data is raw (contains various forms of words) wherefore we need to apply text preprocessing techniques to clean it and make a good prediction. 

**What is the input?**

The input is text data which include the tile of news that we want to predict if the given post is fake news or not .

**What is the output?**

The output is a prediction if a specific reddit post is fake news or not, by looking at its title.

**What data mining function is required?** 

The data mining function is binary classification

**What could be the challenges?**
The challenges are that :

The data is raw (contains various forms of words)  and need text preprocessing techniques to be applied on it. 

We need to implement a model to predict whether the post is fake or not and search for the hyperparameters to get the higher performance of our model.

**What is the impact?** 

The impact is that we want to implement a model that can predict if the post is fake news or not  based on the title of news which mar has no impact 


**What is an ideal solution?** 

The best accuracy I got when using Randomforest model with grid search method

# The Questions :

**1- What is the difference between Character n-gram and Word n-gram? Which one tends to suffer more from the OOV issue?**

**character-level n-grams** divides a text into a collection of characters and require far less storage space and, as a result, will carry much less data

**word-level n-grams** devide a sentence into a collection of words and  may serve the same purposes and much more, but they need much more storage.

Because of the new terms that exist in the testing dataset but do not appear in the training dataset, word n-grams are more prone to OOV (Out-Of-Vocabulary) problems.

OOV words are handled logically by Character Tokenizers by maintaining the word's information. It decomposes the OOV word into characters and expresses it using these characters.



---



**2- What is the difference between stop word removal and stemming? Are these techniques language-dependent?**

**Stop word removal :** stopwords are the most common words in a text that provide no useful information. Stopwords such as they, there, this, and where are examples of stopwords.

**Stemming :** stemming is the process of deleting a component of a word or reducing a word to its stem or root 

Yes, these techniques considered language dependent 



---

**3-  Is tokenization techniques language dependent? Why?**

Yes, tokenization techniques is highly language dependent because  

In natural language processing, programming languages work by breaking down raw code into tokens and then combining them using logic (the program's grammar).

We can apply a modest set of principles to integrate the text into some bigger meaning by breaking it up into little, known fragments.


---



**4- What is the difference between count vectorizer and tf-idf vectorizer? Would it be feasible to use all possible n-grams? If not, how should you select them?**

TfidfVectorizer and CountVectorizer are both methods for turning text input into vectors.because the model can only handle numerical data but  in CountVectorizer we only count the number of times a word appears in the document which results in biasing in favour of most frequent words. 

TF-IDF is preferable to Count Vectorizers in that it not only considers the frequency of words found in the corpus, but also their importance.

No,  It is not feasible to use all possible n-grams because it depends on the application, so is not possible to use all potential n-grams. Instead, we should experiment with different n-grams on your data to see which one performs best in the mod

# Refrences


[1] https://medium.com/@cmukesh8688/tf-idf-vectorizer-scikit-learn-dbc0244a911a

[2] https://blog.ekbana.com/pre-processing-text-in-python-ad13ea544dae

[3] https://xgboost.readthedocs.io/en/latest/parameter.html

[4] https://towardsdatascience.com/xgboost-fine-tune-and-optimize-your-model-23d996fab663

[5] https://analyticsindiamag.com/why-is-random-search-better-than-grid-search-for-machine-learning/#:~:text=Random%20search%20is%20a%20technique,yields%20high%20variance%20during%20computing.

[6] https://medium.com/analytics-vidhya/text-preprocessing-for-nlp-natural-language-processing-beginners-to-master-fd82dfecf95

[7] https://builtin.com/data-science/random-forest-algorithm