## 1) Import the necessary libraries and getting machine ready

1) Pandas- open source data analysis library for providing easy-to-use data structures and data analysis tools

2) Numpy - general-purpose array-processing package. It provides a high-performance multidimensional array object, and tools for working with these arrays. It is the fundamental package for scientific computing with Python.

3) BeautifulSoup- Python library for pulling data out of HTML and XML files.

4) unicodedata- This module provides access to the Unicode Character Database (UCD) which defines character properties for all Unicode characters. 

5) contractions- Fixes contractions such as `you're` to you `are`

6) re- This module provides regular expression matching operations similar to those found in Perl.

7) nltk- NLTK stands for Natural Language Toolkit. This toolkit is one of the most powerful NLP libraries which contains packages to make machines understand human language and reply to it with an appropriate response. Tokenization, Stemming, Lemmatization, Punctuation, Character count, word count are some of these packages 

8) RegexpTokenizer- splits a string into substrings using a regular expression.

9) WordNetLemmatizer- Lemmatize using WordNet's built-in morphy function.Returns the input word unchanged if it cannot be found in WordNet.

10) CountVectorizer- Convert a collection of text documents to a matrix of token counts

11) TfidfVectorizer- Convert a collection of raw documents to a matrix of TF-IDF features.

12) sklearn.model_selection- Split arrays or matrices into random train and test subsets

In [1]:
#python -m pip install --upgrade pip

In [2]:
#!pip install contractions

In [3]:
#!pip install gensim

In [4]:
#!pip install scikit-plot

In [5]:
#!pip install tensorflow

In [None]:
import pandas as pd
import numpy as np

from bs4 import BeautifulSoup
import unicodedata
import contractions
import re
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from gensim.models import Word2Vec
from gensim.models import KeyedVectors

import sklearn.model_selection

## 2) Analyze the dataset

In [7]:
# read file into pandas
df=pd.read_excel("dataset2_11k.xlsx")

In [8]:
# examine the shape (rows, columns)
df.shape

(10743, 2)

In [9]:
# examine the first 5 rows
df.head()

Unnamed: 0,title,result
0,- Pandemonium In Aba As Woman Delivers Baby W...,0
1,#OTRAMETLIFE ' I SWEAR TO GOD I DIDNT EVEN RE...,0
2,Dear Lord Forgive Me For Body Bagging @MeekMi...,0
3,no pharrell only YOU can prevent forest fire...,0
4,-- small bag from the bottom the wounded hero ...,0


In [10]:
df.tail()

Unnamed: 0,title,result
10738,Zouma has just absolutely flattened that guy ??,0
10739,Zouma! Runaway train. Absolutely flattened the...,0
10740,ו_? New Ladies Shoulder Tote #Handbag Faux Lea...,0
10741,ו₪} New Ladies Shoulder Tote #Handbag Faux Lea...,0
10742,וָMGN-AFRICAו¨ pin:263789F4 וָ Correction: Ten...,0


In [11]:
# examine the class distribution
df.result.value_counts()

result
0    8230
1    2513
Name: count, dtype: int64

In [12]:
# generate summary statistics, excluding Nan values
df.describe()

Unnamed: 0,result
count,10743.0
mean,0.23392
std,0.423341
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


In [13]:
# process summary of a dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10743 entries, 0 to 10742
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   title   10743 non-null  object
 1   result  10743 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 168.0+ KB


In [14]:
# check for missing values
df.isnull().sum()

title     0
result    0
dtype: int64

## 3) Preprocessing of data
To preprocess your text simply means to bring your text into a form that is predictable and analyzable for your task. 

NLP is short for Natural Language Processing.

Natural Language Processing or NLP is a field of Artificial Intelligence that gives the machines the ability to read, understand and derive meaning from human languages.

The workflow that will be followed is:

#### 1) Change to lower case:
Lowercasing is one of the effective forms. Variation in input capitalization can give us different output. For example: Canada and canada is the same but the neural network might not be able to make that out.

#### 2) Remove HTML tags: 
Since we have got data through web scraping, the text might contain a lot of noise. HTML tags do not add much value towards understanding and analyzing text so we will remove the HTML tags.

#### 3) Remove accented characters:
Usually in any text corpus, you might be dealing with accented characters/letters, especially if you only want to analyze the English language. Hence, we need to make sure that these characters are converted and standardized into ASCII characters. A simple example — converting é to e.

#### 4) Remove URL:
The url links are of n use and will not help so we will remove it and replace with "". Basically just wiping it.

#### 5) Expanding contractions:
Contractions are shortened version of words or syllables. In case of English contractions, they are often created by removing one of the vowels from the word. Examples would be, do not to don’t and I would to I’d. Converting each contraction to its expanded, original form helps with text standardization.

#### 6) Removing special characters (punctuations, hashtags and @):
Special characters and symbols are usually non-alphanumeric characters or even occasionally numeric characters (depending on the problem), which add to the extra noise in unstructured text. Usually, simple regular expressions (regexes) can be used to remove them.

#### 7) Tokenization:
This breaks up the strings into a list of words or pieces based on a specified pattern using Regular Expressions aka RegEx. The pattern I chose to use this time (r'\w') also removes punctuation and is a better option for this data in particular. 

#### 8) Remove stop words:
We imported a list of the most frequently used words from the NL Toolkit with from nltk.corpus import stopwords. There are 179 English words, including ‘i’, ‘me’, ‘my’, ‘myself’, ‘we’, ‘you’, ‘he’, ‘his’, for example. We usually want to remove these because they have low predictive power. 

#### 9) Lemmatization or stemming
Both tools shorten words back to their root form. Stemming is a little more aggressive. It cuts off prefixes and/or endings of words based on common ones. It can sometimes be helpful, but not always because often times the new word is so much a root that it loses its actual meaning. Lemmatizing, on the other hand, maps common words into one base. Unlike stemming though, it always still returns a proper word that can be found in the dictionary. 
we will compare to see which one works better. For stemming we will use snowball stemmer as it is proven to give better than than porter stemmer.

In [15]:
# Lowercasing
df['title']=df['title'].str.lower()
df.head()

Unnamed: 0,title,result
0,- pandemonium in aba as woman delivers baby w...,0
1,#otrametlife ' i swear to god i didnt even re...,0
2,dear lord forgive me for body bagging @meekmi...,0
3,no pharrell only you can prevent forest fire...,0
4,-- small bag from the bottom the wounded hero ...,0


In [16]:
# Remove HTML tags
def remove_html(text):
    soup=BeautifulSoup(text,'lxml')
    html_free=soup.get_text()
    return html_free

df['title']=df['title'].apply(lambda x: remove_html(x))

FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

In [None]:
# Removing accented characters
def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text
df['title']=df['title'].apply(lambda x: remove_accented_chars(x))

In [None]:
# Remove URLs
df['title']=df['title'].str.replace('http\S+|www.\S+', '', case=False)
#df.iloc[9977]

In [None]:
# Expanding contractions   
df['title']=df['title'].apply(lambda x: [contractions.fix(word) for word in x.split()])
# The above statement tokenizes the sentence in such a 
# way that if word is "I've" the token formed is [I have] after 
# expanding. So we use the below command to contract the tokens
# back to sentence
df['title']=[' '.join(map(str,l)) for l in df['title']]
#df.iloc[9977]

In [None]:
# Remove the hashtags and @ retaining the meaningful words
def hashtag(x):
    text=' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
    return text

df['title']=df['title'].apply(lambda x: hashtag(x))
#df.iloc[9975]

In [None]:
# Tokenization
tokenizer= RegexpTokenizer(r'\w+')
df['title']=df['title'].apply(lambda x: tokenizer.tokenize(x))
df.head()

Unnamed: 0,title,result
0,"[pandemonium, in, aba, as, woman, delivers, ba...",0
1,"[otrametlife, i, swear, to, god, i, did, not, ...",0
2,"[dear, lord, forgive, me, for, body, bagging, ...",0
3,"[no, pharrell, only, you, can, prevent, forest...",0
4,"[small, bag, from, the, bottom, the, wounded, ...",0


In [None]:
# Remove stop words
print(stopwords.words('english'))

def remove_stopwords(text):
    words=[w for w in text if w not in stopwords.words('english')]
    return words

df['title']=df['title'].apply(lambda x: remove_stopwords(x))
df.head()

LookupError: 
**********************************************************************
  Resource [93mstopwords[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('stopwords')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/stopwords[0m

  Searched in:
    - 'C:\\Users\\ayush/nltk_data'
    - 'C:\\Program Files\\WindowsApps\\PythonSoftwareFoundation.Python.3.12_3.12.2800.0_x64__qbz5n2kfra8p0\\nltk_data'
    - 'C:\\Program Files\\WindowsApps\\PythonSoftwareFoundation.Python.3.12_3.12.2800.0_x64__qbz5n2kfra8p0\\share\\nltk_data'
    - 'C:\\Program Files\\WindowsApps\\PythonSoftwareFoundation.Python.3.12_3.12.2800.0_x64__qbz5n2kfra8p0\\lib\\nltk_data'
    - 'C:\\Users\\ayush\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


In [None]:
# Lemmatization
lemmatizer=WordNetLemmatizer()
def word_lemmatizer(text):
    lem_text=[lemmatizer.lemmatize(i) for i in text]
    return lem_text

df['title']=df['title'].apply(lambda x: word_lemmatizer(x))
df.head()

In [None]:
# Stemming
stemmer=SnowballStemmer("english")
def word_stemmer(text):
    stem_text=" ".join([stemmer.stem(i) for i in text])
    return stem_text

df['title']=df['title'].apply(lambda x: word_stemmer(x))
df.head()

## 4) Splitting of columns

In [None]:
# instances to learn from
X=df.title 
# target/responses the model is trying to learn to predict
y=df.result

In [None]:
# first 5 instances
X.head()

In [None]:
# first 5 target/responses
y.head()

## 5) Splitting the dataset into train and test
In machine learning we usually split our data into two subsets: training data and testing data (and sometimes to three: train, validate and test), and fit our model on the train data, in order to make predictions on the test data.

The SciKit library provides a tool, called the Model Selection library. There’s a class in the library which is, aptly, named ‘train_test_split.’ Using this we can easily split the dataset into the training and the testing datasets in various proportions.

#### sklearn.model_selection.train_test_split(*arrays, **options)
#### Parameters:
#### test_size — This parameter decides the size of the data that has to be split as the test dataset. This is given as a fraction. For example, if you pass 0.5 as the value, the dataset will be split 50% as the test dataset. If you’re specifying this parameter, you can ignore the next parameter. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.

#### train_size — You have to specify this parameter only if you’re not specifying the test_size. This is the same as test_size, but instead you tell the class what percent of the dataset you want to split as the training set.

#### random_state — Here you pass an integer, which will act as the seed for the random number generator during the split. Or, you can also pass an instance of the RandomState class, which will become the number generator. If you don’t pass anything, the RandomState instance used by np.random will be used instead.

#### shuffle —  Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None.

In [None]:
# splitting X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2, random_state=100, shuffle=True)

In [None]:
X_train.head()

In [None]:
X_test.head()

In [None]:
y_train.head()

In [None]:
y_test.head()

In [None]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape) 

## 6) Splitting the dataset using cross validation technique

#### The validation set approach
In this approach, we reserve 50% of the dataset for validation and the remaining 50% for model training. 
#### Disadvantage
since we are training a model on only 50% of the dataset, there is a huge possibility that we might miss out on some interesting information about the data which will lead to a higher bias.


#### Leave one out cross validation (LOOCV)
In this approach, we reserve only one data point from the available dataset, and train the model on the rest of the data. This process iterates for each data point.
#### Disadvantage
1) We make use of all data points, hence the bias will be low
2) We repeat the cross validation process n times (where n is number of data points) which results in a higher execution time
3) This approach leads to higher variation in testing model effectiveness because we test against one data point. So, our estimation gets highly influenced by the data point. If the data point turns out to be an outlier, it can lead to a higher variation.

#### k-fold cross validation
From the above two validation methods, we’ve learnt:

1) We should train the model on a large portion of the dataset. Otherwise we’ll fail to read and recognise the underlying trend in the data. This will eventually result in a higher bias

2) We also need a good ratio of testing data points. As we have seen above, less amount of data points can lead to a variance error while testing the effectiveness of the model

3) We should iterate on the training and testing process multiple times. We should change the train and test dataset distribution. This helps in validating the model effectiveness properly
Do we have a method which takes care of all these 3 requirements?

Yes! That method is known as “k-fold cross validation”. It’s easy to follow and implement. Below are the steps for it:

1) Randomly split your entire dataset into k”folds”

2) For each k-fold in your dataset, build your model on k – 1 folds of the dataset. Then, test the model to check the effectiveness for kth fold

3) Record the error you see on each of the predictions

4) Repeat this until each of the k-folds has served as the test set

5) The average of your k recorded errors is called the cross-validation error and will serve as your performance metric for the model

#### “How to choose the right value of k?”.

Always remember, a lower value of k is more biased, and hence undesirable. On the other hand, a higher value of K is less biased, but can suffer from large variability. It is important to know that a smaller value of k always takes us towards validation set approach, whereas a higher value of k leads to LOOCV approach.

#### class sklearn.model_selection.KFold(n_splits=5, shuffle=False, random_state=None)

#### parameters:
#### n_splits: 
Number of folds. default is 5.

#### shuffle:
Whether to shuffle the data before splitting into batches.

#### random_state:
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Only used when shuffle is True. This should be left to None if shuffle is False.

To calculate error metrics of KFolds of test sets:
model_selection.cross_val_score(model, X, y, cv=kf, scoring=‘neg_mean_absolute_error’)



In [None]:
from sklearn.model_selection import KFold
kf=KFold(n_splits=1000,shuffle=True)

In [None]:
for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train_cv, X_test_cv = X[train_index], X[test_index]
    y_train_cv, y_test_cv = y[train_index], y[test_index]

## 7) Text vectorization
The dataset contain numerical value, string value, character value, categorical value,etc. Conversion of these types of features into numerical feature is called featurization or text vectorization. 

Word Embeddings are the texts converted into numbers and there may be different numerical representations of the same text. 
The different types of word embeddings can be broadly classified into two categories-

Frequency based Embedding

Prediction based Embedding

#### Frequency based Embedding:
There are generally three types of vectors that we encounter under this category.

1) Count Vector
2) TF-IDF Vector
3) Co-Occurrence Vector

#### Prediction based vector:
1) Continuous bag of words (CBOW)
2) Skip- Gram model

#### Using Pre-trained word vectors:
1) Word2Vec
2) Fasttext
3) Glove


#### 1) Bag of words:
Its called bag of words because any order of the words in the document is discarded it only tells us weather word is present in the document or not. 

“There used to be Stone Age”

“There used to be Bronze Age”

“There used to be Iron Age”

“There was Age of Revolution”

“Now it is Digital Age”

Here each sentence is separate document if we make list of the word such that one word should be occur only once than our list looks like as follow:

“There”,”was”,”to”,”be”,”used”,”Stone”,”Bronze,”Iron”,”Revolution”,”Digital”,”Age”,”of”,”Now”,”it”,”is”

So how a word can be converted to vector can be understood by simple word count example where we count occurrence of word in a document w.r.t list. For example- vector conversion of sentence “There used to be Stone Age” can be represented as :

“There” = 1

”was”= 0

”to”= 1

”be” =1

”used” = 1

”Stone”= 1

”Bronze” =0

“Iron” =0

”Revolution”= 0

”Digital”= 0

”Age”=1

”of”=0

”Now”=0

”it”=0

”is”=0

So here we basically convert word into vector . By following same approach other vector value are as follow:

“There used to be bronze age” = [1,0,1,1,1,0,1,0,0,0,1,0,0,0,0]

“There used to be iron age” = [1,0,1,1,1,0,0,1,0,0,1,0,0,0,0]

“There was age of revolution” = [1,1,0,0,0,0,0,0,1,0,1,1,0,0,0]

“Now its digital Age” = [0,0,0,0,0,0,0,0,0,1,1,0,1,1,1]

The approach which is discussed above is unigram because we are considering only one word at a time . Similarly we have bigram(using two words at a time- for example — There used, Used to, to be, be Stone, Stone age), trigram(using three words at a time- for example- there used to, used to be ,to be Stone,be Stone Age), ngram(using n words at a time)

By using CountVectorizer function we can convert text document to matrix of word count. Matrix which is produced here is sparse matrix. By using CountVectorizer on above document we get 5*15 sparse matrix of type numpy.int64.

After applying the CountVectorizer we can map each word to feature indices as shown below:
<img src="images/feature_indices_mapping.png">

This can be transformed into sparse matrix by using as shown below:
<img src="images/sparse_matrix.png">

Countvectorizer produces sparse matrix which sometime not suited for some machine learning model hence first convert this sparse matrix to dense matrix then apply machine learning model.

#### sklearn.feature_extraction.text.CountVectorizer() 
#### parameters:
#### token_pattern: string
Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'. The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).

#### ngram_rangetuple:(min_n, max_n), default=(1, 1)
The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams. Only applies if analyzer is not callable.

#### analyzer: string, {‘word’, ‘char’, ‘char_wb’} or callable
Whether the feature should be made of word n-gram or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

#### 2) TF-IDF:
TF-IDF stands for Term Frequency-Inverse Document Frequency which basically tells importance of the word in the corpus or dataset. TF-IDF contain two concept Term Frequency(TF) and Inverse Document Frequency(IDF).

Term Frequency is defined as how frequently the word appear in the document or corpus. As each sentence is not the same length so it may be possible a word appears in long sentence occur more time as compared to word appear in sorter sentence. Term frequency can be defined as:
<img src="images/tf.png">

Suppose we have sentence “The TFIDF Vectorization Process is Beautiful Concept” and we have to find the find frequency count of these words in five different documents
<img src="images/tf2.png">

As shown in Table 1 frequency of ‘The’ is maximum in every Document. Suppose frequency of ‘The’ in Document6 is 2 million while frequency of ‘The’ in Document7 in 3 million. Frequency of ‘The’ is very large in Document6 and Document7 so we can add log term to reduce the value of frequency count (log(2 million) =21). Adding log not only dampen the performance of idf but also reduce the frequency count of TF. Hence formula of TF can be defined as:
<img src="images/tf3.png">

When tf = 1 log term will become zero and value will become 1 . Adding 1 is just to differentiate between tf=0 and tf =1
Hence Table 1 can be modified to :
<img src="images/tf4.png">

Inverse Document frequency is another concept which is used for finding out importance of the word. It is based on the fact that less frequent words are more informative and important. IDF is represented by formula:
<img src="images/idf.png">

Let us consider the above example again
<img src="images/idf2.png">
In Table 3 most frequent word is ‘The’ and ‘is ’ but it is least important according to IDF and the word which appear very less such as ‘TFIDF’, ‘Concept’ are important words. Hence, we can say that IDF of rare term is high and IDF of frequent term is low

TF-IDF is basically a multiplication between Table 2 (TF table) and Table 3(IDF table) . It basically reduces values of common word that are used in different document. As we can see that in Table 4 most important word after multiplication of TF and IDF is ‘TFIDF’ while most frequent word such as ‘The’ and ‘is’ are not that important
<img src="images/tfidf.png">

#### 3) Word2Vec:
Word2Vec is Word representations in Vector Space. It is a strategy where words are represented as a bunch of numbers. These numbers(Vectors) are not assigned in random they are assigned in such a way that two similar words are closer together in a vector space.

After the words are converted as vectors, we need to use some techniques such as Euclidean distance, Cosine Similarity to identify similar words. word2vec uses cosine similarity for finding out most similar word

Why Cosine Similarity?
Count the common words or Euclidean distance is the general approach used to match similar documents which are based on counting the number of common words between the documents.
This approach will not work even if the number of common words increases but the document talks about different topics. To overcome this flaw, the “Cosine Similarity” approach is used to find the similarity between the documents.
<img src="images/cosine.png">
Mathematically, it measures the cosine of the angle between two vectors (item1, item2) projected in an N-dimensional vector space. The advantageous of cosine similarity is, it predicts the document similarity even Euclidean is distance.
“Smaller the angle, the higher the similarity” — Cosine Similarity.

Let’s see an example.

Julie loves John more than Linda loves John

Jane loves John more than Julie loves John

John  2    2

Jane  0    1

Julie 1    1

Linda 1    0
 
likes 0    1

loves 2    1

more  1    1

than  1    1

the two vectors are,

Item 1: [2, 0, 1, 1, 0, 2, 1, 1]

Item 2: [2, 1, 1, 0, 1, 1, 1, 1]

The cosine angle (the smaller the angle) between the two vectors' value is 0.822 which is nearest to 1.

Now let’s see what are all the ways to convert sentences into vectors.

Word embeddings coming from pre-trained methods such as,

Word2Vec — From Google

Fasttext — From Facebook

Glove — From Standford

We will use Word2Vec by google. It is deep learning technique with two-layer neural network.Google Word2vec take input from large data (in this scenario we are using google data) and convert into vector space. Google word2vec is basically pretrained on google dataset. Word2vec basically place the word in the feature space is such a way that their location is determined by their meaning i.e. words having similar meaning are clustered together and the distance between two words also have same meaning.  
<img src="images/w2v.png">

For Google Word2vec we are using google dataset to train the model because not only it cover most of the words. But before using Google word2vec we must install and import genism. Gensim is a robust open-source vector space modeling and topic modeling toolkit implemented in Python.

#### Unigram and bigram explained:
Example: Consider the sentence "I ate banana".

In Unigram we assume that the occurrence of each word is independent of its previous word. Hence each word becomes a gram(feature) here.

For unigram, we will get 3 features - 'I', 'ate', 'banana' and all 3 are independent of each other. Although this is not the case in real languages.

In Bigram we assume that each occurrence of each word depends only on its previous word. Hence two words are counted as one gram(feature) here.

For bigram, we will get 2 features - 'I ate' and 'ate banana'. This makes sense since the model will learn that 'banana' comes after 'ate' and not the other way around.

Similarly, we can have trigram.......n-gram.

In [None]:
# Bag of Words (unigram) 
# \w word. matches any word chracter
# {1,} quantifier. Match 1 or more of the preceding token
count_vectorizer1=CountVectorizer(analyzer='word')
count_vectorizer1.fit(X)
count_vect1_tr= count_vectorizer1.transform(X_train)
count_vect1_te= count_vectorizer1.transform(X_test)

count_vect1_tr_dtm= count_vect1_tr.toarray()
count_vect1_te_dtm= count_vect1_te.toarray()
#print(vectorizer1.get_feature_names())
#print(count_vect1.toarray())
#print(count_vect1.shape)

# Bag of Words (bigram)
count_vectorizer2=CountVectorizer(analyzer='word',ngram_range=(2, 2))
count_vectorizer2.fit(X)
count_vect2_tr= count_vectorizer2.transform(X_train)
count_vect2_te= count_vectorizer2.transform(X_test)
#print(vectorizer2.get_feature_names())
#print(count_vect2.toarray())

# Bag of Words (trigram)
count_vectorizer3=CountVectorizer(analyzer='word',ngram_range=(3, 3))
count_vectorizer3.fit(X)
count_vect3_tr= count_vectorizer3.transform(X_train)
count_vect3_te= count_vectorizer3.transform(X_test)
#print(vectorizer3.get_feature_names())
#print(count_vect3.toarray())

# Bag of Words (unigram and bigram)
count_vectorizer4=CountVectorizer(analyzer='word',ngram_range=(1, 2))
count_vectorizer4.fit(X)
count_vect4_tr= count_vectorizer4.transform(X_train)
count_vect4_te= count_vectorizer4.transform(X_test)
#print(vectorizer4.get_feature_names())
#print(count_vect4.toarray())

# Bag of Words (character level,unigram)
count_vectorizer5=CountVectorizer(analyzer='char')
count_vectorizer5.fit(X)
count_vect5_tr= count_vectorizer5.transform(X_train)
count_vect5_te= count_vectorizer5.transform(X_test)
#print(vectorizer5.get_feature_names())
#print(count_vect5.toarray())

# Bag of Words (character level,bigram)
count_vectorizer6=CountVectorizer(analyzer='char',ngram_range=(2, 2))
count_vectorizer6.fit(X)
count_vect6_tr= count_vectorizer6.transform(X_train)
count_vect6_te= count_vectorizer6.transform(X_test)
#print(vectorizer6.get_feature_names())
#print(count_vect6.toarray())

# Bag of Words (character level,trigram)
count_vectorizer7=CountVectorizer(analyzer='char',ngram_range=(3, 3))
count_vectorizer7.fit(X)
count_vect7_tr= count_vectorizer7.transform(X_train)
count_vect7_te= count_vectorizer7.transform(X_test)
#print(vectorizer7.get_feature_names())
#print(count_vect7.toarray())

# Bag of Words (character level,unigram and bigram)
count_vectorizer8=CountVectorizer(analyzer='char',ngram_range=(2, 3))
count_vectorizer8.fit(X)
count_vect8_tr= count_vectorizer8.transform(X_train)
count_vect8_te= count_vectorizer8.transform(X_test)
#print(vectorizer8.get_feature_names())
#print(count_vect8.toarray())

In [None]:
# TF-IDF (unigram)
tfidf_vectorizer1=TfidfVectorizer(analyzer='word')
tfidf_vectorizer1.fit(X)
tfidf_vect1_tr= tfidf_vectorizer1.transform(X_train)
tfidf_vect1_te= tfidf_vectorizer1.transform(X_test)
#print(tfidf_vectorizer1.get_feature_names())
#print(count_vect5.toarray())
#print(tfidf_vect1.shape)

# TF-IDF (bigram)
tfidf_vectorizer2=TfidfVectorizer(analyzer='word',ngram_range=(2, 2))
tfidf_vectorizer2.fit(X)
tfidf_vect2_tr= tfidf_vectorizer2.transform(X_train)
tfidf_vect2_te= tfidf_vectorizer2.transform(X_test)
#print(tfidf_vectorizer2.get_feature_names())
#print(tfidf_vect2.toarray())

# TF-IDF (trigram)
tfidf_vectorizer3=TfidfVectorizer(analyzer='word',ngram_range=(3, 3))
tfidf_vectorizer3.fit(X)
tfidf_vect3_tr= tfidf_vectorizer3.transform(X_train)
tfidf_vect3_te= tfidf_vectorizer3.transform(X_test)
#print(tfidf_vectorizer3.get_feature_names())
#print(tfidf_vect3.toarray())

# TF-IDF (unigram and bigram)
tfidf_vectorizer4=TfidfVectorizer(analyzer='word',ngram_range=(2, 3))
tfidf_vectorizer4.fit(X)
tfidf_vect4_tr= tfidf_vectorizer4.transform(X_train)
tfidf_vect4_te= tfidf_vectorizer4.transform(X_test)
#print(tfidf_vectorizer4.get_feature_names())
#print(tfidf_vect4.toarray())

# TF-IDF (character level,unigram)
tfidf_vectorizer5=TfidfVectorizer(analyzer='char')
tfidf_vectorizer5.fit(X)
tfidf_vect5_tr= tfidf_vectorizer5.transform(X_train)
tfidf_vect5_te= tfidf_vectorizer5.transform(X_test)
#print(tfidf_vectorizer5.get_feature_names())
#print(tfidf_vect5.toarray())

# TF-IDF (character level,bigram)
tfidf_vectorizer6=TfidfVectorizer(analyzer='char',ngram_range=(2, 2))
tfidf_vectorizer6.fit(X)
tfidf_vect6_tr= tfidf_vectorizer6.transform(X_train)
tfidf_vect6_te= tfidf_vectorizer6.transform(X_test)
#print(tfidf_vectorizer6.get_feature_names())
#print(tfidf_vect6.toarray())

# TF-IDF (character level,trigram)
tfidf_vectorizer7=TfidfVectorizer(analyzer='char',ngram_range=(3, 3))
tfidf_vectorizer7.fit(X)
tfidf_vect7_tr= tfidf_vectorizer7.transform(X_train)
tfidf_vect7_te= tfidf_vectorizer7.transform(X_test)
#print(tfidf_vectorizer7.get_feature_names())
#print(tfidf_vect7.toarray())

# TF-IDF (character level,unigram and bigram)
tfidf_vectorizer8=TfidfVectorizer(analyzer='char',ngram_range=(2, 3))
tfidf_vectorizer8.fit(X)
tfidf_vect8_tr= tfidf_vectorizer8.transform(X_train)
tfidf_vect8_te= tfidf_vectorizer8.transform(X_test)
#print(tfidf_vectorizer8.get_feature_names())
#print(tfidf_vect8.toarray())

In [None]:
# Word2Vec Using fasttext
# load the pre-trained word-embedding vectors 

from keras.preprocessing import text
from keras.preprocessing import sequence

embeddings_index = {}
for i, line in enumerate(open('wiki-news-300d-1M.vec',encoding="utf8")):
    values = line.split()
    embeddings_index[values[0]] = np.asarray(values[1:], dtype='float32')

# create a tokenizer 
token = text.Tokenizer()
token.fit_on_texts(X)
word_index = token.word_index

# convert text to sequence of tokens and pad them to ensure equal length vectors 
fstxt_tr = sequence.pad_sequences(token.texts_to_sequences(X_train), maxlen=70)
fstxt_te = sequence.pad_sequences(token.texts_to_sequences(X_test), maxlen=70)

# create token-embedding mapping
embedding_matrix = np.zeros((len(word_index) + 1, 300))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

In [None]:
# Word2Vec Using google
w2v_model=KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
print(w2v_model.wv.most_similar('flood'))

## 8) Metrics to Evaluate Machine Learning Algorithm

## Refer to this also: https://towardsdatascience.com/the-5-classification-evaluation-metrics-you-must-know-aa97784ff226

#### 1) Classification Accuracy:
Classification Accuracy is what we usually mean, when we use the term accuracy. It is the ratio of number of correct predictions to the total number of input samples.
<img src="images/accu.gif">
It works well only if there are equal number of samples belonging to each class.
For example, consider that there are 98% samples of class A and 2% samples of class B in our training set. Then our model can easily get 98% training accuracy by simply predicting every training sample belonging to class A.
When the same model is tested on a test set with 60% samples of class A and 40% samples of class B, then the test accuracy would drop down to 60%. Classification Accuracy is great, but gives us the false sense of achieving high accuracy.
<img src="images/accu1.gif">
Accuracy works best if false positives and false negatives have similar cost. If the cost of false positives and false negatives are very different, it’s better to look at both Precision and Recall.

#### sklearn.metrics.accuracy_score(y_true, y_pred, normalize=True, sample_weight=None)

parameters:

y_true: 1d array-like, or label indicator array / sparse matrix
Ground truth (correct) labels.

y_pred: 1d array-like, or label indicator array / sparse matrix
Predicted labels, as returned by a classifier.

normalize: bool, optional (default=True)
If False, return the number of correctly classified samples. Otherwise, return the fraction of correctly classified samples.

sample_weight: array-like of shape (n_samples,), default=None
Sample weights.

#### 2) Logarithmic Loss:
Logarithmic Loss or Log Loss, works by penalising the false classifications. It works well for multi-class classification. When working with Log Loss, the classifier must assign probability to each class for all the samples. 

#### sklearn.metrics.log_loss(y_true, y_pred, eps=1e-15, normalize=True, sample_weight=None, labels=None)

parameters:

y_true: array-like or label indicator matrix
Ground truth (correct) labels for n_samples samples.

y_pred: array-like of float, shape = (n_samples, n_classes) or (n_samples,)
Predicted probabilities, as returned by a classifier’s predict_proba method. 

eps: float
Log loss is undefined for p=0 or p=1, so probabilities are clipped to max(eps, min(1 - eps, p)).

normalize: bool, optional (default=True)
If true, return the mean loss per sample. Otherwise, return the sum of the per-sample losses.

sample_weight: array-like of shape (n_samples,), default=None
Sample weights.

labels: array-like, optional (default=None)
If not provided, labels will be inferred from y_true. If labels is None and y_pred has shape (n_samples,) the labels are assumed to be binary and are inferred from y_true. .. versionadded:: 0.18


#### 3) Confusion Matrix:
Confusion Matrix as the name suggests gives us a matrix as output and describes the complete performance of the model.
<img src="images/confu.png">

#### sklearn.metrics.confusion_matrix(y_true, y_pred, labels=None, sample_weight=None, normalize=None)
parameter:
y_true: array-like of shape (n_samples,)
Ground truth (correct) target values.

y_pred: array-like of shape (n_samples,)
Estimated targets as returned by a classifier.

labels: array-like of shape (n_classes), default=None
List of labels to index the matrix. This may be used to reorder or select a subset of labels. If None is given, those that appear at least once in y_true or y_pred are used in sorted order.

sample_weight: array-like of shape (n_samples,), default=None
Sample weights.

normalize: {‘true’, ‘pred’, ‘all’}, default=None
Normalizes confusion matrix over the true (rows), predicted (columns) conditions or all the population. If None, confusion matrix will not be normalized.

#### 4) Area under Curve:
Area Under Curve(AUC) is one of the most widely used metrics for evaluation. It is used for binary classification problem. AUC of a classifier is equal to the probability that the classifier will rank a randomly chosen positive example higher than a randomly chosen negative example. Before defining AUC, let us understand two basic terms :
True Positive Rate (Sensitivity) : True Positive Rate is defined as TP/ (FN+TP). True Positive Rate corresponds to the proportion of positive data points that are correctly considered as positive, with respect to all positive data points.
<img src="images/auc.gif">
False Positive Rate (Specificity) : False Positive Rate is defined as FP / (FP+TN). False Positive Rate corresponds to the proportion of negative data points that are mistakenly considered as positive, with respect to all negative data points.
<img src="images/auc1.gif">
False Positive Rate and True Positive Rate both have values in the range [0, 1]. FPR and TPR bot hare computed at threshold values such as (0.00, 0.02, 0.04, …., 1.00) and a graph is drawn. AUC is the area under the curve of plot False Positive Rate vs True Positive Rate at different points in [0, 1].
AUC has a range of [0, 1]. The greater the value, the better is the performance of our model.

#### sklearn.metrics.auc(x, y)

parameters:
xarray, shape = [n]
x coordinates. These must be either monotonic increasing or monotonic decreasing.

yarray, shape = [n]
y coordinates.

But for binary classfication we will use ROC AUC (Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores) as in auc the parameters are x and y which should be inputed.

#### sklearn.metrics.roc_auc_score(y_true, y_score, average='macro', sample_weight=None, max_fpr=None, multi_class='raise', labels=None)

parameters:

max_fpr: float > 0 and <= 1, default=None
If not None, the standardized partial AUC [2] over the range [0, max_fpr] is returned. For the multiclass case, max_fpr, should be either equal to None or 1.0 as AUC ROC partial computation currently is not supported for multiclass.

multi_class: {‘raise’, ‘ovr’, ‘ovo’}, default=’raise’
Multiclass only. Determines the type of configuration to use. The default value raises an error, so either 'ovr' or 'ovo' must be passed explicitly.

#### 5) F1 Score:
What percent of positive predictions were correct? 
F1 Score is the Harmonic Mean between precision and recall. The range for F1 Score is [0, 1]. It tells you how precise your classifier is (how many instances it classifies correctly), as well as how robust it is (it does not miss a significant number of instances).
The greater the F1 Score, the better is the performance of our model.
<img src="images/f1.gif">

#### sklearn.metrics.f1_score(y_true, y_pred, labels=None, pos_label=1, average='binary', sample_weight=None, zero_division='warn')

parameters:
y_true: 1d array-like, or label indicator array / sparse matrix
Ground truth (correct) target values.

y_pred: 1d array-like, or label indicator array / sparse matrix
Estimated targets as returned by a classifier.

labels: list, optional
The set of labels to include when average != 'binary', and their order if average is None.

pos_label: str or int, 1 by default
The class to report if average='binary' and the data is binary. If the data are multiclass or multilabel, this will be ignored; setting labels=[pos_label] and average != 'binary' will report scores for that label only.

average: string, [None, ‘binary’ (default), ‘micro’, ‘macro’, ‘samples’, ‘weighted’]
This parameter is required for multiclass/multilabel targets. If None, the scores for each class are returned. 

sample_weight: array-like of shape (n_samples,), default=None
Sample weights.

zero_division: “warn”, 0 or 1, default=”warn”
Sets the value to return when there is a zero division. If set to “warn”, this acts as 0, but warnings are also raised.

#### 6) Precision:
What percent of your predictions were correct? 
Precision is the ability of a classifier not to label an instance positive that is actually negative. High precision relates to the low false positive rate
<img src="images/prec.gif">
#### sklearn.metrics.precision_score(y_true, y_pred, labels=None, pos_label=1, average='binary', sample_weight=None, zero_division='warn') 


#### 7) Recall:
What percent of the positive cases did you catch? 
Recall is the ability of a classifier to find all positive instances.
<img src="images/reca.gif">

#### sklearn.metrics.recall_score(y_true, y_pred, labels=None, pos_label=1, average='binary', sample_weight=None, zero_division='warn')

#### 8) Mean Absolute Error:
Mean Absolute Error is the average of the difference between the Original Values and the Predicted Values. It gives us the measure of how far the predictions were from the actual output. However, they don’t gives us any idea of the direction of the error i.e. whether we are under predicting the data or over predicting the data. Mathematically, it is represented as :
<img src="images/mae.gif">

#### sklearn.metrics.mean_absolute_error(y_true, y_pred, sample_weight=None, multioutput='uniform_average')

parameters:

y_true: array-like of shape (n_samples,) or (n_samples, n_outputs)
Ground truth (correct) target values.

y_pred: array-like of shape (n_samples,) or (n_samples, n_outputs)
Estimated target values.

sample_weight: array-like of shape (n_samples,), optional
Sample weights.

multioutput: string in [‘raw_values’, ‘uniform_average’]
or array-like of shape (n_outputs) Defines aggregating of multiple output values. Array-like value defines weights used to average errors.

‘raw_values’ :
Returns a full set of errors in case of multioutput input.

‘uniform_average’ :
Errors of all outputs are averaged with uniform weight.


#### 9) Mean Squared Error:
Mean Squared Error(MSE) is quite similar to Mean Absolute Error, the only difference being that MSE takes the average of the square of the difference between the original values and the predicted values. The advantage of MSE being that it is easier to compute the gradient, whereas Mean Absolute Error requires complicated linear programming tools to compute the gradient. As, we take square of the error, the effect of larger errors become more pronounced then smaller error, hence the model can now focus more on the larger errors.
<img src="images/mse.gif">

#### sklearn.metrics.mean_squared_error(y_true, y_pred, sample_weight=None, multioutput='uniform_average', squared=True)

parameters:
squared: boolean value, optional (default = True)
If True returns MSE value, if False returns RMSE value.

## 9) Model Building
The final step in the text classification framework is to train a classifier using the features created in the previous step. There are many different choices of machine learning models which can be used to train a final model. We will implement following different classifiers for this purpose:

Naive Bayes Classifier: 

Linear Classifier

Support Vector Machine

Bagging Models

Boosting Models

Shallow Neural Networks

Deep Neural Networks

Convolutional Neural Network (CNN)

Long Short Term Modelr (LSTM)

Gated Recurrent Unit (GRU)

Bidirectional RNN

Recurrent Convolutional Neural Network (RCNN)

Other Variants of Deep Neural Networks

In [None]:
def train_model(classifier, feature_vector_train, label, feature_vector_valid, is_neural_net=False):
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)
    
    # predict the labels on validation dataset
    predictions = classifier.predict(feature_vector_valid)
    
    if is_neural_net:
        predictions = predictions.argmax(axis=-1)
    
    return (metrics.accuracy_score(y_test, predictions), metrics.confusion_matrix(y_test, predictions),metrics.precision_score(y_test, predictions,average='weighted'),metrics.f1_score(y_test, predictions,average='weighted'),metrics.recall_score(y_test, predictions,average='weighted'),metrics.roc_auc_score(y_test, predictions), predictions)

## 9.1) Naive Bayes Classifier: 

A Naive Bayes classifier is a probabilistic machine learning model that’s used for classification task. The crux of the classifier is based on the Bayes theorem.
<img src="images/baye.png">
The different types are:
1. Gaussian Naive Bayes:
Implements the Gaussian Naive Bayes algorithm for classification. The likelihood of the features is assumed to be Gaussian.

2. Multinomial Naive Bayes:
Implements the naive Bayes algorithm for multinomially distributed data, and is one of the two classic naive Bayes variants used in text classification (where the data are typically represented as word vector counts, although tf-idf vectors are also known to work well in practice). Therfore we will use this.

3. Complement Naive Bayes:
Implements the complement naive Bayes (CNB) algorithm. CNB is an adaptation of the standard multinomial naive Bayes (MNB) algorithm that is particularly suited for imbalanced data sets.

4. Bernoulli Naive Bayes:
Implements the naive Bayes training and classification algorithms for data that is distributed according to multivariate Bernoulli distributions; i.e., there may be multiple features but each one is assumed to be a binary-valued (Bernoulli, boolean) variable. Therefore, this class requires samples to be represented as binary-valued feature vectors; if handed any other kind of data, a BernoulliNB instance may binarize its input (depending on the binarize parameter).

5. Categorical Naive Bayes:
Implements the categorical naive Bayes algorithm for categorically distributed data. It assumes that each feature, which is described by the index , has its own categorical distribution.

6. Out-of-core naive Bayes model fitting:
Naive Bayes models can be used to tackle large scale classification problems for which the full training set might not fit in memory. To handle this case, MultinomialNB, BernoulliNB, and GaussianNB expose a partial_fit method that can be used incrementally as done with other classifiers.

#### sklearn.naive_bayes.MultinomialNB(alpha=1.0, fit_prior=True, class_prior=None) 

#### Parameters:
alpha: float, optional (default=1.0)
Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).

fit_prior: boolean, optional (default=True)
Whether to learn class prior probabilities or not. If false, a uniform prior will be used.

class_prior: array-like, size (n_classes,), optional (default=None)
Prior probabilities of the classes. If specified the priors are not adjusted according to the data.


In [None]:
# Multinomial Naive Bayes
from sklearn import naive_bayes, metrics

# Bag of Words (unigram) 
(accuracy,cm, precision,f1, recall, auc, model) = train_model(naive_bayes.MultinomialNB(), count_vect1_tr, y_train, count_vect1_te)
print("Multinomial Naive Bayes,Bag of Words (unigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# Bag of Words (bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(naive_bayes.MultinomialNB(), count_vect2_tr, y_train, count_vect2_te)
print("Multinomial Naive Bayes,Bag of Words (bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# Bag of Words (trigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(naive_bayes.MultinomialNB(), count_vect3_tr, y_train, count_vect3_te)
print("Multinomial Naive Bayes,Bag of Words (trigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# Bag of Words (unigram and bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(naive_bayes.MultinomialNB(), count_vect4_tr, y_train, count_vect4_te)
print("Multinomial Naive Bayes,Bag of Words (unigram and bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# Bag of Words (character level,unigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(naive_bayes.MultinomialNB(), count_vect5_tr, y_train, count_vect5_te)
print("Multinomial Naive Bayes,Bag of Words (character level,unigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# Bag of Words (character level,bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(naive_bayes.MultinomialNB(), count_vect6_tr, y_train, count_vect6_te)
print("Multinomial Naive Bayes,Bag of Words (character level,bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# Bag of Words (character level,trigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(naive_bayes.MultinomialNB(), count_vect7_tr, y_train, count_vect7_te)
print("Multinomial Naive Bayes,Bag of Words (character level,trigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# Bag of Words (character level,unigram and bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(naive_bayes.MultinomialNB(), count_vect8_tr, y_train, count_vect8_te)
print("Multinomial Naive Bayes,Bag of Words (character level,unigram and bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)



In [None]:
# TF-IDF (unigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(naive_bayes.MultinomialNB(), tfidf_vect1_tr, y_train, tfidf_vect1_te)
print("Multinomial Naive Bayes,TF-IDF (unigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# TF-IDF (bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(naive_bayes.MultinomialNB(), tfidf_vect2_tr, y_train, tfidf_vect2_te)
print("Multinomial Naive Bayes,TF-IDF (bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# TF-IDF (trigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(naive_bayes.MultinomialNB(), tfidf_vect3_tr, y_train, tfidf_vect3_te)
print("Multinomial Naive Bayes,TF-IDF (trigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# TF-IDF (unigram and bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(naive_bayes.MultinomialNB(), tfidf_vect4_tr, y_train, tfidf_vect4_te)
print("Multinomial Naive Bayes,TF-IDF (unigram and bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# TF-IDF (character level,unigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(naive_bayes.MultinomialNB(), tfidf_vect5_tr, y_train, tfidf_vect5_te)
print("Multinomial Naive Bayes,TF-IDF (character level,unigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# TF-IDF (character level,bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(naive_bayes.MultinomialNB(), tfidf_vect6_tr, y_train, tfidf_vect6_te)
print("Multinomial Naive Bayes,TF-IDF (character level,bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# TF-IDF (character level,trigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(naive_bayes.MultinomialNB(), tfidf_vect7_tr, y_train, tfidf_vect7_te)
print("Multinomial Naive Bayes,TF-IDF (character level,trigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# TF-IDF (character level,unigram and bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(naive_bayes.MultinomialNB(), tfidf_vect8_tr, y_train, tfidf_vect8_te)
print("Multinomial Naive Bayes,TF-IDF (character level,unigram and bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

## 9.2) Linear Classifier

Implementing a Linear Classifier (Logistic Regression)

Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic/sigmoid function. 

#### sklearn.linear_model.LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None) 

parameters:

This class implements regularized logistic regression using the ‘liblinear’ library, ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ solvers.

penalty: {‘l1’, ‘l2’, ‘elasticnet’, ‘none’}, default=’l2’
Used to specify the norm used in the penalization. The ‘newton-cg’, ‘sag’ and ‘lbfgs’ solvers support only l2 penalties. ‘elasticnet’ is only supported by the ‘saga’ solver. If ‘none’ (not supported by the liblinear solver), no regularization is applied.


dual: bool, default=False
Dual or primal formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer dual=False when n_samples > n_features.

tol: float, default=1e-4
Tolerance for stopping criteria.

C: float, default=1.0
Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.

fit_intercept: bool, default=True
Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.

intercept_scaling: float, default=1
Useful only when the solver ‘liblinear’ is used and self.fit_intercept is set to True. In this case, x becomes [x, self.intercept_scaling], i.e. a “synthetic” feature with constant value equal to intercept_scaling is appended to the instance vector. The intercept becomes intercept_scaling * synthetic_feature_weight.

class_weight: dict or ‘balanced’, default=None
Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one.

The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).

random_state: int, RandomState instance, default=None
The seed of the pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Used when solver == ‘sag’ or ‘liblinear’.

solver: {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default=’lbfgs’
Algorithm to use in the optimization problem.

max_iter: int, default=100
Maximum number of iterations taken for the solvers to converge.

multi_class: {‘auto’, ‘ovr’, ‘multinomial’}, default=’auto’
If the option chosen is ‘ovr’, then a binary problem is fit for each label. For ‘multinomial’ the loss minimised is the multinomial loss fit across the entire probability distribution, even when the data is binary. ‘multinomial’ is unavailable when solver=’liblinear’. ‘auto’ selects ‘ovr’ if the data is binary, or if solver=’liblinear’, and otherwise selects ‘multinomial’.

verbose: int, default=0
For the liblinear and lbfgs solvers set verbose to any positive number for verbosity.

warm_start: bool, default=False
When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. Useless for liblinear solver. 

n_jobs: int, default=None
Number of CPU cores used when parallelizing over classes if multi_class=’ovr’”. This parameter is ignored when the solver is set to ‘liblinear’ regardless of whether ‘multi_class’ is specified or not. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

l1_ratio: float, default=None
The Elastic-Net mixing parameter, with 0 <= l1_ratio <= 1. Only used if penalty='elasticnet'`. Setting ``l1_ratio=0 is equivalent to using penalty='l2', while setting l1_ratio=1 is equivalent to using penalty='l1'. For 0 < l1_ratio <1, the penalty is a combination of L1 and L2.

In [None]:
# Logistic regression
from sklearn import linear_model

# Bag of Words (unigram) 
(accuracy,cm, precision,f1, recall, auc, model) = train_model(linear_model.LogisticRegression(solver='lbfgs',max_iter=500), count_vect1_tr, y_train, count_vect1_te)
print("Logistic regression,Bag of Words (unigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# Bag of Words (bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(linear_model.LogisticRegression(solver='lbfgs',max_iter=500), count_vect2_tr, y_train, count_vect2_te)
print("Logistic regression,Bag of Words (bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# Bag of Words (trigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(linear_model.LogisticRegression(solver='lbfgs',max_iter=500), count_vect3_tr, y_train, count_vect3_te)
print("Logistic regression,Bag of Words (trigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# Bag of Words (unigram and bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(linear_model.LogisticRegression(solver='lbfgs',max_iter=500), count_vect4_tr, y_train, count_vect4_te)
print("Logistic regression,Bag of Words (unigram and bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# Bag of Words (character level,unigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(linear_model.LogisticRegression(solver='lbfgs',max_iter=500), count_vect5_tr, y_train, count_vect5_te)
print("Logistic regression,Bag of Words (character level,unigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# Bag of Words (character level,bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(linear_model.LogisticRegression(solver='lbfgs',max_iter=500), count_vect6_tr, y_train, count_vect6_te)
print("Logistic regression,Bag of Words (character level,bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# Bag of Words (character level,trigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(linear_model.LogisticRegression(solver='lbfgs',max_iter=500), count_vect7_tr, y_train, count_vect7_te)
print("Logistic regression,Bag of Words (character level,trigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# Bag of Words (character level,unigram and bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(linear_model.LogisticRegression(solver='lbfgs',max_iter=500), count_vect8_tr, y_train, count_vect8_te)
print("Logistic regression,Bag of Words (character level,unigram and bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

In [None]:
# TF-IDF (unigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(linear_model.LogisticRegression(solver='lbfgs'), tfidf_vect1_tr, y_train, tfidf_vect1_te)
print("Logistic regression,TF-IDF (unigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# TF-IDF (bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(linear_model.LogisticRegression(solver='lbfgs'), tfidf_vect2_tr, y_train, tfidf_vect2_te)
print("Logistic regression,TF-IDF (bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# TF-IDF (trigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(linear_model.LogisticRegression(solver='lbfgs'), tfidf_vect3_tr, y_train, tfidf_vect3_te)
print("Logistic regression,TF-IDF (trigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# TF-IDF (unigram and bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(linear_model.LogisticRegression(solver='lbfgs'), tfidf_vect4_tr, y_train, tfidf_vect4_te)
print("Logistic regression,TF-IDF (unigram and bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# TF-IDF (character level,unigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(linear_model.LogisticRegression(solver='lbfgs'), tfidf_vect5_tr, y_train, tfidf_vect5_te)
print("Logistic regression,TF-IDF (character level,unigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# TF-IDF (character level,bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(linear_model.LogisticRegression(solver='lbfgs'), tfidf_vect6_tr, y_train, tfidf_vect6_te)
print("Logistic regression,TF-IDF (character level,bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# TF-IDF (character level,trigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(linear_model.LogisticRegression(solver='lbfgs'), tfidf_vect7_tr, y_train, tfidf_vect7_te)
print("Logistic regression,TF-IDF (character level,trigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# TF-IDF (character level,unigram and bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(linear_model.LogisticRegression(solver='lbfgs'), tfidf_vect8_tr, y_train, tfidf_vect8_te)
print("Logistic regression,TF-IDF (character level,unigram and bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

## 9.3) Support Vector Machine (SVM)

The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the data points.
To separate the two classes of data points, there are many possible hyperplanes that could be chosen. Our objective is to find a plane that has the maximum margin, i.e the maximum distance between data points of both classes. Maximizing the margin distance provides some reinforcement so that future data points can be classified with more confidence.
<img src="images\svm.jpg">

#### sklearn.svm.SVC(C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape='ovr', break_ties=False, random_state=None)

parameters:

C: float, optional (default=1.0)
Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty.

kernel: string, optional (default=’rbf’)
Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’ (text classification) , ‘poly’(image processing), ‘rbf’(radial basis function) , ‘sigmoid’(proxy for neural netwokr), ‘precomputed’ or a callable. If none is given, ‘rbf’ will be used. If a callable is given it is used to pre-compute the kernel matrix from data matrices; that matrix should be an array of shape (n_samples, n_samples).

degree: int, optional (default=3)
Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels.

gamma: {‘scale’, ‘auto’} or float, optional (default=’scale’)
Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’.

if gamma='scale' (default) is passed then it uses 1 / (n_features * X.var()) as value of gamma,

if ‘auto’, uses 1 / n_features.

coef0: float, optional (default=0.0)
Independent term in kernel function. It is only significant in ‘poly’ and ‘sigmoid’.

shrinking: boolean, optional (default=True)
Whether to use the shrinking heuristic.

probability: boolean, optional (default=False)
Whether to enable probability estimates. This must be enabled prior to calling fit, will slow down that method as it internally uses 5-fold cross-validation, and predict_proba may be inconsistent with predict.

tol: float, optional (default=1e-3)
Tolerance for stopping criterion.

cache_size: float, optional
Specify the size of the kernel cache (in MB).

class_weight: {dict, ‘balanced’}, optional
Set the parameter C of class i to class_weight[i]*C for SVC. If not given, all classes are supposed to have weight one. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))

verbose: bool, default: False
Enable verbose output. Note that this setting takes advantage of a per-process runtime setting in libsvm that, if enabled, may not work properly in a multithreaded context.

max_iter: int, optional (default=-1)
Hard limit on iterations within solver, or -1 for no limit.

decision_function_shape: ‘ovo’, ‘ovr’, default=’ovr’
Whether to return a one-vs-rest (‘ovr’) decision function of shape (n_samples, n_classes) as all other classifiers, or the original one-vs-one (‘ovo’) decision function of libsvm which has shape (n_samples, n_classes * (n_classes - 1) / 2). However, one-vs-one (‘ovo’) is always used as multi-class strategy.

break_ties: bool, optional (default=False)
If true, decision_function_shape='ovr', and number of classes > 2, predict will break ties according to the confidence values of decision_function; otherwise the first class among the tied classes is returned. Please note that breaking ties comes at a relatively high computational cost compared to a simple predict.


random_state: int, RandomState instance or None, optional (default=None)
The seed of the pseudo random number generator used when shuffling the data for probability estimates. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

In [None]:
# SVM
from sklearn import svm

# Bag of Words (unigram) 
(accuracy,cm, precision,f1, recall, auc, model) = train_model(svm.SVC(kernel='linear'), count_vect1_tr, y_train, count_vect1_te)
print("SVM,Bag of Words (unigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# Bag of Words (bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(svm.SVC(kernel='linear'), count_vect2_tr, y_train, count_vect2_te)
print("SVM,Bag of Words (bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# Bag of Words (trigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(svm.SVC(kernel='linear'), count_vect3_tr, y_train, count_vect3_te)
print("SVM,Bag of Words (trigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# Bag of Words (unigram and bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(svm.SVC(kernel='linear'), count_vect4_tr, y_train, count_vect4_te)
print("SVM,Bag of Words (unigram and bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# Bag of Words (character level,unigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(svm.SVC(kernel='linear'), count_vect5_tr, y_train, count_vect5_te)
print("SVM,Bag of Words (character level,unigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# Bag of Words (character level,bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(svm.SVC(kernel='linear'), count_vect6_tr, y_train, count_vect6_te)
print("SVM,Bag of Words (character level,bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# Bag of Words (character level,trigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(svm.SVC(kernel='linear'), count_vect7_tr, y_train, count_vect7_te)
print("SVM,Bag of Words (character level,trigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# Bag of Words (character level,unigram and bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(svm.SVC(kernel='linear'), count_vect8_tr, y_train, count_vect8_te)
print("SVM,Bag of Words (character level,unigram and bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

In [None]:
# TF-IDF (unigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(svm.SVC(kernel='linear'), tfidf_vect1_tr, y_train, tfidf_vect1_te)
print("SVM,TF-IDF (unigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# TF-IDF (bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(svm.SVC(kernel='linear'), tfidf_vect2_tr, y_train, tfidf_vect2_te)
print("SVM,TF-IDF (bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# TF-IDF (trigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(svm.SVC(kernel='linear'), tfidf_vect3_tr, y_train, tfidf_vect3_te)
print("SVM,TF-IDF (trigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# TF-IDF (unigram and bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(svm.SVC(kernel='linear'), tfidf_vect4_tr, y_train, tfidf_vect4_te)
print("SVM,TF-IDF (unigram and bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# TF-IDF (character level,unigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(svm.SVC(kernel='linear'), tfidf_vect5_tr, y_train, tfidf_vect5_te)
print("SVM,TF-IDF (character level,unigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# TF-IDF (character level,bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(svm.SVC(kernel='linear'), tfidf_vect6_tr, y_train, tfidf_vect6_te)
print("SVM,TF-IDF (character level,bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# TF-IDF (character level,trigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(svm.SVC(kernel='linear'), tfidf_vect7_tr, y_train, tfidf_vect7_te)
print("SVM,TF-IDF (character level,trigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# TF-IDF (character level,unigram and bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(svm.SVC(kernel='linear'), tfidf_vect8_tr, y_train, tfidf_vect8_te)
print("SVM,TF-IDF (character level,unigram and bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

## 9.4) Ensemble methods

The goal of ensemble methods is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator.

Two families of ensemble methods are usually distinguished:

In averaging methods, the driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced.

Examples: Bagging methods, Forests of randomized trees

By contrast, in boosting methods, base estimators are built sequentially and one tries to reduce the bias of the combined estimator. The motivation is to combine several weak models to produce a powerful ensemble.

Examples: AdaBoost, Gradient Tree Boosting, …

## 9.4.1) Bagging Models

We will be implementing a Random Forest Model. Random Forest models are a type of ensemble models, particularly bagging models. They are part of the tree based model family.  A forest is comprised of trees. It is said that the more trees it has, the more robust a forest is. Random forests creates decision trees on randomly selected data samples, gets prediction from each tree and selects the best solution by means of voting. It also provides a pretty good indicator of the feature importance.

#### sklearn.ensemble.RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)

parameters:

n_estimators: integer, optional (default=100)
The number of trees in the forest.

criterion: string, optional (default=”gini”)
The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. 

max_depth: integer or None, optional (default=None)
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_split: int, float, optional (default=2)
The minimum number of samples required to split an internal node:

If int, then consider min_samples_split as the minimum number.

If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.

min_samples_leaf: int, float, optional (default=1)
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

min_weight_fraction_leaf: float, optional (default=0.)
The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.

max_features: int, float, string or None, optional (default=”auto”)
The number of features to consider when looking for the best split:

max_leaf_nodes: int or None, optional (default=None)
Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

min_impurity_decrease: float, optional (default=0.)
A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

min_impurity_split: float, (default=1e-7)
Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf.

bootstrap: boolean, optional (default=True)
Whether bootstrap samples are used when building trees. If False, the whole datset is used to build each tree.

oob_score: bool (default=False)
Whether to use out-of-bag samples to estimate the generalization accuracy.

n_jobs: int or None, optional (default=None)
The number of jobs to run in parallel. fit, predict, decision_path and apply are all parallelized over the trees. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. 

random_state: int, RandomState instance or None, optional (default=None)
Controls both the randomness of the bootstrapping of the samples used when building trees (if bootstrap=True) and the sampling of the features to consider when looking for the best split at each node (if max_features < n_features). 

verbose: int, optional (default=0)
Controls the verbosity when fitting and predicting.

warm_start: bool, optional (default=False)
When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See the Glossary.

class_weight: dict, list of dicts, “balanced”, “balanced_subsample” or None, optional (default=None)
Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y.

ccp_alpha: non-negative float, optional (default=0.0)
Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen. By default, no pruning is performed. 

max_samples: int or float, default=None
If bootstrap is True, the number of samples to draw from X to train each base estimator.

In [None]:
# Random Forest
from sklearn import ensemble

# Bag of Words (unigram) 
(accuracy,cm, precision,f1, recall, auc, model) = train_model(ensemble.RandomForestClassifier(), count_vect1_tr, y_train, count_vect1_te)
print("Random Forest,Bag of Words (unigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# Bag of Words (bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(ensemble.RandomForestClassifier(), count_vect2_tr, y_train, count_vect2_te)
print("Random Forest,Bag of Words (bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# Bag of Words (trigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(ensemble.RandomForestClassifier(), count_vect3_tr, y_train, count_vect3_te)
print("Random Forest,Bag of Words (trigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# Bag of Words (unigram and bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(ensemble.RandomForestClassifier(), count_vect4_tr, y_train, count_vect4_te)
print("Random Forest,Bag of Words (unigram and bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# Bag of Words (character level,unigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(ensemble.RandomForestClassifier(), count_vect5_tr, y_train, count_vect5_te)
print("Random Forest,Bag of Words (character level,unigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# Bag of Words (character level,bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(ensemble.RandomForestClassifier(), count_vect6_tr, y_train, count_vect6_te)
print("Random Forest,Bag of Words (character level,bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# Bag of Words (character level,trigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(ensemble.RandomForestClassifier(), count_vect7_tr, y_train, count_vect7_te)
print("Random Forest,Bag of Words (character level,trigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# Bag of Words (character level,unigram and bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(ensemble.RandomForestClassifier(), count_vect8_tr, y_train, count_vect8_te)
print("Random Forest,Bag of Words (character level,unigram and bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

In [None]:
# TF-IDF (unigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(ensemble.RandomForestClassifier(), tfidf_vect1_tr, y_train, tfidf_vect1_te)
print("Random Forest,TF-IDF (unigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# TF-IDF (bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(ensemble.RandomForestClassifier(), tfidf_vect2_tr, y_train, tfidf_vect2_te)
print("Random Forest,TF-IDF (bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# TF-IDF (trigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(ensemble.RandomForestClassifier(), tfidf_vect3_tr, y_train, tfidf_vect3_te)
print("Random Forest,TF-IDF (trigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# TF-IDF (unigram and bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(ensemble.RandomForestClassifier(), tfidf_vect4_tr, y_train, tfidf_vect4_te)
print("Random Forest,TF-IDF (unigram and bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# TF-IDF (character level,unigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(ensemble.RandomForestClassifier(), tfidf_vect5_tr, y_train, tfidf_vect5_te)
print("Random Forest,TF-IDF (character level,unigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# TF-IDF (character level,bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(ensemble.RandomForestClassifier(), tfidf_vect6_tr, y_train, tfidf_vect6_te)
print("Random Forest,TF-IDF (character level,bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# TF-IDF (character level,trigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(ensemble.RandomForestClassifier(), tfidf_vect7_tr, y_train, tfidf_vect7_te)
print("Random Forest,TF-IDF (character level,trigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# TF-IDF (character level,unigram and bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(ensemble.RandomForestClassifier(), tfidf_vect8_tr, y_train, tfidf_vect8_te)
print("Random Forest,TF-IDF (character level,unigram and bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

## 9.4.2) Boosting Models

Boosting models are another type of ensemble models part of tree based models. Boosting is a machine learning ensemble meta-algorithm for primarily reducing bias, and also variance in supervised learning, and a family of machine learning algorithms that convert weak learners to strong ones. A weak learner is defined to be a classifier that is only slightly correlated with the true classification (it can label examples better than random guessing).

We will be implementing Xtereme Gradient Boosting Model.

General Parameters: 
These define the overall functionality of XGBoost.

booster: [default=gbtree]
Select the type of model to run at each iteration. It has 2 options:
gbtree: tree-based models
gblinear: linear models

silent: [default=0]
Silent mode is activated is set to 1, i.e. no running messages will be printed.
It’s generally good to keep it 0 as the messages might help in understanding the model.

nthread: [default to maximum number of threads available if not set]
This is used for parallel processing and number of cores in the system should be entered
If you wish to run on all cores, value should not be entered and algorithm will detect automatically

Booster Parameters:
Though there are 2 types of boosters, I’ll consider only tree booster here because it always outperforms the linear booster and thus the later is rarely used.

eta: [default=0.3]
Analogous to learning rate in GBM
Makes the model more robust by shrinking the weights on each step
Typical final values to be used: 0.01-0.2

min_child_weight: [default=1]
Defines the minimum sum of weights of all observations required in a child.
Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.
Too high values can lead to under-fitting hence, it should be tuned using CV.

max_depth: [default=6]
The maximum depth of a tree
Used to control over-fitting as higher depth will allow model to learn relations very specific to a particular sample.
Should be tuned using CV.
Typical values: 3-10

max_leaf_nodes:
The maximum number of terminal nodes or leaves in a tree.
Can be defined in place of max_depth. Since binary trees are created, a depth of ‘n’ would produce a maximum of 2^n leaves.
If this is defined, GBM will ignore max_depth.

gamma: [default=0]
A node is split only when the resulting split gives a positive reduction in the loss function. Gamma specifies the minimum loss reduction required to make a split.
Makes the algorithm conservative. The values can vary depending on the loss function and should be tuned.

max_delta_step: [default=0]
In maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative.

subsample: [default=1]
Denotes the fraction of observations to be randomly samples for each tree.
Lower values make the algorithm more conservative and prevents overfitting but too small values might lead to under-fitting.
Typical values: 0.5-1

colsample_bytree: [default=1]
Denotes the fraction of columns to be randomly samples for each tree.
Typical values: 0.5-1
colsample_bylevel [default=1]
Denotes the subsample ratio of columns for each split, in each level.

lambda: [default=1]
L2 regularization term on weights (analogous to Ridge regression)
This used to handle the regularization part of XGBoost. 

alpha: [default=0]
L1 regularization term on weight (analogous to Lasso regression)
Can be used in case of very high dimensionality so that the algorithm runs faster when implemented

scale_pos_weight: [default=1]
A value greater than 0 should be used in case of high class imbalance as it helps in faster convergence.

In [None]:
#!pip install xgboost

In [None]:
# Xtereme Gradient Boosting Model
import xgboost

# Bag of Words (unigram) 
(accuracy,cm, precision,f1, recall, auc, model) = train_model(xgboost.XGBClassifier(), count_vect1_tr, y_train, count_vect1_te)
print("Xtereme Gradient Boosting Model,Bag of Words (unigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# Bag of Words (bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(xgboost.XGBClassifier(), count_vect2_tr, y_train, count_vect2_te)
print("Xtereme Gradient Boosting Modelt,Bag of Words (bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# Bag of Words (trigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(xgboost.XGBClassifier(), count_vect3_tr, y_train, count_vect3_te)
print("Xtereme Gradient Boosting Model,Bag of Words (trigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# Bag of Words (unigram and bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(xgboost.XGBClassifier(), count_vect4_tr, y_train, count_vect4_te)
print("Xtereme Gradient Boosting Model,Bag of Words (unigram and bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# Bag of Words (character level,unigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(xgboost.XGBClassifier(), count_vect5_tr, y_train, count_vect5_te)
print("Xtereme Gradient Boosting Model,Bag of Words (character level,unigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# Bag of Words (character level,bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(xgboost.XGBClassifier(), count_vect6_tr, y_train, count_vect6_te)
print("Xtereme Gradient Boosting Model,Bag of Words (character level,bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# Bag of Words (character level,trigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(xgboost.XGBClassifier(), count_vect7_tr, y_train, count_vect7_te)
print("Xtereme Gradient Boosting Model,Bag of Words (character level,trigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# Bag of Words (character level,unigram and bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(xgboost.XGBClassifier(), count_vect8_tr, y_train, count_vect8_te)
print("Xtereme Gradient Boosting Model,Bag of Words (character level,unigram and bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

In [None]:
# TF-IDF (unigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(xgboost.XGBClassifier(), tfidf_vect1_tr, y_train, tfidf_vect1_te)
print("Xtereme Gradient Boosting Model,TF-IDF (unigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# TF-IDF (bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(xgboost.XGBClassifier(), tfidf_vect2_tr, y_train, tfidf_vect2_te)
print("Xtereme Gradient Boosting Model,TF-IDF (bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# TF-IDF (trigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(xgboost.XGBClassifier(), tfidf_vect3_tr, y_train, tfidf_vect3_te)
print("Xtereme Gradient Boosting Model,TF-IDF (trigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# TF-IDF (unigram and bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(xgboost.XGBClassifier(), tfidf_vect4_tr, y_train, tfidf_vect4_te)
print("Xtereme Gradient Boosting Model,TF-IDF (unigram and bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# TF-IDF (character level,unigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(xgboost.XGBClassifier(), tfidf_vect5_tr, y_train, tfidf_vect5_te)
print("Xtereme Gradient Boosting Model,TF-IDF (character level,unigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# TF-IDF (character level,bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(xgboost.XGBClassifier(), tfidf_vect6_tr, y_train, tfidf_vect6_te)
print("Xtereme Gradient Boosting Model,TF-IDF (character level,bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# TF-IDF (character level,trigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(xgboost.XGBClassifier(), tfidf_vect7_tr, y_train, tfidf_vect7_te)
print("Xtereme Gradient Boosting Model,TF-IDF (character level,trigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

# TF-IDF (character level,unigram and bigram)
(accuracy,cm, precision,f1, recall, auc, model) = train_model(xgboost.XGBClassifier(), tfidf_vect8_tr, y_train, tfidf_vect8_te)
print("Xtereme Gradient Boosting Model,TF-IDF (character level,unigram and bigram): ")
print ("Accuracy: ", accuracy)
print ("Confusion matrix:\n ", cm)
print ("Precision: ", precision)
print ("F1 score: ", f1)
print ("Recall: ", recall)
print ("Area under curve: ", auc)
print ("--" *50)

## Save Model

In [None]:
#!pip install joblib

In [None]:
from sklearn import linear_model
model1=linear_model.LogisticRegression(solver='lbfgs',max_iter=500)

In [None]:
model1.fit(count_vect4_tr,y_train)

In [None]:
import pickle 
import joblib

In [None]:
joblib.dump(model1, 'model1.pkl')

## Custom testing

In [None]:
# example text for model testing
simple_test = ["Some Flee, Others Restock Before Australian Wildfires Worsen"]

In [None]:
# transform testing data into a document-term matrix (using existing vocabulary)
simple_test_dtm = count_vectorizer4.transform(simple_test)
simple_test_dtm.toarray()

In [None]:
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_test_dtm.toarray(), columns=count_vectorizer4.get_feature_names())

In [None]:
pred=model1.predict(simple_test_dtm)

In [None]:
print(pred)