# Word List

In the first part of the laboratory exercise, I will demonstrate the application of the word list method for gauging the sentiment within movie reviews.

## Simple Illustration

In [1]:
sentence = "'Glorious new chapter for Hong Kong swimming': top officials congratulate Siobhan Haughey for Olympic win"

# Convert all text to lower case 
sentence = sentence.lower()

# Remove special characters
sentence = sentence.replace("'", '')
sentence = sentence.replace(":", '')

# Simple tokenization
words= sentence.split(' ')

print(words)

['glorious', 'new', 'chapter', 'for', 'hong', 'kong', 'swimming', 'top', 'officials', 'congratulate', 'siobhan', 'haughey', 'for', 'olympic', 'win']


In [2]:
positive_words = ['awesome', 'glorious', 'nice', 'super', 'win', 'delightful', 'congratulate']
negative_words = ['awful', 'lame', 'horrible', 'bad', 'scare']

In [3]:
set(words) - set(positive_words)

{'chapter',
 'for',
 'haughey',
 'hong',
 'kong',
 'new',
 'officials',
 'olympic',
 'siobhan',
 'swimming',
 'top'}

In [4]:
set(words).intersection(set(positive_words))

{'congratulate', 'glorious', 'win'}

In [5]:
pos = len(set(words).intersection(set(positive_words))) / len(words)
print('Positive sentiment score: {}'.format(pos))

Positive sentiment score: 0.2


In [6]:
neg = len(set(words).intersection(set(negative_words))) / len(words)
print('Negative sentiment score: {}'.format(neg))

Negative sentiment score: 0.0


## Movie Review Sentiment Analysis

IMDB enables users to rate movies on a scale from 1 to 10. In categorizing these reviews, the data curator assigned a rating of 4 stars or less to signify a negative reaction, **indicating a poor reception or unfavorable view of the movie**. Conversely, ratings of 7 stars or higher were marked as positive, **reflecting a favorable reaction or approval of the film**. Reviews with ratings of 5 or 6 stars were omitted from this classification scheme. These assigned labels will act as the reference standard for subsequent comparisons.

In [7]:
# pip install nltk

Note: you may need to restart the kernel to use updated packages.


In [6]:
# Import useful libararies used for data management
import pandas as pd
import numpy as np
import re
import nltk

In [15]:
dataset = pd.read_csv('IMDB.csv')
dataset

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [16]:
dataset['sentiment'].value_counts()

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

In [17]:
# Sample 1000 reviews from each sentiment
dataset = dataset.groupby('sentiment').apply(lambda x: x.sample(1000),include_groups=True).reset_index(drop = True)

  dataset = dataset.groupby('sentiment').apply(lambda x: x.sample(1000),include_groups=True).reset_index(drop = True)


In [19]:
pip list

Package                   Version
------------------------- ---------------
annotated-types           0.6.0
anyio                     4.3.0
argon2-cffi               23.1.0
argon2-cffi-bindings      21.2.0
arrow                     1.3.0
asttokens                 2.4.1
async-lru                 2.0.4
attrs                     23.2.0
Babel                     2.14.0
beautifulsoup4            4.12.3
bleach                    6.1.0
blis                      0.7.11
catalogue                 2.0.10
certifi                   2024.2.2
cffi                      1.16.0
charset-normalizer        3.3.2
click                     8.1.7
cloudpathlib              0.16.0
colorama                  0.4.6
comm                      0.2.1
confection                0.1.4
contourpy                 1.2.0
cycler                    0.12.1
cymem                     2.0.8
debugpy                   1.8.1
decorator                 5.1.1
defusedxml                0.7.1
en-core-web-sm            3.7.1
exceptiongroup 


[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [12]:
# Convert the 'label' column into a numeric variable; 'negative' as 0, 'positive' as 1
dataset['label'] = dataset['sentiment'].map({'negative':0, 'positive':1})
dataset

KeyError: 'sentiment'

### Text Preprocessing

In [13]:
# Convert to list
review = dataset.review.values.tolist()

# Remove all html tags
review = [re.sub("<.*?>", " ", i) for i in review]

# Remove unnecessary characters
review = [re.sub("[^A-Za-z0-9]+", " ", i) for i in review]

# Change to lower case
review = [i.lower() for i in review]

# Define functions for stopwords
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

def remove_stopwords(texts):
    return [[word for word in doc.split(' ') if word not in stop_words] for doc in texts]

# Remove Stop Words
review = remove_stopwords(review)

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [14]:
print(dataset.review[0])
print('\n')
print(review[0])

A fabulous book about a fox and his family who does what foxs do. that being stealing from farms and killing prey. until a trio of farmers decide they've had enough of this fox and try in various ways to have the problem "solved". They are of course "out foxed" at every turn and while the trio are camped out at the fox hole the family perform raids against the three farmers land.<br /><br />The"film" version ,and I use the term film very loosely, is more of a god awful pastiche of American heist movies particularly the Oceans movies. They they even have George clooney as Mr fox to to add to the insult and manage to miss the point of the story quite completely. So kudos to them .They'll make lots of money and destroy another classic Roald Dahl children book.


['fabulous', 'book', 'fox', 'family', 'foxs', 'stealing', 'farms', 'killing', 'prey', 'trio', 'farmers', 'decide', 'enough', 'fox', 'try', 'various', 'ways', 'problem', 'solved', 'course', 'foxed', 'every', 'turn', 'trio', 'camped

### Load External Emotion Lexicon

Our analysis will incorporate the NRC Word-Emotion Association Lexicon, which encompasses associations of English words with eight fundamental emotions — anger, fear, anticipation, trust, surprise, sadness, joy, and disgust — as well as two overarching sentiments, namely negative and positive.

For more information, please visit: https://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm#:~:text=The%20NRC%20Emotion%20Lexicon%20is,sentiments%20(negative%20and%20positive).

<div>
   <img src="img/summary.png" width="1200">
</div>

In [15]:
lexicon = pd.read_csv('NRC-Emotion-Lexicon.txt')
lexicon.head()

Unnamed: 0,aback\tanger\t0
0,aback\tanticipation\t0
1,aback\tdisgust\t0
2,aback\tfear\t0
3,aback\tjoy\t0
4,aback\tnegative\t0


Since the file is in **.txt**, we need to change the delimiter from **comma** (',') to **tab** ('\t'). Besides, we need to manually set the column name as the file does not define the columns.

In [16]:
lexicon = pd.read_csv('NRC-Emotion-Lexicon.txt', sep = '\t', names = ['term', 'category', 'associated'])
lexicon.head()

Unnamed: 0,term,category,associated
0,aback,anger,0
1,aback,anticipation,0
2,aback,disgust,0
3,aback,fear,0
4,aback,joy,0


#### Random Sample 10 Positive Words

In [17]:
(lexicon['category'] == 'positive')

0         False
1         False
2         False
3         False
4         False
          ...  
141565    False
141566     True
141567    False
141568    False
141569    False
Name: category, Length: 141570, dtype: bool

In [18]:
(lexicon['associated'] == 1)

0         False
1         False
2         False
3         False
4         False
          ...  
141565    False
141566    False
141567    False
141568    False
141569    False
Name: associated, Length: 141570, dtype: bool

In [19]:
lexicon[(lexicon['category'] == 'positive') & (lexicon['associated'] == 1)]

Unnamed: 0,term,category,associated
76,abba,positive,1
206,ability,positive,1
366,abovementioned,positive,1
486,absolute,positive,1
496,absolution,positive,1
...,...,...,...
141216,yearning,positive,1
141396,youth,positive,1
141426,zeal,positive,1
141446,zealous,positive,1


In [20]:
list(lexicon[(lexicon['category'] == 'positive') & (lexicon['associated'] == 1)].term.sample(10))

['unanimity',
 'excellent',
 'articulate',
 'launch',
 'render',
 'irreducible',
 'improved',
 'harvest',
 'exhortation',
 'partnership']

#### Random Sample 10 Negative Words

In [21]:
list(lexicon[(lexicon['category'] == 'negative') & (lexicon['associated'] == 1)].term.sample(10))

['unsavory',
 'gasping',
 'unlawful',
 'litter',
 'disconnection',
 'miserable',
 'disbelieve',
 'disagree',
 'battle',
 'servile']

### Calculate Sentiment of Movie Reviews

In [22]:
pos_list = list(lexicon[(lexicon['category'] == 'positive') & (lexicon['associated'] == 1)].term)
neg_list = list(lexicon[(lexicon['category'] == 'negative') & (lexicon['associated'] == 1)].term)

In [23]:
print(len(pos_list))
print(len(neg_list))

2308
3318


In [24]:
# A Function to Construct a Sentiment Variable Using a Lexicon-Based Approach

def sentiment_score(text, sen_list):
    temp_list = []
    for t in text:
        temp = 0
        for w in sen_list:
            temp += t.count(w)
        temp_list.append(temp/len(t))
    return temp_list

In [25]:
dataset['Pos_Dic'] = sentiment_score(review, pos_list)
dataset['Neg_Dic'] = sentiment_score(review, neg_list)

# Calculate polarity = positive - negative
dataset['Sentiment_Dic'] = dataset['Pos_Dic'] - dataset['Neg_Dic']
dataset.head()

Unnamed: 0,review,sentiment,label,Pos_Dic,Neg_Dic,Sentiment_Dic
0,A fabulous book about a fox and his family who...,negative,0,0.09589,0.082192,0.013699
1,i searched video store everywhere to find this...,negative,0,0.130081,0.056911,0.073171
2,I saw this movie at the Edmonton International...,negative,0,0.070707,0.161616,-0.090909
3,I recently watched this film on The Sundance C...,negative,0,0.054348,0.054348,0.0
4,The only reason I know this film exists is bec...,negative,0,0.05814,0.081395,-0.023256


In [26]:
dataset[['label', 'Sentiment_Dic']].corr(method ='spearman')

Unnamed: 0,label,Sentiment_Dic
label,1.0,0.435533
Sentiment_Dic,0.435533,1.0


### Word Power

<div>
   <img src="img/example.png" width="600">
</div>

# Text Regression

In the second part of the laboratory exercise, I will demonstrate the application of various regression models to identify confirmed instances of COVID-19-related misinformation.

## Fake News Detection

The dataset on COVID-19 misinformation was sourced from a repository known as "ReCOVery," as documented by Zhou et al. (2020). This dataset comprises news items specifically related to COVID-19, which have been corroborated through two independent verification platforms: NewsGuard and Media Bias/Fact Check.

The identification of pertinent news articles was conducted using a predefined set of keywords including **"SARS-CoV-2,"** **"COVID-19,"** and **"Coronavirus."**

For comprehensive information, please refer to the following document: https://arxiv.org/pdf/2006.05557.pdf.

In [20]:
# pip install spacy

In [21]:
import pandas as pd
import numpy as np

import spacy 
from spacy.lang.en import English
from spacy.lang.en.stop_words import STOP_WORDS

from nltk.stem import WordNetLemmatizer

import warnings
warnings.filterwarnings("ignore")

In [22]:
fakeNews = pd.read_csv("fake_news.csv")
fakeNews.head()

Unnamed: 0,news_id,url,publisher,publish_date,author,title,image,body_text,news_guard_score,mbfc_level,political_bias,country,reliability
0,0,https://www.nytimes.com/article/what-is-corona...,The New York Times,2020-01-21,"['Knvul Sheikh', 'Roni Caryn Rabin']",The Coronavirus: What Scientists Have Learned ...,https://static01.nyt.com/images/2020/03/12/sci...,\nA novel respiratory virus that originated in...,100.0,High,Left,USA,1
1,1,https://www.npr.org/2020/01/22/798392172/chine...,National Public Radio (NPR),2020-01-22,['Emily Feng'],Chinese Health Officials: More Die From Newly ...,https://media.npr.org/include/images/facebook-...,Chinese Health Officials: More Die From Newly ...,100.0,Very high,Center,USA,1
2,2,https://www.theverge.com/2020/1/23/21078457/co...,The Verge,2020-01-23,['Nicole Wetsman'],Everything you need to know about the coronavirus,https://cdn.vox-cdn.com/thumbor/a9_Oz7cvSBKyal...,Public health experts around the globe are scr...,100.0,High,Left-center,USA,1
3,3,https://www.worldhealth.net/news/novel-coronav...,WorldHealth.Net,2020-01-24,[],Novel Coronavirus Cases Confirmed To Be Spreading,https://www.worldhealth.net/media/original_ima...,The first two coronavirus cases in Europe have...,30.0,Low,,USA,0
4,4,https://www.theverge.com/2020/1/24/21080845/co...,The Verge,2020-01-24,"['Nicole Wetsman', 'Zoe Schiffer', 'Jay Peters...",Coronavirus disrupts the world: updates on the...,https://cdn.vox-cdn.com/thumbor/t2gt1SmEni4Mcr...,"A new coronavirus appeared in Wuhan, China, at...",100.0,High,Left-center,USA,1


In [23]:
# Simple preprocessing by removing extra lines and lowercasing all text
fakeNews['body_text'] = fakeNews['body_text'].replace('\n','', regex=True)
fakeNews['body_text'] = fakeNews['body_text'].replace('\r','', regex=True)
fakeNews['body_text'] = [x.lower() for x in fakeNews['body_text']]


# Futher preprocessing by removing all stopwords and lemmatizing all text
documents = []

stemmer = WordNetLemmatizer()

for text in fakeNews['body_text']:
    # Load English tokenizer, tagger, parser, NER and word vectors
    nlp = English()

    #  "nlp" Object is used to create documents with linguistic annotations.
    my_doc = nlp(text)

    # Create list of word tokens
    token_list = []
    for token in my_doc:
        token_list.append(token.text)

    # Create list of word tokens after removing stopwords
    filtered_sentence =[] 

    for word in token_list:
        lexeme = nlp.vocab[word]
        if lexeme.is_stop == False:
            filtered_sentence.append(word) 

    document = [stemmer.lemmatize(word) for word in filtered_sentence]
    document = ' '.join(document)

    documents.append(document)

In [24]:
fakeNews['body_text_process'] = documents
fakeNews['fake'] = 1 - fakeNews['reliability']
fakeNews = fakeNews[['fake', 'body_text_process']]
fakeNews.head()

Unnamed: 0,fake,body_text_process
0,0,"novel respiratory virus originated wuhan , chi..."
1,0,chinese health official : die newly identified...
2,0,public health expert globe scrambling understa...
3,1,"coronavirus case europe detected france , seco..."
4,0,"new coronavirus appeared wuhan , china , start..."


In [25]:
# import cross validation and other evaluation tool
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [26]:
from sklearn.model_selection import train_test_split

News = fakeNews['body_text_process'].values
y = fakeNews['fake'].values

News_train, News_test, y_train, y_test = train_test_split(News, y, test_size=0.2, random_state=1)

In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_df = 0.5, min_df = 0.02)
vectorizer.fit(News_train)

X = vectorizer.transform(News).toarray()
X_train = vectorizer.transform(News_train).toarray()
X_test = vectorizer.transform(News_test).toarray()

In [28]:
print(X.shape)
print(X_train.shape)
print(X_test.shape)

(2029, 2789)
(1623, 2789)
(406, 2789)


## Linear Models

### Logistic Regression

In [29]:
# import Logistic Regression from sklearn
from sklearn.linear_model import LogisticRegression

# define model to be logistic regression
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
# 'saga' is the algorithm to use in the optimization problem (finding the optimal coefficient values)
lr = LogisticRegression(penalty='none', random_state=0, solver='saga')

In [30]:
score_cv = cross_val_score(lr, X, y, cv=10)
score_cv.mean()

ValueError: 
All the 10 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
10 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\David\IdeaProjects\MM5427\venv\lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\David\IdeaProjects\MM5427\venv\lib\site-packages\sklearn\base.py", line 1467, in wrapper
    estimator._validate_params()
  File "C:\Users\David\IdeaProjects\MM5427\venv\lib\site-packages\sklearn\base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "C:\Users\David\IdeaProjects\MM5427\venv\lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_constraints
    raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'penalty' parameter of LogisticRegression must be a str among {'elasticnet', 'l2', 'l1'} or None. Got 'none' instead.


In [38]:

#predict value of target based on cross validation
pred_y = cross_val_predict(lr, X, y, cv=10)
pred_y

array([0, 0, 0, ..., 0, 0, 1])

In [39]:
print(confusion_matrix(y, pred_y))
print(classification_report(y, pred_y))

[[1272   92]
 [ 174  491]]
              precision    recall  f1-score   support

           0       0.88      0.93      0.91      1364
           1       0.84      0.74      0.79       665

    accuracy                           0.87      2029
   macro avg       0.86      0.84      0.85      2029
weighted avg       0.87      0.87      0.87      2029


In [40]:
# train model using training dataset
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.fit
lr.fit(X_train, y_train)

In [41]:
# show the intercept of the trained model (Theta_0)
lr.intercept_

array([-1.45053734])

In [42]:
# show the coefficients of independent attributes
# the reason that we use the function .flatten() here is to convert the 8X1 array to 1X8 array
coeff_df = pd.DataFrame(lr.coef_.flatten(), vectorizer.get_feature_names_out(), columns=['Coefficient'])  
coeff_df

Unnamed: 0,Coefficient
000,-5.323127
05,-1.945742
10,-5.236894
100,3.331025
11,-3.232239
...,...
younger,-1.000704
youth,3.315562
youtube,0.462407
zero,5.306109


### Cost Benefit Analysis

In [43]:
# probablities for each prediction
proba_y = cross_val_predict(lr, X, y, cv=10, method='predict_proba')

In [44]:
proba_y

array([[9.90672087e-01, 9.32791268e-03],
       [9.97789908e-01, 2.21009174e-03],
       [9.99504084e-01, 4.95916169e-04],
       ...,
       [9.99432189e-01, 5.67810700e-04],
       [9.59587203e-01, 4.04127975e-02],
       [1.64482713e-04, 9.99835517e-01]])

<div>
   <img src="img/Confusion_Matrix.png" width="700">
</div>

In [45]:
proba_y_1 = proba_y[:, 1]

In [46]:
# chnage the default threshold from 0.5 to 0.2
pred_y_1 = [0 if i < 0.2 else 1 for i in proba_y_1]

In [47]:
print(confusion_matrix(y, pred_y_1))
print(classification_report(y, pred_y_1))

[[1191  173]
 [ 112  553]]
              precision    recall  f1-score   support

           0       0.91      0.87      0.89      1364
           1       0.76      0.83      0.80       665

    accuracy                           0.86      2029
   macro avg       0.84      0.85      0.84      2029
weighted avg       0.86      0.86      0.86      2029


### Penalized Logistic Regression

<div>
   <img src="img/L1L2.png" width="700">
</div>

#### L1 Penalty

In [48]:
# penalty='l1' means L1 regularization (recall LASSO regression); default is penality='L2' (L2 regularization). C=1.0 is inverse of regularization strength; must be a positive float.
lr_L1 = LogisticRegression(penalty='l1', C=1.0, random_state=0, solver='saga')

In [49]:
score_cv = cross_val_score(lr_L1, X, y, cv=10)
score_cv.mean()

0.8279763936984832

In [50]:
#predict value of target based on cross validation
pred_y = cross_val_predict(lr_L1, X, y, cv=10)
pred_y

array([0, 0, 0, ..., 0, 0, 1])

In [51]:
print(confusion_matrix(y, pred_y))
print(classification_report(y, pred_y))

[[1301   63]
 [ 286  379]]
              precision    recall  f1-score   support

           0       0.82      0.95      0.88      1364
           1       0.86      0.57      0.68       665

    accuracy                           0.83      2029
   macro avg       0.84      0.76      0.78      2029
weighted avg       0.83      0.83      0.82      2029


In [52]:
# train model using training dataset
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.fit
lr_L1.fit(X_train, y_train)

# show the coefficients of independent attributes
# the reason that we use the function .flatten() here is to convert the 8X1 array to 1X8 array
coeff_df['Coefficient_L1'] = lr_L1.coef_.flatten()
coeff_df

Unnamed: 0,Coefficient,Coefficient_L1
000,-5.323127,-0.671078
05,-1.945742,0.000000
10,-5.236894,0.000000
100,3.331025,0.000000
11,-3.232239,0.000000
...,...,...
younger,-1.000704,0.000000
youth,3.315562,0.000000
youtube,0.462407,0.000000
zero,5.306109,0.000000


#### L2 Penalty

In [53]:
# penalty='l1' means L1 regularization (recall LASSO regression); default is penality='L2' (L2 regularization). C=1.0 is inverse of regularization strength; must be a positive float.
lr_L2 = LogisticRegression(penalty='l2', C=1.0, random_state=0, solver='saga')

score_cv = cross_val_score(lr_L2, X, y, cv=10)
score_cv.mean()

0.843269277666683

In [54]:
#predict value of target based on cross validation
pred_y = cross_val_predict(lr_L2, X, y, cv=10)
pred_y

print(confusion_matrix(y, pred_y))
print(classification_report(y, pred_y))

[[1311   53]
 [ 265  400]]
              precision    recall  f1-score   support

           0       0.83      0.96      0.89      1364
           1       0.88      0.60      0.72       665

    accuracy                           0.84      2029
   macro avg       0.86      0.78      0.80      2029
weighted avg       0.85      0.84      0.83      2029


In [55]:
# train model using training dataset
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.fit
lr_L2.fit(X_train, y_train)

# show the coefficients of independent attributes
# the reason that we use the function .flatten() here is to convert the 8X1 array to 1X8 array
coeff_df['Coefficient_L2'] = lr_L2.coef_.flatten()
coeff_df

Unnamed: 0,Coefficient,Coefficient_L1,Coefficient_L2
000,-5.323127,-0.671078,-0.882948
05,-1.945742,0.000000,-0.131950
10,-5.236894,0.000000,-0.461151
100,3.331025,0.000000,0.187767
11,-3.232239,0.000000,-0.226032
...,...,...,...
younger,-1.000704,0.000000,-0.065336
youth,3.315562,0.000000,0.024792
youtube,0.462407,0.000000,0.139082
zero,5.306109,0.000000,0.395460


## Dimension Reduction

### Principal Components Analysis

<div>
   <img src="img/PCA.png" width="900">
</div>

In [56]:
from sklearn.decomposition import PCA

pca_100 = PCA(n_components=100)
pca_100.fit(X)

In [57]:
print(pca_100.explained_variance_ratio_)
print(sum(pca_100.explained_variance_ratio_))

[0.02318243 0.0156609  0.01471188 0.01081118 0.01072263 0.00951713
 0.0090249  0.00845664 0.00794627 0.00768835 0.00702376 0.00675158
 0.00640983 0.0056671  0.00561131 0.00546869 0.00537474 0.00495723
 0.00484049 0.0045945  0.00444083 0.00435471 0.00430546 0.00422697
 0.00409899 0.0040267  0.00394021 0.0039076  0.0037929  0.00376998
 0.00366975 0.00358502 0.00345716 0.00342659 0.00332768 0.00329247
 0.00323686 0.00317153 0.00311429 0.00308895 0.0030058  0.00296875
 0.00293283 0.00288956 0.00284374 0.00282723 0.00275749 0.00269043
 0.00265216 0.00262501 0.0025905  0.0025369  0.00253041 0.0024983
 0.00248354 0.00245157 0.00242757 0.00239511 0.00237723 0.00235289
 0.0023407  0.00229023 0.00228098 0.00222812 0.00221858 0.00220445
 0.00217676 0.00217086 0.00215828 0.00213469 0.00210882 0.00208723
 0.00206406 0.00204568 0.00202753 0.00202187 0.00201185 0.00198141
 0.00197984 0.0019546  0.00193933 0.0019118  0.00189821 0.00189127
 0.00188629 0.00187496 0.00183082 0.00182613 0.00181128 0.00180

<div>
   <img src="img/PCR.png" width="900">
</div>

In [58]:
X_PCA_100 = pca_100.fit_transform(X)

In [59]:
score_cv = cross_val_score(lr, X_PCA_100, y, cv=10)
score_cv.mean()

0.8501755840608691

In [60]:
#predict value of target based on cross validation
pred_y = cross_val_predict(lr, X_PCA_100, y, cv=10)
pred_y

array([0, 0, 0, ..., 0, 0, 1])

In [61]:
print(confusion_matrix(y, pred_y))
print(classification_report(y, pred_y))

[[1241  123]
 [ 181  484]]
              precision    recall  f1-score   support

           0       0.87      0.91      0.89      1364
           1       0.80      0.73      0.76       665

    accuracy                           0.85      2029
   macro avg       0.84      0.82      0.83      2029
weighted avg       0.85      0.85      0.85      2029


In [62]:
pca_50 = PCA(n_components=50)
pca_50.fit(X)

In [63]:
print(pca_50.explained_variance_ratio_)
print(sum(pca_50.explained_variance_ratio_))

[0.02318243 0.0156609  0.01471188 0.01081118 0.01072263 0.00951713
 0.0090249  0.00845664 0.00794627 0.00768835 0.00702376 0.00675158
 0.00640983 0.00566709 0.0056113  0.00546868 0.00537474 0.00495721
 0.00484047 0.00459432 0.00444073 0.00435459 0.00430536 0.00422649
 0.00409808 0.0040264  0.00393869 0.00390563 0.00379204 0.00376837
 0.00366382 0.0035793  0.00345534 0.00342236 0.00332419 0.00328634
 0.00322137 0.00316734 0.0031029  0.0030766  0.00298259 0.00293986
 0.00291884 0.00286185 0.00280201 0.00277176 0.00269067 0.00264068
 0.00261734 0.00254743]
0.2763502767928231


In [64]:
X_PCA_50 = pca_50.fit_transform(X)

score_cv = cross_val_score(lr, X_PCA_50, y, cv=10)
score_cv.mean()

0.8393235136321513

In [65]:
#predict value of target based on cross validation
pred_y = cross_val_predict(lr, X_PCA_50, y, cv=10)
pred_y

array([0, 0, 0, ..., 0, 1, 1])

In [66]:
print(confusion_matrix(y, pred_y))
print(classification_report(y, pred_y))

[[1232  132]
 [ 194  471]]
              precision    recall  f1-score   support

           0       0.86      0.90      0.88      1364
           1       0.78      0.71      0.74       665

    accuracy                           0.84      2029
   macro avg       0.82      0.81      0.81      2029
weighted avg       0.84      0.84      0.84      2029


## Nonlinear Models

### Polynomial Regression

In [67]:
from sklearn.preprocessing import PolynomialFeatures

In [68]:
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X_PCA_50)

In [69]:
score_cv = cross_val_score(lr, X_poly, y, cv=10)
score_cv.mean()

0.8560869141101302

In [70]:
#predict value of target based on cross validation
pred_y = cross_val_predict(lr, X_poly, y, cv=10)
pred_y

array([0, 0, 0, ..., 0, 1, 1])

In [71]:
print(confusion_matrix(y, pred_y))
print(classification_report(y, pred_y))

[[1247  117]
 [ 175  490]]
              precision    recall  f1-score   support

           0       0.88      0.91      0.90      1364
           1       0.81      0.74      0.77       665

    accuracy                           0.86      2029
   macro avg       0.84      0.83      0.83      2029
weighted avg       0.85      0.86      0.85      2029


### Support Vector Machine

<div>
   <img src="img/SVM.png" width="600">
</div>

In [72]:
from sklearn import svm

In [73]:
model_svm = svm.SVC()

In [74]:
score_cv = cross_val_score(model_svm, X, y, cv=10)
score_cv.mean()

0.8629883431692923

In [75]:
#predict value of target based on cross validation
pred_y = cross_val_predict(model_svm, X, y, cv=10)
pred_y

array([0, 0, 0, ..., 0, 0, 1])

In [76]:
print(confusion_matrix(y, pred_y))
print(classification_report(y, pred_y))

[[1311   53]
 [ 225  440]]
              precision    recall  f1-score   support

           0       0.85      0.96      0.90      1364
           1       0.89      0.66      0.76       665

    accuracy                           0.86      2029
   macro avg       0.87      0.81      0.83      2029
weighted avg       0.87      0.86      0.86      2029


In [77]:
model_svm.fit(X, y)

# get support vectors
model_svm.support_vectors_

array([[0.01331069, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.01002866, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.02679912, 0.        , 0.        , ..., 0.0579478 , 0.        ,
        0.        ]])

In [78]:
# get indices of support vectors
model_svm.support_

array([   0,    1,    4, ..., 2025, 2026, 2027], dtype=int32)

In [79]:
# get number of support vectors for each class
model_svm.n_support_

array([975, 575], dtype=int32)

### Decision Tree

<div>
   <img src="img/decisiontree.png" width="800">
</div>

In [80]:
from sklearn.tree import DecisionTreeClassifier

In [81]:
# define our model by using the default value
model_dt = DecisionTreeClassifier(criterion='entropy', max_leaf_nodes = 100)

In [82]:
score_cv = cross_val_score(model_dt, X, y, cv=10)
score_cv.mean()

0.7343486319075259

In [83]:
#predict value of target based on cross validation
pred_y = cross_val_predict(model_dt, X, y, cv=10)
pred_y

array([0, 0, 0, ..., 0, 0, 1])

In [84]:
print(confusion_matrix(y, pred_y))
print(classification_report(y, pred_y))

[[1073  291]
 [ 257  408]]
              precision    recall  f1-score   support

           0       0.81      0.79      0.80      1364
           1       0.58      0.61      0.60       665

    accuracy                           0.73      2029
   macro avg       0.70      0.70      0.70      2029
weighted avg       0.73      0.73      0.73      2029


### Neural Network

<div>
   <img src="img/NN.png" width="800">
</div>

In [85]:
from sklearn.neural_network import MLPClassifier

In [86]:
print(X.shape)

(2029, 2789)


In [87]:
model_MLP = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(1000, 500, 100,), random_state=1)

In [88]:
# train model
model_MLP.fit(X_train, y_train)

# test model
pred_y = model_MLP.predict(X_test)
pred_y

array([0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0,
       0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1,

In [89]:
# evaluate result 
print("Accuracy:", accuracy_score(y_test, pred_y, normalize=True, sample_weight=None))
print("Classification Report:", classification_report(y_test, pred_y))
print("Confusion Matrix:", confusion_matrix(y_test, pred_y))

Accuracy: 0.8694581280788177
Classification Report:               precision    recall  f1-score   support

           0       0.87      0.94      0.91       271
           1       0.86      0.73      0.79       135

    accuracy                           0.87       406
   macro avg       0.87      0.83      0.85       406
weighted avg       0.87      0.87      0.87       406

Confusion Matrix: [[255  16]
 [ 37  98]]


#### Modify the Hidden Layers

In [90]:
model_MLP2 = MLPClassifier(solver='adam', alpha=1e-5, hidden_layer_sizes=(2000, 500,), random_state=1)

In [91]:
# train model
model_MLP2.fit(X_train, y_train)

# test model
pred_y = model_MLP2.predict(X_test)
pred_y

array([0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0,
       0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,

In [92]:
# evaluate result 
print("Accuracy:", accuracy_score(y_test, pred_y, normalize=True, sample_weight=None))
print("Classification Report:", classification_report(y_test, pred_y))
print("Confusion Matrix:", confusion_matrix(y_test, pred_y))

Accuracy: 0.8669950738916257
Classification Report:               precision    recall  f1-score   support

           0       0.88      0.93      0.90       271
           1       0.84      0.74      0.79       135

    accuracy                           0.87       406
   macro avg       0.86      0.84      0.85       406
weighted avg       0.87      0.87      0.86       406

Confusion Matrix: [[252  19]
 [ 35 100]]


<div>
   <img src="img/activationfunction.png" width="600">
</div>

The default activation function is **ReLU**.

Check this link: https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

## Ensemble Learning

<div>
   <img src="img/ensemble.png" width="800">
</div>

In [93]:
#https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html
from sklearn.ensemble import VotingClassifier

In [94]:
#create a dictionary of our models
estimators=[('lr', lr_L1), ('svm', model_svm), ('dt', model_dt)]

In [95]:
#create our voting classifier, inputting our models
model_ensemble = VotingClassifier(estimators, voting='hard')

# If ‘hard’, uses predicted class labels for majority rule voting. Else if ‘soft’, predicts the class label based on the argmax of the sums of the predicted probabilities, which is recommended for an ensemble of well-calibrated classifiers.

In [96]:
score_cv = cross_val_score(model_ensemble, X, y, cv=10)
score_cv.mean()

0.8506657562307955

In [97]:
#predict value of target based on cross validation
pred_y = cross_val_predict(model_ensemble, X, y, cv=10)
pred_y

array([0, 0, 0, ..., 0, 0, 1])

In [98]:
print(confusion_matrix(y, pred_y))
print(classification_report(y, pred_y))

[[1309   55]
 [ 252  413]]
              precision    recall  f1-score   support

           0       0.84      0.96      0.90      1364
           1       0.88      0.62      0.73       665

    accuracy                           0.85      2029
   macro avg       0.86      0.79      0.81      2029
weighted avg       0.85      0.85      0.84      2029


### **Bonus: Random Forest**

<div>
   <img src="img/randomforest.png" width="800">
</div>

In [99]:
from sklearn.ensemble import RandomForestClassifier

model_rf = RandomForestClassifier(n_estimators = 50)

In [100]:
score_cv = cross_val_score(model_rf, X, y, cv=10)
score_cv.mean()

0.8299614690533093

In [101]:
#predict value of target based on cross validation
pred_y = cross_val_predict(model_rf, X, y, cv=10)
pred_y

array([0, 0, 0, ..., 0, 0, 1])

In [102]:
print(confusion_matrix(y, pred_y))
print(classification_report(y, pred_y))

[[1318   46]
 [ 283  382]]
              precision    recall  f1-score   support

           0       0.82      0.97      0.89      1364
           1       0.89      0.57      0.70       665

    accuracy                           0.84      2029
   macro avg       0.86      0.77      0.79      2029
weighted avg       0.85      0.84      0.83      2029
