# ADVANCED CLASSIFICATION PREDICT
#### By Dawie Loots

### Honour Code

I, Dawie Loots, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.

<a id="cont"></a>

## Table of Contents

<a href=#one>1. Predict overview</a>

<a href=#two>2. Importing packages</a>

<a href=#three>3. Loading the data</a>

<a href=#four>4. Data Preprocessing</a>

<a href=#five>5. Exploratory Data Analysis</a>

<a href=#six>6. Modeling</a>

<a href=#seven>7. Model performance evaluation</a>

<a href=#eight>8. Model analysis and conclusion</a>

<a id="one"></a>
### 1. Predict overview

Many companies are built around lessening one’s environmental impact or carbon footprint. They offer products and services that are environmentally friendly and sustainable, in line with their values and ideals. They would like to determine how people perceive climate change and whether or not they believe it is a real threat. This would add to their market research efforts in gauging how their product/service may be received.

With this context, EA is challenging you during the Classification Sprint with the task of creating a Machine Learning model that is able to classify whether or not a person believes in climate change, based on their novel tweet data.

Providing an accurate and robust solution to this task gives companies access to a broad base of consumer sentiment, spanning multiple demographic and geographic categories - thus increasing their insights and informing future marketing strategies.

<a id="two"></a>
### 2. Importing packages

In [293]:
# Libraries for data loading, data manipulation and data visulisation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import chardet # To provide a best estimate of the encoding that was used in the text data
import io # For string operations
%matplotlib inline

# Libraries for data preparation and model building
import nltk
from nltk.tokenize import word_tokenize, TweetTokenizer
from nltk.corpus import stopwords
import string
from nltk.stem import WordNetLemmatizer
#nltk.download('wordnet')
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import math
import re
from sklearn.utils import resample
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn import feature_selection
from sklearn.feature_selection import f_classif
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.model_selection import GridSearchCV

# Setting global constants to ensure notebook results are reproducible
PARAMETER_CONSTANT = 42  # This is the seed value for random number generation
# Vectorizer constants
MAX_DF = 0.5
MIN_DF = 2
NGRAM_RANGE = (1,1)
MAX_FEATURES = None

<a id="three"></a>
### 3. Loading the data

In [278]:
df_train = pd.read_csv('G:/My Drive/Professionele ontwikkeling/Data Science/Explore Data Science Course/Sprint 6_Advanced Classification/Predict/advanced-classification-predict/data/train.csv')
df_train.head()

Unnamed: 0,sentiment,message,tweetid
0,1,"PolySciMajor EPA chief doesn't think carbon dioxide is main cause of global warming and.. wait, what!? https://t.co/yeLvcEFXkC via @mashable",625221
1,1,It's not like we lack evidence of anthropogenic global warming,126103
2,2,RT @RawStory: Researchers say we have three years to act on climate change before it’s too late https://t.co/WdT0KdUr2f https://t.co/Z0ANPT…,698562
3,1,#TodayinMaker# WIRED : 2016 was a pivotal year in the war on climate change https://t.co/44wOTxTLcD,573736
4,1,"RT @SoyNovioDeTodas: It's 2016, and a racist, sexist, climate change denying bigot is leading in the polls. #ElectionNight",466954


In [279]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15819 entries, 0 to 15818
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   sentiment  15819 non-null  int64 
 1   message    15819 non-null  object
 2   tweetid    15819 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 370.9+ KB


<a id="four"></a>
### 4. Data preprocessing

Check for missing values.

In [280]:
df_train.isna().sum()

sentiment    0
message      0
tweetid      0
dtype: int64

There is no missing data, so let's proceed by checking for class imbalance.

In [281]:
class_count = df_train['sentiment'].value_counts()
class_count

 1    8530
 2    3640
 0    2353
-1    1296
Name: sentiment, dtype: int64

Seems like most of the tweets were for class 1 (supporting the belief of man-made changes)
Let's divide the total 15,819 tweets by 4, to get +- 3,955 per class.  We will need to upsamle classes 0, -1 and 2, and downsample class 1

In [282]:
class_min1 = df_train[df_train['sentiment']==-1]
class_0 = df_train[df_train['sentiment']==0]
class_1 = df_train[df_train['sentiment']==1]
class_2 = df_train[df_train['sentiment']==2]
balance = len(df_train) // 4 # The number of samples that will result in class balance
df_train_class1_resampled = resample(class_1,
                            replace=False, # sample without replacement (no need to duplicate observations)
                            n_samples=balance, # make all classes equal
                            random_state=27) # reproducible results
df_train_classmin1_resampled = resample(class_min1,
                            replace=True, # sample with replacement (we need to duplicate observations)
                            n_samples=balance, # make all classes equal
                            random_state=27) # reproducible results
df_train_class0_resampled = resample(class_0,
                            replace=True, # sample with replacement (we need to duplicate observations)
                            n_samples=balance, # make all classes equal
                            random_state=27) # reproducible results
df_train_class2_resampled = resample(class_2,
                            replace=True, # sample with replacement (we need to duplicate observations)
                            n_samples=balance, # make all classes equal
                            random_state=27) # reproducible results

df_train.reset_index(drop=True, inplace=True) # Reset index before upsampling
df_train = pd.concat([df_train_class1_resampled, df_train_classmin1_resampled, 
                                df_train_class0_resampled, df_train_class2_resampled])
df_train.set_index(df_train.index, inplace=True) # Set the default integer index as the new index after upsampling

# Check new class counts
df_train['sentiment'].value_counts()

 1    3954
-1    3954
 0    3954
 2    3954
Name: sentiment, dtype: int64

Now that we have class balance, let's proceed with the following steps to convert text into numerical values, so that it can be used for this classification task:

- Removing noise (such as web-urls)
- Removing punctuation
- Tokenization
- Removal of stop words
- Lemmatization



In [283]:
# Remove noise (all hyperlinks)

def remove_noise(df):
    pattern_url = r'http[s]?://(?:[A-Za-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9A-Fa-f][0-9A-Fa-f]))+'   # Find all hyperlinks
    subs_url = r''
    df['message'] = df['message'].replace(to_replace = pattern_url, value = subs_url, regex = True)
    return df

remove_noise(df_train)
df_train.head()

Unnamed: 0,sentiment,message,tweetid
11729,1,RT @ubcforestry: Funding from @GenomeBC will support @SallyNAitken's team as they address the impact of climate change on trees.…,977844
8308,1,@YadiMoIina gag orders? Sure. He's definitely green and doesn't think climate change was a hoax made by CHINA.,441956
7159,1,RT @pattonoswalt: Not ominous at all! (He also wants the names of anyone working on climate change research),978938
5644,1,RT @MelissaJPeltier: In case you forgot about that 'Chinese Hoax' global warming: #climatechange,587737
6732,1,RT @SethMacFarlane: HRC proposes installing half a billion solar panels by the end of her first term. Trump thinks climate change is a hoaxÃ¢â‚¬Â¦,804767


In [284]:
# Handle emoticons
def process_emoticons(df):
    emoticon_dictionary = {':\)': 'smiley_face_emoticon',
                           ':\(': 'frowning_face_emoticon',
                           ':D': 'grinning_face_emoticon',
                           ':P': 'sticking_out_tongue_emoticon',
                           ';\)': 'winking_face_emoticon',
                           ':o': 'surprised_face_emoticon',
                           ':\|': 'neutral_face_emoticon',
                           ':\'\)': 'tears_of_joy_emoticon',
                           ':\'\(': 'crying_face_emoticon'}

    df['message_encoded_emojis'] = df['message'].replace(emoticon_dictionary, regex=True)
    return df

process_emoticons(df_train)
# Check if it was correctly 
emoji_rows = df_train[df_train['message'].str.contains(':\(')]
emoji_rows.head(10)

Unnamed: 0,sentiment,message,tweetid,message_encoded_emojis
2549,1,RT @Greenpeace: Sad :( Animals and birds which migrate around the world are struggling to adapt to climate change…,909808,RT @Greenpeace: Sad frowning_face_emoticon Animals and birds which migrate around the world are struggling to adapt to climate change…
4687,1,@timesofindia why do fishermen always fish much beyond limits to catch fish? Must be global warming effect that has less fish :(,855852,@timesofindia why do fishermen always fish much beyond limits to catch fish? Must be global warming effect that has less fish frowning_face_emoticon
11386,1,"RT @savitriyaca: a must watch, cause climate change is here! :((( Before the Flood - Full Movie | National Geographic Ã¢â‚¬Â¦",891447,"RT @savitriyaca: a must watch, cause climate change is here! frowning_face_emoticon(( Before the Flood - Full Movie | National Geographic Ã¢â‚¬Â¦"
11203,1,@trees_r_cool animal agriculture is the main contributor to climate change :( please watch cowspiracy you'll see the truth!!,164712,@trees_r_cool animal agriculture is the main contributor to climate change frowning_face_emoticon please watch cowspiracy you'll see the truth!!
15225,1,"@KamalaHarris Start with a Pareto of biggest contributors to climate change:(1) China, (2) Hollywood Trains, Plains, Automobiles, and Yachts",307036,"@KamalaHarris Start with a Pareto of biggest contributors to climate changefrowning_face_emoticon1) China, (2) Hollywood Trains, Plains, Automobiles, and Yachts"
1038,1,Whenever it randomly snows like this I get v worried about global warming and the poor polar bears :((,166148,Whenever it randomly snows like this I get v worried about global warming and the poor polar bears frowning_face_emoticon(
5200,0,EPA head Scott Pruitt denies that carbon dioxide causes global warming \nChok uden overraskelse: Faktaresistens :(,152968,EPA head Scott Pruitt denies that carbon dioxide causes global warming \nChok uden overraskelse: Faktaresistens frowning_face_emoticon
5200,0,EPA head Scott Pruitt denies that carbon dioxide causes global warming \nChok uden overraskelse: Faktaresistens :(,152968,EPA head Scott Pruitt denies that carbon dioxide causes global warming \nChok uden overraskelse: Faktaresistens frowning_face_emoticon


In [285]:
# Remove punctuation and expand all contracted words
def remove_punctuation(message):
    contractions = {"'t": " not","'s": " is","'re": " are","'ll": " will", "'m": " am"}
    pattern = re.compile(r"\b(" + "|".join(re.escape(key) for key in contractions.keys()) + r")\b")
    message = re.sub(r"n't\b", " not", message) # Replace "n't" with " not"
    message = pattern.sub(lambda match: contractions[match.group(0)], message) # Replace all other contractions except for "n't"
    return ''.join([l for l in message if l not in string.punctuation])

df_train['message_clean'] = df_train['message_encoded_emojis'].apply(remove_punctuation)
df_train.head()

Unnamed: 0,sentiment,message,tweetid,message_encoded_emojis,message_clean
11729,1,RT @ubcforestry: Funding from @GenomeBC will support @SallyNAitken's team as they address the impact of climate change on trees.…,977844,RT @ubcforestry: Funding from @GenomeBC will support @SallyNAitken's team as they address the impact of climate change on trees.…,RT ubcforestry Funding from GenomeBC will support SallyNAitken is team as they address the impact of climate change on trees…
8308,1,@YadiMoIina gag orders? Sure. He's definitely green and doesn't think climate change was a hoax made by CHINA.,441956,@YadiMoIina gag orders? Sure. He's definitely green and doesn't think climate change was a hoax made by CHINA.,YadiMoIina gag orders Sure He is definitely green and does not think climate change was a hoax made by CHINA
7159,1,RT @pattonoswalt: Not ominous at all! (He also wants the names of anyone working on climate change research),978938,RT @pattonoswalt: Not ominous at all! (He also wants the names of anyone working on climate change research),RT pattonoswalt Not ominous at all He also wants the names of anyone working on climate change research
5644,1,RT @MelissaJPeltier: In case you forgot about that 'Chinese Hoax' global warming: #climatechange,587737,RT @MelissaJPeltier: In case you forgot about that 'Chinese Hoax' global warming: #climatechange,RT MelissaJPeltier In case you forgot about that Chinese Hoax global warming climatechange
6732,1,RT @SethMacFarlane: HRC proposes installing half a billion solar panels by the end of her first term. Trump thinks climate change is a hoaxÃ¢â‚¬Â¦,804767,RT @SethMacFarlane: HRC proposes installing half a billion solar panels by the end of her first term. Trump thinks climate change is a hoaxÃ¢â‚¬Â¦,RT SethMacFarlane HRC proposes installing half a billion solar panels by the end of her first term Trump thinks climate change is a hoaxÃ¢â‚¬Â¦


In [286]:
# Tokenization
def tokenize(df):
    tokenizer = TweetTokenizer()
    df['tokens'] = df['message_clean'].apply(tokenizer.tokenize)
    return df

tokenize(df_train)

Unnamed: 0,sentiment,message,tweetid,message_encoded_emojis,message_clean,tokens
11729,1,RT @ubcforestry: Funding from @GenomeBC will support @SallyNAitken's team as they address the impact of climate change on trees.…,977844,RT @ubcforestry: Funding from @GenomeBC will support @SallyNAitken's team as they address the impact of climate change on trees.…,RT ubcforestry Funding from GenomeBC will support SallyNAitken is team as they address the impact of climate change on trees…,"[RT, ubcforestry, Funding, from, GenomeBC, will, support, SallyNAitken, is, team, as, they, address, the, impact, of, climate, change, on, trees, …]"
8308,1,@YadiMoIina gag orders? Sure. He's definitely green and doesn't think climate change was a hoax made by CHINA.,441956,@YadiMoIina gag orders? Sure. He's definitely green and doesn't think climate change was a hoax made by CHINA.,YadiMoIina gag orders Sure He is definitely green and does not think climate change was a hoax made by CHINA,"[YadiMoIina, gag, orders, Sure, He, is, definitely, green, and, does, not, think, climate, change, was, a, hoax, made, by, CHINA]"
7159,1,RT @pattonoswalt: Not ominous at all! (He also wants the names of anyone working on climate change research),978938,RT @pattonoswalt: Not ominous at all! (He also wants the names of anyone working on climate change research),RT pattonoswalt Not ominous at all He also wants the names of anyone working on climate change research,"[RT, pattonoswalt, Not, ominous, at, all, He, also, wants, the, names, of, anyone, working, on, climate, change, research]"
5644,1,RT @MelissaJPeltier: In case you forgot about that 'Chinese Hoax' global warming: #climatechange,587737,RT @MelissaJPeltier: In case you forgot about that 'Chinese Hoax' global warming: #climatechange,RT MelissaJPeltier In case you forgot about that Chinese Hoax global warming climatechange,"[RT, MelissaJPeltier, In, case, you, forgot, about, that, Chinese, Hoax, global, warming, climatechange]"
6732,1,RT @SethMacFarlane: HRC proposes installing half a billion solar panels by the end of her first term. Trump thinks climate change is a hoaxÃ¢â‚¬Â¦,804767,RT @SethMacFarlane: HRC proposes installing half a billion solar panels by the end of her first term. Trump thinks climate change is a hoaxÃ¢â‚¬Â¦,RT SethMacFarlane HRC proposes installing half a billion solar panels by the end of her first term Trump thinks climate change is a hoaxÃ¢â‚¬Â¦,"[RT, SethMacFarlane, HRC, proposes, installing, half, a, billion, solar, panels, by, the, end, of, her, first, term, Trump, thinks, climate, change, is, a, hoaxÃ, ¢, â, ‚, ¬, Â, ¦]"
...,...,...,...,...,...,...
12292,2,Video: Statoil produces climate change 'roadmap' - News for the Oil and Gas Sector,633554,Video: Statoil produces climate change 'roadmap' - News for the Oil and Gas Sector,Video Statoil produces climate change roadmap News for the Oil and Gas Sector,"[Video, Statoil, produces, climate, change, roadmap, News, for, the, Oil, and, Gas, Sector]"
15209,2,"RT @Reuters: In rare move, China criticizes Trump plan to exit climate change pact",724243,"RT @Reuters: In rare move, China criticizes Trump plan to exit climate change pact",RT Reuters In rare move China criticizes Trump plan to exit climate change pact,"[RT, Reuters, In, rare, move, China, criticizes, Trump, plan, to, exit, climate, change, pact]"
7757,2,RT @DonaldMacDona18: Global climate change battles being won in court @IIGCCNews #climatechange #carbonbubble #COP2…,878987,RT @DonaldMacDona18: Global climate change battles being won in court @IIGCCNews #climatechange #carbonbubble #COP2…,RT DonaldMacDona18 Global climate change battles being won in court IIGCCNews climatechange carbonbubble COP2…,"[RT, DonaldMacDona, 18, Global, climate, change, battles, being, won, in, court, IIGCCNews, climatechange, carbonbubble, COP, 2, …]"
5707,2,"RT @BBCBreaking: UK government signs Paris Agreement, world's first comprehensive treaty on tackling climate change Ã¢â‚¬Â¦",151024,"RT @BBCBreaking: UK government signs Paris Agreement, world's first comprehensive treaty on tackling climate change Ã¢â‚¬Â¦",RT BBCBreaking UK government signs Paris Agreement world is first comprehensive treaty on tackling climate change Ã¢â‚¬Â¦,"[RT, BBCBreaking, UK, government, signs, Paris, Agreement, world, is, first, comprehensive, treaty, on, tackling, climate, change, Ã, ¢, â, ‚, ¬, Â, ¦]"


In [287]:
# Remove stopwords
def remove_stop_words(df):
    stop_words = set(stopwords.words('english'))
    # Remove stopwords using a vectorized operation
    df['tokens_without_stopwords'] = df['tokens'].apply(lambda tokens: [t for t in tokens if t not in stop_words])
    return df

remove_stop_words(df_train)
df_train.head()

Unnamed: 0,sentiment,message,tweetid,message_encoded_emojis,message_clean,tokens,tokens_without_stopwords
11729,1,RT @ubcforestry: Funding from @GenomeBC will support @SallyNAitken's team as they address the impact of climate change on trees.…,977844,RT @ubcforestry: Funding from @GenomeBC will support @SallyNAitken's team as they address the impact of climate change on trees.…,RT ubcforestry Funding from GenomeBC will support SallyNAitken is team as they address the impact of climate change on trees…,"[RT, ubcforestry, Funding, from, GenomeBC, will, support, SallyNAitken, is, team, as, they, address, the, impact, of, climate, change, on, trees, …]","[RT, ubcforestry, Funding, GenomeBC, support, SallyNAitken, team, address, impact, climate, change, trees, …]"
8308,1,@YadiMoIina gag orders? Sure. He's definitely green and doesn't think climate change was a hoax made by CHINA.,441956,@YadiMoIina gag orders? Sure. He's definitely green and doesn't think climate change was a hoax made by CHINA.,YadiMoIina gag orders Sure He is definitely green and does not think climate change was a hoax made by CHINA,"[YadiMoIina, gag, orders, Sure, He, is, definitely, green, and, does, not, think, climate, change, was, a, hoax, made, by, CHINA]","[YadiMoIina, gag, orders, Sure, He, definitely, green, think, climate, change, hoax, made, CHINA]"
7159,1,RT @pattonoswalt: Not ominous at all! (He also wants the names of anyone working on climate change research),978938,RT @pattonoswalt: Not ominous at all! (He also wants the names of anyone working on climate change research),RT pattonoswalt Not ominous at all He also wants the names of anyone working on climate change research,"[RT, pattonoswalt, Not, ominous, at, all, He, also, wants, the, names, of, anyone, working, on, climate, change, research]","[RT, pattonoswalt, Not, ominous, He, also, wants, names, anyone, working, climate, change, research]"
5644,1,RT @MelissaJPeltier: In case you forgot about that 'Chinese Hoax' global warming: #climatechange,587737,RT @MelissaJPeltier: In case you forgot about that 'Chinese Hoax' global warming: #climatechange,RT MelissaJPeltier In case you forgot about that Chinese Hoax global warming climatechange,"[RT, MelissaJPeltier, In, case, you, forgot, about, that, Chinese, Hoax, global, warming, climatechange]","[RT, MelissaJPeltier, In, case, forgot, Chinese, Hoax, global, warming, climatechange]"
6732,1,RT @SethMacFarlane: HRC proposes installing half a billion solar panels by the end of her first term. Trump thinks climate change is a hoaxÃ¢â‚¬Â¦,804767,RT @SethMacFarlane: HRC proposes installing half a billion solar panels by the end of her first term. Trump thinks climate change is a hoaxÃ¢â‚¬Â¦,RT SethMacFarlane HRC proposes installing half a billion solar panels by the end of her first term Trump thinks climate change is a hoaxÃ¢â‚¬Â¦,"[RT, SethMacFarlane, HRC, proposes, installing, half, a, billion, solar, panels, by, the, end, of, her, first, term, Trump, thinks, climate, change, is, a, hoaxÃ, ¢, â, ‚, ¬, Â, ¦]","[RT, SethMacFarlane, HRC, proposes, installing, half, billion, solar, panels, end, first, term, Trump, thinks, climate, change, hoaxÃ, ¢, â, ‚, ¬, Â, ¦]"


In [288]:
# Lemmatization
def lemmatize(words, lemmatizer):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(word) for word in words]
lemmatizer = WordNetLemmatizer()
df_train['lemma'] = df_train['tokens_without_stopwords'].apply(lemmatize, args=(lemmatizer, ))
df_train.head()

Unnamed: 0,sentiment,message,tweetid,message_encoded_emojis,message_clean,tokens,tokens_without_stopwords,lemma
11729,1,RT @ubcforestry: Funding from @GenomeBC will support @SallyNAitken's team as they address the impact of climate change on trees.…,977844,RT @ubcforestry: Funding from @GenomeBC will support @SallyNAitken's team as they address the impact of climate change on trees.…,RT ubcforestry Funding from GenomeBC will support SallyNAitken is team as they address the impact of climate change on trees…,"[RT, ubcforestry, Funding, from, GenomeBC, will, support, SallyNAitken, is, team, as, they, address, the, impact, of, climate, change, on, trees, …]","[RT, ubcforestry, Funding, GenomeBC, support, SallyNAitken, team, address, impact, climate, change, trees, …]","[RT, ubcforestry, Funding, GenomeBC, support, SallyNAitken, team, address, impact, climate, change, tree, …]"
8308,1,@YadiMoIina gag orders? Sure. He's definitely green and doesn't think climate change was a hoax made by CHINA.,441956,@YadiMoIina gag orders? Sure. He's definitely green and doesn't think climate change was a hoax made by CHINA.,YadiMoIina gag orders Sure He is definitely green and does not think climate change was a hoax made by CHINA,"[YadiMoIina, gag, orders, Sure, He, is, definitely, green, and, does, not, think, climate, change, was, a, hoax, made, by, CHINA]","[YadiMoIina, gag, orders, Sure, He, definitely, green, think, climate, change, hoax, made, CHINA]","[YadiMoIina, gag, order, Sure, He, definitely, green, think, climate, change, hoax, made, CHINA]"
7159,1,RT @pattonoswalt: Not ominous at all! (He also wants the names of anyone working on climate change research),978938,RT @pattonoswalt: Not ominous at all! (He also wants the names of anyone working on climate change research),RT pattonoswalt Not ominous at all He also wants the names of anyone working on climate change research,"[RT, pattonoswalt, Not, ominous, at, all, He, also, wants, the, names, of, anyone, working, on, climate, change, research]","[RT, pattonoswalt, Not, ominous, He, also, wants, names, anyone, working, climate, change, research]","[RT, pattonoswalt, Not, ominous, He, also, want, name, anyone, working, climate, change, research]"
5644,1,RT @MelissaJPeltier: In case you forgot about that 'Chinese Hoax' global warming: #climatechange,587737,RT @MelissaJPeltier: In case you forgot about that 'Chinese Hoax' global warming: #climatechange,RT MelissaJPeltier In case you forgot about that Chinese Hoax global warming climatechange,"[RT, MelissaJPeltier, In, case, you, forgot, about, that, Chinese, Hoax, global, warming, climatechange]","[RT, MelissaJPeltier, In, case, forgot, Chinese, Hoax, global, warming, climatechange]","[RT, MelissaJPeltier, In, case, forgot, Chinese, Hoax, global, warming, climatechange]"
6732,1,RT @SethMacFarlane: HRC proposes installing half a billion solar panels by the end of her first term. Trump thinks climate change is a hoaxÃ¢â‚¬Â¦,804767,RT @SethMacFarlane: HRC proposes installing half a billion solar panels by the end of her first term. Trump thinks climate change is a hoaxÃ¢â‚¬Â¦,RT SethMacFarlane HRC proposes installing half a billion solar panels by the end of her first term Trump thinks climate change is a hoaxÃ¢â‚¬Â¦,"[RT, SethMacFarlane, HRC, proposes, installing, half, a, billion, solar, panels, by, the, end, of, her, first, term, Trump, thinks, climate, change, is, a, hoaxÃ, ¢, â, ‚, ¬, Â, ¦]","[RT, SethMacFarlane, HRC, proposes, installing, half, billion, solar, panels, end, first, term, Trump, thinks, climate, change, hoaxÃ, ¢, â, ‚, ¬, Â, ¦]","[RT, SethMacFarlane, HRC, proposes, installing, half, billion, solar, panel, end, first, term, Trump, think, climate, change, hoaxÃ, ¢, â, ‚, ¬, Â, ¦]"


<a id="five"></a>
### 5. Exploratory Data Analysis

In [294]:
# Convert into Bag Of Words using CountVectorizer
def vectorize(df, max_df, min_df, ngram_range, max_features):
    # Flatten the list of lists into a single list of strings
    df['flattened_lemma'] = df['lemma'].apply(lambda word_list: ' '.join(word_list))
    # Create and fit the CountVectorizer
    vect = CountVectorizer(lowercase=True, max_df=max_df, min_df=min_df, ngram_range=ngram_range,
                           max_features=max_features)
    vect.fit(df_train['flattened_lemma'])  # Note that the vectorizer is always fit on the train data, so that both
                                           # train and test sets are vectorized on the same vocabulary
    X = vect.transform(df['flattened_lemma'])
    bag_of_words = pd.DataFrame(X.toarray(), columns=vect.get_feature_names_out())
    # Merge original dataset with Bag Of Words
    bag_of_words.reset_index(drop=True, inplace=True)
    df.reset_index(drop=True, inplace=True)
    vectorized_df = pd.concat([bag_of_words, df],axis=1)
    return vectorized_df
    
df_train = vectorize(df_train, MAX_DF, MIN_DF, NGRAM_RANGE, MAX_FEATURES)
df_train.head()

Unnamed: 0,000111,004,00kevin7,01,010536,012015,02,020,04,07,...,sentiment,message,tweetid,message_encoded_emojis,message_clean,tokens,tokens_without_stopwords,lemma,differences,flattened_lemma
0,0,0,0,0,0,0,0,0,0,0,...,1,RT @ubcforestry: Funding from @GenomeBC will support @SallyNAitken's team as they address the impact of climate change on trees.…,977844,RT @ubcforestry: Funding from @GenomeBC will support @SallyNAitken's team as they address the impact of climate change on trees.…,RT ubcforestry Funding from GenomeBC will support SallyNAitken is team as they address the impact of climate change on trees…,"[RT, ubcforestry, Funding, from, GenomeBC, will, support, SallyNAitken, is, team, as, they, address, the, impact, of, climate, change, on, trees, …]","[RT, ubcforestry, Funding, GenomeBC, support, SallyNAitken, team, address, impact, climate, change, trees, …]","[RT, ubcforestry, Funding, GenomeBC, support, SallyNAitken, team, address, impact, climate, change, tree, …]",[tree],RT ubcforestry Funding GenomeBC support SallyNAitken team address impact climate change tree …
1,0,0,0,0,0,0,0,0,0,0,...,1,@YadiMoIina gag orders? Sure. He's definitely green and doesn't think climate change was a hoax made by CHINA.,441956,@YadiMoIina gag orders? Sure. He's definitely green and doesn't think climate change was a hoax made by CHINA.,YadiMoIina gag orders Sure He is definitely green and does not think climate change was a hoax made by CHINA,"[YadiMoIina, gag, orders, Sure, He, is, definitely, green, and, does, not, think, climate, change, was, a, hoax, made, by, CHINA]","[YadiMoIina, gag, orders, Sure, He, definitely, green, think, climate, change, hoax, made, CHINA]","[YadiMoIina, gag, order, Sure, He, definitely, green, think, climate, change, hoax, made, CHINA]",[order],YadiMoIina gag order Sure He definitely green think climate change hoax made CHINA
2,0,0,0,0,0,0,0,0,0,0,...,1,RT @pattonoswalt: Not ominous at all! (He also wants the names of anyone working on climate change research),978938,RT @pattonoswalt: Not ominous at all! (He also wants the names of anyone working on climate change research),RT pattonoswalt Not ominous at all He also wants the names of anyone working on climate change research,"[RT, pattonoswalt, Not, ominous, at, all, He, also, wants, the, names, of, anyone, working, on, climate, change, research]","[RT, pattonoswalt, Not, ominous, He, also, wants, names, anyone, working, climate, change, research]","[RT, pattonoswalt, Not, ominous, He, also, want, name, anyone, working, climate, change, research]","[want, name]",RT pattonoswalt Not ominous He also want name anyone working climate change research
3,0,0,0,0,0,0,0,0,0,0,...,1,RT @MelissaJPeltier: In case you forgot about that 'Chinese Hoax' global warming: #climatechange,587737,RT @MelissaJPeltier: In case you forgot about that 'Chinese Hoax' global warming: #climatechange,RT MelissaJPeltier In case you forgot about that Chinese Hoax global warming climatechange,"[RT, MelissaJPeltier, In, case, you, forgot, about, that, Chinese, Hoax, global, warming, climatechange]","[RT, MelissaJPeltier, In, case, forgot, Chinese, Hoax, global, warming, climatechange]","[RT, MelissaJPeltier, In, case, forgot, Chinese, Hoax, global, warming, climatechange]",[],RT MelissaJPeltier In case forgot Chinese Hoax global warming climatechange
4,0,0,0,0,0,0,0,0,0,0,...,1,RT @SethMacFarlane: HRC proposes installing half a billion solar panels by the end of her first term. Trump thinks climate change is a hoaxÃ¢â‚¬Â¦,804767,RT @SethMacFarlane: HRC proposes installing half a billion solar panels by the end of her first term. Trump thinks climate change is a hoaxÃ¢â‚¬Â¦,RT SethMacFarlane HRC proposes installing half a billion solar panels by the end of her first term Trump thinks climate change is a hoaxÃ¢â‚¬Â¦,"[RT, SethMacFarlane, HRC, proposes, installing, half, a, billion, solar, panels, by, the, end, of, her, first, term, Trump, thinks, climate, change, is, a, hoaxÃ, ¢, â, ‚, ¬, Â, ¦]","[RT, SethMacFarlane, HRC, proposes, installing, half, billion, solar, panels, end, first, term, Trump, thinks, climate, change, hoaxÃ, ¢, â, ‚, ¬, Â, ¦]","[RT, SethMacFarlane, HRC, proposes, installing, half, billion, solar, panel, end, first, term, Trump, think, climate, change, hoaxÃ, ¢, â, ‚, ¬, Â, ¦]","[panel, think]",RT SethMacFarlane HRC proposes installing half billion solar panel end first term Trump think climate change hoaxÃ ¢ â ‚ ¬ Â ¦


In [None]:
# Reduce features



<a id="six"></a>
### 6. Modelling

In [296]:
# Split into training and test data
def split_train_test(df):
    X = df.copy()
    y = X.sentiment
    columns_to_drop = X.select_dtypes(include=['object']).columns
    columns_to_drop = list(columns_to_drop) + ['tweetid', 'sentiment']
    X.drop(columns=columns_to_drop,inplace=True)
    # Reduce features
    kbest = LogisticRegression()
    # Set up selector, choosing score function and number of features to retain
    selector_kbest = feature_selection.SelectKBest(score_func=f_classif, k=200)
    # Transform (i.e.: run selection on) the training data
    X_train_kbest = selector_kbest.fit_transform(X, y)
    X_train, X_test, y_train, y_test = train_test_split(X_train_kbest, y.values)
    return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = split_train_test(df_train)

MemoryError: Unable to allocate 322. MiB for an array with shape (3954, 10672) and data type float64

In [None]:
# Define all the classifiers
names = [
         'Logistic Regression', 
         'Nearest Neighbors',
         #'Linear SVM',
         'RBF SVM',
         'Decision Tree',
         'Random Forest',
         'AdaBoost'
         ]

classifiers = [
               LogisticRegression(max_iter=1000),
               KNeighborsClassifier(3),
               #SVC(kernel="linear", C=0.025),
               SVC(gamma=1, C=1),
               DecisionTreeClassifier(max_depth=5),
               RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
               AdaBoostClassifier()
              ]


In [None]:
# Train all the models

results = []
models = {}
confusion = {}
class_report = {}

for name, clf in zip(names, classifiers):
    print ('Fitting {:s} model...'.format(name))
    run_time = %timeit -q -o clf.fit(X_train, y_train)

    print ('... predicting')
    y_pred = clf.predict(X_train)
    y_pred_test = clf.predict(X_test)

    print ('... scoring')
    accuracy  = metrics.accuracy_score(y_train, y_pred)
    precision = metrics.precision_score(y_train, y_pred, average='weighted')
    recall    = metrics.recall_score(y_train, y_pred, average='weighted')

    f1        = metrics.f1_score(y_train, y_pred, average='weighted')
    f1_test   = metrics.f1_score(y_test, y_pred_test, average='weighted')

    # Save the results to dictionaries
    models[name] = clf
    confusion[name] = metrics.confusion_matrix(y_train, y_pred)
    class_report[name] = metrics.classification_report(y_train, y_pred)

    results.append([name, accuracy, precision, recall, f1, f1_test, run_time.best])


results = pd.DataFrame(results, columns=['Classifier', 'Accuracy', 'Precision', 'Recall', 'F1 Train', 'F1 Test', 'Train Time'])
results.set_index('Classifier', inplace= True)

print ('... All done!')

In [None]:
results.sort_values('F1 Train', ascending=False)

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
results.sort_values('F1 Train', ascending=False, inplace=True)
results.plot(y=['F1 Test'], kind='bar', ax=ax[0], xlim=[0,1.1], ylim=[0.05,0.99])
results.plot(y='Train Time', kind='bar', ax=ax[1])

<a id="seven"></a>
### 7. Model performance evaluation

<a id="eight"></a>
### 8. Model analysis and conclusion

## Process test data