<a href="https://colab.research.google.com/github/dawieloots/explore-integrated-project/blob/main/advanced_classification_predict_COLAB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ADVANCED CLASSIFICATION PREDICT
#### By Dawie Loots

### Honour Code

I, Dawie Loots, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.

<a id="cont"></a>

## Table of Contents

<a href=#one>1. Predict overview</a>

<a href=#two>2. Importing packages</a>

<a href=#three>3. Loading the data</a>

<a href=#four>4. Data Preprocessing</a>

<a href=#five>5. Exploratory Data Analysis</a>

<a href=#six>6. Modeling</a>

<a href=#seven>7. Model performance evaluation</a>

<a href=#eight>8. Model analysis and conclusion</a>

<a id="one"></a>
### 1. Predict overview

Many companies are built around lessening one’s environmental impact or carbon footprint. They offer products and services that are environmentally friendly and sustainable, in line with their values and ideals. They would like to determine how people perceive climate change and whether or not they believe it is a real threat. This would add to their market research efforts in gauging how their product/service may be received.

With this context, EA is challenging you during the Classification Sprint with the task of creating a Machine Learning model that is able to classify whether or not a person believes in climate change, based on their novel tweet data.

Providing an accurate and robust solution to this task gives companies access to a broad base of consumer sentiment, spanning multiple demographic and geographic categories - thus increasing their insights and informing future marketing strategies.

<a id="two"></a>
### 2. Importing packages

In [2]:
# Libraries for data loading, data manipulation and data visulisation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import chardet # To provide a best estimate of the encoding that was used in the text data
import io # For string operations
%matplotlib inline

# Libraries for data preparation and model building
import nltk
from nltk.tokenize import word_tokenize, TweetTokenizer
from nltk.corpus import stopwords
nltk.download('stopwords')
import string
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
from sklearn.feature_extraction.text import CountVectorizer
import math
import re
from sklearn.utils import resample
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.model_selection import GridSearchCV

# Setting global constants to ensure notebook results are reproducible
PARAMETER_CONSTANT = 42  # This is the seed value for random number generation

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


<a id="three"></a>
### 3. Loading the data

In [3]:
df_train = pd.read_csv('train.csv')
df_train.head()

FileNotFoundError: [Errno 2] No such file or directory: 'train.csv'

In [None]:
df_train.info()

<a id="four"></a>
### 4. Data preprocessing

Check for missing values.

In [None]:
df_train.isna().sum()

There is no missing data, so let's proceed by checking for class imbalance.

In [None]:
class_count = df_train['sentiment'].value_counts()
class_count

Seems like most of the tweets were for class 1 (supporting the belief of man-made changes)
Let's divide the total 15,819 tweets by 4, to get +- 3,955 per class.  We will need to upsamle classes 0, -1 and 2, and downsample class 1

In [None]:
class_min1 = df_train[df_train['sentiment']==-1]
class_0 = df_train[df_train['sentiment']==0]
class_1 = df_train[df_train['sentiment']==1]
class_2 = df_train[df_train['sentiment']==2]
balance = len(df_train) // 4 # The number of samples that will result in class balance
df_train_class1_resampled = resample(class_1,
                            replace=False, # sample without replacement (no need to duplicate observations)
                            n_samples=balance, # make all classes equal
                            random_state=27) # reproducible results
df_train_classmin1_resampled = resample(class_min1,
                            replace=True, # sample with replacement (we need to duplicate observations)
                            n_samples=balance, # make all classes equal
                            random_state=27) # reproducible results
df_train_class0_resampled = resample(class_0,
                            replace=True, # sample with replacement (we need to duplicate observations)
                            n_samples=balance, # make all classes equal
                            random_state=27) # reproducible results
df_train_class2_resampled = resample(class_2,
                            replace=True, # sample with replacement (we need to duplicate observations)
                            n_samples=balance, # make all classes equal
                            random_state=27) # reproducible results

df_train.reset_index(drop=True, inplace=True) # Reset index before upsampling
df_train = pd.concat([df_train_class1_resampled, df_train_classmin1_resampled,
                                df_train_class0_resampled, df_train_class2_resampled])
df_train.set_index(df_train.index, inplace=True) # Set the default integer index as the new index after upsampling

# Check new class counts
df_train['sentiment'].value_counts()

Now that we have class balance, let's proceed with the following steps to convert text into numerical values, so that it can be used for this classification task:

- Removing noise (such as web-urls)
- Removing punctuation
- Tokenization
- Removal of stop words
- Lemmatization



In [None]:
# Remove noise (all hyperlinks)

pattern_url = r'http[s]?://(?:[A-Za-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9A-Fa-f][0-9A-Fa-f]))+'   # Find all hyperlinks
subs_url = r''
df_train['message'] = df_train['message'].replace(to_replace = pattern_url, value = subs_url, regex = True)
df_train.head()

In [None]:
# Handle emoticons

emoticon_dictionary = {':\)': 'smiley_face_emoticon',
                       ':\(': 'frowning_face_emoticon',
                       ':D': 'grinning_face_emoticon',
                       ':P': 'sticking_out_tongue_emoticon',
                       ';\)': 'winking_face_emoticon',
                       ':o': 'surprised_face_emoticon',
                       ':\|': 'neutral_face_emoticon',
                       ':\'\)': 'tears_of_joy_emoticon',
                       ':\'\(': 'crying_face_emoticon'}

df_train['message_encoded_emojis'] = df_train['message'].replace(emoticon_dictionary, regex=True)
# Check if it was correctly
emoji_rows = df_train[df_train['message'].str.contains(':\(')]
emoji_rows.head(10)


In [None]:
# Remove punctuation and expand all contracted words
def remove_punctuation(message):
    contractions = {"'t": " not","'s": " is","'re": " are","'ll": " will", "'m": " am"}
    pattern = re.compile(r"\b(" + "|".join(re.escape(key) for key in contractions.keys()) + r")\b")
    message = re.sub(r"n't\b", " not", message) # Replace "n't" with " not"
    message = pattern.sub(lambda match: contractions[match.group(0)], message) # Replace all other contractions except for "n't"
    return ''.join([l for l in message if l not in string.punctuation])

df_train['message_clean'] = df_train['message_encoded_emojis'].apply(remove_punctuation)
df_train.head()

In [None]:
# Tokenization
tokenizer = TweetTokenizer()
df_train['tokens'] = df_train['message_clean'].apply(tokenizer.tokenize)
df_train.head()

In [None]:
# Remove stopwords
def remove_stop_words(tokens):
    return [t for t in tokens if t not in stopwords.words('english')]

df_train['tokens_without_stopwords'] = df_train['tokens'].apply(remove_stop_words)
df_train.head()

In [None]:
# Lemmatization

def mbti_lemma(words, lemmatizer):
    return [lemmatizer.lemmatize(word) for word in words]

lemmatizer = WordNetLemmatizer()
df_train['lemma'] = df_train['tokens_without_stopwords'].apply(mbti_lemma, args=(lemmatizer, ))
df_train.head()

<a id="five"></a>
### 5. Exploratory Data Analysis

In [None]:
# Convert into Bag Of Words

# Flatten the list of lists into a single list of strings
df_train['flattened_lemma'] = df_train['lemma'].apply(lambda word_list: ' '.join(word_list))
df_train.head()

# Create and fit the CountVectorizer
vect1 = CountVectorizer(lowercase=True,max_df=0.5, min_df=2,ngram_range=(1,1), max_features=200)
vect2 = CountVectorizer(lowercase=True,max_df=0.5, min_df=2,ngram_range=(1,2), max_features=200)
vect3 = CountVectorizer(lowercase=True,max_df=0.5, min_df=2,ngram_range=(1,3), max_features=200)
vect1.fit(df_train['flattened_lemma'])
vect2.fit(df_train['flattened_lemma'])
vect3.fit(df_train['flattened_lemma'])

In [None]:
X1 = vect1.transform(df_train['flattened_lemma'])
bag_of_words1 = pd.DataFrame(X1.toarray(), columns=vect1.get_feature_names_out())
X2 = vect2.transform(df_train['flattened_lemma'])
bag_of_words2 = pd.DataFrame(X2.toarray(), columns=vect2.get_feature_names_out())
X3 = vect3.transform(df_train['flattened_lemma'])
bag_of_words3 = pd.DataFrame(X3.toarray(), columns=vect3.get_feature_names_out())


In [None]:
df_train.index.is_unique
#bag_of_words2.index.is_unique

In [None]:
pd.set_option('display.max_colwidth', None)
#word_count2 = pd.DataFrame()
abridged_train_df = df_train[['message', 'sentiment']]
bag_of_words2.reset_index(drop=True, inplace=True)
abridged_train_df.reset_index(drop=True, inplace=True)
word_count2 = pd.concat([bag_of_words2, abridged_train_df],axis=1)
grouped2 = word_count2.groupby('sentiment').sum()
top_n = 10
top_words_per_class2 = {}
for class_name, row in grouped2.iterrows():
    top_words2 = row.sort_values(ascending=False)[:top_n]
    top_words_per_class2[class_name] = top_words2

# Create bar plots for top 20 words per class
for class_name, top_words in top_words_per_class2.items():
    plt.figure(figsize=(10, 6))
    top_words.plot(kind='bar', color='skyblue')
    plt.xlabel('Words')
    plt.ylabel('Frequency')
    plt.title(f'Top {top_n} Words in {class_name}')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()


<a id="six"></a>
### 6. Modelling

In [None]:
# Split into training and test data
X = word_count2.copy()
X.drop(columns=['sentiment','message'],inplace=True)
y = word_count2.sentiment
X_train, X_test, y_train, y_test = train_test_split(X.values, y.values)


In [None]:
names = ['Logistic Regression', 'Nearest Neighbors',
         'Linear SVM', 'RBF SVM',
         'Decision Tree', 'Random Forest',  'AdaBoost']

classifiers = [LogisticRegression(max_iter=1000),
               KNeighborsClassifier(3),
               SVC(kernel="linear", C=0.025),
               SVC(gamma=2, C=1),
               DecisionTreeClassifier(max_depth=5),
               RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
               AdaBoostClassifier()
               ]

In [None]:
results = []

models = {}
confusion = {}
class_report = {}


for name, clf in zip(names, classifiers):
    print ('Fitting {:s} model...'.format(name))
    run_time = %timeit -q -o clf.fit(X_train, y_train)

    print ('... predicting')
    y_pred = clf.predict(X_train)
    y_pred_test = clf.predict(X_test)

    print ('... scoring')
    accuracy  = metrics.accuracy_score(y_train, y_pred)
    precision = metrics.precision_score(y_train, y_pred, average='weighted')
    recall    = metrics.recall_score(y_train, y_pred, average='weighted')

    f1        = metrics.f1_score(y_train, y_pred, average='weighted')
    f1_test   = metrics.f1_score(y_test, y_pred_test, average='weighted')

    # Save the results to dictionaries
    models[name] = clf
    confusion[name] = metrics.confusion_matrix(y_train, y_pred)
    class_report[name] = metrics.classification_report(y_train, y_pred)

    results.append([name, accuracy, precision, recall, f1, f1_test, run_time.best])


results = pd.DataFrame(results, columns=['Classifier', 'Accuracy', 'Precision', 'Recall', 'F1 Train', 'F1 Test', 'Train Time'])
results.set_index('Classifier', inplace= True)

print ('... All done!')

In [None]:
results.sort_values('F1 Train', ascending=False)

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
results.sort_values('F1 Train', ascending=False, inplace=True)
results.plot(y=['F1 Test'], kind='bar', ax=ax[0], xlim=[0,1.1], ylim=[0.85,0.92])
results.plot(y='Train Time', kind='bar', ax=ax[1])

<a id="seven"></a>
### 7. Model performance evaluation

In [None]:
# Use K-Fold cross validation

model = models['Logistic Regression']
print(cross_val_score(model, X.values, y.values))

cv = []
for name, model in models.items():
    print ()
    print(name)
    scores = cross_val_score(model, X=X.values, y=y.values, cv=10)
    print("Accuracy: {:0.2f} (+/- {:0.4f})".format(scores.mean(), scores.std()))
    cv.append([name, scores.mean(), scores.std() ])

cv = pd.DataFrame(cv, columns=['Model', 'CV_Mean', 'CV_Std_Dev'])
cv.set_index('Model', inplace=True)

cv.plot(y='CV_Mean', yerr='CV_Std_Dev',kind='bar', ylim=[0.25, 0.85])

In [None]:
cv.plot(y='CV_Mean', yerr='CV_Std_Dev',kind='bar', ylim=[0.25, 0.85])

<a id="eight"></a>
### 8. Model analysis and conclusion

In [None]:
# GridSearchCV for SVM (RBF)
param_grid = {'kernel': ['rbf'],
              'gamma': (0.5, 1,2),
              'C': (0.5,0.75, 1.0)}
svm = SVC()
clf = GridSearchCV(svm, param_grid, scoring='f1_macro', cv=2)
clf.fit(X_train, y_train)
clf.best_params_

In [None]:
# Retrain on best params

svm = SVC(kernel='rbf', gamma=1.0, C=1.0)
clf = svm.fit(X_train, y_train)

y_pred = clf.predict(X_train)
y_pred_test = clf.predict(X_test)

print ('... scoring')
accuracy  = metrics.accuracy_score(y_train, y_pred)
precision = metrics.precision_score(y_train, y_pred, average='weighted')
recall    = metrics.recall_score(y_train, y_pred, average='weighted')

f1        = metrics.f1_score(y_train, y_pred, average='weighted')
f1_test   = metrics.f1_score(y_test, y_pred_test, average='weighted')
print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'f1: {f1}')
print(f'f1-test: {f1_test}')
