## Research Questions Focus Areas

- Note: Please endeavour to explicitly comment your codes and properly document whichever functions created so as to help other collaborators learn from your codes quicker. Remember that the project is also a learning process.

---

### Dataset Import

In [34]:
#!pip install langdetect

In [35]:
#!pip install googletrans==3.1.0a0

In [6]:
import urllib.request, random, re, os, os.path
import pandas as pd
import numpy as np

#Data Cleaning 
from langdetect import detect
from langdetect.detector import LangDetectException
from googletrans import Translator

#Datas Encoding
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import gensim
from gensim.models import word2vec, KeyedVectors
from gensim.models.word2vec import Word2Vec
from sklearn.decomposition import PCA
from sklearn import preprocessing

#Sentiment Analysis
import pickle
from tensorflow.keras.models import load_model
from sklearn import model_selection
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Activation, Dense, Dropout, SimpleRNN, LSTM, Embedding
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.metrics import categorical_crossentropy
from keras.utils import to_categorical
from keras_preprocessing.sequence import pad_sequences
from keras.wrappers.scikit_learn import KerasClassifier
from keras.preprocessing.text import Tokenizer

#Topic Modelling
import glob, pprint, spacy
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import pyLDAvis, pyLDAvis.gensim_models
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [5]:
#Corpus, Dictionary etc declaration

wv = KeyedVectors.load('./Election-Campaign-Application-Phase2-Implementation/app/corpus/word2vec-google-news-300')
stop_words = set(stopwords.words('english'))
ps = nltk.PorterStemmer()

---

In [7]:
pd.set_option("display.max_colwidth", None)
pd.set_option("display.max_rows", None)

tweets = pd.read_csv("citizensvoice_dataset.csv", index_col=0, encoding='latin-1')

In [4]:
tweets.head()

Unnamed: 0,time_created,tweet,loca_tion,sentiment,discourse_area
0,2022-10-25T23:44:56+00:00,"Tinubu Is An Emperor; Buhari, Osinbajo, Governors Begged Him To Forgive Ambode But He Refused âDele Momodu | Sahara Reporters https://t.co/mO1zWgXQxn",Nigeria,negative,unrelated
1,2022-10-25T23:37:40+00:00,Dear @PeterObi please stop putting our future at risk. \r\nYou are the only reason I still believe in Nigeria. \r\nMy vote is for you https://t.co/nKhLhzrV8H,"Lagos, Nigeria",positive,unrelated
2,2022-10-25T23:31:19+00:00,"Wike pointed to how the PDP presidential candidate, Alhaji Atiku Abubakar picked people from Rivers State as members of the presidential campaign council without any input from him.\r\n\r\nhttps://t.co/H2cicBJlu1",Nigeria,indifferent,unrelated
3,2022-10-25T23:03:57+00:00,@fkeyamo @apc_lagos https://t.co/KrKdTG8prX,"Ogun, Nigeria",indifferent,unrelated
4,2022-10-27T23:59:39+00:00,"PDP is in total chaos in Ogun, dead in Lagos, Oyo PDP refusing to work for Atiku, the leaders in Ekiti and Ondo are refusing to mount a challenge, only in Osun does the party have a deem hope. https://t.co/w3r3dmSa0i","Ogun, Nigeria",negative,unrelated


### Dataset Wrangling

+ Using one of our Research Questions to guide the data wrangling. If we consider a simple question of "What is being said about Peter Obi?";

In [8]:
tweets.tweet = tweets.tweet.str.lower()

In [6]:
tweets.head(1)

Unnamed: 0,time_created,tweet,loca_tion,sentiment,discourse_area
0,2022-10-25T23:44:56+00:00,"tinubu is an emperor; buhari, osinbajo, governors begged him to forgive ambode but he refused âdele momodu | sahara reporters https://t.co/mo1zwgxqxn",Nigeria,negative,unrelated


#### 1. Cleaning

+ Cleaning: Clean any element of the dataset that might affect our NLP algorithm.
        - Remove "/n", links and emojis.
        - Replace &amp; with and.
+ In future versions of this project, we might try to analyse some of these element, like the emojis as they could be essential for our sentimental analysis, but for now we keep it simple and focus on the execution.

In [9]:
# Unicode for emojis
emojis = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese char
        u"\U00002702-\U000027B0"
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"
                      "]+", re.UNICODE)

In [10]:
def clean_tweet(tweet):
    """
    Function to clean tweet by:
    - Changing '&' sign to and
    - Removing newlines, carriage returns, links, emojis, handles, hashtags and punctuations

    Parameters:
        tweet (string): The tweet
    """      
            
    tweet = re.sub("@[^\s]+", "",tweet) # removes handles
    tweet = re.sub("\n", " ", tweet) # remove newlines
    tweet = re.sub("\r", "", tweet) # remove carriage returns
    tweet = re.sub(r"http\S+", "", tweet) # removes links
    tweet = re.sub(emojis, "", tweet) # remove emojis
    tweet = re.sub(r"#(\w+)", "", tweet) # remove hashtags
    tweet = re.sub("&", "and", tweet) # changes & sign to and
    tweet = re.sub(r"[^\w\s\@]","",tweet) # removes punctuation
    tweet = tweet.strip()

    return tweet

In [11]:
tweets["clean_tweet"] = tweets.tweet.apply(clean_tweet)

In [10]:
tweets[["tweet", "clean_tweet"]].sample(5)

Unnamed: 0,tweet,clean_tweet
16854,apc inaugurates presidential campaign council in canada https://t.co/wmrqdmnw7h,apc inaugurates presidential campaign council in canada
36557,"@aidelomo_happy @mysteriousdoct3 @shehusani that is ur id card for sure. you can keep piling till next year december no wahala, but one thing you shud know is that your principal can't win the election, simple as apc",that is ur id card for sure you can keep piling till next year december no wahala but one thing you shud know is that your principal cant win the election simple as apc
693,@mahdeeus @rbleipzig_en see what vini cause us,see what vini cause us
10129,"as tinubu dey lose, peter obi dey gain. north central is locked down for po.\r\n\r\nfirst lady arise tv access bank osinbajo tinubu igbo yorubas ambode muslim-muslim funsho williams yaba https://t.co/acyfev0l5n",as tinubu dey lose peter obi dey gain north central is locked down for po first lady arise tv access bank osinbajo tinubu igbo yorubas ambode muslimmuslim funsho williams yaba
9482,@peterobi our incoming presidentð,our incoming presidentð


In [11]:
tweets.head(1)

Unnamed: 0,time_created,tweet,loca_tion,sentiment,discourse_area,clean_tweet
0,2022-10-25T23:44:56+00:00,"tinubu is an emperor; buhari, osinbajo, governors begged him to forgive ambode but he refused âdele momodu | sahara reporters https://t.co/mo1zwgxqxn",Nigeria,negative,unrelated,tinubu is an emperor buhari osinbajo governors begged him to forgive ambode but he refused âdele momodu sahara reporters


#### 2. Translate

+ Convert all tweets to lower case.
+ Translate: Here we convert all non-english tweet to English for a smooth and uniform analysis
        - If text is not in English convert to English (using google translate or any other suitable library or api).
+ This is much time saving as systems have already been develped for such translation, instead of us having to develop our NLP kit or algorithm for each language used in the Nigerian twitter space. We can simply translate with the already existing systems and then analyse with the already trained systems.
+ In future versions of this project, we could look into developing our own custom NLP algorithm and kit tailored to our own native languages.

In [47]:
translator = Translator()

In [48]:
def translate_tweet(tweet):

    """
    Function to translate non-english tweets by checking if the tweet is non-english
    and translating it using the googletrans library.
    It returns the unchanaged tweet if it is in english, a space, empty or numeric.

    Parameters:
        tweet (string): The tweet
    """ 
    try:
        if detect(tweet) != "en":
            return translator.translate(tweet).text
        return tweet
    
    except LangDetectException:
        return tweet

In [49]:
def find_non_english(tweet):
    """
    Function that finds non-english tweets.
    
    Parameters:
        tweet (string): The tweet
    """ 
    try:
        if detect(tweet) != "en":
            return True
        return False
    
    except LangDetectException:
        return False

In [None]:
%%time
#tweets["translated_tweet"] = tweets.clean_tweet.apply(translate_tweet)

In [24]:
%%time
#non_english = tweets.clean_tweet.apply(find_non_english)

CPU times: total: 16min 1s
Wall time: 16min 5s


In [None]:
#tweets[non_english][["clean_tweet", "translated_tweet"]]

#### 3. Filtering

+ Filtering: Filter for tweets directed at Peter Obi, based on the following rules:
        - Peter Obi's handle appears first in tweet.
        - Peter Obi's name (not handle) appears any where in tweet.
        - Peter Obi's handle appears in tweet but not after another handle.
+ These rule help us focus the results on tweet directed to or about Peter Obi, instead of including tweets that could simply be replies to other twitter users under Peter Obi's tweet or replies to other twitter users who posted a tweet with Peter Obi's handle in it.
+ These rules were derived from domain knowledge of the platform.

In [53]:
def filter_tweet(tweet, handle, mentions):
    """
    Function that filters tweet Filter for tweets directed at handle, based on the following rules:
    - The handle appears first in tweet.
    - The handle appears in tweet but not after another handle.
    - The person is mentioned any where in tweet based on the list of metions.

    Parameters:
        tweet (string): The tweet
        handle (string): The username of the subject to be filtered for should start with '@'
        mentions (list): A list of other ways the subject could be mentioned in the text
    """

    # Split text into tokens
    tokens = tweet.split()

    # Check for tokens that have the handle
    indices = [i for i, token in enumerate(tokens) if token == handle]

    for index in indices:

        # Checks if the handle appear first
        if index==0:
            return True

        # Checks if the another handle appears before it
        if not tokens[index-1].startswith("@"):
            return True

    # Checks if the person is mentioned anywhere in the tweet
    for mention in mentions:
        if mention in tweet:
            return True
    
    return False

In [30]:
#po_tweets = tweets[tweets.tweet.apply(filter_tweet, handle="@peterobi", mentions=["peter obi", " peterobi", " po "])].copy()

In [None]:
#po_tweets.sample(5)

In [47]:
#len(po_tweets)

9134

## Sentiment Analysis

In [12]:
#Dataset for Sentiment Analysis
senttweets = tweets[["clean_tweet","sentiment"]][~tweets.sentiment.isnull()]
senttweets.shape

(2097, 2)

### Dataset Encoding

In [13]:
num_classes = 3

In [18]:
le = preprocessing.LabelEncoder()

sent = le.fit_transform(senttweets['sentiment'])
text = senttweets['clean_tweet']

In [19]:
sent

array([1, 2, 0, ..., 0, 0, 1])

In [20]:
senttweets['sentiment'][:10]

0       negative
1       positive
2    indifferent
3    indifferent
4       negative
5       negative
6    indifferent
7       positive
8       negative
9       negative
Name: sentiment, dtype: object

In [23]:
import collections
 
collections.Counter(sent)

Counter({1: 505, 2: 502, 0: 1090})

- 1. Bag of Words Approach

In [258]:
#Functions

def bagofwords_preprocessing(text, label):
    """This function removes stop words and lemmatizes the dataset, after which it uses the most frequent
    words to create a feature set that it will use to generate the encoded dataset with binary values."""
    
    #remove stope words from text
    processed = text.apply(lambda x: ' '.join(term for term in x.split() if term not in stop_words))

    #remove word stems using a Porter stemmer
    processed = processed.apply(lambda x: ' '.join(ps.stem(term) for term in x.split()))

    #creating a bag-of-words
    all_words = []

    for message in processed:
        words = word_tokenize(message)
        for w in words:
            all_words.append(w)

    all_words = nltk.FreqDist(all_words)
    #This bag of words will help us selects the words that will be used for our features.

    #use the 1500 most common words as features
    word_features = list(all_words.keys())[:1500]

    #This merges our pre-processed data with it label
    messages = zip(processed, label)
    
    return messages, word_features

def find_features(message, word_features):
    """This function basically picks a text and creates a row,
    by checking if it contains any of the common words or not."""
    words = word_tokenize(message)
    features = {}
    for word in word_features:
        features[word] = (word in words)
        
    return features

In [69]:
messages, word_features = bagofwords_preprocessing(text, sent)

%%time
#building dataset using our function for generating rows using our word features
dataset_1 = [(find_features(text, word_features), label) for (text, label) in messages]

- 2. Sentence Character-Encoder (Custom Encoder)

In [80]:
#Functions

def custom_dictionary():
    """Dictionary to be used for the sentence-character vector encoding."""

    alphabets = [' ','a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z']
    alpha_codes = ['%.2f' % elem for elem in np.linspace(0, 1, 27)]
    code_dict = dict(zip(alphabets, alpha_codes))
    
    return code_dict

def custom_sentencencoder(text):
    """Vector Senctence Encoding for each tweet.
    
    This approach does not omit stop words, nor lemmatize so as to retain all information within the sentence. 
    
    However the context of each word is lost, therefore if the same sentence is constructed using different words, 
    the algorithm might not recognize the similarity."""

    #Creating dataframe with 300 features meant to represent each character in the sentence
    twitterchar_len = 300
    cols = ['char_'+str(index) for index in range(1,twitterchar_len+1)]
    dataset = pd.DataFrame(columns=cols)
    
    for txt in text:
        instance = []
        for index in range(dataset.shape[1]):
            if index < len(txt):
                if txt[index] in code_dict.keys():
                    instance.append(code_dict[txt[index]])
                else:
                    instance.append(code_dict[' '])
            else:
                #Vector Padding
                instance.append(code_dict[' '])
        dataset.loc[len(dataset.index)] = instance
        
    return dataset 

In [78]:
code_dict = custom_dictionary()

In [81]:
%%time
dataset_2 = custom_sentencencoder(text)
dataset_2 = dataset_2.astype(float)

training2 = dataset_2.to_numpy()
label2 = np.array(sent)
label2 = to_categorical(label2, num_classes)

- 3. Sentence Word-Encoder (GENSIM + PCA)

In [84]:
#Functions

def sent_vect3(sentence):
    """This function tokenizes each text and encodes each word in each text with it's vector representation
    in the word2vec-google-news-300 GENSIM dictionary, after which it reduces the size of each word in the
    text using Principal Component Analysis (PCA)."""
    
    word_token = word_tokenize(sentence)
    sample_vector = np.array([wv[word] for word in word_token if word in wv.index_to_key])
    if sample_vector.shape[0] > 0:
        sample_vector = pca.fit_transform(sample_vector)
        sample_vector = sample_vector.T.tolist()[0]
    else:
        sample_vector = [0.0]
        
    return sample_vector

In [85]:
pca = PCA(n_components=1)

In [86]:
%%time
dataset_3 = text.apply(sent_vect3)
dataset_3 = pd.DataFrame(dataset_3.tolist())
dataset_3.fillna(0, inplace=True)

training3 = dataset_3.to_numpy()
label3 = np.array(sent)
label3 = to_categorical(label3, num_classes)

- 4. Sentence Word-Encoder (GENSIM + Avg())

In [17]:
#Functions

def sent_vect4(sentence):
    """This function tokenizes each text and encodes each word in each text with it's vector representation
    in the word2vec-google-news-300 GENSIM dictionary, after which it reduces the size of each word in the
    text by replacing their respective vectors with the average values of each vector."""
    
    word_token = word_tokenize(sentence)
    sample_vector = np.array([wv[word] for word in word_token if word in wv.index_to_key])
    if sample_vector.shape[0] > 0:
        sample_vector = sample_vector.mean(axis=1)
        sample_vector = sample_vector.tolist()
    else:
        sample_vector = [0.0]
    return sample_vector

In [18]:
%%time
dataset_4 = text.apply(sent_vect4)
dataset_4 = pd.DataFrame(dataset_4.tolist())
dataset_4.fillna(0, inplace=True)

training4 = dataset_4.to_numpy()
label4 = np.array(sent)
label4 = to_categorical(label4, num_classes)

CPU times: total: 6min 29s
Wall time: 6min 30s


- 5. Sentence Word-Encoder (GENSIM to Tensor)

In [19]:
#Functions

def sent_vect5(series):
    """This function tokenizes each text and encodes each word in each text with it's vector representation
    in the word2vec-google-news-300 GENSIM dictionary.
    
    This nested list/array will later be converted into a tensor, and fed directly into an RNN"""
    
    shape = series.shape[0]
    series = list(series.values)
    array = []
    pad_array = np.zeros(300)
    for i in range(shape):
        word_token = word_tokenize(series[i])
        sample_vector = np.array([list(wv[word]) for word in word_token if word in wv.index_to_key])
        if sample_vector.shape[0] > 0:
            deficit = 50-sample_vector.shape[0]
            for i in range(deficit):
                sample_vector = np.vstack((sample_vector, pad_array))
        else:
            sample_vector = np.zeros((50, 300))
        array.append(sample_vector.tolist())
    return array

In [20]:
%%time
dataset_5 = sent_vect5(text)

training5 = np.array(dataset_5)
label5 = np.array(sent)

CPU times: total: 6min 40s
Wall time: 6min 43s


- 6. Word Encoder Keras Tokenizer

In [38]:
tokenizer = Tokenizer(num_words=1000, lower=True)
tokenizer.fit_on_texts(text)

x_train = tokenizer.texts_to_sequences(text)

#adding 1 because of reserved 0 index
vocab_size = len(tokenizer.word_index) + 1

In [39]:
%%time
maxlen = 50 #we set the maximum size of each list to 100

training6 = pad_sequences(x_train, padding='post', maxlen=maxlen)
label6 = np.array(sent)

CPU times: total: 31.2 ms
Wall time: 26.9 ms


#### Sentiment Analysis

- National

In [45]:
scores_dict = {'model1':[],'model2':[],'model2_5':[],'model3':[],'model3_5':[],'model4':[],'model4_5':[],'model5':[],'model6':[]}
epochs = 50

- Senitment Method 1 (with Dataset 1)

In [141]:
training1, testing1 = model_selection.train_test_split(dataset_1, test_size=0.2, random_state=1)

In [147]:
#Create a model list, just as it is done when creating a pipeline.
names = ['Naive Bayes']
classifier = [MultinomialNB()]

models = dict(zip(names, classifier))

In [149]:
model1 = SklearnClassifier(models['Naive Bayes'])
model1.train(training1)

<SklearnClassifier(MultinomialNB())>

In [183]:
scores_dict['model1'].append(nltk.classify.accuracy(model1, testing1))

In [254]:
file_name = "./models/sentmodel1.pkl"
pickle.dump(model1, open(file_name, 'wb'))

- Senitment Method 2 (with Dataset 2)

In [160]:
model2 = Sequential([
    Dense(units=50, input_shape=(training2.shape[1],), activation='relu'),
    Dense(units=100, activation='relu'),
    Dense(units=100, activation='relu'),
    Dense(units=num_classes, activation='softmax')
])

In [None]:
model2.compile(optimizer=Adam(learning_rate=0.0001), loss='categorical_crossentropy', metrics=['accuracy'])
hist2 = model2.fit(x=training2, y=label2, validation_split=0.2, batch_size=50, epochs=epochs, shuffle=True, verbose=2)

In [184]:
scores_dict['model2'].append(hist2.history["val_accuracy"][epochs - 1])

In [237]:
if os.path.isfile('./models/sentmodel2.h5') is False:
    model2.save('./models/sentmodel2.h5')

In [165]:
#Dataset2 combined with SimpleRNN

training2_5 = np.array(training2).reshape((training2.shape[0]), training2.shape[1], 1)

label2_5 = to_categorical(np.array(sent), num_classes)

In [176]:
model2_5 = Sequential([
    SimpleRNN(50, input_shape = (training2.shape[1], 1), return_sequences = False),
    Dense(num_classes, activation='softmax'),
])

In [None]:
model2_5.compile(loss='categorical_crossentropy', optimizer = Adam(learning_rate=0.0005), metrics = ['accuracy'])
hist2_5 = model2_5.fit(training2_5, label2_5, epochs = epochs, batch_size = 50, validation_split=0.2, verbose=2)

In [185]:
scores_dict['model2_5'].append(hist2_5.history["val_accuracy"][epochs - 1])

In [238]:
if os.path.isfile('./models/sentmodel2_5.h5') is False:
    model2_5.save('./models/sentmodel2_5.h5')

- Senitment Method 3 (with Dataset 3)

In [187]:
model3 = Sequential([
    Dense(units=150, input_shape=(training3.shape[1],), activation='relu'),
    Dense(units=200, activation='relu'),
    Dense(units=200, activation='relu'),
    Dense(units=num_classes, activation='softmax')
])

In [None]:
model3.compile(optimizer=Adam(learning_rate=0.0001), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
hist3 = model3.fit(x=training3, y=label3, validation_split=0.2, batch_size=50, epochs=epochs, shuffle=True, verbose=2)

In [189]:
scores_dict['model3'].append(hist3.history["val_accuracy"][epochs - 1])

In [239]:
if os.path.isfile('./models/sentmodel3.h5') is False:
    model3.save('./models/sentmodel3.h5')

In [190]:
#Dataset3 combined with SimpleRNN

training3_5 = np.array(training3).reshape((training3.shape[0]), training3.shape[1], 1)

label3_5 = to_categorical(np.array(sent), num_classes)

In [191]:
model3_5 = Sequential([
    SimpleRNN(50, input_shape = (training3.shape[1], 1), return_sequences = False),
    Dense(num_classes, activation='softmax'),
])

In [None]:
model3_5.compile(loss='categorical_crossentropy', optimizer = Adam(learning_rate=0.0001), metrics = ['accuracy'])
hist3_5 = model3_5.fit(training3_5, label3_5, epochs = epochs, batch_size = 50, validation_split=0.2, verbose=2)

In [193]:
scores_dict['model3_5'].append(hist3_5.history["val_accuracy"][epochs - 1])

In [240]:
if os.path.isfile('./models/sentmodel3_5.h5') is False:
    model3_5.save('./models/sentmodel3_5.h5')

- Sentiment Method 4 (with Dataset 4)

In [195]:
model4 = Sequential([
    Dense(units=150, input_shape=(training4.shape[1],), activation='relu'),
    Dense(units=200, activation='relu'),
    Dense(units=200, activation='relu'),
    Dense(units=num_classes, activation='softmax')
])

In [None]:
model4.compile(optimizer=Adam(learning_rate=0.0001), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
hist4 = model4.fit(x=training4, y=label4, validation_split=0.2, batch_size=50, epochs=epochs, shuffle=True, verbose=2)

In [197]:
scores_dict['model4'].append(hist4.history["val_accuracy"][epochs - 1])

In [241]:
if os.path.isfile('./models/sentmodel4.h5') is False:
    model4.save('./models/sentmodel4.h5')

In [21]:
#Dataset4 combined with SimpleRNN

training4_5 = np.array(training4).reshape((training4.shape[0]), training4.shape[1], 1)

label4_5 = to_categorical(np.array(sent), num_classes)

In [200]:
model4_5 = Sequential([
    SimpleRNN(50, input_shape = (training4.shape[1], 1), return_sequences = False),
    Dense(num_classes, activation='softmax'),
])

In [None]:
model4_5.compile(loss='categorical_crossentropy', optimizer = Adam(learning_rate=0.0001), metrics = ['accuracy'])
hist4_5 = model4_5.fit(training4_5, label4_5, epochs = epochs, batch_size = 50, validation_split=0.2, verbose=2)

In [202]:
scores_dict['model4_5'].append(hist4_5.history["val_accuracy"][epochs - 1])

In [242]:
if os.path.isfile('./models/sentmodel4_5.h5') is False:
    model4_5.save('./models/sentmodel4_5.h5')

In [23]:
#model4_6 = Sequential([
    #LSTM(50, input_shape = (training4.shape[1], 1), return_sequences = False),
    #Dense(num_classes, activation='softmax'),
#])

In [None]:
#model4_6.compile(loss='categorical_crossentropy', optimizer = Adam(learning_rate=0.0001), metrics = ['accuracy'])
#hist4_6 = model4_6.fit(training4_5, label4_5, epochs = epochs, batch_size = 50, validation_split=0.2, verbose=2)

In [202]:
#scores_dict['model4_6'].append(hist4_6.history["val_accuracy"][epochs - 1])

In [242]:
#if os.path.isfile('./models/sentmodel4_6.h5') is False:
    #model4_6.save('./models/sentmodel4_6.h5')

- Sentiment Method 5 (with Dataset 5)

In [34]:
#Dataset5 combined with SimpleRNN

training5 = tf.convert_to_tensor(training5, dtype=tf.int64)

label5 = to_categorical(label5, num_classes)

In [205]:
model5 = Sequential([
    SimpleRNN(50, input_shape = (training5.shape[1], training5.shape[2]), return_sequences = False),
    Dense(num_classes, activation='softmax'),
])

In [None]:
model5.compile(loss='categorical_crossentropy', optimizer = Adam(learning_rate=0.0001), metrics = ['accuracy'])
hist5 = model5.fit(training5, label5, epochs = epochs, batch_size = 50, validation_split=0.2, verbose=2)

In [207]:
scores_dict['model5'].append(hist5.history["val_accuracy"][epochs - 1])

In [243]:
if os.path.isfile('./models/sentmodel5.h5') is False:
    model5.save('./models/sentmodel5.h5')

In [30]:
#model5_5 = Sequential([
    #LSTM(50, input_shape = (training5.shape[1], training5.shape[2]), return_sequences = False),
    #Dense(num_classes, activation='softmax'),
#])

In [None]:
#model5_5.compile(loss='categorical_crossentropy', optimizer = Adam(learning_rate=0.0001), metrics = ['accuracy'])
#hist5_5 = model5_5.fit(training5, label5, epochs = epochs, batch_size = 50, validation_split=0.2, verbose=2)

In [207]:
#scores_dict['model5_5'].append(hist5_5.history["val_accuracy"][epochs - 1])

In [243]:
#if os.path.isfile('./models/sentmodel5_5.h5') is False:
    #model5_5.save('./models/sentmodel5_5.h5')

- Sentiment Method 6 (with Dataset 6)

In [43]:
#Dataset6 combined with SimpleRNN

training6 = np.array(training6).reshape((training6.shape[0]), training6.shape[1], 1)

label6 = to_categorical(label6, num_classes)

In [210]:
model6 = Sequential([
    SimpleRNN(50, input_shape = (training6.shape[1], 1), return_sequences = False),
    Dense(num_classes, activation='softmax'),
])

In [41]:
#model6 = Sequential([
    #LSTM(50, input_shape = (training6.shape[1], 1), return_sequences = False),
    #Dense(num_classes, activation='softmax'),
#])

In [None]:
model6.compile(loss='categorical_crossentropy', optimizer = Adam(learning_rate=0.0001), metrics = ['accuracy'])
hist6 = model6.fit(training6, label6, epochs = epochs, batch_size = 50, validation_split=0.2, verbose=2)

In [212]:
scores_dict['model6'].append(hist6.history["val_accuracy"][epochs - 1])

In [244]:
if os.path.isfile('./models/sentmodel6.h5') is False:
    model6.save('./models/sentmodel6.h5')

In [213]:
index = ['Validation_Accuracy']
scores_df = pd.DataFrame(scores_dict, index = index)
scores_df

Unnamed: 0,model1,model2,model2_5,model3,model3_5,model4,model4_5,model5,model6
Validation_Accuracy,0.511905,0.45,0.607143,0.428571,0.540476,0.595238,0.607143,0.607143,0.552381


In [246]:
scores_df.to_csv('./models/sentmodels_scores.csv')

- Loading already trained model

In [None]:
#Loading model from local repository
model1 = pickle.load(open('./models/sentmodel1.pkl', 'rb'))
model2 = load_model('./models/sentmodel2.h5')
model2_5 = load_model('./models/sentmodel2_5.h5')
model3 = load_model('./models/sentmodel3.h5')
model3_5 = load_model('./models/sentmodel3_5.h5')
model4 = load_model('./models/sentmodel4.h5')
model4_5 = load_model('./models/sentmodel4_5.h5')
model5 = load_model('./models/sentmodel5.h5')
model6 = load_model('./models/sentmodel6.h5')

In [2]:
#Loading model from remote repository
urllib.request.urlretrieve(
        'https://github.com/dub-em/Election-Campaign-Application-Phase2/raw/main/models/damodel8.h5', 'damodel8.h5')

link = './damodel8.h5'

model = load_model(link)

In [4]:
model.summary()

Model: "sequential_14"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_29 (Dense)            (None, 50)                15050     
                                                                 
 dense_30 (Dense)            (None, 100)               5100      
                                                                 
 dense_31 (Dense)            (None, 100)               10100     
                                                                 
 dense_32 (Dense)            (None, 6)                 606       
                                                                 
Total params: 30,856
Trainable params: 30,856
Non-trainable params: 0
_________________________________________________________________


## Discourse Area Analysis

In [268]:
#Dataset for Discourse Area Analysis
datweets = tweets[["clean_tweet","discourse_area"]][~tweets.discourse_area.isnull()]
datweets.shape

(2096, 2)

### Dataset Encoding

In [269]:
num_classes2 = 6

- 7. Bag of Words Approach

In [270]:
disc_area = le.fit_transform(datweets['discourse_area'])
datext = datweets['clean_tweet']

In [272]:
%%time
messages, word_features = bagofwords_preprocessing(datext, disc_area)

#building dataset using our function for generating rows using our word features
dataset_7 = [(find_features(datext, word_features), label) for (datext, label) in messages]

CPU times: total: 3.72 s
Wall time: 3.78 s


- 8. Sentence Character-Encoder (Custom Encoder)

In [274]:
%%time
dataset_8 = custom_sentencencoder(datext)
dataset_8 = dataset_8.astype(float)

training8 = dataset_8.to_numpy()
label8 = np.array(disc_area)
label8 = to_categorical(label8, num_classes2)

CPU times: total: 1min 57s
Wall time: 2min 13s


- 9. Sentence Word-Encoder (GENSIM + PCA)

In [275]:
%%time
dataset_9 = datext.apply(sent_vect3)
dataset_9 = pd.DataFrame(dataset_9.tolist())
dataset_9.fillna(0, inplace=True)

training9 = dataset_9.to_numpy()
label9 = np.array(disc_area)
label9 = to_categorical(label9, num_classes2)

CPU times: total: 14min 10s
Wall time: 11min 21s


- 10. Sentence Word-Encoder (GENSIM + Avg())

In [276]:
%%time
dataset_10 = datext.apply(sent_vect4)
dataset_10 = pd.DataFrame(dataset_10.tolist())
dataset_10.fillna(0, inplace=True)

training10 = dataset_10.to_numpy()
label10 = np.array(disc_area)
label10 = to_categorical(label10, num_classes2)

CPU times: total: 9min 45s
Wall time: 9min 53s


- 11. Sentence Word-Encoder (GENSIM to Tensor)

In [277]:
%%time
dataset_11 = sent_vect5(datext)

training11 = np.array(dataset_11)
label11 = np.array(disc_area)

CPU times: total: 9min 58s
Wall time: 10min 5s


- 12. Word Encoder Keras Tokenizer

In [278]:
tokenizer2 = Tokenizer(num_words=1000, lower=True)
tokenizer2.fit_on_texts(datext)

x_train2 = tokenizer2.texts_to_sequences(datext)

#adding 1 because of reserved 0 index
vocab_size2 = len(tokenizer2.word_index) + 1

In [279]:
%%time
maxlen = 50 #we set the maximum size of each list to 100

training12 = pad_sequences(x_train2, padding='post', maxlen=maxlen)
label12 = np.array(disc_area)

CPU times: total: 15.6 ms
Wall time: 27.9 ms


#### Disourse Area Analysis

- National

In [280]:
scores_dict2 = {'model7':[],'model8':[],'model8_5':[],'model9':[],'model9_5':[],'model10':[],'model10_5':[],'model11':[],'model12':[]}
epochs = 50

- Disourse Method 7 (with Dataset 7)

In [281]:
training7, testing7 = model_selection.train_test_split(dataset_7, test_size=0.2, random_state=1)

In [282]:
#Create a model list, just as it is done when creating a pipeline.
names = ['Naive Bayes']
classifier = [MultinomialNB()]

models = dict(zip(names, classifier))

In [283]:
model7 = SklearnClassifier(models['Naive Bayes'])
model7.train(training7)

<SklearnClassifier(MultinomialNB())>

In [284]:
scores_dict2['model7'].append(nltk.classify.accuracy(model7, testing7))

In [285]:
file_name = "./models/damodel7.pkl"
pickle.dump(model7, open(file_name, 'wb'))

- Discourse Method 8 (with Dataset 8)

In [286]:
model8 = Sequential([
    Dense(units=50, input_shape=(training8.shape[1],), activation='relu'),
    Dense(units=100, activation='relu'),
    Dense(units=100, activation='relu'),
    Dense(units=num_classes2, activation='softmax')
])

In [None]:
model8.compile(optimizer=Adam(learning_rate=0.0001), loss='categorical_crossentropy', metrics=['accuracy'])
hist8 = model8.fit(x=training8, y=label8, validation_split=0.2, batch_size=50, epochs=epochs, shuffle=True, verbose=2)

In [288]:
scores_dict2['model8'].append(hist8.history["val_accuracy"][epochs - 1])

In [289]:
if os.path.isfile('./models/damodel8.h5') is False:
    model8.save('./models/damodel8.h5')

In [290]:
#Dataset8 combined with SimpleRNN

training8_5 = np.array(training8).reshape((training8.shape[0]), training8.shape[1], 1)

label8_5 = to_categorical(np.array(disc_area), num_classes2)

In [291]:
model8_5 = Sequential([
    SimpleRNN(50, input_shape = (training8.shape[1], 1), return_sequences = False),
    Dense(num_classes2, activation='softmax'),
])

In [None]:
model8_5.compile(loss='categorical_crossentropy', optimizer = Adam(learning_rate=0.0005), metrics = ['accuracy'])
hist8_5 = model8_5.fit(training8_5, label8_5, epochs = epochs, batch_size = 50, validation_split=0.2, verbose=2)

In [293]:
scores_dict2['model8_5'].append(hist8_5.history["val_accuracy"][epochs - 1])

In [294]:
if os.path.isfile('./models/damodel8_5.h5') is False:
    model8_5.save('./models/damodel8_5.h5')

- Discourse Method 9 (with Dataset 9)

In [295]:
model9 = Sequential([
    Dense(units=150, input_shape=(training9.shape[1],), activation='relu'),
    Dense(units=200, activation='relu'),
    Dense(units=200, activation='relu'),
    Dense(units=num_classes2, activation='softmax')
])

In [None]:
model9.compile(optimizer=Adam(learning_rate=0.0001), loss='categorical_crossentropy', metrics=['accuracy'])
hist9 = model9.fit(x=training9, y=label9, validation_split=0.2, batch_size=50, epochs=epochs, shuffle=True, verbose=2)

In [301]:
scores_dict2['model9'].append(hist9.history["val_accuracy"][epochs - 1])

In [302]:
if os.path.isfile('./models/damodel9.h5') is False:
    model9.save('./models/damodel9.h5')

In [303]:
#Dataset9 combined with SimpleRNN

training9_5 = np.array(training9).reshape((training9.shape[0]), training9.shape[1], 1)

label9_5 = to_categorical(np.array(disc_area), num_classes2)

In [304]:
model9_5 = Sequential([
    SimpleRNN(50, input_shape = (training9.shape[1], 1), return_sequences = False),
    Dense(num_classes2, activation='softmax'),
])

In [None]:
model9_5.compile(loss='categorical_crossentropy', optimizer = Adam(learning_rate=0.0001), metrics = ['accuracy'])
hist9_5 = model9_5.fit(training9_5, label9_5, epochs = epochs, batch_size = 50, validation_split=0.2, verbose=2)

In [306]:
scores_dict2['model9_5'].append(hist9_5.history["val_accuracy"][epochs - 1])

In [307]:
if os.path.isfile('./models/damodel9_5.h5') is False:
    model9_5.save('./models/damodel9_5.h5')

- Discourse Method 10 (with Dataset 10)

In [308]:
model10 = Sequential([
    Dense(units=150, input_shape=(training10.shape[1],), activation='relu'),
    Dense(units=200, activation='relu'),
    Dense(units=200, activation='relu'),
    Dense(units=num_classes2, activation='softmax')
])

In [None]:
model10.compile(optimizer=Adam(learning_rate=0.0001), loss='categorical_crossentropy', metrics=['accuracy'])
hist10 = model10.fit(x=training10, y=label10, validation_split=0.2, batch_size=50, epochs=epochs, shuffle=True, verbose=2)

In [310]:
scores_dict2['model10'].append(hist10.history["val_accuracy"][epochs - 1])

In [311]:
if os.path.isfile('./models/damodel10.h5') is False:
    model10.save('./models/damodel10.h5')

In [312]:
#Dataset10 combined with SimpleRNN

training10_5 = np.array(training10).reshape((training10.shape[0]), training10.shape[1], 1)

label10_5 = to_categorical(np.array(disc_area), num_classes2)

In [313]:
model10_5 = Sequential([
    SimpleRNN(50, input_shape = (training10.shape[1], 1), return_sequences = False),
    Dense(num_classes2, activation='softmax'),
])

In [None]:
model10_5.compile(loss='categorical_crossentropy', optimizer = Adam(learning_rate=0.0001), metrics = ['accuracy'])
hist10_5 = model10_5.fit(training10_5, label10_5, epochs = epochs, batch_size = 50, validation_split=0.2, verbose=2)

In [315]:
scores_dict2['model10_5'].append(hist10_5.history["val_accuracy"][epochs - 1])

In [316]:
if os.path.isfile('./models/damodel10_5.h5') is False:
    model10_5.save('./models/damodel10_5.h5')

- Discourse Method 11 (with Dataset 11)

In [317]:
#Dataset11 combined with SimpleRNN

training11 = tf.convert_to_tensor(training11, dtype=tf.int64)

label11 = to_categorical(label11, num_classes2)

In [318]:
model11 = Sequential([
    SimpleRNN(50, input_shape = (training11.shape[1], training11.shape[2]), return_sequences = False),
    Dense(num_classes2, activation='softmax'),
])

In [None]:
model11.compile(loss='categorical_crossentropy', optimizer = Adam(learning_rate=0.0001), metrics = ['accuracy'])
hist11 = model11.fit(training11, label11, epochs = epochs, batch_size = 50, validation_split=0.2, verbose=2)

In [320]:
scores_dict2['model11'].append(hist11.history["val_accuracy"][epochs - 1])

In [321]:
if os.path.isfile('./models/damodel11.h5') is False:
    model11.save('./models/damodel11.h5')

- Discourse Method 12 (with Dataset 12)

In [322]:
#Dataset12 combined with SimpleRNN

training12 = np.array(training12).reshape((training12.shape[0]), training12.shape[1], 1)

label12 = to_categorical(label12, num_classes2)

In [323]:
model12 = Sequential([
    SimpleRNN(50, input_shape = (training12.shape[1], 1), return_sequences = False),
    Dense(num_classes2, activation='softmax'),
])

In [None]:
model12.compile(loss='categorical_crossentropy', optimizer = Adam(learning_rate=0.0001), metrics = ['accuracy'])
hist12 = model12.fit(training12, label12, epochs = epochs, batch_size = 50, validation_split=0.2, verbose=2)

In [325]:
scores_dict2['model12'].append(hist12.history["val_accuracy"][epochs - 1])

In [326]:
if os.path.isfile('./models/damodel12.h5') is False:
    model12.save('./models/damodel12.h5')

In [327]:
scores_df2 = pd.DataFrame(scores_dict2, index = index)
scores_df2

Unnamed: 0,model7,model8,model8_5,model9,model9_5,model10,model10_5,model11,model12
Validation_Accuracy,0.942857,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [328]:
scores_df2.to_csv('./models/damodels_scores.csv')

- Loading already trained model

In [None]:
model1 = pickle.load(open('./models/sentmodel1.pkl', 'rb'))
model2 = load_model('./models/sentmodel2.h5')
model2_5 = load_model('./models/sentmodel2_5.h5')
model3 = load_model('./models/sentmodel3.h5')
model3_5 = load_model('./models/sentmodel3_5.h5')
model4 = load_model('./models/sentmodel4.h5')
model4_5 = load_model('./models/sentmodel4_5.h5')
model5 = load_model('./models/sentmodel5.h5')
model6 = load_model('./models/sentmodel6.h5')

#### GENSIM LDA Topic Model

- https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [221]:
#Functions

def sent_to_word(sentences):
    """ This function removes stop words usign a different method from the precious one(s) used."""
    
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
        
def make_bigrams(texts, bigram_mod):
    return [bigram_mod[doc] for doc in texts]

def lemmatization(texts, nlp, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

def LDA_parameters(text):
    """This function takes the text, processes it, and return the parameters to build the LDA Topic Model."""
    
    data = text.tolist()
    data_words = list(sent_to_word(data))

    # Build the bigram and trigram models
    bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
    #trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  

    # Faster way to get a sentence clubbed as a bigram
    bigram_mod = gensim.models.phrases.Phraser(bigram)

    # Form Bigrams
    data_words_bigrams = make_bigrams(data_words, bigram_mod)
    data_words_bigrams[:1]

    # Initialize spacy 'en' model, keeping only tagger component (for efficiency)
    # python3 -m spacy download en
    nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

    # Do lemmatization keeping only noun, adj, vb, adv
    data_lemmatized = lemmatization(data_words_bigrams, nlp, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

    # Create Dictionary
    id2word = corpora.Dictionary(data_lemmatized)

    # Create Corpus
    texts = data_lemmatized

    # Term Document Frequency
    corpus = [id2word.doc2bow(text) for text in texts]
    
    return corpus, id2word

In [226]:
%%time
corpus, id2word = LDA_parameters(text)

CPU times: total: 16.3 s
Wall time: 18.5 s


In [228]:
%%time
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id2word,
                                           num_topics=10, random_state=100,
                                           update_every=1, chunksize=100,
                                           passes=10, alpha='auto',
                                           per_word_topics=True)

CPU times: total: 6.61 s
Wall time: 7.13 s


In [229]:
lda_model.print_topics()

#doc_lda = lda_model[corpus]

[(0,
  '0.058*"know" + 0.052*"well" + 0.037*"flood" + 0.036*"do" + 0.027*"use" + 0.022*"bad" + 0.017*"many" + 0.013*"pray" + 0.013*"issue" + 0.011*"speak"'),
 (1,
  '0.057*"leave" + 0.031*"start" + 0.029*"oil" + 0.026*"buhari" + 0.021*"child" + 0.019*"office" + 0.018*"accuse" + 0.017*"dele_momodu" + 0.012*"human" + 0.011*"position"'),
 (2,
  '0.061*"make" + 0.039*"time" + 0.037*"want" + 0.021*"first" + 0.020*"call" + 0.020*"manifesto" + 0.017*"money" + 0.016*"all" + 0.016*"look" + 0.013*"anambra"'),
 (3,
  '0.059*"more" + 0.043*"government" + 0.034*"bring" + 0.029*"re" + 0.028*"seyi" + 0.024*"promise" + 0.020*"attack" + 0.018*"always" + 0.017*"also" + 0.017*"failure"'),
 (4,
  '0.039*"s" + 0.035*"tell" + 0.035*"kano" + 0.035*"think" + 0.034*"thing" + 0.022*"very" + 0.017*"visit" + 0.017*"here" + 0.014*"obi" + 0.013*"continue"'),
 (5,
  '0.047*"nigerian" + 0.045*"support" + 0.039*"work" + 0.026*"help" + 0.023*"right" + 0.021*"follow" + 0.021*"word" + 0.018*"head" + 0.017*"father" + 0.01

---

### General Trends 

- This sections covers the generic trends existing amongst citizens' discussion groups.
        - What is most talked about (regardless of area or topic)

### Citizens' Sentiment

- This section covers citizens' reactions and general sentiment towards certain topic (e.g areas of developments, policies, politically significant events, public office holders' performance and so on).
        - What is the general sentiment of the citizens?
		- What is most discussed(election and governance related)?
		- What is the sentiment towards what is being discussed? 

### Complaint Areas

- This section covers the extraction of various areas of complaints and dissatisfaction amongst citizens (in different aspects of government).
        - What are the various areas of complaints as regards to governance?
		- What are the levels of sentiment towards the various area of complaints?

### Politician's Reputation

- This section covers what citizens's think about certain public office holders, their sentiment towards these individuals and their general popularity or notoriety.
        - Who is most talked about?
		- Popularity or notoreity of the most talked about.
		- Most popular, and most notorious candidates/politician.
		- How much is a certain candidate being talked about?
		- What is being said about each candidate?
		- What is the general sentiment of what is being said?