<a href="https://colab.research.google.com/github/anmaxwell/UniNotebooks/blob/master/assessmentB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To develop a classifier with Convolutional Neural Networks, we need to follow the following steps:
1. We then need to split the data into training and testing and extract features. 
2. Then we need to define the architecture of the convolutional neural network.
3. Then we need to train the model, and finally,
4. Evaluate it.


In [1]:
!pip install scattertext



In [0]:
import pandas as pd
import scattertext as st
import spacy

from scattertext import CorpusFromPandas, produce_scattertext_explorer

In [0]:
#read data into a dataframe
df = pd.read_csv('agr_en_train.csv', names=['unique_id','text','aggression-level'], sep=',')

In [17]:
#remove the unique_ID column
df.drop(df.columns[[0]], axis=1)

Unnamed: 0,text,aggression-level
0,Well said sonu..you have courage to stand agai...,OAG
1,"Most of Private Banks ATM's Like HDFC, ICICI e...",NAG
2,"Now question is, Pakistan will adhere to this?",OAG
3,Pakistan is comprised of fake muslims who does...,OAG
4,"??we r against cow slaughter,so of course it w...",NAG
...,...,...
11994,They belong to you flight dirty terrorist coun...,OAG
11995,"Really motivating programme, congratulations t...",NAG
11996,fabricated news,OAG
11997,What's wrong with you secular idiots,OAG


In [18]:
#check for missing values
df.isna().values.any()

False

In [19]:
#count the occurences of each level of aggression
df['aggression-level'].value_counts() 

NAG    5051
CAG    4240
OAG    2708
Name: aggression-level, dtype: int64

In [0]:
#prepare the data??
nlp = spacy.load('en')
df['parsed'] = df.text.apply(nlp)
corpus = st.CorpusFromParsedDocuments(df, category_col='aggression-level', 
                                      parsed_col='parsed').build().remove_terms(nlp.Defaults.stop_words, ignore_absences=True)

In [21]:
df.head()

Unnamed: 0,unique_id,text,aggression-level,parsed
0,facebook_corpus_msr_1723796,Well said sonu..you have courage to stand agai...,OAG,"(Well, said, sonu, .., you, have, courage, to,..."
1,facebook_corpus_msr_466073,"Most of Private Banks ATM's Like HDFC, ICICI e...",NAG,"(Most, of, Private, Banks, ATM, 's, Like, HDFC..."
2,facebook_corpus_msr_1493901,"Now question is, Pakistan will adhere to this?",OAG,"(Now, question, is, ,, Pakistan, will, adhere,..."
3,facebook_corpus_msr_405512,Pakistan is comprised of fake muslims who does...,OAG,"(Pakistan, is, comprised, of, fake, muslims, w..."
4,facebook_corpus_msr_1521685,"??we r against cow slaughter,so of course it w...",NAG,"(?, ?, we, r, against, cow, slaughter, ,, so, ..."


In [0]:
freq_df = corpus.get_term_freq_df()
oag_df = freq_df.sort_values(by=['OAG freq'], ascending=False)
nag_df = freq_df.sort_values(by=['NAG freq'], ascending=False)
cag_df = freq_df.sort_values(by=['CAG freq'], ascending=False)

In [9]:
oag_df.head(10)

Unnamed: 0_level_0,OAG freq,NAG freq,CAG freq
term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
u,420,313,466
india,396,616,493
people,372,357,554
like,317,315,337
indian,279,350,327
do n't,220,178,316
bjp,218,188,357
pakistan,189,215,141
country,182,158,239
muslims,178,32,115


In [10]:
nag_df.head(10)

Unnamed: 0_level_0,OAG freq,NAG freq,CAG freq
term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
india,396,616,493
people,372,357,554
good,80,355,189
indian,279,350,327
like,317,315,337
u,420,313,466
of the,142,279,202
&,165,270,155
in the,171,254,209
time,148,248,200


In [11]:
cag_df.head(10)

Unnamed: 0_level_0,OAG freq,NAG freq,CAG freq
term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
people,372,357,554
india,396,616,493
u,420,313,466
bjp,218,188,357
like,317,315,337
indian,279,350,327
do n't,220,178,316
😂,124,154,314
modi,125,205,288
money,112,210,242


In [12]:
from keras.preprocessing.text import text_to_word_sequence
# define the document
text = 'The quick brown fox jumped over the lazy dog.'
# tokenize the document
result = text_to_word_sequence(text)
print(result)

Using TensorFlow backend.


['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']


In [22]:
new_freq_df = corpus.get_term_freq_result()

AttributeError: ignored

In [13]:
from keras.preprocessing.text import Tokenizer

# create the tokenizer
t = Tokenizer()
# fit the tokenizer on the documents
t.fit_on_texts(result)
# summarize what was learned
print(t.word_counts)
print(t.document_count)
print(t.word_index)
print(t.word_docs)
# integer encode documents
encoded_docs = t.texts_to_matrix(result, mode='count')
print(encoded_docs)

OrderedDict([('the', 2), ('quick', 1), ('brown', 1), ('fox', 1), ('jumped', 1), ('over', 1), ('lazy', 1), ('dog', 1)])
9
{'the': 1, 'quick': 2, 'brown': 3, 'fox': 4, 'jumped': 5, 'over': 6, 'lazy': 7, 'dog': 8}
defaultdict(<class 'int'>, {'the': 2, 'quick': 1, 'brown': 1, 'fox': 1, 'jumped': 1, 'over': 1, 'lazy': 1, 'dog': 1})
[[0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1.]]
