First Goal is to make the computer understand the language<br>

Here are the steps

1. Sentence segmentation
2. Word tokenization
3. Stemming
4. Dependency parsing
5. Part-of-speech (POS) tagging

In [47]:
import numpy as np
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer
import re
import spacy
from spacy import displacy
import nltk
from nltk.parse.dependencygraph import DependencyGraph
from prettytable import PrettyTable

**Prepare a proper storage**

In [48]:
class Tweet:
    def __init__(self, count, hate_speech_count, offensive_language_count, neither_count, classs, tweet):
        """
        - count (int): The total count of the tweet.
        - hate (int): The count of hate speech in the tweet.
        - offensive (int): The count of offensive language in the tweet.
        - neither (int): The count of content classified as neither hate speech nor offensive.
        - classs (str): The classification of the tweet.
        - tweet (str): The text content of the tweet.
        """
        self.count = count
        self.hate = hate_speech_count
        self.offensive = offensive_language_count
        self.neither = neither_count
        self.classs = classs
        self.tweet = tweet
        self.stems = []
        self.tokens = []
        self.code = []
        self.tags = []
        self.t_code = []

    def __str__(self):
        return f"{self.count} ; {self.hate} ; {self.offensive} ; {self.neither} ; {self.classs} ;; {self.tweet}"
    
    def peacefullness(self):
        return self.neither_count / self.count
    
    def offensiveness(self):
        return self.offensive / self.count
    
    def hateness(self):
        return self.hate / self.count

**Load the data**

In [49]:
Tweets = []

In [50]:
with open('./archive/train.csv', 'r+') as file:
    previous_line = ''

    # Initialize a list to accumulate the modified content
    final_content_lines = []

    # Read and accumulate non-empty lines
    for line in file:
        stripped_line = line.strip()

        if stripped_line and stripped_line[0].isdigit():
            # If the current line is not empty and starts with an integer, accumulate it
            final_content_lines.append(stripped_line)
            previous_line = stripped_line
        else:
            # If the current line doesn't start with an integer, append it to the previous line
            previous_line += stripped_line

In [51]:
for line in final_content_lines:
    
    comma_indices = [index for index, char in enumerate(line) if char == ',']

    # Extracting substrings between commas
    count_str = line[0:comma_indices[0]].strip()
    hate_str = line[comma_indices[0]+1:comma_indices[1]].strip()
    offensive_str = line[comma_indices[1]+1:comma_indices[2]].strip()
    neither_str = line[comma_indices[2]+1:comma_indices[3]].strip()
    classs_str = line[comma_indices[3]+1:comma_indices[4]].strip()
    tweet_str = line[comma_indices[4]+1:].strip()

    # Converting to integers
    count = int(count_str) if count_str.isdigit() else None
    hate = int(hate_str) if hate_str.isdigit() else None
    offensive = int(offensive_str) if offensive_str.isdigit() else None
    neither = int(neither_str) if neither_str.isdigit() else None
    classs = int(classs_str) if classs_str.isdigit() else None

    # Creating an instance of the Tweet class
    tweet_instance = Tweet(count, hate, offensive, neither, classs_str, tweet_str)

    # Append the tweet instance to a list or do whatever you need to do with it
    Tweets.append(tweet_instance)

In [52]:
for t in Tweets[:10]:
    print(t)

3 ; 0 ; 0 ; 3 ; 2 ;; !!! RT @mayasolovely: As a woman you shouldn't complain about cleaning up your house. &amp; as a man you should always take the trash out...
3 ; 0 ; 3 ; 0 ; 1 ;; !!!!! RT @mleew17: boy dats cold...tyga dwn bad for cuffin dat hoe in the 1st place!!
3 ; 0 ; 3 ; 0 ; 1 ;; !!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby4life: You ever fuck a bitch and she start to cry? You be confused as shit
3 ; 0 ; 2 ; 1 ; 1 ;; !!!!!!!!! RT @C_G_Anderson: @viva_based she look like a tranny
6 ; 0 ; 6 ; 0 ; 1 ;; !!!!!!!!!!!!! RT @ShenikaRoberts: The shit you hear about me might be true or it might be faker than the bitch who told it to ya &#57361;
3 ; 1 ; 2 ; 0 ; 1 ;; "!!!!!!!!!!!!!!!!!!""@T_Madison_x: The shit just blows me..claim you so faithful and down for somebody but still fucking with hoes! &#128514;&#128514;&#128514;"""
3 ; 0 ; 3 ; 0 ; 1 ;; "!!!!!!""@__BrighterDays: I can not just sit up and HATE on another bitch .. I got too much shit going on!"""
3 ; 0 ; 3 ; 0 ; 1 ;; !!!!&#8220

**Tweet formatting**

remove all points when it's not the end of a sentence

In [53]:
def remove_points(line):
    l = list(line)

    for i in range(len(l)):
        if i < len(l) - 2 and l[i] == '.' and 'A' <= l[i+2] <= 'Z':
            l[i] = ' '

    line = ' '.join(l)

apply first format

Let's change:
- all `@<username>` to `username`
- all URLs to `weblink`
- all `&amp` to `&`

In [54]:
for t in Tweets:
    t.tweet = re.sub(r'@([a-zA-Z0-9_]+)', 'username', t.tweet).replace('username:', '') # replace first username
    t.tweet = re.sub(r'http?://\S+', 'weblink', t.tweet)
    t.tweet = re.sub(r'&amp', '&', t.tweet)

    if t.tweet and t.tweet[0] == '.':
        t.tweet = t.tweet[1:]

In [55]:
for t in Tweets:
    remove_points(t.tweet)
    t.tweet = t.tweet   .replace('RT', '').replace('!', ' ').replace('"', '').replace("\n", ' ')\
                        .replace(';', ' ').replace('-', ' ').replace(' and ', ' & ').replace('\'', '')\
                        .replace('?', '.').replace(',', '').replace('~', ' ').replace('|', ' ').replace('°', ' ')\
                        .replace('`', ' ').replace('~', ' ').replace('*', ' ').replace('+', ' ').replace('/', ' ')\
                        .replace(' # ', ' ').replace('http', ' ').replace('t.co', ' ').replace('\\', ' ').replace('&#', ' ')
    for _ in range(4):
        t.tweet = t.tweet.replace('  ', ' ').replace('..', '.').replace(' .', '.') # remove multiple points & space
    
    if t.tweet and t.tweet[0] == ' ':
        t.tweet = t.tweet[1:]


for t in Tweets[:10]:
    print(t)


3 ; 0 ; 0 ; 3 ; 2 ;; As a woman you shouldnt complain about cleaning up your house. & as a man you should always take the trash out.
3 ; 0 ; 3 ; 0 ; 1 ;; boy dats cold.tyga dwn bad for cuffin dat hoe in the 1st place 
3 ; 0 ; 3 ; 0 ; 1 ;; username Dawg You ever fuck a bitch & she start to cry. You be confused as shit
3 ; 0 ; 2 ; 1 ; 1 ;; username she look like a tranny
6 ; 0 ; 6 ; 0 ; 1 ;; The shit you hear about me might be true or it might be faker than the bitch who told it to ya 57361 
3 ; 1 ; 2 ; 0 ; 1 ;;  The shit just blows me.claim you so faithful & down for somebody but still fucking with hoes 128514 128514 128514 
3 ; 0 ; 3 ; 0 ; 1 ;; I can not just sit up & HATE on another bitch. I got too much shit going on 
3 ; 0 ; 3 ; 0 ; 1 ;; 8220 cause Im tired of you big bitches coming for us skinny girls 8221 
3 ; 0 ; 3 ; 0 ; 1 ;; & you might not get ya bitch back & thats that 
3 ; 1 ; 2 ; 0 ; 1 ;; username :hobbies include: fighting Mariam


Change everything in **lowercase**

In [56]:
for t in Tweets:
    t.tweet = t.tweet.lower()

Change some **abbreviations**<br>
Here it is important not to change all abbreviations. some abbreviations are used by 50% of population. our model will think they are different if we don't change those 50% - 50% to a 100% unique expressions. But on the other hand, some are used by all the population (eg. `lol`) so we can keep them

In [57]:
abbr = {
    'ninstagram'        : 'instagram',
    'instagramgram'     : 'instagram',
    'ig'                : 'instagram',
    'strainstagramht'   : 'instagram',
    'insta'             : 'instagram',
    'rinstagramht'      : 'instagram',
    'ninstagramguh'     : 'instagram',
    'instagramz'        : 'instagram',
    'sinstagramn'       : 'instagram',
    'binstagramgest'    : 'instagram',
    'pinstagram'        : 'instagram',
    'linstagramht'      : 'instagram',
    'ninstagramg'       : 'instagram',
    'instagramh'        : 'instagram',
    'instagramnor'      : 'instagram',
    'ninstagramht'      : 'instagram',
    'ninstagramgramga'  : 'instagram',
    'finstagramht'      : 'instagram',
    'binstagram'        : 'instagram',
    'hinstagramh'       : 'instagram',
    'ninstagramga'      : 'instagram',
    'toninstagramht'    : 'instagram',
    'minstagramht'      : 'instagram',
    'minstagramt'       : 'instagram',
    'dwn'               : 'down',
    'dawn'              : 'down',
    'ta'                : 'that',
    'dat'               : 'that',
    'dawg'              : 'dude',
    'smh'               : 'head',
    'fr'                : 'real',
    'plz'               : 'please',
    'tf'                : 'wtf',
    'theyr'             : 'are',
    'bc'                : 'because',
    'af'                : 'lot',
    'u'                 : 'you',
    'ppl'               : 'people',
    'dm'                : 'message',
    'bf'                : 'friend',
    'gt'                : 'getting',
    'ya'                : 'yes',
    'na'                : 'no',
    'ur'                : 'your',
    'tryna'             : 'to',
    'lmfao'             : 'lmao',
    'ive'               : 'have'
}

Apply the modification 

In [58]:
for t in Tweets:
    for old in abbr:
        new = abbr[old]
        t.tweet = t.tweet.replace( ' ' + old + ' ', ' ' + new + ' ' ) # add some space arround the world to avoid matching a part of a word

Create a function that reduce the **repetitions** (people use to add a lot of letters of type a lot of emojis)<br>This reduce the data to handle

In [59]:
def reduce_repetition(s):
    # Use regular expression to find repeated substrings
    pattern = re.compile(r'(.+?)\1{%d,}' % 2)
    match = pattern.search(s)

    # Reduce repetition to two occurrences
    while match:
        repeated_substring = match.group(1)
        s = s.replace(match.group(), repeated_substring, 1)
        match = pattern.search(s)

    return s

In [60]:
for t in Tweets:
    t.tweet = reduce_repetition(t.tweet)

**Tokenisation**

In [61]:
for t in Tweets:
    t.tokens += nltk.word_tokenize(t.tweet)

In [62]:
for t in Tweets:
    for i in range(len(t.tokens) - 1, 0, -1):
        if len(t.tokens[i].strip()) == 0 or (len(t.tokens[i].strip()) == 1 and t.tokens[i].strip() != 'a' and t.tokens[i].strip() != 'i' and t.tokens[i].strip() != '&'):
            t.tokens.pop(i)

**Stemmatisation**

In [63]:
stemmer = SnowballStemmer('english')

let's **compare** original string aigains its stemmed version to see the difference and say if the **stemmatisation** is **effective**

Split according to multiple charachters

In [64]:
for t in Tweets:
    for tok in t.tokens:
        t.stems.append(stemmer.stem(tok))

In [65]:
for t in Tweets:
    for i in range(len(t.stems) - 1, 0, -1):
        if len(t.stems[i].strip()) == 0:
            t.stems.pop(i)
        elif len(t.stems[i].strip()) == 1:
            if 'a' < t.stems[i].strip() <= 'z'and t.stems[i].strip() != 'i':
                t.stems.pop(i)

In [66]:
for t in Tweets:
    for i in range(len(t.stems) - 1, 0, -1):
        arr = t.stems[i].split('.')
        t.stems[i] = arr[0]
        for j in range(1, len(arr)):
            t.stems.insert(i+j, arr[j])

Re-clean one time

In [67]:
for t in Tweets:
    for i in range(len(t.stems)):
        if t.stems[i] in abbr.keys():
            t.stems[i] = abbr[t.stems[i]]

**Dependency parsing**

Example

In [68]:
nlp = spacy.load("en_core_web_sm")

In [69]:
print(Tweets[12].tweet)
print(Tweets[12].tokens)
print(Tweets[12].stems, "\n")

doc = nlp(' '.join(Tweets[12].stems))
for token in doc:
    print(f"{token.text}: {token.dep_} -> {token.head.text}")

so hoes that smoke are losers. yea. go on ig
['so', 'hoes', 'that', 'smoke', 'are', 'losers', 'yea', 'go', 'on', 'ig']
['so', 'hoe', 'that', 'smoke', 'are', 'loser', 'yea', 'go', 'on', 'instagram'] 

so: advmod -> hoe
hoe: ROOT -> hoe
that: mark -> are
smoke: nsubj -> are
are: ccomp -> hoe
loser: amod -> yea
yea: compound -> go
go: attr -> are
on: prep -> go
instagram: pobj -> on


In [70]:
# displacy.serve(doc, style="dep")

This allow us to know the role of a word in the sentence<br>
let's store those information

In [71]:
for t in Tweets:
    doc = nlp(' '.join(t.stems))
    for token in doc:
        t.tags.append(token.dep_)

List the most used words

In [72]:
TRESHOLD = 3
code_table = []

We will record words if they are used more than `TRESHOLD`

In [73]:
used_w = {}

for t in Tweets:
    for s in t.stems:
        if not s in used_w.keys():
            used_w[s] =  1
        else:
            used_w[s] += 1

In [74]:
for k, v in sorted(used_w.items(), key=lambda item: item[1], reverse=True):
    if v < TRESHOLD:
        break
    code_table.append(k)

In [75]:
print(len(code_table))

5707


**Encode-Decode** functions

In [76]:
def enc(word):
    index = -1
    try:
        index = code_table.index(word)
    except ValueError:
        index = -1
    
    return index

In [77]:
def dec(index):
    if index >= len(code_table) or index < 0: return ''
    return code_table(index)

In [78]:
for t in Tweets:
    for s in t.stems:
        t.code.append(enc(s))

Let's **encode** also the **tags** for the words. It will be **easier** to **handle**

In [79]:
d_tags = {}

for t in Tweets:
    for g in t.tags:
        if not g in d_tags.keys():
            d_tags[g]  = 1
        else:
            d_tags[g] += 1

**Note**: Here it's not necessary to **sort** the tags usage count but it's not slowing the learning process

In [80]:
tag_code_table = []
for k, v in sorted(d_tags.items(), key=lambda item: item[1], reverse=True):
    tag_code_table.append(k)

In [81]:
print(tag_code_table)

['nsubj', 'compound', 'ROOT', 'det', 'dobj', 'prep', 'pobj', 'amod', 'advmod', 'aux', 'ccomp', 'npadvmod', 'nummod', 'conj', 'poss', 'cc', 'attr', 'neg', 'advcl', 'appos', 'relcl', 'mark', 'xcomp', 'acomp', 'intj', 'prt', 'nmod', 'dep', 'punct', 'auxpass', 'dative', 'nsubjpass', 'predet', 'oprd', 'parataxis', 'pcomp', 'csubj', 'acl', 'expl', 'quantmod', 'preconj', 'agent', 'meta', 'case', 'csubjpass']


In [82]:
def t_enc(tag):
    index = -1
    try:
        index = tag_code_table.index(tag)
    except ValueError:
        index = -1
    
    return index

In [83]:
def t_dec(index):
    if index >= len(tag_code_table) or index < 0: return ''
    return tag_code_table(index)

Record that data

In [84]:
for t in Tweets:
    for g in t.tags:
        t.t_code.append(t_enc(g))

Here's how our data looks so far!

In [85]:
table = PrettyTable(["TAG", "tag CODE", "STEM", "stem CODE"])

for t in Tweets[:5]:
    for i in range(len(t.stems)):
        table.add_row([t.tags[i], t.t_code[i], t.stems[i], t.code[i]])
    table.add_row(['-', '-', '-', '-'])

print(table)

+----------+----------+----------+-----------+
|   TAG    | tag CODE |   STEM   | stem CODE |
+----------+----------+----------+-----------+
|   prep   |    5     |    as    |     77    |
|   det    |    3     |    a     |     1     |
|   pobj   |    6     |  woman   |    369    |
|  nsubj   |    0     |   you    |     4     |
|   aux    |    9     | shouldnt |    990    |
|   neg    |    17    | complain |    717    |
|   ROOT   |    2     |  about   |     60    |
|   prep   |    5     |  clean   |    645    |
|  pcomp   |    35    |    up    |     35    |
|   prt    |    25    |   your   |     20    |
|   poss   |    14    |   hous   |    306    |
|   dobj   |    4     |    &     |     7     |
|    cc    |    15    |    as    |     77    |
|   prep   |    5     |    a     |     1     |
|   det    |    3     |   man    |     88    |
|   pobj   |    6     |   you    |     4     |
|  nsubj   |    0     |  should  |    213    |
|   aux    |    9     |  alway   |    163    |
|  advmod  | 

Time for making a model !

Here are my ideas for the input of the model 

1. An array of this type:
  - [$stem_0$, $tag_0$, $stem_1$, $tag_1$, $stem_2$, $tag_2$ ... $stem_n$, $tag_n$]

2. Or an array of this type: 
  - [$stem_0$, $stem_1$, $stem_2$ ... $stem_n$, $tag_0$, $tag_1$, $tag_2$ ... $tag_n$]

3. Or two input arrays:
  - stem code
  - tag code

**Keras** offers us the `Embedding` layer with those arguments:

- input_dim: Integer. Size of the vocabulary, i.e. maximum integer index + 1.

- output_dim: Integer. Dimension of the dense embedding.

- embeddings_initializer: Initializer for the embeddings matrix (see keras.initializers).

- embeddings_regularizer: Regularizer function applied to the embeddings matrix (see keras.regularizers).

- embeddings_constraint: Constraint function applied to the embeddings matrix (see keras.constraints).

- mask_zero: Boolean, whether or not the input value 0 is a special "padding" value that should be masked out.

This is useful when using recurrent layers which may take variable length input. If this is True, then all subsequent layers in the model need to support masking or an exception will be raised. If mask_zero is set to True, as a consequence, index 0 cannot be used in the vocabulary (input_dim should equal size of vocabulary + 1).

> Source: [Keras' documentation](https://keras.io/api/layers/core_layers/embedding/)

We have to figure the maximum integer index. in the 1st and 2nd case, it is equals to two times the number of words.

In [89]:
max = 0
for t in Tweets:
    if max < len(t.stems):
        max = len(t.stems)

print("max   =", max, "\n2*max =", 2*max)

max   = 67 
2*max = 134


A maximal input dimension of **500** seems enough and will allow the user to input a large text and **compute** it in the **model**