# **A Hadoop based platform for natural language processing of web pages and documents**
</p> No projeto a seguir, apresentaremos uma plataforma baseada em Hadoop no processamento de linguagem natural em tweets ofensivos. <p>
</p> O intuito do projeto é fazer um WordCount com palavras específicas, incluindo palavras compostas, dentro de uma base de dados. Os dados utilizados são disponbilizados para estudos de forma aberta. <p>

In [None]:
# download do Hadoop pelo site da Apache e 
# cópia do arquivo para uma pasta /usr/local do ambiente Colab

!wget https://archive.apache.org/dist/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz

!tar -xzvf hadoop-3.3.0.tar.gz

!cp -r hadoop-3.3.0/ /usr/local/

# configuração do Java no Google Colab
# esta configuração viabiliza a execução mais facilitada do Hadoop
# para isso, usamos o seguinte script em python

import os

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64/"
os.environ["HADOOP_HOME"] = "/usr/local/hadoop-3.3.0"

Criando um diretório para guardar os arquivos para processar os dados com o Hadoop

In [None]:
!$HADOOP_HOME/bin/hadoop fs -mkdir hadoop_data

Fazendo download das bases de dados selecionadas

In [None]:
!wget https://sites.google.com/site/offensevalsharedtask/olid/OLIDv1.0.zip?attredirects=0&d=1
!wget https://drive.google.com/u/0/uc?id=1-ybjDIfP8pv81uZHfwKL5D1hZQn3cJB7&export=download

Descompactando os arquivos baixados

In [None]:
!unzip OLIDv1.0.zip?attredirects=0 -d OLID_files
!unzip uc?id=1-ybjDIfP8pv81uZHfwKL5D1hZQn3cJB7 -d OLID_files

Selecionando os dados de interesse

In [None]:
print(pd.read_csv('OLID_files/olid-training-v1.0.tsv', sep='\t'))

          id                                              tweet subtask_a  \
0      86426  @USER She should ask a few native Americans wh...       OFF   
1      90194  @USER @USER Go home you’re drunk!!! @USER #MAG...       OFF   
2      16820  Amazon is investigating Chinese employees who ...       NOT   
3      62688  @USER Someone should'veTaken" this piece of sh...       OFF   
4      43605  @USER @USER Obama wanted liberals &amp; illega...       NOT   
...      ...                                                ...       ...   
13235  95338  @USER Sometimes I get strong vibes from people...       OFF   
13236  67210  Benidorm ✅  Creamfields ✅  Maga ✅   Not too sh...       NOT   
13237  82921  @USER And why report this garbage.  We don't g...       OFF   
13238  27429                                        @USER Pussy       OFF   
13239  46552  #Spanishrevenge vs. #justice #HumanRights and ...       NOT   

      subtask_b subtask_c  
0           UNT       NaN  
1           TIN    

In [None]:
print(pd.read_csv('OLID_files/bad-words.csv'))

             jigaboo
0     mound of venus
1           asslover
2                s&m
3              queaf
4         whitetrash
...              ...
1611           cocky
1612     transsexual
1613      unfuckable
1614      bestiality
1615      cocklicker

[1616 rows x 1 columns]


In [None]:
import pandas as pd

tweets = pd.read_csv('OLID_files/olid-training-v1.0.tsv', sep='\t')['tweet']

badwords = pd.read_csv('OLID_files/bad-words.csv')['jigaboo']


Instalando biblioteca para decodar palavras com caracteres diferentes

In [None]:
!pip install unidecode
from unidecode import unidecode
import re

Processando os tweets, passando todas as palavras para lower case e mantendo apenas letras.

In [None]:
for item in range(0, len(tweets)):
  tweets[item] = unidecode(tweets[item]).lower()
  tweets[item] = re.sub(r'[^a-zA-Z]', ' ', tweets[item])
  tweets[item] = tweets[item].replace(' user ', '')

Gerando um arquivo de texto para cada base de dados, com os dados de interesse.

In [None]:
with open('OLID_files/tweets.txt', 'w') as f:
  for line in tweets:
    f.write(line)
    f.write('\n')
  f.close()

with open('OLID_files/badwords.txt', 'w') as f:
  for badword in badwords:
    f.write(badword)
    f.write('\n')
  f.close()

In [None]:
# !rm hadoop_data/badwords.txt
# !rm hadoop_data/tweets.txt

Copiando os arquivos pra partição do Hadoop

In [None]:
!$HADOOP_HOME/bin/hadoop fs -put /content/OLID_files/tweets.txt hadoop_data/
!$HADOOP_HOME/bin/hadoop fs -put /content/OLID_files/badwords.txt hadoop_data/

Testando se as cópias foram criadas com sucesso.

In [None]:
!$HADOOP_HOME/bin/hadoop fs -ls hadoop_data/

!$HADOOP_HOME/bin/hadoop fs -tail hadoop_data/tweets.txt
!$HADOOP_HOME/bin/hadoop fs -tail hadoop_data/badwords.txt

Found 3 items
drwxr-xr-x   - root root       4096 2022-08-31 13:59 hadoop_data/.ipynb_checkpoints
-rw-r--r--   1 root root      13649 2022-08-31 14:57 hadoop_data/badwords.txt
-rw-r--r--   1 root root    1476392 2022-08-31 14:57 hadoop_data/tweets.txt
z dost and how much she needs that touch to comfort her restless head right now is all evident in this one freaking scene  bow to these amazing actors  jenshad is major actors and couple goals  adiya  bepannaah
billy you have a short memory  obama tried to get in commonsense gun control is especially after sandyhook  the parents even came in and begged congress to do something about automatic weapons  but the nra had such a hold on congress democrats and repugs nothing was done
but gun control   
she is not the brightest light on the tree 
 if i say you are mad now you will say i m tired of you 
retweet complete  amp  followed all patriots 
sometimes i get strong vibes from people and this man s vibe is tens of millions of murders   he is

Dando permissão aos códigos desenvolvidos para map e reduce.

In [None]:
!chmod 755 mapper.py reducer.py

In [None]:
# !rm -r output

Adaptando os códigos mapper e reducer para substituirem as funções do Hadoop, para rodarem em um cluster do Hadoop

In [None]:
!$HADOOP_HOME/bin/hadoop jar /content/hadoop-3.3.0/share/hadoop/tools/lib/hadoop-streaming-3.3.0.jar \
-files mapper.py,reducer.py \
-mapper mapper.py \
-reducer reducer.py \
-input /content/hadoop_data/tweets.txt -output /content/output

2022-08-31 14:57:37,188 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2022-08-31 14:57:37,591 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2022-08-31 14:57:37,592 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2022-08-31 14:57:37,662 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2022-08-31 14:57:37,971 INFO mapred.FileInputFormat: Total input files to process : 1
2022-08-31 14:57:37,998 INFO mapreduce.JobSubmitter: number of splits:1
2022-08-31 14:57:38,317 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local2128545062_0001
2022-08-31 14:57:38,317 INFO mapreduce.JobSubmitter: Executing with tokens: []
2022-08-31 14:57:38,786 INFO mapred.LocalDistributedCacheManager: Localized file:/content/mapper.py as file:/tmp/hadoop-root/mapred/local/job_local2128545062_0001_94b81df5-a542-4fbc-987d-2cb33b327525/mapper.py
2022-08-31 14:57:38,829 INFO mapred.LocalDistributedCacheMa

Printando os dados de saída do output

In [None]:
!$HADOOP_HOME/bin/hadoop fs -cat output/part-00000

abortion	39
abuse	37
addict	4
addicts	1
adult	11
africa	6
african	9
amateur	1
american	131
anal	1
angie	2
angry	18
arsehole	1
asian	5
ass	183
assassin	1
assassinate	3
assassination	6
assault	65
asses	7
asshole	20
assholes	9
attack	63
australian	8
babe	8
babies	24
balls	12
banging	1
baptist	2
barf	1
bastard	5
beast	7
bi	4
bible	18
bigger	16
bimbos	1
bitch	100
bitches	15
bitching	3
bitchy	1
black	141
blackout	2
blacks	10
blind	19
blow	15
bomb	11
bombs	3
bondage	1
boner	2
boob	1
boobies	1
boobs	7
boom	3
booty	9
breast	1
brothel	1
bullcrap	1
bullshit	56
bunga	1
buried	3
burn	19
butt	24
butthead	2
buttmunch	1
canadian	18
cancer	17
catholic	13
catholics	4
chin	2
chinese	13
christ	12
christian	33
church	27
cigarette	1
cock	7
cocks	2
cocktail	1
cocky	1
color	24
colored	3
commie	9
communist	24
condom	1
conservative	115
conspiracy	17
cornhole	1
corruption	28
crack	4
crap	47
crapper	1
crappy	3
crash	7
crime	69
crimes	29
criminal	43
criminals	57
cum	5
cumm	1
cumming	2
cunt	6
dammit	2
damn	60
dead	

Pegando os dados do output do Hadoop e organizando na ordem decrescente de palavras usadas.

In [None]:
with open('/content/output/part-00000', 'r') as f:
  txt = {word.strip() for word in f}
  f.close()

badword = []
qtd = []

for item in txt:
  bd_aux, qtd_aux = item.split('\t')
  badword.insert(len(badword), bd_aux)
  qtd.insert(len(qtd), int(qtd_aux))

data = pd.DataFrame([badword, qtd], index=['Badwords', 'Count']).T.sort_values(by='Count', ascending=False)
print(data.head(10))

     Badwords Count
320       gun  1378
346      shit   361
325   liberal   222
392       god   197
15       fuck   183
229       ass   183
95    fucking   147
60      black   141
7    violence   138
366  american   131


Mapper

In [None]:
'''#!/usr/bin/env python
  
# import sys because we need to read and write data to STDIN and STDOUT
import sys
import os

with open('/content/hadoop_data/badwords.txt', 'r') as f:
  bad_words = {word.strip() for word in f}

# reading entire line from STDIN (standard input)
for line in sys.stdin:
    # to remove leading and trailing whitespace
    line = line.strip()
    # split the line into words
    words = line.split()
      
    # we are looping over the words array and printing the word
    # with the count of 1 to the STDOUT
    for badword in bad_words:
      for i in range(0,len(words)):
        try:
          if(badword == words[i] + ' ' + words[i+1] + ' ' + words[i+2]):
            print('%s %s %s\t%s' % (words[i], words[i+1], words[i+2], 1))
        except:
          pass
        try:
          if(badword == words[i] + ' ' + words[i+1]):
            print('%s %s\t%s' % (words[i], words[i+1], 1))
        except:
          pass
          # write the results to STDOUT (standard output);
          # what we output here will be the input for the
          # Reduce step, i.e. the input for reducer.py
        if badword == words[i]:
          print('%s\t%s' % (words[i], 1))'''

Reducer

In [None]:
'''
######################## REDUCER ############################!
from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    word, count = line.split('\t', 1)

    # convert count (currently a string) to int
    try:
        count = int(count)
    except ValueError:
        # count was not a number, so silently
        # ignore/discard this line
        continue

    # this IF-switch only works because Hadoop sorts map output
    # by key (here: word) before it is passed to the reducer
    if current_word == word:
        current_count += count
    else:
        if current_word:
            # write result to STDOUT
            print('%s\t%s' % (current_word, current_count))
        current_count = count
        current_word = word

# do not forget to output the last word if needed!
if current_word == word:
    print('%s\t%s' % (current_word, current_count))
'''

"\n######################## REDUCER ############################\nfrom operator import itemgetter\nimport sys\n\ncurrent_word = None\ncurrent_count = 0\nword = None\n\n# input comes from STDIN\nfor line in sys.stdin:\n    # remove leading and trailing whitespace\n    line = line.strip()\n\n    # parse the input we got from mapper.py\n    word, count = line.split('\t', 1)\n\n    # convert count (currently a string) to int\n    try:\n        count = int(count)\n    except ValueError:\n        # count was not a number, so silently\n        # ignore/discard this line\n        continue\n\n    # this IF-switch only works because Hadoop sorts map output\n    # by key (here: word) before it is passed to the reducer\n    if current_word == word:\n        current_count += count\n    else:\n        if current_word:\n            # write result to STDOUT\n            print('%s\t%s' % (current_word, current_count))\n        current_count = count\n        current_word = word\n\n# do not forget to out