#TFM - Análisis de la evolución de noticias falsas en Twitter

##Fase 02 - Texto a Números

En esta fase trataremos de procesar el texto de los tweets y pasarlos por el modelo para adquirir los Embeddings. 
También trataremos de localizar el tweet más antiguo que hable de esta noticia falsa.


##Fuentes
###Sentence-BERT
- [Sentence Embeddings with BERT & XLNet](https://pythonrepo.com/repo/UKPLab-sentence-transformers-python-natural-language-processing)
- [Quickstart Sentence-BERT](https://www.sbert.net/docs/quickstart.html)
- [Pretrained Models](https://www.sbert.net/docs/pretrained_models.html)

###BERT Analisis Twitter
- [How I used Bidirectional Encoder Representations from Transformers (BERT) to Analyze Twitter Data](https://analyticsindiamag.com/how-i-used-bidirectional-encoder-representations-from-transformers-bert-to-analyze-twitter-data/)
- [Hands-On Guide to Download, Analyze and Visualize Twitter Data](https://analyticsindiamag.com/hands-on-guide-to-download-analyze-and-visualize-twitter-data/)
- [Guide To Pysentimiento Toolkit | Text Classification Using Transformers](https://analyticsindiamag.com/guide-to-pysentimiento-toolkit-text-classification-using-transformers)
- [Sentence Embeddings with BERT & XLNet](https://pythonrepo.com/repo/UKPLab-sentence-transformers-python-natural-language-processing)
- [Semantic Textual Similarity](https://www.sbert.net/docs/usage/semantic_textual_similarity.html)
- [Computing Sentence Embeddings](https://www.sbert.net/examples/applications/computing-embeddings/README.html)

###Sentence-bert model spanish texts
- [Multi-Lingual Models](https://www.sbert.net/docs/pretrained_models.html)
- [BETO: Spanish BERT](https://medium.com/dair-ai/beto-spanish-bert-420e4860d2c6)
- [BETO: Spanish BERT on GitHub](https://github.com/dccuchile/beto)

@inproceedings{CaneteCFP2020,
  title={Spanish Pre-Trained BERT Model and Evaluation Data},
  author={Cañete, José and Chaperon, Gabriel and Fuentes, Rodrigo and Ho, Jou-Hui and Kang, Hojin and Pérez, Jorge},
  booktitle={PML4DC at ICLR 2020},
  year={2020}
}

###Calculo de distancias
- [scipy.spatial.distance.cosine](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html)
- [Cosine Similarity – Understanding the math and how it works (with python codes)](https://www.machinelearningplus.com/nlp/cosine-similarity/)
- [Document Similarity] (https://shravan-kuchkula.github.io/document_similarity/#)
- [Semantic Textual Similarity](https://www.sbert.net/docs/usage/semantic_textual_similarity.html)

###CSV
- [CSV File Reading and Writing](https://docs.python.org/3/library/csv.html)

###Helps on development
- https://stackoverflow.com/questions/62710872/how-to-store-word-vector-embeddings
- https://stackoverflow.com/questions/66537949/convert-twitter-new-date-format-to-date-time-y-m-d-hms
- https://stackoverflow.com/questions/8200342/removing-duplicate-strings-from-a-list-in-python



In [1]:
#################################################
# install the sentence-transformers
#################################################
import datetime #uso de fechas

# install the sentence-transformers
!pip install -U sentence-transformers

print('\n\nInstalación realizada a las ', datetime.datetime.now().strftime("%d %b %Y - %H:%M:%S"),'...')

Collecting sentence-transformers
  Downloading sentence-transformers-2.1.0.tar.gz (78 kB)
[K     |████████████████████████████████| 78 kB 3.6 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.11.3-py3-none-any.whl (2.9 MB)
[K     |████████████████████████████████| 2.9 MB 11.5 MB/s 
[?25hCollecting tokenizers>=0.10.3
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 31.2 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 39.8 MB/s 
[?25hCollecting huggingface-hub
  Downloading huggingface_hub-0.0.19-py3-none-any.whl (56 kB)
[K     |████████████████████████████████| 56 kB 4.7 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████

In [37]:
#################################################
# Importing Libraries
#################################################
# Importing Sentence Transformers Library to get model after
from sentence_transformers import SentenceTransformer, util

# Needed to create/load jsons file with tweets 
import json

# Numpy (here we go to use it for store/load embeddings on cloud)
import numpy as np

# notice we are importing datetime from datetime (we are importing the `datetime` type from the module datetime
###from datetime import datetime

#google drive
from google.colab import drive

import os
import pandas as pd
import math

import importlib.util
import sys

import datetime #uso de fechas
print('Librerias cargadas a las ', datetime.datetime.now().strftime("%d %b %Y - %H:%M:%S"),'...')

Librerias cargadas a las  08 Oct 2021 - 07:45:30 ...


In [38]:
######################################################
# Mount google drive and use folder of data
######################################################
drive.mount('/content/drive/')
BASE_FOLDER = '/content/drive/My Drive/Colab Notebooks/09_TFM/c_vieja_data/'



print('\nDRIVE montada a las ', datetime.datetime.now().strftime("%d %b %Y - %H:%M:%S"),'...')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).

DRIVE montada a las  08 Oct 2021 - 07:45:38 ...


In [39]:
#################################################################
# Method that replace and remove Unicode chars
#
# https://en.wikipedia.org/wiki/List_of_Unicode_characters
# https://apps.timwhitlock.info/emoji/tables/unicode
#################################################################
def UnicodeFilter(var):
    temp = var
    temp = temp.replace(chr(0x0015), "'")
    temp = temp.replace(chr(0x2026), "")
    temp = temp.replace(chr(0x2015), " ")

    # Remove emojis and all that shit
    for x in range(127381, 129305):
        temp = temp.replace(chr(x), "")
    return str(temp)
#end_def

#DEBUG code
##print(chr(0x0015))
##print(chr(0x2026))
##print(chr(0x2019)) 
##for x in range(127381, 129305):
##    print(chr(x))

print('Método creado a las ', datetime.datetime.now().strftime("%d %b %Y - %H:%M:%S"),'...')

Método creado a las  08 Oct 2021 - 07:45:41 ...


In [40]:
#################################################
# Method that makes a little processed of data in  
# order to get tweets text normalized for spanish 
# language
#
# TODO: can be improved
#################################################
def preprocess_tweets_text_v2(tweets_df):

    # Iterating the index
    # same as 'for i in range(len(list))'
    for index, row in tweets_df.iterrows():
        the_text = tweets_df.loc[index, "text"]
        the_text = UnicodeFilter(the_text)
        ###print(the_text)
        tweets_df.loc[index, "text"] = the_text
    #end_for

    # get text list 
    sentences_processed_v2 = tweets_df['text'].tolist()

    return tweets_df, sentences_processed_v2
#end_def


print('Método creado a las ', datetime.datetime.now().strftime("%d %b %Y - %H:%M:%S"),'...')

Método creado a las  08 Oct 2021 - 07:45:44 ...


In [41]:
#################################################
# Method that load tweets from JSON and store its
# into the dataframe
#
# TODO: can be improved
#################################################
def load_tweets_from_json(json_for_load, process_jsons):

    if process_jsons:
        ###########################################
        # Load JSON Tweets in a DataFrame
        with open(json_for_load) as f:
            data = json.load(f)
            ##print(data)

        #Json to Dataframe
        df = pd.DataFrame.from_dict(data, orient='columns')
        full_tweets_frame = df[["id","created_at","full_text","retweet_count","favorite_count"]]

        #Rename some columns
        full_tweets_frame.rename(columns={'id': 'tweet_id', 'full_text': 'text'}, inplace=True)

        # Show first N rows
        ##selected_columns.head(15)
  

        print('\nDatos cargados desde JSONs '+json_for_load+' a las ', datetime.datetime.now().strftime("%d %b %Y - %H:%M:%S"),'...')    

        return full_tweets_frame;
    else:
        print('\nNo working with data stored on JSONs')
    #end_if  
#end_method

In [42]:
#################################################
# Tweets data load from JSONs on Dataframe
#################################################
if 'full_cumvieja_tweets_frame' in globals():
  del full_cumvieja_tweets_frame

if 'full_actualidad_tweets_frame' in globals():
  del full_actualidad_tweets_frame

if 'full_megatun_frame' in globals():
  del full_megatun_frame

if 'full_megatun_frame2' in globals():
  del full_megatun_frame2

if 'full_megatun_frame3' in globals():
  del full_megatun_frame3

if 'full_megatun_frame4' in globals():
  del full_megatun_frame4

###########################################
# Load Cumbre Vieja Tweets in a DataFrame
need2load_jsons = True
full_cumvieja_tweets_frame = load_tweets_from_json(BASE_FOLDER+'20211004_twits_cumbre_vieja.json', need2load_jsons)

###########################################
# Load actualidad/noticias/ciencia Tweets in a DataFrame
need2load_jsons = True
full_actualidad_tweets_frame = load_tweets_from_json(BASE_FOLDER+'20211008_tweets_actualidad.json', need2load_jsons)

###########################################
# Load Cumbre Vieja Tweets in a DataFrame
need2load_jsons = True
full_megatun_frame = load_tweets_from_json(BASE_FOLDER+'20210922_twits_Mega-Tsunami.json', need2load_jsons)

###########################################
# Load Megatsunami Tweets in a DataFrame
need2load_jsons = True
full_megatun_frame2 = load_tweets_from_json(BASE_FOLDER+'20210922_twits_megatsunami.json', need2load_jsons)

###########################################
# Load Megatsunami Tweets in a DataFrame
need2load_jsons = True
full_megatun_frame3 = load_tweets_from_json(BASE_FOLDER+'20211001_twits_Mega-Tsunami.json', need2load_jsons)

###########################################
# Load Megatsunami Tweets in a DataFrame
need2load_jsons = True
full_megatun_frame4 = load_tweets_from_json(BASE_FOLDER+'20211001_twits_megatsunami.json', need2load_jsons)

display('Numero total de tweets en df0: '+str(len(full_cumvieja_tweets_frame)))
display('Numero total de tweets en df1: '+str(len(full_actualidad_tweets_frame)))
display('Numero total de tweets en df2: '+str(len(full_megatun_frame)))
display('Numero total de tweets en df3: '+str(len(full_megatun_frame2)))
display('Numero total de tweets en df4: '+str(len(full_megatun_frame3)))
display('Numero total de tweets en df5: '+str(len(full_megatun_frame4)))

total = len(full_cumvieja_tweets_frame)+ len(full_actualidad_tweets_frame)+ len(full_megatun_frame) + len(full_megatun_frame2) + len(full_megatun_frame3)+ len(full_megatun_frame4)
display('Numero total de tweets a tratar: '+str(total))

print('\nTodos los datos cargados desde JSONs a las ', datetime.datetime.now().strftime("%d %b %Y - %H:%M:%S"),'...')    


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,



Datos cargados desde JSONs /content/drive/My Drive/Colab Notebooks/09_TFM/c_vieja_data/20211004_twits_cumbre_vieja.json a las  08 Oct 2021 - 07:45:54 ...

Datos cargados desde JSONs /content/drive/My Drive/Colab Notebooks/09_TFM/c_vieja_data/20211008_tweets_actualidad.json a las  08 Oct 2021 - 07:45:55 ...

Datos cargados desde JSONs /content/drive/My Drive/Colab Notebooks/09_TFM/c_vieja_data/20210922_twits_Mega-Tsunami.json a las  08 Oct 2021 - 07:45:55 ...

Datos cargados desde JSONs /content/drive/My Drive/Colab Notebooks/09_TFM/c_vieja_data/20210922_twits_megatsunami.json a las  08 Oct 2021 - 07:45:55 ...

Datos cargados desde JSONs /content/drive/My Drive/Colab Notebooks/09_TFM/c_vieja_data/20211001_twits_Mega-Tsunami.json a las  08 Oct 2021 - 07:45:55 ...

Datos cargados desde JSONs /content/drive/My Drive/Colab Notebooks/09_TFM/c_vieja_data/20211001_twits_megatsunami.json a las  08 Oct 2021 - 07:45:55 ...


'Numero total de tweets en df0: 2000'

'Numero total de tweets en df1: 4000'

'Numero total de tweets en df2: 623'

'Numero total de tweets en df3: 535'

'Numero total de tweets en df4: 334'

'Numero total de tweets en df5: 1190'

'Numero total de tweets a tratar: 8682'


Todos los datos cargados desde JSONs a las  08 Oct 2021 - 07:45:55 ...


In [43]:
####################################
# Merge of all dataframes into one

if 'full_tweets_frame' in globals():
  del full_tweets_frame

full_tweets_frame = full_cumvieja_tweets_frame.append(full_actualidad_tweets_frame) ##pd.concat([full_cumvieja_tweets_frame,full_megatun_frame])    
full_tweets_frame = full_tweets_frame.append(full_megatun_frame) ##pd.concat([full_cumvieja_tweets_frame,full_megatun_frame])    
full_tweets_frame = full_tweets_frame.append(full_megatun_frame2) ##pd.concat([full_cumvieja_tweets_frame,full_megatun_frame])    
full_tweets_frame = full_tweets_frame.append(full_megatun_frame3) ##pd.concat([full_cumvieja_tweets_frame,full_megatun_frame])    
full_tweets_frame = full_tweets_frame.append(full_megatun_frame4) ##pd.concat([full_cumvieja_tweets_frame,full_megatun_frame])

display('Numero total de tweets en dataframe final: '+str(len(full_tweets_frame)))


#Remove previous dataframe in order to release memory and avoid problems
del full_cumvieja_tweets_frame
del full_actualidad_tweets_frame
del full_megatun_frame
del full_megatun_frame2
del full_megatun_frame3
del full_megatun_frame4


print('\nDatos mezclados en un único JSONs a las ', datetime.datetime.now().strftime("%d %b %Y - %H:%M:%S"),'...')   

'Numero total de tweets en dataframe final: 8682'


Datos mezclados en un único JSONs a las  08 Oct 2021 - 07:46:03 ...


In [44]:
#################################################
# Review Dataframe tweets
#################################################
show_detailed_feedback = True

display('Numero total de tweets: '+str(len(full_tweets_frame)))

if show_detailed_feedback:
    full_tweets_frame.shape[0]
    full_tweets_frame[full_tweets_frame.columns[0]].count()
    display(full_tweets_frame.describe())
    display(full_tweets_frame.columns)


##full_tweets_frame.index.tolist()

'Numero total de tweets: 8682'

Unnamed: 0,tweet_id,retweet_count,favorite_count
count,8682.0,8682.0,8682.0
mean,1.444445e+18,179.303847,1.089611
std,2233407000000000.0,820.752198,11.217521
min,1.438543e+18,0.0,0.0
25%,1.442235e+18,1.0,0.0
50%,1.444851e+18,14.0,0.0
75%,1.446348e+18,105.0,0.0
max,1.446358e+18,46572.0,513.0


Index(['tweet_id', 'created_at', 'text', 'retweet_count', 'favorite_count'], dtype='object')

In [45]:
#################################################
# Tweets pre-procesing
#################################################

# PRE-PROCESS on own dataframe
# We add a column with tweet text char number
###full_tweets_frame['tweet_chars_number'] = full_tweets_frame['text'].str.len()

#REmoves duplicated tweets
full_tweets_frame = full_tweets_frame.drop_duplicates(subset=['text'])

# We add new column cosine score to the dataframe
full_tweets_frame['cosine_scores'] = 0.0

#Sort tweets by his creation date    
full_tweets_frame.sort_values('created_at')

#Reset index
##full_tweets_frame.reset_index(drop=True, inplace=True)
##full_tweets_frame.set_index('tweet_id', inplace=True)
full_tweets_frame.reset_index(drop=True, inplace=True)

# Show first N rows
full_tweets_frame.head(15)

print('Tweets pre-procesing hecho a las ', datetime.datetime.now().strftime("%d %b %Y - %H:%M:%S"),'...')    


Tweets pre-procesing hecho a las  08 Oct 2021 - 07:46:47 ...


In [46]:
# Show first N rows
##display(full_tweets_frame.head(-15))

##display(full_tweets_frame.head(-15).index.tolist())

display(full_tweets_frame.iloc[2,0])
display(full_tweets_frame.iloc[2,1])
display(full_tweets_frame.iloc[2,2])
display(full_tweets_frame.iloc[2,3])
display(full_tweets_frame.iloc[2,4])
display(full_tweets_frame.iloc[2,5])

1444893689667538945

'Mon Oct 04 05:14:35 +0000 2021'

'RT @apoyoasanchez: 🔴ÚLTIMA HORA🔴\n\nEl presidente @sanchezcastejon se encuentra ya en la isla canaria🇮🇨 de La Palma para seguir la evolución…'

70

0

0.0

In [47]:
#################################################
# Tweets pre-procesing
#################################################

if 'full_tweets_frame_processed' in globals():
  del full_tweets_frame_processed

if 'preproc_txt' in globals():
  del preproc_txt

# PRE-PROCESS text
full_tweets_frame_processed, preproc_txt = preprocess_tweets_text_v2(full_tweets_frame)
print(len(preproc_txt))
print(preproc_txt)

print('Tweets pre-procesing hecho a las ', datetime.datetime.now().strftime("%d %b %Y - %H:%M:%S"),'...')    


3857
['este flujo incandescente ya ha arrasado con varias comunidades, obligando a miles de personas a abandonar sus hogares. El día de ayer la lava finalmente llegó al mar tras recorrer 6 km por las laderas de Cumbre Vieja.', 'RT @EarthquakeChil1: ATENCION CARIBE | Las partículas de ceniza del volcán de cumbre vieja actualmente se sitúan en Puerto Rico,ahora en ho', 'RT @apoyoasanchez: ÚLTIMA HORA\n\nEl presidente @sanchezcastejon se encuentra ya en la isla canaria de La Palma para seguir la evolución', 'RT @HoyPorHoy: ⭕ Así se ve la isla de lava que el volcán de Cumbre Vieja ha creado sobre el mar: 540 metros, 35 de profundidad y cubre 27 h', 'RT @A3Noticias: ▶  El volcán de Cumbre Vieja ha arrasado con una de las zonas de surf más famosas de Canarias, Los Guirres: "Era un sitio', 'RT @NonNobis10:  ÚLTIMA HORA |  | SE DERRUMBA PARTE DEL CONO DEL VOLCÁN DE LA PALMA. \n\n☝️El cono del volcán de Cumbre Vieja, en La Palma,', 'Últimas noticias en directo sobre el volcán de La Palma: el de

In [48]:
#################################################
# Review Dataframe tweets
#################################################
show_detailed_feedback = True

display('Numero total de tweets: '+str(len(full_tweets_frame)))

if show_detailed_feedback:
    full_tweets_frame.shape[0]
    full_tweets_frame[full_tweets_frame.columns[0]].count()
    display(full_tweets_frame.describe())
    display(full_tweets_frame.columns)


##full_tweets_frame.index.tolist()

'Numero total de tweets: 3857'

Unnamed: 0,tweet_id,retweet_count,favorite_count,cosine_scores
count,3857.0,3857.0,3857.0,3857.0
mean,1.44502e+18,45.941405,1.993518,0.0
std,2197085000000000.0,817.048867,14.824499,0.0
min,1.438543e+18,0.0,0.0,0.0
25%,1.444763e+18,0.0,0.0,0.0
50%,1.446345e+18,1.0,0.0,0.0
75%,1.446353e+18,6.0,0.0,0.0
max,1.446358e+18,46572.0,513.0,0.0


Index(['tweet_id', 'created_at', 'text', 'retweet_count', 'favorite_count',
       'cosine_scores'],
      dtype='object')

In [49]:
#################################################
# Review Dataframe tweets
#################################################
# Show first N rows
full_tweets_frame.head(15)

Unnamed: 0,tweet_id,created_at,text,retweet_count,favorite_count,cosine_scores
0,1444893940746964995,Mon Oct 04 05:15:34 +0000 2021,este flujo incandescente ya ha arrasado con va...,0,0,0.0
1,1444893859591462913,Mon Oct 04 05:15:15 +0000 2021,RT @EarthquakeChil1: ATENCION CARIBE | Las par...,691,0,0.0
2,1444893689667538945,Mon Oct 04 05:14:35 +0000 2021,RT @apoyoasanchez: ÚLTIMA HORA\n\nEl president...,70,0,0.0
3,1444892981258039299,Mon Oct 04 05:11:46 +0000 2021,RT @HoyPorHoy: ⭕ Así se ve la isla de lava que...,11,0,0.0
4,1444892449374101508,Mon Oct 04 05:09:39 +0000 2021,RT @A3Noticias: ▶ El volcán de Cumbre Vieja h...,1,0,0.0
5,1444892395443720193,Mon Oct 04 05:09:26 +0000 2021,RT @NonNobis10: ÚLTIMA HORA | | SE DERRUMBA ...,2,0,0.0
6,1444892285049606146,Mon Oct 04 05:09:00 +0000 2021,Últimas noticias en directo sobre el volcán de...,0,3,0.0
7,1444891777614426112,Mon Oct 04 05:06:59 +0000 2021,RT @lilianaf523: ISLA LA PALMA. Un nuevo fenó...,1292,0,0.0
8,1444891595711602691,Mon Oct 04 05:06:15 +0000 2021,RT @AlertaCambio: Actualización - Volcán Cumbr...,28,0,0.0
9,1444891137756512257,Mon Oct 04 05:04:26 +0000 2021,RT @kokehtz: 8 minutos de la erupción del volc...,53,0,0.0


In [50]:
#################################################
# Review Dataframe tweets
#################################################
# Show last N rows
full_tweets_frame.iloc[-15:]


Unnamed: 0,tweet_id,created_at,text,retweet_count,favorite_count,cosine_scores
3842,1442025271217979395,Sun Sep 26 07:16:30 +0000 2021,@ManzanaDori Hace tiempo vi un documental sobr...,1,1,0.0
3843,1442005465286823937,Sun Sep 26 05:57:48 +0000 2021,RT @PauGenestra: Miles de surferos conspiranoi...,13,0,0.0
3844,1441923104595341314,Sun Sep 26 00:30:32 +0000 2021,Octubre 1963: Real Madrid y Glasgow Rangers se...,0,0,0.0
3845,1441911719261966342,Sat Sep 25 23:45:17 +0000 2021,"Joan Martí, el sabio de los volcanes: ""La teor...",6,6,0.0
3846,1441903749719474178,Sat Sep 25 23:13:37 +0000 2021,RT @AlejandroPenles: El volcán de Cumbre Vieja...,1,0,0.0
3847,1441895927883698183,Sat Sep 25 22:42:32 +0000 2021,@rsarille4 @CharlieWings07 Esta teoría del des...,0,2,0.0
3848,1441872832330358784,Sat Sep 25 21:10:46 +0000 2021,Octubre 1963: Real Madrid y Glasgow Rangers se...,26,26,0.0
3849,1441861718087376899,Sat Sep 25 20:26:36 +0000 2021,RT @Amor_y_Rabia: (INFOGRAFICO)\nLa teoría del...,1,0,0.0
3850,1441849300367994890,Sat Sep 25 19:37:16 +0000 2021,"(INFOGRAFICO)\nLa teoría del ""megatsunami"" que...",0,0,0.0
3851,1441848236768067586,Sat Sep 25 19:33:02 +0000 2021,"(INFOGRAFICO)\nLa teoría del ""megatsunami"" que...",0,1,0.0


In [51]:
#################################################
# Using Sentence Transformers
#################################################

# TODO: try to use BETO 
# https://medium.com/dair-ai/beto-spanish-bert-420e4860d2c6
# https://github.com/dccuchile/beto

#load model
##model = SentenceTransformer('paraphrase-distilroberta-base-v1')
##model = SentenceTransformer('all-MiniLM-L6-v2')
model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')

#Change the length to 400
model.max_seq_length = 400

print('Model cargado a las ', datetime.datetime.now().strftime("%d %b %Y - %H:%M:%S"),'...')    


Model cargado a las  08 Oct 2021 - 07:47:41 ...


In [52]:
#################################################
# Using BERT Sentence to process all stored tweets
#################################################

if 'sentence_embeddings' in globals():
  del sentence_embeddings

#DEBUG code
##sentences_test = ['This framework generates embeddings for each input sentence',
##    'Sentences are passed as a list of string.', 
##    'The quick brown fox jumps over the lazy dog.']
##print(type(sentences_test))
##print(type(sentences_test) == type(preproc_txt))

#Compute embeddings
# Provide twits to the model
sentence_embeddings = model.encode(preproc_txt)

print('\nCodificación de embeddings de todos los tweets a las ', datetime.datetime.now().strftime("%d %b %Y - %H:%M:%S"),'...')


Codificación de embeddings de todos los tweets a las  08 Oct 2021 - 07:48:18 ...


In [53]:
#################################################
# Get by hand my selected fake news tweet
#################################################
# Fake news target:
##jose luis araque
##@1joseluis752
##Sep 29
my_selected_tweet0 = 'Según los modelos elaborados por los investigadores Steven Ward y Simon Day, la actividad sísmica del Cumbre Vieja; podría provocar el desprendimiento de rocas de hasta 500 kilómetros cúbicos, haciendo que se deslicen y generando un mega tsunami'

my_selected_tweet_full = my_selected_tweet0
###print(my_selected_tweet_full)


# Find the selected fake news between all tweet sand get 
# index of the selected tweet in the dataframe
##full_tweets_frame.iloc[2,0]
the_index = full_tweets_frame.loc[full_tweets_frame[:]['text'].str.contains(my_selected_tweet_full, case=False)].index.values[0]
print('Index of tweet: '+str(the_index))
rowData = full_tweets_frame.loc[the_index,:]
print(rowData)

#Compute embeddings
# Provide twits to the model
my_selected_tweet_embedding = model.encode(my_selected_tweet_full)
##print(my_selected_tweet_embedding)

print('Codificación de embeddings del tweet seleccionado a las ', datetime.datetime.now().strftime("%d %b %Y - %H:%M:%S"),'...')

Index of tweet: 3622
tweet_id                                        1443007736195817481
created_at                           Wed Sep 29 00:20:28 +0000 2021
text              Según los modelos elaborados por los investiga...
retweet_count                                                     1
favorite_count                                                    0
cosine_scores                                                     0
Name: 3622, dtype: object
Codificación de embeddings del tweet seleccionado a las  08 Oct 2021 - 07:48:41 ...


##Fase 03 - Cálculo de distancias

En esta fase trataremos de utilizar los embeddings almacenados para calcular las similitud entre los textos de los tweets utilizando la distancia por coseno.


###Distancia Coseno

Un enfoque comúnmente utilizado para hacer coincidir documentos similares se basa en contar el número máximo de palabras comunes entre los documentos, enfoque a todas luces defectuoso, ya que a medida que aumenta el tamaño del documento, la cantidad de palabras comunes tiende a aumentar incluso si los documentos hablan de diferentes temas.

In [54]:
#################################################
# COSEIN distances for 1 to n
# Our objective is to quantitatively estimate the 
# similarity between selected tweets and all each 
# others
#################################################

#Compute cosine-similarities for each sentence with each other sentence
cosine_scores = util.pytorch_cos_sim(my_selected_tweet_embedding, sentence_embeddings)
##print(cosine_scores)

#################################################
# Write output in file, in order to see data better
write_on_file = True

if write_on_file:
  with open(BASE_FOLDER+'20211008_tweets_with_scores.txt', 'a') as the_file:

    #Output the pairs with their score
    for i in range(len(preproc_txt)):
        print('\n --- Tweet ',i,' ---', file=the_file)
        print("Tweet Seleccionado: {} \nTweet Comparado: {} \n    Similitud: {:.4f}".format(my_selected_tweet_full, preproc_txt[i], cosine_scores[0][i]), file=the_file)
    #end_for

  the_file.close()   

  print('Fichero de similitudes creado a las ', datetime.datetime.now().strftime("%d %b %Y - %H:%M:%S"),'...')  
else:
  print('!ATENCIÓN¡ Fichero no escrito...', datetime.datetime.now().strftime("%d %b %Y - %H:%M:%S"),'...')   
#end_if


Fichero de similitudes creado a las  08 Oct 2021 - 07:48:50 ...


In [55]:
#################################################
# Add distance score to the dataframe for next using
#################################################

for i in range(len(preproc_txt)):

    try:
      found_index = full_tweets_frame.loc[full_tweets_frame[:]['text'].str.contains(preproc_txt[i], case=False)].index.values[0]
      #print(found_index)    
    except Exception as e:
      msg = e
      #print(msg)  

    if found_index in full_tweets_frame.index:
      try:
        ##print('Tweet: ' + full_tweets_frame.iloc[found_index][5])
        full_tweets_frame.loc[found_index,'cosine_scores'] = float(cosine_scores[0][i])
        ##print('Stored distance: ' + str(full_tweets_frame.loc[found_index,'cosine_scores']))
      except Exception as e2:
        msg = e2
        #print(msg)       
    else:
      print('No index found')
    #end_if
#end_for

# avoid NaNs
full_tweets_frame['cosine_scores'] = full_tweets_frame['cosine_scores'].replace(np.nan, 0)

print('Dataset reestructurado a las ', datetime.datetime.now().strftime("%d %b %Y - %H:%M:%S"),'...') 

  return func(self, *args, **kwargs)


Dataset reestructurado a las  08 Oct 2021 - 07:49:18 ...


In [56]:
#################################################
# Review Dataframe tweets
#################################################

# Show first N rows
full_tweets_frame.head(15)

Unnamed: 0,tweet_id,created_at,text,retweet_count,favorite_count,cosine_scores
0,1444893940746964995,Mon Oct 04 05:15:34 +0000 2021,este flujo incandescente ya ha arrasado con va...,0,0,0.169823
1,1444893859591462913,Mon Oct 04 05:15:15 +0000 2021,RT @EarthquakeChil1: ATENCION CARIBE | Las par...,691,0,0.420222
2,1444893689667538945,Mon Oct 04 05:14:35 +0000 2021,RT @apoyoasanchez: ÚLTIMA HORA\n\nEl president...,70,0,0.461877
3,1444892981258039299,Mon Oct 04 05:11:46 +0000 2021,RT @HoyPorHoy: ⭕ Así se ve la isla de lava que...,11,0,0.501631
4,1444892449374101508,Mon Oct 04 05:09:39 +0000 2021,RT @A3Noticias: ▶ El volcán de Cumbre Vieja h...,1,0,0.436091
5,1444892395443720193,Mon Oct 04 05:09:26 +0000 2021,RT @NonNobis10: ÚLTIMA HORA | | SE DERRUMBA ...,2,0,0.211224
6,1444892285049606146,Mon Oct 04 05:09:00 +0000 2021,Últimas noticias en directo sobre el volcán de...,0,3,0.157044
7,1444891777614426112,Mon Oct 04 05:06:59 +0000 2021,RT @lilianaf523: ISLA LA PALMA. Un nuevo fenó...,1292,0,0.51331
8,1444891595711602691,Mon Oct 04 05:06:15 +0000 2021,RT @AlertaCambio: Actualización - Volcán Cumbr...,28,0,0.38986
9,1444891137756512257,Mon Oct 04 05:04:26 +0000 2021,RT @kokehtz: 8 minutos de la erupción del volc...,53,0,0.441965


In [57]:
######################################################
# Write output in file, in order to see data better
######################################################

write_on_file = True

only_fakes_threshold = 0.7

if write_on_file:
  with open(BASE_FOLDER+'20211008_tweets_fakes.txt', 'a') as the_file:
      for i in range(len(preproc_txt)):
          try:
            found_index = full_tweets_frame.loc[full_tweets_frame[:]['text'].str.contains(preproc_txt[i], case=False)].index.values[0]
            #print(found_index)    
          except Exception as e:
            #print(e)  
            msg = 'Nothing to do'
          #end_try

          if found_index in full_tweets_frame.index:
            try:
              if cosine_scores[0][i] > only_fakes_threshold:
                  print('\n --- Tweet ',i,' ---', file=the_file)
                  print("Tweet Seleccionado: {} \nTweet Comparado: {} \n    Cuando: {} \n    Similitud: {:.4f}"
                          .format(my_selected_tweet_full, full_tweets_frame.loc[found_index,'text'], full_tweets_frame.loc[found_index,'created_at'], cosine_scores[0][i])
                          , file=the_file)

            except Exception as e2:
              #print(e2)       
              msg = 'Nothing to do'
            #end_try
          else:
            print('No index found')
          #end_if
      #end_for

  the_file.close()   

  print('Fichero de noticias falsas creado a las ', datetime.datetime.now().strftime("%d %b %Y - %H:%M:%S"),'...')  
else:
  print('!ATENCIÓN¡ Fichero no escrito...', datetime.datetime.now().strftime("%d %b %Y - %H:%M:%S"),'...')   
#end_if      

  return func(self, *args, **kwargs)


Fichero de noticias falsas creado a las  08 Oct 2021 - 07:53:20 ...


In [89]:
######################################################
# Draw time line of fake news
# https://stackoverflow.com/questions/7670280/tree-plotting-in-python
# https://pypi.org/project/treelib/
######################################################

if 'treelib' in sys.modules:
    print(f"{'treelib'!r} already in sys.modules")
else:
  !pip install treelib

from datetime import datetime
from treelib import Node, Tree

aux_df = pd.DataFrame(columns=['tweet_id', 'created_at', 'cosine_scores', 'full_txt'])

##tree = Tree()
##the_parent = 'root'
##root_txt = 'Notica Falsa de origen: ' + my_selected_tweet_full.replace('\n','')
##tree.create_node(root_txt, the_parent)  # No parent means its the root node

for i in range(len(preproc_txt)):

    try:
      found_index = full_tweets_frame.loc[full_tweets_frame[:]['text'].str.contains(preproc_txt[i], case=False)].index.values[0]
      #print(found_index)    
    except Exception as e:
      #print(e)  
      msg = 'Nothing to do'
    #end_try

    if found_index in full_tweets_frame.index:
      try:
        if cosine_scores[0][i] > only_fakes_threshold:
            #print('\n --- Tweet ',i,' ---') #, file=the_file)
            txt = full_tweets_frame.loc[found_index,'text']

            #Avoid insert the own selected fake news
            if (txt != my_selected_tweet_full) and (my_selected_tweet_full not in txt):
                txt = txt.replace('\n','')
                #print(txt)

                t_id = full_tweets_frame.loc[found_index,'tweet_id']
                #print(t_id)

                when = full_tweets_frame.loc[found_index,'created_at']
                #print(when)

                score = full_tweets_frame.loc[found_index,'cosine_scores']
                #print(score)

                ##Mon Oct 04 00:34:33 +0000 2021
                when2 = datetime.strptime(when,'%a %b %d %H:%M:%S %z %Y')
                #full_txt = str(t_id) +' - '+ when2.strftime("%Y-%m-%d %H:%M:%S") + ' - ' + str(round(score,5)) + ' - ' + txt ##txt[0:120] + '...'
                full_txt = when2.strftime("%Y-%m-%d %H:%M:%S") + ' - ' + str(round(score,5)) + ' - ' + txt ##txt[0:120] + '...'

                #
                ##date_time_obj = datetime.strptime(when2, '%d/%m/%y %H:%M:%S')
                aux_df.loc[i] = [t_id, when2, score, full_txt]

                #tree branch
                ###tree.create_node(full_txt, t_id, parent=the_parent)
                ###the_parent = t_id
            #end_if
      except Exception as e2:
        print(e2)       
        msg = 'Nothing to do'
      #end_try
    else:
      print('No index found')
    #end_if
#end_for

#sort by date
aux_df = aux_df.sort_values('created_at', ascending=False)

###########################
#
tree = Tree()
the_parent = 'root'
root_txt = 'Notica Falsa de origen: ' + my_selected_tweet_full.replace('\n','')
tree.create_node(root_txt, the_parent)  # No parent means its the root node

for index, row in aux_df.iterrows():
    t_id_aux = row['tweet_id']
    full_txt_aux = row['full_txt']

    #tree branch
    try:
      tree.create_node(full_txt_aux, t_id_aux, parent=the_parent)
      the_parent = t_id_aux
    except Exception as e:
      #print(e)  
      msg = 'Nothing to do'
    #end_try    
#end_for


tree.show()

'treelib' already in sys.modules


  return func(self, *args, **kwargs)


Notica Falsa de origen: Según los modelos elaborados por los investigadores Steven Ward y Simon Day, la actividad sísmica del Cumbre Vieja; podría provocar el desprendimiento de rocas de hasta 500 kilómetros cúbicos, haciendo que se deslicen y generando un mega tsunami
└── 2021-09-30 15:20:44 - 0.70001 - Por qué no hay riesgo de que el volcán de La Palma genere un ‘megatsunami’ https://t.co/xdpDOqIcLf @Lidia_San_Jose
    └── 2021-09-30 14:27:05 - 0.72693 - RT @Arachnofool: @AugeAabye Así sería el Mega tsunami causado por el volcán. Puede que la erupción pierda fuerza en estos días, pero el ev
        └── 2021-09-30 13:20:12 - 0.70693 - El riesgo del volcán es que pegue un petardazo y se desprenda la ladera y está llega al mar, ya entonces cagamos, megatsunami
            └── 2021-09-30 11:14:38 - 0.74452 - @AugeAabye Así sería el Mega tsunami causado por el volcán. Puede que la erupción pierda fuerza en estos días, pero el eventual desenlace parece inevitable https://t.co/AVFMB2ykKK
  

In [86]:
len(aux_df)

##aux_df = aux_df.dropna(thresh=2)
##cols_as_date = [datetime.strptime(x,'%d-%m-%Y') for x in aux_df.created_at]
##aux_df = aux_df[sorted(cols_as_data)]
aux_df = aux_df.sort_values('created_at', ascending=False)
aux_df.head(50)



Unnamed: 0,tweet_id,created_at,cosine_scores,full_txt
3713,1443596681879097346,2021-09-30 15:20:44+00:00,0.700008,2021-09-30 15:20:44 - 0.70001 - Por qué no hay...
3592,1443583180565483535,2021-09-30 14:27:05+00:00,0.726933,2021-09-30 14:27:05 - 0.72693 - RT @Arachnofoo...
3714,1443566347477155840,2021-09-30 13:20:12+00:00,0.706928,2021-09-30 13:20:12 - 0.70693 - El riesgo del ...
3593,1443534751009230855,2021-09-30 11:14:38+00:00,0.744521,2021-09-30 11:14:38 - 0.74452 - @AugeAabye Así...
3600,1443406598240624643,2021-09-30 02:45:24+00:00,0.749414,2021-09-30 02:45:24 - 0.74941 - Hoy youtube me...
3599,1443406598240624643,2021-09-30 02:45:24+00:00,0.749414,2021-09-30 02:45:24 - 0.74941 - Hoy youtube me...
3607,1443273150784057354,2021-09-29 17:55:08+00:00,0.729614,2021-09-29 17:55:08 - 0.72961 - @el_pais Unos ...
3621,1443008511236059142,2021-09-29 00:23:33+00:00,0.803796,2021-09-29 00:23:33 - 0.8038 - RT @1joseluis75...
3649,1442624099617067014,2021-09-27 22:56:02+00:00,0.703976,2021-09-27 22:56:02 - 0.70398 - Cientificos de...
3799,1442318471040278531,2021-09-27 02:41:35+00:00,0.735151,2021-09-27 02:41:35 - 0.73515 - Aclarar la pro...
