*Notes:*

-Messages from: https://t.me/QANONSCHILE

-Exported the chats using Telegram Lite application

-Scraped the chats on the file "chile.py," and created a csv with all the messages called "mensajes.csv"

-Next Step: Analyzing the messages 

In [1]:
import pandas as pd
import numpy as np
pd.set_option('display.max_colwidth', 200)
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 300)



Importing the messages and naming the columns

In [2]:
df = pd.read_csv("mensajes.csv", header=None)
df.columns = ["date_time", "fowarded", "user", "message", "in_reply"]
df.head(30)

Unnamed: 0,date_time,fowarded,user,message,in_reply
0,,,Service message,9 May 2020,
1,,,Service message,Q ANON CHILE converted a basic group to this supergroup «Q ANON CHILE»,
2,09.05.2020 12:59:18,,MAJRM,https://youtu.be/ugE-x0-5OME,
3,,True,MAJRM 26.04.2020 18:37:56,https://youtu.be/ugE-x0-5OME,
4,09.05.2020 12:59:32,,MAJRM,https://youtu.be/DL35ZRhEaJU,
5,,True,MAJRM 09.05.2020 00:10:22,https://youtu.be/DL35ZRhEaJU,
6,09.05.2020 12:59:37,,MAJRM,"David Rockefeller es confrontado en Chile, lo curioso de esto es que en ningún medio de comunicación mostro esto a la luz. El sujeto que grabo y confronto a Rockefeller, despues de varios dias sup...",
7,,True,Deleted Account 07.05.2020 21:35:10,"David Rockefeller es confrontado en Chile, lo curioso de esto es que en ningún medio de comunicación mostro esto a la luz. El sujeto que grabo y confronto a Rockefeller, despues de varios dias sup...",
8,,,,Media not included,
9,09.05.2020 12:59:41,,MAJRM,Media not included,


Elminating the "in_reply" column which I won't use for this analysis:

In [3]:
df = df.drop(labels='in_reply', axis=1)
df.head()

Unnamed: 0,date_time,fowarded,user,message
0,,,Service message,9 May 2020
1,,,Service message,Q ANON CHILE converted a basic group to this supergroup «Q ANON CHILE»
2,09.05.2020 12:59:18,,MAJRM,https://youtu.be/ugE-x0-5OME
3,,True,MAJRM 26.04.2020 18:37:56,https://youtu.be/ugE-x0-5OME
4,09.05.2020 12:59:32,,MAJRM,https://youtu.be/DL35ZRhEaJU


Getting rid of "Service message" + "Media not included" + forwareded messages (that I imported twice because you have the user who sent the message to this group chat and the user who wrote the message in the first place and that might not be a member of the groups)

In [4]:
df_no_sm = df[(df.user != 'Service message') & (df.message != 'Media not included') & (df.fowarded != True)]

Looking at the types. I will be changing the date type later on in this notebook.

In [5]:
df_no_sm.dtypes

date_time    object
fowarded     object
user         object
message      object
dtype: object

Looking at the number of columns and rows

In [6]:
df_no_sm.shape

(95865, 4)

Eliminating the nulls with blank spaces because nltk and count vectorizer don't work with null values

In [7]:
df_no_sm.message = df_no_sm.message.fillna('')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [8]:
df_no_sm.head()

Unnamed: 0,date_time,fowarded,user,message
2,09.05.2020 12:59:18,,MAJRM,https://youtu.be/ugE-x0-5OME
4,09.05.2020 12:59:32,,MAJRM,https://youtu.be/DL35ZRhEaJU
6,09.05.2020 12:59:37,,MAJRM,"David Rockefeller es confrontado en Chile, lo curioso de esto es que en ningún medio de comunicación mostro esto a la luz. El sujeto que grabo y confronto a Rockefeller, despues de varios dias sup..."
12,09.05.2020 13:05:09,,Q,Hola buenas tardes desde España!!
13,09.05.2020 13:05:16,,Carlos,https://www.youtube.com/watch?v=X8YYAGDPGaY


Count vectorizer in Spanish. Found this guide: https://pybonacci.org/2015/11/24/como-hacer-analisis-de-sentimiento-en-espanol-2/

In [9]:
!pip install nltk

You should consider upgrading via the '/Users/biancapallaro/.pyenv/versions/3.8.2/bin/python3.8 -m pip install --upgrade pip' command.[0m


Using the code from the webpage above. 
At first, I tried using the stemmer but the results weren’t accurate. So, I eliminated the stemmer by commenting the code.

In [10]:
import nltk
nltk.download('stopwords')
import nltk
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.data import load
from nltk.stem import SnowballStemmer
from string import punctuation
from sklearn.feature_extraction.text import CountVectorizer
spanish_stopwords = stopwords.words('spanish')
#stemmer = SnowballStemmer('spanish')
non_words = list(punctuation)
non_words.extend(['¿', '¡'])
non_words.extend(map(str,range(10)))
#stemmer = SnowballStemmer('spanish')
#def stem_tokens(tokens, stemmer):
    #stemmed = []
    #for item in tokens:
        #stemmed.append(stemmer.stem(item))
    #return stemmed
def tokenize(text):
    text = ''.join([c for c in text if c not in non_words])
    tokens =  word_tokenize(text)
    # stem
    #try:
        #stems = stem_tokens(tokens, stemmer)
   # except Exception as e:
        #print(e)
        #print(text)
        #stems = ['']
    #return stems
    return tokens

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/biancapallaro/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/biancapallaro/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Importing CountVectorizer. For more information look at: https://investigate.ai/text-analysis/counting-words-with-scikit-learns-countvectorizer/

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

# Make a vectorizer
vectorizer = CountVectorizer(
                analyzer = 'word',
                tokenizer = tokenize,
                lowercase = True,
                stop_words = spanish_stopwords,
                min_df=20)

# Learn and count the words in df.content
matrix = vectorizer.fit_transform(df_no_sm.message)

# Convert the matrix of counts to a dataframe
words_df = pd.DataFrame(matrix.toarray(),
                        columns=vectorizer.get_feature_names())

In [12]:
words_df

Unnamed: 0,abajo,abandonar,abc,aberración,abierta,abiertamente,abierto,abiertos,abogada,abogado,abogados,abortados,aborto,abortos,about,abramovic,abrazo,abrazos,abre,abren,abriendo,abrieron,abril,abrir,abrió,absoluta,absolutamente,absoluto,absurdas,absurdo,abuela,abuelo,abuelos,abuso,abusos,ac,aca,acaba,acabado,acaban,acabar,acabe,acabo,acabó,acaso,acceder,acceso,accidentalmente,accidente,acciones,...,🔥,🔴,🔵,😁,😂,😂😂,😂😂😂,😂😂😂😂,😂😂😂😂😂,😅,😆,😉,😉😇🍀,😊,😍,😎,😑,😒,😓,😔,😜,😡,😢,😭,😮,😱,😱😱😱,😱😱😱😱,😳,🙄,🙈,🙌,🙏,🚨,🤔,🤔🤔,🤔🤔🤔,🤝,🤡,🤣,🤣🤣,🤣🤣🤣,🤣🤣🤣🤣,🤦,🤦🏻‍♀️,🤦🏻‍♂️,🤬,🤮,🤯,🧐
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95860,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
95861,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
95862,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
95863,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Showing the top 11 most common words

In [13]:
words_df.sum().T.sort_values(ascending=False).head(11) 

q        8819
si       8785
trump    3351
the      3058
gente    3003
ser      2921
ahora    2819
así      2529
chile    2504
solo     2463
covid    2196
dtype: int64

Creating a dataframe with the most common words

In [14]:
common_words = words_df.sum().T.sort_values(ascending=False).to_frame(name='count_words')

In [15]:
common_words.head(100)

Unnamed: 0,count_words
q,8819
si,8785
trump,3351
the,3058
gente,3003
ser,2921
ahora,2819
así,2529
chile,2504
solo,2463


In [16]:
common_words = common_words.reset_index()

In [17]:
common_words = common_words.rename(columns={'index': 'common_word', 'count_words': 'count'})
common_words.head()

Unnamed: 0,common_word,count
0,q,8819
1,si,8785
2,trump,3351
3,the,3058
4,gente,3003


How many times and when the following words appear? the president's name / Trump / Facebook / 

In [18]:
common_words[common_words['common_word'].str.contains("piñera")]

Unnamed: 0,common_word,count
159,piñera,567


In [19]:
df_no_sm[df_no_sm["message"].str.contains("Piñera")]

Unnamed: 0,date_time,fowarded,user,message
3506,13.05.2020 02:21:46,,Deleted Account,Piñera termina su mandato
7560,18.05.2020 15:01:57,,Claudia L,A la gente que quiere aprobar el cambio de Constitución y que ademas usa siempre el #RenunciaPiñera deberiamos aprovechar la ingeniería social y decirles que detrás de los planes pencas del presi ...
7657,18.05.2020 15:56:13,,Deleted Account,Piñera no lo creo es empresario
7746,18.05.2020 16:05:10,,Deleted Account,Piñeralos compro
8530,19.05.2020 22:12:25,,Claudia L,Seguirles la corriente y decirles que junto a su #RenunciaPiñera debe ir el #FueraOnu
...,...,...,...,...
164626,08.07.2021 13:55:47,,Felipe Vial,"Confirmado, Piñera sigue instrucciones de Fauci https://twitter.com/presidencia_cl/status/1413177177923887110"
165071,09.07.2021 16:20:34,,J,"Rodrigo Rojas, ""Pelao Vade"", constituyente de la Lista del Pueblo, apoyando la campaña de firmas online, Piñera a La Haya!"
165348,10.07.2021 12:17:19,,Julito Martinez,"Qué son 20 millones para Piñera, esto no se trata de dinero."
165744,11.07.2021 18:14:57,,Gonzalo Casas Errazuriz,"PIÑERA, PARIS Y MINSAL, CADA VEZ SE HUNDEN MÁSHasta hace pocos meses, cada acción judicial que se presentaba apuntando contra la validez de las medidas liberticidas de la plandemia era sistemática..."


In [20]:
common_words[common_words['common_word'].str.contains("trump")]

Unnamed: 0,common_word,count
2,trump,3351
3353,httpstmetrumpintel,44
5058,httpstwittercomrealdonaldtrumpstatuss,27
5156,trumptmeqbutchernews,27
5240,realdonaldtrump,26


In [21]:
df_no_sm[df_no_sm["message"].str.contains("trump")]

Unnamed: 0,date_time,fowarded,user,message
320,09.05.2020 20:08:59,,maarmapa,Vice es ds supuestamente pero siempre hay algo de resistenciaY eso es lo mas interesante david wilcock tambien siempre lo comentaReptilianosTrabajan para ambos lados Ese es el miedo con trumpJaja ...
1089,11.05.2020 12:43:18,,Deleted Account,"Hola! Pero con la noticia en la que trump Llamo a piñera, no crees q cambio un poco la narrativa? De hecho frenaron el carnet del covid, entonces no están siguiendo la pauta de la oms, la oms e..."
1213,11.05.2020 20:42:51,,Deleted Account,"Pero positivo. Si obviamente debe tener la campaña digital médios algoritmos... para llegar a mas gente, pero Q es según yo alguien q sabe mucho, sobre distintos futuros, el ejemplo dentro del ..."
1250,11.05.2020 20:56:12,,maarmapa,El mismo dia que trump tiro el anuncio de la space force
1326,11.05.2020 21:25:05,,MAJRM,Hoy con lo k hablo trump creo un hastagh #obamagate
...,...,...,...,...
159531,22.06.2021 14:31:32,,Q BUTCHER,"El Presidente Donald J. Trump, 45º Presidente de los Estados Unidos de América, celebrará un importante rally en Wellington, Ohio, el sábado 26 de junio de 2021 a las 7:00PM EDT.Este rally de Save..."
163625,05.07.2021 12:05:37,,claudio bustos,"Mas perdido que partidario de biden, en marcha de trump🤔😁💪"
164167,07.07.2021 08:04:13,,Q BUTCHER,"El que dejó caer la bandera ""TRUMP GANÓ"" en los estadios de la MLB está prohibido en todos los estadios e instalaciones - NO HA TERMINADOEl héroe de las Grandes Ligas de Béisbol 'TRUMP WON' ha sid..."
164305,07.07.2021 14:09:24,,Leo Brito,https://elamerican.com/trump-batalla-legal-facebook-y-twitter/?lang=es


In [22]:
common_words[common_words['common_word'].str.contains("facebook")]

Unnamed: 0,common_word,count
289,facebook,402
2134,httpsmfacebookcomstoryphpstoryfbidid,72
2177,httpswwwfacebookcomposts,71
2233,httpsbitlyqduxqwfacebook,70
2633,httpswwwfacebookcompostssfnsnmo,57
3023,httpsmfacebookcomstoryphpstoryfbididsfnsnmo,50


In [23]:
df_no_sm[df_no_sm["message"].str.contains("Facebook")].head(50)

Unnamed: 0,date_time,fowarded,user,message
8533,19.05.2020 22:17:52,,Andres Ss,Facebook muy buen grupo de mms y cds mucha Info aunq siempre seguir protocolos de kalcker
10109,22.05.2020 22:04:25,,Rolando Rebel,Empezaron con Facebook
10271,23.05.2020 03:39:23,,beby 005,Facebook se asocia con Reuters para hacer fact check en la plataformaPor Diego Bastarrica12 de febrero de 2020Un fuerte compromiso con borrar cualquier duda sobre su imparcialidad en tiempos elect...
10517,24.05.2020 17:02:49,,Carlos,Audiencia de Mark Zuckerberg sobre la censura de Facebook - (Video aquí)El juez le pregunta: ¿Creó su red social para conectar a la gente y compartir información útil o no útil entre ellos? Zucker...
10651,24.05.2020 17:55:13,,Qanon_LatamForce,"Facebook anunció que ha creado una Junta de Supervisión para ayudar a decidir qué contenido queda para ver y qué contenido se eliminará del sitio de redes sociales más grande del mundo, y parece..."
10973,26.05.2020 20:53:21,,Claudio Santana,Los argentinos van por delante en la conciencia sobre la conspiración del Covid-19. https://www.infobae.com/politica/2020/05/26/solo-tres-de-cada-diez-argentinos-creen-lo-que-dice-la-oms-sobre-el-...
12728,27.05.2020 23:13:24,,MAJRM,"Trump firma una orden ejecutiva para las redes sociales el jueves después de la ""verificación de hechos""Actualización (1830ET): Tras amenazas anteriores, un portavoz de la Casa Blanca confirmó que..."
16045,31.05.2020 23:02:00,,Lucy C,https://www.infobae.com/america/mundo/2020/05/30/un-experto-italiano-asegura-que-el-coronavirus-se-esta-agotando-solo/?utm_medium=Echobox&utm_source=Facebook&fbclid=IwAR3f0WNN1KDyFVWOqgUoyHs8BAnNz...
16206,01.06.2020 01:44:04,,Deleted Account,Facebook estan todos posteando lo de anonymus y se creen todo
16393,01.06.2020 09:21:12,,Pau,Ayer en mi Facebook todos compartiendo la noticia de Trump y la red de pedofilia 🤦🤦


Looking at who is talking the most

In [24]:
df_no_sm.user.value_counts().head(10)

Deleted Account       9614
Claudia L             6107
maarmapa              5030
@katherine Charlot    4519
Q BUTCHER             4264
Nortina❤              2852
Carlos                2609
dani2020              1932
Seba Horton           1607
Luve                  1551
Name: user, dtype: int64

Converting the date to date format

In [25]:
df_no_sm['date_column'] = pd.to_datetime(df_no_sm['date_time'], format='%d.%m.%Y %H:%M:%S', errors='coerce')
df_no_sm.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_no_sm['date_column'] = pd.to_datetime(df_no_sm['date_time'], format='%d.%m.%Y %H:%M:%S', errors='coerce')


Unnamed: 0,date_time,fowarded,user,message,date_column
2,09.05.2020 12:59:18,,MAJRM,https://youtu.be/ugE-x0-5OME,2020-05-09 12:59:18
4,09.05.2020 12:59:32,,MAJRM,https://youtu.be/DL35ZRhEaJU,2020-05-09 12:59:32
6,09.05.2020 12:59:37,,MAJRM,"David Rockefeller es confrontado en Chile, lo curioso de esto es que en ningún medio de comunicación mostro esto a la luz. El sujeto que grabo y confronto a Rockefeller, despues de varios dias sup...",2020-05-09 12:59:37
12,09.05.2020 13:05:09,,Q,Hola buenas tardes desde España!!,2020-05-09 13:05:09
13,09.05.2020 13:05:16,,Carlos,https://www.youtube.com/watch?v=X8YYAGDPGaY,2020-05-09 13:05:16


Creating a column just for the day (without time)

In [26]:
df_no_sm['day_date'] = df_no_sm.date_column.dt.strftime('%Y-%m-%d')
df_no_sm.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_no_sm['day_date'] = df_no_sm.date_column.dt.strftime('%Y-%m-%d')


Unnamed: 0,date_time,fowarded,user,message,date_column,day_date
2,09.05.2020 12:59:18,,MAJRM,https://youtu.be/ugE-x0-5OME,2020-05-09 12:59:18,2020-05-09
4,09.05.2020 12:59:32,,MAJRM,https://youtu.be/DL35ZRhEaJU,2020-05-09 12:59:32,2020-05-09
6,09.05.2020 12:59:37,,MAJRM,"David Rockefeller es confrontado en Chile, lo curioso de esto es que en ningún medio de comunicación mostro esto a la luz. El sujeto que grabo y confronto a Rockefeller, despues de varios dias sup...",2020-05-09 12:59:37,2020-05-09
12,09.05.2020 13:05:09,,Q,Hola buenas tardes desde España!!,2020-05-09 13:05:09,2020-05-09
13,09.05.2020 13:05:16,,Carlos,https://www.youtube.com/watch?v=X8YYAGDPGaY,2020-05-09 13:05:16,2020-05-09


Creating a column just for the month

In [27]:
df_no_sm['month_date'] = df_no_sm.date_column.dt.strftime('%m-%Y')
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_no_sm['month_date'] = df_no_sm.date_column.dt.strftime('%m-%Y')


Unnamed: 0,date_time,fowarded,user,message
0,,,Service message,9 May 2020
1,,,Service message,Q ANON CHILE converted a basic group to this supergroup «Q ANON CHILE»
2,09.05.2020 12:59:18,,MAJRM,https://youtu.be/ugE-x0-5OME
3,,True,MAJRM 26.04.2020 18:37:56,https://youtu.be/ugE-x0-5OME
4,09.05.2020 12:59:32,,MAJRM,https://youtu.be/DL35ZRhEaJU


Before moving forward, I want to know when did most users join the chat. So, I am going back to the original df that includes the "Service messages."

In [28]:
df['date_column'] = pd.to_datetime(df['date_time'], format='%d.%m.%Y %H:%M:%S', errors='coerce')
df.head()

Unnamed: 0,date_time,fowarded,user,message,date_column
0,,,Service message,9 May 2020,NaT
1,,,Service message,Q ANON CHILE converted a basic group to this supergroup «Q ANON CHILE»,NaT
2,09.05.2020 12:59:18,,MAJRM,https://youtu.be/ugE-x0-5OME,2020-05-09 12:59:18
3,,True,MAJRM 26.04.2020 18:37:56,https://youtu.be/ugE-x0-5OME,NaT
4,09.05.2020 12:59:32,,MAJRM,https://youtu.be/DL35ZRhEaJU,2020-05-09 12:59:32


Creating a day column in the original df

In [29]:
df['day_date'] = df.date_column.dt.strftime('%Y-%m-%d')
df.head()

Unnamed: 0,date_time,fowarded,user,message,date_column,day_date
0,,,Service message,9 May 2020,NaT,
1,,,Service message,Q ANON CHILE converted a basic group to this supergroup «Q ANON CHILE»,NaT,
2,09.05.2020 12:59:18,,MAJRM,https://youtu.be/ugE-x0-5OME,2020-05-09 12:59:18,2020-05-09
3,,True,MAJRM 26.04.2020 18:37:56,https://youtu.be/ugE-x0-5OME,NaT,
4,09.05.2020 12:59:32,,MAJRM,https://youtu.be/DL35ZRhEaJU,2020-05-09 12:59:32,2020-05-09


Creating a month column in the original df

In [30]:
df['month_date'] = df.date_column.dt.strftime('%m-%Y')
df.head()

Unnamed: 0,date_time,fowarded,user,message,date_column,day_date,month_date
0,,,Service message,9 May 2020,NaT,,
1,,,Service message,Q ANON CHILE converted a basic group to this supergroup «Q ANON CHILE»,NaT,,
2,09.05.2020 12:59:18,,MAJRM,https://youtu.be/ugE-x0-5OME,2020-05-09 12:59:18,2020-05-09,05-2020
3,,True,MAJRM 26.04.2020 18:37:56,https://youtu.be/ugE-x0-5OME,NaT,,
4,09.05.2020 12:59:32,,MAJRM,https://youtu.be/DL35ZRhEaJU,2020-05-09 12:59:32,2020-05-09,05-2020


In [31]:
df.message = df.message.fillna('')

Note: "Invited" means "joined the group"

In [32]:
df_invited = df[df["message"].str.contains("invited")]
df_invited.head(5)

Unnamed: 0,date_time,fowarded,user,message,date_column,day_date,month_date
18,09.05.2020 13:09:43,,Service message,Sebastian Brigman invited Sebastian Brigman,2020-05-09 13:09:43,2020-05-09,05-2020
23,09.05.2020 13:13:41,,Service message,Deleted invited Deleted Account,2020-05-09 13:13:41,2020-05-09,05-2020
34,09.05.2020 14:18:46,,Service message,Carolina invited Carolina,2020-05-09 14:18:46,2020-05-09,05-2020
35,09.05.2020 14:18:46,,Service message,Deleted invited Deleted Account,2020-05-09 14:18:46,2020-05-09,05-2020
36,09.05.2020 14:18:46,,Service message,Nutrita 🇨🇱 invited Nutrita 🇨🇱,2020-05-09 14:18:46,2020-05-09,05-2020


The day when most people joined the group

In [33]:
df_invited.day_date.value_counts()

2021-01-10    55
2021-01-11    39
2021-01-24    38
2021-01-12    37
2021-01-20    34
              ..
2020-07-01     1
2021-02-27     1
2021-03-04     1
2020-05-28     1
2020-05-11     1
Name: day_date, Length: 433, dtype: int64

In [34]:
joined_day = df_invited.day_date.value_counts()
joined_day = joined_day.to_frame()
joined_day = joined_day.reset_index()
joined_day

Unnamed: 0,index,day_date
0,2021-01-10,55
1,2021-01-11,39
2,2021-01-24,38
3,2021-01-12,37
4,2021-01-20,34
...,...,...
428,2020-07-01,1
429,2021-02-27,1
430,2021-03-04,1
431,2020-05-28,1


In [35]:
joined_day.to_csv(r'/Users/biancapallaro/Desktop/chile_joined.csv', index = False, header=True)

In [36]:
import altair as alt
alt.Chart(joined_day).mark_line().encode(
    x='index',
    y='day_date'
)

The month when most of the people joined the group 

In [36]:
joined = df_invited.month_date.value_counts()
joined = joined.to_frame()
joined = joined.reset_index()
joined

Unnamed: 0,index,month_date
0,01-2021,647
1,05-2021,364
2,02-2021,330
3,04-2021,328
4,11-2020,279
5,06-2021,276
6,03-2021,270
7,10-2020,229
8,12-2020,196
9,07-2021,193


In [37]:
import altair as alt
alt.Chart(joined).mark_line().encode(
    x='index',
    y='month_date'
)

The older message I have access to, when the basic group turned to a supergroup called «Q ANON CHILE»

In [38]:
df.message.iloc[0]

'9 May 2020'

Group activity: the day when most of the messages were sent. For this, we will use the df_no_sm dataframe that has no service messages.

In [39]:
activity = df_no_sm.day_date.value_counts()
activity = activity.to_frame()
activity= activity.reset_index()
activity

Unnamed: 0,index,day_date
0,2020-05-13,2309
1,2020-05-31,2264
2,2020-06-02,1943
3,2020-05-27,1619
4,2020-07-05,1595
...,...,...
429,2020-10-04,34
430,2020-10-21,29
431,2020-07-25,28
432,2020-09-21,22


In [40]:
activity.to_csv(r'/Users/biancapallaro/Desktop/chile_activity.csv', index = False, header=True)

In [41]:
import altair as alt
alt.Chart(activity).mark_line().encode(
    x='index',
    y='day_date'
)

Group activity: the month when most of the messages were sent. 

In [42]:
activity_month = df_no_sm.month_date.value_counts()
activity_month = activity_month.to_frame()
activity_month= activity_month.reset_index()
activity_month

Unnamed: 0,index,month_date
0,06-2020,15919
1,05-2020,14586
2,01-2021,12899
3,04-2021,6359
4,02-2021,6188
5,06-2021,5610
6,07-2020,5396
7,05-2021,5171
8,03-2021,4679
9,12-2020,3744


In [43]:
import altair as alt
alt.Chart(activity_month).mark_line().encode(
    x='index',
    y='month_date'
)

Looking at the number of total messages:

In [44]:
df_no_sm.shape

(95865, 7)

Where do most fowarded messages come from?
The problem is that the user name from the fowarded messages also contains the date and time of the message. 
To easily solve the issue, I just used regex to take of the numbers and easily make unique users. 

In [45]:
df_fowarded = df[df.fowarded == True]
df_fowarded['user'] = df_fowarded['user'].str.replace('\d+', '')
df_fowarded.user.value_counts().head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_fowarded['user'] = df_fowarded['user'].str.replace('\d+', '')


Q BUTCHER  .. ::                                                              4038
Fernando  .. ::                                                               1669
Noticias Rafapal  .. ::                                                        828
Warriorsystem  .. ::                                                           468
Rebelión NOM  .. ::                                                            341
Despertador De La Matrix  .. ::                                                340
We The People ⭐️⭐️⭐️🌩🌪🍷🐸- Elite Falling / Elite en Caida / Noticias  .. ::     323
VERDADES OCULTAS/ ACONTECER MUNDIAL  .. ::                                     317
@Unidos pQr Chile 🇨🇱  .. ::                                                    313
 SPEND  .. ::                                                                  249
Name: user, dtype: int64

Searching for the most fowarded messages

In [46]:
df_fowarded = df[(df.fowarded == True) & (df.message != 'Media not included')]
df_fowarded.head(50)

Unnamed: 0,date_time,fowarded,user,message,date_column,day_date,month_date
3,,True,MAJRM 26.04.2020 18:37:56,https://youtu.be/ugE-x0-5OME,NaT,,
5,,True,MAJRM 09.05.2020 00:10:22,https://youtu.be/DL35ZRhEaJU,NaT,,
7,,True,Deleted Account 07.05.2020 21:35:10,"David Rockefeller es confrontado en Chile, lo curioso de esto es que en ningún medio de comunicación mostro esto a la luz. El sujeto que grabo y confronto a Rockefeller, despues de varios dias sup...",NaT,,
449,,True,GOMELO ANONYMUS 07.05.2020 20:01:51,el divulgados los saluda a todos y les comparte informacion para que sepan como va todo claro esta que no podemos confiarnos esta gente tienen muchas mañas para engañar a la humanidad hasta con fa...,NaT,,
461,,True,W 08.05.2020 18:21:09,Video de Euskaldun,NaT,,
473,,True,Deleted Account 07.05.2020 16:14:01,Video de darobroker,NaT,,
479,,True,Advaita 09.05.2020 12:11:45,Doctor White 2.0 (@FuckingDrWhite) Tweeted:➡️El norte de Italia se rebela: los milaneses salen en masa a tomar el aperitivo➡️La provincia de Alto Adigio ignora la orden del Gobierno y reabre sus n...,NaT,,
1033,,True,oʌɐʇsnƃ ɐɹɹɐd¯\_(ツ)_/¯ 10.05.2020 21:59:13,https://youtu.be/Buo9ZdyDu5I,NaT,,
1108,,True,Gonza 11.05.2020 14:19:09,Julian Assange 💪 aqui hay huevos. Ser humano en todas sus letras!,NaT,,
1295,,True,Deleted Account 11.05.2020 21:02:54,Les informamos a todos los miembros de este grupo que hoy en *60 minutos* exactamente levantaremos nuestra voz con el hashtag #covidgates666 En todas las redes sociales y el siguiente video,NaT,,


In [47]:
df_fowarded.message.value_counts().head(10)

Movimiento para la liberación planetaria🌐 @bioevolucion Con amor por la vida y la verdad.    27
Únete a @CONSPIRANOICOS                                                                      23
                                                                                             21
t.me/fufmedia                                                                                20
T.me/estructurascolapsando                                                                   16
#memes                                                                                       12
🤯🤯🤯🤯🤯🤯🤯🤯                                                                                     11
https://t.me/PLANDEMIA_MUNDIAL_COVID                                                         11
https://t.me/despertadordelamatrix                                                           11
Name: message, dtype: int64

Creating a dataframe with the urls shared on the group

In [48]:
df_no_sm['url'] = df_no_sm["message"].str.extract(r'(?P<url>https?://[^\s]+)')
df_no_sm.head(25)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_no_sm['url'] = df_no_sm["message"].str.extract(r'(?P<url>https?://[^\s]+)')


Unnamed: 0,date_time,fowarded,user,message,date_column,day_date,month_date,url
2,09.05.2020 12:59:18,,MAJRM,https://youtu.be/ugE-x0-5OME,2020-05-09 12:59:18,2020-05-09,05-2020,https://youtu.be/ugE-x0-5OME
4,09.05.2020 12:59:32,,MAJRM,https://youtu.be/DL35ZRhEaJU,2020-05-09 12:59:32,2020-05-09,05-2020,https://youtu.be/DL35ZRhEaJU
6,09.05.2020 12:59:37,,MAJRM,"David Rockefeller es confrontado en Chile, lo curioso de esto es que en ningún medio de comunicación mostro esto a la luz. El sujeto que grabo y confronto a Rockefeller, despues de varios dias sup...",2020-05-09 12:59:37,2020-05-09,05-2020,
12,09.05.2020 13:05:09,,Q,Hola buenas tardes desde España!!,2020-05-09 13:05:09,2020-05-09,05-2020,
13,09.05.2020 13:05:16,,Carlos,https://www.youtube.com/watch?v=X8YYAGDPGaY,2020-05-09 13:05:16,2020-05-09,05-2020,https://www.youtube.com/watch?v=X8YYAGDPGaY
14,09.05.2020 13:08:38,,Claudia L,"A propósito de lo que dice la amiga en el video, les comparto un libro de Jean Michel Foucault “vigilar y castigar” donde se hace el comparativo del sistema penitenciario con el sistema educativo...",2020-05-09 13:08:38,2020-05-09,05-2020,
17,09.05.2020 13:09:43,,MAJRM,👍,2020-05-09 13:09:43,2020-05-09,05-2020,
19,09.05.2020 13:12:49,,Sebastian Brigman,hola saludos de Argentina,2020-05-09 13:12:49,2020-05-09,05-2020,
22,09.05.2020 13:13:41,,Sebastian Brigman,Cree esta imagen para ir pasandola,2020-05-09 13:13:41,2020-05-09,05-2020,
24,09.05.2020 13:16:32,,Claudia L,"Les mando este video, véanlo y traten de investigar...porque capaz que en los días que siguen igual nos sirva👀",2020-05-09 13:16:32,2020-05-09,05-2020,


In [49]:
urls = df_no_sm[df_no_sm['url'].notnull()]
urls.head()

Unnamed: 0,date_time,fowarded,user,message,date_column,day_date,month_date,url
2,09.05.2020 12:59:18,,MAJRM,https://youtu.be/ugE-x0-5OME,2020-05-09 12:59:18,2020-05-09,05-2020,https://youtu.be/ugE-x0-5OME
4,09.05.2020 12:59:32,,MAJRM,https://youtu.be/DL35ZRhEaJU,2020-05-09 12:59:32,2020-05-09,05-2020,https://youtu.be/DL35ZRhEaJU
13,09.05.2020 13:05:16,,Carlos,https://www.youtube.com/watch?v=X8YYAGDPGaY,2020-05-09 13:05:16,2020-05-09,05-2020,https://www.youtube.com/watch?v=X8YYAGDPGaY
25,09.05.2020 13:16:35,,Claudia L,https://youtu.be/_ovP5ZPx6c8,2020-05-09 13:16:35,2020-05-09,05-2020,https://youtu.be/_ovP5ZPx6c8
29,09.05.2020 13:19:05,,Claudia L,https://www.elespanol.com/omicrono/tecnologia/20200508/perros-roboticos-patrullan-parques-singapur-aglomeraciones-personas/488452015_0.html,2020-05-09 13:19:05,2020-05-09,05-2020,https://www.elespanol.com/omicrono/tecnologia/20200508/perros-roboticos-patrullan-parques-singapur-aglomeraciones-personas/488452015_0.html


Number of urls shared on the chat:

In [50]:
urls.shape

(13754, 8)

Create a new columns with the domains

In [51]:
urls['domain'] = urls['url'].str.extract(r'^(?:.*://)?(?:www\.)?([^:/]*).*$')
urls.head(200)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  urls['domain'] = urls['url'].str.extract(r'^(?:.*://)?(?:www\.)?([^:/]*).*$')


Unnamed: 0,date_time,fowarded,user,message,date_column,day_date,month_date,url,domain
2,09.05.2020 12:59:18,,MAJRM,https://youtu.be/ugE-x0-5OME,2020-05-09 12:59:18,2020-05-09,05-2020,https://youtu.be/ugE-x0-5OME,youtu.be
4,09.05.2020 12:59:32,,MAJRM,https://youtu.be/DL35ZRhEaJU,2020-05-09 12:59:32,2020-05-09,05-2020,https://youtu.be/DL35ZRhEaJU,youtu.be
13,09.05.2020 13:05:16,,Carlos,https://www.youtube.com/watch?v=X8YYAGDPGaY,2020-05-09 13:05:16,2020-05-09,05-2020,https://www.youtube.com/watch?v=X8YYAGDPGaY,youtube.com
25,09.05.2020 13:16:35,,Claudia L,https://youtu.be/_ovP5ZPx6c8,2020-05-09 13:16:35,2020-05-09,05-2020,https://youtu.be/_ovP5ZPx6c8,youtu.be
29,09.05.2020 13:19:05,,Claudia L,https://www.elespanol.com/omicrono/tecnologia/20200508/perros-roboticos-patrullan-parques-singapur-aglomeraciones-personas/488452015_0.html,2020-05-09 13:19:05,2020-05-09,05-2020,https://www.elespanol.com/omicrono/tecnologia/20200508/perros-roboticos-patrullan-parques-singapur-aglomeraciones-personas/488452015_0.html,elespanol.com
32,09.05.2020 13:31:30,,Claudia L,https://eluniversal.cl/contenido/10028/llegan-a-chile-los-nuevos-carros-de-carabineros-para-cuidar-las-protestas-en-mar,2020-05-09 13:31:30,2020-05-09,05-2020,https://eluniversal.cl/contenido/10028/llegan-a-chile-los-nuevos-carros-de-carabineros-para-cuidar-las-protestas-en-mar,eluniversal.cl
58,09.05.2020 18:42:55,,maarmapa,https://youtu.be/Xv4d6rTx27U,2020-05-09 18:42:55,2020-05-09,05-2020,https://youtu.be/Xv4d6rTx27U,youtu.be
60,09.05.2020 18:43:54,,maarmapa,https://instagram.com/mikloslukacs8?igshid=8qn2ccxlv9fv,2020-05-09 18:43:54,2020-05-09,05-2020,https://instagram.com/mikloslukacs8?igshid=8qn2ccxlv9fv,instagram.com
118,09.05.2020 19:15:59,,maarmapa,Despues desde mi punto de vista porque creo en kiu y porque tengo esperanza de que no sea un psyopY que realmente sea algo que sea beneficioso para tod@sParte por el entendimiento del project look...,2020-05-09 19:15:59,2020-05-09,05-2020,https://youtu.be/PGPqLX9XYbk,youtu.be
129,09.05.2020 19:18:50,,maarmapa,https://youtu.be/b5tMxBnWpac,2020-05-09 19:18:50,2020-05-09,05-2020,https://youtu.be/b5tMxBnWpac,youtu.be


Saving the urls:

In [52]:
urls.to_csv(r'/Users/biancapallaro/Documents/Data_Studio/chile.csv', index = False, header=True)

The most common domain:

In [53]:
urls.domain.value_counts().head(50)

youtu.be                    3549
t.me                        1585
twitter.com                 1143
youtube.com                  686
facebook.com                 482
instagram.com                428
m.facebook.com               219
t.co                         160
thegatewaypundit.com         136
tierrapura.org               115
bles.com                     103
anarcolibertad.com            88
divulgaciontotal.com          87
rumble.com                    79
m.youtube.com                 78
mobile.twitter.com            72
infobae.com                   71
bitchute.com                  69
eldiestro.es                  65
qalerts.pub                   57
breitbart.com                 56
biobiochile.cl                56
actualidad.rt.com             53
lbry.tv                       49
gab.com                       47
nosmintieron.tv               46
trikooba.com                  45
es-mb.theepochtimes.com       44
euskalnews.com                44
google.com                    43
dailymail.

The most shared url:

In [54]:
urls.url.value_counts().head(50)

https://t.me/despertadordelamatrix                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     162
https://t.me/movimientoqanonperu                                                                                                                                                                                                                                                                                                                                                                             

Looking at all the telegram links

In [55]:
urls[urls.domain == 't.me'].url.value_counts().head(10)

https://t.me/despertadordelamatrix                                 162
https://t.me/movimientoqanonperu                                    84
https://t.me/PLANDEMIA_MUNDIAL_COVID                                64
https://t.me/trumpintel                                             42
https://t.me/unidosporlaverdadQanonypatriotas@NancyQanon            28
https://t.me/LaVerdadNosHaraLibresOficial#LaVerdadNosHaráLibres     20
https://t.me/siemprelaverdad12                                      20
https://t.me/joinchat/JUC9BFRis1-GMm14qoQVdQ                        19
https://t.me/qanonlatinoamerica                                     18
https://t.me/altavozlibre                                           18
Name: url, dtype: int64

Clurstering the messages 

In [56]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Make a vectorizer
vectorizer = TfidfVectorizer(
                analyzer = 'word',
                tokenizer = tokenize,
                lowercase = True,
                stop_words = spanish_stopwords,
                min_df=50,
                max_df=0.15)

# Learn and count the words in df.content
matrix = vectorizer.fit_transform(df_no_sm.message)

# Convert the matrix of counts to a dataframe
words_df = pd.DataFrame(matrix.toarray(),
                        columns=vectorizer.get_feature_names())

In [57]:
words_df

Unnamed: 0,abajo,abierta,abierto,abogado,abogados,aborto,about,abrazo,abre,abril,abrir,absoluta,absolutamente,absoluto,abuso,abusos,aca,acaba,acabar,acabo,acabó,acaso,acceso,acciones,acción,aceptar,acerca,actividad,actividades,activos,acto,actor,actores,actos,actual,actuales,actualización,actualmente,actuar,acuerdo,acusaciones,acusado,acá,adelante,ademas,además,admin,administración,administrador,admite,...,éste,éxito,órdenes,órganos,última,últimas,último,últimos,únete,única,único,–,‘,’,“,”,❤️,🇨🇱,🇪🇸,🇵🇪,🇺🇸,🍿,🐸,👀,👆,👇,👉,👌,👍,👍👍👍,👏👏👏,💪,🔴,😁,😂,😂😂,😂😂😂,😂😂😂😂,😅,😉,😊,😱,😳,🙌,🙏,🚨,🤔,🤡,🤣,🤣🤣🤣
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95860,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
95861,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.156341,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.079833,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
95862,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.444088,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
95863,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [58]:
from sklearn.cluster import KMeans
number_of_clusters=5
km = KMeans(n_clusters=number_of_clusters)
km.fit(matrix)
km.fit

<bound method KMeans.fit of KMeans(n_clusters=5)>

In [59]:
print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(number_of_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids[i, :15]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))

Top terms per cluster:
Cluster 0: q si creo mas gente anon dice ahora trump igual solo bueno pasa así hace
Cluster 1: si claro verdad igual ahora alguien bueno así vi hace po puede pasa bien creo
Cluster 2: así si gente ahora trump mas solo verdad ser video creo hace jaja mismo bien
Cluster 3: gracias muchas 👍 información compartir ok si info buena voy dato 😊 mil excelente 🙏
Cluster 4: chile claro q si onu acá 🇨🇱 país china mas ahora verdad grupo aquí igual


Trying to recognize entities in spanish with Spacy:

In [60]:
!python -m spacy download es_core_news_sm

You should consider upgrading via the '/Users/biancapallaro/.pyenv/versions/3.8.2/bin/python -m pip install --upgrade pip' command.[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_sm')


In [61]:
#!pip install spacy
import pandas as pd
import spacy
import requests

nlp = spacy.load("es_core_news_sm")

pd.set_option("display.max_rows", 200)

In [62]:
text = '\n'.join(df_no_sm.message.values)

In [None]:
nlp.max_length = 11554070
doc = nlp(text)

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

In [None]:
entities = [(ent.text, ent.label_, ent.lemma_) for ent in doc.ents]
df_entity = pd.DataFrame(entities, columns=['text', 'type', 'lemma'])
df_entity.head(25)

In [None]:
df_entity[df_entity.type == 'PER'].lemma.value_counts().head(30)

In [None]:
df_entity[df_entity.type == 'ORG'].lemma.value_counts().head(30)

In [None]:
df_entity[df_entity.type == 'LOC'].lemma.value_counts().head(30)

In [None]:
df_entity[df_entity.type == 'MISC'].lemma.value_counts().head(30)