# Twitter Followers by Climate Figures Lexicon Expansion
Logan Beskoon

In [58]:
import pandas as pd
import html
import re
from re import search
import html
import emoji
import random
import numpy as np
import nltk
from nltk.corpus import stopwords
from operator import itemgetter
from itertools import islice

from collections import Counter

sw = stopwords.words('english', 'spanish')


For this notebook, we will be taking Twitter descriptions from the accounts CFiguerers' and EcoSenseNow's. They are in an interesting climate space, one calling for climate action now and the other trying to make sense of climate discussion positioning against fear. I will be running a lexicon expansion, specifically creating a subgroup for each follower group that is initialized by education. Then I will do a final group comparison of those two subgroups based on there lexicon that I created. 

This data was sourced from Twitter. I pulled the Twitter followers of two fairly well-known individuals in regard to climate action views. The account CFiguerers is I Christiana Figueres. She is an internationally recognized leader on climate change and typically calls for strong action. The second account, EcoSenseNow, belongs to Patrick Moore. He considers himself a sensible environmentalist. He is strongly against taking drastic action in the name of climate change. And he is also the director of C02Coalition (https://co2coalition.org/) which advocates “ to engage in an informed and dispassionate discussion of climate change, humans’ role in the climate system, the limitations of climate models, and the consequences of mandated reductions in CO2 emissions”. Looking at the users’ descriptions that follow each figure, I would hypothesis that we could create an interesting comparing in word usage.

First, there are just a few functions we are going to call. The important one is quick_compare which will build out our lexicon for our subgroups. Group_compare will be called for our final group comparison.

In [59]:
def get_patterns(text, num_words)  :
        """
        This function takes text as an input and returns a dictionary of statistics,
        after cleaning the text. 
    
        """
    
    
        # Calculate your statistics here
        total_tokens = len(text)
        unique_tokens = len(set(text))
        #token length vector
        text_token_len = [len(w) for w in text]
        avg_token_len = (np.mean(text_token_len))
        lex_diversity = (len(set(text))/len(text))
        top_10 = Counter(text).most_common(num_words)
    


    
    
        # Now we'll fill out the dictionary. 
        stats = {'tokens':total_tokens,
               'unique_tokens':unique_tokens,
               'avg_token_length':avg_token_len,
               'lexical_diversity':lex_diversity,
               'top_10':top_10}
        
        return(stats)

In [60]:
def concentration(text, ratio_cutoff):
    """
    This function gets the concentrations for all words in a text.
    """
    con = {}
    fd = nltk.FreqDist(text)
    for word, count in fd.items():
        if count > ratio_cutoff:
            con_word = fd.freq(word)
            con.update({word: con_word})
        else:
            continue
    return(con)

In [61]:
def relative_ratios(con_dict1, con_dict2):
    """
    This function creates our relative ratios based on two incoming concentration ratio dicts,
    """
    
    ratiocon = {}
    for word, con in con_dict1.items():
        if word in con_dict2.keys():
            ratio =(con_dict1[word])/(con_dict2[word])
            ratiocon.update({word:ratio})
        else: 
            continue
    return(ratiocon)

In [62]:
def quick_compare(text1, text2, num_words=10, ratio_cutoff=5):
    """
    This function takes two corpora, and outputs a nested dictionary without text stats.
    
    """
    
    # our master nested dictionary
    results = {}
   
    
    # Now let's get our concentrations
    text1_con = concentration(text1, ratio_cutoff)
    text2_con = concentration(text2, ratio_cutoff)
    
    
    # Now we can get the top num_words relative frequencies. 
    text1_ratios = relative_ratios(text1_con, text2_con)
    text2_ratios = relative_ratios(text2_con, text1_con)
            
    #Text 1 relative ratio        
    sorted_dict1 = {}
    for word, ratio in sorted(text1_ratios.items(), key=lambda x: x[1], reverse=True):
        sorted_dict1[word] = ratio
    text1_final = dict(islice(sorted_dict1.items(), num_words)) 
    
    #Text 2 relative ratio
    sorted_dict2 = {}
    for word, ratio in sorted(text2_ratios.items(), key=lambda x: x[1], reverse=True):
        sorted_dict2[word] = ratio
    text2_final = dict(islice(sorted_dict2.items(), num_words))
    
    #add these ratios to our results dict
    results.update({"One_vs_two":text1_final , "Two_vs_one": text2_final})
    
    
    
    return(results)

In [63]:
def group_compare(text1, text2, num_words=10, ratio_cutoff=5):
    """
    This function takes two corpora, and outputs a nested dictionary.
    
    """
    
    # our master nested dictionary
    results = {}



    #apply stats function, put into results
    text1_stats = get_patterns(text1, num_words)
    text2_stats = get_patterns(text2, num_words)
    
    results = {"One": text1_stats, "Two": text2_stats}
   
    
    
    # Now let's get our concentrations
    text1_con = concentration(text1, ratio_cutoff)
    text2_con = concentration(text2, ratio_cutoff)
    
    
    # Now we can get the top num_words relative frequencies. 
    text1_ratios = relative_ratios(text1_con, text2_con)
    text2_ratios = relative_ratios(text2_con, text1_con)
            
    #Text 1 relative ratio        
    sorted_dict1 = {}
    for word, ratio in sorted(text1_ratios.items(), key=lambda x: x[1], reverse=True):
        sorted_dict1[word] = ratio
    text1_final = dict(islice(sorted_dict1.items(), num_words)) 
    
    #Text 2 relative ratio
    sorted_dict2 = {}
    for word, ratio in sorted(text2_ratios.items(), key=lambda x: x[1], reverse=True):
        sorted_dict2[word] = ratio
    text2_final = dict(islice(sorted_dict2.items(), num_words ))
    
    #add these ratios to our results dict
    results.update({"One_vs_two":text1_final , "Two_vs_one": text2_final})
    
    
    
    return(results)


## First Group

We will start with CFiguerers' followers. Let's read it in, select only the description and clean it up.


In [64]:
pd.set_option('display.max_colwidth', None)

CF = pd.read_csv('CFigueres_followers.txt', encoding='utf8', header = 0, sep='\t', lineterminator='\r',
                   dtype={"screen_name": str,
                          "user_id": str,
                          "name": str,
                          "location": str,
                          "followers_count": str,
                          "friends_count": str,
                          "description": str})


CF_followers = CF.dropna(axis = 0, subset=['description'])
CF_followers

Unnamed: 0,screen_name,name,ID,Location,Followers_Count,Friends_Count,description
0,\nryangzepeda,Ryan Zepeda,8.113401e+17,,145.0,702.0,"Houston, Texas | Texas A&M Alum"
1,\nStefanoutdoors,Stefan Zeeman,1.326808e+18,,20.0,99.0,"All things wild and environmental, @carbonrewild"
3,\nkosickey,Karen O'Sickey,1.033585e+08,"Hiram Township, Ohio",42.0,98.0,"change agent, life-long learner; health nut; love the outdoors, technology, and the arts"
5,\nJackie_OPR,Jackie Simpson,1.048317e+18,"Scotland, United Kingdom",137.0,553.0,Founder @outletplayre | Play Enabler | Hitter of drums in @sambayabamba | Home stuff & work stuff 🏴󠁧󠁢󠁳󠁣󠁴󠁿 | Views my own | She/Her
6,\nClara_Mottura,Clara,1.457754e+18,,1.0,65.0,Conscious Sustainability Enthusiast
...,...,...,...,...,...,...,...
158734,\ndavilalu,Luis Dávila,9.351621e+07,"Seattle, WA",2148.0,1845.0,Leading sustainability communications @Amazon. ex @Sunrun ex @UNFCCC. All tweets are my personal views. RT≠ endorsement.
158735,\nNaimavRF,Naima v Ritter Figue,2.480277e+08,,397.0,199.0,Transformational Coach and Space Facilitator. Head of Community & Wellbeing for Conscious Coliving
158736,\nmedinagomez,Sonia Medina,2.729387e+07,London,1867.0,1200.0,Executive Director Climate Change at @CIFFchild. WEF Young Global Leader. Mother. Internationalist. Views my own. RTs not endorsem.
158737,\nAfriRenBiomass,AfriRenBiomass,2.265923e+08,"London, Accra, Abidjan",380.0,198.0,"AfriRen provides European industry with stable, long term supplies of quality biomass. Follow us for all the latest news on Renewable Energy"


In [66]:
# Convert to correct data types
CF_followers = CF_followers.convert_dtypes(infer_objects=True, convert_string=True, convert_integer=True, convert_boolean=True, convert_floating=True
    
#CF_followers.dtypes

SyntaxError: unexpected EOF while parsing (Temp/ipykernel_23196/2602450162.py, line 4)

Now we need to clean the Tweets up some. This includes tokenization and normalization. We can also see that there are emojis in the tweets, and for this analysis we want to remove those as well.

In [67]:
CF_descs = CF_followers['description']
CF_clean_descs = []
for desc in CF_descs:
    temp = desc.lower()
    temp = re.sub("@[A-Za-z0-9_]+", "", temp)
    temp = re.sub("#[A-Za-z0-9_]+", "", temp)
    temp = re.sub(r"http\S+", "", temp)
    temp = re.sub(r"www.\S+", "", temp)
    temp = re.sub('[()|!?]', ' ', temp)
    temp = re.sub('\[.*?\]', ' ', temp)
    temp = re.sub("[^a-z0-9]", " ", temp)
    temp = re.sub(r"[0-9]", " ", temp)
    temp = temp.split()
    temp =[w for w in temp if not w in sw]
    temp = " ".join(word for word in temp)
    CF_clean_descs.append(temp)
    
    
print(CF_clean_descs[:10])

['houston texas texas alum', 'things wild environmental', 'change agent life long learner health nut love outdoors technology arts', 'founder play enabler hitter drums home stuff work stuff views', 'conscious sustainability enthusiast', 'passionate places live love dish one spoon year banking career previous fishing lodge owner serving wainfleet councilor', 'roaming entrepreneur', 'need followers listen agree disagree think global friend female made brum oh european absolutely european marched', 'veces veces vengo llegando de un viaje en el tiempo soy parte de team pero digan nada', 'geographer urban ecologist w risa program fellow ej climate justice spatial politics climate adaptation floodplain mgmt']


Our Tweets look pretty clean now. Let's insert them back into our dataframe in a new column called Clean.

In [68]:
CF_descs = CF_descs.to_frame()
CF_descs['Clean'] = CF_clean_descs
CF_descs.head()

Unnamed: 0,description,Clean
0,"Houston, Texas | Texas A&M Alum",houston texas texas alum
1,"All things wild and environmental, @carbonrewild",things wild environmental
3,"change agent, life-long learner; health nut; love the outdoors, technology, and the arts",change agent life long learner health nut love outdoors technology arts
5,Founder @outletplayre | Play Enabler | Hitter of drums in @sambayabamba | Home stuff & work stuff 🏴󠁧󠁢󠁳󠁣󠁴󠁿 | Views my own | She/Her,founder play enabler hitter drums home stuff work stuff views
6,Conscious Sustainability Enthusiast,conscious sustainability enthusiast


### Lexicon GO
First we want to pull out a random word from the tweets. To do this, we will pull 50 random tweets and choose a word that seems to define some sub group hypothetically. Remember, we are looking of education focused words.

In [69]:
CF_descs.sample(50)


Unnamed: 0,description,Clean
58760,Soy una persona humilde y devertido Me gustan los ANIMALES😚😚,soy una persona humilde devertido gustan los animales
33006,“The future belongs to those who believe in the beauty of their dreams.” — Eleanor Roosevelt,future belongs believe beauty dreams eleanor roosevelt
2212,"#blacklivesmatter, 🏳️‍🌈, walkable/transit-abundant cities, housing justice + historic preservation, Vermont native, he/him/his. World peace through music.",walkable transit abundant cities housing justice historic preservation vermont native world peace music
22092,Traveler moto tour adventure,traveler moto tour adventure
27197,"Learner , Nature love , Dreamer , Meditation., Heritage -",learner nature love dreamer meditation heritage
22817,We fund the best projects to help offset carbon impact. Want to make a difference? Join us in our mission to stop and reverse climate change today.,fund best projects help offset carbon impact want make difference join us mission stop reverse climate change today
59801,Supporting campaigners around the UK with Friends of the Earth. Occasional barman. Opinions are my own,supporting campaigners around uk friends earth occasional barman opinions
53114,"GWNET aims to advance the global energy transitions by empowering women in energy through interdisciplinary networking, advocacy, training and mentoring.",gwnet aims advance global energy transitions empowering women energy interdisciplinary networking advocacy training mentoring
84826,American Wop about people who suffer the destruction of beautiful America. #Author #Environment #green #fish #carpenter #NYC #Sail #Navy #ET #MBA,american wop people suffer destruction beautiful america
31030,ING eléctrica /CR🇨🇷,ing el ctrica cr


Let's choose a word and go. My choice in educai, as I am curious to see how this will form a lexicon. We'll need to create a lexicon list, and put the word in the list. Also, now is a good time to remember we will get a lot of Spanish and Portugese for this account.

In [70]:
start_word = 'educaci'
lexicon_list1 = []
lexicon_list1.append(start_word)

This is a crucial step. We want to create two groups text, one that includes our start word and one that doesn't. And we can take a look at the Tweet's that include that word.

In [71]:
CF_group_with_word1 = CF_descs[CF_descs['Clean'].str.contains(start_word, case=False, na=False)]
CF_group_NO_word1 = CF_descs[~CF_descs['Clean'].str.contains(start_word, case=False, na=False)]

CF_group_with_word1.head()

Unnamed: 0,description,Clean
657,"Campus virtual de Educación Ambiental. Desde cualquier profesión, área de trabajo u oficio se puede ser parte de la lucha contra el #cambioclimático 🌎💻",campus virtual de educaci n ambiental desde cualquier profesi n rea de trabajo u oficio se puede ser parte de la lucha contra el tico
1304,"Periodista colombiano; experiencia en proyectos de cultura, educación y medio ambiente. Fotografo. Mas aprendiz que experto. Entre el diletante y el militante.",periodista colombiano experiencia en proyectos de cultura educaci n medio ambiente fotografo mas aprendiz que experto entre el diletante el militante
2832,Académica. Apasionada de los temas de educación superior universitaria.,acad mica apasionada de los temas de educaci n superior universitaria
3061,Profesional en Educación Física!! Yo ❤️ la EduFi,profesional en educaci n f sica yo la edufi
6506,"Hijo, Esposo, Padre. Pasión por la investigación. Asesor en Tecnologías y Educación. Interesado en el comportamiento humano.",hijo esposo padre pasi n por la investigaci n asesor en tecnolog educaci n interesado en el comportamiento humano


Now we have a subgroup with our starting word and second subgroup that does not have the starting word. From there, we can do a group comparison of the most concentrated words to discover words important to our subgroup. Before we can run our comparison function, we want to make a list of words for each group.

In [72]:

with_word1 =' '.join([i for i in CF_group_with_word1['Clean']]).split()
no_word1 =' '.join([i for i in CF_group_NO_word1['Clean']]).split()
print(with_word1[:40])


['campus', 'virtual', 'de', 'educaci', 'n', 'ambiental', 'desde', 'cualquier', 'profesi', 'n', 'rea', 'de', 'trabajo', 'u', 'oficio', 'se', 'puede', 'ser', 'parte', 'de', 'la', 'lucha', 'contra', 'el', 'tico', 'periodista', 'colombiano', 'experiencia', 'en', 'proyectos', 'de', 'cultura', 'educaci', 'n', 'medio', 'ambiente', 'fotografo', 'mas', 'aprendiz', 'que']


In [73]:
quick_compare(with_word1, no_word1)

{'One_vs_two': {'superior': 37.5601904929888,
  'investigaci': 26.36266891409308,
  'tecnolog': 17.19172898186552,
  'acci': 15.997858913680416,
  'ncia': 14.894558298943833,
  'ii': 14.239852439649601,
  'england': 13.168969227724732,
  'cient': 12.341205447696318,
  'profesor': 12.154217486367587,
  'conservaci': 11.874699399845255},
 'Two_vs_one': {'sustainability': 4.304147400516224,
  'world': 3.2247370830838586,
  'business': 3.204132473966588,
  'working': 2.9994093111481197,
  'health': 2.881172589488019,
  'co': 2.5985620198137367,
  'environmental': 2.5929395743082218,
  'research': 2.549944402795458,
  'policy': 2.4964760484783026,
  'views': 2.4118724592977885}}

We want to specifically look at the One-vs-Two as we put our with_word1 list in the one spot. We can see that investigaci  are highly concentrated so lets pick those words, add to our lexicon list and run again.

In [74]:
lexicon_list1.append('investigaci')
lexicon_list1

['educaci', 'investigaci']

In [75]:
words ="|".join(word for word in lexicon_list1)
CF_group_with_word2 = CF_descs[CF_descs['Clean'].str.contains(words, case=False, na=False, regex=True)]
CF_group_NO_word2 = CF_descs[~CF_descs['Clean'].str.contains(words, case=False, na=False)]

CF_group_with_word2.head()

Unnamed: 0,description,Clean
657,"Campus virtual de Educación Ambiental. Desde cualquier profesión, área de trabajo u oficio se puede ser parte de la lucha contra el #cambioclimático 🌎💻",campus virtual de educaci n ambiental desde cualquier profesi n rea de trabajo u oficio se puede ser parte de la lucha contra el tico
1304,"Periodista colombiano; experiencia en proyectos de cultura, educación y medio ambiente. Fotografo. Mas aprendiz que experto. Entre el diletante y el militante.",periodista colombiano experiencia en proyectos de cultura educaci n medio ambiente fotografo mas aprendiz que experto entre el diletante el militante
2832,Académica. Apasionada de los temas de educación superior universitaria.,acad mica apasionada de los temas de educaci n superior universitaria
3061,Profesional en Educación Física!! Yo ❤️ la EduFi,profesional en educaci n f sica yo la edufi
4816,"Somos una fundación dedicada a la investigación, formación, asesoría y ejecución de planes, programas y proyectos en la búsqueda de enfrentar la desigualdad.",somos una fundaci n dedicada la investigaci n formaci n asesor ejecuci n de planes programas proyectos en la b squeda de enfrentar la desigualdad


In [76]:
with_word2 =' '.join([i for i in CF_group_with_word2['Clean']]).split()
no_word2 =' '.join([i for i in CF_group_NO_word2['Clean']]).split()
#print(with_word2[:100])

In [77]:
#runs our comparison 
quick_compare(with_word2, no_word2, num_words=20)

{'One_vs_two': {'docencia': 69.68123598611665,
  'centro': 33.791201788448134,
  'superior': 31.810999037140213,
  'cient': 20.63636604204224,
  'acci': 19.44005496714124,
  'fica': 18.16159874106232,
  'tecnolog': 15.58659226005241,
  'profesor': 12.92102389146534,
  'ncia': 12.614706514728015,
  'estado': 12.400897929732626,
  'ii': 12.060213920674036,
  'docente': 11.870476041882705,
  'organizaci': 11.47690945653686,
  'proyectos': 11.457652897046028,
  'conservaci': 11.432077778972262,
  'ambiental': 11.338075874401333,
  'england': 11.153246613631477,
  'cultura': 10.161846914642013,
  'ster': 9.380166382746474,
  'ciencia': 9.261430099420568},
 'Two_vs_one': {'sustainability': 5.0820345530344415,
  'world': 3.8075427618297,
  'business': 3.783214288443036,
  'working': 3.5414915753395846,
  'health': 3.4018859696295944,
  'environmental': 3.0615606958498556,
  'policy': 2.947663348534013,
  'views': 2.847769380340792,
  'phd': 2.7185018857345375,
  'co': 2.683649297455889,
  'su

In [78]:
lexicon_list1.append('docencia')
words ="|".join(word for word in lexicon_list1)  
CF_group_with_word3 = CF_descs[CF_descs['Clean'].str.contains(words, case=False, na=False, regex=True)]
CF_group_NO_word3 = CF_descs[~CF_descs['Clean'].str.contains(words, case=False, na=False)]
CF_group_with_word3

Unnamed: 0,description,Clean
657,"Campus virtual de Educación Ambiental. Desde cualquier profesión, área de trabajo u oficio se puede ser parte de la lucha contra el #cambioclimático 🌎💻",campus virtual de educaci n ambiental desde cualquier profesi n rea de trabajo u oficio se puede ser parte de la lucha contra el tico
1304,"Periodista colombiano; experiencia en proyectos de cultura, educación y medio ambiente. Fotografo. Mas aprendiz que experto. Entre el diletante y el militante.",periodista colombiano experiencia en proyectos de cultura educaci n medio ambiente fotografo mas aprendiz que experto entre el diletante el militante
2832,Académica. Apasionada de los temas de educación superior universitaria.,acad mica apasionada de los temas de educaci n superior universitaria
3061,Profesional en Educación Física!! Yo ❤️ la EduFi,profesional en educaci n f sica yo la edufi
3084,"Familia, docencia, biología, lectura, fotografía, aves",familia docencia biolog lectura fotograf aves
...,...,...
156070,Sembrando empatía 🌱 Educación - Biodiversidad - Ecosistemas - Cambio Climático,sembrando empat educaci n biodiversidad ecosistemas cambio clim tico
156102,El CIET fue creado en 1977 por un acuerdo entre la UNESCO y Venezuela.Una de las principales metas del CIET es promover la investigación científica en ecología,el ciet fue creado en por un acuerdo entre la unesco venezuela una de las principales metas del ciet es promover la investigaci n cient fica en ecolog
156485,intereses: Educación - Salud Holística-Relaciones Intls-Comunicación-yoga-cocina-Patrimonio cultural- bienestar animal-TEFL-ESL- 0% alcohol al volante,intereses educaci n salud hol stica relaciones intls comunicaci n yoga cocina patrimonio cultural bienestar animal tefl esl alcohol al volante
156892,"Biologa, M.C.en Ecol y Cienc. Amb. Máster en Agroecología. Investigación Acción Socio-Ecológica, Soberanía Alimentaria, Desarrollo Rural Sustentable",biologa c en ecol cienc amb ster en agroecolog investigaci n acci n socio ecol gica soberan alimentaria desarrollo rural sustentable


We can run our function to get the concentrations of interesting words one last time. 

In [79]:
with_word3 =' '.join([i for i in CF_group_with_word3['Clean']]).split()
no_word3 =' '.join([i for i in CF_group_NO_word3['Clean']]).split()
quick_compare(with_word3, no_word3, num_words=20)

{'One_vs_two': {'superior': 45.6330094352908,
  'centro': 33.193890899466055,
  'cient': 20.271586883754182,
  'acci': 19.096422426724956,
  'fica': 17.84056485900997,
  'tecnolog': 15.311075534209415,
  'profesor': 13.575820306999015,
  'ncia': 12.391722389755692,
  'estado': 12.181693196708984,
  'proyectos': 12.140538827801185,
  'ambiental': 12.13899127664404,
  'ii': 11.847031295700496,
  'docente': 11.660647322513467,
  'organizaci': 11.274037625189493,
  'conservaci': 11.229998415716096,
  'england': 10.956096015332776,
  'finanzas': 10.727162665758659,
  'turismo': 9.982220813969864,
  'cultura': 9.982220813969862,
  'ciencia': 9.91863978967706},
 'Two_vs_one': {'sustainability': 5.173483747282397,
  'world': 3.876057982259686,
  'business': 3.8512917276526712,
  'working': 3.605219310289231,
  'health': 3.4631015571269863,
  'environmental': 3.1166522651524504,
  'policy': 3.000705380288148,
  'views': 2.89901385979453,
  'phd': 2.7674202479411716,
  'co': 2.7319405011726947,


From looking at the ratios, we can add a few more words to fill out the lexicon. Admittedly, it's a lot harder to do this when it's in another language. 

In [80]:
lexicon_list1 = lexicon_list1 + ['superior', 'ambiental']
print(f'Our final lexicon for our subgroup we determined')
lexicon_list1

Our final lexicon for our subgroup we determined


['educaci', 'investigaci', 'docencia', 'superior', 'ambiental']

In [81]:
words ="|".join(word for word in lexicon_list1)  
CF_group_with_word_final = CF_descs[CF_descs['Clean'].str.contains(words, case=False, na=False, regex=True)]
CF_group_NO_word_final = CF_descs[~CF_descs['Clean'].str.contains(words, case=False, na=False)]
CF_with_word=' '.join([i for i in CF_group_with_word_final['Clean']]).split()
CF_no_word =' '.join([i for i in CF_group_NO_word_final['Clean']]).split()

##  Second Group

Now let's do the same for EcoSenseNow's followers. 

In [86]:
pd.set_option('display.max_colwidth', None)

ESN = pd.read_csv('EcoSenseNow_followers.txt', encoding='utf8', header = 0, sep='\t', on_bad_lines='skip', lineterminator='\r',
                   dtype={"screen_name": str,
                          "user_id": str,
                          "name": str,
                          "location": str,
                          "followers_count": str,
                          "friends_count": str,
                          "description": str})


ESN_followers = ESN.dropna(axis = 0, subset=['description'])
ESN_followers

Unnamed: 0,screen_name,name,ID,Location,Followers_Count,Friends_Count,description
0,\nnotfakhongs1,John Smith,1.456148e+18,,11.0,483.0,"Banned for saying Colin Powell should have been nixed for war crimes, been on twitter since 2013"
1,\nadios_mafia,Adios Mafia,8.245604e+17,North Haverbrook,1216.0,344.0,"Always blindly follow calls 'cos DYOR is boring, very interested in looking cool on social media, more important than irl success tbh. This account = very good"
2,\npgillam,paul gillam,4.604376e+07,"Peterborough, Ontario",439.0,1467.0,"My tweets are my own. I may not agree 100% with retweets. BBM Channel, Leadership Quotes C00069417"
3,\nDenningJimbo,Jimbo Denning,1.456665e+18,,36.0,208.0,Twit twitter twoo. It's amazing what you can get suspended for these days.
4,\nWayneWwoods,wayne woods,5.115411e+08,Alaska,1513.0,4093.0,Gold Star father. Disciple. Purveyor of the manly arts. IMI for life. Jesus Reigns!
...,...,...,...,...,...,...,...
100838,\nprometheusgreen,Val Giddings,1.151363e+09,,5104.0,5353.0,"ITIF life sci guru, keynote speaker, professional skeptic, biotech expert, policy wonk, beekeeper, lover of wilderness. will travel miles for dark night skies"
100839,\nDiscoveryFarms,Discovery Farms,3.347478e+08,"Pigeon Falls, Wisconsin",1126.0,1231.0,Program of @UWMadisonExt working with farmers across Wisconsin on water quality and farm management.
100840,\nLucas_MN_Dairy,Lucas Sjostrom,1.815640e+07,brooten mn,2009.0,1742.0,"Dairy farmer, exec director @mnmilk and @midwestdairy, tour guide @redheadcreamery. Husband of @amsjost. Tweets mine unless they're someone else's."
100841,\ncga_acg,Canadian Gas Association,2.736227e+08,,4765.0,2363.0,"Over 20 million Canadians in homes, businesses, schools, hospitals, and industry use affordable, clean, safe, and reliable natural gas energy solutions."


In [87]:
# Convert to correct data types
ESN_followers = ESN_followers.convert_dtypes(infer_objects=True, convert_string=True, convert_integer=True, convert_boolean=True, convert_floating=True)
    
ESN_descs = ESN_followers['description']

ESN_clean_descs = []
for desc in ESN_descs:
    temp = desc.lower()
    temp = re.sub("@[A-Za-z0-9_]+", "", temp)
    temp = re.sub("#[A-Za-z0-9_]+", "", temp)
    temp = re.sub(r"http\S+", "", temp)
    temp = re.sub(r"www.\S+", "", temp)
    temp = re.sub('[()|!?]', ' ', temp)
    temp = re.sub('\[.*?\]', ' ', temp)
    temp = re.sub("[^a-z0-9]", " ", temp)
    temp = re.sub(r"[0-9]", " ", temp)
    temp = temp.split()
    temp =[w for w in temp if not w in sw]
    temp = " ".join(word for word in temp)
    ESN_clean_descs.append(temp)
    
    
print(ESN_clean_descs[:10])

['banned saying colin powell nixed war crimes twitter since', 'always blindly follow calls cos dyor boring interested looking cool social media important irl success tbh account good', 'tweets may agree retweets bbm channel leadership quotes c', 'twit twitter twoo amazing get suspended days', 'gold star father disciple purveyor manly arts imi life jesus reigns', 'ear learner market', 'writer photographer psychologist retired teacher lover sciences art romance quantum physics nerd', 'pension r och sd sedan', 'sports news laugh political entertainment care opinions expect care mine xrp csc xdc', 'carbon based sombitch']


Let's put the clean Twitter descriptions back in our dataframe.

In [88]:
ESN_descs = ESN_descs.to_frame()
ESN_descs['Clean'] = ESN_clean_descs
ESN_descs.head()

Unnamed: 0,description,Clean
0,"Banned for saying Colin Powell should have been nixed for war crimes, been on twitter since 2013",banned saying colin powell nixed war crimes twitter since
1,"Always blindly follow calls 'cos DYOR is boring, very interested in looking cool on social media, more important than irl success tbh. This account = very good",always blindly follow calls cos dyor boring interested looking cool social media important irl success tbh account good
2,"My tweets are my own. I may not agree 100% with retweets. BBM Channel, Leadership Quotes C00069417",tweets may agree retweets bbm channel leadership quotes c
3,Twit twitter twoo. It's amazing what you can get suspended for these days.,twit twitter twoo amazing get suspended days
4,Gold Star father. Disciple. Purveyor of the manly arts. IMI for life. Jesus Reigns!,gold star father disciple purveyor manly arts imi life jesus reigns


In [111]:
ESN_descs.sample(50)

Unnamed: 0,description,Clean
23981,"Just another well balanced community minded guy, who loves animals, hugs trees and believes in ‘Equal Rights And Justice for All Earthbound Inhabitants’.",another well balanced community minded guy loves animals hugs trees believes equal rights justice earthbound inhabitants
38940,I am single mother of beautiful boy.... I am simple loyal and honest to move with,single mother beautiful boy simple loyal honest move
41953,• Co-founder @HumaneTech_ • Former Google Design Ethicist • Featured in Netflix's @SocialDilemma_ • #TIME100Next • Host #YourUndividedAttention podcast,co founder former google design ethicist featured netflix host podcast
32766,"Carbonate geologist | Passionate about ➡ motorcycles 🏍 , snowmobiles, post-hardcore/metalcore 🎶, #Ravens",carbonate geologist passionate motorcycles snowmobiles post hardcore metalcore
26792,A perfect contradiction Book Devourer,perfect contradiction book devourer
92439,"Space Shuttle Door Gunner (Ret.), Chicken Gun Firing Coordinator(Ret.), Conservative not Republican",space shuttle door gunner ret chicken gun firing coordinator ret conservative republican
24127,Saskatchewan Roughriders Sect.618,saskatchewan roughriders sect
99510,Calling All Ears for Prosperity & Sustainable Government. My tweets for policy & politics. Views tweeted are strictly my own & not attributable to any cows.,calling ears prosperity sustainable government tweets policy politics views tweeted strictly attributable cows
42294,. #Blacklivesmatter,
87306,Wonderer.,wonderer


Since these followers are in English, it'll be easier. Also, it's notably hard to find education focused words. But interesting, we see a of interesting words...some not so surprising for followers of an ant-climate scientist.

In [112]:
start_word = 'academic'
lexicon_list2 = []
lexicon_list2.append(start_word)

In [113]:
ESN_group_with_word1 = ESN_descs[ESN_descs['Clean'].str.contains(start_word, case=False, na=False)]
ESN_group_NO_word1 = ESN_descs[~ESN_descs['Clean'].str.contains(start_word, case=False, na=False)]

ESN_group_with_word1.head()

Unnamed: 0,description,Clean
965,Academic interests: #Extremism and #CT | @BrockUniversity #Philosophy | Found with espresso and never enough books | @HdxAcademy | Podcast🎙: @Modesofinquiry.,academic interests found espresso never enough books podcast
1270,"Academic, golfer, husband, father, loves data",academic golfer husband father loves data
1708,"Disgracefully growing old hedonist, urban cyclist, business professional, academic & writer.",disgracefully growing old hedonist urban cyclist business professional academic writer
4078,"Academic, PhD. Superpower: reading the primary source data, studies, and documents.",academic phd superpower reading primary source data studies documents
6200,"Former Greater Manchester firefighter, FE and HE lecturer in politics, social science and public services. Academic freedom and free speech.",former greater manchester firefighter fe lecturer politics social science public services academic freedom free speech


In [114]:
with_word1 =' '.join([i for i in ESN_group_with_word1['Clean']]).split()
no_word1 =' '.join([i for i in ESN_group_NO_word1['Clean']]).split()
print(with_word1[:200])



['academic', 'interests', 'found', 'espresso', 'never', 'enough', 'books', 'podcast', 'academic', 'golfer', 'husband', 'father', 'loves', 'data', 'disgracefully', 'growing', 'old', 'hedonist', 'urban', 'cyclist', 'business', 'professional', 'academic', 'writer', 'academic', 'phd', 'superpower', 'reading', 'primary', 'source', 'data', 'studies', 'documents', 'former', 'greater', 'manchester', 'firefighter', 'fe', 'lecturer', 'politics', 'social', 'science', 'public', 'services', 'academic', 'freedom', 'free', 'speech', 'kentucky', 'born', 'buried', 'see', 'digital', 'wasteland', 'people', 'wanted', 'academic', 'college', 'happened', 'pronouns', 'dead', 'jim', 'academic', 'secular', 'humanist', 'ex', 'academic', 'wife', 'mother', 'libertarian', 'monotonously', 'alike', 'great', 'tyrants', 'conquerors', 'gloriously', 'different', 'saints', 'cs', 'lewis', 'ex', 'academic', 'ex', 'burger', 'flipper', 'permaculture', 'gardener', 'flamencomane', 'tyrannophobe', 'erratic', 'midnight', 'gab', '

In [115]:
quick_compare(with_word1, no_word1, num_words=20)

{'One_vs_two': {'netherlands': 34.755077422034134,
  'columbia': 26.14057105247012,
  'calgary': 11.028053412760832,
  'alberta': 7.98550081759531,
  'london': 6.742607612740308,
  'england': 6.620014747054121,
  'ca': 6.559169023276049,
  'msc': 6.535142763117529,
  'california': 5.706057487199633,
  'usa': 4.941368150479683,
  'education': 4.207768807384635,
  'canada': 4.191950127657625,
  'united': 3.8616752691149037,
  'british': 3.6540583191624894,
  'engineering': 2.9984772677833367,
  'g': 2.9183652797127895,
  'uk': 2.593159846411462,
  'books': 2.5151700765945755,
  'van': 2.48481054920764,
  'p': 2.37246539139772},
 'Two_vs_one': {'father': 2.6905813645829775,
  'proud': 2.560776916686582,
  'family': 2.3005140942041358,
  'conservative': 2.1948395411559867,
  'people': 2.140955981928465,
  'love': 2.026367905079796,
  'like': 1.94347012165284,
  'fan': 1.910337836043684,
  'god': 1.8911559864804888,
  'dad': 1.8519203851012256,
  'husband': 1.732251800894473,
  'good': 1.65

Let's pick a couple words. Interestingly, we are getting a lot of numbers for this group. So I removed numbers from the text.

In [116]:
lexicon_list2.append('netherlands')

lexicon_list2

['academic', 'netherlands']

In [117]:
words ="|".join(word for word in lexicon_list2)
ESN_group_with_word2 = ESN_descs[ESN_descs['Clean'].str.contains(words, case=False, na=False, regex=True)]
ESN_group_NO_word2 = ESN_descs[~ESN_descs['Clean'].str.contains(words, case=False, na=False)]

ESN_group_with_word2.head()




Unnamed: 0,description,Clean
965,Academic interests: #Extremism and #CT | @BrockUniversity #Philosophy | Found with espresso and never enough books | @HdxAcademy | Podcast🎙: @Modesofinquiry.,academic interests found espresso never enough books podcast
1270,"Academic, golfer, husband, father, loves data",academic golfer husband father loves data
1708,"Disgracefully growing old hedonist, urban cyclist, business professional, academic & writer.",disgracefully growing old hedonist urban cyclist business professional academic writer
3812,Texas ➡️ Ireland ➡️ Netherlands. Also tweets for @b4place,texas ireland netherlands also tweets
4078,"Academic, PhD. Superpower: reading the primary source data, studies, and documents.",academic phd superpower reading primary source data studies documents


In [118]:
with_word2 =' '.join([i for i in ESN_group_with_word2['Clean']]).split()
no_word2 =' '.join([i for i in ESN_group_NO_word2['Clean']]).split()
print(with_word2[:100])

['academic', 'interests', 'found', 'espresso', 'never', 'enough', 'books', 'podcast', 'academic', 'golfer', 'husband', 'father', 'loves', 'data', 'disgracefully', 'growing', 'old', 'hedonist', 'urban', 'cyclist', 'business', 'professional', 'academic', 'writer', 'texas', 'ireland', 'netherlands', 'also', 'tweets', 'academic', 'phd', 'superpower', 'reading', 'primary', 'source', 'data', 'studies', 'documents', 'director', 'netherlands', 'gazelle', 'global', 'interests', 'people', 'recruitment', 'sourcing', 'ict', 'sales', 'agile', 'scrum', 'politics', 'wife', 'kids', 'former', 'greater', 'manchester', 'firefighter', 'fe', 'lecturer', 'politics', 'social', 'science', 'public', 'services', 'academic', 'freedom', 'free', 'speech', 'kentucky', 'born', 'buried', 'see', 'digital', 'wasteland', 'people', 'wanted', 'academic', 'college', 'happened', 'pronouns', 'dead', 'jim', 'academic', 'secular', 'humanist', 'ex', 'academic', 'wife', 'mother', 'libertarian', 'monotonously', 'alike', 'great', 

In [119]:
quick_compare(with_word2, no_word2, num_words=40)

{'One_vs_two': {'dennis': 32.5325837428867,
  'ryan': 26.557211218683022,
  'greg': 24.786730470770824,
  'belgium': 23.660060903917604,
  'jay': 22.30805742369374,
  'queensland': 21.24576897494642,
  'williams': 18.59004785307812,
  'kevin': 18.590047853078115,
  'brian': 18.590047853078115,
  'columbia': 17.428169862260734,
  'daniel': 17.350711329539575,
  'rob': 15.934326731209815,
  'chris': 15.147446398804393,
  'steve': 14.606466170275663,
  'ct': 14.458926107949647,
  'mike': 13.83445421624418,
  'david': 13.409214844843232,
  'michael': 12.657053857414887,
  'matt': 12.393365235385412,
  'scott': 12.393365235385412,
  'nederland': 12.028854493168193,
  'eric': 11.830030451958802,
  'tim': 11.538650391565728,
  'ontario': 11.15402871184687,
  'cape': 10.62288448747321,
  'tony': 10.140026101678973,
  'peter': 10.140026101678972,
  'tx': 9.442563988865075,
  'england': 9.152023558438456,
  'richard': 9.043807063659624,
  'calgary': 8.988594566323485,
  'toronto': 8.637800012541

In [120]:
lexicon_list2.append('england')
words ="|".join(word for word in lexicon_list2)  
ESN_group_with_word3 = ESN_descs[ESN_descs['Clean'].str.contains(words, case=False, na=False, regex=True)]
ESN_group_NO_word3 = ESN_descs[~ESN_descs['Clean'].str.contains(words, case=False, na=False)]
ESN_group_with_word3

Unnamed: 0,description,Clean
611,"On a winters afternoon. In a secluded chapel, history is now and England. Traditionalist...",winters afternoon secluded chapel history england traditionalist
965,Academic interests: #Extremism and #CT | @BrockUniversity #Philosophy | Found with espresso and never enough books | @HdxAcademy | Podcast🎙: @Modesofinquiry.,academic interests found espresso never enough books podcast
999,Guatemala - Texas A&M - Real Madrid - New England Patriots - Options Trader,guatemala texas real madrid new england patriots options trader
1052,"Leeds,West Riding of Yorkshire,England,UK,centre right.......lover of rock music!.......👺🤘🎸",leeds west riding yorkshire england uk centre right lover rock music
1270,"Academic, golfer, husband, father, loves data",academic golfer husband father loves data
...,...,...
97733,"Haydon Manning is a “retired” academic; for four decades he taught Australian politics and electoral behaviour at Flinders University, South Australia.",haydon manning retired academic four decades taught australian politics electoral behaviour flinders university south australia
97989,Scientific Research Publishing (SCIRP) is an academic publisher of open access journals. It also publishes academic books and conference proceedings.,scientific research publishing scirp academic publisher open access journals also publishes academic books conference proceedings
99376,"Father, @TDEM meteorologist, skier, hockey player, #FSU alumnus, native New Englander, naturalized Texan. #IamSecond #TXwx #Noles #skiing",father meteorologist skier hockey player alumnus native new englander naturalized texan
100594,Scientist. Dad to 4. Recovering academic. Anti-Malthusian. Pro-humanity. Climate realist. Anti-'Woke'. Victorious Brexiteer. Neurodiverse. Each life matters.,scientist dad recovering academic anti malthusian pro humanity climate realist anti woke victorious brexiteer neurodiverse life matters


In [121]:
with_word3 =' '.join([i for i in ESN_group_with_word3['Clean']]).split()
no_word3 =' '.join([i for i in ESN_group_NO_word3['Clean']]).split()
quick_compare(with_word3, no_word3, num_words=40)


{'One_vs_two': {'greg': 36.1397424454928,
  'kevin': 33.73042628245994,
  'anthony': 32.12421550710471,
  'stuart': 24.09316163032853,
  'ryan': 24.09316163032853,
  'todd': 24.09316163032853,
  'craig': 24.09316163032853,
  'brian': 21.902874209389573,
  'jon': 21.41614367140314,
  'dennis': 21.081516426537465,
  'eric': 19.575693824641935,
  'williams': 19.274529304262824,
  'belgium': 19.274529304262824,
  'jay': 18.739125712477747,
  'douglas': 18.739125712477747,
  'pennsylvania': 18.069871222746396,
  'columbia': 17.84678639283595,
  'matt': 17.20940116452038,
  'daniel': 16.679881128688983,
  'queensland': 16.679881128688983,
  'denmark': 16.062107753552354,
  'michael': 14.826561003279096,
  'ray': 14.455896978197117,
  'mike': 14.325663672087236,
  'murray': 14.05434428435831,
  'ct': 13.5524034170598,
  'steve': 13.492170512983979,
  'singapore': 13.141724525633743,
  'sean': 13.141724525633743,
  'rob': 12.973240877869207,
  'david': 12.938920134806061,
  'brisbane': 12.7552

With the final compare for our second subgroup, let's choose one last word. Similar to the first text, I'm finding a strange occurance with a lot of names.

In [122]:
lexicon_list2 = lexicon_list2 + ['belgium', 'pennsylvania']
print(f'Our final lexicon for our subgroup 2 we determined')
lexicon_list2

Our final lexicon for our subgroup 2 we determined


['academic', 'netherlands', 'england', 'belgium', 'pennsylvania']

In [128]:
words ="|".join(word for word in lexicon_list2)  
ESN_group_with_word_final = ESN_descs[ESN_descs['Clean'].str.contains(words, case=False, na=False, regex=True)]
ESN_group_NO_word_final = ESN_descs[~ESN_descs['Clean'].str.contains(words, case=False, na=False)]
ESN_with_word=' '.join([i for i in ESN_group_with_word_final['Clean']]).split()
ESN_no_word =' '.join([i for i in ESN_group_NO_word_final['Clean']]).split()

## The final group comparison

Now that we have our lexicon of words, we can split of corpus into two groups--one with all those words present and one without.


In [124]:
group_compare(CF_with_word, ESN_with_word, num_words=30)

{'One': {'tokens': 31711,
  'unique_tokens': 11004,
  'avg_token_length': 6.163634070196462,
  'lexical_diversity': 0.34700892434801806,
  'top_10': [('de', 1103),
   ('n', 829),
   ('en', 612),
   ('ambiental', 576),
   ('la', 561),
   ('el', 233),
   ('educaci', 178),
   ('e', 169),
   ('ambientalista', 144),
   ('del', 138),
   ('climate', 131),
   ('para', 119),
   ('por', 116),
   ('ambientales', 113),
   ('tica', 103),
   ('desarrollo', 102),
   ('los', 98),
   ('rica', 97),
   ('gesti', 96),
   ('pol', 93),
   ('tico', 91),
   ('con', 91),
   ('social', 88),
   ('un', 85),
   ('es', 85),
   ('investigaci', 84),
   ('que', 81),
   ('london', 81),
   ('clim', 80),
   ('las', 76)]},
 'Two': {'tokens': 24634,
  'unique_tokens': 11245,
  'avg_token_length': 6.327230656815783,
  'lexical_diversity': 0.45648290979946415,
  'top_10': [('england', 162),
   ('canada', 125),
   ('usa', 123),
   ('new', 95),
   ('life', 81),
   ('love', 74),
   ('alberta', 73),
   ('academic', 70),
   ('con

And now we have our final group comparison for the two sub groups from each group of followers' descriptions based on education lexicon. Remember, our first group had some good education words. Our second not so much. 

As we can see from the comparison above are more tokens for our first group where climate action is more popular. We also see a lot of climate actions words present. However, token length and lexical diversity is bigger for the second group. A big caveat is that the first group has a lot of Spanish, even though I used Spanish stopwords. 

Of extreme interest, I find some of the relatively high words in the second group interesting such as father, proud, usa, free. Especially when we consider that I started these groups based on education. 

Just for fun, I'll end it with the subgroups that weren't formed by education focused start words.

In [129]:
group_compare(CF_no_word, ESN_no_word, num_words=30)

{'One': {'tokens': 1111897,
  'unique_tokens': 96049,
  'avg_token_length': 6.434593312150316,
  'lexical_diversity': 0.08638300130317826,
  'top_10': [('de', 13031),
   ('climate', 11859),
   ('views', 6547),
   ('energy', 6308),
   ('change', 6294),
   ('la', 6127),
   ('en', 5820),
   ('n', 5077),
   ('world', 4605),
   ('sustainable', 4496),
   ('environmental', 4419),
   ('sustainability', 4313),
   ('development', 4031),
   ('social', 3972),
   ('global', 3940),
   ('tweets', 3771),
   ('director', 3694),
   ('policy', 3207),
   ('environment', 3193),
   ('working', 3005),
   ('science', 2948),
   ('founder', 2945),
   ('love', 2891),
   ('people', 2811),
   ('life', 2799),
   ('business', 2752),
   ('el', 2711),
   ('international', 2663),
   ('co', 2603),
   ('research', 2542)]},
 'Two': {'tokens': 565783,
  'unique_tokens': 61699,
  'avg_token_length': 6.095663885270501,
  'lexical_diversity': 0.10905064309107909,
  'top_10': [('love', 3296),
   ('life', 2754),
   ('conservati

I'll just leave that right there. That's probably more a comparison on the overall followers...