# PREPROCESADO TWEETS ABRIL 2019

Somos conscientes de que hay muchas cosas por hacer en este dataset para sacarle el máximo partido, y también de la limitación temporal que tenemos. 

Por ello, decidimos plantear una limpieza preliminar para un análisis de sentimiento que tendrá que ser mejorado en versiones futuras. 

Como objetivo fundamental se trata de rescatar el mayor número posible de tweets con localización por lo menos a nivel de comunidad para poder realizar el cruce con los datos del INE.


In [2]:
# Imports
import os
import re
import numpy as np
import pandas as pd
from collections import Counter
from nltk import ngrams
from nltk.probability import FreqDist
from tqdm import tqdm
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

In [3]:
import boto3

BUCKET_NAME = 'electomedia' 

# sustituir por credenciales de acceso. 
s3 = boto3.resource('s3', aws_access_key_id = 'XXXXXXX', 
                          aws_secret_access_key= 'XXXXXXXXX')

In [4]:
import botocore.exceptions

KEY = 'EstimacionOtrasFuentes/tweets_df_elecciones04_19.csv' 

try:
    s3.Bucket(BUCKET_NAME).download_file(KEY, 'tweets_df_elecciones04_19.csv')
except botocore.exceptions.ClientError as e:
    if e.response['Error']['Code'] == "404":
        print("The object does not exist.")
    else:
        raise

In [5]:
df_abril_19= pd.read_csv('tweets_df_elecciones04_19.csv', delimiter=';') 

In [6]:
df_abril_19.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 340210 entries, 0 to 340209
Data columns (total 12 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   Datetime         340210 non-null  object
 1   Tweet Id         340210 non-null  int64 
 2   Text             340210 non-null  object
 3   user_id          340210 non-null  int64 
 4   Location         238217 non-null  object
 5   followers_count  340210 non-null  int64 
 6   friends_count    340210 non-null  int64 
 7   retweet_count    340210 non-null  int64 
 8   reply_count      340210 non-null  int64 
 9   like_count       340210 non-null  int64 
 10  language         340210 non-null  object
 11  place            13807 non-null   object
dtypes: int64(7), object(5)
memory usage: 31.1+ MB


Para detectar el mayor número de lugares comunes tenemos que normalizar y estandarizar el texto del campo location y place y aprovechamos y lo hacemos con todo el dataset. 

In [7]:
#quitamos tildes a todas las palabras del dataset y lo que no sea letra y pasamos todo a minúsculas

cols = df_abril_19.select_dtypes(include=[np.object]).columns
df_abril_19[cols] = df_abril_19[cols].apply(lambda x: x.str.normalize('NFKD').str.lower().str.encode('ascii', errors='ignore').str.decode('utf-8'))
#df_abril_19.head()

### limpieza de location

In [8]:
#para agrupar valores creamos un df sólo de location
df_location= df_abril_19['Location']
#df_location.head(5)

In [9]:
#sumamos nulos. 102000 casi de 340. un tercio. 
df_abril_19.loc[df_abril_19['Location'].isnull()].count()

Datetime           101993
Tweet Id           101993
Text               101993
user_id            101993
Location                0
followers_count    101993
friends_count      101993
retweet_count      101993
reply_count        101993
like_count         101993
language           101993
place                2106
dtype: int64

In [9]:
# sustituimos na por unknown en location y en place
df_abril_19['Location'].fillna('unknown',inplace=True)
df_abril_19['place'].fillna('unknown',inplace=True)

Agrupamos en principio por provincia los nombres de los municipios con un número de usuarios por encima de 15 en un principio, aunque en sucesivas revisiones creemos que hemos bajado a los municipios con 5 usuarios..

In [19]:
#con cada provincia: tiene que haber una forma de hacer una función .. pero esta vez está hecho
madrid= df_location.str.contains('madrid')| df_location.str.contains('madriz')|df_location.str.contains('pinto')|df_location.str.contains('brunete')|df_location.str.contains('alcorqueens')|df_location.str.contains('gato')|df_location.str.contains('torrelodones')| df_location.str.contains('madri') | df_location.str.contains('madrit')   | df_location.str.contains('getafe')| df_location.str.contains('madrileno')| df_location.str.contains('carabanchel') |df_location.str.contains('matritum')| df_location.str.contains('alcorcon') | df_location.str.contains('fuenlabrada') | df_location.str.contains('alcala de henares')| df_location.str.contains('mostoles') | df_location.str.contains('leganes')| df_location.str.contains('colmenarejo')| df_location.str.contains('guadalix')| df_location.str.contains('hortaleza')| df_location.str.contains('el boalo')| df_location.str.contains('torrejon')| df_location.str.contains('alcobendas')| df_location.str.contains('san sebastian de los reyes')| df_location.str.contains('majadahonda')| df_location.str.contains('valdemoro')| df_location.str.contains('villalba')| df_location.str.contains('pozuelo')| df_location.str.contains('coslada')| df_location.str.contains('parla')| df_location.str.contains('vallekas')| df_location.str.contains('boadilla')| df_location.str.contains('vallecas')| df_location.str.contains('colmenar viejo')| df_location.str.contains('galapagar')| df_location.str.contains('aranjuez')| df_location.str.contains('tres cantos')
valencia= df_location.str.contains('valencia') |df_location.str.contains('xativa')|df_location.str.contains('mestalla')| df_location.str.contains('vlc')| df_location.str.contains('gandia')| df_location.str.contains('segorbe')
barcelona= df_location.str.contains('barcelona')|df_location.str.contains('osona')|df_location.str.contains('calella')|df_location.str.contains('cardedeu')|df_location.str.contains('maresme')|df_location.str.contains('collblanc')| df_location.str.contains('barcelon')|  df_location.str.contains('gava') | df_location.str.contains('bcn')| df_location.str.contains('cerdanyola')| df_location.str.contains('molins de rei')| df_location.str.contains('terrassa')| df_location.str.contains('vilassar de mar')| df_location.str.contains('manresa')| df_location.str.contains('sant cugat del valles')| df_location.str.contains('granollers')| df_location.str.contains('mataro') | df_location.str.contains('sabadel')| df_location.str.contains('badalona') | df_location.str.contains('llobregat')| df_location.str.contains('gramenet')| df_location.str.contains('geltru')| df_location.str.contains('viladecans')| df_location.str.contains('castelldefels')| df_location.str.contains('sitges')| df_location.str.contains('vic')| df_location.str.contains('despi')| df_location.str.contains('igualada')|df_location.str.contains('sant cugat') | df_location.str.contains('mollet del valles')| df_location.str.contains('seva')| df_location.str.contains('l\'hospitalet')| df_location.str.contains('torello')  | df_location.str.contains('masnou') |df_location.str.contains('sant just desvern')| df_location.str.contains('vilafranca del penedes')
malaga= df_location.str.contains('malaga')|df_location.str.contains('mijas')| df_location.str.contains('nacion palena') | df_location.str.contains('marbella')  | df_location.str.contains('torremolinos')| df_location.str.contains('fuengirola')| df_location.str.contains('estepona')| df_location.str.contains('benalmadena')| df_location.str.contains('antequera')| df_location.str.contains('rincon de la victoria')
castellon= df_location.str.contains('castellon')| df_location.str.contains('castello')| df_location.str.contains('vila-real')| df_location.str.contains('vinaros')
cadiz= df_location.str.contains('cadiz')| df_location.str.contains('jerez') | df_location.str.contains('algeciras') | df_location.str.contains('puerto de santa maria')| df_location.str.contains('san fernando')| df_location.str.contains('sanfernando')| df_location.str.contains('sanlucar')| df_location.str.contains('la linea de la concepcion')
cantabria= df_location.str.contains('cantabria')| df_location.str.contains('santander')| df_location.str.contains('santader') | df_location.str.contains('castro-urdiales')
almeria= df_location.str.contains('almeria')| df_location.str.contains('roquetas')
sevilla= df_location.str.contains('sevilla')| df_location.str.contains('seville') | df_location.str.contains('dos hermanas')| df_location.str.contains('alcala de guadaira')| df_location.str.contains('ciudad del betis')| df_location.str.contains('utrera') | df_location.str.contains('ecija')  
alicante= df_location.str.contains('ontinyent')|df_location.str.contains('alicante')| df_location.str.contains('alcoy')| df_location.str.contains('denia') | df_location.str.contains('elda')| df_location.str.contains('santa pola')| df_location.str.contains('alacant')| df_location.str.contains('elche')| df_location.str.contains('elx')| df_location.str.contains('benidorm')| df_location.str.contains('torrevieja')| df_location.str.contains('orihuela') 
murcia= df_location.str.contains('murcia') | df_location.str.contains('cartagena') | df_location.str.contains('lorca')| df_location.str.contains('molina de segura')| df_location.str.contains('yecla')| df_location.str.contains('alcantarilla')
gipuzkoa= df_location.str.contains('guipuzcoa')| df_location.str.contains('guipuzcoa')| df_location.str.contains('san sebastian')| df_location.str.contains('donostia')| df_location.str.contains('gipuzkoa')| df_location.str.contains('irun')| df_location.str.contains('donosti')
bizkaia=df_location.str.contains('getxo')|df_location.str.contains('sopelana')| df_location.str.contains('vizcaya')| df_location.str.contains('bizkaia')| df_location.str.contains('bilbao')  | df_location.str.contains('bilbo')| df_location.str.contains('barakaldo')| df_location.str.contains('plentzia')
coruna= df_location.str.contains('coruna') | df_location.str.contains('santiago') | df_location.str.contains('ferrol')| df_location.str.contains('compostela') 
laspalmas= df_location.str.contains('las palmas')| df_location.str.contains('arrecife') | df_location.str.contains('fuerteventura') | df_location.str.contains('gran canaria') | df_location.str.contains('lanzarote') | df_location.str.contains('telde')
asturias= df_location.str.contains('asturias')| df_location.str.contains('oviedo')| df_location.str.contains('gijon')| df_location.str.contains('gijon')| df_location.str.contains('aviles')| df_location.str.contains('asturies')| df_location.str.contains('xixon')| df_location.str.contains('reconquista')
tenerife= df_location.str.contains('tenerif')|  df_location.str.contains('garachico')| df_location.str.contains('san cristobal de la laguna')| df_location.str.contains('la laguna')
zaragoza= df_location.str.contains('zaragoza') 
pontevedra=df_location.str.contains('pontevedra')| df_location.str.contains('porrino')| df_location.str.contains('rias baixas') | df_location.str.contains('vigo') | df_location.str.contains('vilagarcia de arousa')
granada= df_location.str.contains('granada')|  df_location.str.contains('sierra nevada') | df_location.str.contains('grana')
tarragona= df_location.str.contains('tarragon') | df_location.str.contains('reus')| df_location.str.contains('cambrils')| df_location.str.contains('tortosa')| df_location.str.contains('salou')
cordoba= df_location.str.contains('cordoba')| df_location.str.contains('cordoba')
girona= df_location.str.contains('gerona')| df_location.str.contains('girona')| df_location.str.contains('blanes')|df_location.str.contains('vilamalla')| df_location.str.contains('palamos')
toledo= df_location.str.contains('toledo') | df_location.str.contains('talavera')
baleares= df_location.str.contains('baleares')| df_location.str.contains('illes balears')| df_location.str.contains('mallorca')| df_location.str.contains('palma')| df_location.str.contains('menorca')| df_location.str.contains('ibiza') | df_location.str.contains('formentera') | df_location.str.contains('cabrera') | df_location.str.contains('eivissa')
badajoz= df_location.str.contains('badajoz') | df_location.str.contains('merida')| df_location.str.contains('almendralejo')| df_location.str.contains('villanueva de la serena')
jaen= df_location.str.contains('jaen') | df_location.str.contains('linares')
navarra= df_location.str.contains('nafarroa')|df_location.str.contains('navarra') | df_location.str.contains('pamplona') |df_location.str.contains('navarre') 
valladolid= df_location.str.contains('valladolid')| df_location.str.contains('cabezon de pisuerga')| df_location.str.contains('pucela') | df_location.str.contains('rueda')
ciudadreal= df_location.str.contains('ciudad real') | df_location.str.contains('valdepenas')| df_location.str.contains('llanos del caudillo')| df_location.str.contains('puertollano')
huelva= df_location.str.contains('huelva') 
leon= df_location.str.contains('leon') | df_location.str.contains('ponferrada')| df_location.str.contains('el bierzo') 
lleida= df_location.str.contains('lleida')| df_location.str.contains('lerida')  
caceres= df_location.str.contains('caceres')| df_location.str.contains('plasencia')  
albacete= df_location.str.contains('albacete') 
burgos= df_location.str.contains('burgos')| df_location.str.contains('miranda de ebro') | df_location.str.contains('vivar')| df_location.str.contains('aranda de duero')  
lugo= df_location.str.contains('lugo') | df_location.str.contains('monforte de lemos')
salamanca= df_location.str.contains('salamanca') | df_location.str.contains('salmantino')
ourense= df_location.str.contains('orense')| df_location.str.contains('ourense')  
alava= df_location.str.contains('alava') | df_location.str.contains('vitoria') | df_location.str.contains('gasteiz')   
larioja= df_location.str.contains('rioja')| df_location.str.contains('logrono')  
guadalajara= df_location.str.contains('guadalajara') | df_location.str.contains('azuqueca de henares')
huesca= df_location.str.contains('huesca') 
cuenca= df_location.str.contains('cuenca')
zamora= df_location.str.contains('zamora') 
palencia= df_location.str.contains('palencia') 
segovia= df_location.str.contains('segovia') 
teruel= df_location.str.contains('teruel') | df_location.str.contains('alcaniz') 
soria= df_location.str.contains('soria') 
ceuta= df_location.str.contains('ceuta') 
melilla= df_location.str.contains('melilla') 
avila= df_location.str.contains('avila') 


In [20]:
df_abril_19['Location']= np.where(madrid, 'madrid',     
                                  np.where(valencia, 'valencia',
                                           np.where(barcelona, 'barcelona',
                                                    np.where(malaga, 'malaga',
                                                             np.where(cordoba, 'cordoba',
                                                                      np.where(tarragona, 'tarragona',
                                                                               np.where(castellon, 'castellon',
                                                                                        np.where(cadiz, 'cadiz',
                                                                                                 np.where(cantabria, 'cantabria',
                                                                                                          np.where(almeria, 'almeria',
                                                                                                                   np.where(sevilla, 'sevilla',
                                                                                                                            np.where(alicante, 'alicante',
                                                                                                                                     np.where(murcia, 'murcia',
                                                                                                                                              np.where(gipuzkoa, 'gipuzkoa',
                                                                                                                                                       np.where(bizkaia, 'bizkaia',
                                                                                                                                                                np.where(coruna, 'coruna',df_abril_19['Location']))))))))))))))))
                                                    
                                                   

In [21]:
df_abril_19['Location']= np.where(laspalmas, 'laspalmas', 
                                  np.where(baleares, 'baleares',
                                           np.where(asturias, 'asturias',
                                                    np.where(tenerife, 'tenerife',
                                                             np.where(zaragoza, 'zaragoza',
                                                                      np.where(pontevedra, 'pontevedra',
                                                                               np.where(granada, 'granada',
                                                                                        np.where(girona, 'girona',
                                                                                                 np.where(toledo, 'toledo',
                                                                                                          np.where(badajoz, 'badajoz',
                                                                                                                   np.where(jaen, 'jaen',
                                                                                                                            np.where(navarra, 'navarra',
                                                                                                                                     np.where(valladolid, 'valladolid',
                                                                                                                                              np.where(ciudadreal, 'ciudadreal',df_abril_19['Location']))))))))))))))
                                                                                                                                                       
                                                                                                                                              

In [22]:
df_abril_19['Location']= np.where(huelva, 'huelva', 
                                  np.where(leon, 'leon',
                                           np.where(lleida, 'lleida',
                                                    np.where(caceres, 'caceres',
                                                             np.where(albacete, 'albacete',
                                                                      np.where(burgos,'burgos',
                                                                               np.where(lugo,'lugo',
                                                                                        np.where(salamanca, 'salamanca',
                                                                                                 np.where(ourense, 'ourense',
                                                                                                          np.where(alava, 'alava',
                                                                                                                   np.where(larioja, 'larioja',
                                                                                                                            np.where(guadalajara, 'guadalajara',
                                                                                                                                     np.where(huesca, 'huesca',
                                                                                                                                              np.where(cuenca, 'cuenca',
                                                                                                                                                       np.where(zamora, 'zamora',
                                                                                                                                                                np.where(palencia, 'palencia',
                                                                                                                                                                         np.where(segovia, 'segovia',
                                                                                                                                                                                  np.where(teruel, 'teruel',
                                                                                                                                                                                           np.where(soria, 'soria',
                                                                                                                                                                                                    np.where(ceuta, 'ceuta',  
                                                                                                                                                                                                             np.where(melilla, 'melilla',
                                                                                                                                                                                                                      np.where(avila, 'avila',df_abril_19['Location']))))))))))))))))))))))

In [23]:
catalunya= df_location.str.contains('catalunya') |  df_location.str.contains('cataluna')|df_location.str.contains('catalans')|df_location.str.contains('catalonia')|df_location.str.contains('catalana')|df_location.str.contains('tabarnia')
andalucia= df_location.str.contains('andalucia') 
canarias= df_location.str.contains('canarias')
galicia= df_location.str.contains('galicia') |  df_location.str.contains('galiza')#|df_location.str.contains('galicia,')
euskadi= df_location.str.contains('euskal herria') |  df_location.str.contains('euskadi')|df_location.str.contains('pais vasco')| df_location.str.contains('basque country')
castillalamancha= df_location.str.contains('castilla-la mancha') |  df_location.str.contains('la mancha')
extremadura = df_location.str.contains('extremadura') 
aragon = df_location.str.contains('aragon') 

In [24]:
df_abril_19['Location']= np.where(catalunya, 'catalunya', 
                                  np.where(andalucia, 'andalucia',
                                           np.where(canarias, 'canarias',
                                                    np.where(galicia, 'galicia',
                                                             np.where(euskadi, 'euskadi',
                                                                      np.where(castillalamancha,'castillalamancha',
                                                                               np.where(extremadura,'extremadura',
                                                                                        np.where(aragon, 'aragon',df_abril_19['Location']))))))))

In [25]:
espana= df_location.str.contains('espana') |  df_location.str.contains('spain')| df_location.str.contains('peninsula iberica')|  df_location.str.contains('hispania')|df_location.str.contains('espanya')|df_location.str.contains('espanita')
europa= df_location.str.contains('europa') |  df_location.str.contains('europe')


In [26]:
df_abril_19['Location']= np.where(espana, 'espana', 
                                  np.where(europa, 'europe',df_abril_19['Location']))

In [59]:
#por motivos de tamaño del notebook y dificultades para subirlo al github, sólo ponemos una muestra de algunos location....se puede reproducir y ver joyas.
df_abril_19.groupby(['Location']).nunique().sort_values('user_id', ascending= False).tail(50)

Unnamed: 0_level_0,Datetime,Tweet Id,Text,user_id,Location,followers_count,friends_count,retweet_count,reply_count,like_count,language,place,com_prov,comunidad,place_clean,provcomplace,comunidad_autonoma2,comunidad_autonoma
Location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
en la noche oscura del alma.,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,1
en la nube. o en las nubes.,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,1
en la oficina,4,4,4,1,1,1,1,1,2,2,2,1,1,1,0,0,1,1
en la pagina de algun libro,3,3,3,1,1,1,1,1,2,1,1,1,1,1,0,0,1,1
en la papada de @lyromas,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,1
en la papelera de reciclaje,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,1
en la parte libre,9,9,9,1,1,1,1,2,3,3,1,1,1,1,0,0,1,1
en la patria,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,1
en la pista!,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,1
en la meva branca,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,1


In [28]:
com_autonoma_list= ['andalucia', 'aragon', 'asturias', 'baleares', 'canarias', 'cantabria', 'castillalamancha', 'castillaleon', 'catalunya', 'ComunidadValenciana', 'Extremadura', 'Galicia', 'LaRioja', 'Madrid', 'Murcia', 'Navarra', 'euskadi','Ceuta','Melilla']
for i in range(len(com_autonoma_list)):    
    com_autonoma_list[i] = com_autonoma_list[i]. lower() #si queda alguna mayúscula

In [29]:
provincia_list=['albacete','alicante','almeria','alava','asturias','avila','badajoz','Baleares','Barcelona','bizkaia','Burgos','Caceres','Cadiz','Cantabria','Castellon','CiudadReal'\
                'Cordoba','Coruna','Cuenca','Gipuzkoa','Girona','Granada','Guadalajara','Huelva','Huesca','Jaen','Leon','Lleida',\
                'Lugo','Madrid','Malaga','Murcia','Navarra','ourense','Palencia','lasPalmas','Pontevedra','laRioja','Salamanca','Tenerife',\
                'Segovia','Sevilla','Soria','Tarragona','Teruel','Toledo','Valencia','Valladolid','Zamora','Zaragoza','Ceuta','Melilla']
for i in range(len(provincia_list)):    
    provincia_list[i] = provincia_list[i]. lower() #si queda alguna mayúscula

In [30]:
andalucia = ["almeria","cadiz","cordoba","granada","huelva","jaen","malaga","sevilla"];
aragon = ["huesca","teruel","zaragoza"];
canarias = ["laspalmas","tenerife"];
cantabria = ["cantabria"];
castillaleon = ["avila","burgos","leon","palencia","salamanca","segovia","soria","valladolid","zamora"];
castillalamancha = ["albacete","ciudadreal","cuenca","guadalajara","toledo"];
catalunya = ['lleida','barcelona','tarragona','girona'];
ceuta = ["ceuta"];
comunidadvalenciana = ["alicante","castellon","valencia"];
madrid = ["madrid"];
extremadura = ["badajoz","caceres"];
galicia = ["coruna","lugo","ourense","pontevedra"];
baleares = ["baleares"];
larioja = ["larioja"];
melilla = ["melilla"];
navarra = ["navarra"];
euskadi = ["alava","gipuzkoa","bizkaia"];
asturias = ["asturias"];
murcia = ["murcia"];

In [31]:
com_prov= provincia_list + com_autonoma_list #combino listas

In [32]:
def extractname(x): #columna para extraer para los valores en location que coinciden con comunidad y provincia 
    if x in com_prov:
        return x
    else:
        return 'unknown'

df_abril_19['com_prov'] = df_abril_19['Location'].apply(lambda x : extractname(x))

In [33]:
def extractcomname(x): #columna para convertir a comunidad las provincias
    if x in com_autonoma_list:
        return x
    if x in andalucia:
        return 'andalucia'
    if x in aragon:
        return 'aragon'
    if x in asturias:
        return 'asturias'
    if x in canarias:
        return 'canarias'
    if x in cantabria:
        return 'cantabria'
    if x in castillalamancha:
        return 'castillalamancha'
    if x in castillaleon:
        return 'castillaleon'
    if x in catalunya:
        return 'catalunya'
    if x in ceuta:
        return 'ceuta'
    if x in comunidadvalenciana:
        return 'comunidadvalenciana'
    if x in madrid:
        return 'madrid'
    if x in extremadura:
        return 'extremadura'
    if x in galicia:
        return 'galicia'
    if x in baleares:
        return 'baleares'
    if x in larioja:
        return 'larioja'
    if x in melilla:
        return 'melilla'
    if x in navarra:
        return 'navarra'
    if x in euskadi:
        return 'euskadi'
    if x in murcia:
        return 'murcia'
    else:
        return 'unknown'
    
df_abril_19['comunidad'] = df_abril_19['Location'].apply(lambda x : extractcomname(x))
#df_abril_19['comunidad'] = df_abril_19['place'].apply(lambda x : extractname(x))


In [52]:
#df_abril_19.com_prov.value_counts()

In [36]:
#extraer de place los nombres que coincidan con comunidad o provincia
df_abril_19['place_clean'] = df_abril_19['place'].str.extract("(" + "|".join(com_prov) +")", expand=False)


In [37]:
#provcomplace tiene provincias,comunidades y lo extraido de place
df_abril_19['provcomplace'] = df_abril_19['com_prov']
df_abril_19['provcomplace'] = np.where(df_abril_19['provcomplace']=='unknown', df_abril_19['place_clean'], df_abril_19['provcomplace'])

In [38]:
df_abril_19['provcomplace'].value_counts(dropna=False)

NaN                 190639
madrid               35843
barcelona            20037
catalunya             9176
valencia              9012
sevilla               6426
baleares              5870
malaga                4917
asturias              3475
murcia                3441
granada               3008
alicante              2715
coruna                2435
andalucia             2418
cadiz                 2337
zaragoza              2336
bizkaia               2086
galicia               2067
tenerife              1881
leon                  1874
pontevedra            1864
valladolid            1503
tarragona             1472
canarias              1466
cantabria             1419
euskadi               1199
girona                1196
navarra               1177
almeria               1161
badajoz                943
salamanca              862
gipuzkoa               841
burgos                 831
castillalamancha       809
larioja                802
toledo                 799
jaen                   799
h

In [40]:
#para convertir las provincias de provcomplace a comunidad autónoma
df_abril_19['comunidad_autonoma2'] = df_abril_19['provcomplace'].apply(lambda x : extractcomname(x))

In [41]:
#por si queda algún unknown en ca2 que en comunidad tenga nombre
df_abril_19['comunidad_autonoma'] = df_abril_19['comunidad_autonoma2']
df_abril_19['comunidad_autonoma'] = np.where(df_abril_19['comunidad_autonoma2']=='unknown', df_abril_19['comunidad'], df_abril_19['comunidad_autonoma2'])


In [42]:
df_abril_19.comunidad_autonoma.value_counts(dropna=False)

unknown                188323
madrid                  35843
catalunya               32386
andalucia               23675
comunidadvalenciana     12388
galicia                  7383
castillaleon             6355
baleares                 5870
euskadi                  4780
castillalamancha         3579
asturias                 3475
murcia                   3441
aragon                   3401
canarias                 3347
extremadura              2121
cantabria                1419
navarra                  1177
larioja                   802
melilla                   240
ceuta                     205
Name: comunidad_autonoma, dtype: int64

In [53]:
#df_abril_19.comunidad_autonoma2.value_counts(dropna=False)#confirmamos que así era..

In [45]:
df_abril_19['comunidad_autonoma'].value_counts(normalize=True) #para ver porcentajes
#df_abril_19['comunidad'].value_counts(dropna=False) #para ver también na

unknown                0.553549
madrid                 0.105356
catalunya              0.095194
andalucia              0.069589
comunidadvalenciana    0.036413
galicia                0.021701
castillaleon           0.018680
baleares               0.017254
euskadi                0.014050
castillalamancha       0.010520
asturias               0.010214
murcia                 0.010114
aragon                 0.009997
canarias               0.009838
extremadura            0.006234
cantabria              0.004171
navarra                0.003460
larioja                0.002357
melilla                0.000705
ceuta                  0.000603
Name: comunidad_autonoma, dtype: float64

Seguimos teniendo muchos usuarios que no han indicado localización pero los que sí están más agrupados. 

In [46]:
df_abril_19_twitter_com= df_abril_19.drop(['Location','place','com_prov','comunidad', 'place_clean','provcomplace','comunidad_autonoma2'],axis=1)



In [55]:
df_abril_19_twitter_com.groupby(['comunidad_autonoma']).nunique().sort_values('user_id', ascending= False)

Unnamed: 0_level_0,Datetime,Tweet Id,Text,user_id,followers_count,friends_count,retweet_count,reply_count,like_count,language,comunidad_autonoma
comunidad_autonoma,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
unknown,108540,188323,186524,70305,9762,5973,561,195,892,41,1
catalunya,27949,32386,32223,12012,4241,3523,262,93,418,32,1
madrid,30594,35843,35677,11830,4760,3258,326,118,501,32,1
andalucia,20833,23675,23572,8557,3107,2545,176,63,263,25,1
comunidadvalenciana,11741,12388,12344,4443,2202,2040,126,36,190,24,1
galicia,7063,7383,7348,2581,1571,1492,106,33,155,22,1
castillaleon,6137,6355,6351,2296,1455,1360,72,26,112,18,1
baleares,5609,5870,5852,1915,1278,1237,71,26,111,19,1
euskadi,4642,4780,4770,1570,1117,1071,98,31,139,16,1
murcia,3369,3441,3433,1261,962,945,63,26,102,16,1


In [47]:
df_abril_19_twitter_com.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 340210 entries, 0 to 340209
Data columns (total 11 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   Datetime            340210 non-null  object
 1   Tweet Id            340210 non-null  int64 
 2   Text                340210 non-null  object
 3   user_id             340210 non-null  int64 
 4   followers_count     340210 non-null  int64 
 5   friends_count       340210 non-null  int64 
 6   retweet_count       340210 non-null  int64 
 7   reply_count         340210 non-null  int64 
 8   like_count          340210 non-null  int64 
 9   language            340210 non-null  object
 10  comunidad_autonoma  340210 non-null  object
dtypes: int64(7), object(4)
memory usage: 28.6+ MB


In [48]:
#nos parece tentador hacer la limpieza de usuario en este punto pero decidimos esperar a tener el sentimiento de los tweets y quitar usuarios duplicados de los que tengamos sentimiento, para no eliminar de antemano tweets con información.  
df_abril_19_twitter_com['user_id'].nunique() 

123129

In [49]:
df_abril_19_twitter_com.to_csv('A19_twitter_comunidades.csv')

In [50]:
#para guardar el archivo en s3:

from botocore.exceptions import ClientError

s3_client = boto3.client(
    's3',
    aws_access_key_id='XXXXXX',
    aws_secret_access_key='XXXXXXXX',    
)

def upload_file(file_name, bucket, object_name=None):
    """Subir un archivo a un bucket
    :param file_name: archivo que hay que subir
    :param bucket: Bucket al que hay que subirlo
    :param object_name: S3 object name. Incluye la carpeta en la que hay que guardarlo. si no hay no se pone nada
    :return: True si sube el archivo, else False
    """

    # If S3 object_name was not specified, use file_name
    if object_name is None:
        object_name = file_name

    # Upload the file
    #s3_client = boto3.client('s3')
    try:
        response = s3_client.upload_file(file_name, bucket, object_name)
    except ClientError as e:
        logging.error(e)
        return False
    return True

In [51]:
upload_file('A19_twitter_comunidades.csv',
            'electomedia',
            object_name = "EstimacionOtrasFuentes/" + 'A19_twitter_comunidades.csv'
           )

True