# Geolocalización

En esta notebook haremos un intento de geolocalización con los textos de los usuarios...

In [1]:
from pymongo import MongoClient

client = MongoClient('localhost', 27018)

db = client['contrastes']

In [2]:
db.tweets.find_one()

{'_id': ObjectId('5ba53c9827a5141aaa383eb9'),
 'created_at': 'Tue Nov 05 14:48:51 +0000 2013',
 'id': 397737276736040960,
 'place': None,
 'provincia': 'larioja',
 'text': 'Estoy tan asustada :(',
 'tokens': ['estoy', 'tan', 'asustada'],
 'user_id': 301800629}

In [3]:
user_ids = list(db.users.distinct('id'))
print("Tenemos {} usuarios".format(len(user_ids)))

Tenemos 56308 usuarios


Hagamos lo siguiente:

- Entrenemos con unigramas una regresión logística para 
- Luego probemos con los regionalismos

Primero, partamos en train, test

In [4]:
import sklearn
import numpy as np
import pandas as pd
import random
from sklearn.model_selection import train_test_split

np.random.seed(2019)

sample_user_ids = random.sample(user_ids, 5000)

# Nos quedamos sólo con los campos que nos interesan
users = list(db.users.find({"id": {"$in": sample_user_ids}}, {"id": 1, "_id": 0, "text": 1, "provincia": 1}))



train_users, test_users = train_test_split(users)

df_train = pd.DataFrame(train_users)
df_train.set_index("id", inplace=True)


df_test = pd.DataFrame(test_users)
df_test.set_index("id", inplace=True)

df_train.groupby("provincia").count()


Unnamed: 0_level_0,text
provincia,Unnamed: 1_level_1
buenosaires,156
catamarca,162
chaco,159
chubut,164
cordoba,166
corrientes,171
entrerios,167
formosa,151
jujuy,189
lapampa,164


## Palabras precalculadas

Carguemos antes las palabras que sabemos que ocurren una cantidad razonable de veces

In [5]:
%%time
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords

tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
vectorizer = CountVectorizer(
    tokenizer=tokenizer.tokenize, stop_words=stopwords.words('spanish'),
    min_df=0.0007, max_df=0.15, ngram_range=(1, 2),
)

vectorizer.fit(df_train["text"])

CPU times: user 5min 18s, sys: 2.72 s, total: 5min 21s
Wall time: 5min 20s


In [6]:
print("Vocabulario del vectorizador: {} palabras".format(len(vectorizer.vocabulary_)))

Vocabulario del vectorizador: 1375444 palabras


In [7]:
X_train = vectorizer.transform(df_train["text"])
X_test = vectorizer.transform(df_test["text"])

In [8]:
from sklearn.preprocessing import LabelEncoder

province_encoder = LabelEncoder()

province_encoder.fit(df_train["provincia"].values)

LabelEncoder()

In [9]:
y_train = province_encoder.transform(df_train["provincia"].values)
y_test = province_encoder.transform(df_test["provincia"].values)

La reg. logística será un softmax, así que elijo `multi_class='multinomial'`

In [10]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(multi_class='multinomial', solver='saga')

In [11]:
%%time
clf.fit(X_train, y_train)



CPU times: user 18min 30s, sys: 744 ms, total: 18min 31s
Wall time: 18min 31s


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='multinomial',
          n_jobs=None, penalty='l2', random_state=None, solver='saga',
          tol=0.0001, verbose=0, warm_start=False)

In [12]:
%%time
clf.score(X_train, y_train)

CPU times: user 1.68 s, sys: 112 ms, total: 1.79 s
Wall time: 1.79 s


0.6493333333333333

In [13]:
%%time
clf.score(X_test, y_test)

CPU times: user 544 ms, sys: 84 ms, total: 628 ms
Wall time: 626 ms


0.3856

38% de accuracy

## Usando sólo "regionalismos" o LIW (Location Indicative Words)

Usemos ahora nuestros "features". Es decir, probemos con porcentajes de las palabras encontradas

In [14]:
df_words = pd.read_csv("../output/listados/listado_completo.csv")
df_words.set_index("palabra", inplace=True)
df_words.sort_values("rank_personas", ascending=True, inplace=True)

df_words.iloc[:10]

Unnamed: 0_level_0,buenosaires_ocurrencias,buenosaires_usuarios,catamarca_ocurrencias,catamarca_usuarios,chaco_ocurrencias,chaco_usuarios,chubut_ocurrencias,chubut_usuarios,cordoba_ocurrencias,cordoba_usuarios,...,guaranitica_usuarios,noroeste_ocurrencias,fnorm_noroeste,noroeste_usuarios,fnorm_region_max,region_max,fnorm_region_min,region_min,region_sin_palabra,max_dif_region
palabra,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
chivil,1970,515,0,0,0,0,0,0,0,0,...,1,0,0.0,0,7.424642,litoral,0.009863,guaranitica,3,752.782165
ush,12,6,4,2,4,3,4,4,5,3,...,11,42,0.276578,18,18.69506,litoral,0.082872,cuyo,0,225.590394
poec,5,2,0,0,0,0,9,1,1,1,...,0,0,0.0,0,4.605158,litoral,0.01653,central,2,278.585597
malpegue,0,0,0,0,0,0,0,0,1,1,...,0,0,0.0,0,31.491247,cuyo,0.033061,central,2,952.519835
aijue,0,0,0,0,1,1,0,0,0,0,...,705,0,0.0,0,21.560377,guaranitica,0.016574,cuyo,2,1300.828621
tolhuin,1,1,0,0,4,1,4,3,3,1,...,2,5,0.032926,5,11.059897,litoral,0.016574,cuyo,0,667.290306
vallerga,1932,436,0,0,0,0,0,0,0,0,...,0,0,0.0,0,7.262991,litoral,0.01653,central,3,439.369284
yarca,0,0,2,2,0,0,0,0,2,1,...,0,3,0.019756,3,16.408597,cuyo,0.007519,litoral,1,2182.393459
blv,268,43,0,0,0,0,0,0,0,0,...,3,6,0.039511,4,85.561838,central,0.039452,guaranitica,0,2168.772148
portho,0,0,0,0,0,0,1,1,0,0,...,0,0,0.0,0,12.530202,cuyo,0.018797,litoral,3,666.622002


Veamos qué performance tiene usando 1000, 2000, 3000, y así...

In [15]:
from sklearn.linear_model import LogisticRegression

clfs = {}
scores = {}

for num_words in range(500, 30000, 250):    
    liw_vectorizer = CountVectorizer(
        tokenizer=tokenizer.tokenize,
        vocabulary=df_words.index[:num_words])

    X_train = liw_vectorizer.transform(df_train["text"])
    X_test = liw_vectorizer.transform(df_test["text"])

    clf = LogisticRegression(multi_class='multinomial', solver='saga')
    clf.fit(X_train, y_train)
    
    scores[num_words] = clf.score(X_test, y_test)
    print("{} palabras ----> accuracy {:.2f}".format(num_words, scores[num_words]*100))
    clfs[num_words] = clf
    



500 palabras ----> accuracy 60.40
750 palabras ----> accuracy 63.12
1000 palabras ----> accuracy 65.68
1250 palabras ----> accuracy 66.64
1500 palabras ----> accuracy 67.92
1750 palabras ----> accuracy 69.12
2000 palabras ----> accuracy 69.76
2250 palabras ----> accuracy 71.20
2500 palabras ----> accuracy 71.44
2750 palabras ----> accuracy 70.72
3000 palabras ----> accuracy 69.76
3250 palabras ----> accuracy 69.84
3500 palabras ----> accuracy 70.48
3750 palabras ----> accuracy 70.80
4000 palabras ----> accuracy 69.04
4250 palabras ----> accuracy 69.52
4500 palabras ----> accuracy 69.52
4750 palabras ----> accuracy 69.44
5000 palabras ----> accuracy 69.84
5250 palabras ----> accuracy 67.60
5500 palabras ----> accuracy 68.24
5750 palabras ----> accuracy 68.24
6000 palabras ----> accuracy 68.00
6250 palabras ----> accuracy 68.24
6500 palabras ----> accuracy 68.72
6750 palabras ----> accuracy 68.88
7000 palabras ----> accuracy 69.12
7250 palabras ----> accuracy 68.32
7500 palabras ----> ac

2500 palabras dan un accuracy de 71%. BASTANTE BIEN. Luego disminuye la performance

In [16]:

scores

{500: 0.604,
 750: 0.6312,
 1000: 0.6568,
 1250: 0.6664,
 1500: 0.6792,
 1750: 0.6912,
 2000: 0.6976,
 2250: 0.712,
 2500: 0.7144,
 2750: 0.7072,
 3000: 0.6976,
 3250: 0.6984,
 3500: 0.7048,
 3750: 0.708,
 4000: 0.6904,
 4250: 0.6952,
 4500: 0.6952,
 4750: 0.6944,
 5000: 0.6984,
 5250: 0.676,
 5500: 0.6824,
 5750: 0.6824,
 6000: 0.68,
 6250: 0.6824,
 6500: 0.6872,
 6750: 0.6888,
 7000: 0.6912,
 7250: 0.6832,
 7500: 0.684,
 7750: 0.6856,
 8000: 0.688,
 8250: 0.6936,
 8500: 0.696,
 8750: 0.7024,
 9000: 0.7096,
 9250: 0.7128,
 9500: 0.7136,
 9750: 0.7128,
 10000: 0.7144,
 10250: 0.7168,
 10500: 0.7152,
 10750: 0.7136,
 11000: 0.7168,
 11250: 0.7184,
 11500: 0.7216,
 11750: 0.716,
 12000: 0.716,
 12250: 0.7216,
 12500: 0.7224,
 12750: 0.724,
 13000: 0.7016,
 13250: 0.7032,
 13500: 0.7024,
 13750: 0.7024,
 14000: 0.7016,
 14250: 0.7032,
 14500: 0.7064,
 14750: 0.708,
 15000: 0.7072,
 15250: 0.708,
 15500: 0.7072,
 15750: 0.7096,
 16000: 0.7104,
 16250: 0.712,
 16500: 0.7104,
 16750: 0.7104,

## Con Palabras

¿Qué pasa con palabras?

In [17]:
df_words.sort_values("rank_palabras", ascending=True, inplace=True)

clfs_palabras = {}
scores_palabras = {}

for num_words in range(500, 30000, 250):    
    liw_vectorizer = CountVectorizer(
        tokenizer=tokenizer.tokenize,
        vocabulary=df_words.index[:num_words])

    X_train = liw_vectorizer.transform(df_train["text"])
    X_test = liw_vectorizer.transform(df_test["text"])

    clf = LogisticRegression(multi_class='multinomial', solver='saga')
    clf.fit(X_train, y_train)
    
    scores_palabras[num_words] = clf.score(X_test, y_test)
    print("{} palabras ----> accuracy {:.2f}".format(num_words, scores_palabras[num_words]*100))
    clfs_palabras[num_words] = clf
    



500 palabras ----> accuracy 62.32
750 palabras ----> accuracy 67.04
1000 palabras ----> accuracy 65.92
1250 palabras ----> accuracy 65.68
1500 palabras ----> accuracy 66.56
1750 palabras ----> accuracy 66.08
2000 palabras ----> accuracy 64.72
2250 palabras ----> accuracy 65.52
2500 palabras ----> accuracy 65.20
2750 palabras ----> accuracy 65.20
3000 palabras ----> accuracy 65.36
3250 palabras ----> accuracy 65.68
3500 palabras ----> accuracy 64.64
3750 palabras ----> accuracy 65.52
4000 palabras ----> accuracy 65.60
4250 palabras ----> accuracy 65.68
4500 palabras ----> accuracy 66.00
4750 palabras ----> accuracy 66.24
5000 palabras ----> accuracy 66.40
5250 palabras ----> accuracy 66.40
5500 palabras ----> accuracy 66.40
5750 palabras ----> accuracy 67.04
6000 palabras ----> accuracy 66.80
6250 palabras ----> accuracy 67.28
6500 palabras ----> accuracy 65.76
6750 palabras ----> accuracy 65.68
7000 palabras ----> accuracy 66.16
7250 palabras ----> accuracy 66.56
7500 palabras ----> ac

In [19]:
scores_palabras

{500: 0.6232,
 750: 0.6704,
 1000: 0.6592,
 1250: 0.6568,
 1500: 0.6656,
 1750: 0.6608,
 2000: 0.6472,
 2250: 0.6552,
 2500: 0.652,
 2750: 0.652,
 3000: 0.6536,
 3250: 0.6568,
 3500: 0.6464,
 3750: 0.6552,
 4000: 0.656,
 4250: 0.6568,
 4500: 0.66,
 4750: 0.6624,
 5000: 0.664,
 5250: 0.664,
 5500: 0.664,
 5750: 0.6704,
 6000: 0.668,
 6250: 0.6728,
 6500: 0.6576,
 6750: 0.6568,
 7000: 0.6616,
 7250: 0.6656,
 7500: 0.664,
 7750: 0.664,
 8000: 0.6456,
 8250: 0.6448,
 8500: 0.6456,
 8750: 0.6576,
 9000: 0.6568,
 9250: 0.656,
 9500: 0.6568,
 9750: 0.6584,
 10000: 0.6592,
 10250: 0.6568,
 10500: 0.6584,
 10750: 0.6592,
 11000: 0.652,
 11250: 0.6512,
 11500: 0.6496,
 11750: 0.6584,
 12000: 0.6528,
 12250: 0.6504,
 12500: 0.6496,
 12750: 0.6464,
 13000: 0.6504,
 13250: 0.652,
 13500: 0.6536,
 13750: 0.6528,
 14000: 0.652,
 14250: 0.652,
 14500: 0.652,
 14750: 0.6528,
 15000: 0.656,
 15250: 0.6584,
 15500: 0.6592,
 15750: 0.66,
 16000: 0.6576,
 16250: 0.6592,
 16500: 0.6576,
 16750: 0.66,
 17000