# Geolocalización

En esta notebook haremos un intento de geolocalización con los textos de los usuarios...

In [24]:
%matplotlib inline
%load_ext autoreload
%autoreload 2
from pymongo import MongoClient

client = MongoClient('localhost', 27018)

db = client['contrastes']

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
db.tweets.find_one()

{'_id': ObjectId('5ba53c9827a5141aaa383eb9'),
 'created_at': 'Tue Nov 05 14:48:51 +0000 2013',
 'id': 397737276736040960,
 'place': None,
 'provincia': 'larioja',
 'text': 'Estoy tan asustada :(',
 'tokens': ['estoy', 'tan', 'asustada'],
 'user_id': 301800629}

In [3]:
users = list(
    db.users.aggregate([{"$sample": {"size": 10000}}], allowDiskUse=True)
)

Hagamos lo siguiente:

- Entrenemos con unigramas una regresión logística para 
- Luego probemos con los regionalismos

Primero, partamos en train, test

In [4]:
import sklearn
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split


train_users, test_users = train_test_split(users, random_state=20192019)

print("Usuarios de train: {}".format(len(train_users)))
print("Usuarios de test: {}".format(len(test_users)))

Usuarios de train: 7500
Usuarios de test: 2500


In [5]:

df_train = pd.DataFrame(train_users, columns=["id", "text", "provincia"])
df_train.set_index("id", inplace=True)


df_test = pd.DataFrame(test_users, columns=["id", "text", "provincia"])
df_test.set_index("id", inplace=True)

df_train.groupby("provincia").count()


Unnamed: 0_level_0,text
provincia,Unnamed: 1_level_1
buenosaires,337
catamarca,341
chaco,331
chubut,328
cordoba,317
corrientes,345
entrerios,338
formosa,286
jujuy,339
lapampa,324


## Palabras precalculadas

Carguemos antes las palabras que sabemos que ocurren una cantidad razonable de veces

In [6]:
%%time
from contrastes.processing import build_dataframe_from_users

word_df = build_dataframe_from_users(train_users)

CPU times: user 8min 41s, sys: 528 ms, total: 8min 41s
Wall time: 10min 44s


In [7]:
from contrastes.processing import preprocess_raw_df

word_df = preprocess_raw_df(word_df, filter_words=(10, 2))

  df.columnas_palabras = cant_palabras
  df.columnas_personas = cant_personas


In [8]:
%%time
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords

tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
vectorizer = CountVectorizer(
    tokenizer=tokenizer.tokenize, vocabulary=word_df.index
)

vectorizer.fit(df_train["text"])

CPU times: user 6min 51s, sys: 328 ms, total: 6min 51s
Wall time: 6min 51s


In [9]:
print("Vocabulario del vectorizador: {} palabras".format(len(vectorizer.vocabulary_)))

Vocabulario del vectorizador: 103783 palabras


In [10]:
X_train = vectorizer.transform(df_train["text"])
X_test = vectorizer.transform(df_test["text"])

In [11]:
from sklearn.preprocessing import LabelEncoder

province_encoder = LabelEncoder()

province_encoder.fit(df_train["provincia"].values)

LabelEncoder()

In [12]:
y_train = province_encoder.transform(df_train["provincia"].values)
y_test = province_encoder.transform(df_test["provincia"].values)

La reg. logística será un softmax, así que elijo `multi_class='multinomial'`

In [13]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(multi_class='multinomial', solver='saga', n_jobs=10, penalty='l2')

In [14]:
%%time
clf.fit(X_train, y_train)

CPU times: user 5min 55s, sys: 168 ms, total: 5min 55s
Wall time: 5min 55s




LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='multinomial',
          n_jobs=10, penalty='l2', random_state=None, solver='saga',
          tol=0.0001, verbose=0, warm_start=False)

In [15]:
%%time
clf.score(X_train, y_train)

CPU times: user 628 ms, sys: 56 ms, total: 684 ms
Wall time: 686 ms


0.3841333333333333

38% de accuracy

## Usando sólo "regionalismos" o LIW (Location Indicative Words)

Usemos ahora nuestros "features". Es decir, probemos con porcentajes de las palabras encontradas

In [16]:
from contrastes.lists import add_ival

add_ival(word_df, normalize=True)

Calculating information values...
Calculating ranks...


In [17]:
word_df.sort_values("rank_personas", ascending=True, inplace=True)

word_df.iloc[:10]

Unnamed: 0,buenosaires_ocurrencias,buenosaires_usuarios,catamarca_ocurrencias,catamarca_usuarios,chaco_ocurrencias,chaco_usuarios,chubut_ocurrencias,chubut_usuarios,cordoba_ocurrencias,cordoba_usuarios,...,tucuman_usuarios,cant_provincias,cant_palabra,cant_usuarios,ival_palabras,ival_personas,ival_palper,rank_palabras,rank_personas,rank_palper
ush,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,...,0.0,6,638.0,173.0,1.241734,1.600248,1.987082,37.0,1.0,4.0
chivilcoy,2331.0,125.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,...,0.0,6,2337.0,131.0,1.599973,1.553529,2.485604,4.0,2.0,1.0
poec,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1,163.0,82.0,1.059592,1.550397,1.642788,117.5,3.0,18.0
plottier,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,...,0.0,3,848.0,103.0,1.387099,1.547079,2.145952,14.0,4.0,3.0
chivil,332.0,79.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1,332.0,79.0,1.207573,1.537284,1.856382,46.0,5.0,6.0
vallerga,291.0,72.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1,291.0,72.0,1.180154,1.504641,1.775707,59.0,6.0,8.0
tolhuin,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,...,0.0,5,373.0,100.0,1.184478,1.501452,1.778436,57.0,7.0,7.0
yarca,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1,160.0,71.0,1.055728,1.49972,1.583296,120.0,8.0,22.0
fsa,0.0,0.0,0.0,0.0,22.0,2.0,0.0,0.0,4.0,1.0,...,0.0,6,321.0,116.0,1.033974,1.489231,1.539826,145.0,9.0,27.0
malpegue,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,3,264.0,85.0,1.128838,1.489024,1.680868,77.0,10.0,16.0


Veamos qué performance tiene usando 1000, 2000, 3000, y así...

In [18]:
%%time 

liw_vectorizer = CountVectorizer(
    tokenizer=tokenizer.tokenize,
    vocabulary=word_df.index)

X_train = liw_vectorizer.fit_transform(df_train["text"])
print("Vectorizing")
X_test = liw_vectorizer.transform(df_test["text"])

Vectorizing
CPU times: user 9min 8s, sys: 432 ms, total: 9min 9s
Wall time: 9min 8s


Ya las tenemos vectorizadas en el orden esperado!

In [None]:
from sklearn.linear_model import LogisticRegression

clfs = {}
scores = {}

for num_words in range(250, 5000, 250):    
    X_tr = X_train[:, :num_words].todense()
    X_tst = X_test[:, :num_words].todense()
    
    clf = LogisticRegression(
        multi_class='multinomial', solver='saga', penalty='l2', 
        max_iter=200, n_jobs=-1)
    clf.fit(X_tr, y_train)
    
    scores[num_words] = clf.score(X_tst, y_test)
    print("{} palabras ----> accuracy {:.2f}".format(num_words, scores[num_words]*100))
    clfs[num_words] = clf
    



250 palabras ----> accuracy 57.40




500 palabras ----> accuracy 65.08




750 palabras ----> accuracy 68.84




1000 palabras ----> accuracy 70.64




1250 palabras ----> accuracy 71.32




1500 palabras ----> accuracy 71.68




1750 palabras ----> accuracy 69.92




2000 palabras ----> accuracy 70.68




2250 palabras ----> accuracy 71.40




2500 palabras ----> accuracy 71.16




2750 palabras ----> accuracy 71.44




3000 palabras ----> accuracy 70.00




3250 palabras ----> accuracy 70.56




3500 palabras ----> accuracy 70.68




3750 palabras ----> accuracy 71.32




4000 palabras ----> accuracy 71.48




4250 palabras ----> accuracy 71.80




4500 palabras ----> accuracy 71.80


2500 palabras dan un accuracy de 71%. BASTANTE BIEN. Luego disminuye la performance

In [23]:
for num_words in range(5000, 20000, 500):    
    X_tr = X_train[:, :num_words].todense()
    X_tst = X_test[:, :num_words].todense()
    
    clf = LogisticRegression(
        multi_class='multinomial', solver='saga', penalty='l2', 
        max_iter=200, n_jobs=-1)
    clf.fit(X_tr, y_train)
    
    scores[num_words] = clf.score(X_tst, y_test)
    print("{} palabras ----> accuracy {:.2f}".format(num_words, scores[num_words]*100))
    clfs[num_words] = clf



5000 palabras ----> accuracy 71.88




5500 palabras ----> accuracy 71.68




6000 palabras ----> accuracy 72.00




6500 palabras ----> accuracy 72.24




7000 palabras ----> accuracy 72.40




7500 palabras ----> accuracy 72.56




8000 palabras ----> accuracy 73.28




8500 palabras ----> accuracy 73.40




9000 palabras ----> accuracy 73.00




9500 palabras ----> accuracy 73.00




10000 palabras ----> accuracy 73.56




10500 palabras ----> accuracy 74.00




11000 palabras ----> accuracy 74.24




11500 palabras ----> accuracy 74.00




12000 palabras ----> accuracy 74.08




12500 palabras ----> accuracy 74.08




13000 palabras ----> accuracy 74.12




13500 palabras ----> accuracy 74.20




14000 palabras ----> accuracy 69.88




14500 palabras ----> accuracy 70.00




15000 palabras ----> accuracy 69.92




15500 palabras ----> accuracy 70.56




16000 palabras ----> accuracy 70.52




16500 palabras ----> accuracy 70.48




17000 palabras ----> accuracy 70.64




17500 palabras ----> accuracy 70.60




18000 palabras ----> accuracy 70.60




18500 palabras ----> accuracy 70.80




19000 palabras ----> accuracy 70.84




19500 palabras ----> accuracy 70.92


In [26]:

scores

{250: 0.574,
 500: 0.6508,
 750: 0.6884,
 1000: 0.7064,
 1250: 0.7132,
 1500: 0.7168,
 1750: 0.6992,
 2000: 0.7068,
 2250: 0.714,
 2500: 0.7116,
 2750: 0.7144,
 3000: 0.7,
 3250: 0.7056,
 3500: 0.7068,
 3750: 0.7132,
 4000: 0.7148,
 4250: 0.718,
 4500: 0.718,
 4750: 0.7196,
 5000: 0.7188,
 5500: 0.7168,
 6000: 0.72,
 6500: 0.7224,
 7000: 0.724,
 7500: 0.7256,
 8000: 0.7328,
 8500: 0.734,
 9000: 0.73,
 9500: 0.73,
 10000: 0.7356,
 10500: 0.74,
 11000: 0.7424,
 11500: 0.74,
 12000: 0.7408,
 12500: 0.7408,
 13000: 0.7412,
 13500: 0.742,
 14000: 0.6988,
 14500: 0.7,
 15000: 0.6992,
 15500: 0.7056,
 16000: 0.7052,
 16500: 0.7048,
 17000: 0.7064,
 17500: 0.706,
 18000: 0.706,
 18500: 0.708,
 19000: 0.7084,
 19500: 0.7092}

## Con Palabras

¿Qué pasa con palabras?

In [28]:
word_df.sort_values("rank_palabras", ascending=True, inplace=True)

word_df.iloc[:10]

Unnamed: 0,buenosaires_ocurrencias,buenosaires_usuarios,catamarca_ocurrencias,catamarca_usuarios,chaco_ocurrencias,chaco_usuarios,chubut_ocurrencias,chubut_usuarios,cordoba_ocurrencias,cordoba_usuarios,...,tucuman_usuarios,cant_provincias,cant_palabra,cant_usuarios,ival_palabras,ival_personas,ival_palper,rank_palabras,rank_personas,rank_palper
hoa,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,6,3281.0,8.0,1.673924,0.33152,0.554939,1.0,13613.0,692.0
rioja,21.0,8.0,219.0,73.0,3.0,3.0,18.0,11.0,23.0,14.0,...,16.0,23,8020.0,506.0,1.658451,0.849382,1.408657,2.0,551.0,40.0
ushuaia,19.0,5.0,2.0,2.0,3.0,2.0,16.0,13.0,3.0,3.0,...,2.0,23,5874.0,352.0,1.644964,1.059067,1.742126,3.0,161.0,11.0
chivilcoy,2331.0,125.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,...,0.0,6,2337.0,131.0,1.599973,1.553529,2.485604,4.0,2.0,1.0
bragado,1757.0,89.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,...,0.0,5,1766.0,96.0,1.535212,1.417497,2.176158,5.0,12.0,2.0
jujuy,17.0,8.0,53.0,23.0,20.0,18.0,22.0,16.0,50.0,21.0,...,49.0,23,8334.0,732.0,1.496421,0.599929,0.897746,6.0,2146.0,197.0
tilly,0.0,0.0,0.0,0.0,0.0,0.0,1409.0,79.0,1.0,1.0,...,0.0,6,1423.0,90.0,1.473581,1.293685,1.906349,7.0,35.0,5.0
tdf,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,11,1399.0,89.0,1.442562,1.128235,1.627548,8.0,112.0,19.0
rada,0.0,0.0,1.0,1.0,1.0,1.0,1914.0,169.0,7.0,3.0,...,1.0,18,1988.0,212.0,1.43746,1.233813,1.773557,9.0,62.0,9.0
gestionando,6.0,2.0,1.0,1.0,0.0,0.0,1.0,1.0,3.0,2.0,...,2.0,20,1646.0,40.0,1.418463,0.149234,0.211683,10.0,67507.0,3810.0


In [29]:
%%time 

liw_vectorizer = CountVectorizer(
    tokenizer=tokenizer.tokenize,
    vocabulary=word_df.index)

X_train = liw_vectorizer.fit_transform(df_train["text"])
print("Vectorizing")
X_test = liw_vectorizer.transform(df_test["text"])

Vectorizing
CPU times: user 9min 53s, sys: 1.89 s, total: 9min 55s
Wall time: 10min 17s


In [None]:

clfs_palabras = {}
scores_palabras = {}

In [None]:


for num_words in range(250, 5000, 250):
    if num_words in clfs_palabras and num_words in scores_palabras:
        print("{} palabras ----> accuracy {:.2f}".format(num_words, scores_palabras[num_words]*100))
        continue
        
    X_tr = X_train[:, :num_words].todense()
    X_tst = X_test[:, :num_words].todense()
    
    clf = LogisticRegression(
        multi_class='multinomial', solver='saga', penalty='l2', 
        max_iter=200, n_jobs=-1)
    clf.fit(X_tr, y_train)
    
    scores_palabras[num_words] = clf.score(X_tst, y_test)
    print("{} palabras ----> accuracy {:.2f}".format(num_words, scores_palabras[num_words]*100))
    clfs_palabras[num_words] = clf
    

250 palabras ----> accuracy 58.80
500 palabras ----> accuracy 61.40
750 palabras ----> accuracy 63.52
1000 palabras ----> accuracy 62.44
1250 palabras ----> accuracy 62.08
1500 palabras ----> accuracy 62.88
1750 palabras ----> accuracy 65.40
2000 palabras ----> accuracy 65.56
2250 palabras ----> accuracy 65.32
2500 palabras ----> accuracy 65.24
2750 palabras ----> accuracy 65.44
3000 palabras ----> accuracy 65.88
3250 palabras ----> accuracy 66.60
3500 palabras ----> accuracy 66.76
3750 palabras ----> accuracy 66.64
4000 palabras ----> accuracy 66.60
4250 palabras ----> accuracy 66.80
4500 palabras ----> accuracy 67.24




In [60]:
for num_words in range(5000, 20000, 500):
    if num_words in clfs_palabras:
        print("{} palabras ----> accuracy {:.2f}".format(num_words, scores_palabras[num_words]*100))
        continue
        
    X_tr = X_train[:, :num_words].todense()
    X_tst = X_test[:, :num_words].todense()
    
    clf = LogisticRegression(
        multi_class='multinomial', solver='saga', penalty='l2', 
        max_iter=200, n_jobs=-1)
    clf.fit(X_tr, y_train)
    
    scores_palabras[num_words] = clf.score(X_tst, y_test)
    print("{} palabras ----> accuracy {:.2f}".format(num_words, scores_palabras[num_words]*100))
    clfs_palabras[num_words] = clf

5000 palabras ----> accuracy 67.12
5500 palabras ----> accuracy 67.28
6000 palabras ----> accuracy 67.36
6500 palabras ----> accuracy 67.16
7000 palabras ----> accuracy 66.96
7500 palabras ----> accuracy 67.20
8000 palabras ----> accuracy 66.84
8500 palabras ----> accuracy 66.88
9000 palabras ----> accuracy 67.24
9500 palabras ----> accuracy 65.12
10000 palabras ----> accuracy 65.36
10500 palabras ----> accuracy 65.44
11000 palabras ----> accuracy 66.24
11500 palabras ----> accuracy 66.16
12000 palabras ----> accuracy 65.88
12500 palabras ----> accuracy 66.24
13000 palabras ----> accuracy 66.36
13500 palabras ----> accuracy 66.32
14000 palabras ----> accuracy 66.44
14500 palabras ----> accuracy 66.44
15000 palabras ----> accuracy 66.32
15500 palabras ----> accuracy 66.40
16000 palabras ----> accuracy 66.40
16500 palabras ----> accuracy 66.36
17000 palabras ----> accuracy 66.08
17500 palabras ----> accuracy 65.96
18000 palabras ----> accuracy 66.04




18500 palabras ----> accuracy 66.12




19000 palabras ----> accuracy 66.16




19500 palabras ----> accuracy 66.40


In [62]:
import pickle

pickle.dump(clfs, open("clfs_personas.pkl", "wb"))
pickle.dump(clfs_palabras, open("clfs_palabras.pkl", "wb"))

In [63]:
clfs_personas = pickle.load(open("clfs_personas.pkl", "rb"))

In [66]:
word_df.sort_values("rank_personas", ascending=True, inplace=True)

In [67]:
%%time 

liw_vectorizer = CountVectorizer(
    tokenizer=tokenizer.tokenize,
    vocabulary=word_df.index)

X_train = liw_vectorizer.fit_transform(df_train["text"])
print("Vectorizing")
X_test = liw_vectorizer.transform(df_test["text"])

Vectorizing
CPU times: user 9min 13s, sys: 456 ms, total: 9min 14s
Wall time: 9min 14s


In [69]:
X_tr = X_train[:, :10000]
X_tst = X_test[:, :10000]

clfs_personas[10000].score(X_tst, y_test)

0.7344