# Geolocalización con Information Value Personas

En esta notebook haremos un intento de geolocalización con los textos de los usuarios...

In [1]:
%matplotlib inline
%load_ext autoreload
%autoreload 2
import pandas as pd

df_train = pd.read_json("../data/geoloc/users_train.json")
df_test = pd.read_json("../data/geoloc/users_test.json")

Hagamos lo siguiente:

- Entrenemos con unigramas una regresión logística para 
- Luego probemos con los regionalismos

Primero, partamos en train, test

In [2]:
df_train.groupby("provincia").count()

Unnamed: 0_level_0,text
provincia,Unnamed: 1_level_1
buenosaires,337
catamarca,341
chaco,331
chubut,328
cordoba,317
corrientes,345
entrerios,338
formosa,286
jujuy,339
lapampa,324


## Palabras precalculadas

Carguemos antes las palabras que sabemos que ocurren una cantidad razonable de veces

In [3]:
%%time
from contrastes.processing import build_dataframe_from_users
from contrastes.processing import preprocess_raw_df


#word_df = build_dataframe_from_users(row for index, row in df_train.iterrows())

word_df = pd.read_csv("train_word_df_filtered.csv", index_col=0)
word_df = preprocess_raw_df(word_df, filter_words=(10, 2))

CPU times: user 1.5 s, sys: 204 ms, total: 1.7 s
Wall time: 1.77 s


  df.columnas_palabras = cant_palabras
  df.columnas_personas = cant_personas


La reg. logística será un softmax, así que elijo `multi_class='multinomial'`

In [4]:
from contrastes.lists import add_ival

add_ival(word_df, normalize=True)

Calculating information values...
Calculating ranks...


In [5]:
word_df.sort_values("rank_personas", ascending=True, inplace=True)

word_df.iloc[:10]

Unnamed: 0,buenosaires_ocurrencias,buenosaires_usuarios,catamarca_ocurrencias,catamarca_usuarios,chaco_ocurrencias,chaco_usuarios,chubut_ocurrencias,chubut_usuarios,cordoba_ocurrencias,cordoba_usuarios,...,tucuman_usuarios,cant_provincias,cant_palabra,cant_usuarios,ival_palabras,ival_personas,ival_palper,rank_palabras,rank_personas,rank_palper
ush,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,...,0.0,6,638.0,346.0,1.241734,1.612927,2.002826,37.0,1.0,4.0
poec,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1,163.0,164.0,1.059592,1.584212,1.678618,117.5,2.0,18.0
chivil,332.0,79.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1,332.0,158.0,1.207573,1.57452,1.901348,46.0,3.0,6.0
plottier,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,...,0.0,3,848.0,206.0,1.387099,1.570455,2.178376,14.0,4.0,3.0
chivilcoy,2331.0,125.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,...,0.0,6,2337.0,262.0,1.599973,1.569203,2.510682,4.0,5.0,1.0
vallerga,291.0,72.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1,291.0,144.0,1.180154,1.545664,1.824121,59.0,6.0,8.0
yarca,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1,160.0,142.0,1.055728,1.545184,1.631294,120.0,7.0,23.0
tolhuin,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,...,0.0,5,373.0,200.0,1.184478,1.534011,1.817002,57.0,8.0,9.0
fsa,0.0,0.0,0.0,0.0,22.0,2.0,0.0,0.0,4.0,1.0,...,0.0,6,321.0,232.0,1.033974,1.532793,1.584868,145.0,9.0,26.0
malpegue,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,3,264.0,170.0,1.128838,1.523059,1.719288,77.0,10.0,16.0


Veamos qué performance tiene usando 1000, 2000, 3000, y así...

In [6]:
%%time
from sklearn.feature_extraction.text import CountVectorizer
from contrastes.text import tokenize

liw_vectorizer = CountVectorizer(
    tokenizer=tokenize,
    vocabulary=word_df.index)

X_train = liw_vectorizer.fit_transform(df_train["text"])
print("Vectorizing")
X_test = liw_vectorizer.transform(df_test["text"])

Vectorizing
CPU times: user 10min 14s, sys: 964 ms, total: 10min 15s
Wall time: 10min 15s


Ya las tenemos vectorizadas en el orden esperado!

In [7]:
from sklearn.preprocessing import LabelEncoder

province_encoder = LabelEncoder()

province_encoder.fit(df_train["provincia"].values)

y_train = province_encoder.transform(df_train["provincia"].values)
y_test = province_encoder.transform(df_test["provincia"].values)

In [13]:
%%time
from contrastes.classifiers import fit_classifiers

num_words_to_fit = list(range(250, 5000, 250))

ret = fit_classifiers(X_train, y_train, X_test, y_test, province_encoder=province_encoder,
                      range_num_words=num_words_to_fit, num_jobs=6)

Entrenando con 500 palabras
Entrenando con 750 palabras
Entrenando con 1000 palabras
Entrenando con 250 palabras
Entrenando con 1500 palabras
Entrenando con 1250 palabras




250   palabras ----> accuracy 56.24 mean distance 420.5352
Entrenando con 1750 palabras




500   palabras ----> accuracy 63.72 mean distance 339.35
Entrenando con 2000 palabras




750   palabras ----> accuracy 66.64 mean distance 303.342
Entrenando con 2250 palabras




1000  palabras ----> accuracy 69.60 mean distance 262.0232
Entrenando con 2500 palabras




1250  palabras ----> accuracy 71.48 mean distance 223.064
Entrenando con 2750 palabras




1500  palabras ----> accuracy 71.44 mean distance 220.0616
Entrenando con 3000 palabras
1750  palabras ----> accuracy 72.36 mean distance 214.5224
Entrenando con 3250 palabras
2000  palabras ----> accuracy 72.48 mean distance 208.0372
Entrenando con 3500 palabras
2250  palabras ----> accuracy 72.80 mean distance 209.7624
Entrenando con 3750 palabras
2500  palabras ----> accuracy 73.08 mean distance 206.1368
Entrenando con 4000 palabras
2750  palabras ----> accuracy 71.48 mean distance 214.248
Entrenando con 4250 palabras
3000  palabras ----> accuracy 72.20 mean distance 209.75
Entrenando con 4500 palabras
3250  palabras ----> accuracy 72.32 mean distance 208.8172
Entrenando con 4750 palabras
3500  palabras ----> accuracy 73.12 mean distance 201.0868
3750  palabras ----> accuracy 73.04 mean distance 203.1096
4000  palabras ----> accuracy 72.88 mean distance 206.7296
4250  palabras ----> accuracy 73.00 mean distance 206.254
4500  palabras ----> accuracy 73.48 mean distance 203.2804
4750 

In [16]:
for r in ret:
    num_words = r["num_words"]
    acc = r["accuracy"]
    md = r["mean_distance"]
    print("{:<5} palabras ----> accuracy {:.2f} mean distance {}".format(
        num_words, acc*100, md
    ))

250   palabras ----> accuracy 56.24 mean distance 420.5352
500   palabras ----> accuracy 63.72 mean distance 339.35
750   palabras ----> accuracy 66.64 mean distance 303.342
1000  palabras ----> accuracy 69.60 mean distance 262.0232
1250  palabras ----> accuracy 71.48 mean distance 223.064
1500  palabras ----> accuracy 71.44 mean distance 220.0616
1750  palabras ----> accuracy 72.36 mean distance 214.5224
2000  palabras ----> accuracy 72.48 mean distance 208.0372
2250  palabras ----> accuracy 72.80 mean distance 209.7624
2500  palabras ----> accuracy 73.08 mean distance 206.1368
2750  palabras ----> accuracy 71.48 mean distance 214.248
3000  palabras ----> accuracy 72.20 mean distance 209.75
3250  palabras ----> accuracy 72.32 mean distance 208.8172
3500  palabras ----> accuracy 73.12 mean distance 201.0868
3750  palabras ----> accuracy 73.04 mean distance 203.1096
4000  palabras ----> accuracy 72.88 mean distance 206.7296
4250  palabras ----> accuracy 73.00 mean distance 206.254
4500 

In [18]:
%%time
from contrastes.classifiers import fit_classifiers

num_words_to_fit = list(range(5000, 20000, 500))

ret2 = fit_classifiers(X_train, y_train, X_test, y_test, province_encoder, 
                       num_words_to_fit, num_jobs=8)

Entrenando con 5000 palabras
Entrenando con 6000 palabras
Entrenando con 7500 palabras
Entrenando con 7000 palabras
Entrenando con 5500 palabras
Entrenando con 6500 palabras
Entrenando con 8000 palabras
Entrenando con 8500 palabras




5000  palabras ----> accuracy 72.40 mean distance 214.4532
Entrenando con 9000 palabras




5500  palabras ----> accuracy 72.68 mean distance 211.2316
Entrenando con 9500 palabras




6000  palabras ----> accuracy 72.44 mean distance 213.5676
Entrenando con 10000 palabras




6500  palabras ----> accuracy 72.84 mean distance 212.2288
Entrenando con 10500 palabras




7000  palabras ----> accuracy 72.76 mean distance 214.5164
Entrenando con 11000 palabras




8000  palabras ----> accuracy 72.96 mean distance 215.2168
Entrenando con 11500 palabras




7500  palabras ----> accuracy 72.88 mean distance 215.2732
Entrenando con 12000 palabras
8500  palabras ----> accuracy 72.84 mean distance 213.4204
Entrenando con 12500 palabras
9000  palabras ----> accuracy 72.88 mean distance 214.5344
Entrenando con 13000 palabras
9500  palabras ----> accuracy 72.92 mean distance 214.8076
Entrenando con 13500 palabras
10000 palabras ----> accuracy 73.08 mean distance 216.7036
Entrenando con 14000 palabras
10500 palabras ----> accuracy 73.20 mean distance 215.1668
Entrenando con 14500 palabras
11000 palabras ----> accuracy 73.40 mean distance 215.0912
Entrenando con 15000 palabras
11500 palabras ----> accuracy 73.80 mean distance 212.7644
Entrenando con 15500 palabras
12000 palabras ----> accuracy 73.60 mean distance 214.66
Entrenando con 16000 palabras
12500 palabras ----> accuracy 69.16 mean distance 252.2992
Entrenando con 16500 palabras
13000 palabras ----> accuracy 69.08 mean distance 253.2592
Entrenando con 17000 palabras
13500 palabras ----> ac

In [19]:
total = ret + ret2

for r in total:
    num_words = r["num_words"]
    acc = r["accuracy"]
    md = r["mean_distance"]
    print("{:<5} palabras ----> accuracy {:.2f} mean distance {}".format(
        num_words, acc*100, md
    ))

250   palabras ----> accuracy 56.24 mean distance 420.5352
500   palabras ----> accuracy 63.72 mean distance 339.35
750   palabras ----> accuracy 66.64 mean distance 303.342
1000  palabras ----> accuracy 69.60 mean distance 262.0232
1250  palabras ----> accuracy 71.48 mean distance 223.064
1500  palabras ----> accuracy 71.44 mean distance 220.0616
1750  palabras ----> accuracy 72.36 mean distance 214.5224
2000  palabras ----> accuracy 72.48 mean distance 208.0372
2250  palabras ----> accuracy 72.80 mean distance 209.7624
2500  palabras ----> accuracy 73.08 mean distance 206.1368
2750  palabras ----> accuracy 71.48 mean distance 214.248
3000  palabras ----> accuracy 72.20 mean distance 209.75
3250  palabras ----> accuracy 72.32 mean distance 208.8172
3500  palabras ----> accuracy 73.12 mean distance 201.0868
3750  palabras ----> accuracy 73.04 mean distance 203.1096
4000  palabras ----> accuracy 72.88 mean distance 206.7296
4250  palabras ----> accuracy 73.00 mean distance 206.254
4500 

In [20]:
import pickle

pickle.dump(total, open("res_iv_personas.pkl", "wb"))