# Geolocalización con MultiProcessing + IGR Personas

En esta notebook haremos un intento de geolocalización con los textos de los usuarios...

In [1]:
%matplotlib inline
%load_ext autoreload
%autoreload 2
import pandas as pd

df_train = pd.read_json("../data/geoloc/users_train.json")
df_test = pd.read_json("../data/geoloc/users_test.json")

Hagamos lo siguiente:

- Entrenemos con unigramas una regresión logística para 
- Luego probemos con los regionalismos

Primero, partamos en train, test

In [2]:
df_train.groupby("provincia").count()

Unnamed: 0_level_0,text
provincia,Unnamed: 1_level_1
buenosaires,337
catamarca,341
chaco,331
chubut,328
cordoba,317
corrientes,345
entrerios,338
formosa,286
jujuy,339
lapampa,324


## Palabras precalculadas

Carguemos antes las palabras que sabemos que ocurren una cantidad razonable de veces

In [3]:
%%time
from contrastes.processing import build_dataframe_from_users
from contrastes.processing import preprocess_raw_df


#word_df = build_dataframe_from_users(row for index, row in df_train.iterrows())

word_df = pd.read_csv("train_word_df_filtered.csv", index_col=0)
word_df = preprocess_raw_df(word_df, filter_words=(10, 2))

CPU times: user 1.47 s, sys: 224 ms, total: 1.7 s
Wall time: 1.69 s


  df.columnas_palabras = cant_palabras
  df.columnas_personas = cant_personas


La reg. logística será un softmax, así que elijo `multi_class='multinomial'`

In [4]:
from contrastes.igr import igr

word_df["igr"] = igr(word_df, word_df.columnas_personas)

In [5]:
word_df.sort_values("igr", ascending=False, inplace=True)

word_df[:30][word_df.columnas_palabras]

Unnamed: 0,buenosaires_ocurrencias,catamarca_ocurrencias,chaco_ocurrencias,chubut_ocurrencias,cordoba_ocurrencias,corrientes_ocurrencias,entrerios_ocurrencias,formosa_ocurrencias,jujuy_ocurrencias,lapampa_ocurrencias,...,neuquen_ocurrencias,rionegro_ocurrencias,salta_ocurrencias,sanjuan_ocurrencias,sanluis_ocurrencias,santacruz_ocurrencias,santafe_ocurrencias,santiago_ocurrencias,tierradelfuego_ocurrencias,tucuman_ocurrencias
bombola,0.0,0.0,0.0,0.0,0.0,0.0,0.0,75.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
logroño,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sanagasta,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
bombolo,0.0,0.0,4.0,0.0,0.0,0.0,0.0,69.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
aijue,0.0,0.0,0.0,0.0,0.0,2.0,1.0,237.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mitai,0.0,0.0,0.0,0.0,0.0,7.0,0.0,123.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
aij,0.0,0.0,0.0,0.0,0.0,0.0,0.0,30.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
fne,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,75.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
unse,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,50.0,0.0,0.0
quintela,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Veamos qué performance tiene usando 1000, 2000, 3000, y así...

In [6]:
%%time
from sklearn.feature_extraction.text import CountVectorizer
from contrastes.text import tokenize

liw_vectorizer = CountVectorizer(
    tokenizer=tokenize,
    vocabulary=word_df.index)

X_train = liw_vectorizer.fit_transform(df_train["text"])
print("Vectorizing")
X_test = liw_vectorizer.transform(df_test["text"])

Vectorizing
CPU times: user 10min 22s, sys: 1.16 s, total: 10min 23s
Wall time: 10min 23s


Ya las tenemos vectorizadas en el orden esperado!

In [7]:
from sklearn.preprocessing import LabelEncoder

province_encoder = LabelEncoder()

province_encoder.fit(df_train["provincia"].values)

y_train = province_encoder.transform(df_train["provincia"].values)
y_test = province_encoder.transform(df_test["provincia"].values)

In [8]:
%%time
from contrastes.classifiers import fit_classifiers

num_words_to_fit = list(range(250, 5000, 250)) + list(range(5000, 20000, 500))

ret = fit_classifiers(X_train, y_train, X_test, y_test, 
                      province_encoder=province_encoder,
                      range_num_words=num_words_to_fit, num_jobs=8)

Entrenando con 250 palabras
Entrenando con 750 palabras
Entrenando con 1250 palabras
Entrenando con 1750 palabras
Entrenando con 2250 palabras
Entrenando con 2750 palabras
Entrenando con 3250 palabras
Entrenando con 3750 palabras




250   palabras ----> accuracy 35.32 mean distance 512.7888
Entrenando con 500 palabras




750   palabras ----> accuracy 49.44 mean distance 406.2552
Entrenando con 1000 palabras




1250  palabras ----> accuracy 58.72 mean distance 338.7972
Entrenando con 1500 palabras




1750  palabras ----> accuracy 63.04 mean distance 361.7928
Entrenando con 2000 palabras




2250  palabras ----> accuracy 68.00 mean distance 321.3048
Entrenando con 2500 palabras




2750  palabras ----> accuracy 68.64 mean distance 300.944
Entrenando con 3000 palabras




3250  palabras ----> accuracy 69.88 mean distance 283.8432
Entrenando con 3500 palabras




3750  palabras ----> accuracy 70.48 mean distance 274.104
Entrenando con 4000 palabras
500   palabras ----> accuracy 43.32 mean distance 448.2648
Entrenando con 4250 palabras
1000  palabras ----> accuracy 54.40 mean distance 367.8736
Entrenando con 4750 palabras
1500  palabras ----> accuracy 61.88 mean distance 380.0864
Entrenando con 5500 palabras
2000  palabras ----> accuracy 65.96 mean distance 323.2856
Entrenando con 6500 palabras
2500  palabras ----> accuracy 68.84 mean distance 305.3228
Entrenando con 7500 palabras
3000  palabras ----> accuracy 69.40 mean distance 289.4456
Entrenando con 8500 palabras
3500  palabras ----> accuracy 70.24 mean distance 277.3524
Entrenando con 9500 palabras
4000  palabras ----> accuracy 70.36 mean distance 270.686
Entrenando con 10500 palabras
4250  palabras ----> accuracy 70.60 mean distance 266.756
Entrenando con 4500 palabras
4750  palabras ----> accuracy 71.76 mean distance 256.4124
Entrenando con 5000 palabras
5500  palabras ----> accuracy 72.4

In [9]:
for r in ret:
    num_words = r["num_words"]
    acc = r["accuracy"]
    md = r["mean_distance"]
    print("{:<5} palabras ----> accuracy {:.2f} mean distance {}".format(
        num_words, acc*100, md
    ))

250   palabras ----> accuracy 35.32 mean distance 512.7888
500   palabras ----> accuracy 43.32 mean distance 448.2648
750   palabras ----> accuracy 49.44 mean distance 406.2552
1000  palabras ----> accuracy 54.40 mean distance 367.8736
1250  palabras ----> accuracy 58.72 mean distance 338.7972
1500  palabras ----> accuracy 61.88 mean distance 380.0864
1750  palabras ----> accuracy 63.04 mean distance 361.7928
2000  palabras ----> accuracy 65.96 mean distance 323.2856
2250  palabras ----> accuracy 68.00 mean distance 321.3048
2500  palabras ----> accuracy 68.84 mean distance 305.3228
2750  palabras ----> accuracy 68.64 mean distance 300.944
3000  palabras ----> accuracy 69.40 mean distance 289.4456
3250  palabras ----> accuracy 69.88 mean distance 283.8432
3500  palabras ----> accuracy 70.24 mean distance 277.3524
3750  palabras ----> accuracy 70.48 mean distance 274.104
4000  palabras ----> accuracy 70.36 mean distance 270.686
4250  palabras ----> accuracy 70.60 mean distance 266.756
4

In [11]:
import pickle

pickle.dump(ret, open("res_igr_personas.pkl", "wb"))

In [13]:
new_ret = pickle.load(open("res_igr_personas.pkl", "rb"))

clf = new_ret[-1]["clf"]

clf.coef_.shape

(23, 19500)