# Geolocalización con MultiProcessing + IGR Personas

En esta notebook haremos un intento de geolocalización con los textos de los usuarios...

In [1]:
%matplotlib inline
%load_ext autoreload
%autoreload 2
import pandas as pd

df_train = pd.read_json("../data/geoloc/users_train.json")
df_test = pd.read_json("../data/geoloc/users_test.json")

Hagamos lo siguiente:

- Entrenemos con unigramas una regresión logística para 
- Luego probemos con los regionalismos

Primero, partamos en train, test

In [2]:
df_train.groupby("provincia").count()

Unnamed: 0_level_0,text
provincia,Unnamed: 1_level_1
buenosaires,337
catamarca,341
chaco,331
chubut,328
cordoba,317
corrientes,345
entrerios,338
formosa,286
jujuy,339
lapampa,324


## Palabras precalculadas

Carguemos antes las palabras que sabemos que ocurren una cantidad razonable de veces

In [3]:
%%time
from contrastes.processing import build_dataframe_from_users
from contrastes.processing import preprocess_raw_df


#word_df = build_dataframe_from_users(row for index, row in df_train.iterrows())

word_df = pd.read_csv("train_word_df_filtered.csv", index_col=0)
word_df = preprocess_raw_df(word_df, filter_words=(10, 2))

CPU times: user 1.02 s, sys: 200 ms, total: 1.22 s
Wall time: 1.22 s


  df.columnas_palabras = cant_palabras
  df.columnas_personas = cant_personas


La reg. logística será un softmax, así que elijo `multi_class='multinomial'`

In [4]:
from contrastes.igr import igr

word_df["igr"] = igr(word_df, word_df.columnas_personas)

In [5]:
word_df.sort_values("igr", ascending=False, inplace=True)

word_df[:30][word_df.columnas_palabras]

Unnamed: 0,buenosaires_ocurrencias,catamarca_ocurrencias,chaco_ocurrencias,chubut_ocurrencias,cordoba_ocurrencias,corrientes_ocurrencias,entrerios_ocurrencias,formosa_ocurrencias,jujuy_ocurrencias,lapampa_ocurrencias,...,neuquen_ocurrencias,rionegro_ocurrencias,salta_ocurrencias,sanjuan_ocurrencias,sanluis_ocurrencias,santacruz_ocurrencias,santafe_ocurrencias,santiago_ocurrencias,tierradelfuego_ocurrencias,tucuman_ocurrencias
bombola,0.0,0.0,0.0,0.0,0.0,0.0,0.0,75.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
logroño,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sanagasta,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
bombolo,0.0,0.0,4.0,0.0,0.0,0.0,0.0,69.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
aijue,0.0,0.0,0.0,0.0,0.0,2.0,1.0,237.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mitai,0.0,0.0,0.0,0.0,0.0,7.0,0.0,123.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
aij,0.0,0.0,0.0,0.0,0.0,0.0,0.0,30.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
fne,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,75.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
unse,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,50.0,0.0,0.0
quintela,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Veamos qué performance tiene usando 1000, 2000, 3000, y así...

In [6]:
print("Tenemos {} palabras".format(len(word_df)))

Tenemos 103782 palabras


In [7]:
%%time
from sklearn.feature_extraction.text import CountVectorizer
from contrastes.text import tokenize

liw_vectorizer = CountVectorizer(
    tokenizer=tokenize,
    vocabulary=word_df.index)

X_train = liw_vectorizer.fit_transform(df_train["text"])
print("Vectorizing")
X_test = liw_vectorizer.transform(df_test["text"])

Vectorizing
CPU times: user 10min 16s, sys: 1.26 s, total: 10min 17s
Wall time: 10min 17s


Ya las tenemos vectorizadas en el orden esperado!

In [8]:
from sklearn.preprocessing import LabelEncoder

province_encoder = LabelEncoder()

province_encoder.fit(df_train["provincia"].values)

y_train = province_encoder.transform(df_train["provincia"].values)
y_test = province_encoder.transform(df_test["provincia"].values)

In [9]:
%%time
from contrastes.classifiers import fit_classifiers

num_words_to_fit = list(range(250, 5000, 250)) + list(range(5000, 20000, 500))

params = {"max_iter": 7000}

ret = fit_classifiers(X_train, y_train, X_test, y_test, 
                      province_encoder=province_encoder, clf_params=params,
                      range_num_words=num_words_to_fit, num_jobs=12)

Classifier params: {'multi_class': 'multinomial', 'solver': 'saga', 'penalty': 'l2', 'max_iter': 7000}
Entrenando con 750 palabras
Entrenando con 250 palabras
Entrenando con 2250 palabras
Entrenando con 1250 palabras
Entrenando con 1750 palabras
Entrenando con 3750 palabras
Entrenando con 3250 palabras
Entrenando con 2750 palabras
Entrenando con 4250 palabras
Entrenando con 5500 palabras
Entrenando con 4750 palabras
Entrenando con 6500 palabras
250   palabras ----> accuracy 36.16 mean distance 632.0648
Entrenando con 500 palabras
750   palabras ----> accuracy 50.80 mean distance 492.4676
Entrenando con 1000 palabras
1250  palabras ----> accuracy 60.16 mean distance 393.7628
Entrenando con 1500 palabras
1750  palabras ----> accuracy 64.16 mean distance 356.9404
Entrenando con 2000 palabras
2250  palabras ----> accuracy 69.24 mean distance 314.4144
Entrenando con 2500 palabras
2750  palabras ----> accuracy 70.32 mean distance 234.2132
Entrenando con 3000 palabras
3250  palabras ----> acc

In [12]:
for r in ret:
    num_words = r["num_words"]
    acc = r["accuracy"]
    md = r["mean_distance"]
    print("{:<5} palabras ----> accuracy {:.2f} mean distance {}".format(
        num_words, acc*100, md
    ))

250   palabras ----> accuracy 36.16 mean distance 632.0648
500   palabras ----> accuracy 44.60 mean distance 542.0792
750   palabras ----> accuracy 50.80 mean distance 492.4676
1000  palabras ----> accuracy 55.64 mean distance 436.9324
1250  palabras ----> accuracy 60.16 mean distance 393.7628
1500  palabras ----> accuracy 62.32 mean distance 376.7876
1750  palabras ----> accuracy 64.16 mean distance 356.9404
2000  palabras ----> accuracy 67.40 mean distance 316.8856
2250  palabras ----> accuracy 69.24 mean distance 314.4144
2500  palabras ----> accuracy 70.28 mean distance 236.2344
2750  palabras ----> accuracy 70.32 mean distance 234.2132
3000  palabras ----> accuracy 71.64 mean distance 225.6068
3250  palabras ----> accuracy 71.96 mean distance 224.1312
3500  palabras ----> accuracy 72.28 mean distance 220.1132
3750  palabras ----> accuracy 73.04 mean distance 257.0116
4000  palabras ----> accuracy 73.36 mean distance 250.6696
4250  palabras ----> accuracy 73.72 mean distance 249.67

In [14]:
import pickle

pickle.dump(ret, open("res_igr_personas.pkl", "wb"))

In [15]:
new_ret = pickle.load(open("res_igr_personas.pkl", "rb"))

clf = new_ret[-1]["clf"]

clf.coef_.shape

(23, 19500)