# Geolocalización con TF-IPF

En esta notebook haremos un intento de geolocalización con los textos de los usuarios.

Pero haremos algo distinto: usaremos Term Frequency - Inverse Province Frequency (TF-IPF)


[Geolocation prediction in social media data by finding location indicative words](http://www.aclweb.org/anthology/C12-1064)

In [1]:
%matplotlib inline
%load_ext autoreload
%autoreload 2
import pandas as pd

df_train = pd.read_json("../data/geoloc/users_train.json")
df_test = pd.read_json("../data/geoloc/users_test.json")

Hagamos lo siguiente:

- Entrenemos con unigramas una regresión logística para 
- Luego probemos con los regionalismos

Primero, partamos en train, test

In [2]:
df_train.groupby("provincia").count()


Unnamed: 0_level_0,text
provincia,Unnamed: 1_level_1
buenosaires,337
catamarca,341
chaco,331
chubut,328
cordoba,317
corrientes,345
entrerios,338
formosa,286
jujuy,339
lapampa,324


## Palabras precalculadas

Carguemos antes las palabras que sabemos que ocurren una cantidad razonable de veces

In [3]:
%%time
from contrastes.processing import build_dataframe_from_users
from contrastes.processing import preprocess_raw_df


#word_df = build_dataframe_from_users(row for index, row in df_train.iterrows())

word_df = pd.read_csv("train_word_df_filtered.csv", index_col=0)
word_df = preprocess_raw_df(word_df, filter_words=(10, 2))

CPU times: user 1.04 s, sys: 140 ms, total: 1.18 s
Wall time: 1.18 s


  df.columnas_palabras = cant_palabras
  df.columnas_personas = cant_personas


In [4]:
word_df.sort_values(["cant_provincias", "cant_palabra"], ascending=[True, False], inplace=True)

word_df.iloc[:10][["cant_palabra", "cant_provincias"]]

Unnamed: 0,cant_palabra,cant_provincias
tiemposur,883.0,1
logroño,711.0,1
nihuil,450.0,1
chivil,332.0,1
ipauss,315.0,1
vallerga,291.0,1
asprodema,290.0,1
cdelu,244.0,1
calahorra,216.0,1
canicross,202.0,1


Veamos qué performance tiene usando 1000, 2000, 3000, y así...

In [5]:
%%time
from sklearn.feature_extraction.text import CountVectorizer
from contrastes.text import tokenize

liw_vectorizer = CountVectorizer(
    tokenizer=tokenize,
    vocabulary=word_df.index)

X_train = liw_vectorizer.fit_transform(df_train["text"])
print("Vectorizing")
X_test = liw_vectorizer.transform(df_test["text"])

Vectorizing
CPU times: user 9min 22s, sys: 928 ms, total: 9min 23s
Wall time: 9min 23s


Ya las tenemos vectorizadas en el orden esperado!

In [9]:
import numpy as np

np.save(open(""))

[0;31mSignature:[0m [0mnp[0m[0;34m.[0m[0msave[0m[0;34m([0m[0mfile[0m[0;34m,[0m [0marr[0m[0;34m,[0m [0mallow_pickle[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m [0mfix_imports[0m[0;34m=[0m[0;32mTrue[0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Save an array to a binary file in NumPy ``.npy`` format.

Parameters
----------
file : file, str, or pathlib.Path
    File or filename to which the data is saved.  If file is a file-object,
    then the filename is unchanged.  If file is a string or Path, a ``.npy``
    extension will be appended to the file name if it does not already
    have one.
arr : array_like
    Array data to be saved.
allow_pickle : bool, optional
    Allow saving object arrays using Python pickles. Reasons for disallowing
    pickles include security (loading pickled data can execute arbitrary
    code) and portability (pickled objects may not be loadable on different
    Python installations, for example if the stored objects require librar

Entrenando con 6500 palabras
Entrenando con 7500 palabras
Entrenando con 8500 palabras


In [13]:
from sklearn.preprocessing import LabelEncoder

province_encoder = LabelEncoder()

province_encoder.fit(df_train["provincia"].values)

y_train = province_encoder.transform(df_train["provincia"].values)
y_test = province_encoder.transform(df_test["provincia"].values)

In [14]:
%%time
from contrastes.classifiers import fit_classifiers

num_words_to_fit = list(range(250, 5000, 250)) + list(range(5000, 20000, 500))

params = {"max_iter": 7000}

ret = fit_classifiers(X_train, y_train, X_test, y_test, 
                      province_encoder=province_encoder, clf_params=params,
                      range_num_words=num_words_to_fit, num_jobs=8)

Classifier params: {'multi_class': 'multinomial', 'solver': 'saga', 'penalty': 'l2', 'max_iter': 7000}
Entrenando con 250 palabras
Entrenando con 2250 palabras
Entrenando con 1750 palabras
Entrenando con 750 palabras
Entrenando con 1250 palabras
Entrenando con 2750 palabras
Entrenando con 3750 palabras
Entrenando con 3250 palabras
250   palabras ----> accuracy 25.12 mean distance 750.3752
Entrenando con 500 palabras
750   palabras ----> accuracy 35.24 mean distance 643.978
Entrenando con 1000 palabras
1250  palabras ----> accuracy 44.24 mean distance 568.6348
Entrenando con 1500 palabras
1750  palabras ----> accuracy 51.92 mean distance 496.7968
Entrenando con 2000 palabras
2750  palabras ----> accuracy 57.28 mean distance 438.098
Entrenando con 3000 palabras
2250  palabras ----> accuracy 53.72 mean distance 475.3536
Entrenando con 2500 palabras
3250  palabras ----> accuracy 60.12 mean distance 419.0848
Entrenando con 3500 palabras
3750  palabras ----> accuracy 61.00 mean distance 402.



8500  palabras ----> accuracy 68.24 mean distance 282.8416
Entrenando con 9000 palabras
5000  palabras ----> accuracy 65.44 mean distance 363.3788
Entrenando con 12500 palabras
6000  palabras ----> accuracy 67.40 mean distance 325.9384
Entrenando con 13500 palabras




9500  palabras ----> accuracy 68.44 mean distance 280.7484
Entrenando con 10000 palabras
7000  palabras ----> accuracy 67.52 mean distance 319.4756
Entrenando con 14500 palabras




10500 palabras ----> accuracy 68.48 mean distance 281.9504
Entrenando con 11000 palabras
8000  palabras ----> accuracy 67.64 mean distance 315.458
Entrenando con 15500 palabras
9000  palabras ----> accuracy 68.40 mean distance 280.7088
Entrenando con 16500 palabras




11500 palabras ----> accuracy 69.36 mean distance 272.6472
Entrenando con 12000 palabras
10000 palabras ----> accuracy 68.44 mean distance 281.9068
Entrenando con 17500 palabras




12500 palabras ----> accuracy 69.80 mean distance 269.2596
Entrenando con 13000 palabras
11000 palabras ----> accuracy 68.56 mean distance 282.1152
Entrenando con 18500 palabras




13500 palabras ----> accuracy 69.84 mean distance 272.1724
Entrenando con 14000 palabras




14500 palabras ----> accuracy 69.92 mean distance 270.7404
Entrenando con 15000 palabras




15500 palabras ----> accuracy 69.80 mean distance 272.006
Entrenando con 16000 palabras
12000 palabras ----> accuracy 69.60 mean distance 272.0088
Entrenando con 19500 palabras
13000 palabras ----> accuracy 69.80 mean distance 270.3984
16500 palabras ----> accuracy 70.08 mean distance 270.608
Entrenando con 17000 palabras
14000 palabras ----> accuracy 69.88 mean distance 272.28
15000 palabras ----> accuracy 69.92 mean distance 271.1168
18500 palabras ----> accuracy 70.20 mean distance 271.2784
Entrenando con 19000 palabras
16000 palabras ----> accuracy 69.72 mean distance 270.9292
19500 palabras ----> accuracy 70.32 mean distance 270.548
17000 palabras ----> accuracy 70.12 mean distance 270.9364
18000 palabras ----> accuracy 70.16 mean distance 270.6296
19000 palabras ----> accuracy 70.36 mean distance 270.8848
CPU times: user 2.5 s, sys: 1.71 s, total: 4.21 s
Wall time: 26min 24s


2500 palabras dan un accuracy de 71%. BASTANTE BIEN. Luego disminuye la performance

In [15]:
for r in ret:
    num_words = r["num_words"]
    acc = r["accuracy"]
    md = r["mean_distance"]
    print("{:<5} palabras ----> accuracy {:.2f} mean distance {}".format(
        num_words, acc*100, md
    ))

250   palabras ----> accuracy 25.12 mean distance 750.3752
500   palabras ----> accuracy 31.92 mean distance 676.392
750   palabras ----> accuracy 35.24 mean distance 643.978
1000  palabras ----> accuracy 38.56 mean distance 614.062
1250  palabras ----> accuracy 44.24 mean distance 568.6348
1500  palabras ----> accuracy 49.32 mean distance 516.4276
1750  palabras ----> accuracy 51.92 mean distance 496.7968
2000  palabras ----> accuracy 53.00 mean distance 484.152
2250  palabras ----> accuracy 53.72 mean distance 475.3536
2500  palabras ----> accuracy 55.08 mean distance 462.016
2750  palabras ----> accuracy 57.28 mean distance 438.098
3000  palabras ----> accuracy 59.12 mean distance 424.4572
3250  palabras ----> accuracy 60.12 mean distance 419.0848
3500  palabras ----> accuracy 60.68 mean distance 411.8884
3750  palabras ----> accuracy 61.00 mean distance 402.1748
4000  palabras ----> accuracy 61.28 mean distance 396.7428
4250  palabras ----> accuracy 63.44 mean distance 352.6272
450

In [16]:
import pickle

pickle.dump(ret, open("res_tf_ipf.pkl", "wb"))


In [17]:
new_ret = pickle.load(open("res_tf_ipf.pkl", "rb"))

clf = new_ret[-1]["clf"]

clf.coef_.shape

(23, 19500)