# Geolocalización Bag-of-Words

En esta notebook haremos un baseline de geolocalización usando un bag of words de todas las palabras

In [1]:
%matplotlib inline
%load_ext autoreload
%autoreload 2
import pandas as pd

df_train = pd.read_json("../data/geoloc/users_train.json")
df_test = pd.read_json("../data/geoloc/users_test.json")

In [2]:
df_train.groupby("provincia").count()


Unnamed: 0_level_0,text
provincia,Unnamed: 1_level_1
buenosaires,337
catamarca,341
chaco,331
chubut,328
cordoba,317
corrientes,345
entrerios,338
formosa,286
jujuy,339
lapampa,324


## Palabras precalculadas

Carguemos antes las palabras que sabemos que ocurren una cantidad razonable de veces

In [3]:
%%time
from contrastes.processing import build_dataframe_from_users
from contrastes.processing import preprocess_raw_df


#word_df = build_dataframe_from_users(row for index, row in df_train.iterrows())

word_df = pd.read_csv("train_word_df_filtered.csv", index_col=0)
word_df = preprocess_raw_df(word_df, filter_words=(10, 2))


CPU times: user 1.01 s, sys: 152 ms, total: 1.16 s
Wall time: 1.16 s


  df.columnas_palabras = cant_palabras
  df.columnas_personas = cant_personas


In [4]:
word_df.sort_index(inplace=True)

In [5]:
%%time
from sklearn.feature_extraction.text import CountVectorizer
from contrastes.text import tokenize

vectorizer = CountVectorizer(
    tokenizer=tokenize,
    vocabulary=word_df.index)

X_train = vectorizer.fit_transform(df_train["text"])
print("Vectorizing")
X_test = vectorizer.transform(df_test["text"])


Vectorizing
CPU times: user 9min 14s, sys: 672 ms, total: 9min 15s
Wall time: 9min 15s


In [6]:
print("Vocabulario del vectorizador: {} palabras".format(len(vectorizer.vocabulary_)))

Vocabulario del vectorizador: 103782 palabras


In [7]:
from sklearn.preprocessing import LabelEncoder

province_encoder = LabelEncoder()

province_encoder.fit(df_train["provincia"].values)

y_train = province_encoder.transform(df_train["provincia"].values)
y_test = province_encoder.transform(df_test["provincia"].values)

La reg. logística será un softmax, así que elijo `multi_class='multinomial'`

In [8]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(multi_class='multinomial', solver='saga', n_jobs=10, penalty='l2')

In [10]:
%%time
clf.fit(X_train, y_train)

CPU times: user 6min 4s, sys: 212 ms, total: 6min 4s
Wall time: 6min 4s




LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='multinomial',
          n_jobs=10, penalty='l2', random_state=None, solver='saga',
          tol=0.0001, verbose=0, warm_start=False)

In [12]:
%%time
clf.score(X_train, y_train)

CPU times: user 652 ms, sys: 76 ms, total: 728 ms
Wall time: 728 ms


0.3838666666666667

38% de accuracy

In [14]:
from contrastes.geo import mean_distance_score

mean_distance_score(clf, X_test, y_test, province_encoder)

599.8988

### Resultados

- Accuracy: 38%
- Mean Distance: 599km