In [1]:
import sklearn
import numpy as np
import pandas as pd

In [2]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.width', 1000)

# Step 1. Cleaning and Preparing Data

In [3]:
df = pd.read_csv('./data/data-sample-invoices.csv', index_col=0)

In [4]:
len(df)

26040

- Integer columns with at least one `NaN` are converted automaticaly by pandas to floatg64.
- To allows these columns to be integer and have null values, we convert to 'Int64' dtype (nullable Int array)
- Finally, convert to 'category' dtype

In [5]:
df['nature'] = df['nature'].astype('Int64')
df['nature'] = df['nature'].astype('category')

In [6]:
df['cost_center'] = df['cost_center'].astype('Int64')
df['cost_center'] = df['cost_center'].astype('category')

### Targets: `nature`, `cost_center`, `prepayment` values in extra_data

In [7]:
len(df['nature'].unique())  # 172 classes over 26,040 examples

172

In [8]:
len(df['cost_center'].unique())  # 274 classes over 26,040 examples

274

In [9]:
#df["prepayment"].replace({"FALSO": False, "VERDADERO": True}, inplace=True)  # highly skewed; 26,005 is False: 99.8%

#### Benchmark con Logistic regression o Multinomial?

In [17]:
df.loc[[1344]]
# in this sample, text is not provided.

Unnamed: 0,counterparty_name,counterparty_alias,counterparty_rfc,descriptions,id,prepayment,nature,cost_center,all_text
1344,REMAPA,,REM120119US5,HONORARIOS POR GESTI&Oacute;N PARA TRABAJOS E...,45765920,FALSO,3461,10031,REMAPA REM120119US5 HONORARIOS POR GESTI&Oacu...


In [13]:
df.drop('text', inplace=True, axis=1)

In [42]:
df['all_text'] = df['counterparty_name'] + ' ' + df['counterparty_rfc'] + ' ' + df['descriptions']  # tokenize and vectorize
df['all_text'] = df['all_text'].astype(str)

In [43]:
import re
def convertAccented(text, pattobj):
    '''
    Restores characters from a normalized, lowercase text
    like "&oacute;" into "ó"
    '''
    accented = {
        'a':'á',
        'e':'é',
        'i':'í',
        'o': 'ó',
        'u':'ú'
    }
    
    def accentRepl(matchobj):
        letter = matchobj.group(1)
        return accented[letter]
    
    text = pattobj.sub(accentRepl, text)
    return text

In [44]:
def normalizeTextColumn(dataframe):
    # lowercase and remove invalid characters from `all_text` column
    patt = r'&([aeiou])acute;'  # vowel is captured by group 1
    rgx = re.compile(patt)
    dataframe['all_text'] = dataframe['all_text'].apply( lambda x: convertAccented(x.lower(), rgx))

In [46]:
patt = r'&([aeiou])acute;'  # vowel is captured by group 1
rgx = re.compile(patt)

convertAccented(df['all_text'][1344].lower(), rgx)

'remapa rem120119us5  honorarios por gestión para trabajos especializados cc 10031 proveedor: 326824'

In [47]:
normalizeTextColumn(df)

Para este punto tenemos las features x1, x2,..,xp concatenadas como un solo texto.
Tenemos que vectorizar el texto de cada observación antes de pasarlo a un algoritmo de clasificación.

En scikit-learn, los vectorizers implementan tokenización. 

##### Limpieza previa del Texto
1. Una buena práctica es quitar la puntuación primero.
2. En español, quizá no deberíamos quitar acentos (aunque a veces no vienen con ellos nisiquiera)
3. Queremos obtener únicamente palabras relevantes que existen en el español? 
4. Sin limpieza, ¿cuáles son los tokens más frecuentes? ¿Qué tanto poder de predicción tiene un RFC?
5. Usar una función previa o aprovechar los parámetros de un Vectorizer?

https://scikit-learn.org/0.15/modules/feature_extraction.html#text-feature-extraction