## Introducción a ML
ML (Aprendizaje Automático) es un subcampo de la inteligencia artificial y tiene como objetivo construir sistemas capaces de realizar tareas sin necesidad de ser programados explícitamente para hacerlo. Los algoritmos de ML emplean modelos matemáticos que aprenden de los datos existentes para realizar tareas como predicción, clasificación, toma de decisiones, entre otros. La parte de 'aprendizaje' del modelo también se denomina 'entrenamiento', donde el modelo analiza grandes volúmenes de datos para identificar patrones. Este proceso es intensivo en términos computacionales, ya que el modelo debe realizar numerosos cálculos con los datos proporcionados. Sin embargo, con el continuo avance del poder computacional a nuestro alcance, el entrenamiento y la implementación de modelos de ML se ha vuelto bastante sencillo y muy popular. Dado que el procesamiento de lenguaje natural (NLP) también requiere analizar grandes volúmenes de datos, los algoritmos de ML se aplican ampliamente en el procesamiento de texto.

* Aprendizaje Supervisado
* Aprendizaje no Supervisado

In [2]:
## Naive Bayes

## Naive Bayes

$$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$$

Supongamos que tenemos el siguiente conjunto de datos ficticio que contiene información sobre solicitudes a universidades de la Ivy League. Las variables independientes en el conjunto de datos son el puntaje SAT del solicitante, el GPA del solicitante y la información sobre si los padres del solicitante son exalumnos de una universidad de la Ivy League. La variable dependiente es el resultado de la solicitud. Con base en estos datos, nos interesa calcular la probabilidad de que un solicitante sea admitido en una universidad de la Ivy League dado que su puntaje SAT es superior a 1,500, su GPA es superior a 3.2 y sus padres no son exalumnos

| AT Score | GPA  | Alumni Parents | Ivy League Admission? |
|----------|------|----------------|------------------------|
| 1,580    | 4.0  | 0              |         1              |
| 1,450    | 3.1  | 1              |         1              |
| 1,480    | 3.6  | 0              |         0              |
| 1,410    | 3.33 | 0              |         0              |
| 1,280    | 3.0  | 1              |         1              |
| 1,440    | 3.7  | 0              |         0              |
| 1,560    | 3.9  | 1              |         1              |
| >1,500   | >3.2 | 0              |         ?              |


La probabilidad de que un solicitante sea admitido a una escuela de la Ivy League se puede representar como una expresión de probabilidad, de la siguiente manera

$$P(Ivy League|(SAT>1500, GPA>3.2, AP=0)) $$

Usando el Teorema de Bayes, sería algo como:
$$\frac{P((SAT>1500, GPA>3.2, AP=0)|\text{Ivy League}) $$}{P(SAT>1500, GPA>3.2, AP=0)}$$

Podemos resolver la ecuación dada calculando las probabilidades conjuntas respectivas presentadas en la tabla anterior. Sin embargo, para conjuntos de datos más grandes, calcular la probabilidad conjunta puede ser un poco desafiante. Para solucionar este problema, usamos Naive Bayes, que asume que todas las características son independientes entre sí, por lo que la probabilidad conjunta es simplemente el producto de las probabilidades independientes. Esta suposición es ingenua porque casi siempre es incorrecta. Incluso en el ejemplo, un solicitante con un puntaje SAT alto tiene más probabilidades de tener un GPA alto, por lo que estos dos eventos no son independientes. Sin embargo, se ha demostrado que la suposición de Naive Bayes funciona bien para problemas de clasificación

$$
\frac{P(\text{SAT} > 1500 \mid \text{Ivy League}) \cdot P(\text{GPA} > 3.2 \mid \text{Ivy League}) \cdot P(\text{AP} = 0 \mid \text{Ivy League})}{P(\text{SAT} > 1500)\cdot P(\text{GPA} > 3.2)\cdot P(\text{AP} = 0)}
$$


##Construcción de un analizador de sentimientos utilizando el algoritmo Naive Bayes:
El análisis de sentimientos, también conocido como minería de opiniones o detección de polaridad, se refiere al conjunto de algoritmos y técnicas utilizados para extraer la polaridad de un documento dado; es decir, determina si el sentimiento de un documento es positivo, negativo o neutral. El análisis de sentimientos está ganando popularidad en la industria, ya que permite a las organizaciones extraer opiniones de un gran grupo de usuarios o clientes potenciales de manera rentable. Hoy en día, se utiliza ampliamente en campañas publicitarias, campañas políticas, análisis de acciones, y más.
Ahora que entendemos las matemáticas detrás del algoritmo Naive Bayes, construiremos un analizador de sentimientos entrenando nuestro modelo Naive Bayes con un conjunto de datos etiquetados de reseñas de productos recopilados de Amazon. Este conjunto de datos fue creado para el artículo.
From Group to
Individual Labels using Deep Features, Kotzias et. al., KDD 2015, and can be accessed at http:/
/archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences

In [3]:
num = (2/4)*(2/4)*(1/4)*(4/7)

In [4]:
den = (2/7)*(5/7)*(4/7)

In [5]:
num/den


0.30625

## Modelo naive bayes

In [6]:
import pandas as pd
import numpy as np

In [7]:
# leer el archivo amazon_cells_labelled.txt
df = pd.read_csv('../Datos/amazon_cells_labelled.txt', sep='\t', header=None)
df

Unnamed: 0,0,1
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1
...,...,...
995,The screen does get smudged easily because it ...,0
996,What a piece of junk.. I lose more calls on th...,0
997,Item Does Not Match Picture.,0
998,The only thing that disappoint me is the infra...,0


In [8]:
X = df.iloc[:,0]
y = df.iloc[:,1]

In [9]:
X

0      So there is no way for me to plug it in here i...
1                            Good case, Excellent value.
2                                 Great for the jawbone.
3      Tied to charger for conversations lasting more...
4                                      The mic is great.
                             ...                        
995    The screen does get smudged easily because it ...
996    What a piece of junk.. I lose more calls on th...
997                         Item Does Not Match Picture.
998    The only thing that disappoint me is the infra...
999    You can not answer calls with the unit, never ...
Name: 0, Length: 1000, dtype: object

In [10]:
y

0      0
1      1
2      1
3      0
4      1
      ..
995    0
996    0
997    0
998    0
999    0
Name: 1, Length: 1000, dtype: int64

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

In [12]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
import pandas as pd
import re
import numpy as np

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\aluca\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\aluca\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Limpieza del corpus

In [13]:
def text_clean(corpus, keep_list):
    '''
    Purpose : Function to keep only alphabets, digits and certain words (punctuations, qmarks, tabs etc. removed)

    Input : Takes a text corpus, 'corpus' to be cleaned along with a list of words, 'keep_list', which have to be retained
            even after the cleaning process

    Output : Returns the cleaned text corpus

    '''
    cleaned_corpus = pd.Series()
    for row in corpus:
        qs = []
        for word in row.split():
            if word not in keep_list:
                p1 = re.sub(pattern='[^a-zA-Z]',repl=' ',string=word)
                p1 = p1.lower()
                qs.append(p1)
            else : qs.append(word)
        #cleaned_corpus = cleaned_corpus.append(pd.Series(' '.join(qs)))
        cleaned_corpus = pd.concat([cleaned_corpus,pd.Series(' '.join(qs))])
    return cleaned_corpus

def stopwords_removal(corpus):
    wh_words = ['who', 'what', 'when', 'why', 'how', 'which', 'where', 'whom']
    stop = set(stopwords.words('english'))
    for word in wh_words:
        stop.remove(word)
    corpus = [[x for x in x.split() if x not in stop] for x in corpus]
    return corpus

def lemmatize(corpus):
    lem = WordNetLemmatizer()
    corpus = [[lem.lemmatize(x, pos = 'v') for x in x] for x in corpus]
    return corpus

def stem(corpus, stem_type = None):
    if stem_type == 'snowball':
        stemmer = SnowballStemmer(language = 'english')
        corpus = [[stemmer.stem(x) for x in x] for x in corpus]
    else :
        stemmer = PorterStemmer()
        corpus = [[stemmer.stem(x) for x in x] for x in corpus]
    return corpus


def preprocess(corpus, keep_list, cleaning = True, stemming = False, stem_type = None, lemmatization = False, remove_stopwords = True):
    '''
    Purpose : Function to perform all pre-processing tasks (cleaning, stemming, lemmatization, stopwords removal etc.)

    Input :
    'corpus' - Text corpus on which pre-processing tasks will be performed
    'keep_list' - List of words to be retained during cleaning process
    'cleaning', 'stemming', 'lemmatization', 'remove_stopwords' - Boolean variables indicating whether a particular task should
                                                                  be performed or not
    'stem_type' - Choose between Porter stemmer or Snowball(Porter2) stemmer. Default is "None", which corresponds to Porter
                  Stemmer. 'snowball' corresponds to Snowball Stemmer

    Note : Either stemming or lemmatization should be used. There's no benefit of using both of them together

    Output : Returns the processed text corpus

    '''

    if cleaning == True:
        corpus = text_clean(corpus, keep_list)

    if remove_stopwords == True:
        corpus = stopwords_removal(corpus)
    else :
        corpus = [[x for x in x.split()] for x in corpus]

    if lemmatization == True:
        corpus = lemmatize(corpus)


    if stemming == True:
        corpus = stem(corpus, stem_type)

    corpus = [' '.join(x) for x in corpus]

    return corpus

In [14]:

common_dot_words = ['U.S.', 'Mr.', 'Mrs.', 'D.C.']

In [15]:
preprocessed_corpus = preprocess(X, keep_list = common_dot_words, stemming = False, stem_type = None,
                                lemmatization = True, remove_stopwords = True)

In [16]:
preprocessed_corpus

['way plug us unless go converter',
 'good case excellent value',
 'great jawbone',
 'tie charger conversations last minutes major problems',
 'mic great',
 'jiggle plug get line right get decent volume',
 'several dozen several hundred contact imagine fun send one one',
 'razr owner must',
 'needless say waste money',
 'what waste money time',
 'sound quality great',
 'impress when go original battery extend battery',
 'two seperated mere ft start notice excessive static garble sound headset',
 'good quality though',
 'design odd ear clip comfortable',
 'highly recommend one who blue tooth phone',
 'advise everyone fool',
 'far good',
 'work great',
 'click place way make wonder how long mechanism would last',
 'go motorola website follow directions could get pair',
 'buy use kindle fire absolutely love',
 'commercials mislead',
 'yet run new battery two bar three days without charge',
 'buy mother problem battery',
 'great pocket pc phone combination',
 'own phone months say best mob

### Vocabulario

In [17]:
set_of_words = set()
for sentence in preprocessed_corpus:
    print(sentence.split())
    for word in sentence.split():
        set_of_words.add(word)
vocab = list(set_of_words)
print('----------')
print(vocab)

['way', 'plug', 'us', 'unless', 'go', 'converter']
['good', 'case', 'excellent', 'value']
['great', 'jawbone']
['tie', 'charger', 'conversations', 'last', 'minutes', 'major', 'problems']
['mic', 'great']
['jiggle', 'plug', 'get', 'line', 'right', 'get', 'decent', 'volume']
['several', 'dozen', 'several', 'hundred', 'contact', 'imagine', 'fun', 'send', 'one', 'one']
['razr', 'owner', 'must']
['needless', 'say', 'waste', 'money']
['what', 'waste', 'money', 'time']
['sound', 'quality', 'great']
['impress', 'when', 'go', 'original', 'battery', 'extend', 'battery']
['two', 'seperated', 'mere', 'ft', 'start', 'notice', 'excessive', 'static', 'garble', 'sound', 'headset']
['good', 'quality', 'though']
['design', 'odd', 'ear', 'clip', 'comfortable']
['highly', 'recommend', 'one', 'who', 'blue', 'tooth', 'phone']
['advise', 'everyone', 'fool']
['far', 'good']
['work', 'great']
['click', 'place', 'way', 'make', 'wonder', 'how', 'long', 'mechanism', 'would', 'last']
['go', 'motorola', 'website', 

In [18]:
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(preprocessed_corpus)

In [19]:
print(vectorizer.get_feature_names_out())
print(bow_matrix.toarray())

['abhor' 'ability' 'able' ... 'yes' 'yet' 'zero']
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [20]:
X_vec = bow_matrix.toarray()
X_vec.shape

(1000, 1440)

Construcción de un vectorizador básico de TF-IDF

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [22]:
vectorizer = TfidfVectorizer()
bow_matrix_x = vectorizer.fit_transform(preprocessed_corpus)
print(vectorizer.get_feature_names_out())
print(bow_matrix.toarray())
X_vec = bow_matrix_x.toarray()
X_vec.shape

['abhor' 'ability' 'able' ... 'yes' 'yet' 'zero']
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


(1000, 1440)

In [23]:
bow_matrix_x.shape

(1000, 1440)

In [24]:
# crear un dataframe
df_bow = pd.DataFrame(bow_matrix_x.toarray(), columns=vectorizer.get_feature_names_out())
df_bow


Unnamed: 0,abhor,ability,able,abound,absolutel,absolutely,ac,accept,acceptable,access,...,would,wow,wrong,wrongly,year,years,yell,yes,yet,zero
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [25]:
#concatenar con la variable objetivo y
df_bow = pd.concat([df_bow, y], axis=1)
df_bow

Unnamed: 0,abhor,ability,able,abound,absolutel,absolutely,ac,accept,acceptable,access,...,wow,wrong,wrongly,year,years,yell,yes,yet,zero,1
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


In [26]:
# dividir el conjunto en entrenamiento y test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_vec, y, test_size=0.2, random_state=42)


In [27]:
#implementacion modelo naive bayes
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

clf = MultinomialNB()
clf.fit(X_train, y_train)

#Generar predicciones
y_pred = clf.predict(X_test)



In [28]:
y_pred

array([1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1,
       0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0,
       1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0,
       1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1,
       1, 1])

In [29]:
# Evaluar el modelo con las metricas accuracy_score, precision_score, recall_score, f1_score
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

#imprimir matriz de confucion
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)

#mostrar matris de confusion
print("Matriz de confusión:")
print(confusion_matrix(y_test, y_pred))

#mostrar metricas
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)


Matriz de confusión:
[[74 19]
 [14 93]]
Accuracy: 0.835
Precision: 0.8303571428571429
Recall: 0.8691588785046729
F1 Score: 0.8493150684931506


In [30]:
# probar el modelo con el suppour vector machine
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
clf = SVC()
clf.fit(X_train, y_train)

#Generar predicciones
y_pred = clf.predict(X_test)

# Generar metricas de evaluacion.
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

#mostrar metricas
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

Accuracy: 0.805
Precision: 0.8777777777777778
Recall: 0.7383177570093458
F1 Score: 0.8020304568527918


In [32]:
# generar modelo XGBoost
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
clf = XGBClassifier()
clf.fit(X_train, y_train)

#Generar predicciones
y_pred = clf.predict(X_test)

# Generar metricas de evaluacion.
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

#Mostar metricas
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)




Accuracy: 0.705
Precision: 0.7790697674418605
Recall: 0.6261682242990654
F1 Score: 0.694300518134715
