# Sentiment Analysis

**Instalación e importación de librerías**

In [1]:
!pip install pandas
!pip install numpy
!pip install sklearn

Collecting pandas
  Downloading pandas-1.4.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.7 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.7/11.7 MB[0m [31m115.9 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m36m0:00:01[0m
[?25hCollecting numpy>=1.18.5
  Downloading numpy-1.23.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.1/17.1 MB[0m [31m123.8 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m
[?25hCollecting pytz>=2020.1
  Downloading pytz-2022.1-py2.py3-none-any.whl (503 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m503.5/503.5 kB[0m [31m69.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pytz, numpy, pandas
Successfully installed numpy-1.23.1 pandas-1.4.3 pytz-2022.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m2

In [2]:
import pandas as pd
import numpy as np
import unicodedata
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB # cuando es texto se usa la multinomial
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import classification_report

**Importación de datos**

In [3]:
url='https://raw.githubusercontent.com/4GeeksAcademy/naive-bayes-project-tutorial/main/playstore_reviews_dataset.csv'
df = pd.read_csv(url)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   package_name  891 non-null    object
 1   review        891 non-null    object
 2   polarity      891 non-null    int64 
dtypes: int64(1), object(2)
memory usage: 21.0+ KB


La base de datos cuenta con 891 observaciones y con 3 variables, de las cuales dos están clasificadas como objetos y una como entera.

In [5]:
df.sample(10)

Unnamed: 0,package_name,review,polarity
53,com.twitter.android,bug in changing notification sound i gave 1 s...,0
434,com.facebook.orca,makes top left of my screen unresponsive many...,0
441,com.whatsapp,it's ok !!! only 1 request... its been now mo...,1
487,com.Slack,"great app, i mainly use it to keep the team ...",1
836,com.hamropatro,usefull no others app like this....,1
788,org.mozilla.firefox,shite. crashes constantly! test 1st does noth...,0
693,com.hamrokeyboard,great app i really liked this app and its ver...,1
733,com.opera.mini.native,keeps crashing it only works well in extreme ...,0
324,com.viber.voip,new message notification fault it doesn't not...,0
553,com.dropbox.android,5-stars.! this app has saved my life on multi...,1


Se observa que las dos primeras variables deberían ser string mientras que la última debería ser categórica, por lo que se procede a hacer las transformaciones correspondientes.

In [6]:
df[df.select_dtypes('object').columns]=df[df.select_dtypes('object').columns].astype('string')
df['polarity']=df['polarity'].astype('category')

A continuación se eliminan algunos caracteres desconocidos que se detectaron en el texto buscando reducir caracteres para implementar luego el algoritmo.

In [7]:
df['review']=df['review'].str.strip() # elimina espacios al comienzo y al final de la oracion
df['review']=df['review'].str.lower() # lleva todo a minuscula

In [8]:
# Función que estandariza las palabras a Normal Form Decomposed (NFD) para luego indicar que codificar en ascii ignorando los errores.
def normalize_str(text_string):
    if text_string is not None:
        result=unicodedata.normalize('NFD',text_string).encode('ascii','ignore').decode()
    else:
        result=None 
    return result

In [9]:
df['review']=df['review'].apply(normalize_str)
df['review']=df['review'].str.replace('!','')
df['review']=df['review'].str.replace(',','')
df['review']=df['review'].str.replace('&','')
df['review']=df['review'].str.normalize('NFKC')
df['review']=df['review'].str.replace(r'([a-zA-Z])\1{2,}',r'\1',regex=True) # elimina caracteres repetidos mas de dos veces


In [10]:
df.sample(10)

Unnamed: 0,package_name,review,polarity
730,com.opera.mini.native,old version was better this new version needs ...,0
713,com.opera.mini.native,use to be a 5 star app i gave this app 1 star ...,0
741,com.shirantech.kantipur,it is best app for regular news update but ......,1
703,com.opera.mini.native,classic browsing at it's best i've recently up...,1
644,com.uc.browser.en,simple and powerfull it's fast small and perfe...,1
753,com.shirantech.kantipur,good app it help us to get fast news.,1
688,com.hamrokeyboard,best app i have seen so far..... rsrzrzrlrrl r...,1
343,com.viber.voip,contacts and delays is it just me? with the ne...,0
12,com.facebook.katana,connection issues everytime i try and click on...,0
612,com.evernote,really cool and organized. the new update is r...,1


In [11]:
df.iloc[675,]

package_name                                    com.hamrokeyboard
review          loved it rzrl app rrlrrl rrsrzrlrrl r rrsrzrlr...
polarity                                                        1
Name: 675, dtype: object

Se definen las target y features variables.

In [12]:
X=df['review']
y=df['polarity']

Luego se separa la muestra entre entrenamiento y prueba, considerando la variable objetuvo para estratificar ambas muestras.

In [21]:
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=2007,stratify=y)

Antes de definir el modelo de clasificacion con el algoritmo Naive Bayes, se crea una matriz esparza que contiene en cada columna las distintas palabras o caracteres de la frase. A su vez, cada fila representa una frase y cada celda la cantidad de veces que aparece esa palabra o caracter en la frase.

In [22]:
vec=CountVectorizer(stop_words='english')


In [23]:
X_train=vec.fit_transform(X_train).toarray()
X_test=vec.transform(X_test).toarray()

In [27]:
X_train.shape

(668, 3142)

Se observa que la matriz tiene 3142 columnas que refiere a la cantidad de palabras o caracteres distintos, que contienen las reseñas. A continuación se imprimen los 10 primeros.

In [31]:
vec.get_feature_names_out()[:10]

array(['10', '100', '101', '11', '1186', '12', '13', '14', '14th', '15'],
      dtype=object)

Ahora se ajusta el modelo de Naive Bayes Multinomial (es el que se utiliza en texto) a la matriz esparza.

In [32]:
nb=MultinomialNB()
nb.fit(X_train,y_train)
print('R score', nb.score(X_train,y_train))

R score 0.9580838323353293


In [33]:
y_predict=nb.predict(X_test)

In [34]:
print(classification_report(y_test,y_predict))

              precision    recall  f1-score   support

           0       0.83      0.90      0.87       146
           1       0.78      0.65      0.71        77

    accuracy                           0.82       223
   macro avg       0.81      0.78      0.79       223
weighted avg       0.81      0.82      0.81       223



Se observa que el ajuste en la muestra de prueba es bastante bueno para predecir cualquiera de las categorias, hasta para la minoritaria el recall es mayor al 60%.

In [None]:
import pickle
filename = '/workspace/Naive-Bayes/models/finalized_model.sav'
pickle.dump(nb, open(filename, 'wb'))