# Text Naive Bayes Classification

In this code notebook we will train a Naive Bayes Classifier to automatically
recognize good comments and bad comments about a selected subject.



**Figure 1** . Supervised Classification. (a) During training, a feature extractor is used to convert each input value to a feature set. These feature sets, which capture the basic information about each input that should be used to classify it. Pairs of feature sets and labels are fed into the machine learning algorithm to generate a model. (b) During prediction, the same feature extractor is used to convert unseen inputs to feature sets. These feature sets are then fed into the model, which generates predicted labels (figure extracted from www.nltk.org/book/ch06.html).
<img src="https://www.nltk.org/images/supervised-classification.png">




Created by [Alejandro Molina](https://www.centrogeo.org.mx/areas-profile/amolina) on April 14,  2022 and adapted by [Gandhi Hernández](https://www.centrogeo.org.mx/areas-profile/ghernandez) May 15, 2024

## STAGE 1 Preparation


In [None]:
# import libraries
from google.colab import drive
import pandas as pd
import nltk
nltk.download('punkt')
from textblob.classifiers import NaiveBayesClassifier

In [None]:
# mount the drive
drive.mount('/content/drive/')


In [None]:
%cd '/content/drive/MyDrive/CentroGeo/Eventos/IRAP_2024'

In [None]:
# Data from kaggle Sentiment Analysis of Restaurant Reviews
# https://www.kaggle.com/code/apekshakom/sentiment-analysis-of-restaurant-reviews/

# THIS IS  FOR TSV DATA
df = pd.read_csv('Restaurant_Reviews.tsv',
                 sep='\t',
                 engine='python')

In [None]:
df.head()

In [None]:
len(df)

In [None]:
texts = df['Review'].tolist()
classes = df['Liked'].tolist()

In [None]:
texts[0:5]

In [None]:
classes[0:5]

In [None]:
uniques = set(classes)
uniques

# Now we are going to separate the positive and the negative reviews

In [None]:
positive = []
negative = []

# We will iterate trough the dataframe row by row and, if the Liked column is 1 then we will put the review in the positive list, and if the Liked column is 0 then we will put the review in the negative list

In [None]:
for index, row in df.iterrows():
  text = row['Review']
  label = row['Liked']
  if label == 1:
    positive.append(text)
  else:
    negative.append(text)

In [None]:
positive[:5]

In [None]:
negative[:5]

## STAGE 2 Preprocesing


# For the preprocessing stage we need to clean the text, so it is necessary to remove the stopwords.

In [None]:
# From Oracle Text Stoplists for English
# https://docs.oracle.com/cd/B19306_01/text.102/b14218/astopsup.htm#i634475

with open('english_stoplist.txt','r') as f:
    lines = f.readlines()
    stopwords = [l.strip('\n') for l in lines]

In [None]:
stopwords[0:10]

The labeled_tuples function takes a list of senteces and the label for each one as arguments (parameters) and assign the label to the sentence

In [None]:
# Creating nice lebeled tuples
def labeled_tuples(sentences_list, label):
    labeled = [(s, label) for s in sentences_list]
    return labeled

In [None]:
example = labeled_tuples(['this is nice', 'the food was very good'], 'POS')
print(example)

In [None]:
def filter_stopw(sentence, stopwords):
    words = sentence.split()
    words_nostops = list(filter(lambda w: w not in stopwords, words))
    words_nostops = [w.lower() for w in words_nostops]
    words_filtered = ' '.join(words_nostops)
    return words_filtered

def remove_stopw(sentences_list, stopwords):
    filtered = list([])
    for s in sentences_list:
        words_nostops = filter_stopw(s, stopwords)
        filtered.append(words_nostops)
    return filtered

In [None]:
example = filter_stopw('this is a test', stopwords)
print(example)

example = remove_stopw(['this is a test', 'Hola a todos mis amigos', "Howdy, my name is Jose"], stopwords)
print(example)

## STAGE 3 Training a Naive Bayes Classifier Model



Bayes' theorem defines a way to compute conditional probabilities.

Let $ P(POSITIVE) $ be the prior probability that a user text is positive. $ P(POSITIVE|x) $ would be the posterior probability that the text is positive based on the observation of x.
Using this theorem, it is possible to estimate, through many examples, how the probability distributions of each category (classes) are, using the words of a particular class as observations  (variables $x$).

$ P(POSITIVE | x) = \frac{ P(x|POSITIVE) P(POSITIVE) }{ P(x)  } $


Because we are classifying documents, the hypothesis is that the document belongs to a class. The evidence is the occurrence of the word W.

In [None]:
positive_train = remove_stopw(positive, stopwords)
negative_train = remove_stopw(negative, stopwords)

In [None]:
positive_train

In [None]:
negative_train

In [None]:
positive_train = labeled_tuples(positive_train, 'POS')
negative_train = labeled_tuples(negative_train, 'NEG')

In [None]:
negative_train

In [None]:
train = positive_train+negative_train

In [None]:
print(train[:3],'...',train[-3:])

### Training NaiveBayesClassifier with the data

In [None]:
# train the model with the data
my_nbclassifier = NaiveBayesClassifier(train)

# show the features that the model will use
features = my_nbclassifier.informative_features()
print('model features: ', features )


### Explore the features

In [None]:
# observe the particular features in a particular phrase
phrase = 'literally the worst food ever'
phrase_features = my_nbclassifier.extract_features(phrase)
print('phrase features', phrase_features)


In [None]:
type(phrase_features)

In [None]:
for element in phrase_features:
  value = phrase_features[element]
  if value == True:
    print(element,' ',value)

## Stage 4 Testing

In [None]:

test = [('i love this sandwich', 'POS'),
         ('This is an amazing place', 'POS'),
         ('i feel very good about these beers', 'POS'),
         ('a great touch of frech style', 'POS'),
         ('i do not like this restaurant', 'NEG'),
         ('horrible restaurant', 'NEG'),
         ('i hated  i will never return', 'NEG'),
         ('the atmosphere is horrible in this place', 'NEG')]

for t, l in test:
    prob_dist = my_nbclassifier.prob_classify(t)
    prob_pos = round(prob_dist.prob("POS"), 3)
    prob_neg = round(prob_dist.prob("NEG"), 3)
    print(prob_pos, prob_neg, t, prob_dist.max())


In [None]:
# evaluation
acc = my_nbclassifier.accuracy(test)
print('model accuracy:', acc)


Read more...

https://www.nltk.org/book/ch06.html

https://www.adamsmith.haus/python/docs/textblob.classifiers



## Now it's your turn!!!
## Download the Train.csv file and open it
## In the label column write a 0 if you think that the text is depression related or write a 1 if you think it is the opposite
## Save the labeled file as Train.csv

## Read the Train.csv file and put it on a DataFrame

In [None]:
df = pd.read_csv('/content/drive/MyDrive/IRAP24/Train.csv',
                 sep=',',
                 engine='python')
texts = df['message'].tolist()
classes = df['label'].tolist()
texts[0:5]

In [None]:
classes[0:5]

In [None]:
positive = [t for (i,t) in enumerate(texts) if classes[i]==0]

In [None]:
positive[:5]

In [None]:
negative = [t for (i,t) in enumerate(texts) if classes[i]==1]

In [None]:
negative[:5]

## Training a Naive Bayes Classifier Model

In [None]:
positive_train = remove_stopw(positive, stopwords)
negative_train = remove_stopw(negative, stopwords)

positive_train = labeled_tuples(positive_train, 'POS')
negative_train = labeled_tuples(negative_train, 'NEG')

train = positive_train+negative_train

In [None]:
negative_train

In [None]:
print(train[:3],'...',train[-3:])

### Training NaiveBayesClassifier with the data


In [None]:
# train the model with the data
my_nbclassifier = NaiveBayesClassifier(train)

# show the features that the model will use
features = my_nbclassifier.informative_features()
print('model features: ', features )


## Testing

## Read the Test.csv file and put it on a DataFrame. Observe that this dataframe does not have a label column

In [None]:
df = pd.read_csv('Test.csv')
df.head()

## Read the dataframe row by row (line by line) to take the text in the message column and give it to the model to classify it. The model will write the result label (POS or NEG) in the labels list, and finally, the labels list will be added as a new column in the dataframe

In [None]:
labels = [] ##This is a list of lables generated by the model
for index, row in df.iterrows():
  message = row['message']
  prob_dist = my_nbclassifier.prob_classify(message)
  prob_pos = round(prob_dist.prob('POS'), 3)
  prob_neg = round(prob_dist.prob('NEG'), 3)
  labels.append(prob_dist.max())

df['labels'] = labels
df.head()

## Wite the result dataframe in the labeled_texts.csv file. Upload the file in the shared folder in Google Drive

In [None]:
df.to_csv('labeled_texts.csv')

## Let's try in spanish

In [1]:
# import libraries
from google.colab import drive
import pandas as pd
import nltk
nltk.download('punkt')
from textblob.classifiers import NaiveBayesClassifier

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
# mount the drive
drive.mount('/content/drive/')

In [None]:
%cd '/content/drive/MyDrive/CentroGeo/Eventos/IRAP_2024'

In [6]:
df = pd.read_csv('spanish_tweets_train.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Text,Latitude,Longitude,Sentiment
0,7638133,Jajajajaja se pasó la candidata. Ya anda como ...,25.032632,-101.166961,positivo
1,11516926,Vacunados y Felices @DrAngelEf #EsposisDoctore...,20.506148,-97.682239,positivo
2,5990312,Que emoción te quedó super perros la rolita mi...,20.974213,-101.44848,positivo
3,332691,"@bonitafeliz100 Buenos días, dios te bendiga h...",15.772467,-90.224221,positivo
4,803782,Acaba de publicar una foto en San Francisco de...,19.8446,-90.5368,positivo


In [None]:
df['Sentiment'].unique()

In [None]:
df_positivos = df[df.Sentiment == 'positivo']
df_positivos = df_positivos.sample(1000)
len(df_positivos)

In [None]:
df_negativos = df[df.Sentiment == 'negativo']
df_negativos = df_negativos.sample(1000)
len(df_negativos)

In [None]:
df = pd.concat([df_positivos, df_negativos])
len(df)

## Change 'positivo' to POS and 'negativo' to NEG

In [None]:
label = []
positives = []
negatives = []
for index, row in df.iterrows():
  sentiment = row['Sentiment']
  text = row['Text']
  if sentiment == 'positivo':
    label.append('POS')
    positives.append(text)
  else:
    label.append('NEG')
    negatives.append(text)

df['Label'] = label
df = df.drop(['Sentiment'], axis=1)
df.head()

In [None]:
df['Label'].unique()

In [None]:
positives[:5]

In [None]:
negatives[:5]

In [21]:
with open('spanish_stoplist.txt','r') as f:
    lines = f.readlines()
    sp_stopwords = [l.strip('\n') for l in lines]

In [22]:
# Creating nice lebeled tuples
def labeled_tuples(sentences_list, label):
    labeled = [(s, label) for s in sentences_list]
    return labeled

In [26]:
def filter_sp_stopw(sentence, sp_stopwords):
    words = sentence.split()
    words_nostops = list(filter(lambda w: w not in sp_stopwords, words))
    words_nostops = [w.lower() for w in words_nostops]
    words_filtered = ' '.join(words_nostops)
    return words_filtered

def remove_sp_stopw(sentences_list, sp_stopwords):
    filtered = list([])
    for s in sentences_list:
        words_nostops = filter_sp_stopw(s, sp_stopwords)
        filtered.append(words_nostops)
    return filtered

In [27]:
positive_train = remove_sp_stopw(positives, sp_stopwords)
negative_train = remove_sp_stopw(negatives, sp_stopwords)

In [28]:
positive_train = labeled_tuples(positives, 'POS')
negative_train = labeled_tuples(negatives, 'NEG')

In [None]:
positive_train[:5]

In [None]:
negative_train[:5]

In [None]:
train = positive_train + negative_train
len(train)

In [None]:
# train the model with the data
my_nbclassifier = NaiveBayesClassifier(train)

# show the features that the model will use
features = my_nbclassifier.informative_features()
print('model features: ', features )

## Let's label some texts

In [None]:
df_test = pd.read_csv('spanish_tweets_test.csv')
df_test.head()

In [None]:
len(df_test)

In [36]:
df_test = df_test.sample(1000)

In [None]:
%%time
labels = [] ##This is a list of lables generated by the model
for index, row in df_test.iterrows():
  message = row['Text']
  print(index)
  prob_dist = my_nbclassifier.prob_classify(message)
  prob_pos = round(prob_dist.prob('POS'), 3)
  prob_neg = round(prob_dist.prob('NEG'), 3)
  labels.append(prob_dist.max())

df_test['labels'] = labels

In [None]:
df_test.head()

## Now, let's make the map

In [None]:
df_positives = df_test[df_test.labels=='POS']
df_positives.head()

In [None]:
len(df_positives)

In [41]:
import geopandas as gpd
import shapely
from shapely import wkt
from shapely.geometry import Polygon, Point
from google.colab import drive

In [None]:
# mount the drive
drive.mount('/content/drive/')

In [None]:
%cd '/content/drive/MyDrive/CentroGeo/Eventos/IRAP_2024'

In [102]:
mexico = gpd.read_file('inegi_estatal/INEGI_Estatal_.shp')

In [None]:
mexico.head()

In [None]:
mexico.plot()

In [None]:
mexico['NOMBRE'].unique()

In [None]:
size = len(mexico)
size

In [107]:
def give_me_the_estate(lat, lon):
    estado = None
    point = Point(lon, lat)
    for i in range(size):
        try:
            pol = mexico['geometry'][i]
            if pol.contains(point):
                estado = mexico['NOMBRE'][i]
                print(estado)
                break
        except:
            pass
    return estado

In [None]:
lat = 21.1156633
lon = -89.766103
give_me_the_estate(lat, lon)

In [None]:
df_positives['Estado'] = df_positives.apply(lambda x: give_me_the_estate(x['Latitude'], x['Longitude']), axis=1)

In [None]:
df_positives.head()

In [None]:
len(df_positives)

In [None]:
df_positives = df_positives.dropna()
len(df_positivos)

In [84]:
def put_state_key(abr_ent):
    if abr_ent == 'Aguascalientes':
        return 'MX-AGU'
    if abr_ent == 'Baja California':
        return "MX-BCN"
    if abr_ent == 'Baja California Sur':
        return "MX-BCS"
    if abr_ent == 'Campeche':
        return "MX-CAM"
    if abr_ent == 'Chiapas':
        return "MX-CHP"
    if abr_ent == 'Chihuahua':
        return "MX-CHH"
    if abr_ent == 'Ciudad de México':
        return "MX-CMX"
    if abr_ent == 'Coahuila de Zaragoza':
        return "MX-COA"
    if abr_ent == 'Colima':
        return "MX-COL"
    if abr_ent == 'Durango':
        return "MX-DUR"
    if abr_ent == 'Guanajuato':
        return "MX-GUA"
    if abr_ent == 'Guerrero':
        return "MX-GRO"
    if abr_ent == 'Hidalgo':
        return "MX-HID"
    if abr_ent == 'Jalisco':
        return "MX-JAL"
    if abr_ent == 'México':
        return "MX-MEX"
    if abr_ent == 'Michoacán de Ocampo':
        return "MX-MIC"
    if abr_ent == 'Morelos':
        return "MX-MOR"
    if abr_ent == 'Nayarit':
        return "MX-NAY"
    if abr_ent == 'Nuevo León':
        return "MX-NLE"
    if abr_ent == 'Oaxaca':
        return "MX-OAX"
    if abr_ent == 'Puebla':
        return "MX-PUE"
    if abr_ent == 'Querétaro':
        return "MX-QUE"
    if abr_ent == 'Quintana Roo':
        return "MX-ROO"
    if abr_ent == 'San Luis Potosí':
        return "MX-SLP"
    if abr_ent == 'Sinaloa':
        return "MX-SIN"
    if abr_ent == 'Sonora':
        return "MX-SON"
    if abr_ent == 'Tabasco':
        return "MX-TAB"
    if abr_ent == 'Tamaulipas':
        return "MX-TAM"
    if abr_ent == 'Tlaxcala':
        return "MX-TLA"
    if abr_ent == 'Veracruz de Ignacio de la Llave':
        return "MX-VER"
    if abr_ent == 'Yucatán':
        return "MX-YUC"
    if abr_ent == 'Zacatecas':
        return "MX-ZAC"

In [85]:
df_positives['CVE_ESTADO'] = df_positives.apply(lambda x: put_state_key(x['Estado']), axis=1)

In [None]:
df_positives.head()

In [None]:
df = df_positives.groupby(['CVE_ESTADO']).count()
df.head()

In [88]:
df.reset_index(inplace = True)

In [None]:
df.head()

In [None]:
size = len(df)
size

In [None]:
data = "[ "
data

In [92]:
i = 1
for index, row in df.iterrows():
  ide = row['CVE_ESTADO']
  value = row['Estado']
  if i < size:
    data += '{ id: "'+ide+'", "value": '+str(value)+'},'
    i = i + 1
  else:
    data += '{ id: "'+ide+'", "value": '+str(value)+'} ]'

In [None]:
data

## Let's move to the html file