#  Airline Twitter 

## Fuente de información

fuente: https://www.kaggle.com/crowdflower/twitter-airline-sentiment  
  
Un trabajo de análisis de sentimientos sobre los problemas de cada una de las principales aerolíneas estadounidenses. Los datos de Twitter se extrajeron de febrero de 2015 y se les pidió a los contribuyentes que primero clasificaran los tweets positivos, negativos y neutrales, y luego clasificaran las razones negativas (como "vuelo tardío" o "servicio grosero"). 

## Objetivo del proyecto

Reconocer patrones en los datos para poder aportar mayor valor al negocio y generar un modelo que pueda ser capaz de reconocer el sentimiento que está presente en los tweets que estén relacionados a las aerolíneas de USA.

## Cargando los datos

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv("datos/Tweets.csv", sep=",", encoding="latin1")
df.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [3]:
#numero de registros en el dataset
df.shape

(14640, 15)

## Análisis de datos

### Cantidad de tweets por sentimientos

In [4]:
import altair as alt

In [5]:
source = pd.DataFrame({
    'clases':['negative', 'neutral', 'positive'],
    'tweets': list(df["airline_sentiment"].value_counts())
})
alt.Chart(source).mark_bar().encode(

    x=alt.X('clases',axis=alt.Axis(
                                    labelAngle=0,
                                    )),
    y='tweets',
    tooltip=[
        alt.Tooltip('tweets:Q', title="Total tweets"),
    ]
).properties(
    width=400,
    height=300
)

### Cantidad total de tweets por aerolíneas

In [6]:
aerolineas = df["airline"].value_counts()
source = pd.DataFrame({
    'aerolineas':aerolineas.index,
    'tweets': aerolineas.values
})
alt.Chart(source).mark_bar().encode(

    x=alt.X('aerolineas',axis=alt.Axis(
                                    labelAngle=-45,
                                    )),
    y='tweets',
    tooltip=[
        alt.Tooltip('tweets:Q', title="Total tweets"),
    ]
).properties(
    width=400,
    height=300
)

### Cantidades de tweets de sentimiento por aerolíneas

In [7]:
df_filter = df[["airline_sentiment", "airline"]]
# agrupacion de sentiment por aerolineas
serie = df_filter.groupby(["airline","airline_sentiment"])["airline"].count()
df_airline_sent = pd.DataFrame(columns=["airline", "sentiment", "cantidad"])
for air, atr in serie.index:
    df_airline_sent.loc[df_airline_sent.shape[0]] = {
        "airline": air,
        "sentiment": atr,
        "cantidad": serie[air][atr],
    }

In [8]:
gp_chart = alt.Chart(df_airline_sent).mark_bar().encode(
  alt.Column('airline'), 
  alt.X('sentiment', axis=alt.Axis(
                                    labelAngle=-45,
                                    )),
  alt.Y('cantidad', axis=alt.Axis(grid=False)),
  alt.Color('airline'),
    tooltip=[
      alt.Tooltip('cantidad:Q', title="Total tweets"),
  ]
)
  
gp_chart.display()

### Cantidad de incidencias en total

los tweets negativos están categorizados en diferentes incidencias.

In [9]:
incidencias = df["negativereason"].value_counts()
source = pd.DataFrame({
    'incidencias':incidencias.index,
    'tweets': incidencias.values
})
alt.Chart(source).mark_bar().encode(

    x=alt.X('incidencias',axis=alt.Axis(
                                    labelAngle=-45,
                                    )),
    y='tweets',
    tooltip=[
        alt.Tooltip('tweets:Q', title="Total tweets"),
    ]
).properties(
    width=400,
    height=300
)

### Porcentaje de Incidencias por empresa

In [10]:
df_filter = df[["negativereason", "airline"]]
serie = df_filter.groupby(["airline","negativereason"])["airline"].count()
df_airline_reason = pd.DataFrame(columns=["airline", "negativereason", "cantidad"])
for air, atr in serie.index:
    valor = np.round((serie[air][atr]/serie[air].sum())*100,1)
    df_airline_reason.loc[df_airline_reason.shape[0]] = {
        "airline": air,
        "negativereason": atr,
        "cantidad": valor,
    }

In [11]:
alt.Chart(df_airline_reason).mark_rect().encode(
    x='airline:O',
    y='negativereason:O',
    tooltip=[
        alt.Tooltip('cantidad:Q', title="% issue"),
    ],
    color='cantidad:Q'
).properties(
    width=400,
    height=400
)

### Linea de tiempo de los tweets durante los 7 dias

Buscando por Google con este simple comando de fechas “delayed flights after:2015-02-22 before:2015-02-24” encontré 
artículos que mencionan tormentas que obligaron a cancelar vuelos.

[noticia vuelos cancelados](https://www.usatoday.com/story/todayinthesky/2015/02/23/airlines-cancel-1250-flights-as-yet-another-storm-hits-dfw-dallas-texas/23873767/)

In [12]:
df["tweet_created"] = df["tweet_created"].astype("datetime64[ns]")

In [13]:
#crear un dataframe para poder crear un grupo
df_fecha = pd.DataFrame()
df_fecha["year"] = df["tweet_created"].dt.year
df_fecha["month"] = df["tweet_created"].dt.month
df_fecha["day"] = df["tweet_created"].dt.day
df_fecha["hour"] = df["tweet_created"].dt.hour

In [14]:
#agrupando por hora los tweets
#airline after:2015-02-24 before:2015-02-17
grupo_hora = df_fecha.groupby(["year", "month", "day", "hour"])

In [15]:
# contar los tweets por hora
serie_tiempo = grupo_hora["hour"].count()

In [16]:
df_x_time = serie_tiempo.index.to_frame(index=None)

In [17]:
df_fecha_tweets = pd.DataFrame()
df_fecha_tweets["fecha"] = pd.to_datetime(df_x_time)
df_fecha_tweets["cantidad"] = serie_tiempo.values

In [18]:
alt.Chart(df_fecha_tweets).mark_line().encode(
    x='fecha:T',
    y='cantidad:Q'
).properties(
    width=600,
    height=300
)

### Linea de tiempo de todos los tweets del dia 18

In [19]:
dia = df_fecha_tweets[(df_fecha_tweets["fecha"] > "2015-02-18") &
                      (df_fecha_tweets["fecha"] < "2015-02-19") ]
alt.Chart(dia).mark_line().encode(
    x='fecha:T',
    y='cantidad:Q'
).properties(
    width=600,
    height=300
)

### Línea de tiempo de todos los tweets agrupados por sentimiento

In [20]:
df_fecha_sent = df_fecha
df_fecha_sent["sentiment"] = df["airline_sentiment"]
grupo_hora_sent = df_fecha_sent.groupby(["year", "month", "day", "hour", "sentiment"])
serie_tiempo = grupo_hora_sent["sentiment"].count()

In [21]:
df_x_time = serie_tiempo.index.to_frame(index=None)

In [22]:
df_fecha_tweets_sent = pd.DataFrame()
df_fecha_tweets_sent["sentiment"] = df_x_time["sentiment"]
df_fecha_tweets_sent["fecha"] = pd.to_datetime(df_x_time[["year","month","day", "hour"]])
df_fecha_tweets_sent["cantidad"] = serie_tiempo.values

In [23]:
alt.Chart(df_fecha_tweets_sent).mark_line().encode(
    x='fecha:T',
    y='cantidad:Q',
    color='sentiment:N'
).properties(
    width=800,
    height=300
)

### Linea temporal de los sentimientos de los tweets del dia 18

In [24]:
dia = df_fecha_tweets_sent[(df_fecha_tweets_sent["fecha"] > "2015-02-18") &
                      (df_fecha_tweets_sent["fecha"] < "2015-02-19") ]
alt.Chart(dia).mark_line().encode(
    x='fecha:T',
    y='cantidad:Q',
    color='sentiment:N'
).properties(
    width=800,
    height=300
)

###  Promedio las incidencias de los 7 días en una franja de 24 horas. 
  
  El propósito de la gráfica es mostrar si alguna incidencia aumentaba en un horario distinto a otra, pero al parecer todas aumentan y decaen en el mismo horario.

In [25]:
df_hora_reason = df_fecha[["hour"]]
df_hora_reason["negativereason"] = df["negativereason"]
df_hora_reason = df_hora_reason[df["airline_sentiment"] == "negative"]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_hora_reason["negativereason"] = df["negativereason"]


In [26]:
grupo_hora_reason = df_hora_reason.groupby(["hour", "negativereason"])
serie_tiempo = grupo_hora_reason["hour"].count()
serie_tiempo= serie_tiempo/7 # 7 dias de muestreo

In [27]:
df_x_time = serie_tiempo.index.to_frame(index=None)

In [28]:
df_fecha_tweets_reason = pd.DataFrame()
df_fecha_tweets_reason["negativereason"] = df_x_time["negativereason"]
df_fecha_tweets_reason["hora"] = df_x_time["hour"]
df_fecha_tweets_reason["cantidad"] = serie_tiempo.values

In [29]:
alt.Chart(df_fecha_tweets_reason).mark_line().encode(
    x='hora',
    y='cantidad:Q',
    color='negativereason:N',
    tooltip=[
        alt.Tooltip('cantidad:Q', title="issue"),
    ]
).properties(
    width=800,
    height=300
)

### Mapa de calor de la cantidad de tweets por cada estado

In [30]:
from vega_datasets import data

In [31]:
pop = data.population_engineers_hurricanes()
pop.head()

Unnamed: 0,state,id,population,engineers,hurricanes
0,Alabama,1,4863300,0.003422,22
1,Alaska,2,741894,0.001591,0
2,Arizona,4,6931071,0.004774,0
3,Arkansas,5,2988248,0.00244,0
4,California,6,39250017,0.007126,0


In [32]:
pop = pop.drop(['population', 'engineers', 'hurricanes'], axis=1)

#### Dataset de ciudades y estados

https://github.com/grammakov/USA-cities-and-states

In [33]:
bp_data = pd.read_csv("datos/us_cities_states_counties.csv", sep="|")
bp_data.head(2)

Unnamed: 0,City,State short,State full,County,City alias
0,Holtsville,NY,New York,SUFFOLK,Internal Revenue Service
1,Holtsville,NY,New York,SUFFOLK,Holtsville


In [34]:
#eliminando columnas innecesarias
bp_data = bp_data.drop(['County', 'City alias'], axis=1)
#eliminando filas repetidas
bp_data = bp_data.drop_duplicates()
#eliminando filas en nulos
bp_data=bp_data[bp_data["City"].notna()]
bp_data.head()

Unnamed: 0,City,State short,State full
0,Holtsville,NY,New York
2,Adjuntas,PR,Puerto Rico
6,Aguada,PR,Puerto Rico
12,Aguadilla,PR,Puerto Rico
45,Maricao,PR,Puerto Rico


#### Extrayendo las ciudades con nombres validos 
  
Uso el nombre de las ciudades del dataset y busca coincidencias dentro de los tweets que tenga una localización

In [35]:
import re
from os import path

In [36]:
ruta_csv = "datos/extraccion_estados.csv"

if path.exists(ruta_csv):
    df_cities = pd.read_csv(ruta_csv)
else:
    location = df["tweet_location"].value_counts()
    index_loc = pd.Series(location.index)
    index_loc=index_loc[index_loc.notna()]

    df_cities = pd.DataFrame(columns=["id_state","state", "cantidad", "localidad"])

    index_loc = list(index_loc)
    for i_city in range(bp_data.shape[0]):
        fila = bp_data.iloc[i_city]
        for loc in index_loc:
            if re.search(fila["City"], loc, re.IGNORECASE):
                df_cities.loc[df_cities.shape[0]] = {
                    "state": fila["State full"],
                    "cantidad": location[loc],
                    "localidad": loc
                }
                index_loc.remove(loc)
                break
    df_cities.to_csv(ruta_csv, index = False)

In [91]:
df_cities

Unnamed: 0,state,cantidad,localidad
0,Puerto Rico,96,"Los Angeles, CA"
1,Puerto Rico,6,Dorado
2,Puerto Rico,13,"Englewood, Florida"
3,Puerto Rico,5,"San Antonio, TX"
4,Puerto Rico,6,"Yauco, Puerto Rico"
...,...,...,...
1668,Alaska,1,"cheektowaga, ny"
1669,Alaska,1,"H-Town Houston, Texas"
1670,Alaska,2,"Western Massachusetts, USA"
1671,Alaska,2,central mass


#### Agrupacion por estados

In [38]:
states_issue = df_cities.groupby("state").aggregate(
    cantidad = ("cantidad", sum)
)
states_issue = states_issue.reset_index()
states_issue.head()

Unnamed: 0,state,cantidad
0,Alabama,103
1,Alaska,12
2,Arizona,19
3,Arkansas,58
4,California,156


#### Join entre el dataset de huracanes y el dataset creado con las cantidad de tweet
  
El dataset de los huracanes tiene el id de cada estado pero no tiene las cuidades, asi que utilice el nombre del estado para hacer el join entre ambos dataset

In [130]:
states_issue_merge = pd.merge(states_issue,pop,on='state')
states_issue_merge.sort_values(by=['cantidad'], ascending=False)[["state", "cantidad"]].iloc[:10]

Unnamed: 0,state,cantidad
31,New York,785
37,Pennsylvania,536
20,Massachusetts,526
18,Maine,403
28,New Hampshire,309
48,West Virginia,257
12,Illinois,238
46,Virginia,229
43,Texas,227
13,Indiana,193


In [40]:
states = alt.topo_feature(data.us_10m.url, 'states')

alt.Chart(states).mark_geoshape().encode(
    color='cantidad:Q',
    tooltip=[
        alt.Tooltip('state:O'),
        alt.Tooltip('cantidad:Q', title="tweets"),
    ],
).transform_lookup(
    lookup='id',
    from_=alt.LookupData(states_issue_merge, 'id', 
                         list(states_issue_merge.columns))
).properties(
    width=500,
    height=300
).project(
    type='albersUsa'
)

### Cantidad de incidencias por estado
  
Primero se elige un tipo de incidencia e inmediatamente se agrupan los datos  por aerolínea y finalmente se cuenta las 
incidencias por cada estado.

In [None]:
#seleccione la incidencia a graficar
incidencias = df["negativereason"].unique()[1:]
select_incidencia = incidencias[2]

In [211]:
#seleccionando las columnas relevantes
df_localidad = df[["negativereason", "airline", "tweet_location"]]
#renombrando la columna para hacer el join por el mismo nombre
df_localidad=df_localidad.rename(columns={'tweet_location': 'localidad'})
#join de los dataframe
df_state_reason = pd.merge(df_localidad,df_cities[["localidad","state"]],on='localidad')
#eliminando la columna que solo servia de puente
df_state_reason=df_state_reason.drop(['localidad'], axis=1)
#eliminando los tweets que no son negativos
df_state_reason.dropna(subset = ["negativereason"], inplace=True)
df_state_reason= pd.merge(df_state_reason,pop,on='state')

#filtrando el dataframe por la incidencia a graficar
group_issue_state = df_state_reason[df_state_reason["negativereason"] == select_incidencia]
#quitando la columna negativereason
group_issue_state = group_issue_state[["airline", "state", "id"]]
#agrupando por aerolinea y contado los tweets
serie = group_issue_state.groupby(["airline", "state", "id"]).size()
df_aux = serie.index.to_frame(index=None)
df_aux["cantidad"] = serie.values

#el dataframe no tiene todos los estados y los mapas aparecerian incompletos
#necesito un datafrma con todas las combinaciones de aerolineas y estados para hacer un outer join
lista_aerolineas = df_aux["airline"].unique()
df_sta_aero = pd.DataFrame(columns=["airline", "state", "id"])
for aero in lista_aerolineas:
    for i_state in range(pop.shape[0]):
        state = pop.iloc[i_state]["state"]
        id_ = pop.iloc[i_state]["id"]
        df_sta_aero.loc[df_sta_aero.shape[0]] = {"airline": aero,
                                                 "state": state,
                                                 "id": id_}
#agregara los estados y aerolineas que faltan
df_all_state_air = pd.merge(df_aux, df_sta_aero, on=['airline', 'state', 'id'], how='outer')
#la conbinaciones nuevas tiene como cantidad el valor 0
df_all_state_air["id"] = df_all_state_air["id"].astype(int)
df_all_state_air["cantidad"] = df_all_state_air["cantidad"].fillna(value=0)

In [212]:
alt.Chart(df_all_state_air).mark_geoshape().encode(
    shape='geo:G',
    color='cantidad:Q',
    tooltip=['state:N', 'cantidad:Q'],
    facet=alt.Facet('airline:N', columns=2),
).transform_lookup(
    lookup='id',
    from_=alt.LookupData(data=states, key='id'),
    as_='geo'
).properties(
    width=300,
    height=175,
).project(
    type='albersUsa'
)

## Modelo Clasificación de sentimiento

### Dividiendo el dataset para validar el modelo

In [44]:
from sklearn.model_selection import train_test_split

In [45]:
target = "airline_sentiment"

In [46]:
rest, test = train_test_split(df, test_size=0.2, stratify=df[target])
train, val = train_test_split(rest, test_size=0.2, stratify=rest[target])
len(train), len(val), len(test)

(9369, 2343, 2928)

In [47]:
def normalizarTarget(df):
    train_y = df[target]
    train_y = np.where(train_y == "negative", 1, train_y)
    train_y = np.where(train_y == "neutral", 0, train_y)
    train_y = np.where(train_y == "positive", 2, train_y)
    return train_y.astype('int')

In [48]:
#negative: 1
#neutral : 0
#positive: 2

train_y = normalizarTarget(train)
val_y = normalizarTarget(val)
test_y = normalizarTarget(test)

### Tokenizacion de palabras

In [49]:
import math
import re
#texto en varios formatos, limpiar los texto
from bs4 import BeautifulSoup

In [50]:
def tokenize(tweet):
    tweet = BeautifulSoup(tweet, "lxml").get_text()
    # Eliminamos la @ y su mención
    tweet = re.sub(r"@[A-Za-z0-9]+", '_Entidad_', tweet)
    # Eliminamos los links de las URLs
    tweet = re.sub(r"https?://[A-Za-z0-9./]+", ' ', tweet)
    # Nos quedamos solamente con los caracteres
    tweet = re.sub(r"[^a-zA-Z.!?']", ' ', tweet)
    # Eliminamos espacios en blanco adicionales
    tweet = re.sub(r" +", ' ', tweet)
    return tweet.split()

### One Hot Encoding

In [51]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [52]:
from sklearn.feature_extraction.text import CountVectorizer

In [53]:
vectorizador_tweet = CountVectorizer(binary=False, analyzer=tokenize, max_features=5000)

In [54]:
vectorizador_tweet.fit(train["text"])

CountVectorizer(analyzer=<function tokenize at 0x0000021922ABA4C0>,
                max_features=5000)

In [55]:
train_vector = vectorizador_tweet.transform(train["text"])
val_vector = vectorizador_tweet.transform(val["text"])
test_vector = vectorizador_tweet.transform(test["text"])

In [56]:
lr = LogisticRegression(max_iter=1000, class_weight="balanced")

In [57]:
lr.fit(train_vector, train_y)

LogisticRegression(class_weight='balanced', max_iter=1000)

In [58]:
train_pred = lr.predict(train_vector)
val_pred = lr.predict(val_vector)

In [59]:
print(classification_report(train_y, train_pred))

              precision    recall  f1-score   support

           0       0.81      0.95      0.88      1983
           1       0.99      0.91      0.95      5873
           2       0.88      0.97      0.92      1513

    accuracy                           0.93      9369
   macro avg       0.90      0.94      0.92      9369
weighted avg       0.94      0.93      0.93      9369



In [60]:
print(classification_report(val_y, val_pred))

              precision    recall  f1-score   support

           0       0.54      0.69      0.60       496
           1       0.89      0.80      0.84      1469
           2       0.65      0.67      0.66       378

    accuracy                           0.75      2343
   macro avg       0.69      0.72      0.70      2343
weighted avg       0.77      0.75      0.76      2343



### Word Embedding

http://old.tacosdedatos.com/word-to-vec-ilustrado

In [61]:
import multiprocessing

In [62]:
from gensim.models import Word2Vec

In [63]:
cores = multiprocessing.cpu_count() # Count the number of cores in a computer

In [64]:
#https://www.kaggle.com/pierremegret/gensim-word2vec-tutorial
w2v_model = Word2Vec(min_count=3,
                     window=2,
                     vector_size=300,
                     sample=6e-5, 
                     alpha=0.03, 
                     min_alpha=0.0007, 
                     negative=20,
                     workers=cores-1)

In [65]:
train_sentence = [tokenize(train["text"].iloc[i_row]) for i_row in range(train.shape[0])]

In [66]:
w2v_model.build_vocab(train_sentence, progress_per=10000)

In [67]:
w2v_model.train(train_sentence, total_examples=w2v_model.corpus_count, epochs=100, report_delay=1)

(5490530, 16097700)

#### Creando un nuevos dataset con el embedding


suma todos los vectores de cada una de las palabras del tweet y finalmente obtiene un vector con el promedio, este vector representa al tweet con todas sus palabras. 


In [68]:
def docsVector(embbeding, datset):
    train_sentence = [tokenize(datset["text"].iloc[i_row]) for i_row in range(datset.shape[0])]
    docs_vectors = pd.DataFrame()
    num_dim = embbeding.vector_size
    tam_dataset = len(train_sentence)
    media = None
    for indice in range(tam_dataset):
        lista_palabras = train_sentence[indice]
        temp = []
        for palabra in lista_palabras:
            if palabra in embbeding:
                embbeding_word = np.round(embbeding.__getitem__(palabra),5)
                temp.append(list(embbeding_word))
        if len(temp)>0:
            media = pd.Series(np.array(temp).mean(axis = 0))
            docs_vectors = docs_vectors.append(media, ignore_index = True)
        else:
            array = [0 for i in range(num_dim)]
            docs_vectors = docs_vectors.append(media, ignore_index = True)
    return docs_vectors

In [69]:
docsVector_train = docsVector(w2v_model.wv, train)
docsVector_val = docsVector(w2v_model.wv, val)
docsVector_test = docsVector(w2v_model.wv, test)

In [70]:
def normalizarTarget(df):
    train_y = df[target]
    train_y = np.where(train_y == "negative", 1, train_y)
    train_y = np.where(train_y == "neutral", 0, train_y)
    train_y = np.where(train_y == "positive", 2, train_y)
    return train_y.astype('int')

In [71]:
#negative: 1
#neutral : 0
#positive: 2

train_y = normalizarTarget(train)
val_y = normalizarTarget(val)
test_y = normalizarTarget(test)

#### Regresión lineal

In [72]:
lr = LogisticRegression(max_iter=1000, class_weight="balanced")
lr.fit(docsVector_train, train_y)

LogisticRegression(class_weight='balanced', max_iter=1000)

In [73]:
train_pred = lr.predict(docsVector_train)

In [74]:
accuracy_score(train_y, train_pred)

0.7519479133311986

In [75]:
print(classification_report(train_y, train_pred))

              precision    recall  f1-score   support

           0       0.56      0.70      0.62      1983
           1       0.90      0.77      0.83      5873
           2       0.61      0.75      0.67      1513

    accuracy                           0.75      9369
   macro avg       0.69      0.74      0.71      9369
weighted avg       0.78      0.75      0.76      9369



In [76]:
val_pred = lr.predict(docsVector_val)

In [77]:
accuracy_score(val_y, val_pred)

0.734955185659411

In [78]:
print(classification_report(val_y, val_pred))

              precision    recall  f1-score   support

           0       0.53      0.67      0.59       496
           1       0.88      0.78      0.83      1469
           2       0.58      0.65      0.61       378

    accuracy                           0.73      2343
   macro avg       0.66      0.70      0.68      2343
weighted avg       0.76      0.73      0.74      2343



#### Red neuronal multicapa

In [79]:
from sklearn.neural_network import MLPClassifier

In [80]:
clf_bk = MLPClassifier(solver='sgd', alpha=1e-5,
                    hidden_layer_sizes=(300,100), max_iter=500, random_state=1)

In [81]:
clf_bk.fit(docsVector_train, train_y)



MLPClassifier(alpha=1e-05, hidden_layer_sizes=(300, 100), max_iter=500,
              random_state=1, solver='sgd')

In [82]:
train_pred = clf_bk.predict(docsVector_train)

In [83]:
accuracy_score(train_y, train_pred)

0.8679688333867008

In [84]:
print(classification_report(train_y, train_pred))

              precision    recall  f1-score   support

           0       0.81      0.72      0.76      1983
           1       0.90      0.95      0.92      5873
           2       0.82      0.75      0.78      1513

    accuracy                           0.87      9369
   macro avg       0.84      0.81      0.82      9369
weighted avg       0.86      0.87      0.87      9369



In [85]:
val_pred = clf_bk.predict(docsVector_val)

In [86]:
accuracy_score(val_y, val_pred)

0.7686726419120785

In [87]:
print(classification_report(val_y, val_pred))

              precision    recall  f1-score   support

           0       0.65      0.49      0.56       496
           1       0.81      0.92      0.86      1469
           2       0.69      0.56      0.62       378

    accuracy                           0.77      2343
   macro avg       0.72      0.66      0.68      2343
weighted avg       0.76      0.77      0.76      2343

