### NOTEBOOKS PARA REALIZAR PRUEBAS Y ANÁLISIS EXPLORATORIO

In [1]:
import pandas as pd
import os

### 1. Construcción del corpus

A partir de todos los archivos de reuters, vamos a construir un solo archivo .csv, en el cual vamos a tener todos estos textos, para este fin vamos a realizar el siguiente proceso:
1. Iterar sobre las carpetas /data/test y /data/training
2. Iterar sobre cada archivo dentro de esas carpetas
3. Obtener el contenido de dicho archivo
4. Guardar en una lista [id_archivo, contenido_archivio, path_archivo]

Cuando se finalicen todas las iteraciones, esta lista va a ser convertida en un DataFrame de pandas y, posteriormente a un archivo .csv con sus respectivas 3 columnas mencionadas anteriormente.


In [31]:
directorio_base = os.path.join(os.getcwd(), "..")  # Volver a directorio RI_PROYECTO_1B
subcarpeta_reuters = [f"{directorio_base}\\data\\training",
                      f"{directorio_base}\\data\\test"]

noticias = []

for subcarpeta in subcarpeta_reuters:
    textos = os.scandir(subcarpeta)

    for texto in textos:
        with open(texto, 'r', encoding='utf-8', errors='ignore') as file:
            nombre_archivo = file.name.split("\\")[-1]
            subcarpeta_texto = subcarpeta.split("\\")[-1]
            print(f"Iterando sobre texto {nombre_archivo} en subcarpeta {subcarpeta_texto}")
            contenido = file.read()

            noticias.append([nombre_archivo, contenido, subcarpeta_texto])

Iterando sobre texto 1 en subcarpeta training
Iterando sobre texto 10 en subcarpeta training
Iterando sobre texto 100 en subcarpeta training
Iterando sobre texto 1000 en subcarpeta training
Iterando sobre texto 10000 en subcarpeta training
Iterando sobre texto 10002 en subcarpeta training
Iterando sobre texto 10005 en subcarpeta training
Iterando sobre texto 10008 en subcarpeta training
Iterando sobre texto 10011 en subcarpeta training
Iterando sobre texto 10014 en subcarpeta training
Iterando sobre texto 10015 en subcarpeta training
Iterando sobre texto 10018 en subcarpeta training
Iterando sobre texto 10023 en subcarpeta training
Iterando sobre texto 10025 en subcarpeta training
Iterando sobre texto 10027 en subcarpeta training
Iterando sobre texto 1003 en subcarpeta training
Iterando sobre texto 10032 en subcarpeta training
Iterando sobre texto 10035 en subcarpeta training
Iterando sobre texto 10037 en subcarpeta training
Iterando sobre texto 10038 en subcarpeta training
Iterando so

In [32]:
df_textos_corpus = pd.DataFrame(noticias, columns=["id", "contenido", "subcarpeta"])

In [35]:
df_textos_corpus.to_csv(f"{directorio_base}\\data\\corpus_sin_procesar.csv", index=False)

### 2. Preprocesar los corpus obtenidos

Hemos obtenido los textos que van a conformar nuestro corpus y los hemos concatenado todos en un archivo .csv el cual podemos manipularlo de una manera mucho más fácil y eficiente. Ahora lo que vamos a realizar es el preprocesamiento de este corpus, para ello vamos a realizar el siguiente proceso:

En la consola:
python -m pip install spacy
python -m spacy download es_core_news_sm

In [46]:
import spacy

# Cargar el modelo en inglés
nlp = spacy.load('en_core_web_sm')

# El texto a lematizar
texto = """
BAHIA COCOA REVIEW
Showers continued throughout the week in
the Bahia cocoa zone, alleviating the drought since early
January and improving prospects for the coming temporao,
although normal humidity levels have not been restored,
Comissaria Smith said in its weekly review.
The dry period means the temporao will be late this year.
Arrivals for the week ended February 22 were 155,221 bags
of 60 kilos making a cumulative total for the season of 5.93
mln against 5.81 at the same stage last year. Again it seems
that cocoa delivered earlier on consignment was included in the
arrivals figures.
Comissaria Smith said there is still some doubt as to how
much old crop cocoa is still available as harvesting has
practically come to an end. With total Bahia crop estimates
around 6.4 mln bags and sales standing at almost 6.2 mln there
are a few hundred thousand bags still in the hands of farmers,
middlemen, exporters and processors.
There are doubts as to how much of this cocoa would be fit
for export as shippers are now experiencing dificulties in
obtaining +Bahia superior+ certificates.
In view of the lower quality over recent weeks farmers have
sold a good part of their cocoa held on consignment.
Comissaria Smith said spot bean prices rose to 340 to 350
cruzados per arroba of 15 kilos.
Bean shippers were reluctant to offer nearby shipment and
only limited sales were booked for March shipment at 1,750 to
1,780 dlrs per tonne to ports to be named.
New crop sales were also light and all to open ports with
June/July going at 1,850 and 1,880 dlrs and at 35 and 45 dlrs
under New York july, Aug/Sept at 1,870, 1,875 and 1,880 dlrs
per tonne FOB.
Routine sales of butter were made. March/April sold at
4,340, 4,345 and 4,350 dlrs.
April/May butter went at 2.27 times New York May, June/July
at 4,400 and 4,415 dlrs, Aug/Sept at 4,351 to 4,450 dlrs and at
2.27 and 2.28 times New York Sept and Oct/Dec at 4,480 dlrs and
2.27 times New York Dec, Comissaria Smith said.
Destinations were the U.S., Covertible currency areas,
Uruguay and open ports.
Cake sales were registered at 785 to 995 dlrs for
March/April, 785 dlrs for May, 753 dlrs for Aug and 0.39 times
New York Dec for Oct/Dec.
Buyers were the U.S., Argentina, Uruguay and convertible
currency areas.
Liquor sales were limited with March/April selling at 2,325
and 2,380 dlrs, June/July at 2,375 dlrs and at 1.25 times New
York July, Aug/Sept at 2,400 dlrs and at 1.25 times New York
Sept and Oct/Dec at 1.25 times New York Dec, Comissaria Smith
said.
Total Bahia sales are currently estimated at 6.13 mln bags
against the 1986/87 crop and 1.06 mln bags against the 1987/88
crop.
Final figures for the period to February 28 are expected to
be published by the Brazilian Cocoa Trade Commission after
carnival which ends midday on February 27.
"""

# Procesar el texto
doc = nlp(texto)

# Lematizar el texto
lematizado = " ".join([token.lemma_ for token in doc])

print(lematizado)



 bahia COCOA REVIEW 
 Showers continue throughout the week in 
 the Bahia cocoa zone , alleviate the drought since early 
 January and improve prospect for the come temporao , 
 although normal humidity level have not be restore , 
 Comissaria Smith say in its weekly review . 
 the dry period mean the temporao will be late this year . 
 arrival for the week end February 22 be 155,221 bag 
 of 60 kilo make a cumulative total for the season of 5.93 
 mln against 5.81 at the same stage last year . again it seem 
 that cocoa deliver early on consignment be include in the 
 arrival figure . 
 Comissaria Smith say there be still some doubt as to how 
 much old crop cocoa be still available as harvesting have 
 practically come to an end . with total Bahia crop estimate 
 around 6.4 mln bag and sale stand at almost 6.2 mln there 
 be a few hundred thousand bag still in the hand of farmer , 
 middleman , exporter and processor . 
 there be doubt as to how much of this cocoa would be fit 
 for