!/usr/bin/env python
coding: utf-8
## Culture Maps Project
    ## Author: Nicolás Torres-Echeverry
    ## Created: July 2021
    ## Date(last modified): July 20th 2023
    ## Data Cleaning
    ## Working with El Corpus del Español
    
# This notebook creates the pipeline to save data as a dictonary (stored as a json). It does the innitial text preprocesing steps for:
    # 1. Topic Models
    # 2. Dynamic Topic Models
    # 3. Word2Vec
    # 4. Diachronic Word Embeddings

        # `1` and `2` require the data in a different format from `3` and `4`. 
 
## Notebook index:
    # 1. Libraries
    # 2. Helper functions
    # 3. Pipeline
    # 4. Save dictionaty as json

# 1. Libraries 

In [1]:
# # 1. Libraries 

import re
import zipfile
import os
import sys
import pandas as pd
import json


# 2. Helper functions

In [2]:
def loadcorpus(corpus_name, corpus_style="text"):
    '''
    Iterates through the files in the folder, and 
    unzips the files, storing them in a dictionary with 
    each zip file mapping to a list of the texts.
    
    Input:
        corpus_name (str): indicates the working directory and the name
                        of the foldet that contains the corpus

    Output:
        text_raw (dict):
            key - name of the enclosing folder
            value - list of strings that corresponds to that folder
    '''
    texts_raw = {}
    for file in os.listdir(corpus_name + "/"):
        if corpus_style in file:
            print(file)
            zfile = zipfile.ZipFile(corpus_name + "/" + file)
            for file in zfile.namelist():
                texts_raw[file] = []
                with zfile.open(file) as f:
                    for line in f:
                        texts_raw[file].append(line)
    return texts_raw

In [3]:
def clean_raw_text(raw_texts):
    '''
    Decodes and removes some reg expresssions from list of strings
    Reg expressions removed: [¡!@#$:);,¿?&]
    Notice that I don't remove dots (.) to be able to mark sentences
    
    Input:
        raw_texts (list of str): text # YOU ARE INPUTING A LIST IN THE FUNCTION BELOW

    
    Output:
        clean text(list): list with clean texts
        
    '''
    clean_texts = []
    for text in raw_texts:
        try:
            text = text.decode("utf-8")
            text = re.sub('[¡!@#$:);,¿?&]', '', text)
            clean_texts.append(text)
        except AttributeError:
            print("ERROR CLEANING", "Text:")
            print(text)
            continue
        except UnicodeDecodeError:
            print("Unicode Error, Skip")
            continue
    return clean_texts


In [4]:
def dic_match_key_text(raw_dic_texts, max_texts):
    '''
    Creates dictionary of strings to match text and sources
    
    Input:
        raw_dic_texts (dict):
            key - name of the enclosing folder
            value - list of strings that corresponds to multiple websites entries
        
        max_texts(int): number that detemines the number of texts 
                        included in the dictionary        

    Output:
        websites_text(dict):
            key - id that matches the text and the source
            value - (str) text
    '''
    websites_text = {}
    i=0
    
    for key in raw_dic_texts:
        i =+ 1
        if len(websites_text) > max_texts:
            break
        texts_for_key = clean_raw_text(raw_dic_texts[key])
        for one_text in texts_for_key:
            text_id = one_text.split()[0]
            try:
                websites_text[text_id] = one_text[6:] # This is not ideal.
                                                    # It's bringing numbers from ID
            except IndexError:
                print('IndexError')
                continue
    return websites_text

# 3. Pipeline - with some Tests

In [5]:
# loads corpus as a dictionary [key: folder name; value: text]
raw_span = loadcorpus("../SPAN")

text_AR-tez.zip
text_BO-teh.zip
text_CL-wts.zip
text_CO-pem.zip
text_CR-jfy.zip
text_CU-rag.zip
text_DO-egn.zip
text_EC-jss.zip
text_ES-sbo.zip
text_GT-miv.zip
text_HN-paj.zip
text_MX-vzo.zip
text_NI-exu.zip
text_PA-qlz.zip
text_PE-tae.zip
text_PR-epz.zip
text_PY-ukd.zip
text_SV-xkl.zip


    ### This is correct. Printed the names of the 21 zipfiles in folder.

In [5]:
#checking data type

type(raw_span)

dict

In [6]:
#checking number of keys (aka folders)

len(list(raw_span.keys()))

780

The raw_span dictionary that maps folders to texts is made of 780 folders (aka keys)

In [7]:
#extracting one key from raw_span

list(raw_span.keys())[1]

'AR-B-01.txt'

In [8]:
# seeing what is in the key extracted

raw_span['EC-B-1.txt']

[b'@@389981 Cr\xc3\xadtica : Indiana Jones y el Reino de la Calavera de Cristal En primer t\xc3\xa9rmino , una consideraci\xc3\xb3n juguetona . Supongo que me lo permiten . Todo ser pensante que escribe en alg\xc3\xban blog , bit\xc3\xa1cora , peri\xc3\xb3dico o revista ha estado esperando este mismo instante desde hace muchos meses : el de poner se a escribir sobre la \xc3\xbaltima pel\xc3\xadcula de Indiana Jones . Tocar con las teclas algo casi legendario pone mucho y m\xc3\xa1s si tenemos entre manos una reliquia de el pasado , una joya de la nostalgia cin\xc3\xa9fila y un personaje tan m\xc3\xadtico como admirado . Una vez llegado el estreno y el d\xc3\xada en que uno se dirige a ver la de forma fren\xc3\xa9tica ( d\xc3\xa9nse cuenta que las mayor\xc3\xada de cr\xc3\xadticas ya estar\xc3\xa1n colgadas ) , llega el momento refocilante : esta es mi cr\xc3\xadtica sobre Indiana Jones . Tremendo instante . Me he recreado y lo s\xc3\xa9 pero es que parece el cine tan aburrido en sus pr

In [9]:
#cheking the type of data for the first element of a list that the key maps to

print(type(raw_span['EC-B-1.txt'][0]))

#cheking leng of one list
print(len(raw_span['EC-B-1.txt']))

<class 'bytes'>
3615


    ### The key maps to a list of bites. Each element in the list is a website. 
    ### When we extract the list for key 'EC-B-1.txt' => we get 3,615 websites

What this means is that if we iterate over the keys and count the lenght of the lists and add them up => we get the total number of websites.

In [10]:
#checking total number of websites in dictionary

c = 0 
for key in raw_span:
    n = len(raw_span[key])
    c = c + n
print ('total number of websites =', c) 

total number of websites = 2086902


    ### The total number of websites == 2,086,902
    ### This is very near the number from the sources files == 2,096,914
    ### This is around 10,000 missing websites 
    ### But these seems to be missing in the raw_data. IT DOES NOT SEEM TO BE A A BUG IN THE CODE.

In [11]:
# Test - trying the clean text funciton over the list extracted with the key above (one key)
clean_text = clean_raw_text(raw_span['EC-B-1.txt'])
clean_text

NameError: name 'clean_raw_text' is not defined

In [None]:
#Turning a FRACTION OF dictionary into json file 

dict_span_20k = dic_match_key_text(raw_span, 20000)

In [None]:
len(dict_span_20k)

21593

    ### Notice that the length of the dictionary is over 20,000 documents. This is ok. 
    ### The function above opens each folder which contains thousands of websites (documents) and add them. 
    ### The funcion stops adding after one folder adds for a total of more than 20,000.
    ### So the number of documents included won't be excatly 20k.

In [None]:
# FULL DICTIONARY

dict_span_full = dic_match_key_text(raw_span, 20000000)

In [None]:
len(dict_span_full)

2086608

# 4. Saving dictionaries as JSON files

In [21]:
# writing json

# with open("dict_span_20k.json", "w") as write_file:
#     json.dump(dict_span_20k, write_file)

# FULL

with open("dict_span_full.json", "w") as write_file:
    json.dump(dict_span_full, write_file)


In [16]:
# testing reading json
# with open("dict_span_20k.json", "r") as read_file:
#     test_data_20K = json.load(read_file)


In [17]:
#cheking length.

# len(test_data_20K)

21593

In [18]:
#checking what is in one key. 
# test_data_20K['389981'] 

' Crítica  Indiana Jones y el Reino de la Calavera de Cristal En primer término  una consideración juguetona . Supongo que me lo permiten . Todo ser pensante que escribe en algún blog  bitácora  periódico o revista ha estado esperando este mismo instante desde hace muchos meses  el de poner se a escribir sobre la última película de Indiana Jones . Tocar con las teclas algo casi legendario pone mucho y más si tenemos entre manos una reliquia de el pasado  una joya de la nostalgia cinéfila y un personaje tan mítico como admirado . Una vez llegado el estreno y el día en que uno se dirige a ver la de forma frenética ( dénse cuenta que las mayoría de críticas ya estarán colgadas   llega el momento refocilante  esta es mi crítica sobre Indiana Jones . Tremendo instante . Me he recreado y lo sé pero es que parece el cine tan aburrido en sus propios refritos que cuando se resucita algo jugoso a uno le sale la vena lírica . Y versar sobre algo que ha puesto en           a la crítica que debe em

################
# Everything looks good.
# Closing.
################