# **SentimentArcs (Part 1): Text Preprocessing**

```
Jon Chun
12 Jun 2021: Started
04 Mar 2022: Last Update
```

Welcome! 

SentimentArcs is a methodlogy and software framework for analyzing narrative in text. Virtually all long text contains narrative elements...(TODO: Insert excerpts from Paper Abstract/Intro Sections here)

***

* **SentimentArcs: Cloning the Github repository to your gDrive**

If this is the first time using SentimentArcs, you will need to copy the software from our Github.com repository (github repo). The default recommended gDrive path is ./gdrive/MyDrive/research/sentiment_arcs/'. 

The first time you run this notebook and connect your Google gDrive, it will allow to to specify the path to your SentimentArcs subdirectory. If it does not exists, this notebook will copy/clone the SentimentArcs github repository code to your gDrive at the path you specify.


***

* **NovelText: A Reference Corpus of 24 Diverse Novel**

Sentiment Arcs comes with a carefully curated reference corpus of Novels to illustrate the unique diachronic sentiment analysis characteristic of long form fictional narrativeas. This corpus of 24 diverse novels also provides a baseline for exploring and comparing new novels with sentiment analysis using SentimentArcs.

***

* **Preparing New Novels: Formatting and adding to subdirectory**

To analyze new novels with SentimentArcs, the body of the text should consist of plain text organized in to blocks separated by two newlines which visually look like a single blank line between blocks. These blocks are usually paragraphs but can also include title headers, separate lines of dialog or quotes. Please reference any of the 24 novels in the NovelText corpus for examples of this expected format.

Once the new novel is correctly formatted as a plain text file, it should follow this standard file naming convention:

[first letter of first name]+[full lastname]_[abbreviated book title].txt

Examples:

* fdouglass_narrativelifeofaslave
* fscottfitzgerald_thegreatgatsby.txt
* vwoolf_mrsdalloway.txt
* homer-ewilson_odyssey.txt (trans. E.Wilson)
* mproust-mtreharne_3guermantesway.txt (Book 3, trans. M.Treharne)
* staugustine_confessions9end.txt (Upto and incl Book 9)

Note the optional author suffix (-translator) and optional title suffix (-selected chapters/books)

***

* **Adding New Novels: Add file to subdirectory and Update this Notebook**

Once you have a cleaned and text file named according the standard rule above, you must move that file to the subdirectory of all input novels and update the global variable in this notebook that defines which novels to analyze.

First, copy your cleaned text file to the subdirectory containing all novels read by this notebook. This subdir is defined by the program variable 'subdir_novels' with the default value './in1_novels/'

Second, update the program variable 'novels_dt'. This is a Dictionary data structure that following the pattern below:
```
novels_dt = {
  'cdickens_achristmascarol':['A Christmas Carol by Charles Dickens ',1843,1399],
```
Where the first string (the dictionary key) must match the filename root without the '.txt' suffix (e.g. cdickens_achristmascarol). The Dictionary value after the ':' is a list of three elements:

* A nicely formatted string of the form '(title) by (full first and last name of author)' that should be a human friendly string used to label plots and saved files.

* The (publication year) and the (sentence count). Both are optional, but should have placeholder string '0' if unknown. These are intended for future reference and analytics.

* Your future self will thank you if you insert new novels into the 'novels_dt' in alphabetic order for faster and more accurate reference.

***

* **How to Execute SentimentArcs Notebooks:**

This is a Jupyter Notebook created to run on Google's free Colab service using only a browers and your exiting Google email account. We chose Google Colab because it is relatively, fast, free, easy to use and makes collaboration as simple as web browsing.

A few reminders about using Jupyter Notebooks general and SentimentArcs in particular:

* All cells must be run ***in order*** as later code cells often depend upon the output of earlier code cells

* ***Cells that take more time to execute*** (> 1 min) usually begin with *%%time* which outputs the *total execution time* of the last run.  This timing output is deleted and recalculated each time the code cell is executed.

* **[OPTIONAL]** at the top of a cell indicates you *may* change a setting in that cell to customize behavior.

* **[CUSTOMIZE]** at the top of a cell indicates you *must* change a setting in that cell.

* **[RESTART REQUIRED]** at the top of a cell indicates you *may* see a *[RESTART REQUIRED] button* at the end of the output. *If you see this button, you must select [Runtime]->[Restart Runtime] from the top menubar.

* **[INPUT REQUIRED]** at the top of a cell indicates you will be required to take some action for execution to proceed, usually by clicking a button or entering the response to a prompt.

All cells with a top comment prefixed with # [OPTIONAL]: indicates that you can change a setting to customize behavior, the prefix [CUSTOMIZE] indicates you MUST set/change a setting

* SentimentArcs divides workflow into a series of chronological Jupyter Notebooks that must be run in order. Here is an overview of the workflow:

***

**SentimentArcs Notebooks Workflow**
1. Notebook #1: Preprocess Text
2. Notebook #2: Compute Sentiment Values (Simple Models/CPUs)
3. Notebook #3: Compute Sentiment Values (Complex Models/GPUs)
4. Notebook #4: Combine all Sentiment Values, perform Time Series analysis, and extract Crux points and surrounding text

If you are unfamilar with setting up and using Google Colab or Jupyter Notebooks, here are a series of resources to quickly bring you up to speed. If you are using SentimentArcs with the Cambridge University Press Elements textbook, there are also a series of videos by Prof Elkins and Chun stepping you through these notebooks.

***

**Additional Resources and Tutorials**


**Google Colab and Jupyter Resources:**

* Coming...
* [IPython, Python Data Science Handbook by Jake VanderPlas](https://jakevdp.github.io/PythonDataScienceHandbook/01.00-ipython-beyond-normal-python.html) 

**Cambridge University Press Videos:**

* Coming...




# **[STEP 1] Manual Configuration/Setup**



## (Popups) Connect Google gDrive

In [1]:
# [INPUT REQUIRED]: Authorize access to Google gDrive

# Connect this Notebook to your permanent Google Drive
#   so all generated output is saved to permanent storage there

try:
  from google.colab import drive
  IN_COLAB=True
except:
  IN_COLAB=False

if IN_COLAB:
  print("Attempting to attach your Google gDrive to this Colab Jupyter Notebook")
  drive.mount('/gdrive', force_remount=True)
else:
  print("Your Google gDrive is attached to this Colab Jupyter Notebook")

Attempting to attach your Google gDrive to this Colab Jupyter Notebook
Mounted at /gdrive


## (3 Inputs) Define Directory Tree

In [104]:
# [CUSTOMIZE]: Change the text after the Unix '%cd ' command below (change directory)
#              to math the full path to your gDrive subdirectory which should be the 
#              root directory cloned from the SentimentArcs github repo.

# NOTE: Make sure this subdirectory already exists and there are 
#       no typos, spaces or illegals characters (e.g. periods) in the full path after %cd

# NOTE: In Python all strings must begin with an upper or lowercase letter, and only
#         letter, number and underscores ('_') characters should appear afterwards.
#         Make sure your full path after %cd obeys this constraint or errors may appear.

# #@markdown **Instructions**

# #@markdown Set Directory and Corpus names:
# #@markdown <li> Set <b>Path_to_SentimentArcs</b> to the project root in your **GDrive folder**
# #@markdown <li> Set <b>Corpus_Genre</b> = [novels, finance, social_media]
# #@markdown <li> <b>Corpus_Type</b> = [reference_corpus, new_corpus]
# #@markdown <li> <b>Corpus_Number</b> = [1-20] (id nunmber if a new_corpus)

#@markdown <hr>

# Step #1: Get full path to SentimentArcs subdir on gDrive
# =======
#@markdown **Accept default path on gDrive or Enter new one:**

Path_to_SentimentArcs = "/gdrive/MyDrive/sentimentarcs_notebooks/" #@param ["/gdrive/MyDrive/sentiment_arcs/"] {allow-input: true}


#@markdown Set this to the project root in your <b>GDrive folder</b>
#@markdown <br> (e.g. /<wbr><b>gdrive/MyDrive/research/sentiment_arcs/</b>)

#@markdown <hr>

#@markdown **Which type of texts are you cleaning?** \

Corpus_Genre = "finance" #@param ["novels", "social_media", "finance"]

# Corpus_Type = "reference" #@param ["new", "reference"]
Corpus_Type = "reference" #@param ["new", "reference"]


Corpus_Number = 2 #@param {type:"slider", min:0, max:10, step:1}


#@markdown Put in the corresponding Subdirectory under **./text_raw**:
#@markdown <li> All Texts as clean <b>plaintext *.txt</b> files 
#@markdown <li> A <b>YAML Configuration File</b> describing each Texts

#@markdown Please verify the required textfiles and YAML file exist in the correct subdirectories before continuing.

print('Current Working Directory:')
%cd $Path_to_SentimentArcs

print('\n')

if Corpus_Type == 'reference':
  SUBDIR_TEXT_RAW = f'text_raw_{Corpus_Genre}_reference'
  SUBDIR_TEXT_CLEAN = f'text_clean_{Corpus_Genre}_reference'
else:
  SUBDIR_TEXT_RAW = f'text_raw_{Corpus_Genre}_{Corpus_Type}_corpus{Corpus_Number}/'
  SUBDIR_TEXT_CLEAN = f'text_clean_{Corpus_Genre}_{Corpus_Type}_corpus{Corpus_Number}/'

PATH_TEXT_RAW = f'./text_raw/{SUBDIR_TEXT_RAW}'
PATH_TEXT_CLEAN = f'./text_clean/{SUBDIR_TEXT_CLEAN}'

# TODO: Clean up
SUBDIR_TEXT_CLEAN = PATH_TEXT_CLEAN

print(f'SUBDIR_TEXT_RAW:\n  [{SUBDIR_TEXT_RAW}]')
print(f'PATH_TEXT_RAW:\n  [{PATH_TEXT_RAW}]')

print(f'SUBDIR_TEXT_CLEAN:\n  [{SUBDIR_TEXT_CLEAN}]')
print(f'PATH_TEXT_CLEAN:\n  [{PATH_TEXT_CLEAN}]')

Current Working Directory:
/gdrive/MyDrive/sentimentarcs_notebooks


SUBDIR_TEXT_RAW:
  [text_raw_finance_reference]
PATH_TEXT_RAW:
  [./text_raw/text_raw_finance_reference]
SUBDIR_TEXT_CLEAN:
  [./text_clean/text_clean_finance_reference]
PATH_TEXT_CLEAN:
  [./text_clean/text_clean_finance_reference]


In [3]:
# Add PATH for ./utils subdirectory

import sys
import os

!python --version

print('\n')

PATH_UTILS = f'{Path_to_SentimentArcs}utils'
PATH_UTILS

sys.path.append(PATH_UTILS)

print('Contents of Subdirectory [./sentiment_arcs/utils/]\n')
!ls $PATH_UTILS

# More Specific than PATH for searching libraries
# !echo $PYTHONPATH

Python 3.7.13


Contents of Subdirectory [./sentiment_arcs/utils/]

config_matplotlib.py   global_constants.py    sentiment_arcs_config.py
config_seaborn.py      global_vars.py	      set_globals.py
file_utils.py	       __init__.py	      subdir_constants.py
get_fullpath.py        __pycache__	      test.py
get_model_families.py  read_yaml.py	      text_cleaners_new.py
get_sentimentr.R       sa_config_20220404.py  text_cleaners.py
get_sentiments.py      sa_config.py
get_subdirs.py	       sentiment_analysis.py


In [4]:
# Review Global Variables and set the first few

import global_vars as global_vars

global_vars.SUBDIR_SENTIMENTARCS = Path_to_SentimentArcs
global_vars.Corpus_Genre = Corpus_Genre
global_vars.Corpus_Type = Corpus_Type
global_vars.Corpus_Number = Corpus_Number

global_vars.SUBDIR_TEXT_RAW = SUBDIR_TEXT_RAW
global_vars.PATH_TEXT_RAW = PATH_TEXT_RAW

dir(global_vars)

['Corpus_Genre',
 'Corpus_Number',
 'Corpus_Type',
 'FNAME_SENTIMENT_RAW',
 'MIN_PARAG_LEN',
 'MIN_SENT_LEN',
 'NotebookModels',
 'PATH_TEXT_RAW',
 'PATH_TEXT_RAW_CORPUS',
 'SLANG_DT',
 'STOPWORDS_ADD_EN',
 'STOPWORDS_DEL_EN',
 'SUBDIR_CRUXES',
 'SUBDIR_DATA',
 'SUBDIR_GRAPHS',
 'SUBDIR_SENTIMENTARCS',
 'SUBDIR_SENTIMENT_CLEAN',
 'SUBDIR_SENTIMENT_RAW',
 'SUBDIR_TEXT_CLEAN',
 'SUBDIR_TEXT_RAW',
 'SUBDIR_TIMESERIES_CLEAN',
 'SUBDIR_TIMESERIES_RAW',
 'SUBDIR_UTILS',
 'TEST_SENTENCES_LS',
 'TEST_WORDS_LS',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 'corpus_texts_dt',
 'corpus_titles_dt',
 'corpus_titles_ls',
 'lexicons_dt',
 'model_titles_dt',
 'models_ensemble_dt']

# **[STEP 2] Automatic Configuration/Setup**

## Custom Libraries & Define Globals

In [8]:
dir(global_vars)

['Corpus_Genre',
 'Corpus_Number',
 'Corpus_Type',
 'FNAME_SENTIMENT_RAW',
 'MIN_PARAG_LEN',
 'MIN_SENT_LEN',
 'NotebookModels',
 'PATH_TEXT_RAW',
 'PATH_TEXT_RAW_CORPUS',
 'SLANG_DT',
 'STOPWORDS_ADD_EN',
 'STOPWORDS_DEL_EN',
 'SUBDIR_CRUXES',
 'SUBDIR_DATA',
 'SUBDIR_GRAPHS',
 'SUBDIR_SENTIMENTARCS',
 'SUBDIR_SENTIMENT_CLEAN',
 'SUBDIR_SENTIMENT_RAW',
 'SUBDIR_TEXT_CLEAN',
 'SUBDIR_TEXT_RAW',
 'SUBDIR_TIMESERIES_CLEAN',
 'SUBDIR_TIMESERIES_RAW',
 'SUBDIR_UTILS',
 'TEST_SENTENCES_LS',
 'TEST_WORDS_LS',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 'corpus_texts_dt',
 'corpus_titles_dt',
 'corpus_titles_ls',
 'lexicons_dt',
 'model_titles_dt',
 'models_ensemble_dt']

In [51]:
# Define Global Dict to hold cleaned Texts

global_vars.corpus_texts_dt = {}

In [9]:
!ls utils/*.py

utils/config_matplotlib.py   utils/read_yaml.py
utils/config_seaborn.py      utils/sa_config_20220404.py
utils/file_utils.py	     utils/sa_config.py
utils/get_fullpath.py	     utils/sentiment_analysis.py
utils/get_model_families.py  utils/sentiment_arcs_config.py
utils/get_sentiments.py      utils/set_globals.py
utils/get_subdirs.py	     utils/subdir_constants.py
utils/global_constants.py    utils/test.py
utils/global_vars.py	     utils/text_cleaners_new.py
utils/__init__.py	     utils/text_cleaners.py


In [10]:
!head -n 40 ./utils/sa_config.py


import global_vars

def get_subdirs(SA_root,Corpus_Genre, Corpus_Type, Corpus_Number, NotebookModels):
    '''
    Given a two strings: Corpus, Text_type
    Set all global SUB/DIR constants
    '''

    # NotebookModels indicates which notebook is currently running that imported this get_subdirs() function
    if global_vars.NotebookModels == 'syuzhetr2sentimentr':
        global_vars.FNAME_SENTIMENT_RAW = f'sentiment_raw_{Corpus_Genre}_{Corpus_Type}_syuzhetr2sentimentr.json'
    elif NotebookModels == 'lex2ml':
        global_vars.FNAME_SENTIMENT_RAW = f'sentiment_raw_{Corpus_Genre}_{Corpus_Type}_lex2ml.json'
    elif NotebookModels == 'dnn2transformers':
        global_vars.FNAME_SENTIMENT_RAW = f'sentiment_raw_{Corpus_Genre}_{Corpus_Type}_dnn2transformers.json'
    elif NotebookModels == 'none':
        global_vars.FNAME_SENTIMENT_RAW = f'[NONE]'
    else:
        print(f'ERROR: Illegal value for NotebookModels: {global_vars.NotebookModels}')
        return

    # Define a univers

In [11]:
# Import SentimentArcs Utilities to define Directory Structure
#   based the Selected Corpus Genre, Type and Number

!pwd 
print('\n')

# from utils import sa_config # .sentiment_arcs_utils
from utils import sa_config

print('Objects in sa_config()')
print(dir(sa_config))
print('\n')

# Directory Structure for the Selected Corpus Genre, Type and Number
sa_config.get_subdirs(Path_to_SentimentArcs, Corpus_Genre, Corpus_Type, Corpus_Number, 'none')


/gdrive/MyDrive/sentimentarcs_notebooks


Objects in sa_config()
['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'get_subdirs', 'global_vars', 'set_globals']


Verify the Directory Structure:

-------------------------------

           [Corpus Genre]: finance

            [Corpus Type]: reference


    [FNAME_SENTIMENT_RAW]: [NONE]




INPUTS:
-------------------------------

   [SUBDIR_SENTIMENTARCS]: /gdrive/MyDrive/sentimentarcs_notebooks/


STEP 1: Clean Text
--------------------

        [SUBDIR_TEXT_RAW]: ./text_raw/text_raw_finance_reference/

      [SUBDIR_TEXT_CLEAN]: ./text_clean/text_clean_finance_reference/


STEP 2: Get Sentiments
--------------------

   [SUBDIR_SENTIMENT_RAW]: ./sentiment_raw/sentiment_raw_finance_reference/

 [SUBDIR_SENTIMENT_CLEAN]: ./sentiment_clean/sentiemnt_clean_finance_reference/


STEP 3: Smooth Time Series and Get Crux Points
--------------------

  [SUBDIR_TIMESERIES_RAW]: ./sentiment

In [12]:
# Call SentimentArcs Utility to define Global Variables

sa_config.set_globals()

# Verify sample global var set
print(f'MIN_PARAG_LEN: {global_vars.MIN_PARAG_LEN}')
print(f'STOPWORDS_ADD_EN: {global_vars.STOPWORDS_ADD_EN}')
print(f'TEST_WORDS_LS: {global_vars.TEST_WORDS_LS}')
print(f'SLANG_DT: {global_vars.SLANG_DT}')

MIN_PARAG_LEN: 10
STOPWORDS_ADD_EN: ['a', 'the', 'an']
TEST_WORDS_LS: ['Love', 'Hate', 'bizarre', 'strange', 'furious', 'elated', 'curious', 'beserk', 'gambaro']
SLANG_DT: {'$': ' dollar ', '€': ' euro ', '4ao': 'for adults only', 'a.m': 'before midday', 'a3': 'anytime anywhere anyplace', 'aamof': 'as a matter of fact', 'acct': 'account', 'adih': 'another day in hell', 'afaic': 'as far as i am concerned', 'afaict': 'as far as i can tell', 'afaik': 'as far as i know', 'afair': 'as far as i remember', 'afk': 'away from keyboard', 'app': 'application', 'approx': 'approximately', 'apps': 'applications', 'asap': 'as soon as possible', 'asl': 'age, sex, location', 'atk': 'at the keyboard', 'ave.': 'avenue', 'aymm': 'are you my mother', 'ayor': 'at your own risk', 'b&b': 'bed and breakfast', 'b+b': 'bed and breakfast', 'b.c': 'before christ', 'b2b': 'business to business', 'b2c': 'business to customer', 'b4': 'before', 'b4n': 'bye for now', 'b@u': 'back at you', 'bae': 'before anyone else', '

## Configure Jupyter Notebook

In [13]:
# Configure Jupyter

# To reload modules under development

# Option (a)
%load_ext autoreload
%autoreload 2
# Option (b)
# import importlib
# importlib.reload(functions.readfunctions)


# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Enable multiple outputs from one code cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

from IPython.display import display
from IPython.display import Image
from ipywidgets import widgets, interactive

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## Read YAML Configuration for Corpus and Models 

In [80]:
# from utils import sa_config # .sentiment_arcs_utils

import yaml

from utils import read_yaml

print('Objects in read_yaml()')
print(dir(read_yaml))
print('\n')

# Directory Structure for the Selected Corpus Genre, Type and Number
read_yaml.read_corpus_yaml(Corpus_Genre, Corpus_Type, Corpus_Number)

print('SentimentArcs Model Ensemble ------------------------------\n')
model_titles_ls = global_vars.models_titles_dt.keys()
print('\n'.join(model_titles_ls))


print('\n\nCorpus Texts ------------------------------\n')
corpus_titles_ls = list(global_vars.corpus_titles_dt.keys())
print('\n'.join(corpus_titles_ls))


print(f'\n\nThere are {len(model_titles_ls)} Models in the SentimentArcs Ensemble above.\n')
print(f'\nThere are {len(corpus_titles_ls)} Texts in the Corpus above.\n')
print('\n')


Objects in read_yaml()
['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'global_vars', 'read_corpus_yaml', 'yaml']


YAML Directory: text_raw/text_raw_finance_reference
YAML File: text_raw_finance_reference_info.yaml
SentimentArcs Model Ensemble ------------------------------

AutoGluon_Text
BERT_2IMDB
BERT_Dual_Coding
BERT_Multilingual
BERT_Yelp
CNN_DNN
Distilled_BERT
FLAML_AutoML
Fully_Connected_Network
HyperOpt_CNN_Flair_AutoML
LSTM_DNN
Logistic_Regression
Logistic_Regression_CV
Multilingual_CNN_Stanza_AutoML
Multinomial_Naive_Bayes
Pattern
Random_Forest
RoBERTa_Large_15DB
RoBERTa_XML_8Language
SentimentR_JockersRinker
SentimentR_Jockers
SentimentR_Bing
SentimentR_NRC
SentimentR_SentiWord
SentimentR_SenticNet
SentimentR_LMcD
SentimentR_SentimentR
PySentimentR_JockersRinker
PySentimentR_Huliu
PySentimentR_NRC
PySentimentR_SentiWord
PySentimentR_SenticNet
PySentimentR_LMcD
SyuzhetR_AFINN
SyuzhetR_Bing
SyuzhetR_NRC
SyuzhetR_Syuz

## Install Libraries

In [21]:
# Library to Read R datafiles from within Python programs

!pip install pyreadr

Collecting pyreadr
  Downloading pyreadr-0.4.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (361 kB)
[?25l[K     |█                               | 10 kB 30.7 MB/s eta 0:00:01[K     |█▉                              | 20 kB 35.9 MB/s eta 0:00:01[K     |██▊                             | 30 kB 13.3 MB/s eta 0:00:01[K     |███▋                            | 40 kB 6.5 MB/s eta 0:00:01[K     |████▌                           | 51 kB 6.6 MB/s eta 0:00:01[K     |█████▍                          | 61 kB 7.7 MB/s eta 0:00:01[K     |██████▍                         | 71 kB 8.5 MB/s eta 0:00:01[K     |███████▎                        | 81 kB 8.1 MB/s eta 0:00:01[K     |████████▏                       | 92 kB 8.9 MB/s eta 0:00:01[K     |█████████                       | 102 kB 7.5 MB/s eta 0:00:01[K     |██████████                      | 112 kB 7.5 MB/s eta 0:00:01[K     |██████████▉                     | 122 kB 7.5 MB/s eta 0:00:01[K     |███████████▊            

In [22]:
# Powerful Industry-Grade NLP Library

!pip install -U spacy

Collecting spacy
  Downloading spacy-3.2.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.0 MB)
[K     |████████████████████████████████| 6.0 MB 8.0 MB/s 
Collecting spacy-legacy<3.1.0,>=3.0.8
  Downloading spacy_legacy-3.0.9-py2.py3-none-any.whl (20 kB)
Collecting pathy>=0.3.5
  Downloading pathy-0.6.1-py3-none-any.whl (42 kB)
[K     |████████████████████████████████| 42 kB 1.6 MB/s 
Collecting typer<0.5.0,>=0.3.0
  Downloading typer-0.4.1-py3-none-any.whl (27 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0
  Downloading spacy_loggers-1.0.2-py3-none-any.whl (7.2 kB)
Collecting pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4
  Downloading pydantic-1.8.2-cp37-cp37m-manylinux2014_x86_64.whl (10.1 MB)
[K     |████████████████████████████████| 10.1 MB 51.0 MB/s 
Collecting langcodes<4.0.0,>=3.2.0
  Downloading langcodes-3.3.0-py3-none-any.whl (181 kB)
[K     |████████████████████████████████| 181 kB 50.4 MB/s 
[?25hCollecting srsly<3.0.0,>=2.4.1
  Downloading srsly-2.4.2-cp37-cp37m-many

In [23]:
# NLP Library to Simply Cleaning Text

!pip install texthero

Collecting texthero
  Downloading texthero-1.1.0-py3-none-any.whl (24 kB)
Collecting spacy<3.0.0
  Downloading spacy-2.3.7-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.4 MB)
[K     |████████████████████████████████| 10.4 MB 10.4 MB/s 
Collecting nltk>=3.3
  Downloading nltk-3.7-py3-none-any.whl (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 54.7 MB/s 
Collecting unidecode>=1.1.1
  Downloading Unidecode-1.3.4-py3-none-any.whl (235 kB)
[K     |████████████████████████████████| 235 kB 75.4 MB/s 
Collecting regex>=2021.8.3
  Downloading regex-2022.3.15-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (749 kB)
[K     |████████████████████████████████| 749 kB 35.1 MB/s 
Collecting srsly<1.1.0,>=1.0.2
  Downloading srsly-1.0.5-cp37-cp37m-manylinux2014_x86_64.whl (184 kB)
[K     |████████████████████████████████| 184 kB 53.8 MB/s 
[?25hCollecting catalogue<1.1.0,>=0.0.7
  Downloading catalogue-1.0.0-py2.py3-none-any.whl (7.7 kB)
Collecting thinc<7.5.0

In [24]:
# Advanced Sentence Boundry Detection Pythn Library
#   for splitting raw text into grammatical sentences
#   (can be difficult due to common motifs like Mr., ..., ?!?, etc)

!pip install pysbd

Collecting pysbd
  Downloading pysbd-0.3.4-py3-none-any.whl (71 kB)
[?25l[K     |████▋                           | 10 kB 18.6 MB/s eta 0:00:01[K     |█████████▏                      | 20 kB 23.6 MB/s eta 0:00:01[K     |█████████████▉                  | 30 kB 11.1 MB/s eta 0:00:01[K     |██████████████████▍             | 40 kB 6.7 MB/s eta 0:00:01[K     |███████████████████████         | 51 kB 5.7 MB/s eta 0:00:01[K     |███████████████████████████▋    | 61 kB 6.8 MB/s eta 0:00:01[K     |████████████████████████████████| 71 kB 4.3 MB/s 
[?25hInstalling collected packages: pysbd
Successfully installed pysbd-0.3.4


In [25]:
# Python Library to expand contractions to aid in Sentiment Analysis
#   (e.g. aren't -> are not, can't -> can not)

!pip install contractions

Collecting contractions
  Downloading contractions-0.1.68-py2.py3-none-any.whl (8.1 kB)
Collecting textsearch>=0.0.21
  Downloading textsearch-0.0.21-py2.py3-none-any.whl (7.5 kB)
Collecting pyahocorasick
  Downloading pyahocorasick-1.4.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (106 kB)
[K     |████████████████████████████████| 106 kB 8.4 MB/s 
[?25hCollecting anyascii
  Downloading anyascii-0.3.0-py3-none-any.whl (284 kB)
[K     |████████████████████████████████| 284 kB 57.4 MB/s 
[?25hInstalling collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.0 contractions-0.1.68 pyahocorasick-1.4.4 textsearch-0.0.21


In [26]:
# Library for dealing with Emoticons (punctuation) and Emojis (icons)

!pip install emot

Collecting emot
  Downloading emot-3.1-py3-none-any.whl (61 kB)
[?25l[K     |█████▎                          | 10 kB 26.5 MB/s eta 0:00:01[K     |██████████▋                     | 20 kB 25.9 MB/s eta 0:00:01[K     |████████████████                | 30 kB 13.3 MB/s eta 0:00:01[K     |█████████████████████▎          | 40 kB 5.9 MB/s eta 0:00:01[K     |██████████████████████████▋     | 51 kB 6.6 MB/s eta 0:00:01[K     |████████████████████████████████| 61 kB 7.2 MB/s eta 0:00:01[K     |████████████████████████████████| 61 kB 21 kB/s 
[?25hInstalling collected packages: emot
Successfully installed emot-3.1


## Load Libraries

In [27]:
# Core Python Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

import re
import string
from datetime import datetime
import os
import sys
import glob
import json
from pathlib import Path
from copy import deepcopy

2022-04-05 20:06:10,057 : INFO : NumExpr defaulting to 2 threads.


In [28]:
# More advanced Sentence Tokenizier Object from PySBD
from pysbd.utils import PySBDFactory

In [29]:
# Simplier Sentence Tokenizer Object from NLTK
import nltk 
from nltk.tokenize import sent_tokenize

# Download required NLTK tokenizer data
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [30]:
# Instantiate and Import Text Cleaning Ojects into Global Variable space
import texthero as hero
from texthero import preprocessing

2022-04-05 20:06:12,472 : INFO : 'pattern' package not found; tag filters are not available for English
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [31]:
# Expand contractions (e.g. can't -> can not)
import contractions

# Translate emoticons :0 and emoji icons to text
import emot 
emot_obj = emot.core.emot() 

from emot.emo_unicode import UNICODE_EMOJI, EMOTICONS_EMO

# Test
text = "I love python ☮ 🙂 ❤ :-) :-( :-)))" 
emot_obj.emoticons(text)

{'flag': True,
 'location': [[20, 23], [24, 27], [28, 33]],
 'mean': ['Happy face smiley',
  'Frown, sad, andry or pouting',
  'Very very Happy face or smiley'],
 'value': [':-)', ':-(', ':-)))']}

In [32]:
# Import spaCy, language model and setup minimal pipeline

import spacy

nlp = spacy.load('en_core_web_sm', disable=['tagger', 'parser', 'ner'])
# nlp.max_length = 1027203
nlp.max_length = 2054406
nlp.add_pipe(nlp.create_pipe('sentencizer')) # https://stackoverflow.com/questions/51372724/how-to-speed-up-spacy-lemmatization

# Test some edge cases, try to find examples that break spaCy
doc= nlp(u"Apples and oranges are similar. Boots and hippos aren't.")
print('\n')
print("Token Attributes: \n", "token.text, token.pos_, token.tag_, token.dep_, token.lemma_")
for token in doc:
    # Print the text and the predicted part-of-speech tag
    print("{:<12}{:<12}{:<12}{:<12}{:<12}".format(token.text, token.pos_, token.tag_, token.dep_, token.lemma_))

print('\nAnother Test:\n')
doc = nlp(u"Apples and oranges are similar. Boots and hippos aren't.")

for token in doc:
    print("{:<12}{:<30}{:<12}".format(token.text, token.lemma, token.lemma_))



Token Attributes: 
 token.text, token.pos_, token.tag_, token.dep_, token.lemma_
Apples                                          Apples      
and                                             and         
oranges                                         orange      
are                                             be          
similar                                         similar     
.                                               .           
Boots                                           Boots       
and                                             and         
hippos                                          hippo       
are         AUX         VBP                     be          
n't         PART        RB                      not         
.                                               .           

Another Test:

Apples      9297668116247400838           Apples      
and         2283656566040971221           and         
oranges     2208928596161743350           orange      
are 

## Define/Customize Stopwords

In [43]:
# Define Globals
"""
# Main data structure: Dictionary (key=text_name) of DataFrames (cols: text_raw, text_clean)
corpus_texts_dt = {}

# Verify in SentimentArcs Root Directory
os.chdir('/gdrive/MyDrive/cdh/sentiment_arcs/')

%run -i './utils/get_globals.py'

SLANG_DT.keys()
""";

In [44]:
global_vars.SLANG_DT.keys()

dict_keys(['$', '€', '4ao', 'a.m', 'a3', 'aamof', 'acct', 'adih', 'afaic', 'afaict', 'afaik', 'afair', 'afk', 'app', 'approx', 'apps', 'asap', 'asl', 'atk', 'ave.', 'aymm', 'ayor', 'b&b', 'b+b', 'b.c', 'b2b', 'b2c', 'b4', 'b4n', 'b@u', 'bae', 'bak', 'bbbg', 'bbc', 'bbias', 'bbl', 'bbs', 'be4', 'bfn', 'blvd', 'bout', 'brb', 'bros', 'brt', 'bsaaw', 'btw', 'bwl', 'c/o', 'cet', 'cf', 'cia', 'csl', 'cu', 'cul8r', 'cv', 'cwot', 'cya', 'cyt', 'dae', 'dbmib', 'diy', 'dm', 'dwh', 'e123', 'eet', 'eg', 'embm', 'encl', 'encl.', 'etc', 'faq', 'fawc', 'fb', 'fc', 'fig', 'fimh', 'ft.', 'ft', 'ftl', 'ftw', 'fwiw', 'fyi', 'g9', 'gahoy', 'gal', 'gcse', 'gfn', 'gg', 'gl', 'glhf', 'gmt', 'gmta', 'gn', 'g.o.a.t', 'goat', 'goi', 'gps', 'gr8', 'gratz', 'gyal', 'h&c', 'hp', 'hr', 'hrh', 'ht', 'ibrb', 'ic', 'icq', 'icymi', 'idc', 'idgadf', 'idgaf', 'idk', 'ie', 'i.e', 'ifyp', 'IG', 'iirc', 'ilu', 'ily', 'imho', 'imo', 'imu', 'iow', 'irl', 'j4f', 'jic', 'jk', 'jsyk', 'l8r', 'lb', 'lbs', 'ldr', 'lmao', 'lmfao', 

In [45]:
dir(global_vars)

['Corpus_Genre',
 'Corpus_Number',
 'Corpus_Type',
 'FNAME_SENTIMENT_RAW',
 'MIN_PARAG_LEN',
 'MIN_SENT_LEN',
 'NotebookModels',
 'PATH_TEXT_RAW',
 'PATH_TEXT_RAW_CORPUS',
 'SLANG_DT',
 'STOPWORDS_ADD_EN',
 'STOPWORDS_DEL_EN',
 'SUBDIR_CRUXES',
 'SUBDIR_DATA',
 'SUBDIR_GRAPHS',
 'SUBDIR_SENTIMENTARCS',
 'SUBDIR_SENTIMENT_CLEAN',
 'SUBDIR_SENTIMENT_RAW',
 'SUBDIR_TEXT_CLEAN',
 'SUBDIR_TEXT_RAW',
 'SUBDIR_TIMESERIES_CLEAN',
 'SUBDIR_TIMESERIES_RAW',
 'SUBDIR_UTILS',
 'TEST_SENTENCES_LS',
 'TEST_WORDS_LS',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 'corpus_texts_dt',
 'corpus_titles_dt',
 'corpus_titles_ls',
 'lexicons_dt',
 'model_titles_dt',
 'models_ensemble_dt',
 'models_titles_dt']

In [46]:
%whos

Variable                Type             Data/Info
--------------------------------------------------
Corpus_Genre            str              finance
Corpus_Number           int              2
Corpus_Type             str              reference
EMOTICONS_EMO           dict             n=221
IN_COLAB                bool             True
Image                   type             <class 'IPython.core.display.Image'>
InteractiveShell        MetaHasTraits    <class 'IPython.core.inte<...>eshell.InteractiveShell'>
PATH_TEXT_CLEAN         str              ./text_clean/text_clean_finance_ref
PATH_TEXT_RAW           str              ./text_raw/text_raw_finance_ref
PATH_UTILS              str              /gdrive/MyDrive/sentimentarcs_notebooks/utils
Path                    type             <class 'pathlib.Path'>
Path_to_SentimentArcs   str              /gdrive/MyDrive/sentimentarcs_notebooks/
PySBDFactory            type             <class 'pysbd.utils.PySBDFactory'>
SUBDIR_TEXT_CLEAN       str 

In [47]:
# Verify English Stopword List

stopwords_spacy_en_ls = nlp.Defaults.stop_words

','.join([x for x in stopwords_spacy_en_ls])

stopwords_en_ls = stopwords_spacy_en_ls

print(f'\n\nThere are {len(stopwords_spacy_en_ls)} default English Stopwords from spaCy\n')

"mine,will,that,made,until,our,both,off,are,’ll,this,many,whether,but,herself,by,get,since,move,two,’m,‘ll,as,thus,with,much,me,doing,it,do,him,anyway,‘s,so,already,full,sometimes,might,whoever,seeming,please,after,everything,none,bottom,anyhow,become,across,very,always,’d,mostly,eleven,most,quite,empty,although,n‘t,yours,seems,not,then,n’t,really,also,nor,these,never,own,front,becoming,who,every,various,amongst,seemed,its,'ve,indeed,am,perhaps,few,thence,together,myself,through,a,ever,forty,above,down,via,on,side,others,ours,itself,however,here,himself,throughout,behind,amount,four,i,while,meanwhile,anyone,where,were,can,other,all,back,see,three,up,nevertheless,be,hereafter,make,serious,go,‘m,top,may,now,nothing,how,take,further,there,in,whence,once,show,if,formerly,beyond,wherein,during,had,becomes,no,'m,keep,next,whatever,could,somehow,last,even,latterly,twelve,everywhere,due,’s,to,seem,should,name,along,part,being,did,'s,from,upon,still,something,whereupon,them,under,which,elsewher



There are 326 default English Stopwords from spaCy



## (Optional) Customize Stopword List (add/del)

In [48]:
# Customize Default SpaCy English Stopword List

print(f'\n\nThere are {len(stopwords_spacy_en_ls)} default English Stopwords from spaCy\n')

# [CUSTOMIZE] Stopwords to ADD or DELETE from default spaCy English stopword list
LOCAL_STOPWORDS_DEL_EN = set(global_vars.STOPWORDS_DEL_EN).union(set(['a','an','the','but','yet']))
print(f'    Deleting these stopwords: {LOCAL_STOPWORDS_DEL_EN}')
LOCAL_STOPWORDS_ADD_EN = set(global_vars.STOPWORDS_ADD_EN).union(set(['a','an','the','but','yet']))
print(f'    Adding these stopwords: {LOCAL_STOPWORDS_ADD_EN}\n')

stopwords_en_ls = list(set(stopwords_spacy_en_ls).difference(set(LOCAL_STOPWORDS_DEL_EN)).union(set(LOCAL_STOPWORDS_ADD_EN)))
print(f'Final Count: {len(stopwords_en_ls)} Stopwords')



There are 326 default English Stopwords from spaCy

    Deleting these stopwords: {'a', 'jimmy', 'but', 'an', 'the', 'dean', 'yet'}
    Adding these stopwords: {'a', 'an', 'but', 'the', 'yet'}

Final Count: 326 Stopwords


## Setup Matplotlib Style

In [49]:
# Configure Matplotlib

# View available styles
# plt.style.available

# Verify in SentimentArcs Root Directory
os.chdir(Path_to_SentimentArcs)

%run -i './utils/config_matplotlib.py'

config_matplotlib()

print('Matplotlib Configuration ------------------------------')
print('\n  (Uncomment to view)')
# plt.rcParams.keys()
print('\n  Edit ./utils/config_matplotlib.py to change')




 New figure size:  (20, 10)
Matplotlib Configuration ------------------------------

  (Uncomment to view)

  Edit ./utils/config_matplotlib.py to change


## Setup Seaborn Style

In [50]:
# Configure Seaborn

# Verify in SentimentArcs Root Directory
os.chdir(Path_to_SentimentArcs)

%run -i './utils/config_seaborn.py'

config_seaborn()

print('Seaborn Configuration ------------------------------\n')
# print('\n  Update ./utils/config_seaborn.py to display seaborn settings')




Seaborn Configuration ------------------------------



## **Utility Functions**

### Generate Convenient Data Lists

In [83]:
# Derive List of Texts in Corpus a)keys and b)full author and titles

print('Dictionary: corpus_titles_dt')
global_vars.corpus_titles_dt
print('\n')

corpus_texts_ls = list(global_vars.corpus_titles_dt.keys())
print(f'\nCorpus Texts:')
for akey in corpus_texts_ls:
  print(f'  {akey}')
print('\n')

print(f'\nNatural Corpus Titles:')
corpus_titles_ls = [x[0] for x in list(global_vars.corpus_titles_dt.values())]
for akey in corpus_titles_ls:
  print(f'  {akey}')


Dictionary: corpus_titles_dt


{'bogfederalreserve_speech_1997-2022': ['Federal Reserve Board of Governor Speeches (Jan 1997 - Feb 2022)',
  datetime.date(1997, 1, 8),
  datetime.date(2022, 2, 25)],
 'eucentralbank_speeches_1998-2022': ['European Central Bank Speeches (Jul 1998 - Mar 2022)',
  datetime.date(1998, 7, 17),
  datetime.date(2022, 3, 1)]}




Corpus Texts:
  bogfederalreserve_speech_1997-2022
  eucentralbank_speeches_1998-2022



Natural Corpus Titles:
  Federal Reserve Board of Governor Speeches (Jan 1997 - Feb 2022)
  European Central Bank Speeches (Jul 1998 - Mar 2022)


In [84]:
# Get Model Families of Ensemble

from utils.get_model_families import get_ensemble_model_famalies

global_vars.models_ensemble_dt = get_ensemble_model_famalies(global_vars.models_titles_dt)

print('\nTest: Lexicon Family of Models:')
global_vars.models_ensemble_dt['lexicon']


There are 12 Lexicon Models
  Lexicon Model #0: sentimentr_sentimentr
  Lexicon Model #1: pysentimentr_jockersrinker
  Lexicon Model #2: pysentimentr_huliu
  Lexicon Model #3: pysentimentr_nrc
  Lexicon Model #4: pysentimentr_sentiword
  Lexicon Model #5: pysentimentr_senticnet
  Lexicon Model #6: pysentimentr_lmcd
  Lexicon Model #7: syuzhetr_afinn
  Lexicon Model #8: syuzhetr_bing
  Lexicon Model #9: syuzhetr_nrc
  Lexicon Model #10: syuzhetr_syuzhetr
  Lexicon Model #11: afinn

There are 9 Heuristic Models
  Heuristic Model #0: pattern
  Heuristic Model #1: sentimentr_jockersrinker
  Heuristic Model #2: sentimentr_jockers
  Heuristic Model #3: sentimentr_bing
  Heuristic Model #4: sentimentr_nrc
  Heuristic Model #5: sentimentr_sentiword
  Heuristic Model #6: sentimentr_senticnet
  Heuristic Model #7: sentimentr_lmcd
  Heuristic Model #8: vader

There are 8 Traditional ML Models
  Traditional ML Model #0: autogluon
  Traditional ML Model #1: flaml
  Traditional ML Model #2: logreg


['sentimentr_sentimentr',
 'pysentimentr_jockersrinker',
 'pysentimentr_huliu',
 'pysentimentr_nrc',
 'pysentimentr_sentiword',
 'pysentimentr_senticnet',
 'pysentimentr_lmcd',
 'syuzhetr_afinn',
 'syuzhetr_bing',
 'syuzhetr_nrc',
 'syuzhetr_syuzhetr',
 'afinn']

### Text Cleaning 

In [53]:
# [VERIFY]: Texthero preprocessing pipeline

hero.preprocessing.get_default_pipeline()



# Create Default and Custom Stemming TextHero pipeline

# Create a custom cleaning pipeline
def_pipeline = [preprocessing.fillna
                , preprocessing.lowercase
                , preprocessing.remove_digits
                , preprocessing.remove_punctuation
                , preprocessing.remove_diacritics
                # , preprocessing.remove_stopwords
                , preprocessing.remove_whitespace]

# Create a custom cleaning pipeline
stem_pipeline = [preprocessing.fillna
                , preprocessing.lowercase
                , preprocessing.remove_digits
                , preprocessing.remove_punctuation
                , preprocessing.remove_diacritics
                , preprocessing.remove_stopwords
                , preprocessing.remove_whitespace
                , preprocessing.stem]
                   
# Test: pass the custom_pipeline to the pipeline argument
# df['clean_title'] = hero.clean(df['title'], pipeline = custom_pipeline)df.head()

[<function texthero.preprocessing.fillna>,
 <function texthero.preprocessing.lowercase>,
 <function texthero.preprocessing.remove_digits>,
 <function texthero.preprocessing.remove_punctuation>,
 <function texthero.preprocessing.remove_diacritics>,
 <function texthero.preprocessing.remove_stopwords>,
 <function texthero.preprocessing.remove_whitespace>]

In [54]:
# Test Text Cleaning Functions

# Verify in SentimentArcs Root Directory
os.chdir(Path_to_SentimentArcs)

%run -i './utils/text_cleaners.py'

test_suite_ls = ['text2lemmas',
                 'text_str2sents',
                 'textfile2df',
                 'emojis2text',
                 'all_emos2text',
                 'expand_slang',
                 'clean_text',
                 'lemma_pipe'
                 ]

# test_suite_ls = []

# Test: text2lemmas()
if 'text2lemmas' in test_suite_ls:
  text2lemmas('I am going to start studying more often and working harder.', lowercase=True, remove_stopwords=False)
  print('\n')

# Test: text_str2sents()
if 'text_str2sents' in test_suite_ls:
  text_str2sents('Hello. You are a great dude! WTF?\n\n You are a goat. What is a goat?!? A big lazy GOAT... No way-', pysbd_only=False) # !?! Dr. and Mrs. Elipses...', pysbd_only=True)
  print('\n')

# Test: textfile2df()
if 'textfile2df' in test_suite_ls:
  # ???
  print('\n')

# Test: emojis2text()
if 'emojis2text' in test_suite_ls:
  test_str = "Hilarious 😂. The feeling of making a sale 😎, The feeling of actually ;) fulfilling orders 😒"
  test_str = emojis2text(test_str)
  print(f'test_str: [{test_str}]')
  print('\n')

# Test: all_emos2text()
if 'all_emos2text' in test_suite_ls:
  test_str = "Hilarious 😂. The feeling :o of making a sale 😎, The feeling :( of actually ;) fulfilling orders 😒"
  all_emos2text(test_str)
  print('\n')

# Test: expand_slang():
if 'expand_slang' in test_suite_ls:
  expand_slang('idk LOL you suck!')
  print('\n')

# Test: clean_text()
if 'clean_text' in test_suite_ls:
  test_df = pd.DataFrame({'text_dirty':['The RAin in SPain','WTF?!?! Do you KnoW...']})
  clean_text(test_df, 'text_dirty', text_type='formal')
  print('\n')

# Test: lemma_pipe()
if 'lemma_pipe' in test_suite_ls:
  print('\nTest #1:\n')
  test_ls = ['I am running late for a meetings with all the many people.',
            'What time is it when you fall down running away from a growing problem?',
            "You've got to be kidding me - you're joking right?"]
  lemma_pipe(test_ls)
  print('\nTest #2:\n')
  texts = pd.Series(["I won't go and you can't make me.", "Billy is running really quickly and with great haste.", "Eating freshly caught seafood."])
  for doc in nlp.pipe(texts):
    print([tok.lemma_ for tok in doc])
  print('\nTest #3:\n')
  lemma_pipe(texts)


'i be go to start study much often and work hard .'



BEFORE stripping out headings len: 96
   Parag count before processing sents: 2
pysbd found 3 Sentences in Paragraph #0
      3 Sentences remain after cleaning
pysbd found 3 Sentences in Paragraph #1
      3 Sentences remain after cleaning
Processing asent: Hello.
Processing asent: You are a great dude!
Processing asent: WTF?
Processing asent: You are a goat.
Processing asent: What is a goat?!? A big lazy GOAT...
Processing asent: No way-
About to return sents_ls with len = 7


['Hello.',
 'You are a great dude!',
 'WTF?',
 'You are a goat.',
 'What is a goat?!?',
 'A big lazy GOAT...',
 'No way-']





test_str: [Hilarious face with tears of joy. The feeling of making a sale smiling face with sunglasses, The feeling of actually ;) fulfilling orders unamused face]




'Hilarious face with tears of joy. The feeling Surprise of making a sale smiling face with sunglasses, The feeling Frown sad andry or pouting of actually Wink or smirk fulfilling orders unamused face'





'i do not know laughing out loud you suck!'





0    the rain in spain
1      wtf do you know
Name: text_dirty, dtype: object




Test #1:



['i be run late for a meeting with all the many people .',
 'what time be it when you fall down run away from a grow problem ?',
 'you have get to be kid me - you be joke right ?']


Test #2:

['I', 'will', 'not', 'go', 'and', 'you', 'can', 'not', 'make', 'me', '.']
['Billy', 'be', 'run', 'really', 'quickly', 'and', 'with', 'great', 'haste', '.']
['Eating', 'freshly', 'catch', 'seafood', '.']

Test #3:



['i will not go and you can not make me .',
 'billy be run really quickly and with great haste .',
 'eating freshly catch seafood .']

In [55]:
# Test Text Cleaning Functions

%run -i './utils/text_cleaners.py'
# from utils.text_cleaners import text2lemmas, text_str2sents, emojis2text, expand_slang, clean_text, lemma_pipe

test_suite_ls = ['text2lemmas',
                 'text_str2sents',
                 'textfile2df',
                 'emojis2text',
                 'all_emos2text',
                 'expand_slang',
                 'clean_text',
                 'lemma_pipe'
                 ]

# Comment out this line to active tests above
# test_suite_ls = []


# Test: text2lemmas()
if 'text2lemmas' in test_suite_ls:
  text2lemmas('I am going to start studying more often and working harder.', lowercase=True, remove_stopwords=False)
  print('\n')

# Test: text_str2sents()
if 'text_str2sents' in test_suite_ls:
  text_str2sents('Hello. You are a great dude! WTF?\n\n You are a goat. What is a goat?!? A big lazy GOAT... No way-', pysbd_only=False) # !?! Dr. and Mrs. Elipses...', pysbd_only=True)
  print('\n')

# Test: textfile2df()
if 'textfile2df' in test_suite_ls:
  # ???
  print('\n')

# Test: emojis2text()
if 'emojis2text' in test_suite_ls:
  test_str = "Hilarious 😂. The feeling of making a sale 😎, The feeling of actually ;) fulfilling orders 😒"
  test_str = emojis2text(test_str)
  print(f'test_str: [{test_str}]')
  print('\n')

# Test: all_emos2text()
if 'all_emos2text' in test_suite_ls:
  test_str = "Hilarious 😂. The feeling :o of making a sale 😎, The feeling :( of actually ;) fulfilling orders 😒"
  all_emos2text(test_str)
  print('\n')

# Test: expand_slang():
if 'expand_slang' in test_suite_ls:
  expand_slang('idk LOL you suck!')
  print('\n')

# Test: clean_text()
if 'clean_text' in test_suite_ls:
  test_df = pd.DataFrame({'text_dirty':['The RAin in SPain','WTF?!?! Do you KnoW...']})
  clean_text(test_df, 'text_dirty', text_type='formal')
  print('\n')
"""
# Test: lemma_pipe()
if 'lemma_pipe' in test_suite_ls:
  print('\nTest #1:\n')
  test_ls = ['I am running late for a meetings with all the many people.',
            'What time is it when you fall down running away from a growing problem?',
            "You've got to be kidding me - you're joking right?"]
  lemma_pipe(test_ls)
  print('\nTest #2:\n')
  texts = pd.Series(["I won't go and you can't make me.", "Billy is running really quickly and with great haste.", "Eating freshly caught seafood."])
  for doc in nlp.pipe(texts):
    print([tok.lemma_ for tok in doc])
  print('\nTest #3:\n')
  lemma_pipe(texts)
"""

'i be go to start study much often and work hard .'



BEFORE stripping out headings len: 96
   Parag count before processing sents: 2
pysbd found 3 Sentences in Paragraph #0
      3 Sentences remain after cleaning
pysbd found 3 Sentences in Paragraph #1
      3 Sentences remain after cleaning
Processing asent: Hello.
Processing asent: You are a great dude!
Processing asent: WTF?
Processing asent: You are a goat.
Processing asent: What is a goat?!? A big lazy GOAT...
Processing asent: No way-
About to return sents_ls with len = 7


['Hello.',
 'You are a great dude!',
 'WTF?',
 'You are a goat.',
 'What is a goat?!?',
 'A big lazy GOAT...',
 'No way-']





test_str: [Hilarious face with tears of joy. The feeling of making a sale smiling face with sunglasses, The feeling of actually ;) fulfilling orders unamused face]




'Hilarious face with tears of joy. The feeling Surprise of making a sale smiling face with sunglasses, The feeling Frown sad andry or pouting of actually Wink or smirk fulfilling orders unamused face'





'i do not know laughing out loud you suck!'





0    the rain in spain
1      wtf do you know
Name: text_dirty, dtype: object





'\n# Test: lemma_pipe()\nif \'lemma_pipe\' in test_suite_ls:\n  print(\'\nTest #1:\n\')\n  test_ls = [\'I am running late for a meetings with all the many people.\',\n            \'What time is it when you fall down running away from a growing problem?\',\n            "You\'ve got to be kidding me - you\'re joking right?"]\n  lemma_pipe(test_ls)\n  print(\'\nTest #2:\n\')\n  texts = pd.Series(["I won\'t go and you can\'t make me.", "Billy is running really quickly and with great haste.", "Eating freshly caught seafood."])\n  for doc in nlp.pipe(texts):\n    print([tok.lemma_ for tok in doc])\n  print(\'\nTest #3:\n\')\n  lemma_pipe(texts)\n'

### File Functions

In [56]:
# Verify in SentimentArcs Root Directory
os.chdir(Path_to_SentimentArcs)

%run -i './utils/file_utils.py'
# from utils.file_utils import *

# %run -i './utils/file_utils.py'

# TODO: Not used? Delete?
# get_fullpath(text_title_str, ftype='data_clean', fig_no='', first_note = '',last_note='', plot_ext='png', no_date=False)

# **[STEP 2] Read in Corpus and Clean**

## Create List of Raw Textfiles

In [57]:
global_vars.SUBDIR_SENTIMENTARCS

'/gdrive/MyDrive/sentimentarcs_notebooks/'

In [58]:
# TODO: Temp fix until print(f'Original: {SUBDIR_TEXT_RAW}\n')
path_text_raw = './' + '/'.join(global_vars.SUBDIR_TEXT_RAW.split('/')[1:-1])
print(f'path_text_raw: {path_text_raw}\n')
# SUBDIR_TEXT_RAW = path_text_raw + '/'
print(f'Full Path to Corpus text_raw: ./text_raw/{global_vars.SUBDIR_TEXT_RAW}')

path_text_raw: ./text_raw/text_raw_finance_reference

Full Path to Corpus text_raw: ./text_raw/./text_raw/text_raw_finance_reference/


In [59]:
!pwd

/gdrive/MyDrive/sentimentarcs_notebooks


In [60]:
# Get a list of all the Textfile filename roots in Subdir text_raw

# Verify in SentimentArcs Root Directory
os.chdir(Path_to_SentimentArcs)

corpus_titles_ls = list(global_vars.corpus_titles_dt.keys())

print(f'Corpus_Genre: {global_vars.Corpus_Genre}')
print(f'Corpus_Type: {global_vars.Corpus_Type}\n')

# Build path to Corpus Subdir
# TODO: Temp fix until print(f'Original: {SUBDIR_TEXT_RAW}\n')
# path_text_raw = './' + '/'.join(SUBDIR_TEXT_RAW.split('/')[1:-1]) + '/' + SUBDIR_TEXT_RAW
path_text_raw = './text_raw/' + global_vars.SUBDIR_TEXT_RAW
print(f'Corpus Subdir: {path_text_raw}')

# Create a List (preprocessed_ls) of all preprocessed text files
try:
  # texts_raw_ls = glob.glob(f'{SUBDIR_TEXT_RAW}*.txt')
  texts_raw_root_ls = glob.glob(f'{path_text_raw}/*.txt')
  texts_raw_root_ls = [x.split('/')[-1] for x in texts_raw_root_ls]
  texts_raw_root_ls = [x.split('.')[0] for x in texts_raw_root_ls]
except IndexError:
  raise RuntimeError('No *.txt files found')

print(f'\ntexts_raw_root_ls:\n  {texts_raw_root_ls}\n')

text_ct = 0
for afile_root in texts_raw_root_ls:
  # file_root = file_fullpath.split('/')[-1].split('.')[0]
  text_ct += 1
  print(f'{afile_root}: ') # {corpus_titles_dt[afile_root]}')

print(f'\nThere are {text_ct} Texts defined in SentmentArcs [corpus_dt] and found in the subdir: [SUBDIR_TEXT_RAW]')

Corpus_Genre: finance
Corpus_Type: reference

Corpus Subdir: ./text_raw/./text_raw/text_raw_finance_reference/

texts_raw_root_ls:
  []


There are 0 Texts defined in SentmentArcs [corpus_dt] and found in the subdir: [SUBDIR_TEXT_RAW]


In [61]:
!ls -altr $path_text_raw

ls: cannot access './text_raw/./text_raw/text_raw_finance_reference/': No such file or directory


In [62]:
glob.glob(f'{path_text_raw}/*.txt')

[]

## Read and Segment into Sentences

In [63]:
corpus_titles_ls

['federalreserve_speech_bog', 'european_central_bank']

In [68]:
%%time
%%capture

# Read all Corpus Textfiles and Segment each into Sentences

# NOTE:   3m30s Entire Corpus of 25 
#         7m30s Ref Corpus 32 Novels
#         7m24s Ref Corpus 32 Novels
#         1m00s New Corpus 2 Novels

#        13m55s Finance FedBoard Gov Speeches 32M + EU Cent Bank SPeeches 38M

# Read all novel files into a Dictionary of DataFrames
#   Dict.keys() are novel names
#   Dict.values() are DataFrames with one row per Sentence

# Continue here ONLY if last cell completed WITHOUT ERROR

# anovel_df = pd.DataFrame()

for i, file_root in enumerate(corpus_titles_ls):
  file_fullpath = f'{global_vars.SUBDIR_TEXT_RAW}{file_root}.txt'
  # print(f'Processing Novel #{i}: {file_fullpath}') # {file_root}')
  # fullpath_str = novels_subdir + asubdir + '/' + asubdir + '.txt'
  # print(f"  Size: {os.path.getsize(file_fullpath)}")

  global_vars.corpus_texts_dt[file_root] = textfile2df(file_fullpath)
  
# corpus_dt.keys()

# Verify First Text is Segmented into text_raw Sentences
print('\n\n')

# global_vars.corpus_texts_dt[corpus_titles_ls[0]].head()
text_no = 0
print(f'Verify sample segmented Text: \n    {corpus_texts_ls[text_no]}\n')
global_vars.corpus_texts_dt[corpus_texts_ls[text_no]].head()


TypeError: ignored

CPU times: user 13min 48s, sys: 2.57 s, total: 13min 51s
Wall time: 13min 55s


## Clean Sentences

In [94]:
%%time

# NOTE: (no stem) 4m09s (24 Novels)
#       (w/ stem) 4m24s (24 Novels)


#         4m10s Finance FedBoard Gov Speeches 32M + EU Cent Bank SPeeches 38M

i = 0

for key_novel, atext_df in global_vars.corpus_texts_dt.items():

  print(f'Processing Novel #{i}: {key_novel}...')

  atext_df['text_clean'] = clean_text(atext_df, 'text_raw', text_type='formal')
  atext_df['text_clean'] = lemma_pipe(atext_df['text_clean'])
  atext_df['text_clean'] = atext_df['text_clean'].astype('string')

  # TODO: Fill in all blank 'text_clean' rows with filler semaphore
  atext_df.text_clean = atext_df.text_clean.fillna('empty_placeholder')

  atext_df.head(2)

  print(f'  shape: {atext_df.shape}')

  i += 1

Processing Novel #0: bogfederalreserve_speech_1997-2022...
  shape: (219966, 2)
Processing Novel #1: eucentralbank_speeches_1998-2022...
  shape: (298570, 2)
CPU times: user 3min 57s, sys: 2.26 s, total: 3min 59s
Wall time: 4min 10s


In [112]:
# Verify the first Text in Corpus is cleaned

text_no = 1
global_vars.corpus_texts_dt[corpus_texts_ls[text_no]].head(10)
global_vars.corpus_texts_dt[corpus_texts_ls[text_no]].tail(10)
global_vars.corpus_texts_dt[corpus_texts_ls[text_no]].info()

Unnamed: 0,text_raw,text_clean
0,Mr. Duisenberg reports on the outcome of the s...,mr duisenberg report on the outcome of the 2 m...
1,The Governing Council first assessed current e...,the govern council ﻿1 assess current economic ...
2,The general picture is one of continued econom...,the general picture be one of continue economi...
3,Several forecasts made during spring 1998 have...,several forecast make during spring have even ...
4,"As far as price developments are concerned, in...",a far a price development be concern inflation...
5,Output growth has remained strong in recent qu...,output growth have remain strong in recent qua...
6,Economic growth has been driven increasingly b...,economic growth have be drive increasingly by ...
7,Private consumption and stockbuilding have bee...,private consumption and stockbuilding have be ...
8,The favourable conjunctural situation has star...,the favourable conjunctural situation have sta...
9,"It is evident, however, that economic growth a...",it be evident however that economic growth alo...


Unnamed: 0,text_raw,text_clean
298560,"@87ABCD6 )$1113631E21)2!191235192F913""62606133...",87abcd6 1113631e21 191235192f913 21325 g 11h21...
298561,[5!612316261E1 269213121!,612316261e1
298562,"362!09251!31111"" B)91291123161!9231\91]359I612...",b \ 359i61230 \
298563,151!632269131912!313632512626!2626163 99236166...,y1365321131
298564,"=36128132(2""",
298565,_abcdeafdbhbibafjkellmicfdein,abcdeafdbhbibafjkellmicfdein
298566,"$612321 BK >32""32I2J1320 p ABAK C XX 1!2q113621",bk 32i2j1320 p abak c xx 2q113621
298567,24185152458824158788714516 !,
298568,#$%&!,
298569,012345678 412084,


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 298570 entries, 0 to 298569
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   text_raw    298570 non-null  object
 1   text_clean  298570 non-null  string
dtypes: object(1), string(1)
memory usage: 4.6+ MB


## Detect Language and Filter out Noise

To deal with noisy text:

* ftfy: fix bad encodings
* chardet: detect encoding
* langdet: detect language
* (custom): skip non-sense sentences
 
References:

* http://cs229.stanford.edu/proj2014/Ian%20Tenney,%20A%20General-Purpose%20Sentence-Level%20Nonsense%20Detector.pdf

In [140]:
!pip install ftfy

Collecting ftfy
  Downloading ftfy-6.1.1-py3-none-any.whl (53 kB)
[?25l[K     |██████▏                         | 10 kB 34.4 MB/s eta 0:00:01[K     |████████████▍                   | 20 kB 23.8 MB/s eta 0:00:01[K     |██████████████████▌             | 30 kB 17.3 MB/s eta 0:00:01[K     |████████████████████████▊       | 40 kB 15.4 MB/s eta 0:00:01[K     |██████████████████████████████▉ | 51 kB 8.3 MB/s eta 0:00:01[K     |████████████████████████████████| 53 kB 1.7 MB/s 
Installing collected packages: ftfy
Successfully installed ftfy-6.1.1


In [150]:
from ftfy import fix_and_explain, apply_plan, fix_text

In [149]:
sent_no = -2

sample_str = global_vars.corpus_texts_dt[corpus_texts_ls[text_no]].iloc[sent_no]['text_raw']
sample_str = "L&AMP;AMP;ATILDE;&AMP;AMP;SUP3;PEZ"

fixed, explanation = fix_and_explain(sample_str)
fixed
explanation

'LóPEZ'

[('apply', 'unescape_html'),
 ('apply', 'unescape_html'),
 ('apply', 'unescape_html'),
 ('encode', 'latin-1'),
 ('decode', 'utf-8')]

In [191]:
tail_indx_ls = list(range(-20, 0))
tail_indx_ls.reverse()
tail_indx_ls

head_indx_ls = list(range(1, 10))

# Pick one
# Option (a)
indx_ls = head_indx_ls
# Option (b)
indx_ls = tail_indx_ls

for sent_no in indx_ls:

  print(f'Sentence Index = {sent_no}:')
  sample_str = global_vars.corpus_texts_dt[corpus_texts_ls[text_no]].iloc[sent_no]['text_raw']
  # sample_str = "L&AMP;AMP;ATILDE;&AMP;AMP;SUP3;PEZ"

  fixed = fix_text(sample_str)

  print(f'  Original Text:\n    [{sample_str}]')

  print(f'  Fixed Text:\n    [{fixed}]')

  fixed_alpha= re.sub('[^a-zA-Z]','',fixed)

  print(f'  Fixed Text w/o punct or numbers:\n    [{fixed_alpha}]\n\n')

[-1,
 -2,
 -3,
 -4,
 -5,
 -6,
 -7,
 -8,
 -9,
 -10,
 -11,
 -12,
 -13,
 -14,
 -15,
 -16,
 -17,
 -18,
 -19,
 -20]

Sentence Index = -1:
  Original Text:
    [012345678 412084]
  Fixed Text:
    [012345678 412084]
  Fixed Text w/o punct or numbers:
    []


Sentence Index = -2:
  Original Text:
    [#$%&!]
  Fixed Text:
    [#$%&!]
  Fixed Text w/o punct or numbers:
    []


Sentence Index = -3:
  Original Text:
    [24185152458824158788714516 !]
  Fixed Text:
    [24185152458824158788714516 !]
  Fixed Text w/o punct or numbers:
    []


Sentence Index = -4:
  Original Text:
    [$612321 BK >32"32I2J1320 p ABAK C XX 1!2q113621]
  Fixed Text:
    [$612321 BK >32"32I2J1320 p ABAK C XX 1!2q113621]
  Fixed Text w/o punct or numbers:
    [BKIJpABAKCXXq]


Sentence Index = -5:
  Original Text:
    [_abcdeafdbhbibafjkellmicfdein]
  Fixed Text:
    [_abcdeafdbhbibafjkellmicfdein]
  Fixed Text w/o punct or numbers:
    [abcdeafdbhbibafjkellmicfdein]


Sentence Index = -6:
  Original Text:
    [=36128132(2"]
  Fixed Text:
    [=36128132(2"]
  Fixed Text w/o punct or numbers:
    []


Sentence Index = -7:
  Or

In [126]:
!pip install chardet



In [127]:
import chardet


In [139]:
sent_no = 1

sample_str = global_vars.corpus_texts_dt[corpus_texts_ls[text_no]].iloc[sent_no]['text_raw']

sample_bin = bytearray(sample_str, encoding ='utf-8')

print(f'Sentence #{sent_no}:\n\n    {sample_str}\n\n')

try:
  alang = chardet.detect(sample_bin)
  # print(f'LANGUAGE: {detect(sample_str)}')
except:
  print('ERROR: Unrecognized language')

print(f'LANGUAGE: {chardet.detect(sample_bin)}')

TypeError: ignored

In [113]:
!pip install langdetect

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[?25l[K     |▍                               | 10 kB 26.3 MB/s eta 0:00:01[K     |▊                               | 20 kB 31.6 MB/s eta 0:00:01[K     |█                               | 30 kB 14.3 MB/s eta 0:00:01[K     |█▍                              | 40 kB 10.9 MB/s eta 0:00:01[K     |█▊                              | 51 kB 7.4 MB/s eta 0:00:01[K     |██                              | 61 kB 8.7 MB/s eta 0:00:01[K     |██▍                             | 71 kB 9.0 MB/s eta 0:00:01[K     |██▊                             | 81 kB 7.8 MB/s eta 0:00:01[K     |███                             | 92 kB 8.7 MB/s eta 0:00:01[K     |███▍                            | 102 kB 8.2 MB/s eta 0:00:01[K     |███▊                            | 112 kB 8.2 MB/s eta 0:00:01[K     |████                            | 122 kB 8.2 MB/s eta 0:00:01[K     |████▍                           | 133 kB 8.2 MB/s eta 0:00:01[K   

In [128]:
# from langdetect import detect
import langdetect

In [130]:
sent_no = 1

sample_str = global_vars.corpus_texts_dt[corpus_texts_ls[text_no]].iloc[sent_no]['text_raw']

print(f'Sentence #{sent_no}:\n\n    {sample_str}\n\n')

try:
  alang = langdetect.detect(sample_str)
  # print(f'LANGUAGE: {detect(sample_str)}')
except:
  print('ERROR: Unrecognized language')

print(f'LANGUAGE: {langdetect.detect(sample_str)}')

Sentence #1:

    The Governing Council first assessed current economic developments in the euro area.


LANGUAGE: en


## Save Cleaned Corpus

In [99]:
# Verify in SentimentArcs Root Directory
os.chdir(Path_to_SentimentArcs)

print('Currently in SentimentArcs root directory:')
!pwd

# Verify Subdir to save Cleaned Texts and Texts into..

print(f'\nSaving Clean Texts to Subdir: {SUBDIR_TEXT_CLEAN}')
print(f'\nSaving these Texts:\n  {global_vars.corpus_texts_dt.keys()}')

Currently in SentimentArcs root directory:
/gdrive/MyDrive/sentimentarcs_notebooks

Saving Clean Texts to Subdir: ./text_clean/text_clean_finance_ref

Saving these Texts:
  dict_keys(['bogfederalreserve_speech_1997-2022', 'eucentralbank_speeches_1998-2022'])


In [108]:
# Save the cleaned Textfiles

i = 0
for key_novel, anovel_df in global_vars.corpus_texts_dt.items():
  anovel_fname = f'{key_novel}.csv'

  anovel_fullpath = f'{SUBDIR_TEXT_CLEAN}/{anovel_fname}'
  print(f'Saving Novel #{i} to {anovel_fullpath}')
  global_vars.corpus_texts_dt[key_novel].to_csv(anovel_fullpath)
  i += 1

Saving Novel #0 to ./text_clean/text_clean_finance_reference/bogfederalreserve_speech_1997-2022.csv
Saving Novel #1 to ./text_clean/text_clean_finance_reference/eucentralbank_speeches_1998-2022.csv


In [102]:
%whos str


Variable                Type    Data/Info
-----------------------------------------
Corpus_Genre            str     finance
Corpus_Type             str     reference
PATH_TEXT_CLEAN         str     ./text_clean/text_clean_finance_ref
PATH_TEXT_RAW           str     ./text_raw/text_raw_finance_ref
PATH_UTILS              str     /gdrive/MyDrive/sentimentarcs_notebooks/utils
Path_to_SentimentArcs   str     /gdrive/MyDrive/sentimentarcs_notebooks/
SUBDIR_TEXT_CLEAN       str     ./text_clean/text_clean_finance_ref
SUBDIR_TEXT_RAW         str     text_raw_finance_ref
akey                    str     European Central Bank Spe<...>hes (Jul 1998 - Mar 2022)
anovel_fname            str     eucentralbank_speeches_1998-2022.csv
anovel_fullpath         str     ./text_clean/text_clean_f<...>nk_speeches_1998-2022.csv
file_fullpath           str     ./text_raw/text_raw_finan<...>nk_speeches_1998-2022.txt
file_root               str     eucentralbank_speeches_1998-2022
key_novel               str     

In [107]:
SUBDIR_TEXT_CLEAN

'./text_clean/text_clean_finance_reference'

In [105]:
PATH_TEXT_CLEAN

'./text_clean/text_clean_finance_reference'

# **[END OF NOTEBOOK]**