### Reports Automation
- The objective is auto populate name of company, text sentences in their respective categories based on text classification model and overall sentiment scores (Positive/Negative/Neutral) of the company based on ABSA Sentiment Analysis Model
- We use **Gramex** library to acheive the above objective

**Preprare Data**
- Input File: JSON file resulted from ABSA model's inference step
- Ouput File: Pandas Data Frame 

In [None]:
# Install libraries
import pandas as pd
from tqdm import tqdm
import json, re
from nltk import flatten
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
from sklearn.svm import LinearSVC
import joblib

In [None]:
# convert json file contents into data frame
with open('../output/step2_competitor_news_data_sentiment_scores.json', 'r') as jsonFile:
    lines = jsonFile.readlines()
    sentiment_labels = pd.DataFrame()
    for l_id, line in enumerate(tqdm(lines)):
        line = json.loads(line)
        document_df = pd.DataFrame()
        document_df.loc[0, 'news_text'] = line['_news_text']
        document_df.loc[0, 'd_pol'] = line['_doc_polarity']
        document_df.loc[0, 'doc_id'] = line['sent_id']
        document_df.loc[0, 'company_name'] = line["_vendor_name"]
        document_df.loc[0, 'positive'] = str(list(filter(None, [v if k == 'Positive' else 0 for k, v in line['scores'].items()])))
        document_df.loc[0, 'negative'] = str(list(filter(None, [v if k == 'Negative' else 0 for k, v in line['scores'].items()])))
        document_df.loc[0, 'neutral'] = str(list(filter(None, [v if k == 'Neutral' else 0 for k, v in line['scores'].items()])))
        sentence_df = pd.DataFrame()
        for s_id, sent in enumerate(line['_sentences']):
            sentence_df.loc[s_id, 'sents'] = [v for k, v in sent.items() if v][0]
            sentence_df.loc[s_id, 's_pol'] = [v if v else '' for k, v in sent.items()][3]
            words_dict = [v if v else '' for k, v in sent.items()][1]
            if type(words_dict) == dict:
                sentence_df.loc[s_id, 'terms_neg'] = str(list(filter(None, flatten([v if k == 'NEG' else '' for k, v in words_dict.items()]))))
                sentence_df.loc[s_id, 'terms_pos'] = str(list(filter(None, flatten([v if k == 'POS' else '' for k, v in words_dict.items()]))))
            else:
                next
        document_df = pd.concat([document_df, sentence_df], axis = 1)
        sentiment_labels = pd.concat([sentiment_labels, document_df], axis = 0)

In [None]:
# remove pos / neg tags attached in the sentences column
sentiment_labels['sents'] = sentiment_labels['sents'].apply(lambda x: re.sub(r'<NEG>|<POS>', '', str(x)))

**Predict Categories**

- We use **Universal Sentence Encoder** pre-trained language model from **Tensorflow Hub** to obtain text representations for our text corpus
- We use classifier model that we developed to get prediction 

In [None]:
# Download Universal Sentence Embeddings model
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder-large/5")

In [None]:
# helper function to generate emebddings  
def embed_text(text):
    '''
    args:list of sentences
    
    '''
    embeddings = embed(text)
    return[vector.numpy().tolist() for vector in embeddings]

In [None]:
# helper function to load vectors into a dataframe
def vectors_to_df(embed_vectors):
    embeddings_df = pd.DataFrame()
    for i in range(len(embed_vectors)):
    df = pd.DataFrame([embed_vectors[i]])
    embeddings_df = embeddings_df.append(df)
    return embeddings_df

In [None]:
# load svm model 
svm_model = joblib.load('../input/sent_classifier_model/sent_classifier/svm_model_wt_use.pkl')

In [None]:
# load data and prepare 
sents = [sent for sent in sentiment_labels.sents]
text_vectors = embed_text(sents)
vectors_df = vectors_to_df(text_vectors)

In [None]:
# prdict categories 
preds = svm_model.predict(vectors_df)

In [None]:
# merge prediction results with original data frame
results_df = pd.merge(sentiment_labels, pd.DataFrame(preds).reset_index(), how = 'left', left_index = True, right_index = True)

In [None]:
# write dataframe to flat file
results_df.to_csv('../output/step2_output_sentiment_scores_categories.csv', index = False)