# Contents and Description
In this notebook, we
1. create training data for fine tuning our LLM sentiment analyzer <br>
&nbsp; - The output is a 1_LLM_finetuning_training_dataset.pkl file<br><br>
2. experiment with creating a firm summarizer for multiple reviews.<br>
&nbsp; - The actual summarizing is not performed in this notebook but is instead carried out in a 1_summarizer.py file<br>
&nbsp; - The output will be summary text files in an output folder (not included in GitHub repo)<br><br>
3. create a dataframe from the firm summary reviews <br>
&nbsp; - The output will be a 2_summary_reviews.csv file<br><br>

We also attempted exploration of RAG for out sentiment analyzer but opted not to continue with the approach.

Contents:
1. [Imports and installs](#imports)
2. [Create training data for fine-tuning](#createdata)
3. [Create firm summary reviews experiment](#summarizer)
4. [Create dataframe from firm summaries](#output)
5. [Exploration on using RAG (not utilizied)](#rag)

---
# Imports and installs <a name="imports"></a>
1. install ollama
2. run ollama run/pull \<model to be used\>
3. run 'ollama serve' in terminal if service not already running

Check cuda etc requirements if ollama should run on a gpu

In [1]:
import ollama
import os
import random
import pickle
import json
import re
import math
import pandas as pd
from tqdm.notebook import tqdm
# import below for RAG
#from langchain_chroma import Chroma
#from langchain_text_splitters import RecursiveCharacterTextSplitter
#from langchain_community.embeddings import OllamaEmbeddings

In [2]:
df = pd.read_csv('all_reviews.csv')

  df = pd.read_csv('all_reviews.csv')


In [3]:
df.head()

Unnamed: 0,rating,title,status,pros,cons,advice,Recommend,CEO Approval,Business Outlook,Career Opportunities,Compensation and Benefits,Senior Management,Work/Life Balance,Culture & Values,Diversity & Inclusion,firm_link,date,job,index
0,5.0,Good,"Current Employee, more than 10 years",Knowledge gain of complete project,Financial growth and personal growth,,v,o,v,3.0,3.0,3.0,3.0,3.0,3.0,Reviews/Baja-Steel-and-Fence-Reviews-E5462645.htm,"Nov 19, 2022",Manager Design,
1,4.0,Good,"Former Employee, less than 1 year","Good work,good work , flexible, support","Good,work, flexible,good support, good team work",,v,o,o,4.0,4.0,4.0,4.0,4.0,4.0,Reviews/Baja-Steel-and-Fence-Reviews-E5462645.htm,"Jan 29, 2022",Anonymous Employee,
2,4.0,"Supervising the manufacturing the processes, e...","Current Employee, more than 1 year",This company is a best opportunity for me to l...,"Monthly Target work,Maintain production schedu...",,v,o,v,2.0,3.0,2.0,2.0,2.0,2.0,Reviews/Baja-Steel-and-Fence-Reviews-E5462645.htm,"Aug 12, 2021",Production Engineer,
3,1.0,terrible,"Current Employee, more than 1 year",I wish there were some to list,too many to list here,,x,x,x,1.0,3.0,1.0,3.0,1.0,,https://www.glassdoor.com/Reviews/Calgary-Flam...,"Sep 24, 2020",Senior Account Executive,
4,4.0,"It could be so good, but it isn’t","Current Employee, more than 3 years",Fast Paced. Endless challenges. Inclusive envi...,The biggest perk of the job provides no value ...,,o,o,o,3.0,3.0,3.0,1.0,4.0,5.0,https://www.glassdoor.com/Reviews/Calgary-Flam...,"Mar 25, 2023",Assistant Manager,


---
# Create training data for fine-tuning <a name="createdata"></a>

In [76]:
# create mapping dict
cols = dict(zip(list(df.columns)[9:15],['co','cb','sm','wlb','cv','di']))
cols

{'Career Opportunities': 'co',
 'Compensation and Benefits': 'cb',
 'Senior Management': 'sm',
 'Work/Life Balance': 'wlb',
 'Culture & Values': 'cv',
 'Diversity & Inclusion': 'di'}

In [None]:
train={}
for k,v in cols.items(): # iterate through 6 categories
    col = pd.to_numeric(df[k],errors='coerce') # set values to numeric, else NA
    train[v]={}
    for rating in range(1,6): # create 1k training reviews for each rating value (1 to 5)
        i=0
        train[v][rating]=[]
        indices = list(col[col==rating].index) # indices of rows where rating of category matches
        pbar = tqdm(total=1000,desc=f'{v} {rating} rating') # instantiate progress bar
        while i<1000: # iterate till 1k rows are obtained for rating in category
            index = indices.pop(random.choice(range(len(indices)))) # random choose matching row index
            row = df.loc[index]
            review = f"pros: {row['pros']}\ncons: {row['cons']}" # set pros and cons to review string
            # prompt ollama to id if reviews have anything to do with category
            # if yes, store review as training data and increment counter i
            # if no, *do not* increment counter, move on to another row
            response = ollama.chat(
                model="llama3.2",
                messages=[
                    {
                        "role": "user",
                        "content": f"Does the following review have anything to do with {k.lower()}?\n"\
                        +review\
                        +"reply with only yes or no",
                    },
                ],
            )

            if 'yes' in response["message"]["content"].lower():
                train[v][rating].append(review)
                i = i+1
                pbar.update(1)
        pbar.close()

co 1 rating:   0%|          | 0/1000 [00:00<?, ?it/s]

co 2 rating:   0%|          | 0/1000 [00:00<?, ?it/s]

co 3 rating:   0%|          | 0/1000 [00:00<?, ?it/s]

co 4 rating:   0%|          | 0/1000 [00:00<?, ?it/s]

cb 3 rating:   0%|          | 0/1000 [00:00<?, ?it/s]

cb 4 rating:   0%|          | 0/1000 [00:00<?, ?it/s]

cb 5 rating:   0%|          | 0/1000 [00:00<?, ?it/s]

sm 1 rating:   0%|          | 0/1000 [00:00<?, ?it/s]

sm 2 rating:   0%|          | 0/1000 [00:00<?, ?it/s]

sm 3 rating:   0%|          | 0/1000 [00:00<?, ?it/s]

sm 4 rating:   0%|          | 0/1000 [00:00<?, ?it/s]

sm 5 rating:   0%|          | 0/1000 [00:00<?, ?it/s]

cv 1 rating:   0%|          | 0/1000 [00:00<?, ?it/s]

cv 2 rating:   0%|          | 0/1000 [00:00<?, ?it/s]

cv 3 rating:   0%|          | 0/1000 [00:00<?, ?it/s]

In [89]:
# load training data from pkl
with open('1_LLM_finetuning_training_dataset.pkl','rb') as file:
    train=pickle.load(file)
train.keys() # check that all categories are present in training data

dict_keys(['co', 'cb', 'sm', 'cv', 'di', 'wlb'])

---
# Create summary reviews per firm (experiment)<a name="summarizer"></a>

In [2]:
# laod processed df
with open('1_df.pkl','rb') as file:
    df = pickle.load(file)

In [3]:
# show top 10 firms by review count
df.groupby('firm').size().reset_index(name='count').sort_values(by='count',ascending=False).head(10)

Unnamed: 0,firm,count
1652,Amazon,163396
29274,Tata-Consultancy-Services,107218
32691,Walmart,102152
6831,Cognizant-Technology-Solutions,84171
19010,McDonald-s,76777
779,Accenture,69026
29243,Target,67885
13178,HP-Inc,63787
28064,Starbucks,55325
15053,Infosys,53189


In [4]:
firm='Infosys' # choose firm to test on
firm_df = df[df.firm==firm]

We run the summarizer by sending 5 sets of reviews (pros and cons) to the LLM and prompt it to summarize the pros and cons, 1 sentence for each pro and con in the 6 categories (if no review on category, empty string should be returned).<br>
We then chain the process by adding the summary to the next iteration and getting the LLM to summarize it again.<br>
We choose sets of 5 reviews per iteration as testing on 10 resulted in problematic outputs with Llama 3.2. Likely a context window issue; other models may fare better/worse.<br>
One problem with this iterative method is that reviews from later iterations will be more 'weighted' in the final summary outputs. A better way would be to summarize reviews into batches, then summarize those batched summaries equally till a final summary is generated. Due to time and computation limitations, we carry out the iterative method.

In [None]:
test = False # if True, test with 2 iterations
start_from = 0 # change if continuing iterations part way

if test == True:
    rng = range(2)
else:
    rng = range(start_from,math.ceil(len(firm_df)/5))

# if continuing from part way, change the pros and cons lists to the output from the previous iteration.
pros=[]
cons=[]
    
for i in tqdm(rng): # set up progress bar
    for attempt in range(5): 
        # we allow 5 tries as the LLM occasionally gives badly formatted output which results in errors.
        # within 5 attempts the LLM should produce an output we can use.
        try:
            chunk = firm_df.iloc[i*5:(i*5)+5] # extract chunk of 5 reviews sets

            response = ollama.chat(
                model="llama3.2",
                messages=[
                    {
                        "role": "system",
                        "content": """
                        You reply in json format like this:
                        {
                            "pros": { "career opportunities": "one positive sentence describing career opportunities else return an empty string",
                                    "compensation and benefits": "one positive sentence describing compensation and benefits else return an empty string",
                                    "senior management": "one positive sentence describing senior management else return an empty string",
                                    "work life balance": "one positive sentence describing work life balance else return an empty string",
                                    "culture and values": "one positive sentence describing culture and values else return an empty string",
                                    "diversity and inclusion": "one positive sentence describing diversity and inclusion else return an empty string}", 
                            "cons": { "career opportunities": "one negative sentence describing career opportunities else return an empty string",
                                    "compensation and benefits": "one negative sentence describing compensation and benefits else return an empty string",
                                    "senior management": "one negative sentence describing senior management else return an empty string",
                                    "work life balance": "one negative sentence describing work life balance else return an empty string",
                                    "culture and values": "one negative sentence describing culture and values else return an empty string",
                                    "diversity and inclusion": "one negative sentence describing diversity and inclusion else return an empty string}"
                        }
                        There should be no escape characters in the output.
                        """,
                    },
                    {
                        "role": "user",
                        "content": f"""
                        Here are positive reviews {chunk.pros.to_list()+pros}.\n
                        Here are negative reviews {chunk.cons.to_list()+cons}.""",
                    },
                ],
                options={'num_ctx':6144} # set context window. default window is 2048 tokens
            )

            resp = response['message']['content']
            if i%100==0: # write summary output to text file every 100 iterations
                with open('resp.txt','w') as f:
                    f.write(resp)    
            resp_dict = json.loads(resp[resp.find('{'):resp.rfind('}')+1]) # load json output string to dictionary
            pros = list(resp_dict['pros'].values()) # set pros values to list
            cons = list(resp_dict['cons'].values()) # set cons values to list
        except:
            print(f'exception in attempt {attempt+1}')
        else:
            break
with open('resp.txt','w') as f: # write final summary output to text file
    f.write(resp)    

  0%|          | 0/10638 [00:00<?, ?it/s]

exception in attempt 0
exception in attempt 0
exception in attempt 0
exception in attempt 0
exception in attempt 0
exception in attempt 0
exception in attempt 0
exception in attempt 1
exception in attempt 0
exception in attempt 0
exception in attempt 0
exception in attempt 0
exception in attempt 0
exception in attempt 0
exception in attempt 0
exception in attempt 0
exception in attempt 0
exception in attempt 0
exception in attempt 0
exception in attempt 0
exception in attempt 0
exception in attempt 1
exception in attempt 0
exception in attempt 0
exception in attempt 0
exception in attempt 0
exception in attempt 1
exception in attempt 0
exception in attempt 0
exception in attempt 0
exception in attempt 0
exception in attempt 0
exception in attempt 0
exception in attempt 0
exception in attempt 0
exception in attempt 0
exception in attempt 1
exception in attempt 0
exception in attempt 0
exception in attempt 0
exception in attempt 1
exception in attempt 0
exception in attempt 0
exception i

In [12]:
print(resp)

{
  "pros": {
    "career opportunities": "Some of the other contractors I worked with who were also employed by InfoSys were nice, Very good projects to work on, Lots of opportunities to move around different technologies, High Job Security and you will lot of support from other team members., It’s Good company to work., Lots of learning opportunities and career paths Experience working in varying environments.",
    "compensation and benefits": "decent pay for entry-level, Excellent benefits, amazing training opportunities, good pay, and an excellent 401k plan with low monthly premiums and low deductible.", 
    "senior management": "",
    "work life balance": "Flexibility working from home\nNumerous PTO", 
    "culture and values": "", 
    "diversity and inclusion": ""
  },
  "cons": {
    "career opportunities": "They didn’t allow anyone to take any vacation or PTO of any kind (lol you probably think I’m joking but I’m not at all)., - travelling\n- not very flexible", 
    "compe

---
# Outputs to df<a name="output"></a>

In [3]:
filenames = os.listdir('./output')

In [11]:
firms = [x.split('_')[1] for x in filenames[:] if x.endswith('_19.txt')]
df_files = [x for x in filenames[:] if x.endswith('_19.txt')]

In [50]:
reviews = {}
zipped = list(zip(firms[:],df_files[:]))
for i in (pbar := tqdm(range(len(zipped)))):
    firm,df_file = zipped[i]
    pbar.set_description(df_file)
    try:
        with open(os.path.join('./output',df_file),'r') as f:
            filetxt = f.read()
    except:
        with open(os.path.join('./output',df_file),'r',encoding='utf-8') as f:
            filetxt = f.read()
    filedict = json.loads('{'+filetxt.split('{',maxsplit=1)[1].rsplit('}',maxsplit=1)[0]+'}')
    reviews[firm] = filedict

  0%|          | 0/1000 [00:00<?, ?it/s]

In [93]:
col_headers = ['pros: '+x for x in list(reviews['AMR']['pros'].keys())]+\
                ['cons: '+x for x in list(reviews['AMR']['cons'].keys())]
col_headers

['pros: career opportunities',
 'pros: compensation and benefits',
 'pros: senior management',
 'pros: work life balance',
 'pros: culture and values',
 'pros: diversity and inclusion',
 'cons: career opportunities',
 'cons: compensation and benefits',
 'cons: senior management',
 'cons: work life balance',
 'cons: culture and values',
 'cons: diversity and inclusion']

In [90]:
def reviews_to_list(review_dict):
    p_co = review_dict['pros'].get('career opportunities','')
    p_cb = review_dict['pros'].get('compensation and benefits','')
    p_cv = review_dict['pros'].get('culture and values','')
    p_di = review_dict['pros'].get('diversity and inclusion','')
    p_sm = review_dict['pros'].get('senior management','')
    p_wlb = review_dict['pros'].get('work life balance','')
    n_co = review_dict['cons'].get('career opportunities','')
    n_cb = review_dict['cons'].get('compensation and benefits','')
    n_cv = review_dict['cons'].get('culture and values','')
    n_di = review_dict['cons'].get('diversity and inclusion',review_dict['cons'].get('Diversity and inclusion',''))
    n_sm = review_dict['cons'].get('senior management','')
    n_wlb = review_dict['cons'].get('work life balance','')
    review_list = [p_co,p_cb,p_sm,p_wlb,p_cv,p_di,n_co,n_cb,n_sm,n_wlb,n_cv,n_di]
    return review_list

In [92]:
df_cols = []
firms = reviews.keys()
for firm in firms:
    df_cols.append(reviews_to_list(reviews[firm]))

In [107]:
summ_df = pd.DataFrame(df_cols,index=firms,columns=col_headers).reset_index().rename(columns={'index':'firm'})
summ_df.head(3)

Unnamed: 0,firm,pros: career opportunities,pros: compensation and benefits,pros: senior management,pros: work life balance,pros: culture and values,pros: diversity and inclusion,cons: career opportunities,cons: compensation and benefits,cons: senior management,cons: work life balance,cons: culture and values,cons: diversity and inclusion
0,AMR,AMR has multiple locations throughout the glob...,Good pay with ability to make a lot more with ...,,Only work 3 days a week. Every other 3 day wee...,"Great company culture, fast moving industry, o...",,You will get called in a lot. Mandatory overti...,Have to work a lot of O/T because always short...,Management has no idea what they’re doing or h...,"On Call hours can be very high, high stress en...","Very bad upper management, Lack of culture Shi...",-Two Separate Division with Different Contract...
1,International-Flavors-&-Fragrances,The company offers always new opportunities an...,Good health care benefits and pay; Really good...,Agile & flexible team; management that cares f...,Work-life balance as the manufacturing sides a...,Financially Solid Company with good values,Most people are good to deal with; The company...,"Flat structure, limited hierarchy and presumab...","Pay is not great, especially when you’re short...",Micromanagement runs rampant. Procurement etc....,No work life balance at all; High pressure and...,Too much bereaucracy i guess.,the wages are a bit lower compared to other mu...
2,Interserve,Opportunity to grow and development,"Good salary, excellent pay, Good hours, paid o...",Some senior managers willing to help and advise,Good work life balance - plenty of opportuniti...,"good and friendly culture and friends, Good pe...",,None. Shocking health and safety. No wellbeing...,"Low salaries not clear career path, Minimum ho...",Senior management. Shareholders. Opaque (no or...,"Work can be boring, Work life balance could be...","very flat culture, no promotions",


In [111]:
ratings_df = pd.read_csv('1_firm_summary.csv').rename(columns={'Unnamed: 0':'index'})

In [119]:
comb_df = summ_df.merge(ratings_df,on='firm',how='left')
comb_df.head(3)

Unnamed: 0,firm,pros: career opportunities,pros: compensation and benefits,pros: senior management,pros: work life balance,pros: culture and values,pros: diversity and inclusion,cons: career opportunities,cons: compensation and benefits,cons: senior management,...,cons: culture and values,cons: diversity and inclusion,index,opportunities,compensation,management,worklife_balance,culture,diversity,kmeans_labels
0,AMR,AMR has multiple locations throughout the glob...,Good pay with ability to make a lot more with ...,,Only work 3 days a week. Every other 3 day wee...,"Great company culture, fast moving industry, o...",,You will get called in a lot. Mandatory overti...,Have to work a lot of O/T because always short...,Management has no idea what they’re doing or h...,...,"Very bad upper management, Lack of culture Shi...",-Two Separate Division with Different Contract...,877,3.02,2.82,2.61,2.64,2.69,3.3,3
1,International-Flavors-&-Fragrances,The company offers always new opportunities an...,Good health care benefits and pay; Really good...,Agile & flexible team; management that cares f...,Work-life balance as the manufacturing sides a...,Financially Solid Company with good values,Most people are good to deal with; The company...,"Flat structure, limited hierarchy and presumab...","Pay is not great, especially when you’re short...",Micromanagement runs rampant. Procurement etc....,...,Too much bereaucracy i guess.,the wages are a bit lower compared to other mu...,1970,3.28,3.41,3.0,3.35,3.49,3.75,4
2,Interserve,Opportunity to grow and development,"Good salary, excellent pay, Good hours, paid o...",Some senior managers willing to help and advise,Good work life balance - plenty of opportuniti...,"good and friendly culture and friends, Good pe...",,None. Shocking health and safety. No wellbeing...,"Low salaries not clear career path, Minimum ho...",Senior management. Shareholders. Opaque (no or...,...,"very flat culture, no promotions",,2964,2.59,2.78,2.58,2.88,2.77,3.24,3


In [120]:
comb_df.to_csv('1_summary_reviews.csv')

---
# RAG method (explored part way, not utilized) <a name="rag"></a>

In the RAG method, we give the LLM access to context in a speedy manner by using semantic search in a vector store.<br>
Steps:
- Split document to chunks
- Convert document chunks to embeddings
- Use ChromaDB to store embeddings in a vector store (sqlite db will be created)
- Perform semantic search on keywords related to prompt, return top <i>n</i> matched documents
- Matched documents are used as prompt context

For our purposes, an example of this process will be to do a semantic search on the top 10 reviews related to 'work life balance' and prompt the LLM to summarize these reviews.

In [9]:
nvy_pros = df[df.firm=='US-Navy'].pros.to_list()

In [10]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)

In [11]:
all_splits = text_splitter.create_documents(nvy_pros)

Document batch size was exceeded when trying to create vector store below.<br>
Embedding generation and push to vector store should be done in batched loops.<br>
We continued no further on RAG exploration from this point.

In [12]:
vectorstore = Chroma.from_documents(
    documents=all_splits,
    embedding=OllamaEmbeddings(model="llama3.2", show_progress=True),
    persist_directory="./chroma_db",
)
vectorstore.persist()

OllamaEmbeddings: 100%|█████████████████████████████████████████████████████████| 43697/43697 [1:09:06<00:00, 10.54it/s]


ValueError: Batch size 43697 exceeds maximum batch size 41666