## Scaling ML Prototype

We initially wanted to run the model on the server directly as a standalone tool. However with prototype testing, it was apparent that running an LLM model locally requires a lot of computing power especially the GPU.  With the currently available resources, the latency for each request with Llama2 7b came to be around 5 minutes.  This meant trying to get 5 feedbacks analyzed would take roughly around 25 minutes.  This is a showstopping hurdle and ultimately led me to change course and use a paid API.  And with that decision, there was no specific reason for me to stay with llama2.  
Therefore, the rest of the project will be done via OpenAI API with GPT 3.5.

As part of next step, we will be giving further detailed instructions to the model so we can get a better and consisten results using Few-Shot Prompting method.

In [2]:
!pip install openai

Collecting openai
  Downloading openai-1.14.1-py3-none-any.whl.metadata (18 kB)
Collecting distro<2,>=1.7.0 (from openai)
  Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
Downloading openai-1.14.1-py3-none-any.whl (257 kB)
   ---------------------------------------- 0.0/257.5 kB ? eta -:--:--
   ------ -------------------------------- 41.0/257.5 kB 991.0 kB/s eta 0:00:01
   ------------------------------- -------- 204.8/257.5 kB 2.5 MB/s eta 0:00:01
   ---------------------------------------- 257.5/257.5 kB 2.3 MB/s eta 0:00:00
Downloading distro-1.9.0-py3-none-any.whl (20 kB)
Installing collected packages: distro, openai
Successfully installed distro-1.9.0 openai-1.14.1


In [1]:
from openai import OpenAI

client = OpenAI(api_key='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')

def get_response(role, prompt):
  response = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {
      "role": "system",
      "content": f""" {role} """
    },
    {
      "role": "user",
      "content": f"""{prompt} """
    }
    ],
      temperature=.25,
      max_tokens=1000,
      top_p=1,
      frequency_penalty=0,
      presence_penalty=0
    )
  return response.choices[0].message.content

In [2]:
# prompt
Role = """You are a feedback analyzing assistant.  For the feedback provided, you are to come up with two scores.  
A workplace appropriateness score called 'AS' that ranges from 1 to 10 on how appropriate the feedback is for professional setting.
A feedback score called 'FS' that ranges from 1 to 10 on overall score of the feedback itself.
Beside each scores, you are to provide a short rationale for your scoring.  
The feedback to analyze will be input in format like, 'Feedback: <text>'"""

Feedback = """James is a highly professional attorney who consistently delivers exceptional results. 
He is well-respected by clients and colleagues alike. 
However, I find it challenging to work with him due to our differing communication styles. 
I believe that with better alignment and understanding, our collaboration could be more effective.
"""


In [3]:
print(get_response(Role, Feedback))


AS: 9
The feedback is mostly appropriate for a professional setting as it focuses on the individual's professional qualities and areas for improvement in a constructive manner.

FS: 8
The feedback is well-structured and provides specific examples to support the points made. However, it could benefit from offering more actionable suggestions for improving communication alignment.


Test case AS: 10 / FS: 7

Model performs very well with appropriate scores compared to the test case.  Albeit, not identical but we are not concerned with the LLM nailing the same score everytime but its rationale should make sense.  We will now add some examples.

In [4]:

def get_response(role, examples, prompt):
  response = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {
      "role": "system",
      "content": f""" {role} """
    },
    {
      "role": "user",
      "content": f""" {examples[0]} """
    },
    {
      "role": "assistant",
      "content": f""" {examples[1]} """
    },
    {
      "role": "user",
      "content": f""" {examples[2]} """
    },
    {
      "role": "assistant",
      "content": f""" {examples[3]} """
    },
    {
      "role": "user",
      "content": f""" {examples[4]} """
    },
    {
      "role": "assistant",
      "content": f""" {examples[5]} """
    },
    {
      "role": "user",
      "content": f""" {examples[6]} """
    },
    {
      "role": "assistant",
      "content": f""" {examples[7]} """
    },
     {
      "role": "user",
      "content": f"""{prompt} """
    }
    ],
      temperature=.25,
      max_tokens=1000,
      top_p=1,
      frequency_penalty=0,
      presence_penalty=0
    )
  return response.choices[0].message.content
     

In [5]:
# Add Few Shots Examples

Example1A = "Bob is a cool attorney to work with.  He is alright when assisting clients and pretty knowledgeable when it comes to his area of expertise."
Example1B = """AS: 10. This feedback is appropriate for a professional workplace setting. It does not contain any inappropriate language or derogatory remarks.

FS:  8. This feedback is generally positive, stating that Bob is cool to work with, assists clients adequately, and has good knowledge in his area of expertise. However, the tone is somewhat neutral and does not convey a high level of enthusiasm or excitement.
"""

Example2A = """Bob is an ass. I hate everything about him and I wish he would leave the firm."""
Example2B = """AS: 2. This feedback contains derogatory language and expresses a strong negative sentiment towards Bob. It is not appropriate for a professional workplace setting.

FS: 2. The feedback is highly negative, expressing hatred towards Bob and a desire for him to leave the firm. The tone is filled with frustration and dissatisfaction.
"""


Example3A = """Bob is freaking amazing.  He is a goddamn angel when it comes to saving my ass when shit hits the fan."""
Example3B = """AS: 4. This feedback contains inappropriate language and profanity, which is not suitable for a professional workplace setting.

FS: 9. The feedback is highly positive, praising Bob for his exceptional abilities in resolving difficult situations. The use of strong language and emphasis conveys a high level of enthusiasm and satisfaction.
"""


Example4A = """Bob is a colleague that I do not want to work with going forward.  While Bob is recognized as a high performing attorney within his practice area, the projects that I ran with him proved to be very difficult and emotional.   Though I appreciate his time, I would like to work with someone else.
"""
Example4B = """AS: 10. This feedback is appropriate for a professional workplace setting. It does not contain any inappropriate language or derogatory remarks.

FS:  7. The reviewer acknowledges Bob's high performance but expresses difficulty and emotional challenges while working with him. The overall tone is neutral and suggests a preference to work with someone else.
"""


Exmps = [Example1A,Example1B,Example2A,Example2B,Example3A,Example3B,Example4A,Example4B]

In [6]:
print(get_response(Role, Exmps, Feedback))

 AS: 10. This feedback is appropriate for a professional workplace setting. It provides a balanced and constructive assessment without any inappropriate language.

FS: 9. The feedback is highly professional and balanced, highlighting James' strengths while also addressing the challenges faced in working with him. The reviewer offers a constructive suggestion for improvement, indicating a willingness to enhance collaboration for better outcomes.


In [7]:
print(Feedback)

James is a highly professional attorney who consistently delivers exceptional results. 
He is well-respected by clients and colleagues alike. 
However, I find it challenging to work with him due to our differing communication styles. 
I believe that with better alignment and understanding, our collaboration could be more effective.



Test Case AS,FS: 10,7

Even with few shots examples, the LLM judged the feedback in a very similar manner as before.  AS is in complete agreement with the feedback being perfectly professional.  But I also acknowledge that this specific example is considered a good feedback as well.  There are negative sentiments but overall, feedback is well written.

This brings into question for FS, is the model rating the feedback itself and not the content?  Even though the sentiment is not positive, if the feedback was well written FS score may be high.  If this is the case, we will need to update the system prompt.  
Let's go through more test cases.


### Testing

In [11]:
import pandas as pd

#import test data
tc = pd.read_csv('out.csv')

In [12]:
tc.head()

Unnamed: 0,text,appro,feed
0,James is a terrible person to work with. I hat...,3,3
1,James is freaking amazing! I love working with...,6,9
2,I hate Jamess face and guts. He may be good at...,3,3
3,I love James and his ability to ruin the relat...,2,1
4,James is a highly professional attorney who co...,10,7


In [19]:
for i in range(len(tc)):

    analyzed = get_response(Role, Exmps, tc.loc[i].text)
    print('Test Case ',i)
    print('Feedback: ',tc.loc[i].text)
    print('LLM Analysis: \n',analyzed)
    print('Actual AS: ',tc.loc[i].appro)
    print('Actual TS: ',tc.loc[i].feed)
    print('\n')


Test Case  0
Feedback:  James is a terrible person to work with. I hate how James smells and I just dont want to be near him whatsoever.  Hes good at his job though.
LLM Analysis: 
  AS: 2. This feedback contains personal attacks on James, including comments about his personal hygiene, which is not appropriate for a professional workplace setting.

FS: 4. The feedback is mixed, acknowledging James' competence at his job but expressing strong negative feelings towards him based on personal reasons. The comments about his smell and the desire to avoid him detract from the professionalism of the feedback.
Actual AS:  3
Actual TS:  3


Test Case  1
Feedback:  James is freaking amazing! I love working with James cause he is my bro.  He knows his shit so he should get a raise.
LLM Analysis: 
 AS: 6. This feedback contains some inappropriate language (e.g., "freaking" and "shit"), which may not be suitable for all professional workplace settings.

FS: 8. The feedback is highly positive, prais

Generally, the model is doing quite well.  As an example of Test Case 1, we see the LLM analysis AS 6 and FS 8 where the actual scores were AS 6 and FS 9.  This one conveys a great feedback in a non-professional manner so expectation was relatively low AS and a high FS where this was captured perfectly.  In the rationale for AS, we see words like "freaking" and "shit" are considered inappropriate and this is exactly what we want to capture.

Test Case 2 is a low AS, low FS example where the feedback content is negative and also there are hurful remarks and personal attacks as part of the feedback which warrants a low score.  LLM showed AS 3 and FS 3.  The test case matches this exactly.

Final test case to showcase is Test Case 3 where I threw a curveball.  The feedback text shows "I love James and his ability to ruin the relationship with the clients.  I find it so funny and I want to work more with him because its entertaining to watch him crash and burn.  It makes me look good!".
This is a feedback full of positive words to trick the LLM into thinking this is appropriate and may be even a good feedback.  But the readers would know this is the worst kind of feedback where the reviewer is happy about the failure shown by the reviewee.  LLM was able to capture this essence and gave it a score of AS 2 and FS 3 with rationale showing this got a low score because "is highly negative and expresses enjoyment in watching a colleague fail."