This note will be used for prompt generation test

1. summarization of critic reviews.

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import psycopg2

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# bert is not smart enough, going to use llama3 as summarizer
model_id = 'meta-llama/Meta-Llama-3-8B-Instruct'
HF_API_TOKEN = "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"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True, token=HF_API_TOKEN)
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, token=HF_API_TOKEN)
model = model.to(torch.bfloat16)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|██████████| 4/4 [00:11<00:00,  2.85s/it]


In [3]:
# brief test of llama3-instruct
messages = [
    {"role": "system", "content": "You are a summarizer who summarize critic reviews of video game, the work is to highlight the shared view points and different view points. Also you need to summarize the distribution of their scores (0-100)."},
    {"role": "user", "content": "Who are you?"}
]


In [4]:
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)

In [5]:
inputs

tensor([[128000, 128006,   9125, 128007,    271,   2675,    527,    264,  29385,
           3213,    889,  63179,   9940,   8544,    315,   2835,   1847,     11,
            279,    990,    374,    311,  11415,    279,   6222,   1684,   3585,
            323,   2204,   1684,   3585,     13,   7429,    499,   1205,    311,
          63179,    279,   8141,    315,    872,  12483,    320,     15,     12,
           1041,    570, 128009, 128006,    882, 128007,    271,  15546,    527,
            499,     30, 128009, 128006,  78191, 128007,    271]])

In [6]:
input_text = tokenizer.decode(inputs[0], skip_special_tokens=False)
print(input_text)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a summarizer who summarize critic reviews of video game, the work is to highlight the shared view points and different view points. Also you need to summarize the distribution of their scores (0-100).<|eot_id|><|start_header_id|>user<|end_header_id|>

Who are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>




In [7]:
# Set the generation parameters
max_new_tokens = 256
temperature = 0.6
top_p = 0.9

In [8]:
outputs = model.generate(
    inputs,
    max_new_tokens=max_new_tokens,
    temperature=temperature,
    top_p=top_p,
    do_sample=True,
    eos_token_id=[tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")]
)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


In [9]:
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=False)
print(generated_text)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a summarizer who summarize critic reviews of video game, the work is to highlight the shared view points and different view points. Also you need to summarize the distribution of their scores (0-100).<|eot_id|><|start_header_id|>user<|end_header_id|>

Who are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I am a summarizer, specifically designed to summarize critic reviews of video games. My task is to analyze the reviews from various critics and journalists, identify the shared viewpoints and differing opinions, and summarize the overall sentiment and scores given to the game.

I'll be highlighting the common themes, praises, and criticisms found in the reviews, as well as noting any significant disagreements or controversies. Additionally, I'll provide a summary of the score distribution, ranging from 0 to 100, to give you an idea of the general consensus among critics.

So, if you're looking for a com

In [58]:
# create a class for summarization
class summarizer:

    def __init__(self, model, tokenizer, psyconn):
        self.module = model
        self.tokenizer = tokenizer
        self.conn = psyconn

    def summarize_game(self, game_name, data_from_metacritic=False, detailed_steam_info=False):

        steam_sql =f"""
            SELECT description, positive_rate, developer, genes, release_date
            FROM games
            WHERE games.name = '{game_name}';
        """
        cursor = self.conn.cursor()
        cursor.execute(steam_sql)

        description, positive_rate, developer, genes, release_date = cursor.fetchone()
        msg = f"Please help me analysis the game: {game_name}. {game_name} was developed by {developer}, {developer} characterizes this game with tags: {genes} and gives a short description: {description}\n"

        if detailed_steam_info:
            msg = msg + f"So far, {round(positive_rate)} percent of its customers give positive reviews."

        if data_from_metacritic:
            meta_sql = f"""
                SELECT game_name, AVG(score)
                FROM meta_critic_reviews
                WHERE game_name = '{game_name}'
                GROUP BY game_name;
            """

            cursor.execute(meta_sql)
            _, avg_score = cursor.fetchone()
            msg = msg + f"\nNext are some reviews from critic reviewers on metacritic. In metacritic the game get an average score of {round(avg_score)} out of 100.\n"

            meta_sql = f"""
                SELECT reviewer_name, score, review
                FROM meta_critic_reviews
                WHERE game_name = '{game_name}'
                ORDER BY score DESC;
            """
            cursor.execute(meta_sql)
            reviews = cursor.fetchall()

            # promopt messages for llama3-8B-instruct
            prompt_messages = [
                {"role": "system", "content": "You are a summarizer who summarize critic reviews of video game into one paragraph, the work is to highlight the shared view points and different view points. Also you need to summarize the distribution of their scores (0-100).\n"},
                {"role": "user", "content": "Please summarize following critic reviews to video game {game_name} from Metacritic:\n"}
            ]

            for (reviewer, score, review_text) in reviews:
                prompt_messages[1]["content"] = prompt_messages[1]["content"] + reviewer + ": " + review_text + (f" In conclusion this reviewer gives a score of {str(score)}.\n" if score != -100 else " This reviewer does not give a score.\n")

            inputs = self.tokenizer.apply_chat_template(prompt_messages, return_tensors="pt", add_generation_prompt=True).to('cuda')
            outputs = model.generate(
                inputs,
                max_new_tokens=384,
                temperature=0.2,
                top_p=0.9,
                do_sample=True,
                eos_token_id=[tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")]
            )

            output = self.tokenizer.decode(outputs[0], skip_special_tokens=False)
            idx = output.find("<|start_header_id|>assistant<|end_header_id|>")
            output = output[idx:]
            print(output)
        cursor.close()
                


In [59]:
postgres_conn_param = {
    'host': 'localhost',
    'port': '5432',
    'user': 'postgres',
    'password': 'Lyc199412',
    'database': 'steam'
}

conn = psycopg2.connect(**postgres_conn_param)
summarizer = summarizer(model, tokenizer, conn)

In [60]:
summarizer.summarize_game('Battlefield V', True, True)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


<|start_header_id|>assistant<|end_header_id|>

Here is a summary of the critic reviews for Battlefield V:

**Shared viewpoints:**

* Most critics agree that Battlefield V is an improvement over its predecessors, with a more refined gunplay and a new squad system that adds depth to the gameplay.
* The game's multiplayer mode is praised for its intensity and excitement, with many critics enjoying the large-scale battles and objective-based gameplay.
* The game's visuals and sound design are also widely praised, with many critics noting that the game is one of the most visually stunning FPS games on the market.

**Different viewpoints:**

* Some critics feel that the game is lacking in content, particularly at launch, with too few maps and modes available.
* Others criticize the game's single-player campaign, which is seen as forgettable and lacking in depth.
* A few critics note that the game is still rough around the edges, with technical issues and bugs that detract from the overall ex

2. assisstant respond with aggregation of reviews.

In [3]:
# create a class for assisstant response generator based on reviews from a cluster of customers
class assisstant_response:

    def __init__(self, model, tokenizer):
        self.module = model
        self.tokenizer = tokenizer

    def gen_response(self, reviews, game_name):

        prompt_message = [
            {"role": "system", "content": ("You are a video game recommender, your task is to make suggestions before purchase a video game. "
                                           "Your suggestion should based on reviews and status of players who purchased this game. "
                                           "You should make predictions of attitude, playing time of the game. Examples like: "
                                           "\"You probabily like this game, and will spend a lot of times on it\", "
                                           "\"You won't like this game, but it is a good time killing tool if you really have nothing to do\".")},
            {"role": "user", "content": f"Please summarize these reviews and playing time status (in brackets) of game: {game_name}:\n"}
        ]

        for i, review in enumerate(reviews):
            prompt_message[1]['content'] = prompt_message[1]['content'] + f"reviewer {str(i)}:" + review

        inputs = self.tokenizer.apply_chat_template(prompt_message, return_tensors="pt", add_generation_prompt=True).to('cuda')
        outputs = model.generate(
                inputs,
                max_new_tokens=256,
                temperature=0.2,
                top_p=0.9,
                do_sample=True,
                eos_token_id=[tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")]
            )

        output = self.tokenizer.decode(outputs[0], skip_special_tokens=False)
        idx = output.find("<|start_header_id|>assistant<|end_header_id|>")
        output = output[idx:]
        return output


In [4]:
ar = assisstant_response(model=model, tokenizer=tokenizer)

In [5]:
sample_reviews = [
    "At least the uninstall button works (spent little time)",
    "Words cant describe how great this game is. But numbers can. 1/10 (spent little time)",
    "damn, efootball and EAFC24 really go all the way down to the garbage. I miss the old PES and FIFA (spend some time)",
    """2 years. People waited for TWO full years under the promise that this game was going to be groundbreaking .
The only logical and possible explanation for how bad this game is, is that KONAMI wants to declare bunkruptcy or something.
It's equally laughable and sad.
They launched a "full" game that lacks content left and right. They even dumbed down the Game Plan settings.
Terrible, just terrible.
KONAMI just enjoys shooting themselves in the foot. (spent some time)""",
    "This is (probably) the only game my wife manages to beat me to. Game sucks, I hate it but my wife's smile and joy when she scores a goal beats all the cons this game has. (spent some time)",
    "Worst football game ever. Konami, go home, you are drunk. (nearly did not spend any time)"
]

In [6]:
output = ar.gen_response(sample_reviews, "FIFA 2024")
print(output)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
  attn_output = torch.nn.functional.scaled_dot_product_attention(


<|start_header_id|>assistant<|end_header_id|>

Based on the reviews and playing time status, here's a summary:

**Overall Attitude:** The game has received overwhelmingly negative reviews, with most reviewers expressing disappointment, frustration, and even anger towards the game's poor quality.

**Playing Time:** The playing time varies, but most reviewers have spent little to some time playing the game, indicating that it may not be engaging or enjoyable for many players.

**Predictions:**

* You probably won't like this game, as it seems to have received widespread criticism for its poor quality and lack of content.
* You might spend some time playing it, but it's unlikely to hold your attention for long due to its numerous flaws.
* If you're a fan of football games, you might be disappointed by the game's poor performance and lack of innovation.

Keep in mind that opinions are subjective, and some players might still find enjoyment in the game. However, based on the reviews, it see

This section will generate the training dataset for the general-passive assisstant.