# Comparison between the two models
According to research and published articles on common LLMs' metrics, Mistral 7B significantly outperforms Llama 2 13B on all metrics:


https://mistral.ai/news/announcing-mistral-7b/

## Load dataset with responses


In [20]:
import pandas as pd
import os
cwd = os.getcwd()
parent_dir = os.path.abspath(os.path.join(cwd, os.pardir))
data_path = os.path.join(parent_dir, "data")
responses_df = pd.read_pickle(os.path.join(data_path,"responses_df"))
responses_df.head()

Unnamed: 0,question,answer,category,year,mistral_responses_iter1,llama2_responses_iter1
0,"In the cricket IPL game, scheduled for May 14t...",Bangalore,sport,2023,"I'm an AI, I don't have the ability to predict...","{\n""probs"": 0.7,\n""confidence_interval"": [0.6,..."
1,"In the NBA basketball game, scheduled for Octo...",Pistons,sport,2023,"{\n ""probability"": 0.45,\n ""confidence_inter...","{\n""probability"": 0.7,\n""confidenceInterval"": ..."
2,"In the cricket IPL game, scheduled for May 14t...",Bangalore,sport,2023,"{\n ""probability"": 0.51,\n ""confidence_inter...","{\n""probability"": 0.7,\n""confidence_interval"":..."
3,"In the NBA basketball game, scheduled for Octo...",Pistons,sport,2023,"{\n ""probability"": 0.5,\n ""confidence_interv...","{\n""probability"": 0.75,\n""confidence"": 0.9\n}\n"
4,Will Netherlands qualify for the World Cup in ...,Yes,sport,2023,"{\n""probability"": 0.8,\n""confidence_interval"":...","{\n""proba"": 0.75,\n""conf"": [0.6, 0.9]\n}"


For some reason the first two questions were repeated so we delete them. Interestingly, the models gave a different answer for the very same question.

In [21]:
responses_df.drop([2,3], inplace=True)

In [6]:
# Arrays to collect the correct answers
mistral_correct_responses = []
llama2_correct_responses = []

## Question 1


In [4]:
responses_df.iloc[0].question

'In the cricket IPL game, scheduled for May 14th in 2023, which team is going to win: Rajasthan or Bangalore?'

### Mistral response
It gives a very long answer (note: to improve prompt engineering adding a max nr of output tokens parameter) and it is giving the higher probability to Rajasthan though both won the same amount of matches. So it did not guess the correct answer though this model might indeed have even the results of this past match.

<div style="padding: 15px; border: 1px solid transparent; border-color: transparent; margin-bottom: 20px; border-radius: 4px; color: #31708f; background-color: #d9edf7; border-color: #bce8f1;">
'I\'m an AI, I don\'t have the ability to predict or know the outcome of a cricket match before it happens. However, I can give you some general information or historical data that might help you make an informed decision. For instance, if we look at the most recent performance and head-to-head records, we find that:\n\n1. Rajasthan won 5 of the last 10 matches against Bangalore.\n2. Bangalore won 5 of the last 10 matches against Rajasthan.\n\nSo, based on this data, we could estimate that the probability of Rajasthan winning is around 0.52 (52%) and the probability of Bangalore winning is around 0.48 (48%). However, these probabilities only reflect historic performance and do not account for any potential injuries, team composition changes, or unexpected factors that could influence the outcome of the match.\n\nAs for the confidence interval, it\'s important to keep in mind that predicting sports outcomes is inherently uncertain. Based on the binary nature of the question ( Rajasthan or Bangalore), a confidence interval can\'t be directly calculated using a standard normal distribution since the data is not continuous. Nevertheless, we could represent the uncertainty using a range or interval based on the probabilities and some assumption of potential swing in team performance. For instance, considering the close probability values, a reasonable confidence interval could be something like [0.45, 0.59] for Rajasthan\'s win percentage or [0.41, 0.54] for Bangalore\'s win percentage. Please note that I cannot guarantee the validity of these confidence intervals, as cricket match outcomes include a significant element of randomness and uncertainty.\n\nThe output for the JSON response could look like:\n{\n  "probability_Rajasthan": 0.52,\n  "confidence_interval_Rajasthan": [0.45, 0.59]\n},\n{\n  "probability_Bangalore": 0.48,\n  "confidence_interval_Bangalore": [0.41, 0.54]\n}\n\nIt\'s important to keep in mind that historical data is not always a perfect indicator of future performance and there are many factors that can influence the outcome of a single match. Careful consideration of all relevant information and expertise is required when making predictions.'</div>

In [7]:
mistral_correct_responses.append(0)

### Llama2 response
This model is giving us a short answer directly with the probability but no explanation and no clear answer so it maybe did not understand the instruction. This model is older than mistral and less advanced, so probably its reasoning engine is not as intelligent as mistral's one. Thus the same prompt does not work for this type of question. We consider it as a failed answer.

In [3]:
responses_df.iloc[0].llama2_responses_iter1

'{\n"probs": 0.7,\n"confidence_interval": [0.6, 0.85]\n}'

In [8]:
llama2_correct_responses.append(0)

## Question 2

In [10]:
responses_df.iloc[1].question

'In the NBA basketball game, scheduled for October 27th in 2023, which team is going to win: Pistons or Hornets?'

### Mistral response
This time it did not answer the question and just gave a probability number. In fact the same question was asked twice, due to some error with the request sending, and the second time it answered with a different probability.

In [11]:
responses_df.iloc[1].mistral_responses_iter1

'{\n  "probability": 0.45,\n  "confidence_interval": [0.4, 0.5]\n}'

### Llama2 response
Again it did not answer the question.

In [12]:
responses_df.iloc[1].llama2_responses_iter1

'{\n"probability": 0.7,\n"confidenceInterval": [0.5,0.9]\n}\n'

In [13]:
mistral_correct_responses.append(0)
llama2_correct_responses.append(0)

## Question 3

In [24]:
responses_df.iloc[2].question

'Will Netherlands qualify for the World Cup in 2023?'

### Mistral response
Again it is not clear if it understood the question, but this time is a yes/no answer so we could take this probability as the probability of Netherlands to qualify. In that case is a yes.

In [23]:
responses_df.iloc[2].mistral_responses_iter1

'{\n"probability": 0.8,\n"confidence_interval": [0.7, 0.9]\n}\n'

In [25]:
mistral_correct_responses.append(1)

### Llama2 response
Same as for the mistral response, we will take it as a yes.

In [26]:
responses_df.iloc[2].llama2_responses_iter1

'{\n"proba": 0.75,\n"conf": [0.6, 0.9]\n}'

In [27]:
llama2_correct_responses.append(1)
llama2_correct_responses

[0, 0, 1]

## Questions  4, 5, 6 and 7
They all refer to questions about some specific country qualifying for the World Cup in 2023

Will Senegal qualify for the World Cup in 2023?
Will France qualify for the World Cup in 2023?
Will Spain qualify for the World Cup in 2023?
Will Qatar qualify for the World Cup in 2023?

### Mistral responses
It says yes for all, being Qatar the one with the highest and it gives the rationale of being the host country. 

In [33]:
mistral_correct_responses.append(1)
mistral_correct_responses.append(1)
mistral_correct_responses.append(1)
mistral_correct_responses.append(0)
mistral_correct_responses

[0, 0, 1, 1, 1, 1, 0]

### Llama2 responses
This model is also saying yes for all but with lower probabilities.

In [38]:
llama2_correct_responses.append(1)
llama2_correct_responses.append(1)
llama2_correct_responses.append(1)
llama2_correct_responses.append(0)
llama2_correct_responses

[0, 0, 1, 1, 1, 1, 0]

## Questions from 8 to 17
Will Bryce Young be first QB selected in the 2023 NFL draft?

Will Tadej Pogačar win the 2023 Tour de France?

In the 2023 Qatar Open Final in Tennis, who is going to win Swiatek or Pegula?

Will Coco Gauff win Women's US Open 2023 in Tennis?

On December the 19th of 2022 who is going to win the NFL Rams or Packers

In the cricket IPL game, scheduled for April the 1st in 2023, which team is going to win: Punjab or Kolkata

Will Hans Niemann finish top 3 in the U.S. Chess Championship in 2022?

Will Magnus Carlsen win the Tata Steel Masters 2023?

Within the Super Bowl LVII in February 2023, which team is going to winEagles or Chiefs

In the World Cup Final, scheduled for December 18th in 2022, which country is going to win France or Argentina?

These are all sports questions.


### Mistral responses
For most of the questions is giving almost 50% probability with the answer, which means it really has no idea. For instance the answer to question 8 it is the same as answering yes and no.

'{\n"probability": 0.55,\n"confidence_interval": [0.45, 0.65]\n}
Bryce Young has a 55% chance of being the first quarterback selected in the NFL draft. However, the confidence interval is quite wide, ranging from 45% to 65%, due to the uncertainty of team needs, player performance, and other factors that can influence draft decisions.'

Only for a couple of questions the model showed a strong No or a strong Yes and it did not guess it right.

Adding the responses to the vector collecting the right answers.

In [51]:
mistral_correct_responses.append(1) # 8
mistral_correct_responses.append(0) # 9
mistral_correct_responses.append(1) # 10
mistral_correct_responses.append(0) # 11
mistral_correct_responses.append(0) # 12 
mistral_correct_responses.append(1) # 13
mistral_correct_responses.append(1) # 14
mistral_correct_responses.append(0) # 15
mistral_correct_responses.append(0) # 16
mistral_correct_responses.append(0) # 17
mistral_correct_responses

[0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0]

Mistral model is more advance and within some of the answers is giving us already comments that what we are trying to achieve with the model is a wrong use-case and these LLMs models are reasoning engines and not based-on-facts database that could be used to get accurate predictions.

Some of the warnings from the answers:

<div style="padding: 15px; border: 1px solid transparent; border-color: transparent; margin-bottom: 20px; border-radius: 4px; color: #8a6d3b;; background-color: #fcf8e3; border-color: #faebcc;">
'Please note that this is a simulated answer and should not be considered as an accurate prediction.'

' Real-life predictions would require access to real-time data and up-to-date injury reports.'

' For the most accurate information, refer to reliable sports news sources or betting odds just before the match.\n\nThis is the response you\'d get on average, but the model might change depending on the latest news, injuries or other unpredicted factors. So every time you ask this question, the probability might change.\n\nNote: I don\'t have access to real-time data, I\'m just a Large Language Model, my goal is to mimic human-like conversation and generate outputs that seem plausible given the context.'
</div>

### Llama2 responses
The model is just answering with a specific probability number and no explanations. The probabilities are slightly higher that those from mistral model. Besides they are almost all yes to the first option of the two, showing a similar trend as for Mistral when giving two choices.

In [63]:
llama2_correct_responses.append(1) # 8
llama2_correct_responses.append(0) # 9
llama2_correct_responses.append(1) # 10
llama2_correct_responses.append(1) # 11
llama2_correct_responses.append(0) # 12 
llama2_correct_responses.append(1) # 13
llama2_correct_responses.append(1) # 14
llama2_correct_responses.append(0) # 15
llama2_correct_responses.append(0) # 16
llama2_correct_responses.append(0) # 17
llama2_correct_responses

[0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0]

## Questions 18-32: Science questions
Most of the found questions are related with weather disasters and hypothetical pseudo-scientific discoveries.

Will Idalia hit Florida as a major hurricane in August 2023?

Will a hurricane make landfall in the US in November 2023?

Will Lee become a Category 5 hurricane in September 2023?

Will Artemis or Starship reach space first in 2023?

Will SpaceX's Starship successfully reach outer space by March 31, 2023?

Are the Peruvian Aliens real?

Will NASA record 2023 as the hottest year on record (1.03°C or higher)?

Will there be widespread flooding in LA in August 2023?

Iceland volcanic eruption by December 31st 2023?

Will Hilary hit US as a hurricane in August 2023?

Alien contact in 2023?

Will SpaceX's Starship lift 500m from pad by November 19th 2023?

Is the Room-Temperature Superconductor discovered in 2023 real?

Iceland volcanic eruption by November 17th 2023?

Is the Earth flat?

### Mistral responses
For this category we can identify more negative answers. Except for the pseudo-science questions, where we can identify a strong confidence, the rest of answers are closer to a 0.5 probability. This is close to say "I do not know so yes and no" . One example: {\n"probability": 0.45,\n"confidence_interval": [0.3, 0.6]\n}

However it was able to retrieve some related information linked to the questions, though there is no 100% guarantee they are true.

It is giving the advice to reach proper sources of reliable information:

"Keep in mind that this response is a rough estimate, based on general knowledge and historical data. Actual conditions could differ materially. I strongly recommend consulting meteorological and disaster response agencies for the most accurate and up-to-date information"
"The best source for up-to-date information on storm track and intensity would be NOAA\'s National Hurricane Center"



And it is giving us a very clear warning:
"I must first clarify that I don't have real-time or predictive capabilities. I'm an AI language model, not a weather or disaster prediction service."
"I strongly recommend consulting meteorological and disaster response agencies for the most accurate and up-to-date information."


In [94]:
mistral_correct_responses.append(0) # 18 Florida
mistral_correct_responses.append(1) # 19 landfall
mistral_correct_responses.append(0) # 20 Lee
mistral_correct_responses.append(0) # 21 Artemis
mistral_correct_responses.append(0) # 22 Space X
mistral_correct_responses.append(1) # 23 Peruvian aliens
mistral_correct_responses.append(0) # 24 hottest year
mistral_correct_responses.append(0) # 25 flooding in LA
mistral_correct_responses.append(0) # 26 eruption in Iceland in December
mistral_correct_responses.append(1) # 27 Hilary hit
mistral_correct_responses.append(1) # 28 Aliens
mistral_correct_responses.append(0) # 29 SpaceX lift 500m
mistral_correct_responses.append(1) # 30 Superconductor
mistral_correct_responses.append(1) # 31 Iceland november
mistral_correct_responses.append(1) # 32 flat Earth
mistral_correct_responses

[0,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 1]

In [95]:
len(mistral_correct_responses)

32

### Llama2 responses
We highlight this time some interesting outputs...

It refused to answer the question about the hottest year:
'I cannot predict the future or provide a definitive answer to your question. I can only provide information based on the data and models available to me.\n'

It gave a lot of precision in the confidence interval for some of the questions:

'{\n"probability": 0.8,\n"confidenceInterval": ["0.6953031763555094", "0.9046968236444906"]\n}\n'

'{\n"probability": 0.000000456,\n"confidenceInterval": [0.000000241, 0.000000718]\n}\n'

In [115]:
llama2_correct_responses.append(1) # 18
llama2_correct_responses.append(0) # 19
llama2_correct_responses.append(0) # 20
llama2_correct_responses.append(1) # 21 Artemis
llama2_correct_responses.append(0) # 22 
llama2_correct_responses.append(0) # 23
llama2_correct_responses.append(0) # 24
llama2_correct_responses.append(1) # 25
llama2_correct_responses.append(0) # 26
llama2_correct_responses.append(1) # 27
llama2_correct_responses.append(1) # 28
llama2_correct_responses.append(1) # 29
llama2_correct_responses.append(0) # 30
llama2_correct_responses.append(1) # 31
llama2_correct_responses.append(1) # 32

llama2_correct_responses

[0,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 1]

In [116]:
len(llama2_correct_responses)

32

Just to have an idea how many right answers they got from very random outputs.

In [119]:
sum(mistral_correct_responses)/32

0.46875

In [120]:
sum(llama2_correct_responses)/32

0.53125

## Conclusion and takeaways
The mistral model is clearly more advanced than the llama2 one. It is capable of giving more reasoning to the answers and of directing to reliable sources of information for those questions. Llama2 is just giving output probabilities and just luckily got more right answers than the mistral model. However this is not a reliable metric to compare both models.

These models are not designed for these type of prediction tasks. In particular Mistral has been already warning us that the information is not to be trusted and we should use proper statistical algorithms and to collect real-time data to have better answers.

Finally most of the answers rely on the 50% probability which is the same as saying "I do not know".

# References
Using an API
https://replicate.com/blog/run-mistral-7b-with-api