<a href="https://colab.research.google.com/github/Yanan-Chen0922/4048/blob/main/%E2%80%9CIR(H_M)_2026_pre_course_notebook_ipynb%E2%80%9D%E7%9A%84%E5%89%AF%E6%9C%AC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

Courseworks in the Information Retrieval (H/M) course are based on the [PyTerrier](https://github.com/terrier-org/pyterrier) framework. PyTerrier uses Pandas for input and output, so we also assume Pandas knowledge. This notebook contains:

1. an overview/refresher of Pandas for relational data manipulation, including:
  - creating Pandas dataframes
  - applying relational algebra operations on dataframes (projection, restriction)
  - applying functions on dataframes

2. a verification that PyTerrier works on your machine.

# Part 1 - Pandas

Part 1 aims to refresh your understanding of Pandas and relational data manipulation needed in PyTerrier’s usage.

In [2]:
# we need to import pandas. We commonly rename it to pd, to make commands shorter
import pandas as pd

# let's not truncate Pandas output too much
pd.set_option('display.max_colwidth', 150)
# pandas 的显示设置，作用是控制 DataFrame 中“字符串列”在输出时最多显示多少字符
#'display.max_colwidth'：单个单元格中，字符串（如文本、长句子）最多显示多少个字符
#

### Constructing a Pandas dataframe

In [3]:
# let's take our data from a list, where each element is the rows of the data.
population_data = [
  ['California', 38332521],
  ['Texas', 26448193],
  ['Illinois', 12882135]
]

# now we construct a dataframe object. this is our relation
# we need to name the columns
population_df = pd.DataFrame(population_data, columns=['State', 'Population'])
# if we put a variable last in the code cell, its content will be printed
population_df

Unnamed: 0,State,Population
0,California,38332521
1,Texas,26448193
2,Illinois,12882135


As you can see, this is very much like a relation, with a name (`population_df`), a header with attribute names, and rows.

We can also make a dataframe using a dictionary:

In [4]:
population_df = pd.DataFrame({
      'State' : ['California', 'Texas', 'Illinois'],
      'Population' : [38332521, 26448193, 12882135]
    })
population_df

Unnamed: 0,State,Population
0,California,38332521
1,Texas,26448193
2,Illinois,12882135


Dataframes have attributes (such as length).


In [5]:
len(population_df)

3

Each column is typed (actually using Numpy datatypes)

In [6]:
population_df.dtypes

Unnamed: 0,0
State,object
Population,int64



### Projection

Ok, we now have a dataframe. Unlike a relation, this has order, so we can ask to *select* the first or second rows:

In [7]:
population_df.iloc[1]
#按“位置索引”（列头）取第 2 行（不是按行名）
#返回一个 Series（一行数据）

Unnamed: 0,1
State,Texas
Population,26448193


We'll return to selection shortly.

We can *project* one column:

In [8]:
population_df['Population']

Unnamed: 0,Population
0,38332521
1,26448193
2,12882135


When both projecting single rows or columns, we get an object of type [Pandas Series](https://pandas.pydata.org/pandas-docs/stable/reference/series.html).

In [9]:
type(population_df['Population'])

A Pandas Series can be thought of as a kind of dictionary/key-value store. We can ask for a given value using 'dot notation', or square brackets:

In [10]:
population_df.iloc[1].State

'Texas'

In [11]:
population_df.iloc[1]["State"]#更安全，和上面那个一样功能

'Texas'

On the other hand, if we wanted to project many columns we would end up with a *dataframe* that looks similar to the original one, just with columns reordered).

Note how we use an additional `[]` when we want to project many columns - i.e. we are projecting a *list* of columns.

In [12]:
population_df[['Population', 'State']]
#从 population_df 中一次性选取两列：Population 和 State，并且返回的是一个 DataFrame
#双中括号 [[...]] 的核心含义：外层 []：对 DataFrame 做索引；内层 [...]：列名列表。只要你看到 df[[...]]，基本可以确定：结果一定是 DataFrame，而不是 Series

Unnamed: 0,Population,State
0,38332521,California
1,26448193,Texas
2,12882135,Illinois


### Selection

Selection is the filtering of rows. We can do this based on conditions, for instance with a population greater than a threshold, e.g. 30M.

In [14]:
population_df[population_df['Population'] > 3e7]
#布尔索引（Boolean Indexing）
#population_df[(population_df['Population'] > 3e7) &(population_df['State'] != 'Texas')]，必须用 & / |每个条件要加括号

Unnamed: 0,State,Population
0,California,38332521


What happened here? Let's break this down.

Inside the brackets is
```python
population_df['Population'] > 3e7
```
This identifies all rows that have a population greater than 30 million (I'm using 3e7, a shorthand scientific notation for 30 million). Let's run that by itself

In [15]:
population_df['Population'] > 3e7
#这一步只做判断，不做筛选,返回的是一个 布尔 Series,每一行对应一个 True / False

Unnamed: 0,Population
0,True
1,False
2,False


It evaluates the expression Population > 3e7 for each row, and returns a kind of list (actually a Pandas [Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html)), with `True` and `False` values for each row, determining if the row meets the selection condition.

Aside: a Pandas Series is just a wrapper of a Numpy array - you can access the underlying `np.array` object by calling `.values`.

By inserting this into `population_df[]`, we determine which rows to return (i.e. only those for California).

### Apply

Sometimes we want to apply a function on a row of a dataframe. [Pandas apply() function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html) is very useful for this.

For instance, perhaps we want to add a string representation of the states population, e.g. `"38M"` for the population of California.

We're going to use a nice Python function from [StackExchange](https://stackoverflow.com/a/3155023):


In [16]:
import math
millnames = ['','K','M','B','T'] #把很大的数字，格式化成 K / M / B / T 这种“人类友好”的形式
def millify(n):
    n = float(n) #统一转成 float，方便数学计算。
    millidx = max(0,min(len(millnames)-1,int(math.floor(0 if n == 0 else math.log10(abs(n))/3))))
    return '{:.0f}{}'.format(n / 10**(3 * millidx), millnames[millidx])
    #开始计算“用哪个单位”，最小不能小于 0，floor：向下取整，abs(n)：取绝对值（防止负数），math.log10(...)：算这个数有多少“位数”/ 3：每 3 位算一个单位（K / M / B）

millify(38332521)

'38M'

Ok, so how to "apply" this function to our dataframe?

Well, let's take the Series for the population columns, and apply the function. That returns a new Series with the new string form of the column.

In [17]:
population_df["Population"].apply(millify)
#“这一列有多少行，我就重复做多少次”

Unnamed: 0,Population
0,38M
1,26M
2,13M


In this case, for *each* Population value, millify is called on the numeric value.

We can also apply on an entire dataframe. Here, we use a lambda function to call millify on the attribute we care about. The lambda is called for each row of population_df, being passed that row as a Pandas series. We use the square brackets notation to get the Population value. Finally, `axis=1` tells Pandas we are operating row-by-row, not column-by-column, which is the default.

In [18]:
population_df.apply(lambda row: millify(row["Population"]), axis=1)
#对整个表格动手
#axis=1：一行一行来
#row：一整行数据

Unnamed: 0,0
0,38M
1,26M
2,13M


Ok, so how can we make a new column? Well, we can assign columns to dataframes too.

In [19]:
population_df["Pop Human"] = population_df["Population"].apply(millify)
population_df

Unnamed: 0,State,Population,Pop Human
0,California,38332521,38M
1,Texas,26448193,26M
2,Illinois,12882135,13M


Beautiful!

### Pandas Exercises


####Q1. Creating a Pandas dataframe

Create a dataframe `area_df` for the following information about states:

State| Area
--- | ---
California | 423967
Texas | 695662
New York | 141297
Florida | 170312
Illinois | 149995



In [20]:
#YOUR SOLUTION
area_df = pd.DataFrame({
      'State' : ['California', 'Texas', 'New York', 'Florida', 'LLLinois'],
      'Area' : [423967, 695662, 141297, 170312, 149995]
    })
area_df




Unnamed: 0,State,Area
0,California,423967
1,Texas,695662
2,New York,141297
3,Florida,170312
4,LLLinois,149995


#### Q2. What are the names of states which have an area less than 150,000?



In [21]:
#YOUR SOLUTION
area_df[area_df['Area'] < 150000]

Unnamed: 0,State,Area
2,New York,141297
4,LLLinois,149995


# Part 2: PyTerrier installation & verification

PyTerrier is usable on *free* [Google Colab](https://colab.research.google.com/), or using a Jupyter notebook on your personal computer. The requirements for using PyTerrier this year are:
  - Linux, macOS or Windows
  - Python 3.9 - 3.13
  - Java 11 or newer
  - 3GB local free disk space

NB: Apple Silicon many need the use of [Anaconda Python](https://www.anaconda.com/download). Apple Silicon wont work for Exercise 2 (more information below).

The purposes of this notebook is for you to test your environment before the course starts.

Our recomended platform is Google Colab.

NB: If your personal environment does not work, you MUST resort to using Google Colab. NB: We cannot offer support for local installations.

## Check Python version

In [22]:
import sys
assert sys.version_info >= (3, 8), "Python too old - Python 3.8 required"
assert sys.version_info <= (3, 14), "Python too new! - Python 3.14+ not yet supported"

In [23]:
import platform
apple_silicon = sys.platform == 'darwin' and platform.processor() == "arm"

import sys, os
is_conda = os.path.exists(os.path.join(sys.prefix, 'conda-meta', 'history'))

if apple_silicon:
    assert is_conda, "PyTerrier requires use of an Anaconda Python version"
    print("Running on Apple Silicon - let us know of any problems")

## Install PyTerrier

The IRHM package installs `pyterrier` and other dependencies you will need. It may take a few minutes.

In [24]:
%pip install -q git+https://github.com/cmacdonald/irhm.git

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m208.3/208.3 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m51.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m866.1/866.1 kB[0m [31m44.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.3/61.3 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m60.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m20.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90

Later in the year, Exercise 2 relies on a specific version of a package called LightGBM. This particular package version is available pre-compiled for Linux, Windows and macOS (x86_64), but not for e.g. Apple Silicon.

In [25]:
%pip install -q 'irhm[ex2] @ git+https://github.com/cmacdonald/irhm.git'

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[?25h

If the `ex2` installation fails (LightGBM), you can still do both the Warmup Lab and Exercise 1 on your personal machine, but you'll need to use Google Colab for Exericse 2.

## Use PyTerrier

 - If you have errors here, you may need to set your JAVA_HOME environment variable.
 - Any warnings about the "Panel class is removed from pandas" can be ignored.

In [26]:
import pyterrier as pt

Lets use a small index to test retrieval. Here we download the [`pyterrier/vaswani.terrier`](https://huggingface.co/datasets/pyterrier/vaswani.terrier) index from HuggingFace.

You may get a warning about `HF_TOKEN` in Colab. You do not need to worry about this because the index is publicly available.

In [27]:
index = pt.terrier.TerrierIndex.from_hf('pyterrier/vaswani.terrier')
retriever = index.bm25() # you'll learn what BM25 is later in the course
retriever

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


https://huggingface.co/datasets/pyterrier/vaswani.terrier/resolve/main/artifact.tar.lz4:   0%|          | 0.00…

extracting data.direct.bf [387.6 KB]
extracting data.document.fsarrayfile [234.4 KB]
extracting data.inverted.bf [361.9 KB]
extracting data.lexicon.fsomapfile [681.7 KB]
extracting data.lexicon.fsomaphash [777 B]
extracting data.lexicon.fsomapid [30.3 KB]
extracting data.meta-0.fsomapfile [725.5 KB]
extracting data.meta.idx [89.3 KB]
extracting data.meta.zdata [223.5 KB]
extracting data.properties [4.3 KB]
extracting files [272 B]
extracting md5sums [619 B]
extracting pt_meta.json [79 B]
terrier-assemblies 5.11 jar-with-dependencies not found, downloading to /root/.pyterrier...


https://repo1.maven.org/maven2/org/terrier/terrier-assemblies/5.11/terrier-assemblies-5.11-jar-with-dependenci…

Done
terrier-python-helper 0.0.8 jar not found, downloading to /root/.pyterrier...


https://repo1.maven.org/maven2/org/terrier/terrier-python-helper/0.0.8/terrier-python-helper-0.0.8.jar:   0%| …

Done


Java started (triggered by Retriever.__init__) and loaded: pyterrier.java.colab, pyterrier.java, pyterrier.java.24, pyterrier.terrier.java [version=5.11 (build: craig.macdonald 2025-01-13 21:29), helper_version=0.0.8]


0,1,2
qid,str,(Query ID) ID of query in frame
query,str,Query text

0,1
index_location,<org.terrier.querying.IndexRef at 0x7df4f26b8730 jclass=org/terrier/querying/IndexRef jself=<LocalRef obj=0x1c0a9dd8 at 0x7df4f2f23210>>
num_results,1000
metadata,['docno']
wmodel,BM25
threads,1
verbose,False
terrierql,on
parsecontrols,on
parseql,on
applypipeline,on

0,1,2
qid,str,(Query ID) ID of query in frame
docid,int,(Internal Document ID) Integer ID of document in a specific index
docno,str,(External Document ID) String ID of document in collection
rank,int,Ranking order of document to query (lower=better)
score,float,Ranking score of document to query (higher=better)
query,str,Query text


When the result of a cell is a PyTerrier transformer object, it appears as a visual "schematic". Here, we can see that BM25 takes `Q` ("queries") as input and returns `R` ("results"). We will cover more about this later in the course. But for now, let's try running it by providing an example query.

In [28]:
retriever.search("chemical reactions").head(10)

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,9373,9374,0,22.076426,chemical reactions
1,1,8765,8766,1,20.498801,chemical reactions
2,1,7048,7049,2,20.159044,chemical reactions
3,1,4686,4687,3,19.323491,chemical reactions
4,1,10702,10703,4,13.472012,chemical reactions
5,1,2999,3000,5,12.88185,chemical reactions
6,1,5433,5434,6,12.88185,chemical reactions
7,1,1055,1056,7,12.517082,chemical reactions
8,1,2420,2421,8,12.384649,chemical reactions
9,1,3079,3080,9,12.327595,chemical reactions


We will also quickly run an experiment. This will download some evaluation data.

In [29]:
dataset = pt.get_dataset('vaswani')
pt.Experiment(
    [retriever],
    dataset.get_topics(),
    dataset.get_qrels(),
    ["map", "recip_rank"]
)

Downloading vaswani topics to /root/.pyterrier/corpora/vaswani/query-text.trec


query-text.trec:   0%|          | 0.00/3.05k [00:00<?, ?iB/s]

Downloading vaswani qrels to /root/.pyterrier/corpora/vaswani/qrels


qrels:   0%|          | 0.00/6.63k [00:00<?, ?iB/s]

Unnamed: 0,name,map,recip_rank
0,TerrierRetr(BM25),0.296517,0.725665


## Check Diskspace

You will need about 3GB of free disk space in your home directory to conduct the IR(M) experiments.


In [30]:
import shutil
_,_, free = shutil.disk_usage(pt.io.pyterrier_home())
assert free > 3 * 1024 * 1024 * 1024, "You dont have enough free disk space"
print("You are good for disk space")

You are good for disk space


## Ok, validation completed

You have one action remaining - you must now complete the User Agreement Submission instance, before the deadline stated on the Moodle page.


