## 数据结构设计

现在需要思考，我们需要什么数据，需要拿数据干什么？

所以我们需要具体化我们的目的。

    目的：用五大联赛球员伤病史分析球员伤病规律及预测。

为此，我们需要模型，需要将球员抽象，还需要将球员伤病行为抽象。

    球员抽象：位置，身高，体重，打法，场均出场时间，场均跑动量。
    伤病行为抽象：用一个单向的线性结构表示。这个结构可以作为一个随时间变化的函数，也可以看作成一个数列。
   

<figure>
    <left> <img src="images/injury_datastructure.png"  alt='missing' width="800"  ><left/>
<figure/>

位置，身高，国籍 （体重数据没有，应该是这个数据缺乏统一标准且不稳定，不要也罢）<br>
出勤率，场均出场时间 （打法过于抽象，难以描述；场均跑动量没有）

<br>

___

In [28]:
import requests
from bs4 import BeautifulSoup
import re
import time
import random

def get_page_source(url):
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
            'Accept-Encoding': 'gzip, deflate, br',
            'Accept-Language': 'en-US,en;q=0.9',
            'Connection': 'keep-alive'
        }
        response = requests.get(url, headers=headers)
        response.raise_for_status()  
        response.encoding = "utf-8"
        return response.text
    except requests.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return None
    
def get_page_source_r(url, max_retries=4, delay=20):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'en-US,en;q=0.9',
        'Connection': 'keep-alive'
    }

    for attempt in range(1, max_retries + 1):
        try:
            response = requests.get(url, headers=headers, timeout=10)
            response.raise_for_status()
            response.encoding = "utf-8"
            return response.text

        except requests.exceptions.RequestException as e:
            print(f"Request failed (attempt {attempt}) for {url}: {e}")
            if attempt < max_retries:
                print(f"Waiting {delay}s before retrying...")
                time.sleep(delay)
            else:
                raise Exception(f"Failed to fetch {url} after {max_retries} retries.")



>`namedtuple`: 轻量级结构体，不可变。底层由`tuple`实现，速度比`dataclass`
>
>使用：
>``` python 
>from collections import namedtuple
>
>MyClass = namedtuple('PlayerNT', ['position', 'height', 'nationality'])
>p = MyClass('Forward', '185', 'Yes')
>
>print(type(p))  # <class '__main__.PlayerNT'>
>print(p)        # PlayerNT(position='Forward', height='185', nationality='Germany')
>```


In [38]:
from collections import namedtuple

# 使用namedtuple来实现简单数据容器
Healthy = namedtuple('Healthy',['days'])
Injured = namedtuple('Injured',['days', 'injury','games_missed'])    # int str int


class Player:
    def __init__(self, name, position, height, nationality):  
        if not all(isinstance(arg, str) for arg in (name, position, height, nationality)):
            raise TypeError("All attributes must be strings.")
        self.name = name
        self.position = position
        self.height = height
        self.nationality = nationality
        self.injury_history = []
    
    def add_healthy(self, days):
        history = self.injury_history
        if history and isinstance(history[-1], Healthy):
            raise TypeError("Last history is healthy")
        history.append(Healthy(days))
        
    def add_injured(self, days, injury, games_missed):
        history = self.injury_history
        if history and isinstance(history[-1], Injured):
            raise TypeError("Last history is injured")
        history.append(Injured(days, injury, games_missed))
        
    def get_all_injured(self):
        return self.injury_history[1::2]  # 切片有复制操作，时间复杂度为O(n)

        

<br>

---
TEST

In [46]:
p = Player('Rodri','Defensive Midfield','1,91 m', 'Spain')
# print(dir(p))
print(p.__dict__)

{'name': 'Rodri', 'position': 'Defensive Midfield', 'height': '1,91 m', 'nationality': 'Spain', 'injury_history': []}


In [47]:
p.add_healthy(123)
p.add_injured(0, 'no injury', 0)

In [48]:
print(p.__dict__)

{'name': 'Rodri', 'position': 'Defensive Midfield', 'height': '1,91 m', 'nationality': 'Spain', 'injury_history': [Healthy(days=123), Injured(days=0, injury='no injury', games_missed=0)]}


<br>

## 数据获取并用数据结构储存

In [67]:
site = 'https://www.transfermarkt.com'

In [29]:
url = 'https://www.transfermarkt.com/rodri/verletzungen/spieler/357565'
html = get_page_source(url)

In [31]:
soup = BeautifulSoup(html, 'html.parser')
soup

<!DOCTYPE html>

<html lang="en">
<head>
<script data-description="sourcepoint stub code" type="text/javascript">
    !function () { var e = function () { var e, t = "__tcfapiLocator", a = [], n = window; for (; n;) { try { if (n.frames[t]) { e = n; break } } catch (e) { } if (n === window.top) break; n = n.parent } e || (!function e() { var a = n.document, r = !!n.frames[t]; if (!r) if (a.body) { var i = a.createElement("iframe"); i.style.cssText = "display:none", i.name = t, a.body.appendChild(i) } else setTimeout(e, 5); return !r }(), n.__tcfapi = function () { for (var e, t = arguments.length, n = new Array(t), r = 0; r < t; r++)n[r] = arguments[r]; if (!n.length) return a; if ("setGdprApplies" === n[0]) n.length > 3 && 2 === parseInt(n[1], 10) && "boolean" == typeof n[3] && (e = n[3], "function" == typeof n[2] && n[2]("set", !0)); else if ("ping" === n[0]) { var i = { gdprApplies: e, cmpLoaded: !1, cmpStatus: "stub" }; "function" == typeof n[2] && n[2](i) } else a.push(n) }, n.add

由于在球员正式进入一线队之前的伤病数据并不完整，无法判断第一个`Healthy`的`days`是否等于职业生涯开始到第一次伤病开始的时间。因此我们设置`injury_history`数组的第一个数据为`Healthy(-1)`，下一个数据`Injured(...)`即是首次出现的伤病记录。

分析网页得知，伤病信息是存在翻页的，幸运的是，它不是JS渲染出来的。

``` html
<!-- 第一个box -->
<div class="box">
    ...
    <div class="responsive-table">
        <div id="yw1" class="gird-view">
            <!-- 数据表格 -->
            <table class="items">...</table>

            <!-- 分页器，如果是单页就没有 -->
            <div class="pager">
                <ul class="tm-pagination">
                    <!-- 一个li就是一页 -->
                    <li class="tm-pagination__list-item tm-pagination__list-item--active">
                        <a href="..." title="Page 1" class="tm-pagination__link">1</a>
                    </li>
                    <li class="tm-pagination__list-item">
                        ...
                    </li>
                    ...
                </ul>
            </div>
        </div>
    </div>
</div>
```

In [37]:
table = soup.select_one('#yw1 table.items')
rows = table.select(".even, .odd")
print(rows[0].prettify())

<tr class="odd">
 <td class="zentriert">
  24/25
 </td>
 <td class="hauptlink">
  Cruciate ligament tear
 </td>
 <td class="zentriert">
  Sep 23, 2024
 </td>
 <td class="zentriert">
  May 16, 2025
 </td>
 <td class="rechts">
  236 days
 </td>
 <td class="rechts hauptlink wappen_verletzung">
  <a href="/manchester-city/startseite/verein/281/saison_id/2024" title="Manchester City">
   <img alt="Manchester City" class="tiny_wappen" src="https://tmssl.akamaized.net//images/wappen/tiny/281.png?lm=1467356331" title="Manchester City"/>
  </a>
  <span>
   47
  </span>
 </td>
</tr>



In [50]:
injury_infos = []
for row in rows:
    injury_data = {}
    tds = row.find_all('td')
    if len(tds) < 6:
        raise IndexError("wrong table")
    injury_data['injury'] = tds[1].get_text(strip=True)
    injury_data['from'] = tds[2].get_text(strip=True)
    injury_data['util'] = tds[3].get_text(strip=True)
    injury_data['days'] = tds[4].get_text(strip=True)
    injury_data['games_missed'] = tds[5].get_text(strip=True)
    injury_infos.append(injury_data)

由于对象中包含需要import的部分，所以在导入对象数据的json之前，必须要import所需包: `namedtuple`

In [51]:
injury_infos

[{'injury': 'Cruciate ligament tear',
  'from': 'Sep 23, 2024',
  'util': 'May 16, 2025',
  'days': '236 days',
  'games_missed': '47'},
 {'injury': 'Hamstring injury',
  'from': 'Jul 14, 2024',
  'util': 'Aug 25, 2024',
  'days': '43 days',
  'games_missed': '3'},
 {'injury': 'Knock',
  'from': 'Feb 11, 2021',
  'util': 'Feb 12, 2021',
  'days': '2 days',
  'games_missed': '-'},
 {'injury': 'Muscle injury',
  'from': 'Oct 22, 2019',
  'util': 'Nov 9, 2019',
  'days': '19 days',
  'games_missed': '5'}]

---
### 爬有翻页的。
内马尔有高达三页的伤病列表。

In [52]:
url = 'https://www.transfermarkt.com/neymar/verletzungen/spieler/68290'
html = get_page_source(url)
soup = BeautifulSoup(html, 'html.parser')

先得到所有未显示页面的url

In [68]:
Table = soup.select_one('#yw1')  # 包含数据表格与分页器
table1 = soup.select_one('#yw1 table.items')   # 第一页数据表格
Table

<div class="grid-view" id="yw1">
<div class="summary"></div>
<table class="items">
<thead>
<tr>
<th class="zentriert" id="yw1_c0"><a class="sort-link" href="/neymar/verletzungen/spieler/68290/sort/saison_id.desc">Season</a></th><th class="" id="yw1_c1"><a class="sort-link" href="/neymar/verletzungen/spieler/68290/sort/translated.desc">Injury</a></th><th class="zentriert" id="yw1_c2"><a class="sort-link" href="/neymar/verletzungen/spieler/68290/sort/datum_von.desc">from</a></th><th class="zentriert" id="yw1_c3"><a class="sort-link" href="/neymar/verletzungen/spieler/68290/sort/datum_bis.desc">until</a></th><th class="rechts" id="yw1_c4"><a class="sort-link" href="/neymar/verletzungen/spieler/68290/sort/tage.desc">Days</a></th><th class="rechts" id="yw1_c5"><a class="sort-link" href="/neymar/verletzungen/spieler/68290/sort/anzahl_spiele.desc">Games missed</a></th></tr>
</thead>
<tbody>
<tr class="odd">
<td class="zentriert">24/25</td><td class="hauptlink">Hamstring injury</td><td class="

In [75]:
print(table1,20*'\n')

<table class="items">
<thead>
<tr>
<th class="zentriert" id="yw1_c0"><a class="sort-link" href="/neymar/verletzungen/spieler/68290/sort/saison_id.desc">Season</a></th><th class="" id="yw1_c1"><a class="sort-link" href="/neymar/verletzungen/spieler/68290/sort/translated.desc">Injury</a></th><th class="zentriert" id="yw1_c2"><a class="sort-link" href="/neymar/verletzungen/spieler/68290/sort/datum_von.desc">from</a></th><th class="zentriert" id="yw1_c3"><a class="sort-link" href="/neymar/verletzungen/spieler/68290/sort/datum_bis.desc">until</a></th><th class="rechts" id="yw1_c4"><a class="sort-link" href="/neymar/verletzungen/spieler/68290/sort/tage.desc">Days</a></th><th class="rechts" id="yw1_c5"><a class="sort-link" href="/neymar/verletzungen/spieler/68290/sort/anzahl_spiele.desc">Games missed</a></th></tr>
</thead>
<tbody>
<tr class="odd">
<td class="zentriert">24/25</td><td class="hauptlink">Hamstring injury</td><td class="zentriert">Apr 17, 2025</td><td class="zentriert">May 11, 202

In [77]:
all_li = Table.select('li.tm-pagination__list-item')
last_li = [li for li in all_li if li.get('class') == ['tm-pagination__list-item']]     # 从所有li中筛选出未显示的页面

last_hrefs = [li.find('a')['href'] for li in last_li if li.find('a')]

all_table = [table1]
for hrefs in last_hrefs:
    html = get_page_source_r(site+hrefs)
    soup = BeautifulSoup(html, 'html.parser')
    table = soup.select_one('#yw1 table.items')
    all_table.append(table)
    time.sleep(random.uniform(8,10))

In [79]:
neymar_injury_infos = []

for table in all_table:
    rows = table.select(".even, .odd")
    for row in rows:
        injury_data = {}
        tds = row.find_all('td')
        if len(tds) < 6:
            raise IndexError("wrong table")
        injury_data['injury'] = tds[1].get_text(strip=True)
        injury_data['from'] = tds[2].get_text(strip=True)
        injury_data['util'] = tds[3].get_text(strip=True)
        injury_data['days'] = tds[4].get_text(strip=True)
        injury_data['games_missed'] = tds[5].get_text(strip=True)
        neymar_injury_infos.append(injury_data)

In [83]:
print(len(neymar_injury_infos))  # 预期输出: 43
neymar_injury_infos

43


[{'injury': 'Hamstring injury',
  'from': 'Apr 17, 2025',
  'util': 'May 11, 2025',
  'days': '25 days',
  'games_missed': '5'},
 {'injury': 'Hamstring injury',
  'from': 'Mar 4, 2025',
  'util': 'Apr 12, 2025',
  'days': '40 days',
  'games_missed': '3'},
 {'injury': 'Fitness',
  'from': 'Dec 17, 2024',
  'util': 'Feb 3, 2025',
  'days': '49 days',
  'games_missed': '7'},
 {'injury': 'Hamstring injury',
  'from': 'Nov 4, 2024',
  'util': 'Dec 16, 2024',
  'days': '43 days',
  'games_missed': '7'},
 {'injury': 'Fitness',
  'from': 'Sep 23, 2024',
  'util': 'Nov 3, 2024',
  'days': '42 days',
  'games_missed': '9'},
 {'injury': 'Cruciate ligament tear',
  'from': 'Oct 19, 2023',
  'util': 'Sep 22, 2024',
  'days': '340 days',
  'games_missed': '48'},
 {'injury': 'muscular problems',
  'from': 'Aug 4, 2023',
  'util': 'Sep 3, 2023',
  'days': '31 days',
  'games_missed': '5'},
 {'injury': 'Ankle surgery',
  'from': 'Feb 20, 2023',
  'util': 'Jun 30, 2023',
  'days': '131 days',
  'games_

In [94]:
def get_all_injuries_r(url):
    html = get_page_source(url)
    soup = BeautifulSoup(html, 'html.parser')

    Table = soup.select_one('#yw1')
    table1 = soup.select_one('#yw1 table.items')

    # 在这里开始，如果不存在分页，下面三个列表全部为空，代码正常运行
    all_li = Table.select('li.tm-pagination__list-item')
    last_li = [li for li in all_li if li.get('class') == ['tm-pagination__list-item']]     # 从所有li中筛选出未显示的页面
    
    last_hrefs = [li.find('a')['href'] for li in last_li if li.find('a')]
#     print(all_li, last_li, last_hrefs)

    all_table = [table1]
    for hrefs in last_hrefs:
        html = get_page_source_r(site+hrefs)
        soup = BeautifulSoup(html, 'html.parser')
        table = soup.select_one('#yw1 table.items')
        all_table.append(table)
        time.sleep(random.uniform(8,10))

    injury_infos = []
    for table in all_table:
        rows = table.select(".even, .odd")
        for row in rows:
            injury_data = {}
            tds = row.find_all('td')
            if len(tds) < 6:
                raise IndexError("wrong table")
            injury_data['injury'] = tds[1].get_text(strip=True)
            injury_data['from'] = tds[2].get_text(strip=True)
            injury_data['util'] = tds[3].get_text(strip=True)
            injury_data['days'] = tds[4].get_text(strip=True)
            injury_data['games_missed'] = tds[5].get_text(strip=True)
            injury_infos.append(injury_data)
            
    return injury_infos

In [95]:
get_all_injuries_r('https://www.transfermarkt.com/rodri/verletzungen/spieler/357565')

[{'injury': 'Cruciate ligament tear',
  'from': 'Sep 23, 2024',
  'util': 'May 16, 2025',
  'days': '236 days',
  'games_missed': '47'},
 {'injury': 'Hamstring injury',
  'from': 'Jul 14, 2024',
  'util': 'Aug 25, 2024',
  'days': '43 days',
  'games_missed': '3'},
 {'injury': 'Knock',
  'from': 'Feb 11, 2021',
  'util': 'Feb 12, 2021',
  'days': '2 days',
  'games_missed': '-'},
 {'injury': 'Muscle injury',
  'from': 'Oct 22, 2019',
  'util': 'Nov 9, 2019',
  'days': '19 days',
  'games_missed': '5'}]

---
### 爬所有球员。
并用字典储存。