# University Applications Letters Generator 

- Purpose: when we try to apply graduate schools, the statement of purpose will take us tremendous time. If we turn to the application agencies, they will charge tens of thousands of CNY. Also, the agencies always make ***mistakes***.

- Work:  use python to generate 90 copies of state of purpose efficiently

- Details:
    - step1 : find a application letter template by using online
    - step2 : replace the personal information with your own personal information to get a `MS Word template` (one pager)
    - step3 : create a `excel list` of 30 universities from econ department ranking by and only keep the university names
    (https://ideas.repec.org/top/top.econdept.html, 10 from top 30, 10 from top 60, and 10 from top 90)
    - step4: create a `excel list` of 3 interested research areas from (https://www.scmor.com/view/10554) (economics, management, finance, information management or etc.)
    - step5 : add the `excel list` from step4 of top journals for each research areas you selected above (3 for each area)
    - step6 : add the `excel list` from step4 of the skills you search from (Glassdoor Job)[https://www.glassdoor.com.hk/Job/index.htm]
    - step7 : Loop over the two lists aforementioned to fill the template.
    - step8 : use docxtpl to generate the MS Word document
    
    - step9 : use docx2pdf to generate the PDF document (only for the windows users)
    - step10 : create a subdir named "HW_School_Application" under your home dir and upload your `your codes, excel list, Word template and only 1 copy of PDF or WORD` to ***GitHub***  by next class 

Example:
    
Dear Admission Committee,

My name is `Lei Ge`, and I am pleased to apply for the `Master of Finance program` at `Renmin University of China`.

In my free time, I enjoy reading top-tier academic research to stay updated with the latest advancements in `finance`. I occasionally study articles from leading ABS 4+ rated journals such as the `Review of Financial Studies (RES), Journal of Finance (JF), and Management Science (MS)`, among others. This habit not only deepens my understanding of theoretical and empirical approaches in finance but also sharpens my ability to critically analyze complex economic phenomena.

I want to be a `quant researcher`. To achive my dream, I have practical skills such as `Python, SQL, Math, PowerBI, Tableau and etc`. 

I am particularly drawn to Renmin University of China due to its strong academic environment and research-oriented approach. 

Thank you for considering my application. I am eager to contribute to and benefit from the rigorous academic culture at `Renmin University of China`.

Sincerely,

Lei Ge







- TOOLS: function, loop, docxtpl, docx2pdf
- Results: you get a 90 copies of statement of purpose

In [24]:
# Step1: 使用 docxtpl 的 SOP 模板（含日期与联系方式）
import os
from docx import Document

cwd = os.getcwd()
TARGET_PATH = os.path.join(cwd, "sop_template_docxtpl.docx")

doc = Document()

# 开头
doc.add_paragraph("Dear Admission Committee,")

doc.add_paragraph(
    "My name is {{ applicant_name }}, and I am pleased to apply for the {{ program_name }} at {{ university_name }}."
)

doc.add_paragraph(
    "In my free time, I enjoy reading top-tier academic research to stay updated with the latest advancements in "
    "{{ research_area }}. I occasionally study articles from leading journals such as the {{ top_journals }}, among others. "
    "This habit not only deepens my understanding of theoretical and empirical approaches but also sharpens my ability "
    "to critically analyze complex economic phenomena."
)

doc.add_paragraph(
    "I want to be a {{ career_goal }}. To achieve my dream, I have practical skills such as {{ skills }}."
)

doc.add_paragraph(
    "I am particularly drawn to {{ university_name }} due to its strong academic environment and research-oriented approach."
)

doc.add_paragraph(
    "Thank you for considering my application. I am eager to contribute to and benefit from the rigorous academic culture at "
    "{{ university_name }}."
)

# 结束语与签名
doc.add_paragraph("Sincerely,")
doc.add_paragraph("{{ applicant_name }}")

# 日期与联系方式
doc.add_paragraph("Date: {{ date }}")
doc.add_paragraph("Contact: {{ contact }}")

# 保存
doc.save(TARGET_PATH)
print("Step1 完成（docxtpl 模板），模板已保存在当前目录:", TARGET_PATH)


Step1 完成（docxtpl 模板），模板已保存在当前目录: /Users/luok/Desktop/2023200211/sop_template_docxtpl.docx


In [16]:
# Step2 (docxtpl): 使用 sop_template_docxtpl.docx 渲染并生成成品（含日期与联系方式）
import os, sys, subprocess
from datetime import date

# 确保 docxtpl 可用
try:
    from docxtpl import DocxTemplate  # type: ignore[reportMissingImports]
except Exception:
    subprocess.run([sys.executable, "-m", "pip", "install", "-q", "docxtpl"], check=False)
    from docxtpl import DocxTemplate  # type: ignore[reportMissingImports]

cwd = os.getcwd()
TEMPLATE_PATH = os.path.join(cwd, "sop_template_docxtpl.docx")
OUTPUT_DOCX = os.path.join(cwd, "sop_docxtpl_filled.docx")

if not os.path.exists(TEMPLATE_PATH):
    raise FileNotFoundError(f"未找到模板: {TEMPLATE_PATH}，请先运行Step1生成模板")

# 个人信息（可按需修改）
context = {
    "applicant_name": "Qimin Lin",
    "program_name": "Master of Finance program",
    "university_name": "Renmin University of China",
    "research_area": "finance",
    "top_journals": "Review of Financial Studies (RFS), Journal of Finance (JF), Management Science (MS)",
    "career_goal": "quant researcher",
    "skills": "Python, SQL, Math,solid grasp of economic and financial knowledge,proficient application of AI tools for academic and analytical scenarios",
    "date": date.today().isoformat(),
    "contact": "+86-139-0000-0000 | linqimin@qq.com",
}

# 渲染
tpl = DocxTemplate(TEMPLATE_PATH)
tpl.render(context)
tpl.save(OUTPUT_DOCX)
print("Step2 (docxtpl) 完成，已在当前目录生成：", OUTPUT_DOCX)


Step2 (docxtpl) 完成，已在当前目录生成： /Users/luok/Desktop/2023200211/sop_docxtpl_filled.docx


In [17]:
# Step3: 爬取 RePEc 经济系排名并生成仅含大学名称的 Excel（30 所）
# 需求：从 https://ideas.repec.org/top/top.econdept.html 爬取“经济系”排名，
#       选取 10 所来自 Top30、10 所来自 Top60（31-60 区间）、10 所来自 Top90（61-90 区间），
#       仅保留大学/机构名称，保存为 Excel 文件。

import os
import sys
import subprocess
from typing import List, Dict, Tuple

# 确保依赖存在
for pkg in ["requests", "beautifulsoup4", "pandas", "lxml"]:
    try:
        __import__(pkg if pkg != "beautifulsoup4" else "bs4")
    except Exception:
        subprocess.run([sys.executable, "-m", "pip", "install", "-q", pkg], check=False)

import requests
from bs4 import BeautifulSoup  # type: ignore
import pandas as pd  # type: ignore

REPEC_URL = "https://ideas.repec.org/top/top.econdept.html"
OUTPUT_XLSX = os.path.join(os.getcwd(), "step3_universities.xlsx")


def fetch_html(url: str) -> str:
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9,zh-CN,zh;q=0.8",
    }
    resp = requests.get(url, headers=headers, timeout=30)
    resp.raise_for_status()
    return resp.text


def locate_ranking_table(soup: BeautifulSoup):
    """在页面中定位包含列 Rank / Institution 的表格。"""
    for table in soup.find_all("table"):
        # 获取表头
        headers = []
        thead = table.find("thead")
        if thead:
            ths = thead.find_all("th")
            headers = [th.get_text(strip=True) for th in ths]
        else:
            # 可能没有 thead，则尝试第一行作为表头
            first_tr = table.find("tr")
            if first_tr:
                headers = [th.get_text(strip=True) for th in first_tr.find_all(["th", "td"])]
        # 判断是否为目标表
        if not headers:
            continue
        normalized = [h.lower() for h in headers]
        if "rank" in normalized and "institution" in normalized:
            return table
    return None


def parse_departments(html: str) -> List[Tuple[int, str]]:
    soup = BeautifulSoup(html, "lxml")
    table = locate_ranking_table(soup)
    if table is None:
        raise RuntimeError("未找到包含 Rank/Institution 的排名表格，请检查页面结构是否变化")

    rows = []
    # 排除表头行
    for tr in table.find_all("tr"):
        tds = tr.find_all("td")
        if len(tds) < 2:
            continue
        # 解析 Rank
        rank_text = tds[0].get_text(strip=True)
        try:
            rank = int(rank_text.split()[0])
        except Exception:
            continue
        # 解析 Institution 名称（优先取 <a> 的文本）
        inst_cell = tds[1]
        a = inst_cell.find("a")
        if a and a.get_text(strip=True):
            name = a.get_text(strip=True)
        else:
            # 退化：直接取单元格文本，并尽量去掉地理位置尾巴
            name = inst_cell.get_text(" ", strip=True)
        rows.append((rank, name))

    # 去重并按 rank 排序
    dedup = {}
    for r, n in rows:
        if r not in dedup:
            dedup[r] = n
    items = sorted(dedup.items(), key=lambda x: x[0])
    return items


def select_top_30(items: List[Tuple[int, str]]) -> List[str]:
    """从 1-30、31-60、61-90 各取前 10 所，共 30 所。只返回名称列表。"""
    seg1 = [n for r, n in items if 1 <= r <= 30][:10]
    seg2 = [n for r, n in items if 31 <= r <= 60][:10]
    seg3 = [n for r, n in items if 61 <= r <= 90][:10]
    combined = seg1 + seg2 + seg3
    if len(seg1) < 10 or len(seg2) < 10 or len(seg3) < 10:
        print("警告：某些区间不足 10 所，已按可用数量返回。")
    return combined


def save_to_excel(names: List[str], path: str) -> None:
    df = pd.DataFrame({"university_name": names})
    # 仅一列，保证“只保留大学名称”
    df.to_excel(path, index=False)


if __name__ == "__main__":
    html = fetch_html(REPEC_URL)
    items = parse_departments(html)
    names = select_top_30(items)
    save_to_excel(names, OUTPUT_XLSX)
    print(f"Step3 完成：已保存 30 所大学名称到 Excel：{OUTPUT_XLSX}")



Step3 完成：已保存 30 所大学名称到 Excel：/Users/luok/Desktop/2023200211/step3_universities.xlsx


In [18]:
# Step3-更新：仅保留大学名（如含逗号则取最后一段）
import os, sys, subprocess

for pkg in ["requests", "beautifulsoup4", "pandas", "lxml"]:
    try:
        __import__(pkg if pkg != "beautifulsoup4" else "bs4")
    except Exception:
        subprocess.run([sys.executable, "-m", "pip", "install", "-q", pkg], check=False)

import requests
from bs4 import BeautifulSoup
import pandas as pd

URL = "https://ideas.repec.org/top/top.econdept.html"
OUTPUT_XLSX = os.path.join(os.getcwd(), "step3_universities.xlsx")


def fetch(url: str) -> str:
    resp = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}, timeout=30)
    resp.raise_for_status()
    return resp.text


def locate_table(soup: BeautifulSoup):
    for table in soup.find_all("table"):
        headers = []
        thead = table.find("thead")
        if thead:
            headers = [th.get_text(strip=True).lower() for th in thead.find_all("th")]
        else:
            tr = table.find("tr")
            if tr:
                headers = [td.get_text(strip=True).lower() for td in tr.find_all(["th","td"])]
        if "rank" in headers and "institution" in headers:
            return table
    return None


def extract_university_name(name: str) -> str:
    # 如果包含逗号，取最后一段（通常是大学名）
    parts = [p.strip() for p in name.split(",") if p.strip()]
    if parts:
        return parts[-1]
    return name.strip()


def parse_items(html: str):
    soup = BeautifulSoup(html, "lxml")
    table = locate_table(soup)
    if table is None:
        raise RuntimeError("未找到排名表格")
    rows = []
    for tr in table.find_all("tr"):
        tds = tr.find_all("td")
        if len(tds) < 2:
            continue
        try:
            rank = int(tds[0].get_text(strip=True).split()[0])
        except Exception:
            continue
        a = tds[1].find("a")
        name = a.get_text(strip=True) if a and a.get_text(strip=True) else tds[1].get_text(" ", strip=True)
        rows.append((rank, name))
    rows = sorted({r:n for r,n in rows}.items(), key=lambda x:x[0])
    return rows


def select_top_30(items):
    seg1 = [n for r,n in items if 1<=r<=30][:10]
    seg2 = [n for r,n in items if 31<=r<=60][:10]
    seg3 = [n for r,n in items if 61<=r<=90][:10]
    names = seg1+seg2+seg3
    # 清洗：仅保留大学名
    names = [extract_university_name(n) for n in names]
    return names

html = fetch(URL)
items = parse_items(html)
names = select_top_30(items)
pd.DataFrame({"university_name": names}).to_excel(OUTPUT_XLSX, index=False)
print("Step3 更新完成：已仅保留大学名，保存到：", OUTPUT_XLSX)



Step3 更新完成：已仅保留大学名，保存到： /Users/luok/Desktop/2023200211/step3_universities.xlsx


In [19]:
# Step4: 生成感兴趣研究领域 Excel（来自 SCMOR 页面的领域编码）
# 目标领域：ECON, FINANCE, INFO MAN
import os
import pandas as pd

areas = [
    {"area_code": "ECON", "area_name": "economics"},
    {"area_code": "FINANCE", "area_name": "finance"},
    {"area_code": "INFO MAN", "area_name": "information management"},
]

STEP4_XLSX = os.path.join(os.getcwd(), "step4_research_areas.xlsx")

pd.DataFrame(areas).to_excel(STEP4_XLSX, index=False)
print("Step4 完成：", STEP4_XLSX)


Step4 完成： /Users/luok/Desktop/2023200211/step4_research_areas.xlsx


In [22]:
# Step5: 从 SCMOR 页面爬取每领域最先出现的3本期刊，导出 Excel
# 页面： https://www.scmor.com/view/10554
import os, sys, subprocess

for pkg in ["requests", "beautifulsoup4", "pandas", "lxml"]:
    try:
        __import__(pkg if pkg != "beautifulsoup4" else "bs4")
    except Exception:
        subprocess.run([sys.executable, "-m", "pip", "install", "-q", pkg], check=False)

import requests
from bs4 import BeautifulSoup  # type: ignore
import pandas as pd  # type: ignore

URL = "https://www.scmor.com/view/10554"
STEP5_XLSX = os.path.join(os.getcwd(), "step5_area_top_journals.xlsx")

TARGET_CODES = ["ECON", "FINANCE", "INFO MAN"]


def fetch_html(url: str) -> str:
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9,zh-CN,zh;q=0.8",
    }
    resp = requests.get(url, headers=headers, timeout=60)
    resp.raise_for_status()
    return resp.text


def tables_from_page(html: str):
    soup = BeautifulSoup(html, "lxml")
    return soup.find_all("table")


def normalize_header(text: str) -> str:
    return text.strip().lower().replace(" ", "").replace("*", "")


def parse_table(table) -> pd.DataFrame | None:
    # 尝试解析为 DataFrame，要求包含“领域/field”和“期刊/journal”两列
    # 容忍列名变化：领域、field、领域代码；期刊、期刊名称、journal
    headers = []
    thead = table.find("thead")
    if thead:
        headers = [th.get_text(strip=True) for th in thead.find_all("th")]
    else:
        first_tr = table.find("tr")
        if first_tr:
            headers = [td.get_text(strip=True) for td in first_tr.find_all(["th", "td"])]
    if not headers:
        return None

    rows = []
    body_trs = table.find_all("tr")
    # 跳过第一行表头
    for tr in body_trs[1:]:
        tds = tr.find_all(["td", "th"])  # 有些表用 th 做第一列
        if len(tds) < len(headers):
            continue
        row = [td.get_text(" ", strip=True) for td in tds[:len(headers)]]
        rows.append(row)

    try:
        df = pd.DataFrame(rows, columns=headers)
    except Exception:
        return None

    # 统一列名
    col_map = {}
    for col in df.columns:
        key = normalize_header(col)
        if key in ("领域", "field", "领域field", "领域/field", "领域代码", "领域domain", "domain", "领域列"):
            col_map[col] = "field"
        elif key in ("期刊", "期刊名称", "journal", "期刊名"):
            col_map[col] = "journal"
        else:
            # 保留原列
            col_map[col] = col
    df = df.rename(columns=col_map)

    if "field" not in df.columns or "journal" not in df.columns:
        return None

    # 清洗 field 与 journal
    df["field"] = df["field"].astype(str).str.strip()
    df["journal"] = df["journal"].astype(str).str.strip()
    return df[["field", "journal"]]


html = fetch_html(URL)
all_tables = tables_from_page(html)
frames = []
for tbl in all_tables:
    d = parse_table(tbl)
    if d is not None and not d.empty:
        frames.append(d)

if not frames:
    raise RuntimeError("未能在页面中解析出包含 领域/期刊 的表格，请检查页面结构")

merged = pd.concat(frames, ignore_index=True)

# 查找每个领域最先出现的3本期刊（按页面顺序）
records = []
for code in TARGET_CODES:
    sub = merged[merged["field"].str.upper() == code]
    top3 = sub.head(3)["journal"].tolist()
    for idx, jn in enumerate(top3, start=1):
        records.append({"area_code": code, "journal": jn, "rank_order": idx})

result = pd.DataFrame(records)
result.to_excel(STEP5_XLSX, index=False)
print("Step5 完成：", STEP5_XLSX)



Step5 完成： /Users/luok/Desktop/2023200211/step5_area_top_journals.xlsx


In [23]:
# Step6: 整合研究领域、顶刊与技能，并导出 Excel
import os, sys, subprocess

# 确保 pandas 可用
try:
    import pandas as pd  # type: ignore
except Exception:
    subprocess.run([sys.executable, "-m", "pip", "install", "-q", "pandas", "openpyxl"], check=False)
    import pandas as pd  # type: ignore

CWD = os.getcwd()
STEP4_XLSX = os.path.join(CWD, "step4_research_areas.xlsx")
STEP5_XLSX = os.path.join(CWD, "step5_area_top_journals.xlsx")
STEP6_XLSX = os.path.join(CWD, "step6_area_journals_skills.xlsx")

# 用户提供的技能（按题意来自 Step4 的技能收集，此处直接使用给定字符串）
skills_str = "Python, SQL, Math, solid grasp of economic and financial knowledge, proficient application of AI tools for academic and analytical scenarios"

# 读取 Step4/Step5
if not os.path.exists(STEP4_XLSX):
    raise FileNotFoundError(f"未找到 Step4 文件: {STEP4_XLSX}")
if not os.path.exists(STEP5_XLSX):
    raise FileNotFoundError(f"未找到 Step5 文件: {STEP5_XLSX}")

areas_df = pd.read_excel(STEP4_XLSX)
journals_df = pd.read_excel(STEP5_XLSX)

# 规范列名
areas_df = areas_df.rename(columns={"area_code": "area_code", "area_name": "area_name"})
journals_df = journals_df.rename(columns={"area_code": "area_code", "journal": "journal", "rank_order": "rank_order"})

# 校验必要列
for req in ["area_code", "area_name"]:
    if req not in areas_df.columns:
        raise KeyError(f"Step4 缺少必要列: {req}")
for req in ["area_code", "journal"]:
    if req not in journals_df.columns:
        raise KeyError(f"Step5 缺少必要列: {req}")

# 合并（左连接保留领域列表）
merged = areas_df.merge(journals_df[["area_code", "journal", "rank_order"]], on="area_code", how="left")

# 将每个领域的 journal 聚合为逗号分隔
agg = (merged
    .sort_values(["area_code", "rank_order"], na_position="last")
    .groupby(["area_code", "area_name"], as_index=False)
    .agg({"journal": lambda s: ", ".join([x for x in s.dropna().astype(str) if x])})
)
agg = agg.rename(columns={"journal": "top_journals"})

# 添加技能列（同一串技能，便于后续模板渲染）
agg["skills"] = skills_str

# 输出 Excel
agg.to_excel(STEP6_XLSX, index=False)
print("Step6 完成：整合领域-期刊-技能，输出文件：", STEP6_XLSX)



Step6 完成：整合领域-期刊-技能，输出文件： /Users/luok/Desktop/2023200211/step6_area_journals_skills.xlsx


In [None]:
# Step7: 循环生成 30 所大学 × 3 个研究领域的 SOP（docxtpl）
import os, sys, subprocess
from datetime import date

# 确保依赖
try:
    from docxtpl import DocxTemplate  # type: ignore
except Exception:
    subprocess.run([sys.executable, "-m", "pip", "install", "-q", "docxtpl"], check=False)
    from docxtpl import DocxTemplate  # type: ignore

try:
    import pandas as pd  # type: ignore
except Exception:
    subprocess.run([sys.executable, "-m", "pip", "install", "-q", "pandas", "openpyxl"], check=False)
    import pandas as pd  # type: ignore

CWD = os.getcwd()
TEMPLATE_PATH = os.path.join(CWD, "sop_template_docxtpl.docx")
UNIV_XLSX = os.path.join(CWD, "step3_universities.xlsx")
AREA_SKILL_XLSX = os.path.join(CWD, "step6_area_journals_skills.xlsx")
OUTPUT_DIR = os.path.join(CWD, "step7_outputs")

# 固定信息（可按需修改）
APPLICANT_NAME = "Qimin Lin"
PROGRAM_NAME = "Master of Finance program"
CAREER_GOAL = "quant researcher"
CONTACT = "+86-139-0000-0000 | linqimin@qq.com"

# 校验资源
if not os.path.exists(TEMPLATE_PATH):
    raise FileNotFoundError(f"未找到模板：{TEMPLATE_PATH}，请先运行 Step1 生成模板")
if not os.path.exists(UNIV_XLSX):
    raise FileNotFoundError(f"未找到 Step3 大学列表：{UNIV_XLSX}")
if not os.path.exists(AREA_SKILL_XLSX):
    raise FileNotFoundError(f"未找到 Step6 领域-期刊-技能表：{AREA_SKILL_XLSX}")

os.makedirs(OUTPUT_DIR, exist_ok=True)

# 读取数据
univs = pd.read_excel(UNIV_XLSX)
areas = pd.read_excel(AREA_SKILL_XLSX)

# 列检查
if "university_name" not in univs.columns:
    raise KeyError("Step3 缺少必要列：university_name")
for col in ["area_code", "area_name", "top_journals", "skills"]:
    if col not in areas.columns:
        raise KeyError(f"Step6 缺少必要列：{col}")

# 安全文件名
def safe_filename(name: str) -> str:
    keep = "-_.() []{}"
    return "".join(c if c.isalnum() or c in keep else "_" for c in str(name)).strip(" ._")

# 生成
count = 0
for _, urow in univs.iterrows():
    uni = str(urow["university_name"]).strip()
    if not uni:
        continue
    for _, arow in areas.iterrows():
        area_code = str(arow["area_code"]).strip()
        area_name = str(arow["area_name"]).strip()
        top_journals = str(arow.get("top_journals", "")).strip()
        skills = str(arow.get("skills", "")).strip()

        context = {
            "applicant_name": APPLICANT_NAME,
            "program_name": PROGRAM_NAME,
            "university_name": uni,
            "research_area": area_name or area_code,
            "top_journals": top_journals,
            "career_goal": CAREER_GOAL,
            "skills": skills,
            "date": date.today().isoformat(),
            "contact": CONTACT,
        }

        tpl = DocxTemplate(TEMPLATE_PATH)
        tpl.render(context)
        fname = f"SOP_{safe_filename(uni)}_{safe_filename(area_code or area_name)}.docx"
        out_path = os.path.join(OUTPUT_DIR, fname)
        tpl.save(out_path)
        count += 1

print(f"Step7 完成：已生成 {count} 份 SOP 到目录：{OUTPUT_DIR}")

