Skip to content

fipl-hse/2023-2-level-ctlr

Repository files navigation

Technical Track of Computer Tools for Linguistic Research (2023/2024)

As a part of a compulsory course Computer Tools for Linguistic Research in National Research University Higher School of Economics.

This technical track is aimed at building basic skills for retrieving data from external WWW resources and processing it for future linguistic research. The idea is to automatically obtain a dataset that has a certain structure and appropriate content, perform morphological analysis using various natural language processing (NLP) libraries. Dataset requirements dataset-label.

Instructors:

Project Timeline

  1. Scrapper:
    1. Short summary: Your code can automatically parse a media website you are going to choose, save texts and its metadata in a proper format.
    2. Deadline: April, 29.
    3. Format: each student works in their own PR.
    4. Dataset volume: 5-7 articles.
    5. Design document: scrapper-label.
  2. Pipeline:
    1. Short summary: Your code can automatically process raw texts from previous step, make point-of-speech tagging and basic morphological analysis.
    2. Deadline: May, 27.
    3. Format: each student works in their own PR.
    4. Dataset volume: 5-7 articles.
    5. Design document: pipeline-label.

Lectures history

Date Lecture topic | Important links
01.04.2024 Lecture: | Lab no. 5 description Introduction to | technical track. |
01.04.2024 Seminar: Local | N/A setup. Choose | website. |
08.04.2024 Lecture: 3rd party | N/A libraries. Browser | headers. |
08.04.2024 Seminar: | Листинг <./seminars/semin requests`: . | install, API. | ar_04_08_2024/try_requests.py>`__.
15.04.2024 Lecture: HTML | N/A structure. bs4 | library. |
15.04.2024 Seminar: | Листинг <./seminars/semin bs4`: . | install, API. | ar_04_15_2024/try_bs.py>`__.
22.04.2024 Lecture: Filesystem | N/A with pathlib. | Dates. |
22.04.2024 Seminar: | Листинг <./seminars/semin filesystem with | `Листинг <./seminars/semin pathlib`, dates. | `Листинг <./seminars/semin ar_04_22_2024/try_fs.py>__. ar_04_22_2024/try_json.py>__. ar_04_22_2024/try_dates.py>`__.
29.04.2024 Introduction to lab | N/A 6. CoNLLU format. |
29.04.2024 Lab 5 handover. | N/A

You can find a more complete summary from lectures in ctlr-lectures-label.

Technical solution

Module Description Component Need to get
pathlib <https://pypi.org /project/pathlib/>__ working with file paths scrapper 4
requests <https:// pypi.org/project/reque sts/2.25.1/>__ downloading web pages scrapper 4
BeautifulSoup4 <https://pypi.org /project/beautifulso up4/4.11.1/>__ finding information on web pages scrapper 4
lxml <https://pypi. org/project/lxml/>__ optional parsing HTML scrapper 6
datetime working with dates scrapper 6
json working with json text format scrapper, pipeline 4
spacy_udpipe <https: //pypi.org/project /spacy-udpipe/>__ module for morphological analysis pipeline 6
stanza <https://p ypi.org/project /stanza/>__ module for morphological analysis pipeline 8
networkx <https:/ /pypi.org/project /networkx/>__ working with graphs pipeline 10

Software solution is built on top of three components:

  1. scrapper.py

    - a module for finding articles from the given media, extracting text and dumping it to the file system. Students need to implement it.

  2. pipeline.py

    - a module for processing text: point-of-speech tagging and basic morphological analysis. Students need to implement it.

  3. article.py
    • a module for article abstraction to encapsulate low-level manipulations with the article.

Handing over your work

  1. Lab work is accepted for oral presentation.
  2. A student has explained the work of the program and showed it in action.
  3. A student has completed the mini-task from a mentor that requires some slight code modifications.
  4. A student receives a mark:
    1. That corresponds to the expected one, if all the steps above are completed and mentor is satisfied with the answer.
    2. One point bigger than the expected one, if all the steps above are completed and mentor is very satisfied with the answer.
    3. One point smaller than the expected one, if a lab is handed over one week later than the deadline and criteria from 4.1 are satisfied.
    4. Two points smaller than the expected one, if a lab is handed over more than one week later than the deadline and criteria from 4.1 are satisfied.

Note

A student might improve their mark for the lab, if they complete tasks of the next level after handing over the lab.

A lab work is accepted for oral presentation if all the criteria below are satisfied:

  1. There is a Pull Request (PR) with a correctly formatted name: Scrapper, <NAME> <SURNAME> - <UNIVERSITY GROUP NAME>.
    1. Example: Scrapper, Irina Novikova - 20FPL2.
  2. Has a filled file settings.json with an expected mark. Acceptable values: 4, 6, 8, 10.
  3. Has green status.
  4. Has a label done, set by mentor.

Resources

  1. Academic performance
  2. Media websites list
  3. Documentation website
  4. Python programming course from previous semester
  5. Scrapping tutorials (Russian)
  6. Scrapping tutorials (English)
  7. starting-guide-en-label
  8. ctlr-tests-label
  9. run-in-terminal-label
  10. ctlr-faq-label