---
title: Project 2
bibliography:
  - myref.bib
---

In [None]:
import os

if not os.getenv(
    "NBGRADER_EXECUTION"
):
    %load_ext jupyter_ai
    %ai update chatgpt dive:chat
    # %ai update chatgpt dive-azure:gpt4o

## Requirements

Project 2 is a group project where each group should identify a relevant topic that involves all the following elements:

1. An objective that involves extracting knowledge from a real-world dataset.
2. A data preprocessing step that prepares the data for mining.
3. A learning algorithm that extracts knowledge from the data.
4. An evaluation of the mined knowledge.

::::{caution}

The project may be used in a data-mining competition, but previously completed projects should not be reused. Take proper measures to avoid any suspicion of (self-)plagiarism. 

::::

There are several tasks associated with Project 2:

1. **Presentation video**: Submit one 15-minute video per group to the [Group presentation assignment](https://canvas.cityu.edu.hk/courses/62414/assignments/279414) on Canvas.
2. **Peer Reviews of Group Presentations**: Each student will be assigned 3 group presentations to review from the [Group presentation assignment](https://canvas.cityu.edu.hk/courses/62414/assignments/279414) on Canvas.
3. **Report**: Submit one report per group to the [Group report assignment](https://canvas.cityu.edu.hk/courses/62414/assignments/275136) on Canvas.

::::{caution}

The submission deadlines are different for the different tasks above. See the Canvas assignment page for more details on the specific requirements.

::::

The project is worth 15 points, which accounts for 15% of the entire course assessment. The assessment is divided into 5 categories, each of which is scored on a scale of 0-3 points:

- **3**: Excellent
- **2**: Satisfactory
- **1**: Unsatisfactory
- **0**: Incomplete

:::::{admonition} Rubrics

1. **Presentation** (3 points)

   - Is the problem well formulated and motivated?
   - Is there proper use of visualization techniques to convey the results concisely within the time limit?
   - Are questions from the audience addressed well?

2. **Report** (3 points)
   - Does the report contain the essential elements such as the title, abstract, introduction, conclusion, and references?
   - Are the problem and results described clearly with proper citations?  
   - Are the results reproducible by well-documented code?  

3. **Correctness** (3 points)
   - Does the project contain the required elements?
   - Are the results correct?
   - Are the learning process and evaluation methods appropriate, leading to correct results?

4. **Technical elements** (3 points)
   - How sophisticated are the techniques used for preprocessing, learning, and evaluation?
   - Is the quality of the mined knowledge better than existing ones?
   - Are there meaningful generalizations to related classes of problems?

5. **Team spirit** (3 points)
   - Can the team challenge other teams successfully for the group presentation?
   - Can the team members work together efficiently?
   - Is the workload evenly divided among members?

:::::

## Data Sources

Here are some websites that provide trustworthy real-world datasets:

::::{tip} Recommended sources of data

- [UNICEF Data](https://data.unicef.org/): Offers comprehensive data on the well-being of children around the world.
- [World Bank Open Data](https://data.worldbank.org/): Grants free and open access to global development data.
- [Data.gov.hk](https://data.gov.hk/): Hosts open government data in Hong Kong, similar to [Data.gov](https://www.data.gov/) for the US.
- [Eurostat](https://ec.europa.eu/eurostat): Provides statistical data on Europe, covering a wide range of topics from economy to environment.
- [Challenge Data](https://challengedata.ens.fr/): Features data mining challenges from data provided by public services, companies and laboratories.

::::

The following sources of data are widely popular, but their datasets may be synthetic or heavily studied. If you plan to use datasets from these sources, you will need to put extra effort into the following aspects:

1. Verify the dataset’s authenticity by citing trustworthy and original sources properly.
2. Clearly differentiate your approach and results from existing works.

::::{caution} Sources to use with caution

- [UCI Machine Learning Repository](https://archive.ics.uci.edu/): Provides a wide range of machine learning datasets for educational and research purposes. However, be aware that many of these datasets have been extensively studied.
- [Kaggle](https://www.kaggle.com/datasets): Offers a diverse array of datasets for practicing data mining. Be cautious, as some datasets may be synthetic or lack proper references, and there are often hundreds of submitted solutions in the form of Jupyter notebooks.
- [OpenML](https://www.openml.org/): Facilitates the easy sharing and discovery of datasets, algorithms, and experiments. However, some datasets may be synthetic or lack proper references.

::::

There are many other ways to look for reliable data sources. For instance, you might try [Google Dataset Search](https://datasetsearch.research.google.com/) to find datasets, and large language models (LLMs) to provide concrete examples.

In [None]:
%%ai chatgpt -f markdown
I am doing a group project in a data mining course. Explain how I can find
a good real-world dataset and the corresponding data mining objective? Give me
a list of 10 examples. (Do not use any headers in your reply.)

## Group Server

Members of the same group can collaboratively work on the same notebook using a group (Jupyter) server that have higher resource limits than the individual user servers:

- Storage: 100GB
- Memory: 100GB
- CPU: 32 cores for default servers without GPU, 8 cores for GPU servers
- GPU: 48GB for GPU servers

Group servers also run JupyterLab in collaborative mode, which provides real-time collaboration features, allowing multiple users to see each other and work on the same notebook simultaneously. For more details, see [](#fig:collab) and the [`jupyterlab-collaboration` package](https://jupyterlab-realtime-collaboration.readthedocs.io/en/latest/).

:::::{figure} images/collab.dio.svg
:label: fig:collab
:alt: Collaborative mode
:align: left

Collaborative mode in JupyterLab.
:::::

To access and manage the group server:

1. **Access the Hub Control Panel:**

   - Within the JupyterLab interface, click `File->Hub Control Panel`.

3. **Select the Admin Panel:**

   - In the top navigation bar of the Hub Control Panel, select the `Admin` Panel as shown in [](#fig:admin).

5. **Locate the Group User:**

   - Within the Admin Panel, look for the user named `group{n}`, where `{n}` corresponds to the group number.

7. **Manage the Group Server:**
    - **If the group server has not started:**
        - Click the action button labeled <kbd>Spawn Page</kbd> to select the server options with higher resource limits.
        - If you click the action button labeled <kbd>Start Server</kbd>, the server will start with lower resource limits that apply to individual user servers.
    - **If the group server is already running:**
        - Click the action button labeled <kbd>Access Server</kbd> to access the currently running server.
        - If necessary, click the action button labeled <kbd>Stop Server</kbd> to terminate the existing server.

:::::{figure} images/admin.dio.svg
:label: fig:admin
:alt: Admin panel
:align: left

Admin panel for managing group server.
:::::

::::{seealso}

Members can also collaborate on their individual Jupyter servers using the [Live Share extension](https://marketplace.visualstudio.com/items?itemName=MS-vsliveshare.vsliveshare) installed in the VSCode interface. Signing in with a GitHub or Microsoft account is required.

::::

To facilitates file transfer and sharing among members, each member of a project group can access the group home directory from their individual user servers:

- **Accessing from a terminal**:
    - Members can access the group home directory by navigating to the mounted path in a terminal app in JupyterLab or VSCode interface.
    - For example, if the group is group0, they can use the following command in the terminal:
      ```bash
      cd /group0
      ```
- **Accessing via a soft link**
    - To make it easier to access the group home directory from the JupyterLab file explorer, members can create a soft link.
    - This can be done using the ln -s command. For example, if the group is `group1`, they can create a soft link named `group_home` in the user home directory:
      ```bash
      ln -s /group1 ~/group_home
      ```
      Refresh the file browser to see the `group_home` folder in the JupyterLab file explorer as shown in [](#group_home).

:::::{figure} images/group_home.dio.svg
:label: fig:group_home
:alt: Group home directory
:align: left

Group home directory mounted in member server.
:::::

::::{caution}

It is important to note that multiple users editing the same file can potentially cause data loss or conflicts. To mitigate this risk, you should use version control systems like Git to manage their changes and collaborate more effectively: 

- The `jupyterlab-git` extension provides a graphical interface for Git within JupyterLab.
- VSCode interface has Git-related extensions  such as `GitLens`.

::::

## Custom Packages

To ensure the reproducibility of your results, you are required to *use programming instead of the WEKA graphical interface* to complete the project. Specifically, you can access WEKA’s tools through the `python-weka-wrapper3` module, which allows you to use Python instead of Java. You can also install additional packages using the commands 

- [`conda install`](https://docs.conda.io/projects/conda/en/stable/commands/install.html), if the package is available on [Anaconda](https://anaconda.org/), or
- [`pip install`](https://packaging.python.org/en/latest/tutorials/installing-packages/), if the package is available on [PyPI](https://pypi.org/search/).

In [None]:
%%ai chatgpt -f text
What are the pros and cons of conda install vs pip install?

The installation might not persist after restarting the Jupyter server because the default environment is not saved permanently. To keep the installation, create a conda environment in your home directory, which will be saved permanently.

For instance, if you would like to use `xgboost` and `python-weka-wrapper3` in the same notebook, run the following to create a conda environment:[^conda]

```bash
myenv=myenvname
cat <<EOF > /tmp/myenv.yaml && mamba env create -n "${myenv}" -f /tmp/myenv.yaml
dependencies:
  - python=3.11
  - pip
  - ipykernel
  - xgboost
  - pip:
    - python-weka-wrapper3
EOF
```

where `myenvname` can be any valid environment name.

[^conda]: See the [documentation](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html) for more details on managing conda environment.

Afterwards, you can create a kernel using the command:[^kernel]

```bash
conda activate ${myenv}
python -m ipykernel install \
    --user \
    --name "${myenv}" --display-name "${myenv}"
```

[^kernel]: See the [documentation](https://ipython.readthedocs.io/en/stable/install/kernel_install.html#kernels-for-different-environments) for more details on creating kernels for conda environments.

Reload the browser window for the kernel to take effect.

::::{tip} How to clean up a conda environment?

To deactivate the conda environment in a terminal, run

```bash
conda deactivate
```

To delete the kernel, run the command

```bash
rm -rf ~/.local/share/jupyter/kernels/${myenv}
```

To delete the conda environment, run

```bash
conda deactivate
mamba env remove -n ${myenv}
```

::::

In [None]:
%%ai chatgpt -f text
How to create a conda environment that inherit all the packages from the base
environment? Will this take a long time and create duplicate files?