# Machine Learning / Aprendizagem Automática

### Sara C. Madeira, 2025/26

# ML Project  - Learning about Donations

## Logistics

**_Read Carefully_**

**Students are encouraged to work in teams of 3 people**. 

Projects with smaller teams are allowed, in exceptional cases, but will not have better grades for this reason. 

The quality of the project will dictate its grade, not the number of people working.

**The project's work load was planned for a team of 3 people and 4 weeks of work. Its solution should be uploaded in Moodle before the end of Sunday, December 21st 2025.**  

Teams should **upload a `.zip` file** containing all the files necessary for project evaluation. Teams should be registered in [Moodle](https://moodle.ciencias.ulisboa.pt/mod/groupselect/view.php?id=299599) and the zip file, upload by one of the group members, should be identified as `AA202526nn.zip` where `nn` is the group number.

**It is mandatory to produce a Jupyter notebook containing code and text/images/tables/etc describing the solution and the results. Projects not delivered in this format will not be graded. You can use `AA_202526_Project.ipynb`as template. In your `.zip` folder you should also include an HTML version of your notebook with all the outputs.**

**Decisions should be justified and results should be critically discussed.** 

Remember that **your notebook should be as clear and organized as possible**, that is, only the relevant code and experiments should be presented, not everything you tried and did not work (that can be discussed in the text, if relevant)! 

_Project solutions containing **only code and outputs without discussions** will achieve a **maximum grade of 10 out of 20**._

## Tools

The team should use [Python 3](https://www.python.org) and [Jupyter Notebook](http://jupyter.org), together with **[Scikit-Learn](http://scikit-learn.org/stable/)**, **[Orange3](https://orange.biolab.si)**, or **both**.

**[Orange3](https://orange.biolab.si)** can be used through its **[programmatic version](https://docs.orange.biolab.si/3/data-mining-library/)**, by importing and using its packages as done with Scikit-Learn, or through its **workflow version**. 

**It is up to the team to decide when to use Scikit-learn, Orange, or both.**

In this context, your Jupyter notebook might have a mix of code, results, text explanations, workflow figures, etc. 

In case you use Orange/workflows for some tasks you should also deliver the workflow files. Your notebook should figures for the workflow used together with an overall explaination and specific descriptions for the options taken in each of their widgets.

**You should use this notebook and the sections below as template for your solutions. Sections might be added or changed.**

## Dataset

The dataset to be analysed is **`Donors_dataset.csv`**, made available together with this project description. This dataset, downloaded from [Kaggle](https://www.kaggle.com), contains selected data from the following competition: [Donors-Prediction](https://www.kaggle.com/momohmustapha/donorsprediction/). 


**In this project, your team is supposed to use machine learning in the challenging tasks of predicting donations and understanding the donors. You should use both supervised and unsupervised learning to tackle 2 tasks:**

1. **Task 1 (Supervised Learning) - Predicting Donation and Donation Type**
2. **Task 2 (Unsupervised Learning) - Characterizing Donors**

The **`Donors_dataset.csv`** you should learn from has **19.372 instances** described by **50 data fields** that you might use as **categorical/numerical features.** 

### File Description

* **Donors_dataset.csv** - Tabular/text data to be used in the machine learning tasks.


### Data Fields (in alphabetic order)

* **CARD_PROM_12** - number of card promotions sent to the individual by the charitable organization in the past 12 months
* **CLUSTER_CODE** - one of 54 possible cluster codes, which are unique in terms of socioeconomic status, urbanicity, ethnicity, and other demographic characteristics
* **CONTROL_NUMBER** - unique identifier of each individual
* **DONOR_AGE** - age as of last year's mail solicitation
* **DONOR_GENDER** - actual or inferred gender
* **FILE_AVG_GIFT** - this variable is identical to LIFETIME_AVG_GIFT_AMT
* **FILE_CARD_GIFT** - lifetime average donation (in \\$) from the individual in response to all card solicitations from the charitable organization
* **FREQUENCY_STATUS_97NK** - based on the period of recency (determined by RECENCY_STATUS_96NK), which is the past 12 months for all groups except L and E. L and E are 13–24 months ago and 25–36 months ago, respectively: 1 if one donation in this period, 2 if two donations in this period, 3 if three donations in this period, and 4 if four or more donations in this period.
* **HOME_OWNER** - H if the individual is a homeowner, U if this information is unknown
* **INCOME_GROUP** - one of 7 possible income level groups based on a number of demographic characteristics
* **IN_HOUSE** - 1 if the individual has ever donated to the charitable organization's In House program, 0 if not
* **LAST_GIFT_AMT** - amount of the most recent donation from the individual to the charitable organization
* **LIFETIME_AVG_GIFT_AMT** - lifetime average donation (in \\$) from the individual to the charitable organization
* **LIFETIME_CARD_PROM** - total number of card promotions sent to the individual by the charitable organization
* **LIFETIME_GIFT_AMOUNT** - total lifetime donation amount (in \\$) from the individual to the charitable organization
* **LIFETIME_GIFT_COUNT** - total number of donations from the individual to the charitable organization
* **LIFETIME_GIFT_RANGE** - maximum donation amount from the individual minus minimum donation amount from the individual
* **LIFETIME_MAX_GIFT_AMT** - maximum donation amount (in \\$) from the individual to the charitable organization
* **LIFETIME_MIN_GIFT_AMT** - minimum donation amount (in \\$) from the individual to the charitable organization
* **LIFETIME_PROM** - total number of promotions sent to the individual by the charitable organization
* **MEDIAN_HOME_VALUE** - median home value (in 100\\$) as determined by other input variables
* **MEDIAN_HOUSEHOLD_INCOME** - median household income (in 100\\$) as determined by other input variables
* **MONTHS_SINCE_FIRST_GIFT** - number of months since the first donation from the individual to the charitable organization
* **MONTHS_SINCE_LAST_GIFT** - number of months since the most recent donation from the individual to the charitable organization
* **MONTHS_SINCE_LAST_PROM_RESP** - number of months since the individual has responded to a promotion by the charitable organization
* **MONTHS_SINCE_ORIGIN** - number of months that the individual has been in the charitable organization's database
* **MOR_HIT_RATE** - total number of known times the donor has responded to a mailed solicitation from a group other than the charitable organization
* **NUMBER_PROM_12** - number of promotions (card or other) sent to the individual by the charitable organization in the past 12 months
* **OVERLAY_SOURCE** - the data source against which the individual was matched: M if Metromail, P if Polk, B if both
* **PCT_ATTRIBUTE1** - percent of residents in the neighborhood in which the individual lives that are males and active military
* **PCT_ATTRIBUTE2** - percent of residents in the neighborhood in which the individual lives that are males and veterans
* **PCT_ATTRIBUTE3** - percent of residents in the neighborhood in which the individual lives that are Vietnam veterans
* **PCT_ATTRIBUTE4** - percent of residents in the neighborhood in which the individual lives that are WWII veterans
* **PCT_OWNER_OCCUPIED** - percent of owner-occupied housing in the neighborhood in which the individual lives
* **PEP_STAR** - 1 if individual has ever achieved STAR donor status, 0 if not
* **PER_CAPITA_INCOME** - per capita income (in \\$) of the neighborhood in which the individual lives
* **PUBLISHED_PHONE** - 1 if the individual's telephone number is published, 0 if not
* **RECENCY_STATUS_96NK** - recency status as of two years ago: A if active donor, S if star donor, N if new donor, E if inactive donor, F if first time donor, L if lapsing donor
* **RECENT_AVG_CARD_GIFT_AMT** - average donation from the individual in response to a card solicitation from the charitable organization since four years ago
* **RECENT_AVG_GIFT_AMT** - average donation (in \\$) from the individual to the charitable organization since four years ago
* **RECENT_CARD_RESPONSE_COUNT** - number of times the individual has responded to a card solicitation from the charitable organization since four years ago
* **RECENT_CARD_RESPONSE_PROP** - proportion of responses to the individual to the number of card solicitations from the charitable organization since four years ago
* **RECENT_RESPONSE_COUNT** - number of times the individual has responded to a promotion (card or other) from the charitable organization since four years ago
* **RECENT_RESPONSE_PROP** - proportion of responses to the individual to the number of (card or other) solicitations from the charitable organization since four years ago
* **RECENT_STAR_STATUS** - 1 if individual has achieved star donor status since four years ago, 0 if not
* **SES** - one of 5 possible socioeconomic codes classifying the neighborhood in which the individual lives
* **TARGET_B** - 1 if individual donated in response to last year's 97NK mail solicitation from the charitable organization, 0 if individual did not
* **TARGET_D** - amount of donation (in \\$) from the individual in response to last year's 97NK mail solicitation from the charitable organization
* **URBANICITY** - classification of the neighborhood in which the individual lives: U if urban, C if city, S if suburban, T if town, R if rural, ? if missing
* **WEALTH_RATING** - one of 10 possible wealth rating groups based on a number of demographic characteristics


### Donation TYPE

You are supposed to create a new column/feature named `DONATION_TYPE`, whose 5 values {`A`, `B`, `C`, `D`, `E`} describe ranges of the donation amount (DA) reported in feature `TARGET_D` as follows:
* `A` - DA >= 50
* `B` - DA in interval [20,50[ 
* `C` - DA in interval [13,20[ 
* `D` - DA in interval [10,13[ 
* `E` - DA < 10

### **Important Notes on Data Cleaning and Preprocessing**

   1. Data can contain **errors/typos**, whose correction might improve the analysis.
   2. Some features can contain **many values**, whose grouping in categories (aggregation into bins) might improve the analysis.
   3. Data can contain **missing values**, that you might decide to fill. You might also decide to eliminate instances/features with high percentages of missing values.
   4. **Not all features are necessarily important** for the analysis.
   5. Depending on the analysis, **some features might have to be excluded**.
   6. Class distribution is an important characteristic of the dataset that should be carefully taken into consideration. **Class imbalance** might impair machine learning.  
  
Some potentially useful links:

* Data Cleaning and Preprocessing in Scikit-learn: https://scikit-learn.org/stable/modules/preprocessing.html#
* Data Cleaning and Preprocessing in Orange: https://docs.biolab.si//3/visual-programming/widgets/data/preprocess.html
* Dealing with imbalance datasets: https://pypi.org/project/imbalanced-learn/ and https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets#t7

## Task 0 (Know your Data) - Exploratory Data Analysis

## 0.1. Loading Data

In [1]:
import pandas as pd
import numpy as np
from sklearn import neighbors

path = "/Users/aljoscha/Downloads/Donors_dataset.csv"

df = pd.read_csv(path)

FileNotFoundError: [Errno 2] No such file or directory: '/Users/aljoscha/Downloads/Donors_dataset.csv'

## 0.2. Understanding Data

In this task you should **understand better the features**, their distribution of values, potential errors, etc and plan/describe what data preprocessing steps should be performed next. Very important also is the distribution of values in the target (class distribution). 

Here you can find a notebook with some examples of what you can do in **Exploratory Data Analysis**: https://www.kaggle.com/artgor/exploration-of-data-step-by-step/notebook. You can also use Orange widgets for this.

In [None]:
colum_names = df.columns

print(df.info())
print(df.describe())
print(df.head(10))
print(df["URBANICITY"][:10])

## 0.3. Preprocessing Data

Here you might perform data preprocessing that will be used for both supervised and unsupervised learning tasks.

### <font color="red">Todo1:
[Sklearn documentation](https://scikit-learn.org/stable/modules/preprocessing.html)

- eventually dropna (rows/col)
- Imputation (sklearn.impute)
<br>

In [None]:
# # pd.Series(B)
# pd.DataFrame.dropna(B, axis=0)
# pd.DataFrame.dropinf(B1, axis=0)

## Task 1 (Supervised Learning) - Predicting Donation and Donation Type

In this task you should target 2 classification tasks:
1. **Predicting Donation (binary classification task)** 
2. **Predicting Donation TYPE (multiclass classification)**

**You should:**

* Choose **5 classifiers** from **at least 3 of the following categories**: Tree models, Probabilistic models, Distance-based models and Linear models. You can also try one Ensemble Classifier (https://scikit-learn.org/1.5/modules/ensemble.html). 
* Use **cross-validation** to evaluate the results.
* Describe the parameters used for each classifier and how their choice impacted or not the results.
* Choose the **best classifier** and fundament you choice.
* Present and discuss the results for different evaluation measures, present confusion matrices. Remember that not only overall results are important. Check what happens when learning to predict each class. Remember also that some metrics might be more adequate than others according to the problem at hand.

* **Discuss critically your choices and the results!**

### <font color="red">Todo2:
Potential classifiers:
- **Tree models**
    - [Decision Tree Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
- **Linear models**
    - [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
- **Probabilistic models**
    - [Gaussian NB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html)
- **Distance-based models**
    - [K Neighbours Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
- **(Ensemble classifier)**
    - [Random forest classsifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)


## 1.1. Specific Data Preprocessing for Classification

## 1.2. Learning and Evaluating Classifiers

...

## 1.3. Classification - Final Discussion and Conclusions 

...

## Task 2 (Unsupervised Learning) - Characterizing Donors and Donation Type

In this task you should **use unsupervised learning to characterize donors (people who really did a donation) and their donation type**.
1. **Use clustering algorithms to find similar groups of donors**. Is it possible to find groups of donors with the same/similar DonationTYPE? Evaluate clustering results using **internal and external metrics**.
2. **Be creative and define and explore your own clustering task!** What else would it be interesting to find out?

## 2.1. Preprocessing Data for Clustering

...

## 2.2. Learning and Evaluating Clusterings

...

## 2.3. Clustering - Final Discussion and Conclusions

...

## 3. Final Comments and Conclusions

...