This was the final project conducted by my group member (Shadi Chamseddine) and I for our STAT 5703 W Data Mining course.
Given a small sample of a client's financial statements, we aim to predict the binary variable (TARGET_Adjusted) which identifies whether a client’s financial statement is a target to get audited and have potential adjustments applied to it. In addition we aim to predict the continuous variable (RISK_Adjustment) which records the monetary amount of any adjustment to the person’s financial status as a result of a productive audit. This variable is a measure of the size of the risk associated with the person, where a productive audit refers to that which results in an adjustment being made to a client’s financial statement. Given these goals, we would prefer to over-classifying a client’s financial statement as requiring an audit as opposed to under-classifying as not requiring an audit. The reason for this is because it is easier to assess a client’s financial statement and come to the conclusion that no adjustment is necessary than to completely miss a client’s financial statement that requires an adjustment.
We tackled a variety of different topics in the data science stream in this project such as:
- Visualization
- Dimension reduction
- Data reduction
- Unsupervised learning (clustering)
- Supervised learning (classification)
After data cleansing, we ran a Random Forest algorithm on the dataset to determine which variables are important to the prediction process. Variables which contribute to less than 10% of the overall cumulative Mean Decrease in Accuracy were dropped from our dataset. The next step is to see if we can reduce our dataset even further, this time by the means of dimension reduction. We employed three different methods of dimension reduction on the dataset, Principal Component Analysis (PCA), Multiple Correspondence Analysis (MCA), and a joint PCA and MCA on the dataset. We employ a variety of dimension reduction methods to ensure the results are robust because of the different types of variables present in the dataset. Once our dataset has been reduced into a smaller dataset, we proceeded to predicting which client’s financial statements to flag for an audit and what the audit adjustment would be if necessary. We looked at this prediction process in two ways, with unsupervised learning algorithms and with supervised learning algorithms, this is because we can look at the given dataset in two ways. We can either treat it as our full dataset or as a subset of the whole dataset. In the case we treat it as our full dataset, we employed unsupervised learning algorithms to predict whether an audit on a client’s financial statement will be necessary. The unsupervised learning algorithms we will use are K-Means clustering and hierarchical clustering to identify the two clusters (audit or no audit) in the data. In the case we treat it as a subset of the whole dataset we employed supervised learning algorithms. There are two different values we wish to predict for each client’s financial statement, their TARGET_Adjusted value and their RISK_Adjustment value. In the prediction of a client’s TARGET_Adjusted value the RISK_Adjustment variable will be dropped from the models because it is determined after TARGET_Adjusted is determined. We used four different models to predict a client’s TARGET_Adjusted value, these are Random Forest, K-Nearest Neighbour, Naïve Bayes, and Neural Networks. In the prediction of a client’s RISK_Adjustment all variables were used in the models because it is determined after TARGET_Adjusted is determined. We used two different models to predict a client’s RISK_Adjustment, these are Random Forest, and a regression.