# CM 3070 Final Project

## CoCurricular Classification for University Advancement

Brian Van Steen 210182781

## 1. Set Up

This project is built using Anaconda JupyterLab, and a dedicated virtual environment was built for this notebook.

The following libraries were installed using pip in the virtual environment terminal:

- NumPy
- Pandas
- Matplotlib
- Seaborn
- scikit-learn
- jupyterlab-git
- pyLDAvis
- TensorFlow
- openpyxl

Each of these libraries can now be imported for use.

In [1]:
# install all libraries

import numpy as np # Python library for working with arrays
import pandas as pd # Python library for data processing, working with CSV files
import matplotlib.pyplot as plt # used for basic visualzations and graph creations
%matplotlib inline
import seaborn as sns # used for advanced visualizations and graph creations

import nltk # import NLTK library for natural language processing
import re # import regular expression library for text pre-processing
import tokenization # import tokenization library

from wordcloud import WordCloud

from sklearn.model_selection import train_test_split # use for Splitting the data between training data and test data
from sklearn.linear_model import LinearRegression # for creating the Linear Regression Model
from sklearn.preprocessing import MinMaxScaler # for normalization
from sklearn.preprocessing import PolynomialFeatures # for multivariate polynomial regression
from sklearn.tree import DecisionTreeClassifier # for cross-validation
from sklearn.linear_model import LogisticRegressionCV # logistic regression
from sklearn.naive_bayes import MultinomialNB # Naive Bayes
from xgboost import XGBClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import RidgeClassifier
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.pipeline import Pipeline
from sklearn.linear_model import RidgeClassifier
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.decomposition import LatentDirichletAllocation as LDA
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from sklearn.metrics import mean_squared_error # library for calculating metric
from sklearn.metrics import r2_score # library for calculating metric
from sklearn.metrics import accuracy_score

import tensorflow as tf
import keras
from keras import layers

import pyLDAvis
import pyLDAvis.lda_model
pyLDAvis.enable_notebook()

import warnings
warnings.filterwarnings('ignore')




In [2]:
# as there are 23 attributes, show all columns when examining all data
pd.set_option('display.max_columns', None)

## 2. Dataset

To start, the co-curricular dataset will be imported.

This dataset has been provided by a Canadian university, consisting of all students who completed a degree or are in the processing of completing a degree, and who participate in co-curricular activities.

In [5]:
# import university co-curricular dataset as DataFrame from .xlsx file
# this required the install of openpyxl from the virtual environment terminal
prelimDataset = pd.read_excel('..\\Preliminary\\preliminaryReportData.xlsx')

In [7]:
# initial summary view of the DataFrame, the preliminary report dataset
prelimDataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60996 entries, 0 to 60995
Data columns (total 23 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   ClassifierCode       60996 non-null  object        
 1   ActivityYear         60996 non-null  object        
 2   ActivityCategory     60996 non-null  object        
 3   PostCode             55836 non-null  object        
 4   Constituent          60996 non-null  object        
 5   GraduationYear       56932 non-null  datetime64[ns]
 6   Faculty              56932 non-null  object        
 7   Degree               56932 non-null  object        
 8   Exclusions           2587 non-null   object        
 9   LifetimeDollars      15377 non-null  float64       
 10  LifetimeGifts        15377 non-null  float64       
 11  RFMRC                15409 non-null  float64       
 12  RFMFR                15409 non-null  float64       
 13  RFMMO                15409 non-

In [12]:
prelimDataset.head()

Unnamed: 0,ClassifierCode,ActivityYear,ActivityCategory,PostCode,Constituent,GraduationYear,Faculty,Degree,Exclusions,LifetimeDollars,LifetimeGifts,RFMRC,RFMFR,RFMMO,RFMTO,FirstGiftYear,FirstAmount,FirstArea,FirstSolicitation,LargestGiftYear,LargestAmount,LargestArea,LargestSolicitation
0,6K223BZ,2010-2011,"Education, Training and Outreach",K2,Alumnus,2008-10-24,Faculty:Arts & Social Sciences,Bachelor of Arts,,,,,,,,NaT,,,,NaT,,,
1,KBZ43K4,2015-2016,Student Government and Student and Residence Life,K2,Alumnus Parent employee,2017-02-17,Faculty:Public Affairs,Bachelor of Arts,,,,,,,,NaT,,,,NaT,,,
2,KBZ43K4,2014-2015,"Panels, Events, Committees and Conferences",K2,Alumnus Parent employee,2017-02-17,Faculty:Public Affairs,Bachelor of Arts,,,,,,,,NaT,,,,NaT,,,
3,KBZ43K4,2014-2015,"Panels, Events, Committees and Conferences",K2,Alumnus Parent employee,2017-02-17,Faculty:Public Affairs,Bachelor of Arts,,,,,,,,NaT,,,,NaT,,,
4,KKK0BZ0,2012-2013,"Academics, Awards and Research",K1,Alumnus,2016-05-27,Faculty:Public Affairs,Bachelor of Arts,,,,,,,,NaT,,,,NaT,,,


In [9]:
# import university co-curricular dataset as DataFrame from .xlsx file
# this required the install of openpyxl
studentDataset = pd.read_excel('..\\..\\Data\\coCurricular.xlsx')

In [10]:
studentDataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17714 entries, 0 to 17713
Data columns (total 66 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Sort                 17714 non-null  object        
 1   RFMRC                4377 non-null   float64       
 2   RFMFR                4377 non-null   float64       
 3   RFMMO                4377 non-null   float64       
 4   RFMTO                4377 non-null   float64       
 5   Exclusions           707 non-null    object        
 6   Postal               17714 non-null  object        
 7   StartYear            17714 non-null  int64         
 8   Cat1                 17714 non-null  object        
 9   Cat2                 10712 non-null  object        
 10  Cat3                 7318 non-null   object        
 11  Cat4                 5314 non-null   object        
 12  Cat5                 4012 non-null   object        
 13  Cat6                 3115 non-n