BD_ML_Code

This repo contains all the codes from BD_ML course.

Notice

Scripts in this repo need to be executed with IDE with the required libraries installed in the local environment.

Development Environment

Windows 11
- Keras
- TensorFlow (with GPU acceleration, check how to install below)
- pandas
- numpy
- sklearn
- matplotlib
- wandb
Anaconda Jupyter Notebook
- Keras
- TensorFlow
- pandas
- numpy
- sklearn
- matplotlib

Utilized Datasets

No.	Name	Source
1	Banking Dataset (EDA and binary classification)	https://www.kaggle.com/code/rashmiranu/banking-dataset-eda-and-binary-classification/data
2	CIFAR-10/100	https://www.cs.toronto.edu/~kriz/cifar.html
3	COVID-19 Dataset by Our World in Data	https://github.com/owid/covid-19-data
4	Pumpkin_Seeds_Dataset	https://www.kaggle.com/datasets/muratkokludataset/pumpkin-seeds-dataset
5	Breast cancer wisconsin	https://archive/ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data
6	World Mortality Dataset	https://github.com/akarlinsky/world_mortality
7	MINST in Tensorflow	https://www.tensorflow.org/datasets/catalog/mnist
8	Taiwan Death Detail	Unknown
9	House Price	https://kaggle.com/datasets/rsizem2/house-prices-ames-cleaned-dataset

File Description

No.	Filename	Description
1	ML_w2_BasicProgStructure.ipynb	Original Decision Tree algorithm demonstration in Jupyter Notebook.
2	ML_w2_DecisionTree_basic.py	Perform single time of Decision Tree training and testing from tumor cell data.
3	ML_w2_DecisionTree_consistency.py	Perform multiple times of Decision Tree training and testing from tumor cell data in order to test Decision Tree algorithm consistency.
4	BD_w3_OWID_COVID19_class.ipynb	Code follows teacher's video instruction.
5	BD_w3_OWID_COVID19_hw.ipynb	BD w3 Homework.
6	ML_w3_BasicProgStructure_MoreModel.ipynb	Add different models.
7	ML_w3_BasicProgStructure_AddStdScaler.ipynb	Add scaler to preprocess data before using models.
8	ML_w3_hw_q1.ipynb	Question inside, output file in data/w3q1data/
9	ML_w3_hw_q2.ipynb	Question inside, output file in data/w3q2data/
10	BD_w3_OWID_COVID19_hw_adjusted.ipynb	Adjusted code based on hw last week and teacher's answer.
11	BD_w4_hw.ipynb	Different approach to display specified columns from the original dataset.
12	ML_w4_PrepareTrainingSetWithFewerCancer.ipynb	Showcase dropping some rows of cancer data can improve accuracy.
13	ML_w4_hw_q1.ipynb	Try tuning the ratio between cancer positive data and cancer negative data to acheive higher accuracy.
14	ML_w4_hw_q2.ipynb	Try removing some features of the original dataset to see whether this can improve accuracy.
15	ML_w5_TuningOnTrainData_BASIC.ipynb	W5 class material + hw.
16	BD_w6_hw.ipynb	W6 class homework. Q1: scatter plot data. Q2: plot two data comparison.
17	ML_w6_Homework_wdbc.ipynb	ML result analysis with wdbc cancer cell dataset.
18	ML_w6_Homework_Pumpkin.ipynb	ML result analysis with Pumpkin seed dataset.
19	ML_w6_module.py	Modulize w6 homeworks.
20	BD_w7_hw.ipynb	Analysis the cuases leading to the trend of "%death by cases"
21	ML_w7_hw.ipynb	Machine learning with mostly text data.
22	BD_midterm.ipynb	BD midterm answer.
23	ML_midterm.ipynb	ML midterm answer.
24	BD_w10_hw.ipynb	COVID-19 caused excess death across the glob.
25	ML_w10_cn.ipynb	Overfitting, details about DT and RF.
26	ML_w10_DT_RF_DiveIn_Homework.ipynb	Plot tree structure of trained result of given dataset.
27	BD_w11_ExcessDeathAndCovidDeathPhase1.ipynb	A clean and fast way to calculate excess death.
28	BD_w11_ReadMeFirst.ipynb	How the excess death calculation approach is not logical.
29	ML_w11_SVM_Homework.ipynb	Details of SVM.
30	BD_w12_hw.ipynb	Demonstrate excess death changes in the range of 2020 to 2022.
31	ML_w12_hw.ipynb	Start using tf.keras
32	BD_w13_hw.ipynb	Adjust result of BD_w12_hw.jpynb.
33	ML_w13_redoprehw.ipynb	Redo "ML_w12_hw" with some adjustments.
34	ML_w13_hw.ipynb	CNN on CIFAR 10/100 dataset.
35	ML_w14_redoprehw.ipynb	CNN model fit successfully and accelerated by GPU
36	ML_w14_hw.ipynb	Build CNN module for multiple models and extra features like export models, learning curve, prediction image.
37	ML_w14_hw.py	Loop executing built CNN module in both building and testing models.
38	BD_w14_hw.ipynb	Try to use really messed up Taiwan COVID-19 Death Detail dataset.
39	BD_w15_redoprehw.ipynb	Redo COVID-19 mortality scatter plot with correct result.
40	BD_w15_hw.ipynb	Try to use really messed up Taiwan COVID-19 Death Detail dataset.
41	BD_w16_class.ipynb	Processing dataset column with complex content.
42	ML_w16_hw.ipynb	Build CNN module for multiple models and extra features like export models, learning curve, prediction image, and module for preprocessing images.
43	ML_w16_hw.py	execute modules in ML_w17_hw.ipynb.
44	BD_final.ipynb	BD Final.

Optional Code ideas

ML_w2_DecisionTree_DevelopmentTrend.py: use the data gathered while looping the algorithm and display it as a bar chart or line graph.
ML_w3_MultiModelComparison.py: compare different models mentioned in class.
ML_w3_StdScalerPerformance.py: compare the difference with and without using the scaler.

Developing Process

BD Process

Prepare/Preprocess Data
1. Remove unused columns.
2. Convert datatype. (str2timestamp, str2int, str2float, obj2str...)
Plot Data (scatter not plot)
1. Column to column.
2. Interaction between two columns.

Don't use 'concate' in most cases since some data in column might be missing, and it can cause a lot of errors.

ML Process

Prepare/Preprocess Data (the dataset needs to be appropriate for the question)
1. Read the dataset from the file. (pandas/scipy.io.arff/python)
2. Check features (X) correlation with the result (y), and drop features with low correlation. Not necessary, but might help reduce time and not mislead the algorithm. (pandas)
3. Deal with missing values (NAN/NA). (pandas)
4. Convert non-numerical data to numbers representing them. (pandas)
5. (optional) Balance out the imbalance dataset. (balance: one category of result like y=0 has a lot more data(rows) than the other/others.) (python)
6. Split features (X) and result (y). (python)
7. (optional) Scale dataset. (sklearn)
8. Split the dataset into either train+test or train+cross-validation+test subsets. (random state is optional) (sklearn)
Deploy Model
1. Select model. (sklearn->supervised/unsupervised)
  1. Logistic Regression, LR.
  2. Decision Tree, DT.
  3. Random Forest, RF.
  4. Support Vector Machine, SVM. (SVC)
  5. K-Nearest Neighbor, KNN.
2. Train model. (sklearn->fit)
3. Test model. (sklearn->pred)
Result Analysis
1. accuracy (sklearn)
2. precision (sklearn)
3. recall (sklearn)
4. f1 (sklearn)
5. confusion matrix (sklearn)
6. separate results from different models (pandas->groupby)
7. scale ratio (dataset/python)
8. label ratio (dataset/python)
9. model (sklearn)
10. model parameters (ex: classweight) (sklearn)

Need to code a module for the final test:

ML accessible features: process time (train/test), dataset size, weight, scale info, acc, pre, rec, f1, fitting based on iteration
1. supports multiple different datasets
2. preprocess dataset and comment info about accessible features
3. support for multi-run tests to average out accessible features
4. supports multiple algorithms
5. with commends to test different stuff about accessible features
6. provide a table for questions like Q9 in ML_midterm.jpynb.
BD accessible features:
1. supports multiple different datasets
2. preprocess dataset
3. support all questions asked in hw

NN Models

It might perform better if the output node is strictly 1 result per node. (ML_w12_hw/w13 videos).
It might have better performance if the input data is properly scaled.
Sometimes it's not necessary to specify batch_size.

Enable GPU acceleration when using Keras or Tensorflow (20221216)

Steps

Install Visual Studio Code
Install Python 3.10.8
Install VC++ https://learn.microsoft.com/en-US/cpp/windows/latest-supported-vc-redist?view=msvc-170
Install NVIDIA GPU drivers https://www.nvidia.com/download/index.aspx?lang=en-us
Install CUDA Toolkit 11.2 https://developer.nvidia.com/cuda-toolkit-archive
Install cuDNN SDK 8.1.0 https://developer.nvidia.com/rdp/cudnn-archive

Verification

Verify Installation https://www.tensorflow.org/install/pip - 7. Verify install
Enable Tensorflow GPU acceleration: https://stackoverflow.com/questions/45662253/can-i-run-keras-model-on-gpu

Name		Name	Last commit message	Last commit date
Latest commit History 154 Commits
pic		pic
prof_code		prof_code
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pic

pic

prof_code

prof_code

src

src

.gitattributes

.gitattributes

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

BD_ML_Code

Notice

Development Environment

Utilized Datasets

File Description

Optional Code ideas

Developing Process

BD Process

ML Process

NN Models

Enable GPU acceleration when using Keras or Tensorflow (20221216)

Steps

Verification

Sources

About

Releases

Packages

Languages

License

belongtothenight/BD_ML_Code

Folders and files

Latest commit

History

Repository files navigation

BD_ML_Code

Notice

Development Environment

Utilized Datasets

File Description

Optional Code ideas

Developing Process

BD Process

ML Process

NN Models

Enable GPU acceleration when using Keras or Tensorflow (20221216)

Steps

Verification

Sources

About

Topics

Resources

License

Stars

Watchers

Forks

Languages