Skip to content

Using machine learning to defeat criminals who try to print their own currency

License

Notifications You must be signed in to change notification settings

gautam-sharma1/money-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Machine Learning to detect counterfeit bills 💵

Using machine learning to defeat criminals who try to print their own currency

⭐ Star me on GitHub — it helps!

Table of content

Background

You work for the Acme Money Analysis and Prediction Enterprises (AMAPE for short). As engineers for this company you are developing an app for the U.S. Treasury to aid in detection of counterfeit bills. You will be supplied with a data set which provides four features per bill and whether that bill was genuine or counterfeit. You have been assigned to development AMAPE’s official prediction model which will be used by anybody who accepts cash. As with any problem you first want to study the data. Read in the database given in the problem. It is in the CSV file data_banknote_authentication.txt provided to you. You may want to use panda read_csv which places the data in a dataframe, similar to, but not to be confused with, a dictionary. This data set contains observations based on measurements made on a number of bills. The last column is whether the bill is genuine (1) or counterfeit (0). Based on the measurements, you need to build a predictor to determine whether a bill is genuine or counterfeit. The columns in the database are:

  1. variance of Wavelet Transformed image (continuous) 2. skewness of Wavelet Transformed image (continuous) 3. curtosis of Wavelet Transformed image (continuous) 4. entropy of image (continuous)
  2. class (integer)

Installation

This project uses Scipy as it's primary library to solve Machine Learning tasks.

Installing via pip

You can install packages via the command line by entering:

python -m pip install --user numpy scipy matplotlib ipython jupyter pandas sympy nose

Install system-wide via a package manager

System package managers can install the most common Python packages. They install packages for the entire computer, often use older versions, and don’t have as many available versions.

Ubuntu

using apt-get:

sudo apt-get install python-numpy python-scipy python-matplotlib ipython ipython-notebook python-pandas python-sympy python-nose

Mac

Homebrew has an incomplete coverage of the SciPy ecosystem, but does install these packages:

brew install numpy scipy ipython jupyter

Solution

proj1_A.py statistically analyzes the data and finds out the correlation of each variable. The input file (‘data_banknote_authentication.txt') was read using pandas package and analyzed for covariance and correlation. The covariance and correlation matrix was printed to the console. A heat-map figure for visualization purposes was also displayed along with a pair-plot of each variable (dependent and independent). From careful analysis it can be inferred with confidence that variance, skewness, curtosis and entropy are most correlated in that order with dependent variable class. Also among independent variables curtosis and skewness were most correlated followed by entropy-skewness, curtosis-variance, entropy-curtosis, entropy-variance and skewness-variance is the least correlated pair among the independent variables. The covariance results show a trend highlighting curtosis-skewness pair having the highest covariance followed by entropy-skewness, curtosis-variance, skewness-variance, entropy-curtosis, entropy-variance. Based on the aforementioned analysis , since entropy is least correlated to class we can drop it and thus variance, skewness and curtosis are potential candidates to predict the class but since curtosis is highly correlated with skewness so only variance and skewness are left as the best candidates to represent dependent variables.

proj2_A.py uses:

🟢 Percepetron
🟢 Logistic Regression
🟢 Support Vector Machine
🟢 Decision Tree Learning
🟢 Random Forest
🟢 K-Nearest Neighbor

All ML algorithms follow the following function structure:

    def support_vector_machine(self,verbose=0):
        param_grid = [{'C': [0.1,1.0,5.0,10.0,20.0]}]                       # optimizing over C
        svm = SVC(kernel='linear',random_state=0)
        best_model, prospective_models = self.best_model(svm, param_grid, verbose)  # get the best model

        # print the best parameter
        print("Parameters that gives the highest accuracy achieved for SVM ",
        prospective_models.best_params_)

        return self.accuracy(best_model, verbose)   # return the best test set accuracy

For each machine learning method used the best values for parameters is as follows:

Method Best parameters Best accuracy
Perceptron max iteration = 10 0.983
Logistic Regression c = 30 0.987
Support Vector Machine c = 10 0.987
Decision Tree Learning criterion = ' gini', max_depth= 7 0.975
Random Forest criterion = ' gini', n_estimators= 100 10.99
K-Nearest Neighbor n_neighbors =10, p= 2 0.997

Best on the table above one can observe that KNN gives us the highest accuracy with 10 neighbors and a euclidean distance metric. To get this result I have used GridSearchCV in my program (proj1_B.py) to run different ML algorithm for different parameters and get the best parameters in return. Training set has been set to 70% and test set to be 30%. A cross validation set of 5 is chosen in the GridSearchCV parameters.

License

MIT License

Links

✳️ Website
✳️ LinkedIn

About

Using machine learning to defeat criminals who try to print their own currency

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages