Training 6 different Supervised Machine Learning in scikit-learn framework then tuning and optimizing the best model of them to increase the accuracy.
This project is part of Udacity Machine Learning Nanodegree projects.
In this project, I used 6 different Supervised Machine Learning models to train and test the dataset. I also used F-beta score as a metric that considers both precision and recall:
In particular, when 𝛽=0.5 , more emphasis is placed on precision. This is called the F(0.5)_score (or F-score for simplicity).
The tested models are:
Model |
---|
1. Random Forest |
2. Gradient Boosting |
3. Logistic Regression |
4. Decision Trees |
5. AdaBoost |
6. Support Vector Machine |
And then I visualized a graph to compare between:
- Accuracy Score on Training & Testing Subsets.
- F-Score on Training & Testing Subsets.
- Time of Model Training & Testing.
1. Gradient Boosting Classifier | 2. Random Forest Classifier | 3. Logistic Regression |
---|
4. Decision Tree Classifier | 5. AdaBoost Classifier | 6. Support Vector Machine Classifier |
---|
From graphs, it's obvious that AdaBoost Classifier is better in both Accuracy and F-Score. So after that, we'll use it with the GridSearch technique to tune our model and optimize its hyperparameter in order to increase the Accuracy and F-Score as it's shown below
I applied supervised learning techniques and an analytical mind on data collected for the U.S. census to help CharityML (a fictitious charity organization) identify people most likely to donate to their cause. I started by exploring the data to learn how the census data is recorded. Next, I applied a series of transformations and preprocessing techniques to manipulate the data into a workable format. Then I evaluated several supervised models on the data and considered which is best suited for the solution. Afterwards, I optimized the selected model.
This project uses the following software and Python libraries:
You will also need to have software installed to run and execute a Jupyter Notebook.
If you do not have Python installed yet, it is highly recommended that you install the Anaconda distribution of Python, which already has the above packages and more included.
This project contains three files:
finding-donors-for-charityML.ipynb
: This is the main file where you will find all the work on the project.census.csv
: The project dataset. Which is loaded this data in the notebook.visuals.py
: A Python file containing visualization code that is run behind-the-scenes. Do not modify
Template code is provided in the finding-donors-for-charityML.ipynb
notebook file. The script visuals.py
Python file is also required for the visualizing functions, and the census.csv
dataset file.
In a terminal or command window, navigate to the top-level project directory Finding-Donors-for-CharityML/
(that contains this README) and run one of the following commands:
ipython notebook finding-donors-for-charityML.ipynb
or
jupyter notebook finding-donors-for-charityML.ipynb
This will open the iPython Notebook software and project file in your browser.
The modified census dataset consists of approximately 32,000 data points, with each datapoint having 13 features. This dataset is a modified version of the dataset published in the paper "Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid", by Ron Kohavi. You may find this paper online, with the original dataset hosted on UCI.
-
Features
age
: Ageworkclass
: Working Class (Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked)education_level
: Level of Education (Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool)education-num
: Number of educational years completedmarital-status
: Marital status (Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse)occupation
: Work Occupation (Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces)relationship
: Relationship Status (Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried)race
: Race (White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black)sex
: Sex (Female, Male)capital-gain
: Monetary Capital Gainscapital-loss
: Monetary Capital Losseshours-per-week
: Average Hours Per Week Workednative-country
: Native Country (United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands)
-
Target Variable
income
: Income Class (<=50K, >50K)
Metric | Naive Predictor | Unoptimized Model | Optimized Model |
---|---|---|---|
Accuracy Score | 0.2478 | 0.8638 | 0.8709 |
F-score | 0.2917 | 0.7333 | 0.7446 |
- You could download the optimized model and load it by the following commands:
import pickle
filename = 'optimized_model.sav'
loaded_model = pickle.load(open(filename, 'rb'))
- Random Forest Simple Explanation
- A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning | Machine Learning Mastery
- Logistic Regression — Detailed Overview | Towards Data Science
- Decision Trees in Machine Learning | Towards Data Science
- Boosting and AdaBoost for Machine Learning | Machine Learning Mastery
- Chapter 2 : SVM (Support Vector Machine) — Theory | Machine Learning 101
- Grid Searching in Machine Learning: Quick Explanation and Python Implementation
- Ahmed Hamido
Inspired by Udacity Machine Learning Engineer Nanodegree.