Skip to content

This repository houses small projects (R and Python) exploring Machine Learning methods and algorithm description & implementation. Larger ML projects are housed in the Projects Repo.

Notifications You must be signed in to change notification settings

cordero-c-perez/Machine-Learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ML Repo

Note: some data is not provided in the repository. Links are available in the README when the data is not included.

This repository houses an ongoing series of small projects exploring Machine Learning models/implementations. There will typically be three files for applications in R (.r, .rmd, .md), and two files for applications in Python (.ipynb, .md) The README file descriptions below include the link to the markdown file and a small description of the ML implementation.

Files

Applies kNN classification (R) to speech recognition data and aims to identify the “best k” via manual cross-validation under 17 values of k. Overall classification accuracy serves as the measure of performance here and the data used for this can be found here.

Applies kNN classification (Python) to speech recognition data and aims to evalaute performance for the out of box model vs. a tuned model. Overall classification accuracy, precision, and recall, all serve as the measures of performance here and the data used for this project can be found here.

Applies 4 decision tree algorithms (C5.0, OneR, rpart, and randomForest (R)) to a diabetes dataset offered in the UCI machine learning repository with the aim to correctly classify the presence of diabetes given the presence of other conditions. The goal is to then improve on the models with a business objective in mind, NOT to improve the overall accuracy, and compare. The business objective being that the presence of false negatives in trying to predict the presence of a condition outweighs the overall accuracy of correct classification.Data used for this project can be found here.

This can be considered a supervised learning application (R) with a commonly used dataset (iris) to explore distinctions between three techniques, Hierarchical Clustering, Kmeans and Density Based Spatial Cluster Applications w/ Noise (DBSCAN) in R. Although this is not supervised learning in the traditional sense, this project explores applying three methods in a supervised setting as the data has the correct cluster classification available (species) to check the results against. This is important for identifying subtle nuances between methods and understanding the built-in assumptions of these functions. The Iris dataset is used as there are only quantitative variables present and thus removes the trouble of creating a proper distance or dissimilarity matrix for mixed data types (explored in later projects with mixed data). The real takeaway for this application is that in order to cluster effectively, the measurements chosen for features are far more important than the clustering algorithm itself.

Applies Logistic Regression (Python) to the breast cancer dataset with the aim of identifying the presence of malignant tissue samples given the sample features. This model does very well achieving 98% recall and 100% precision. Here recall is probably more important from the business' perspective so an iteration which trades precision for recall would be better suited for business implementation.

About

This repository houses small projects (R and Python) exploring Machine Learning methods and algorithm description & implementation. Larger ML projects are housed in the Projects Repo.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published