# A Broad Overview

In this notebook we give an overview of what is covered in these notebooks.

The goal of these materials is to endow you with the tools to complete end to end data science/machine learning projects with the help of Python. To that end we will cover the following topics in varying amounts of detail.

## Data Collection

In order to complete a data-based project, you first need to have data. We will cover a few different ways that you can collect your own data including:
- Data competition websites,
- Data repositories,
- Databases and
- Web Scraping.

## Data Analysis/Exploration

Prior to placing a data science solution into production you often will have to perform some rudimentary or exploratory data analysis (or EDA). We will see common approaches/techniques while learning data science algorithms including:
- Exploratory plotting,
- Examining basic statistics and
- Data manipulation with `pandas` and `numpy`.

## Data Cleaning

Perhaps surprisingly, perhaps unsurprisingly, a majority of any data science project tends to be cleaning and preparing your data for whatever solution you have built/are going to build. Throughout our notebooks we will see an array of data cleaning steps including:
- Cleaning data files,
- Cleaning text data with `str` functionality,
- Imputing missing values,
- Creating new columns from existing columns and
- More.

## Supervised Learning Models

When we work with data that has a label that we want to predict we will use <i>supervised learning</i> models. Supervised learning covers a wide array of problems including:
- Regression, which uses techniques like:
    - Simple linear regression,
    - Multiple linear regression,
    - Polynomial regression,
    - Regularized regression and
    - Time series analysis/forecasting
- Classification, which uses techniques like:
    - Naive Bayes,
    - $k$-Nearest neighbors,
    - Logistic regression,
    - Decision trees,
    - Random forests and
    - Support vector machines.
    
We should note, however, that there are also more general purpose techniques that can be used for either regression or classification that we will also discuss including:
- Bagging/Pasting,
- AdaBoost and 
- Gradient boosting.

<i>A quick note, technically, most of the techniques listed under Classification can also be adapted to work for regression problems, as we will see.</i>

## Unsupervised Learning

Sometimes you will want to work on a problem where your data set does not have a label. Such a problem falls under the purview of <i>unsupervised learning</i>, for example:
- Dimensionality reducion, which uses techniques like:
    - Principal components analysis (PCA),
    - t-Distributed stochastic neighbor embedding (tSNE),
    - Uniform manifold approximation and projection (UMAP) and
    - Singular value decompositions (SVD).
- Clustering, which uses techniques like:
    - $k$-Means clustering and
    - Hierarchical clustering.

## Neural Networks

One modelling framework that has been adapted to work for both supervised and unsupervised problems is that of a neural network. We will also touch on those, with our main focus being on neural nets designed for supervised learning including:
- Perceptrons,
- Multilayer perceptrons or feed forward networks,
- Basic convolutional networks and 
- Recurrent neural networks (if time).

The sole neural network framework for unsupervised learning we may touch on are:
- Autoencoders.

## Presenting/Disseminating Findings

When you finally complete all of your hard work, you may have to (or want to) present your findings to your peers, supervisors or the world. Along the way we will also hit on concepts critical to clearly and concisely your work such as:
- Knowing your audience,
- Framing your problem,
- Making clear and readable plots and
- More.

# Mama Mia!

That's a lot of content!

It is, so let us get started! :)

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2022.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)