## Difference between Machine learning Engineer and Data Scientist:

The Data Scientist does the statistical analysis/research to determine which machine leaning approach to use, models the algorithm, and prototypes it usually in R, Python / PySpark, etc for testing.

The Machine Learning Engineer will partner with the Data Scientist to take the ML model prototyped by the Data Scientist and make it work well in a production environment at scale (i.e. lots of concurrent users). Usually doing so by coding it in a more robust language like Scala, JAVA or C++ and utilizing faster data piping and parallel processing (Spark, MapReduce, etc.)

The Data Scientist is typically trained to be stronger in Statistics, while the ML Engineer is typically trained to be stronger in Computer Science, however the two usually know a lot about what the other does and can work together to iterate and optimize.

In the market today, both make similar salaries. We consider them to be fairly close in profile (skills, education, critical thought problem solving). The Data Scientist leans more toward Statistical Analysis Research and the ML Engineer leans more toward Data/Engineering


Types of Data

* **Nominal Data**:
A categorical variable, also called a nominal variable, is for mutual exclusive, but not ordered, categories. For example, your study might compare five different genotypes. You can code the five genotypes with numbers if you want, but the order is arbitrary and any calculations (for example, computing an average) would be meaningless.
 
* **Ordinal Data**:
With ordinal scales, it is the order of the values is what’s important and significant, but the differences between each one is not really known.  Take a look at the example below.  In each case, we know that a #4 is better than a #3 or #2, but we don’t know–and cannot quantify–how much better it is.  For example, is the difference between “OK” and “Unhappy” the same as the difference between “Very Happy” and “Happy?”  We can’t say.

* **Interval**
Interval scales are numeric scales in which we know not only the order, but also the exact differences between the values.  The classic example of an interval scale is Celsius temperature because the difference between each value is the same.  For example, the difference between 60 and 50 degrees is a measurable 10 degrees, as is the difference between 80 and 70 degrees.  Time is another good example of an interval scale in which the increments are known, consistent, and measurable.

* **Ratio**:
A ratio variable, has all the properties of an interval variable, and also has a clear definition of 0.0. When the variable equals 0.0, there is none of that variable. Variables like height, weight, enzyme activity are ratio variables. Temperature, expressed in F or C, is not a ratio variable. A temperature of 0.0 on either of those scales does not mean 'no heat'. However, temperature in Kelvin is a ratio variable, as 0.0 Kelvin really does mean 'no heat'.

## Supervised vs Unsupervised learning

### Supervised Machine Learning

The majority of practical machine learning uses supervised learning.

Supervised learning is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output.

Y = f(X)

The goal is to approximate the mapping function so well that when you have new input data (x) that you can predict the output variables (Y) for that data.

It is called supervised learning because the process of an algorithm learning from the training dataset can be thought of as a teacher supervising the learning process. We know the correct answers, the algorithm iteratively makes predictions on the training data and is corrected by the teacher. Learning stops when the algorithm achieves an acceptable level of performance.

Supervised learning problems can be further grouped into regression and classification problems.

Classification: A classification problem is when the output variable is a category, such as “red” or “blue” or “disease” and “no disease”.
Regression: A regression problem is when the output variable is a real value, such as “dollars” or “weight”.
Some common types of problems built on top of classification and regression include recommendation and time series prediction respectively.

Some popular examples of supervised machine learning algorithms are:

Linear regression for regression problems.
Random forest for classification and regression problems.
Support vector machines for classification problems.

## Unsupervised Machine Learning
Unsupervised learning is where you only have input data (X) and no corresponding output variables.

The goal for unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about the data.

These are called unsupervised learning because unlike supervised learning above there is no correct answers and there is no teacher. Algorithms are left to their own devises to discover and present the interesting structure in the data.

Unsupervised learning problems can be further grouped into clustering and association problems.

Clustering: A clustering problem is where you want to discover the inherent groupings in the data, such as grouping customers by purchasing behavior.
Association:  An association rule learning problem is where you want to discover rules that describe large portions of your data, such as people that buy X also tend to buy Y.
Some popular examples of unsupervised learning algorithms are:

k-means for clustering problems.
Apriori algorithm for association rule learning problems.


## Semi-Supervised Machine Learning
Problems where you have a large amount of input data (X) and only some of the data is labeled (Y) are called semi-supervised learning problems.

These problems sit in between both supervised and unsupervised learning.

A good example is a photo archive where only some of the images are labeled, (e.g. dog, cat, person) and the majority are unlabeled.

Many real world machine learning problems fall into this area. This is because it can be expensive or time-consuming to label data as it may require access to domain experts. Whereas unlabeled data is cheap and easy to collect and store.

You can use unsupervised learning techniques to discover and learn the structure in the input variables.

You can also use supervised learning techniques to make best guess predictions for the unlabeled data, feed that data back into the supervised learning algorithm as training data and use the model to make predictions on new unseen data.

## Terminologies in Machine Learning

* **Target Variable**: Response , usually denoted by Y , is the variable being predicted in supervised learning; also called. dependent variable, output variable, target variable or outcome variable. Example, house prices in boston dataset

* **Independent Variable**: An independent variable is the variable that is changed or controlled in a scientific experiment to test the effects on the dependent variable. Also known as features

* **Train Data and Test Data**:  The data we use is usually split into training data and test data. The training set contains a known output and the model learns on this data in order to be generalized to other data later on. We have the test dataset (or subset) in order to test our model’s prediction on this subset.

* **Overfitting**: Overfitting means that model we trained has trained “too well” and is now, well, fit too closely to the training dataset. This usually happens when the model is too complex (i.e. too many features/variables compared to the number of observations). This model will be very accurate on the training data but will probably be very not accurate on untrained or new data. It is because this model is not generalized, meaning you can generalize the results and can’t make any inferences on other data, which is, ultimately, what you are trying to do. Basically, when this happens, the model learns or describes the “noise” in the training data instead of the actual relationships between variables in the data. This noise, obviously, isn’t part in of any new dataset, and cannot be applied to it.

* **Underfitting**: In contrast to overfitting, when a model is underfitted, it means that the model does not fit the training data and therefore misses the trends in the data. It also means the model cannot be generalized to new data. As you probably guessed (or figured out!), this is usually the result of a very simple model (not enough predictors/independent variables). It could also happen when, for example, we fit a linear model (like linear regression) to data that is not linear. It almost goes without saying that this model will have poor predictive ability (on training data and can’t be generalized to other data).

![1_tBErXYVvTw2jSUYK7thU2A.png](attachment:1_tBErXYVvTw2jSUYK7thU2A.png)

* **Cross Validation**: Look at the Ppt

* **Confusion Matrix**: Look at the Ppt

# Workflow for a Data Science Project:


### Overview:

* Objective
* Importing Data
* Data Exploration and Data Cleaning
* Baseline Modeling
* Secondary Modeling
* Communicating Results
* Conclusion

