#  Data Science - Machine Learning

## What is Data Science ?:
Data science is the study of data. It involves developing methods of recording, storing, and analyzing data to effectively extract useful information. The goal of data science is to gain insights and knowledge from any type of data — both structured and unstructured


Yet another way to put this definition into words is that data science equates to uncovering insights, patterns and trends that are *hidden* behind the data i.e using data to understand things. Among the “things” we would like to understand better we can list:
-	Your customers
-	Your business processes
-	How your product is consumed / perceived / liked by your customers.


This brings us to a further wording of the same concept, that I especially like: data science means transforming data into knowledge, that can guide decision making.

The key insight here is that **data in itself has practically zero value when it comes to making decisions**. Data is no more than a, sometimes huge, collection of recorded facts about some entities, such as your customers, your sales transactions or your web interactions. Although it might cost you and your company a lot of money, time and effort, to manage this repository of facts, said cost does not translate automatically into value.



### Data Science as Umbrella term: 

- Artificial intelligence
- Machine learning
- Statistics
- ETLs and data cleanup **:** ETL refers to extraction, transformation and load or the stages previous to data exploration, formulation of hypothesis and construction of models.
- Data Visualisation
- Algorithm and infrastructure	
- Efficient data storage and retrieval architectures
- Business knowledge / understanding of the application domain


### Why is data science “a thing” nowadays?

1.	The advent of the Internet economy and the explosion in mobile apps has caused a deluge of data waiting to be turned into value
2.	The sharp decrease in costs associated with data storage and processing. According to this source and this other source, the cost of a hard-drive per GB dropped from around 100 USD in 1997 to 1 USD in 2005 and then again to around 0.03 USD in 2017. This has but fueled trends such as “Big Data” and “distributed computing”.
3.	The development of a wealth of innovative ML and DL algorithms
4.	On the hardware side, the availability of GPUs that are used to run the heavy computations required by deep learning algorithms


![img](images/oneminofinternet.jpg)

## Applications of Data Science

![img](images/Data-Science-Applications-01.jpg)

## What it takes to be a Data Scientist ?

![img](images/venn_diag.png)

### Opportunities in Data Science
- Business Intelligence Developer
- Data Architect
- Applications Architect
- Infrastructure Architect
- Enterprise Architect
- Data Analyst
- Data Scientist
- Data Engineer
- Machine Learning Scientist
- Machine Learning Engineer
- Statistician

## What is Machine Learning:
- Machine Learning is the science (and art) of programming computers so they can
learn from data.


- Here is a slightly more general definition:
[Machine Learning is the] field of study that gives computers the ability to learn
without being explicitly programmed.
—Arthur Samuel, 1959

- And a more engineering-oriented one:
A computer program is said to learn from experience E with respect to some task T
and some performance measure P, if its performance on T, as measured by P, improves
with experience E.
—Tom Mitchell, 1997

### How machines learn from Experience
There are three common ways by which machines learn from
experience (data):
1. Supervised Learning
2. Unsupervised Learning
3. Reinforcement Learning

-  **Supervised Learning**  
Supervised learning algorithms experience dataset containing
many features as well as label (or target) associated with each
example.  In supervised learning, the training data you feed to the algorithm includes the desired
solutions, called labels

![img](images/image2.png)

Examples of learning problems/tasks.
- Classification - Assign a category to each item.eg: cat or dog prediction
- Regression - Predict a value for each item. eg: customer salary prediction

Here are some of the most important supervised learning algorithms 
- k-Nearest Neighbors
- Linear Regression
- Logistic Regression
- Support Vector Machines (SVMs)
- Decision Trees and Random Forests
- Neural networks2

![img](images/image6.png)

![img](images/image7.png)

![img](images/image9.png)

![img](images/image10.png)

### Unsupervised learning

In contrast in Unsupervised Learning it is unkown what the computer needs to learn. That means that you only have input data without corresponding labels. In that case the Machine Learning algorithm needs to figure out how to interpret and how to structure the data by itself. There are two main tasks for Unsupervised Learning:

- Unsupervised Learning For Clustering: Clustering means that the algorithm tries to group together the input data in categories. Example: red, blue, green etc. The category is then the label which is determined for the provided input data.
- Unsupervised Learning For Association: Association means that the algorithm tries to determine rules which are describing large portions of the data. E.g. poeple that smoke tend to get cancer.

![img](images/image3.png)

![img](images/image4.png)

### Reinforcement Learning
Reinforcement Learning is a very different beast. The learning system, called an agent
in this context, can observe the environment, select and perform actions, and get
rewards in return (or penalties in the form of negative rewards, as in Figure 1-12). It
must then learn by itself what is the best strategy, called a policy, to get the most
reward over time. A policy defines what action the agent should choose when it is in a
given situation.

![img](images/image5.png)

In [10]:
#from Ipython.display import HTML, display
#display(HTML("<table><tr><td><img src= 'imgage1'></td><td><img src = 'image2'></td></tr><table>"))

## Evaluating the performance of a model

Associated with every ML algorithm is a loss function (a measure of the
un-correctness of the output given an input) also known as performance
metrics.
The aim of the algorithm is to minimize as much as possible this loss
(iteratively!) using the training data.
The performance of the algorithm can’t be measured with the same training
data it was fitted with!

![img](images/image11.png)

## Main Challenges of Machine Learning

- Insufficient Quantity of Training Data
- Nonrepresentative Training Data
- Poor-Quality Data

If some instances are clearly outliers, it may help to simply discard them or try to
fix the errors manually.<br>
If some instances are missing a few features (e.g., 5% of your customers did not
specify their age), you must decide whether you want to ignore this attribute altogether,
ignore these instances, fill in the missing values (e.g., with the median
age), or train one model with the feature and one model without it, and so on.

- Overfitting and Underfitting

![img](images/image12.png)

Overfitting happens when the model is too complex relative to the
amount and noisiness of the training data. The possible solutions
are:<br>
• To simplify the model by selecting one with fewer parameters
(e.g., a linear model rather than a high-degree polynomial
model), by reducing the number of attributes in the training
data or by constraining the model<br>
• To gather more training data<br>
• To reduce the noise in the training data (e.g., fix data errors
and remove outliers)

Underfitting the Training Data
As you might guess, underfitting is the opposite of overfitting: it occurs when your
model is too simple to learn the underlying structure of the data. For example, a linear
model of life satisfaction is prone to underfit; reality is just more complex than
the model, so its predictions are bound to be inaccurate, even on the training examples.
The main options to fix this problem are:<br>
• Selecting a more powerful model, with more parameters<br>
• Feeding better features to the learning algorithm (feature engineering)<br>
• Reducing the constraints on the model (e.g., reducing the regularization hyperparameter)