# DATA MINING AA18-19 F008
<img src="figures/data-mining.png" width=50%>

## Notebook sources

<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover.png" width="100" height="150">
<img align="left" style="padding-right:10px;" src="figures/Hands-on-Machine-Learning-with-Scikit-Learn-and-Tensorflow.png" width="100" height="150">

*The notebooks of this course contain text and examples from* 
- *the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
- *the [Hands-On Machine Learning With Scikit-Learn and Tensorflow: Concepts, Tools, and Techniques to Build Intelligent Systems]() by Aurelien Geron*

*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). 
If you find this content useful, please consider supporting the work by buying the books*

These notebooks were extracted from the above books and starting from these a lot of information useful for the [Data Mining course](https://www.uninsubria.it/ugov/degreecourse/107866) was added / changed / deleted.

# Preface

Data mining **isn’t a new invention** that came with the digital age.  
The concept has been around for over a century.

Two short videos to introduce the topics of this course: 
- "[What is Data Mining?](https://www.youtube.com/watch?v=R-sGvh6tI04)"
- "[Big data - Superquark 12/07/2017](https://www.youtube.com/watch?v=A2pUx5B_C4A)".

## Data Mining vs. Machine Learning vs. Data Science
REF. [import.io](https://www.import.io/post/data-mining-machine-learning-difference/)

With **big data** becoming so prevalent in the business world, a lot of data terms tend to be thrown around, with many not quite understanding what they mean. 
 * What is data mining? 
 * Is there a difference between machine learning vs. data science? 
 * How do they connect to each other? 
 * Isn’t machine learning just artificial intelligence? 

Both **data mining** and **machine learning** are rooted in **data science** and generally fall under that umbrella.  
They often intersect or are confused with each other, but there are a few key distinctions between the two. 

## Data Mining and Machine Learning differences

One **key difference** between machine learning and data mining is <br> 
``how they are used and applied in our everyday lives``. <br>
For example, 
* <span style="color:blue">data mining is often used by machine learning to see the connections between relationships</span>. 
* <span style="color:blue">Uber uses machine learning</span> to calculate ETAs for rides or meal delivery times for UberEATS.<br>
<img align="left" style="padding-right:10px;" src="figures/uber-ETA.jpg"><img src="figures/uber-eat-ETA.jpg"><br>


* **Machine learning** can look at **patterns** and **learn** from them to adapt behavior for future incidents, while <br>
* **data mining** is typically used as an **information source** for machine learning to pull from.

But some experts have a <span style="color:blue">different idea about data mining and machine learning</span> altogether.  
Instead of focusing on their differences, you could argue that 
* they <span style="color:blue">both</span> concern themselves with the same question: “<span style="color:blue"><b>How we can learn from data?</b>”</span>. <br>
* How we ``acquire and learn from data`` is really the foundation for emerging technology. 

### Data Mining
Data mining can be used for a variety of **purposes** 

* IDENTIFY INDIVIDUAL TARGET GROUPS<img align="right" style="padding-left:10px;" src="figures/clustering-persons.png" width="30%"><br>**Cluster analysis** makes it possible to identify a group of users within an archive according to common characteristics (``age``, ``geographical origin``, ``educational qualification``, etc.). Useful to send, for example, a certain **promotion to the right target** for that product or service (young people, mothers, retirees, etc.). <img align="right" style="padding-left:10px;" src="figures/financial-research.jpg" width="30%">
* MARKETING FORECASTS<br> Using a **regressive analysis** to study changes, habits, level of customer satisfaction and other factors related to parameters such as the budget of an advertising campaign or similar . The moment you change one of these parameters, you will have a fairly likely ``idea of what will happen to your audience of users``. 

### Machine Learning<img align="right" style="padding-left:10px;" src="figures/google-car.jpeg" width="30%">
can be used for a variety of purposes:
* **Machine learning** is the technology behind **self-driving cars** that can quickly ``adjust to new conditions while driving``. 
* **Machine learning** also provides **instant recommendations** when a buyer purchases a product from Amazon. 
* Banks are already using and investing in **machine learning** to help look for **fraud** when **credit cards** are swiped by a vendor.

**Machine learning isn’t artificial intelligence**, but the ability to learn and improve is still an impressive feat.

### The Future of Data Mining and Machine Learning

The **future is bright for data science** as the amount of **data will** only **increase**. <br>
By 2020, our accumulated digital universe of data will grow <span style="color:blue">from 4.4 zettabytes ($10^{21}$byte) to 44 zettabytes</span>, as reported by Forbes.  
We’ll also create **1.7 megabytes** of new information every **second** for every human being on the planet.

As we **amass more data**, the demand for advanced data mining and machine learning techniques will force the industry to evolve in order to keep up. 

We’ll likely see **more overlap between data mining and machine learning** as the two intersect to enhance the collection and usability of large amounts of data for analytics purposes.

## What Is Data Science?

It's a surprisingly **hard definition** to nail down.
<img src="figures/Data_Science_VD.png" width="40%" align="right" style="padding-left:10px;">

<span style="color:blue">Data science is</span> perhaps the best label we have  <span style="color:blue"> for the cross-disciplinary set of skills</span> that are becoming increasingly important in many applications across industry and academia.

This cross-disciplinary piece is key: the best extisting definition of data science is illustrated by Drew Conway's Data Science Venn Diagram, first published on his blog in September 2010.

**Data science** <span style="color:blue">is fundamentally an **interdisciplinary** subject</span>.

Data science comprises three distinct and overlapping areas: 
* the <span style="color:blue">skills of a *statistician*</span> who knows how to model and summarize datasets (which are growing ever larger); 
* <span style="color:blue">the skills of a *computer scientist*</span> who can design and use algorithms to efficiently store, process, and visualize this data; 
* <span style="color:blue">the *domain expertise*</span> necessary both to formulate the right questions and to put their answers in context.

With this in mind, think to **data science** not as a new domain of knowledge to learn, but a ``new set of skills that you can apply within your current area of expertise``.

The **goal of this course** is to give you the ``ability to ask and answer new questions about your chosen subject area``.

In a nutshell, **data science** is **more about data** than it is **about science**.

* If you have <span style="color:blue">data</span>, and 
* you have <span style="color:blue">curiosity</span>, and 
* you're working with data, and 
* you're <span style="color:blue">manipulating</span> it, 
* you're <span style="color:blue">exploring</span> it, the very exercise of going through <span style="color:blue">analyzing data</span>, trying to ``get some answers from it, is data science``.

## What Is Data Mining?

Data mining is the study of <span style="color:blue">collecting</span>, <span style="color:blue">cleaning</span>, <span style="color:blue">processing</span>, <span style="color:blue">analyzing</span>, and gaining useful <span style="color:blue">insights from data</span>. 

[comment]: <> (A wide variation exists in terms of the <span style="color:blue">problem domains</span>, applications, formulations, and data representations that are encountered in real applications. 
Therefore, “data mining” is a broad umbrella term that is used to describe these different aspects of data processing.)

**Google Search example**:  
I think a <span style="color:blue">good example</span> is the Google Search.
Google published a paper saying they can <span style="color:blue">predict flu epidemics</span> before the Center for Disease Control.  
And what they did was they were looking at what people were searching on Google so flu symptoms.  
So Google saw the flu symptom searches before anybody else and they were able to predict it.  

**Financial interactions example**:
Most common transactions of everyday life, such as using an
automated teller machine (**ATM**) **card** or a **credit card**, can create data in an automated way.  
Such transactions can be mined for many useful insights such as <span style="color:blue">fraud</span> or other unusual activity.

**User interactions**: Many forms of user interactions create large volumes of data.  
For example, <span style="color:blue">the use of a telephone</span> typically creates a record at the telecommunication company with details about the duration and destination of the call.  
Many phone companies routinely analyze such data to determine relevant patterns of behavior that can be used to <span style="color:blue">make decisions about network capacity, promotions, pricing, or customer targeting</span>.

**Sensor technologies and the Internet of Things**: A recent trend is the development of **low-cost wearable sensors**, **smartphones**, and other smart devices that can communicate with one another.  
The implications of such massive data collection are significant for mining algorithms.

## Data Science vs Data Mining 
* <img src="figures/datamining-vs-datascience.webp" width="50%" align="right" style="padding-left:10px;"> **Data mining** refers to the science of **collecting** all the past **data** and then **searching for patterns** in this data.  
* You look for consistent patterns and / or **relationships between variables**. 
* Once you find these insights, you **validate the findings** by applying the detected patterns to new subsets of data. 
* The ultimate goal of data mining is <span style="color:red">prediction</span>.
* Some activities under **Data Mining** can **intersect** with **Data Science** such as 
  * <span style="color:red">statistical analysis</span>, 
  * writing data flows and <span style="color:red">pattern recognition</span>. 

Hence, Data Mining becomes a subset of Data Science. 



* **Data Science** is an umbrella that contain many other fields like <span style="color:red">Machine learning</span>, <span style="color:red">Data Mining</span>, <span style="color:red">Big Data</span>, <span style="color:red">statistics</span>, Data visualization, data analytics,…
* Data science is the process of **using data to understand** different things, to understand the world.
* <span style="color:blue">Data science is the art of uncovering the insights and trends that are hiding behind data.</span>
* Data science is the study of data, like biological sciences are the study of biology; physical sciences, it's the study of physical reactions.

Thus, Data Science and Data Mining are different terms & techniques that are used for data processing.

##  [Data Mining](https://www.uninsubria.it/ugov/degreecourse/107866) Course Program

0. Introduction to Data Mining
1. Introduction to the Python language and some of its libraries as a tool to be able to directly experience what was seen in class.
    1. IPython and Jupyter
    2. Python, NumPy, Pandas, Matplotlib
2. Machine Learning
    1. Introduction to Machine Learning
    1. SciKit-Learning
    2. Association Rules, Decision Trees, Regression, Ensemble methods, Deep learning
3. Problems and methods of learning structured information
    1. collaborative filtering, ranking, etc.
4. Data mining problems with large data that can be solved with deep learning algorithms.

The various topics covered will be accompanied by practical examples and the python code necessary for their solution.

## Prerequisites

- Basic contents of the ``Intelligent Systems`` course delivered to the first year of the MSc program.

- The course is recommended for those who already know at least one ``programming or scripting language``.

- Students are advised to get a ``laptop`` (Windows, Mac or Linux) that runs the Python interpreter and IPython.

## Exam

The exam consists of 
* a ``project`` <br>
  The project is proposed by the student based on his interests.<br>
  In the absence of specific proposals, the project is proposed by the teacher.
* an ``oral interview``<br>
  The oral test consists of an interview whose first question is always the discussion of the results obtained in the project.<br>
  During the oral examination the student must show understanding of the methods covered in class, their advantages and their disadvantages.

With the ``project``, students are typically called to **implement simple methods** of experimental investigation on data made available to them by web sites and / or other **banchmarking data** available on online repositories.  
These investigations are aimed at ascertaining the students' ability to adapt the studied methods to the real cases, possibly understanding their specificities.
The project must be accompanied by a short ``report`` describing the contents and the results obtained.

The outcome of the **project** is positive (and allows access to the next ``oral`` exam) if it shows a score of at least **18/30**.

The overall test is passed with a final vote of at least **18/30**.
The vote of the project contributes significantly to the determination of the final vote.

## Why Python?

Python has emerged over the last couple decades as a first-class tool ``for scientific computing tasks``, including the analysis and visualization of large datasets.

This may have come as a surprise to early proponents of the Python language: the language itself was not specifically designed with data analysis or scientific computing in mind.

The usefulness of Python for data science stems primarily from the ``large and active ecosystem of third-party packages``: **NumPy** for manipulation of homogeneous array-based data, **Pandas** for manipulation of heterogeneous and labeled data, **SciPy** for common scientific computing tasks, **Matplotlib** for publication-quality visualizations, **IPython** for interactive execution and sharing of code, **Scikit-Learn** for machine learning, and many more tools that will be mentioned in the following pages.

<img src="figures/00.00-python4datascience.png" width="50%">


I’ve discovered in the classes I teach, that ``programmers generally grasp principles more readily by seeing simple code illustrations than by looking at math``.<br>

Among general-purpose programming languages, Python developers have been in the forefront, building state-of-the-art machine learning tools, but <span style="color:red">there is a gap between having the tools and being able to use them efficiently</span>.

### Practice 
* print "Hello world!" using Python3 dòflbmdòflbmdòfl

In [1]:
# write your code here
print("Hello world!")

Hello world!


### Python 2 vs Python 3

* We will use the **syntax of Python 3**, which contains language enhancements that are not compatible with the 2.x series of Python.
* Though Python 3.0 was first released in 2008, adoption has been relatively slow, particularly in the scientific and web development communities.
* This is primarily because it took some time for many of the essential third-party packages and toolkits to be made compatible with the new language internals.
* **Since 2014**, however, stable releases of **the most important tools** in the data science ecosystem have been **fully compatible** with both Python 2 and 3.
* However, the vast majority of code snippets in this course will also work without modification in Python 2: in cases where a Py2-incompatible syntax is used, I will make every effort to note it explicitly.

## Why IPython?
* <img src="figures/Jupyter_logo.png" width="20%" align="right" style="padding-left:10px;"> One of the most significant advances in the scientific computing arena is underway with the explosion of interest in <span style="color:red">Jupyter</span> (formerly, IPython) Notebook technology.  
* The scientific publication *Nature* recently featured 
[an article on the benefits of Jupyter Notebooks](http://www.nature.com/news/interactive-notebooks-sharing-the-code-1.16261)
for scientific research. 
* There are now <span style="color:red">Jupyter Notebooks on numerous topics in **many** scientific disciplines</span>.  
Here are a few examples of IPython Notebooks for science:
  * [Machine Learning](https://nbviewer.jupyter.org/github/rhiever/Data-Analysis-and-Machine-Learning-Projects/blob/master/example-data-science-notebook/Example%2520Machine%2520Learning%2520Notebook.ipynb)
  * [Computer Vision](http://nbviewer.jupyter.org/github/ogrisel/notebooks/blob/master/Labeled%2520Faces%2520in%2520the%2520Wild%2520recognition.ipynb)
  * [Satellite Imagery Analysis](http://unidata.github.io/python-gallery/examples/Satellite_Example.html)
...

The reason for Jupyter’s immense success is it excels in a form of programming called  
"``literate programming``".  
* Literate programming is a **software development style** pioneered by Stanford computer scientist, Donald Knuth. 
* This type of programming emphasizes a **prose first approach** where exposition with **human-friendly text** is punctuated with **code blocks**. 
* It excels at demonstration, research, and teaching objectives especially for science. 
* Literate programming allows users to 
  * formulate and <span style="color:red">describe their thoughts</span> with prose,
  * supplemented by <span style="color:red">mathematical equations</span>, 
  * as they prepare to <span style="color:red">write code blocks</span>. 
  
This mindset is the opposite of how we usually think about code. 

([continue to read](https://unidata.github.io/online-python-training/introduction.html)...)

### Data Mining for computer science students

<img src="figures/es-slide-nn-4-computer-science.png" width="70%">



### Data Mining for maths students

<img src="figures/es-slide-nn-4-maths.png" width="70%">



### Data Mining for statistical students

<img src="figures/es-slide-nn-4-statistic.png" width="70%">

## Installation Considerations

- [Python 3 Installation & Setup Guide](https://realpython.com/installing-python/)
- [Installing the Jupyter Notebook](http://jupyter.org/install)

Installing Python and the suite of libraries that enable scientific computing is straightforward . 

Though there are various ways to install Python, the one I would suggest for use in data science is the **Anaconda distribution**, which works similarly whether you use Windows, Linux, or Mac OS X.
The Anaconda distribution comes in two flavors:
- [Miniconda](http://conda.pydata.org/miniconda.html) gives you the Python interpreter itself, along with a command-line tool called ``conda`` which operates as a cross-platform package manager geared toward Python packages, similar in spirit to the apt or yum tools that Linux users might be familiar with.
- [Anaconda](https://www.continuum.io/downloads) includes both Python and conda, and additionally bundles a suite of other pre-installed packages geared toward scientific computing. Because of the size of this bundle, expect the installation to consume several gigabytes of disk space.



Any of the packages included with Anaconda can also be installed manually on top of Miniconda; for this reason I suggest starting with Miniconda.

To get started, download and install the Miniconda package–make sure to choose a version with Python 3–and then install the core packages used in this book:

```
[~]$ conda install numpy pandas scikit-learn matplotlib seaborn jupyter
```

We will also make use of other more specialized tools in Python's scientific ecosystem;  
installation is usually as easy as typing 
```
[~]$ conda install packagename
```
For more information on conda, including information about creating and using conda environments (which I would *highly* recommend), refer to [conda's online documentation](http://conda.pydata.org/docs/).