# 2. Data Mining
Gether and scrape the data necessary for the project.

**`Note: This notebook is only exemplary and in view of the small dataset size of the datasets (known.csv and unknown.csv) it is not necessary to execute it! We can continue with notebook '3_data_cleaning'`.**

In general data mining is the process by which organizations detect patterns in data for insights relevant to their business needs. It's essential for both business intelligence and data science. There are many data mining techniques organizations can use to turn raw data into actionable insights. These involve everything from cutting-edge artificial Intelligence to the basics of data preparation, which are both key for maximizing the value of data investments. 

 1. Data cleaning and preparation
 2. Tracking patterns
 3. Classification
 4. Association
 5. Outlier detection
 6. Clustering
 7. Regression
 8. Prediction
 9. Sequential patterns
10. Decision trees
11. Statistical techniques
12. Visualization
13. Neural networks
14. Data warehousing
15. Long-term memory processing
16. Machine learning and artificial intelligence


In this case, we have received two unknown datasets from an unknown source or sources.

Dataset: 
1. Dataset known.csv: a dataset containing 15000 data samples, longitude, latitude, 6 input variables f1-6 and 1 target variable t1.
2. Dataset unknown.csv: a dataset containing 767 data samples in similar format, where the target values are unknown.

## Import Libraries

In [1]:
# Basic import(s)
import pandas as pd

# Helper script(s)
from scripts.helper import reduce_mem_usage

# Hide all warnings in ipython
import warnings
warnings.filterwarnings('ignore')

## Import Datasets

In [2]:
# Loading of the known.csv dataset via pandas.
known_dataset = pd.read_csv('data/known.csv', sep=';', na_filter= False) # Seperate semi colin in csv file with pandas separator.

In [3]:
# Loading of the unknown.csv dataset via pandas.
unknown_dataset = pd.read_csv('data/unknown.csv', sep=';', na_filter= False) # Seperate semi colin in csv file with pandas separator.

In [4]:
# Show head for known_dataset.
# We want to make sure that the known.csv dataset has been read in correctly.
known_dataset.head()

Unnamed: 0,"index,",longitude,latitude,f1,f2,f3,f4,f5,f6,t1
0,0,1425,5217,11,2403,890,344,3,497,120800
1,1,1411,522,15,5644,2659,783,67559,757,312000
2,2,1426,521,52,2084,1438,516,23087,550,258600
3,3,1147,5309,32,3011,1287,525,50605,529,311000
4,4,1146,5303,33,2824,1797,493,36359,523,135100


In [5]:
# Show head for unknown_dataset.
# We want to make sure that the unknown.csv dataset has been read in correctly.
unknown_dataset.head()

Unnamed: 0,index,longitude,latitude,f1,f2,f3,f4,f5,f6,t1
0,0,1328,5263,34,3850,1619,602,50465,608,0
1,1,1331,5245,21,5041,2719,1420,35335,1491,0
2,2,1339,524,52,1509,674,244,49306,225,0
3,3,1346,5259,42,1291,1535,332,19083,345,0
4,4,1322,5251,27,4742,1682,696,6194,775,0


## Reduce dataset size
Very often it happens that we have very large datasets of several gigabytes. By formatting the data types into smaller data types, the dataset can be reduced enormously by addressing it in the memory.

In [6]:
# Reduce known_dataset size by change the datatypes.
known_dataset, NAlist = reduce_mem_usage(known_dataset)

Memory usage of properties dataframe is : 1.14453125  MB
******************************
Column:  index,
dtype before:  int64
dtype after:  uint16
******************************
******************************
Column:  f1
dtype before:  int64
dtype after:  uint8
******************************
******************************
Column:  f3
dtype before:  int64
dtype after:  uint16
******************************
******************************
Column:  f4
dtype before:  int64
dtype after:  uint16
******************************
******************************
Column:  t1
dtype before:  int64
dtype after:  uint32
******************************
___MEMORY USAGE AFTER COMPLETION:___
Memory usage is:  0.7296829223632812  MB
This is  63.753866254266214 % of the initial size


In [7]:
# Reduce unknown_dataset size by change the datatypes
unknown_dataset, NAlist = reduce_mem_usage(unknown_dataset)

Memory usage of properties dataframe is : 0.0586395263671875  MB
******************************
Column:  index
dtype before:  int64
dtype after:  uint16
******************************
******************************
Column:  f1
dtype before:  int64
dtype after:  uint8
******************************
******************************
Column:  f2
dtype before:  int64
dtype after:  uint16
******************************
******************************
Column:  f3
dtype before:  int64
dtype after:  uint16
******************************
******************************
Column:  f4
dtype before:  int64
dtype after:  uint16
******************************
******************************
Column:  t1
dtype before:  int64
dtype after:  uint8
******************************
___MEMORY USAGE AFTER COMPLETION:___
Memory usage is:  0.030843734741210938  MB
This is  52.59888108248764 % of the initial size


## Export Compressed Datasets

In [8]:
# Save compressed known dataset to a new .csv file
path = 'data/compressed_known.csv'
known_dataset.to_csv(path, 
index=False) # Avoid creating an index in a saved .csv file.

In [9]:
# Save cleaned unknown dataset to a new .csv file
path = 'data/compressed_unknown.csv'
unknown_dataset.to_csv(path, 
index=False) # Avoid creating an index in a saved .csv file.