# Machine Learning (ML) Basic
## Dataset
A data is an essential component of ML model and, a dataset definition in ML is “a collection of data that is treated as a single unit by a computer”. This means that a dataset contains a lot of separate pieces of data, but can be used to teach the machine learning algorithm to find predictable patterns inside the whole dataset.

**Splitting the Data: Training, Testing, and Validation Datasets in ML** - Usually, a dataset is used not only for training purposes. A single training set that has already been processed is usually split into several types of datasets in machine learning, which is needed to check how well the training of the model went. For this purpose, a testing dataset is typically separated from the data. Next, a validation dataset, while not strictly crucial, is quite helpful to avoid training your algorithm on the same type of data and making biased predictions.

## Simple data


In [3]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

data_origin = [[30, 100], [20, 50], [35, np.nan],
               [25, 80], [30, 70], [40, 60]]
data_origin

[[30, 100], [20, 50], [35, nan], [25, 80], [30, 70], [40, 60]]

In [4]:
# interpolation: replace missing value with mean value
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(strategy='mean')
imp_mean.fit(data_origin)

data_mean_imp = imp_mean.transform(data_origin)
print(data_mean_imp)

[[ 30. 100.]
 [ 20.  50.]
 [ 35.  72.]
 [ 25.  80.]
 [ 30.  70.]
 [ 40.  60.]]


In [5]:
# interpolation: apply interpolation to new data
new = [[20, np.nan],
       [30, np.nan],
       [np.nan, 70],
       [np.nan, np.nan]]

new_mean_imp = imp_mean.transform(new)
print(new_mean_imp)

[[20. 72.]
 [30. 72.]
 [30. 70.]
 [30. 72.]]


## Load dataset
One of the most well-known repositories for machine learning datasets is the [UCI Machien Learning Repository](https://archive.ics.uci.edu/). Most of the datasets over there are small in size because the technology at the time was not advanced enough to handle larger size data. Newer datasets are usually larger in size. For example, the ImageNet dataset is over 160 GB. These datasets are commonly found in [Kaggle](https://www.kaggle.com/datasets), and we can search them by name and download. [OpenML](https://www.openml.org/) is also, a repository that hosts a lot of datasets. It is convenient because you can search for the dataset by name, but it also has a standardized web API for users to retrieve data.

The repositories described above are a good place to start getting datasets, but you may also obtain datasets simply by downloading them from the web using a browser, the command line, a network library such as the wget tool, or a request in Python. *Scikit-learn* is an example where you can download the dataset using its API. Because some of these datasets have become standards or benchmarks, many machine learning libraries have created functions to help you retrieve them. For practical reasons, the datasets often don't come with the library and are downloaded in real time when you call the function, so you'll need a stable internet connection to use them. The related functions are defined under [`sklearn.datasets`](https://scikit-learn.org/stable/api/sklearn.datasets.html), and you may see the list of functions at:

In [6]:
import sklearn.datasets

# load california housing dataset (regression)
data = sklearn.datasets.fetch_california_housing(return_X_y=False, as_frame=True)
data = data["frame"]
print(data)

       MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0      8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1      8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2      7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3      5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4      3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   
...       ...       ...       ...        ...         ...       ...       ...   
20635  1.5603      25.0  5.045455   1.133333       845.0  2.560606     39.48   
20636  2.5568      18.0  6.114035   1.315789       356.0  3.122807     39.49   
20637  1.7000      17.0  5.205543   1.120092      1007.0  2.325635     39.43   
20638  1.8672      18.0  5.329513   1.171920       741.0  2.123209     39.43   
20639  2.3886      16.0  5.254717   1.162264      1387.0  2.616981     39.37   

       Longitude  MedHouseVal  
0      

One good thing tat scikit-learn provides is the convenient ability to read any dataset from OpenML. Read dataset and run the logistic regression:

In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import fetch_openml

#data = fetch_openml("diabetes", version=1, as_frame=True, return_X_y=False)
data = fetch_openml("titanic", version=6, as_frame=True, return_X_y=False)
print(data["frame"])

x, y = fetch_openml("titanic", version=6, as_frame=False, return_X_y=True)
clf = LogisticRegression(random_state=0).fit(x, y)
print("Accuracy: {:,}".format(clf.score(x, y)))   # accuracy
print("Coefficient:", clf.coef_)                  # coefficient in logistic regression

     Survived  Pclass  Sex  Age  Fare  Embarked  relatives  Title
0           0       3    0    2     0         0          1      1
1           1       1    1    5     3         1          1      3
2           1       3    1    3     0         0          0      2
3           1       1    1    5     3         0          1      3
4           0       3    0    5     1         0          0      1
..        ...     ...  ...  ...   ...       ...        ...    ...
886         0       2    0    3     1         0          0      5
887         1       1    1    2     2         0          0      2
888         0       3    1    1     2         0          3      2
889         1       1    0    3     2         1          0      1
890         0       3    0    4     0         2          0      1

[891 rows x 8 columns]
Accuracy: 0.8114478114478114
Coefficient: [[-0.75486619  2.23758245 -0.20788379  0.2806995   0.24340651 -0.36705508
   0.47881365]]


# References
- [Kaggle](https://www.kaggle.com/datasets)
- [OpenML](https://www.openml.org/)
- [UCI Machien Learning Repository](https://archive.ics.uci.edu/)