# NumPy Primer - 3
- Importing datasets
    - loadtxt
    - genfromtxt
- Processing data
    - Splitting data
- Summarizing data

In [None]:
import numpy as np

## 1. Importing Datasets
- Datasets in text files (```.txt, .csv```, etc.) can be imported using ```loadtxt()``` for ```genfromtxt()```

### loadtxt()
- ```loadtxt()``` is used when there do not exist missing values
- Hence, number of values in each row should be equal
- Key parameters
    - ```fname```: designates name of text file to be imported
    - ```dtype```: data type of array to be created as result (default: ```float```)
    - ```delimiter```: string used to separate values (default: whitespace)
    - ```skiprows```: skip first *n* rows (default: 0)
    - ```usecols```: column indices to be read (default: ```None``` => all cols are used)

In [None]:
# importing dataset with default settings
data = np.loadtxt('even_numbers.txt')
print(data)

In [None]:
# importing dataset with parameter settings
# skiprows is especially useful when you want to get rid of "header" of dataset 
data = np.loadtxt('even_numbers.txt', dtype = int, skiprows = 1, usecols = (1,3))
print(data)

In [None]:
# importing dataset with parameter settings
data = np.loadtxt('even_numbers.txt', dtype = np.str_, usecols = 0)
print(data)        # note that resulting array is 1-D

In [None]:
# .csv files could be imported as well
# in this case, delimiter should be set to ','
data = np.loadtxt('glass.csv', delimiter = ',')
print(data.shape)        # dataset with 214 rows & 11 columns
print(data.dtype)        # dtype is float64 as default

### genfromtxt()
- ```genfromtxt()``` is used when dataset has some missing values
- Otherwise, it is largely identical to ```loadtxt()```

In [None]:
# when ' '(whitespace) is used as delimiter
data = np.genfromtxt('odd_numbers.txt', invalid_raise = False)
print(data)      # you can see that last row is removed

In [None]:
# when ',' is used as delimiter
data = np.genfromtxt('odd_numbers.csv', delimiter = ',')
print(data)      # you could see that last element is set to nan

In [None]:
# value to fill missing element can be deisgnated
data = np.genfromtxt('odd_numbers.csv', delimiter = ',', filling_values = 100)
print(data)       # fill missing value with 100.

In [None]:
# if there is certain string to designate missing value ('?' here)
data = np.genfromtxt('odd_numbers_2.csv', delimiter = ',', missing_values = '?', filling_values = 99)
print(data)       # fill missing value with 99.

### Exercise 3-1.
- Import ```highway.csv``` using genfromtxt()
    - Set ```dtype``` as string
    - Replace missing values (```'?'```) with ```'Unknown'```
    - Print first three rows of resulting array

In [None]:
### Your answer
data = np.genfromtxt('highway.csv', dtype = np.str_, delimiter = ',')
data[data == '?'] = 'Unknown'   # use boolean indexing 
print(data[:3])

## 2. Processing data
- Processing imported dataset is essential task in data mining
- Many techniques we have learnt so far are used

### Splitting data
- In most cases, dataset is splitted into ```X data``` (input variables) and ```Y data``` (output variables), or other variable splits
    - Then, dataset is splitted *column-wise*
- Then, dataset is splitted into ```training data``` and ```validation/test data```, or cross-validated
    - Then, dataset is splitted *row-wise*
- In either case, array indexing & slicing are utilized

**glass dataset**
- source: https://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.names
- Number of instances (# rows): 214
- Number of attributes (# columns): 10
    - ID number: 1 to 214
    - RI: refractive index
    - Na: Sodium (unit measurement: weight percent in corresponding oxide, as are attributes 4-10)
    - Mg: Magnesium
    - Al: Aluminum
    - Si: Silicon
    - K: Potassium
    - Ca: Calcium
    - Ba: Barium
    - Fe: Iron
    - Type of glass: (class attribute)
        - 1: building_windows_float_processed
        - 2: building_windows_non_float_processed
        - 3: vehicle_windows_float_processed
        - 4: vehicle_windows_non_float_processed (none in this database)
        - 5: containers
        - 6: tableware
        - 7: headlamps

In [None]:
# import dataset
data = np.loadtxt('glass.csv', delimiter = ',')
print(data.shape)
print(data[0])

In [None]:
# excluding ID number variable
data_1 = data[:, 1:]
print(data_1.shape)        # 10 columns now

In [None]:
# selecting only X data (excluding ID number & Type of glass variables)
X_data = data[:, 1:-1]
print(X_data.shape)        # 9 columns now

In [None]:
# selecting only RI, Na, Mg variables
X_partial = data[:, 1:4]
print(X_partial.shape)

In [None]:
# selecting only Y data
Y_data = data[:, -1]
print(Y_data.shape)

In [None]:
# splitting data into train-test set
# first 150 data instances into train set, others into test set
X_train = X_data[:150,:]
X_test = X_data[150:, :]
Y_train = Y_data[:150]
Y_test = Y_data[150:]

print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

In [None]:
# randomly performing train-test split
data_shulffled = data               # copy data
np.random.shuffle(data_shulffled)   # shulffle dataset randomly

X_data = data_shulffled[:, 1:-1]
Y_data = data_shulffled[:, -1]

X_train = X_data[:150,:]
X_test = X_data[150:, :]
Y_train = Y_data[:150]
Y_test = Y_data[150:]

print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

### Exercise 3-2.
- Import ```dermatology.csv``` using ```genfromtxt()``` function
    - Replace missing values ('?') with 0
- Assign data in first 34 columns into X_data
- Assign data in last column into Y_data
- Perform train-test split
    - Randomly assign half into train set, another half into test set

In [None]:
## Your answer
data = np.genfromtxt('dermatology.csv', delimiter = ',', missing_values = '?', filling_values = 0)
#print(data.shape)

train_len = len(data)//2
np.random.shuffle(data)

X_train = data[:train_len, :-1]
X_test = data[train_len:, :-1]
y_train = data[:train_len, -1]
y_test = data[train_len:, -1]

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)