<b>OUTLIER DETECTION AND HANDLING</b> https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/

## Feature Engineering

### Tasks

Feature Extraction & Feature Engineering

Feature Transformation

Feature Selection

### Feature Extraction

#### Text

Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.
    
We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

We will use CountVectorizer to "convert text into a matrix of token counts".

BAG OF WORDS:<br />
https://machinelearningmastery.com/gentle-introduction-bag-words-model/

TF-IDF:<br />
https://www.commonlounge.com/discussion/99e86c9c15bb4d23a30b111b23e7b7b1

CODE EXAMPLE:<br />
https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

lst_text=['it was the best of times','it was the worst of times',\
          'it was the age of wisdom','it was the age of foolishness']

# Initialize the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.
vocab = CountVectorizer()

# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of 
# strings.
dtm = vocab.fit_transform(lst_text)

'''
vocab.fit(lst_text)

dtm = vocab.transform()
'''

'\nvocab.fit(lst_text)\n\ndtm = vocab.transform()\n'

In [2]:
vocab.vocabulary_

{'it': 3,
 'was': 7,
 'the': 5,
 'best': 1,
 'of': 4,
 'times': 6,
 'worst': 9,
 'age': 0,
 'wisdom': 8,
 'foolishness': 2}

In [4]:
print(type(dtm))

print(dtm.shape)

# print(dtm)

print(dtm.toarray())

<class 'scipy.sparse.csr.csr_matrix'>
(4, 10)
[[0 1 0 1 1 1 1 1 0 0]
 [0 0 0 1 1 1 1 1 0 1]
 [1 0 0 1 1 1 0 1 1 0]
 [1 0 1 1 1 1 0 1 0 0]]


In [16]:
# 2-grams

vocab = CountVectorizer(ngram_range=[1,2])

dtm = vocab.fit_transform(lst_text)

print(vocab.vocabulary_)

print(dtm.toarray()) # convert sparse matrix to nparray

{'it': 5, 'was': 16, 'the': 11, 'best': 2, 'of': 7, 'times': 15, 'it was': 6, 'was the': 17, 'the best': 13, 'best of': 3, 'of times': 9, 'worst': 19, 'the worst': 14, 'worst of': 20, 'age': 0, 'wisdom': 18, 'the age': 12, 'age of': 1, 'of wisdom': 10, 'foolishness': 4, 'of foolishness': 8}
[[0 0 1 1 0 1 1 1 0 1 0 1 0 1 0 1 1 1 0 0 0]
 [0 0 0 0 0 1 1 1 0 1 0 1 0 0 1 1 1 1 0 1 1]
 [1 1 0 0 0 1 1 1 0 0 1 1 1 0 0 0 1 1 1 0 0]
 [1 1 0 0 1 1 1 1 1 0 0 1 1 0 0 0 1 1 0 0 0]]


<b>Summary:</b>
<ul>
    <li> <code>vect.fit(lst_text)</code> <b>learns the vocabulary</b>
    <li> <code>vect.transform(lst_text)</code> <b>uses the fitted vocabulary</b> to build a <b>document-term matrix</b>
</ul>

In [20]:
# TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer

# list of text documents
lst_text=['It was the best of times','it was the worst of times',\
          'it was the age of wisdom','it was the age of foolishness']

vectorizer = TfidfVectorizer()

dtm = vectorizer.fit_transform(lst_text)

print(vectorizer.vocabulary_)

print('*'*50)

print(dtm.toarray()) # convert sparse matrix to nparray

{'it': 3, 'was': 7, 'the': 5, 'best': 1, 'of': 4, 'times': 6, 'worst': 9, 'age': 0, 'wisdom': 8, 'foolishness': 2}
**************************************************
[[0.         0.60735961 0.         0.31694544 0.31694544 0.31694544
  0.4788493  0.31694544 0.         0.        ]
 [0.         0.         0.         0.31694544 0.31694544 0.31694544
  0.4788493  0.31694544 0.         0.60735961]
 [0.4788493  0.         0.         0.31694544 0.31694544 0.31694544
  0.         0.31694544 0.60735961 0.        ]
 [0.4788493  0.         0.60735961 0.31694544 0.31694544 0.31694544
  0.         0.31694544 0.         0.        ]]


### Feature Transformation

#### Normalization & Changing Distribution

Min-Max Scaling (Column Normalization)

Standard Scaling (Z score normalization)

In [7]:
import numpy as np

a = np.array([[1,2,3],[4,5,6]])

print(a.reshape(6))

print('*'*50)

print(a.reshape(3,-1))

[1 2 3 4 5 6]
**************************************************
[[1 2]
 [3 4]
 [5 6]]


In [6]:
import numpy as np

data = np.array([1,1,0,-1,2,1,3,-2,4,100], dtype='f').reshape(-1,1)

data

array([[  1.],
       [  1.],
       [  0.],
       [ -1.],
       [  2.],
       [  1.],
       [  3.],
       [ -2.],
       [  4.],
       [100.]], dtype=float32)

In [38]:
# Min-Max Scaling

from sklearn.preprocessing import MinMaxScaler

MinMaxScaler().fit_transform(data)

array([[0.02941177],
       [0.02941177],
       [0.01960784],
       [0.00980392],
       [0.03921569],
       [0.02941177],
       [0.04901961],
       [0.        ],
       [0.05882353],
       [1.0000001 ]], dtype=float32)

In [39]:
(data-data.min())/(data.max() - data.min())

array([[0.02941176],
       [0.02941176],
       [0.01960784],
       [0.00980392],
       [0.03921569],
       [0.02941176],
       [0.04901961],
       [0.        ],
       [0.05882353],
       [1.        ]], dtype=float32)

In [40]:
# Standard Scaling

from sklearn.preprocessing import StandardScaler

StandardScaler().fit_transform(data)

array([[-0.3328055 ],
       [-0.3328055 ],
       [-0.36642224],
       [-0.40003896],
       [-0.2991888 ],
       [-0.3328055 ],
       [-0.26557207],
       [-0.43365568],
       [-0.23195536],
       [ 2.9952497 ]], dtype=float32)

In [41]:
(data-data.mean())/data.std()

array([[-0.3328055 ],
       [-0.3328055 ],
       [-0.36642224],
       [-0.40003896],
       [-0.2991888 ],
       [-0.3328055 ],
       [-0.26557207],
       [-0.43365568],
       [-0.23195536],
       [ 2.9952497 ]], dtype=float32)

#### Filling Missing Values

<code>sklearn.preprocessing.Imputer()</code><br />
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html

### Feature Selection

#### Statistical Approach

It is reasonable to say that features with low variance are worse than those with high variance. So, one can consider cutting features with variance below a certain threshold.

In [3]:
from sklearn.feature_selection import VarianceThreshold

from sklearn.datasets import make_classification

# generate a ndarray of size (100, 20)
x_data_generated, y_data_generated = make_classification()

x_data_generated.shape

(100, 20)

In [9]:
type(x_data_generated)

numpy.ndarray

In [4]:
VarianceThreshold(0.7).fit_transform(x_data_generated).shape

(100, 19)

In [5]:
VarianceThreshold(0.8).fit_transform(x_data_generated).shape

(100, 18)

In [6]:
VarianceThreshold(0.9).fit_transform(x_data_generated).shape

(100, 15)

In [21]:
# https://chrisalbon.com/machine_learning/feature_selection/variance_thresholding_for_feature_selection/

from sklearn import datasets
from sklearn.feature_selection import VarianceThreshold

# Load iris data
iris = datasets.load_iris()

print(type(iris))

# Create features and target
X = iris.data
y = iris.target

print(X.shape)
print(X[0:5])
print(X.std(axis=0))
print('*'*50)

# Create VarianceThreshold object with a variance with a threshold of 0.5
thresholder = VarianceThreshold(threshold=.5)

# Conduct variance thresholding
X_high_variance = thresholder.fit_transform(X)

print(X_high_variance.shape)

# View first five rows with features with variances above threshold
print(X_high_variance[0:5])

<class 'sklearn.utils.Bunch'>
(150, 4)
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
[0.82530129 0.43214658 1.75852918 0.76061262]
**************************************************
(150, 3)
[[5.1 1.4 0.2]
 [4.9 1.4 0.2]
 [4.7 1.3 0.2]
 [4.6 1.5 0.2]
 [5.  1.4 0.2]]


<b>Very Important: </b>
1. If variables represent different physical quantities their scaling can be different. By changing units (e.g. from measuring distance in kilometers to measuring distance in nanometers) you can change the scaling of a variable arbitrarily.

2. If the variance is zero, it means that the feature is constant and will not improve the performance of the model. In that case, it should be removed. Or if only a handful of observations differ from a constant value, the variance will also be very low.

3. If there is high correlation between 2 features then you would discard one of them. The features that are removed because of low variance have very low variance, that would be near to zero. You should always perform all the tests with existing data before discarding any features. 

#### Grid Search
(covered later)

## Pearson Correlation & p-value
(covered later)

from scipy.stats import pearsonr

## TODO

<b>Topics Left</b>

1. CDF
2. shapiro
3. feature selection - grid search
4. qq plot
5. pearsonr
6. Missing values using classification