### Pandas
Basic Datastructure:
1. Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). 
2. DataFrame is a 2-dimensional labeled data structure with columns of potentially different types
3. Panel is a somewhat less-used, but still important container for 3-dimensional data. The term panel data is derived from econometrics and is partially responsible for the name pandas: pan(el)-da(ta)-s. 

#### Examples:
DataFrame:(Quick tutorial:https://www.tutorialspoint.com/python_pandas/python_pandas_dataframe.htm)
1. create dataframe
2. select and index
3. convert

In [7]:
import pandas as pd
import numpy as np
#df=pd.read_csv('clean_dataset.csv', sep=',',header='infer')
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])
#print df['A']< 0
#print "subset:"
#print df[df['A']< 0]
## index by column numbers
print("index by column numbers:\n", df.iloc[:,1:3].head())
## index by column names:
print("index by column names: one column\n", df.loc[:,'A'].head())
print("index by column names: multiple columns\n", df[['A','B']].head())
## select rows by conditions

tmp_data = df[df['A'] < 0]
print( "select rows by conditions\n:", tmp_data)
## access dataframe values
print( "access dataframe values \n:",  type(tmp_data.values))

index by column numbers:
           B         C
0 -0.273562 -0.485425
1 -1.658433 -0.015363
2 -0.274297  0.294145
3 -0.051012  0.227535
4 -2.296558 -0.805188
index by column names: one column
 0    0.266964
1   -1.273495
2   -0.286125
3    0.959914
4   -0.747683
Name: A, dtype: float64
index by column names: multiple columns
           A         B
0  0.266964 -0.273562
1 -1.273495 -1.658433
2 -0.286125 -0.274297
3  0.959914 -0.051012
4 -0.747683 -2.296558
select rows by conditions
:           A         B         C         D
1 -1.273495 -1.658433 -0.015363 -2.241884
2 -0.286125 -0.274297  0.294145 -0.450299
4 -0.747683 -2.296558 -0.805188 -0.865138
5 -0.194597 -0.355680 -0.021988  0.562281
7 -0.415250 -0.911368  1.373268 -0.287063
access dataframe values 
: <class 'numpy.ndarray'>


## Feature Engineering:
ref: 
https://www.youtube.com/watch?v=LMlzHfJPvjI&list=PL7tqo8Xk0expKfOuKz9AWWQ0rqIhtjjfP&index=16

From raw data to useful features
1. Feature extraction, preprocessing
2. Feature selection
    1. Remove useless features
    2. sklearn: statistics, correlation, model_based
    2. Feature Generation

E.g. Automatic Feature Generation:
1. multiplicative interactions
2. Function transformation: $x^2, sqrt(x), ln(x)$
3. Automated Threshold Selection:
    1. Turn a numerical variable into a binary
    2. Find a cut off point automatically
    
Automatic Feature Selection：
1. Correlation Filtering:
    How to choose among those correlated features? 


## Feature preprocessing
1. scaling
     1. to [0,1]:
        sklearn.preprocessing.MinMaxScaler
     2. to mean=0,std=1:
         sklearn.preprocessing.StandardScaler
     3. http://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py
2. outlier
3. encoding
     1. rank: set spaces between sorted values equal

### Feature encoding
reference: https://zh.coursera.org/learn/competitive-data-science/lecture/wckTQ/datetime-and-coordinates
#### Type of features
1. Numerical features
2. Categorical features
3. Ordinal features
4. Datetime and coordiantes
5. Handling missing values




#### Sample code for one hot encoding

In [27]:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
test_ft = [[0, 0, 3], [1, 1, 1], [0, 2, 1], [1, 0, 2]]
enc.fit(test_ft)
print(enc.n_values_)
print(enc.feature_indices_)
print(enc.transform(test_ft))

###You are recommended to set categories='auto' (the new version)
enc2 = OneHotEncoder(categories='auto')
enc2.fit(test_ft)
#print(enc2.n_values_)
#print(enc2.feature_indices_)
print(enc2.categories_)


[2 3 4]
[0 2 5 9]
  (0, 7)	1.0
  (0, 2)	1.0
  (0, 0)	1.0
  (1, 5)	1.0
  (1, 3)	1.0
  (1, 1)	1.0
  (2, 5)	1.0
  (2, 4)	1.0
  (2, 0)	1.0
  (3, 6)	1.0
  (3, 2)	1.0
  (3, 1)	1.0
[array([0, 1]), array([0, 1, 2]), array([1, 2, 3])]


In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In [3]:
#### using pandas 
import pandas  as pd 
import numpy as np 
import matplotlib.pyplot as plt
import sklearn
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.preprocessing import OneHotEncoder,LabelEncoder
from sklearn.preprocessing import StandardScaler,MinMaxScaler 
from sklearn.pipeline import make_pipeline

path = './'
df = pd.read_csv(path+'insurance_data.csv')
ft0 = ['ft1','ft2','ft3','ft4', 'ft5', 'ft6']
categorical_columns = ft0
df_encode = pd.get_dummies(data = df, prefix = None, prefix_sep='_',\
                           columns = categorical_columns, drop_first=True)

#### Transform columns of dataframe separately. 

In [4]:
preprocess = make_column_transformer(
    (StandardScaler(),['ft1', 'ft2']),
    (OneHotEncoder(sparse=False), ['ft6']),
    remainder='passthrough'
)

ft_res = preprocess.fit_transform(df[ft0])
## access one intermedia step model
preprocess.transformers_[1][1].categories_

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


[array(['northeast', 'northwest', 'southeast', 'southwest'], dtype=object)]

### Introduction for features importance in the Random Forest
ref: https://zh.coursera.org/learn/python-machine-learning/lecture/lF9QN/random-forests
<img src=rf.png>

1. Randomized bootstrap copies
2. Randomized feature splits
3. predictions for regression task : mean of tree predictions

In [7]:
# visualize the random forest
from sklearn import tree
from sklearn.ensemble import RandomForestRegressor
import graphviz 
# best tree: 3 features, 20 estimators,
RANDOM_STATE = 42
forest1 =  RandomForestRegressor(oob_score=True,
                               max_features=3,max_depth=4,
                               random_state=RANDOM_STATE)

forest1.set_params(n_estimators=20)
X = df.iloc[:200,0:6].values
y = df.iloc[:200,6].values


forest1.fit(X, y)
tree1 = forest1.estimators_[1]
dot_data = tree.export_graphviz(tree1, out_file=None, feature_names=feature_list0, 
                                class_names="charges",   filled=True, rounded=True,  special_characters=True)
graph = graphviz.Source(dot_data)
graph.render('random_forest.png', view=False)

ValueError: could not convert string to float: 'northeast'