# __scikit-learn Cheat Sheet__

---

----

----

## __Preprocessing__

### 1. train_test_split: split the dataset into training set and test set
    - Parameters:
        - X
        - y
        - random_state
        - test_size
    - Return: X_train, X_test, y_train, y_test

### 2. Scaling 
    1. StandardScaler: Using mean and standard deviation
    2. RobustScaler: Using median and quartiles
    3. MinMaxScaler: between 0 and 1
    4. Normalizer: project to a circle with radius 1
    
### 3. Imputation
    1. SimpleImputer: fill missing values with something
        - Modules: sklearn.impute
        - Parameters: 
            - strategy: 'mean' impute the missing values using the column's mean

### 4. LabelEncoder:
    - Modules: sklearn.preprocessing
    - Ex: 
        - encoder = LabelEncoder().fit(X_train_categories)
        - X_train_encoded = encoder.transform(X_train_categories)

### 5. OneHotEncoder:
    - Modules: sklearn.preprorcessing
    - Ex: 
        - encoder = OneHotEncoder().fit(X_train_categories)
        - X_train_encoded = encoder.transform(X_train_categories)
    - Attributes:
        - classes_: return the column label of each one hot encoded column
    
    

----

## __Supervised Learning__

## I. Classification


### 1. k-Nearest Neighbors:
    - Idea: Find n_neighbors data points closest to the new data point, take the majority of class and label it by the majority class of the nearest data points
    - Modules: sklearn.neighbors
    - Class: KNeighborsClassifier
    - Parameters:
        - n_neighbors: number of neighbors used to classifiy
        - weights: 
            - distance: further points will be less influential 
            - uniform: no weights used (default)
        - metric: minkowski (default)
        - p: 1 for manhattan distance, 2 for euclidean distance (default)
    - Methods: 
        - fit, predict
        - predict_proba: the probability estimates for test data X
    - Attributes:
    - Strengths:
        - Good for first model to consider
        - Fast and easy to understand
    - Weaknesses:
        - Not good for sparse dataset
        - Not a complex model
        
### 2. Naive Bayes:
    - Idea: They learn parameters by looking at each feature individually and collect simple per-class statistics from each feature.
    - Modules: sklearn.naive_bayes
    - Classes: 
        - GaussianNB
        - MultinommialNB
        - BernoulliNB
        - ComplementNB
    - Parameters:
        - alpha (MultinomialNB, BernoulliNB, ComplementNB): control model complexity. large alpha means more smoothing, resulting in less complex model
    - Methods:
        - fit, predict
        - predict_proba: the probability estimates for test data X
    - Attributes:
        - coef_ (MultinomialNB): as linear model
        - intercept_ (MultinomialNB): as linear model
        - feature_count_ (MultinomialNB, BernoulliNB, ComplementNB): Number of samples encountered for each (class, feature) during fitting. This value is weighted by the sample weight when provided.
    - Strengths:
        - Faster than linear models
        - Fast to train and to predict
        - GaussianNB for continuous data, for highly-dimensional data
        - MultinomialNB for count data, for sparse data (text)
        - BernouliNB for binomial data, for sparse data (text)
        - ComplementNB for imbalanced data
    - Weaknesses:
        - Worse performance than linear models
        
### 3. Linear Models
    - Idea: seperate the classes using threshold, also with line, plane, or hyperplane
    - Modules: 
        - sklearn.svm: for LinearSVC
        - sklearn.linear_model: for LogisticRegression and SGDClassifier
    - Classes:
        - LogisticRegression
        - SGDClassifier
        - LinearSVC
    - Parameters:
        - C: Regularization control parameter, increasing C means less regularization
        - penalty: l1 or l2
        - muti_class: 'multinomial' will use OnevsAll or Sofmax Regression Classifier
        - solver: 'lbfgs' for multinomial parameter
        - loss: 'hinge'
        - dual (LinearSVC): False, unless there are more features than training instances
    - Methods:
        - fit, predict
        - decision_function: Predict confidence scores for samples.
    - Attributes:
        - coef_: coefficients of linear function
        - intercept: constant value of linear function
    - Strengths:
        - Fast to train and predict
        - Good for high-dimensional data and sparse data
    - Weaknesses:
        - Not so good on small data
        
### 4. Decision Trees
    - Idea: Using series of tests (hierarchical if/else questions)
    - Modules: sklearn.tree
    - Classes: DecisionTreeClassifier
    - Parameters:
        - max_depth: the maximum depth of the tree
        - min_samples_split: the minimum number of samples required to split an internal node
        - min_samples_leaf: the minimum number of samples required to be at a leaf node.
    - Methods:
        - fit, predict
        - predict_proba
    - Attributes:
        - feature_importances_: return the feature importances
    - Streghts: 
        - Easily visualized and understood (atleast for small trees).
        - Invariant to scalling. Works well without preprocessing the data.
    - Weaknesses: 
        - Tend to overfit and have a worse generalization performance.

### 5. Random Forest
    - Idea: Bunch of slightly different single Decision Tree 
    - Modules: sklearn.ensemble
    - Classes: RandomForestClassifier
    - Parameters:
        - n_estimators: larger always better, but more trees need more memory and more time to train.
        - max_features: determines how random each tree is and smaller value reduces overfitting.
        - max_depth (for preprunning)
    - Methods:
        - fit, predict
        - predict_proba
    - Attributes:
        - estimators_ : list of DecisionTreeClassifier
        - feature_importances_
    - Strengths:
        - Most widely used machine learning methods.
        - Very powerful, often work well without heavy tuning.
        - No need for scaling.
        - Share benefits from decision tree.
    - Weaknesses:
        - Large datasets -> time consuming
        - Don't tend to perform well on very high dimensional, sparse data (linear model is more appropriate).
        - Require more memory and slower to train and to predict than linear models.
        
### 6. Gradient Boosting
    - Idea: Bunch of decision trees, where each tree tries to correct the previous one
    - Modules: sklearn.ensemble
    - Classes: GradientBoostingClassifier
    - Parameters:
        - n_estimators
        - learning_rate: control how strong each tree makes correction from previous trees.
        - max_depth: small value(1-5) will reduce model complexity.
        - max_leaf_nodes
    - Methods:
        - fit, predict
        - predict_proba
        - decision_function
    - Attributes:
        - estimators_
        - feature_importances_
    - Strenghts:
        - Among the most powerful and widely used models for supervised learning.
        - Same as other tree implementation algorithm, gbrt works well without scaling.
    - Weaknesses:
        - Requires careful tuning of parameters and may take a long time to train.
        - Doesn't work well on high dimensional sparse data.
        
### 7. Support Vector Machines
    - Idea: using support vector to classify
    - Modules: sklearn.svm
    - Classes: SVC
    - Parameters:
        - C: regularization parameter
        - kernel: rbf using Gaussian rbf function, poly using polynomial features
        - gamma: paremeter for certain kernel (rbf), increase -> less complex model
    - Methods:
        - fit, predict
        - decision_function
    - Attributes:
        - support_: indices of support vectors
        - support_vectors
        - coef_
        - intercept_
    - Strenghts:
        - Powerful model and performs well on a variety of datasets
        - Allows complex decision boundaries.
        - Works well on high dimensional or low dimensional data.
        - Works well if features use the same unit.
    - Weaknesses:
        - Don't scale very well with the number of samples.
        - Running an SVM on data with up to 10,000 samples might work well, but working with datasets of size 100,000 or more can become challenging in terms of runtime and memory usage.
        - Require scaling and careful parameters tuning.
        - Hard to inspect, difficult to understand why a particular prediction was made.

## II. Regression


### 1. k-Nearest Neighbors:
    - Idea: Find n_neighbors data points closest to the new data point, take the average of the closest data points and label it to new data point
    - Modules: sklearn.neighbors
    - Classes: KNeighborsRegressor
    - Parameters:
        - n_neighbors: number of neighbors used to classifiy
        - weights: 
            - distance: further points will be less influential 
            - uniform: no weights used (default)
        - metric: minkowski (default)
        - p: 1 for manhattan distance, 2 for euclidean distance (default)
    - Methods: 
        - fit, predict
    - Attributes:
        - classes_
    - Strengths:
        - Good for first model to consider
        - Fast and easy to understand
    - Weaknesses:
        - Not good for sparse dataset
        - Not a complex model
        
### 2. Linear Models:
    - Idea: Make predictions using linear function of the input features
    - Modules: sklearn.linear_model
    - Classes: 
        - LinearRegression: ordinary least squares
        - SGDRegressor: Stochastic Gradient Descent
        - Ridge: the magnitude of the coefficients must become small, close to zero (L2 Regularization)
        - Lasso: makes coefficients for some feature become zero (L1 Regularization)
        - ElasticNet
        - LinearSVR
    - Parameters:
        - alpha (Ridge, Lasso, ElasticNet): Regularization control parameter, increasing alpha makes the regularization stronger
        - r: ratio of which one is stronger, Ridge or Lasso
    - Methods:
        - fit, predict
    - Attributes:
        - coef_: coefficients of linear function
        - intercept: constant value of linear function
        - classes_
    - Strengths:
        - Fast to train and predict
        - Good for high-dimensional data and sparse data
    - Weaknesses:
        - Not so good on small data

### 3. Decision Trees
    - Idea: Using series of tests (hierarchical if/else questions)
    - Modules: sklearn.tree
    - Classes: DecisionTreeRegressor
    - Parameters:
        - max_depth: the maximum depth of the tree
        - min_samples_split: the minimum number of samples required to split an internal node
        - min_samples_leaf: the minimum number of samples required to be at a leaf node.
    - Methods:
        - fit, predict
    - Attributes:
        - feature_importances_: return the feature importances
        - classes_
    - Streghts: 
        - Easily visualized and understood (atleast for small trees).
        - Invariant to scalling. Works well without preprocessing the data.
    - Weaknesses: 
        - Tend to overfit and have a worse generalization performance.
        
### 4. Random Forest
    - Idea: Bunch of slightly different single Decision Tree 
    - Modules: sklearn.ensemble
    - Classes: RandomForestRegressor
    - Parameters:
        - n_estimators: larger always better, but more trees need more memory and more time to train.
        - max_features: determines how random each tree is and smaller value reduces overfitting.
        - max_depth (for preprunning)
    - Methods:
        - fit, predict
        - predict_proba
    - Attributes:
        - estimators_ : list of DecisionTreeClassifier
        - feature_importances_
        - classes_
    - Strengths:
        - Most widely used machine learning methods.
        - Very powerful, often work well without heavy tuning.
        - No need for scaling.
        - Share benefits from decision tree.
    - Weaknesses:
        - Large datasets -> time consuming
        - Don't tend to perform well on very high dimensional, sparse data (linear model is more appropriate).
        - Require more memory and slower to train and to predict than linear models.
        
### 5. Gradient Boosting
    - Idea: Bunch of decision trees, where each tree tries to correct the previous one
    - Modules: sklearn.ensemble
    - Classes: GradientBoostingRegressor
    - Parameters:
        - n_estimators
        - learning_rate: control how strong each tree makes correction from previous trees.
        - max_depth: small value(1-5) will reduce model complexity.
        - max_leaf_nodes
    - Methods:
        - fit, predict
    - Attributes:
        - estimators_
        - feature_importances_
        - classes_
    - Strenghts:
        - Among the most powerful and widely used models for supervised learning.
        - Same as other tree implementation algorithm, gbrt works well without scaling.
    - Weaknesses:
        - Requires careful tuning of parameters and may take a long time to train.
        - Doesn't work well on high dimensional sparse data.

### 6. Support Vector Machines
    - Idea: using support vector to classify
    - Modules: sklearn.svm
    - Classes: SVC
    - Parameters:
        - C: regularization parameter
        - kernel
        - gamma: paremeter for certain kernel
    - Methods:
        - fit, predict
    - Attributes:
        - support_: indices of support vectors
        - support_vectors
        - coef_
        - intercept_
        - classes_
    - Strenghts:
        - Powerful model and performs well on a variety of datasets
        - Allows complex decision boundaries.
        - Works well on high dimensional or low dimensional data.
        - Works well if features use the same unit.
    - Weaknesses:
        - Don't scale very well with the number of samples.
        - Running an SVM on data with up to 10,000 samples might work well, but working with datasets of size 100,000 or more can become challenging in terms of runtime and memory usage.
        - Require scaling and careful parameters tuning.
        - Hard to inspect, difficult to understand why a particular prediction was made.

### XGBoost: 
    - Modules: xgboost
    - Classes: XGBRegressor
    - Parameters:
        - n_estimators: how many times to go through the modeling cycle described above, advice: 100-1000
        - early_stopping_rounds: the model to stop iterating when the validation score stops improving, advice: 5
        - learning_rate
        - n_jobs: for parallel processsing, divide the process to several processor
    - Methods:
        - fit, predict
    - Ex: 
        - model = XGBRegressor(n_estimators=1000, learning_rate=0.05)
        - model.fit(X_train, y_train, early_stopping_rounds=5, eval_set=[(X_valid, y_valid)], verbose=False)


----

## __Unsupervised Learning__


### 1. Principal Component Analysis
    - Idea: find principal component, component with high variance and contain much information
    - Modules: sklearn.decomposition
    - Classes: PCA
    - Parameters:
        - n_components: number of principal components
        - whitten (boolean)
    - Methods:
        - fit
        - transform
        - inverse_transform
    - Attributes:
        - components_: the principal components

### 2. Non Negative Matrix Factorization
    - Idea: finding interesting patterns within the data, for feature extraction
    - Modules: sklearn.decomposition
    - Classes: NMF
    - Parameters: 
        - n_components:
    - Methods:
        - fit
        - transform
        - inverse_transform
    - Attributes:
        - components_

### 3. Manifold Learning (t-SNE)
    - Idea: find a two-dimensional representation of the data that preserves the distances between points as best as possible, good for visualization
    - Modules: sklearn.decomposition
    - Classes: TSNE
    - Parameters:
        - n_components
    - Methods:
        - fit
        - transform
    - Attributes:
    
### 4. K-Means Clustering
    - Idea: data points that close togerther labeled as the same group
    - Modules: sklearn.cluster
    - Classes: KMeans
    - Parameters:
        - n_clusters: number of clusters
    - Methods:
        - fit
        - predict: use it for new data, old data still the same, it does not change the model
        - transform: transform X to a cluster-distance space.
    - Attributes:
        - labels_: label for each training data points
        - cluster_centers_: center of each cluster
    - Strenghts:
        - Can add new features to simple data.
        - Easy to understand and to implement.
        - Relatively quick.
    - Weaknesses:
        - Works well only on data with relatively simple shape.
        - Assume every directions equally matters.
        - Relies on random initialization. 
        - Restrictive on simple shape.

### 5. Agglomerative Clustering
    - Idea: : the algorithm starts by declaring each point its own cluster, and then merges the two most similar clusters until some stopping criterion is satisfied
    - Modules: sklearn.cluster
    - Classes: AgglomerativeClustering
    - Parameters:
        - n_clusters:
        - linkage: 'ward', 'average', and 'complete'
    - Methods:
        - fit_predict
    - Attributes:
        - labels_
    - Externals:
        - ward (linkage) from scipy.cluster.hierarchy: returns linkage_array
        - dendrogram from scipy.cluster.hierarchy: visualize the decision making of Agglomerative Clustering
    - Ex, syntax:
        - linkage_array = ward(X)
        - dendogram(linkage_array)
    - Strengths:
        - The decision making is easy to understand and visualize
    - Weaknesses:
        - Can not cluster data with extreme shape

### 6. Density Based Spatial Clustering of Application with Noise (DBSCAN)
    - Idea: works by identifying points that are in “crowded” regions of the feature space, where many data points are close together.
    - Modules: sklearn.cluster
    - Classes: DBSCAN
    - Parameters:
        - min_samples: increasing eps means that more points will be included in a cluster.
        - eps: increasing min_samples means that fewer points will be core points, and more points will be labeled as noise.
    - Methods:
        - fit_predict
    - Attributes:
    - Strenght:
        - We do not have to specify the number of clusters.
        - Can capture clusters with complex shapes.
        - Can identify points that are not part of any cluster.
    - Weaknesses:
        - Slower than KMeans and AgglomerativeClustering.
        - But still scales to relatively large datasets.
        - Complex process.
        
### Clustering Metrics:
    - Adjusted Rand Score
    - Sillhouette Score: computes the compactness of a cluster
    - Normalized Mutual Information


---

## __Feature Engineering__

### 1. LabelEncoder:
    - Modules: sklearn.preprocessing
    - Ex: 
        - encoder = LabelEncoder().fit(X_train_categories)
        - X_train_encoded = encoder.transform(X_train_categories)

### 2. OneHotEncoder:
    - Modules: sklearn.preprorcessing
    - Ex: 
        - encoder = OneHotEncoder().fit(X_train_categories)
        - X_train_encoded = encoder.transform(X_train_categories)
        
### 3. Binning and Discretization
    - Idea: convert single feature into several categorical features
    - Tools: 
        - np.linspace
        - np.digitize
        - ML Algorithm
    - Example:
        - bins = np.linspace(-3, 3, 11)
        - which_bin = np.digitize(X, bins=bins) # each data point in X will be converted to which bin they are belong to
        - X_encoded = onehotencoder(sparse=False).fit(which_bin) # assumes onehotencoder has been instantiated

### 4. Interactions and Polynomials
    - Idea: Creating more complex features
    - Ways:
        - Combining real dataset with binned dataset
        - Using PolynomialFeatures
    - PolynomialFeatures:
        - Module: sklearn.preprocessing
        - Parameters:
            - degree
            - include_bias (bool)
        - Methods:
            - fit
            - transform
            - fit_transform
            - get_feature_names
        - Attributes:

### 5. Feature Selection: Univariate Statistics
    - Idea: compute whether there is statistically significant relationship between feature to target vector, then features that are related with the highest confidence are selected
    - Modules: sklearn.feature_selection
    - Classes: 
        - SelectPercentile
        - SelectKBest
    - Parameters: 
        - percentile (SelectPercentile)
    - Methods:
        - fit
        - transform
        - get_support
    - Attributes:
    - Strengths:
    - Weaknesses:
    
### 6. Feature Selection: Model Based
    - Idea: uses a supervised machine learning model to judge the importance of each feature, and keeps the important ones.
    - Modules: sklearn.feature_selection
    - Classes: SelectFromModel
    - Parameters:
        - ML model
        - threshold: specifying how some features are important than others
    - Methods:
        - fit
        - transfrom
        - get_support
    - Attributes:
    - Strengths:
    - Weaknesses:
    
### 7. Feature Selection: Iterative (RFE: recursive feature elimination)
    - Idea: a series of models are built, with varying number of features
    - Modules: sklearn.feature_selection
    - Classes: RFE
    - Parameters:
        - ML model
        - n_features_to_select
    - Methods:
        - fit
        - transform
        - get_support
    - Attributes:
    - Strengths:
    - Weaknesses:
        - Computationaly expensive

---

## __Model Evaluation__

### 1. cross_val_score
    - Module: sklearn.model_selection 
    - Method: cross_val_score
    - Parameters:
        - ML model
        - X
        - y
        - cv: number of folds
        - scoring
    - Control for cross_val_score:
        - KFold
            - n_splits
            - shuffle (bool)
        - StratifiedKFold
        - LeaveOneOut
            - 
        - ShuffleSplit
            - n_splits
            - test_size
            - train_size
        - StratifiedShuffleSplit
        - GroupKFold
    - Strengths:
        - Efficient use of data
    - Weaknesses:
        - Computationaly more expensive than single split
        
### 2. cross_val_predict
    - Module: sklearn.model_selection 
    - Method: cross_val_predict
    - Paramters:
        - ML Model
        - X
        - y
        - cv
        
### 2. Grid Search
    - Module: sklearn.model_selection
    - Class: GridSearchCV
    - Parameters:
        - model
        - param_grid
        - cv
        - scoring
    - Methods: (estimator methods)
        - fit
        - predict
        - transform
    - Attributes:
        - best_params_: stored the best parameter
        - best_score_: stored the mean score of cross-validation
        - best_estimator_: stored the best estimator
        - cv_results_: stored the result of grid search
        
### 3. Randomized Search
    - Module: sklearn.model_selection
    - Class: RandomizedSearchCV

---

## __Evaluation Metrics and Scoring__

### 1. Confusion Matrix
    - Idea: Matrix with row as true classes and column as predicted classes
    - Module: sklearn.metrics
    - Function: confusion_matrix
    - Parameters: y_test, y_pred
    
### 2. Precision Score
    - Idea: Score of Precision
    - Module: sklearn.metrics
    - Function: precision_score
    - Parameters: y_test, y_pred

### 3. Recall Score
    - Idea: Score of Recall (sensitivity or true positive rate)
    - Module: sklearn.metrics
    - Function: recall_score
    - Parameters: y_test, y_pred
    
### 4. f1 Score
    - Idea: 2 * (precision*recall)/(precision+recall) 
    - Module: sklearn.metrics
    - Function: f1_score
    - Parameters: y_test, y_pred

### 5. Classification Report
    - Idea: Summary of precision, recall, and f1-score
    - Module: sklearn.metrics
    - Function: classification_report
    - Parameters: y_test, y_pred, target_names
    
### 6. Precision-Recall Curve 
    - Idea: To visualize precision-recall trade offs
    - Module: sklearn.metrics
    - Function: precision_recall_curve
    - Parameters: y_test, predict_proba/decision_function
    - Returns: precision, recall, thresholds
    
### 7. Average Precision Score
    - Idea: computing the integral or area under the curve of the precision-recall curve
    - Module: sklearn.metrics
    - Function: average_precision_score
    - Parameters: y_test, predict_proba/decision_function
    
### 8. Receiver Operating Characteristics and AUC
    - Idea: False positive rate vs True positive rate
    - Module: sklearn.metrics
    - Function: roc_curve
    - Parameters: y_test, predict_proba/decision_function
    - Returns fpr, tpr, thresholds
    
### 9. ROC and AUC score
    - Idea: Summarize the ROC curve using single number
    - Module: sklearn.metrics
    - Function: roc_auc_score
    - Parameters: y_test, predict_proba/decision_function

---


## Pipeline

### 1. Pipeline: 
    - Modules: sklearn.pipeline
    - Syntax (example): pipeline = Pipeline([('impute', SimpleImputer), ('scalind', MinMaxScaler()), ..., ('model', LinearRegression())])
    - Attributes:
        - steps
        - named_steps: return dictionary with key is the step name

### 2. make_pipeline:
    - Module: sklearn.pipeline
    - Attributes:
        - steps
        - named_steps: return dictionary with key is the step name

### 3. FeatureUnion: create paralel processing, commonly for numerical and categorical features
    - Module: sklearn.pipeline
    
### 4. ColumnTransformer: Bundle together different preprocessing step (ex: bundle together preprocessing step between categorical and numerical features)
    - Modules: from sklearn.compose
    - Syntax (example): preprocess = ColumnTransformer(transformers=[('num', SimpleImputer(), 'cat', pipeline)]

## 5. joblib: to save the model as pickle 
    - Module: sklearn.externals
    - Ex: 
        - joblib.dump(my_model, 'my_model.pkl')
        - my_model_loaded = joblib.load('my_model.pkl')

---

## Custom Estimators

### 1. BaseEstimator
    - Module: sklearn.base

### 2. TransformerMixin
    - Module: sklearn.base

---

## __Additional__

1. One vs One Classifier
    - Module: sklearn.multiclass
    - Class: OneVsOneClassifier
    - Syntax: OneVsOneClassifier(SGDClassifier())
    
2. One vs Rest classifier
    - Module: sklearn.multiclass
    - Class: OneVsRestClassifier
    - Syntax: OneVsRestClassifier(SGDClassifier())

---

## Code to make decision boundary

In [1]:
def draw_decision_boundary(model, X, y, intensity=0.01):
    x_min, x_max = X[:, 0].min(), X[:, 0].max()
    y_min, y_max = X[:, 1].min(), X[:, 1].max()

    xx, yy = np.meshgrid(np.arange(x_min, x_max, intensity),
                         np.arange(y_min, y_max, intensity))

    Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
    
    plt.contourf(xx, yy, Z)
    plt.scatter(X[:, 0], X[:, 1], c=y)
    plt.xlim(X[:, 0].min(), X[:, 0].max())
    plt.ylim(X[:, 1].min(), X[:, 1].max());