## Supervised Learning

**Typical Model Syntax**
- If using dataframe, always need 2-D array shape

- df[['column']] gives Dataframe type while using df['column'] will give Series. Always use dataframe for scikit learn. 

In [None]:
from sklearn.module import Model
model = Model()
model.fit(X,y)
predictions = model.predict(X_new)


**Train vs Test Data Split**
- Important for all models
- random_state is seed for randombly splitting and reproducibility
- test size is 30% below
- stratify y so selection of each value of y matches proportion in dataset

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21, stratify=y)

#### Pipelines
- Multiple operations at a time

In [None]:
from sklearn.pipeline import Pipeline

# Build the pipeline example
pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('reducer', PCA(n_components=2)),
        ('classifier', RandomForestClassifier(random_state=0))])

# Fit the pipeline to dataframe and transform the data
pc = pipe.fit_transform(df)

#can access steps of pipeline using keywords
pca = pipe['reducer']

**K Nearest Neighbours**
- ALWAYS SCALE DATA FIRST

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5,weights={‘uniform’, ‘distance’or callabe array})
knn.fit(X_train , y_train)
accuracy = knn.score(X_test, y_test)

**Linear Regression**

In [None]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)
#####R-Squared#####
R-square = reg.score(X_test, y_test)

######RMSE######
from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(y_test, y_pred, squared=False)


**Cross Validation**
- shuffle arguement shuffles the data before dividing into n_splits

In [None]:
from sklearn.model_selection import cross_val_score, KFold
kf = KFold(n_splits=5, shuffle=True, random_state={int})
cv_results = cross_val_score(model, X, y, cv=kf) ###output is R-squared

Ridge Rigression (on Linear Regression)

In [None]:
####Ridge Rigression (on Linear Regression)####
from sklearn.linear_model import Ridge
ridge = Ridge(alpha={float})
ridge.fit(X_train, y_train)
y_pred = ridge.predict(X_test)

####Lasso Rigression (on Linear Regression)####
from sklearn.linear_model import Lasso
#Same rest of the steps

**Classification Metrics**
- Confusion Matrix
- Accuracy = TP + TN / All
- Precision = True Positive (TP) / {TP + FP}
- Recall = Sensitivity = TP / (TP + FN)
- F1 Score = Recall x Precision / (Recall + Precision)
- Support = # of instances within each class with their true labels

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
#....make and train model, predict test values
matrix = confusion_matrix(y_test, y_pred) #gives confusion matrix
report = classification_report(y_test, y_pred) #gives f1 score, precision, recall and support

**Logistic Regression**

In [None]:
from sklearn.linear_model import LogisticRegression
#....make and train model (logreg) , predict test values
y_pred_prob = logreg.predict_proba(X_test)[:,1] 
#probability of each instance to belong to a class i.e. p_k(X) 

**ROC Curve**
- Changes as threshold for p_k(X) is changed. 
- tpr = true positive rate = TP/ (P) = Sensitivity = TP/(TP + FN)
- fpr = false positive rate = FP/ (N) = FP/ (FP + TN)
- ROC curve= tpr(y) vs fpr(x)
- AUC = Area under curve


In [None]:
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob) #all three values are calculated

from sklearn.metrics import roc_auc_score
score = roc_auc_score(y_test, y_pred_prob) #want it to be as close to 1 as possible

### Hyperparameter Tuning aka optimizing models
- **GridSearchCV** = cross validates throughout a range of values. NOT SCALABLE
- **RandomizesSearchCV** = set number of hyperparameters tested randomly using n_iter


In [None]:
from sklearn.model_selection import GridSearchCV

kf = KFold(n_splits=5, shuffle=True, random_state={int})

#make a dictionary of parameters
param_grid = {"parameter1": np.arange(start,stop,steps)
            "parameter2": ["value1","method2"]}
ridge = Ridge() #initialize model
ridge_cv = GridSeachCV(ridge, param_grid, cv=kf)

ridge_cv.fit(X_train, y_train) #fit the model to training data

best_values = ridge_cv.best_params_ #gives best values
score = ridge_cv.best_score_ #best score corresponding to best params

#########################################################################
from sklearn.model_selection import RandomizedSearchCV
.
.
.
ridge_cv = GridSeachCV(ridge, param_grid, cv=kf, n_iter=2)


## XGBoost

Use when:
- Large number of training data (1000+) and few features (~100)
- Categorical or numeric data

DO NOT USE when:
- Computer Vision/ NLP, etc
- Small training sets
- Training examples << features

In [None]:
from xgboost as xgb

#Convert to DMatrix to use XGB
data_dmatrix = xgb.DMatrix(data = data_db.iloc[:,:-1], label = data_db.iloc[:,-1]) 

#Defining Params
params = {"objective":'binary:logistic', "max_depth":4}

#Cross Validation
cv_results = xgb.cv(dtrain = data_dmatrix, params = params, n_fold = 4, num_boost_round=10, metrics = "error", as_pandas = True)


Depending on type of base learner, one can use Scikit-learn based API or learning API from xgboost.

In [None]:
#Using Trees as Base Learners using Scikit-learn based API

xg_reg = xgb.XGBRegressor(objective='reg:linear', n_estimators=10, seed=123)

xg_reg.fit(X_train, y_train)
pred = xg_reg.predict(X_test)

rmse = np.sqrt(mean_squared_error(y_test, pred))

#Using Linear base learners: have to use learning API only

DM_train = xgb.DMatrix(data = X_train, label = y_train) #need to convert to DMatrices
DM_test = xgb.DMatrix(data = X_test, label = y_test)

params = {"booster":'gblinear', "objective":"reg:linear"}

xg_reg = xgb.train(dtrain = DM_train, params = params,num_boost_round=10) #Notice train instead of fit

pred = xg_reg.predict(DM_test)
rmse = np.sqrt(mean_squared_error(y_test, pred)



Plotting trees and feature importance

In [None]:
# Plot the feature importances
xgb.plot_importance(xg_reg)
plt.show()

# Plot the last tree sideways
xgb.plot_tree(xg_reg, num_trees=9, rankdir="LR") #num_trees depends on value of num_boost_round (10 here)
plt.show()

Using API learning to do cross validation

In [None]:
cv_results = xgb.cv(dtrain = housing_dmatrix, 
                    params=params, # parameters such as learning-rate (eta), gamma, lambda, alpha, max_depth etc
                    early_stopping_rounds=5, #stops early if no improvement in model performance is seen
                    nfold=3, 
                    max_depth = 5,
                    subsample = (%), # %samples used for each tree
                    colsample_bytree = (%), # %features used for each tree
                    num_boost_round=10, 
                    metrics='rmse', #metric
                    as_pandas=True, # saves as pandas
                    seed=123)

Using Grid Search CV or Random Search CV

In [None]:
######GRID SEARCH###########

# Create the parameter grid: gbm_param_grid
gbm_param_grid = {
    'colsample_bytree': [0.3, 0.7],
    'n_estimators': [50],
    'max_depth': [2, 5]
}
# Instantiate the regressor: gbm
gbm = xgb.XGBRegressor()

# Perform grid search: grid_mse
grid_mse = GridSearchCV(estimator=gbm, 
                        param_grid = gbm_param_grid, 
                        scoring = 'neg_mean_squared_error', 
                        cv=4, 
                        verbose=1) #to understand the outputs better
# Fit grid_mse to the data
grid_mse.fit(X, y)

########RANDOM SEARCH###########

# Perform random search: grid_mse
randomized_mse = RandomizedSearchCV(estimator = gbm,
                                    param_distributions = gbm_param_grid, #distribution of parameter grid
                                    scoring='neg_mean_squared_error', 
                                    n_iter=5,    #number of iterations
                                    cv=4, 
                                    verbose=1)

Using DataFrameMapper to use in Pipeline

In [None]:
# Import necessary modules
from sklearn_pandas import DataFrameMapper
from sklearn.impute import SimpleImputer

# Check number of nulls in each feature column
nulls_per_column = X.isnull().sum()
print(nulls_per_column)

# Create a boolean mask for categorical columns
categorical_feature_mask = X.dtypes == object

# Get list of categorical column names
categorical_columns = X.columns[categorical_feature_mask].tolist()

# Get list of non-categorical column names
non_categorical_columns = X.columns[~categorical_feature_mask].tolist()

# Apply numeric imputer
numeric_imputation_mapper = DataFrameMapper(
                                            [([numeric_feature], SimpleImputer(strategy="median")) for numeric_feature in non_categorical_columns],
                                            input_df=True,
                                            df_out=True
                                           )

# Apply categorical imputer
categorical_imputation_mapper = DataFrameMapper(
                                                [(category_feature, SimpleImputer()) for category_feature in categorical_columns],
                                                input_df=True,
                                                df_out=True
                                               )

# Import FeatureUnion
from sklearn.pipeline import FeatureUnion

# Combine the numeric and categorical transformations
numeric_categorical_union = FeatureUnion([
                                          ("num_mapper", numeric_imputation_mapper),
                                          ("cat_mapper", categorical_imputation_mapper)
                                         ])
pipeline = Pipeline([
                     ("featureunion", numeric_categorical_union()),
                     ("dictifier", Dictifier()),
                     ("vectorizer", DictVectorizer(sort=False)),
                     ("clf", xgb.XGBClassifier(max_depth=3))
                    ])


# Word Vectorization and Weighting

- Stemming is a process that stems or removes last few characters from a word, often leading to incorrect meanings and spelling. (Caring to Car)	
- Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma. (Caring to Care)

- Stopwords

## TF-IDF or Term Frequency - Inverse Document Frequency

High frequency words might be important but words like 'it' and 'the' need to be weighted down. 
***idf*** is used to weight down commonly occuring words

- weight = tf x idf
- tf = frequency of term **t** in a document **d** = log_10 (count (t,d) + 1)
- df_t = document frequency = number of documents **t** occurs in.
- idf = log_10 (**N**/**df_t**) where **N** is the number of documents in the collection. 

In [None]:
# Word Vectorization and Weighting

from sklearn.feature_extraction.text import TfidfVectorizer

tfid_vec = TfidfVectorizer()
text_vec = tfid_vec.fit_transform(documents)

X_train, X_test, y_train, y_test = train_test_split(text_vec.toarray(), y, stratify=y, random_state=42) 
# IMPORTANT have to convert the vector back into array for sklearn


################# Finding Weights of words ################
text_vec.vocabulary_ # gives the words in the vocabulary of the vector

vocab = {v:k for k,v in text_vec.vocabulary_.items} #dictionary of all vocabulary

text_vec[3].data # weights of the 4th row of the vector
text_vec[3].indices # indices of the words whose weights are listed

zipped_row = dict(zip(text_vec[3].indices,text_vec[3].data))

#EXAMPLE:
# Add in the rest of the arguments
def return_weights(vocab, original_vocab, vector, vector_index, top_n):
    zipped = dict(zip(vector[vector_index].indices, vector[vector_index].data))
    
    # Transform that zipped dict into a series
    zipped_series = pd.Series({vocab[i]:zipped[i] for i in vector[vector_index].indices})
    
    # Sort the series to pull out the top n weighted words
    zipped_index = zipped_series.sort_values(ascending=False)[:top_n].index
    return [original_vocab[i] for i in zipped_index]

# Print out the weighted words
print(return_weights(vocab, tfidf_vec.vocabulary_, text_tfidf, 8, 3)) #top 3 words in row 9