#Data Science Reference Guide

##The DRY Principle: Don't Repeat Yourself

DRY stand for "Don't Repeat Yourself," a basic principle of software development aimed at reducing repetition of information. The DRY principle is stated as, "Every piece of knowledge or logic must have a single, unambiguous representation within a system."

###Loading Libraries

Load libraries first to help out the viewer so they know upfront what kind of stuff will be done within the context of your notebook.

In [None]:
# Library imports to run library functions
import numpy as np 
import pandas as pd

##Importing Data & Cleaning It

In [None]:
df = pd.read_csv(url_of_data, header=None, names=column_headers)

pd.read_csv is where I'd want to read doc notes. Highlights include arguments passed which disallows using top row as column headers (header=) and (names=) which allows you to insert column headers upon import. Useful.

To read docs you can highlight over the command, look it up on Google, or use this command:

In [None]:
?pd.read_csv

###Scoping Details of Data

(exclude='number') or (include='all) for df.describe

In [None]:
df.shape
df.head()
df.types
df.info()
df.describe()
####Counting & sorting within a column
df['column'].value_counts() #count the number of values in the column
df['column'].value_counts().nlargest(15) #find value counts, sort, and choose how many to display

###NULLS or Why Isnull versus Isna (They do the same thing!)

But why have two methods with different names do the same thing?
This is because pandas' DataFrames are based on R's DataFrames. In R na and null are two separate things. Read this post for more information.

However, in python, pandas is built on top of numpy, which has neither na nor null values. Instead numpy has NaN values (which stands for "Not a Number"). Consequently, pandas also uses NaN values.

In short
To detect NaN values numpy uses np.isnan().

To detect NaN values pandas uses either .isna() or .isnull().
The NaN values are inherited from the fact that pandas is built on top of numpy, while the two functions' names originate from R's DataFrames, whose structure and functionality pandas tried to mimic.

In [None]:
####Nulls
df.isnull().sum() #takes a scalar or array-like object and indictates whether values are missing
df.isna().sum() #returns a boolean same-sized object indicating if the values are NA
####Resolving nulls
df.fillna()
df.dropna()
df.drop()
df = df.drop(columns=['Column1', 'Column2', 'etc'])

###An example of quick to find unique values

In [None]:
y = df['column']
y.nunique() #this finds number of unique values
y.unique() #this generates the contents of the array of unique values

###Rename the index of a dataframe or determine datatype of index

In [None]:
df.index.name = 'foo'
type(df.index)

###Replace random null characters with numpy nulls

In [None]:
df['column'] = df['column'].replace("?",np.NaN)

###From the great burrito data set in the sky. Use this to reassemble a bucket of favored labels.

In [None]:
# Let's combine burrito categories to make this more useful
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

###Datetime tools

In [None]:
df['Date'] = pd.to_datetime(df['Date'])

### Drop rows based on index

In [None]:
df_array = np.array([0,3])
df_frame = [[1,'a'],[2,'b'],[3,'c'],[4,'d']]
df_frame = pd.DataFrame(df_frame)
for each in (df_array):
  df_frame.drop(index=[each],inplace=True)
print(df_frame.head())

###Setting maximum number of display rows when looking at head, list, etc.

In [None]:
pd.set_option('display.max_rows', 1567)

###List Comprehension


In [None]:
[col for col in df if col.endswith('_d')]

###Covariance###


In [None]:
# Python code to demonstrate the use of numpy.cov 
x = np.array([[0, 1], [1, 0]]) 
# Note this is a numpy array with a shape of 2x2. This works with all symmetric 
# dimensions. 
print("Shape of array:\n", np.shape(x)) 
print("Covariance matrix of x:\n", np.cov(x)) 

### Dot Product ###


In [None]:
# input two matrices 
matrix1 = ([1, 6, 5],[3 ,4, 8],[2, 12, 3]) 
matrix2 = ([3, 4, 6],[5, 6, 7],[6, 56, 7]) 
# These are called from lists. They do not need to be symmetric. But they do 
# need to follow the inner / outer rule for rows & columns.
# This will return dot product 
result = np.dot(matrix1,matrix2) 
# print resulted matrix 
print(result)

### Magnitude of a Vector

In [None]:
x = np.array([[1,3],[2,1]])
print(x)
y = np.linalg.inv(x)
print(y)
print(np.matmul(x,y))

[[1 3]
 [2 1]]
[[-0.2  0.6]
 [ 0.4 -0.2]]
[[ 1.00000000e+00 -5.55111512e-17]
 [ 0.00000000e+00  1.00000000e+00]]


In [None]:
type(df['cats_allowed'])

#Confusion Martrix Code

In [None]:
import pandas as pd
import seaborn as sns
from sklearn.metrics import confusion_matrix
from sklearn.utils.multiclass import unique_labels
def my_confusion_matrix(y_true, y_pred):
    labels = unique_labels(y_true)
    columns = [f'Predicted {label}' for label in labels]
    index = [f'Actual {label}' for label in labels]
    table = pd.DataFrame(confusion_matrix(y_true, y_pred), 
                         columns=columns, index=index)
    return sns.heatmap(table, annot=True, fmt='d', cmap='viridis')

#Tracking Nulls

Standard

In [None]:
df.isnull().sum()

In [None]:
(df == 0).sum().sort_values(ascending=False)

###Split data for machine learning

In [None]:
train, val = train_test_split(train, train_size=0.80, test_size=0.20, 
                              stratify=train['status_group'], random_state=42)

In [None]:
#split on pandas datetime
train = df[df['Date'].dt.year <= 2016]
val = df[df['Date'].dt.year == 2017]
test = df[df['Date'].dt.year >= 2018]

###List columns

In [None]:
list(df.columns.values)

###Determine Baseline of Classification Problem

In [None]:
# We have 2 classes, as expected, so this is binary classification
# How do we check if the classes are imbalanced?
y.value_counts(normalize=True)

###Classifer functions

### Try a shallow decision tree as a fast, first model

Make fast first models to clarify where you are with things.

For classification problems:

As a rough rule of thumb, if your majority class frequency is >= 50% and < 70% then you can just use accuracy if you want. Outside that range, accuracy could be misleading — so what evaluation metric will you choose, in addition to or instead of accuracy? For example:

Precision?
Recall?
ROC AUC?

###ROC AUC
Let's also review ROC AUC (Receiver Operating Characteristic, Area Under the Curve).

Wikipedia explains, "A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings."

ROC AUC is the area under the ROC curve. It can be interpreted as "the expectation that a uniformly drawn random positive is ranked before a uniformly drawn random negative."

ROC AUC measures how well a classifier ranks predicted probabilities. So, when you get your classifier’s ROC AUC score, you need to use predicted probabilities, not discrete predictions.

ROC AUC ranges from 0 to 1. Higher is better. A naive majority class baseline will have an ROC AUC score of 0.5, regardless of class (im)balance.

In [None]:
from sklearn.metrics import roc_auc_score
y_pred_proba = pipeline.predict_proba(X_val)[:, -1] # Probability for the last class
roc_auc_score(y_val, y_pred_proba)

In [None]:
import category_encoders as ce
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier

target = 'Great'
features = train.columns.drop([target, 'Date'])
X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[target]

pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    DecisionTreeClassifier(max_depth=3)
)

pipeline.fit(X_train, y_train)
print('Validation Accuracy', pipeline.score(X_val, y_val))

###Quick decision tree graph

In [None]:
import graphviz
from sklearn.tree import export_graphviz

tree = pipeline.named_steps['decisiontreeclassifier']

dot_data = export_graphviz(
    tree, 
    out_file=None, 
    feature_names=X_train.columns, 
    class_names=y_train.unique().astype(str), 
    filled=True, 
    impurity=False,
    proportion=True
)

graphviz.Source(dot_data)