# Linear Regression
- Assume the data is clean and feature/target are in columns of a pandas dataframe

##### View/prep the data
```python
sns.pairplot(df) # view the features and target data
sns.distplot([df['target']]) # view target distribution
df.corr() # make a correlation table
sns.heatmap(df.corr(), annot=True) # make a correlation heatmap
```

#####  Assign, split and fit
```python
X = df[['feature_0', 'feature_1']]
y = df['target']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101) # Tuple unpacking

from sklearn.linear_model import LinearRegression
lm = LinearRegression() # Instantiate the object
lm.fit(X_train, y_train) # fit the model
```

##### Coefficients, Predictions and Residuals
```python
cdf = pd.DataFrame(lm.coef_, X.columns, columns=['Coeff']) # make coefficients
predictions = lm.predict(X_test) # make X_test predictions from X_train/y_train data
sns.jointplot(y_test, predictions) # plot predictions vs actual y_test values
sns.distplot((y_test-predictions)) # Histogram of residuals
```

##### Evaluation Metrics
```python
from sklearn import metrics
metrics.mean_absolute_error(y_test, predictions) # Mean Average Error
metrics.mean_squared_error(y_test, predictions) # Mean Squared Error
np.sqrt(metrics.mean_absolute_error(y_test, predictions)) # Root Mean Squared Error
metrics.explained_variance_score(y_test, predictions) # R^2. How much variance the model explains
```

# Logistic Regression
- A method for classification, 0 to 1
- Non-continuous, discrete categories, eg., binary groups

### $\theta(z) = \frac{1}{1 + exp(-z)}$

Logistic slope

### $p = \frac{1}{1 + e^{(b_0 + b_1x)}}$

##### View/prep the data
```python
sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap='viridis') # view all missing data
sns.distplot([df['target']]) # view target distribution

# Impute some data
def impute_age(cols):
    Age = cols[0]
    Pclass = cols[1]
    
    if pd.isnull(Age):
        
        if Pclass == 1:
            return 37
        elif Pclass == 2:
            return 29
        else:
            return 24
            
    else:
        return Age
        
train['Age'] = train[['Age', 'Pclass']].apply(impute_age, axis=1) # Fill in age

sex = pd.get_dummies(train['Sex'], drop_first=True)
embark = pd.get_dummies(train['Embarked'], drop_first=True)
train = pd.concat([train, sex, embark], axis=1)

train.drop(['PassengerId','Sex', 'Embarked', 'Name', 'Ticket'], axis=1, inplace=True) # Drop all numerical
```

##### Assign, split and fit
```python
# assign
X = train.drop('Survived', axis=1)
y = train['Survived']

# split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

# fit
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()
logmodel.fit(X_train, y_train)
```

##### Predictions and Evaluations
```python
# Predictions
predictions = logmodel.predict(X_test)

from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))
```

# K Nearest neighbours

##### Scale data
```python
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# Fit scaler object to feature columns only
scaler.fit(c_data.drop('TARGET CLASS', axis=1)) 

# Use scalar object to do a transformation
scaled_features = scaler.transform(c_data.drop('TARGET CLASS', axis=1))

# Convert transformation array to a pandas dataframe
df_feat = pd.DataFrame(scaled_features, columns=c_data.columns[:-1])
```
##### Assign, split, fit - First run through
```python
# Assign
X = df_feat
y = c_data['TARGET CLASS']

# split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# fit
from sklearn.neighbors import KNeighborsClassifier
error_rate = []
for i in range(1,40):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test))
```

##### Plot
```python
fig, ax =plt.subplots(figsize=(10,6))
ax.plot(range(1,40), 
         error_rate, 
         color='blue', 
         linestyle='dashed', 
         marker='o',
        markerfacecolor='red',
        markersize=10,)
```

##### refit with best K
```python
from sklearn.metrics import classification_report, confusion_matrix

# Find best k from plot
knn = KNeighborsClassifier(n_neighbors=17)
knn.fit(X_train, y_train)
```

##### Predictions and Evaluations
```python
from sklearn.metrics import classification_report, confusion_matrix
predictions = knn.predict(X_test)
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))

```

# Decision Trees

https://www.youtube.com/watch?v=LDRbO9a6XPU&t=382s

1. Quantify how much a question serves to unimix the labels 
2. Gini impurity = Amount of uncertainty at a node
3. We can quantify how much a question reduces Gini uncertainty by measuring information gain
4. We can use information gain to dictate which question to ask at each node
5. Thus we need to know 
    - which questions to ask
    - when to ask them

Each node takes a list of rows and iterates over every value of every feature. Each feature-value can be used as a threshold to partition the data in the form of a question.

The best question is the one that reduces our uncertainty the most. Gini impurity (0 to 1)quantifies the uncertainty at a node. Information gain quantifies how much a question reduces that.

Gini impurity. The chance of being INCORRECT if you randomly assign a label to an example in the same set.

### $\mathit{Gini}(E) = 1 - \sum_{j=1}^{c}p_j^2$

### $ = 1 - p_j$

Which questions to ask?
1. Find the Gini impurity of current node
2. Look at dataset in that node, choose a question (we will iterate over all questions)
3. Partition child nodes for that question. Calculate the weighted average of the uncertainty in those child nodes.
4. Subtract this uncertainty from our starting uncertainty to yield the information gain.
5. Iterate to the next question in the list and perform the same methods.
6. We choose to apply the question that yields that largest information gain.

##### Assign, split, fit
```python
# Assign
X = data.drop('Kyphosis', axis=1)
y = data['Kyphosis']

# split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# fit
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()
dtree.fit(X_train, y_train)
```

##### Predictions and Evaluations
```python
# Predictions
predictions = dtree.predict(X_test)

# Evaluations
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))
```

##### Visualisation
```python
from IPython.display import Image
from sklearn.externals.six import StringIO
from sklearn.tree import export_graphviz
import pydot

features = list(data.columns[1:])

dot_data = StringIO()
export_graphviz(dtree, out_file=dot_data, feature_names=features,filled=True, rounded=True)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
Image(graph[0].create_png())

```

# Random Forests

##### Assign, split, fit
```python
# Assign
X = data.drop('Kyphosis', axis=1)
y = data['Kyphosis']

# split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=200)
rfc.fit(X_train, y_train)
```

##### Predictions and Evaluations
```python
rfc_predictions = rfc.predict(X_test)

from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, rfc_predictions))
print(classification_report(y_test, rfc_predictions))

```

# Support Vector Machines

```python

```

# K means clustering

```python

```

# Principal Component Analysis

```python

```

# Recommender Systems

```python

```

# Natural Language Processing

```python

```

# Big Data & Spark

```python

```

# Neural Nets & Deep Learning

```python

```

# Naive Bayes

```python

```

# Hippocampus Bayes

```python

```