# Day 5

- Today we will review two more topics:
    1. Text analysis
    2. Machine learning
    3. Webscraping


## Lecture 5.1: Machine Learning

- This lecture has been modified from a tutorial at [pythonforengineers.com](https://www.pythonforengineers.com/machine-learning-for-complete-beginners/).
- **If you have any questions over the course of this lecture, please post them to the 'Day 5 Lecture Questions' assignment on the Canvas course page.**

## Lecture 5.1: Machine Learning

- Steps:
    1. Gather data
        - Web scraping
        - Compiling text data
        - Conduct survey
    2. Clean data
    3. **Prepare data for machine learning**
    4. **Run ML algorithms**
    5. **Test model**
   

## Load the data

- For this lecture we have some statistics about the survivors and mortalities of the titanic ship wreck.
- We will load in the regular packages.

In [10]:
import numpy as np
import pandas as pd

data = pd.read_csv("titanic_lecture.csv")
data.head()

Unnamed: 0,row.names,pclass,survived,name,age,embarked,home.dest,room,ticket,boat,sex
0,999,3rd,1,"McCarthy, Miss Katie",,,,,,,female
1,180,1st,0,"Millet, Mr Francis Davis",65.0,Southampton,"East Bridgewater, MA",,,-249.0,male
2,557,2nd,0,"Sjostedt, Mr Ernst Adolf",59.0,Southampton,"Sault St Marie, ON",,,,male
3,175,1st,0,"McCaffry, Mr Thomas Francis",46.0,Cherbourg,"Vancouver, BC",,,-292.0,male
4,1233,3rd,0,"Strilic, Mr Ivan",,,,,,,male


## Edit the data.
- Now we can look at some of our data and change it such that it is suitable for analysis

In [11]:
data.columns

Index(['row.names', 'pclass', 'survived', 'name', 'age', 'embarked',
       'home.dest', 'room', 'ticket', 'boat', 'sex'],
      dtype='object')

### We are going to grab the median age and fill in the missing data with this value

In [12]:
median_age = data['age'].median()
print("Median age is {}".format(median_age))

Median age is 29.0


In [13]:
data['age'].fillna(median_age, inplace = True)
data['age'].head()


0    29.0
1    65.0
2    59.0
3    46.0
4    29.0
Name: age, dtype: float64

### Get rid of string values.
- We will replace text representation of the ticket class each passenger had with numeric values.
- We will replace `female` and `male` with numeric indicators where 0 == Female and 1 == Male.

In [14]:

data["pclass"].replace("3rd", 3, inplace = True)
data["pclass"].replace("2nd", 2, inplace = True)
data["pclass"].replace("1st", 1, inplace = True)
data = data.copy()

In [15]:
data["sex"] = np.where(data["sex"] == "female", 0, 1)
data.head()

Unnamed: 0,row.names,pclass,survived,name,age,embarked,home.dest,room,ticket,boat,sex
0,999,3,1,"McCarthy, Miss Katie",29.0,,,,,,0
1,180,1,0,"Millet, Mr Francis Davis",65.0,Southampton,"East Bridgewater, MA",,,-249.0,1
2,557,2,0,"Sjostedt, Mr Ernst Adolf",59.0,Southampton,"Sault St Marie, ON",,,,1
3,175,1,0,"McCaffry, Mr Thomas Francis",46.0,Cherbourg,"Vancouver, BC",,,-292.0,1
4,1233,3,0,"Strilic, Mr Ivan",29.0,,,,,,1


### Splitting the data

- For most calculations we are going to need X and Y variables in separate datasets.

In [16]:
Y = data[["survived"]].copy()
Y.head()

Unnamed: 0,survived
0,1
1,0
2,0
3,0
4,0


In [17]:
X = data[["pclass", "age", "sex"]].copy()
X.head()

Unnamed: 0,pclass,age,sex
0,3,29.0,0
1,1,65.0,1
2,2,59.0,1
3,1,46.0,1
4,3,29.0,1


### Selecting variables
- An important part of machine learning is [selecting your variables](https://www.analyticsvidhya.com/blog/2016/12/introduction-to-feature-selection-methods-with-an-example-or-how-to-select-the-right-variables/).
    - Here we selected the class the passenger was in, age and sex.
    - Variables such as an individuals names is random and unlikely to have any predictive power.
- In machine learning our variables are 'features' of the data. By selecting only the relevant features we reduce the noise which comes from irrelevant data.


### Methods of feature selection
- You can use theoretical explanations to select features.
- You can use the process of elimination during by running with all the data and then excluding feature-by-feature and observing changes in performance.
    - Alternatively you can start with one variable and rerun each time adding a new variable and explore changes in performance.
- You can filter using various techiniques:
    - Use Chi-Square Test to test the independence of two events.
    - You can remove all features who have zero variance with your outcome.
- You can explore the best features while running your models:
    - Lasso 
    - Ridge regression
    - Elastic net

### Training and test data.

- One of the most important part of machine learning is ascertaining how well a certain model is performing.
    - To do this we slit our dataset so that roughly 70% is used for training.
        - We take the training data and we run a model on this data. The model is then saved.
        - Our models learn about our data by establishing coefficients.
    - The other 30% is our testing data.
        - We take the model which we have created and we can see how well it performs on 'out of sample' predictions.
        - The model is tested for accuracy by seeing how it predicts data that it was not trained on.


### Training and test data.

- This whole structure is meant to prevent overfitting the data.
    - For example you may have a very large dataset with a lot of variables so you create a model based on those factors.
        - However, once you go to a smaller dataset and apply the same model you may find it performs very poorly.
- With our training and test data we can gaue how well a specific model performs and compare it to other models. 

In [18]:
import sklearn as sk
import sklearn.model_selection

X_train, X_test, Y_train, Y_test   = sk.model_selection.train_test_split(X, Y, test_size = 0.33, random_state = 42)

print(X_train.head())
print(Y_train.head())

     pclass   age  sex
93        3  29.0    1
268       3  29.0    1
335       3  29.0    1
340       3  22.0    0
598       2  34.0    1
     survived
93          0
268         0
335         0
340         1
598         0


## Modeling the data
- We will try several different types of models to see which one predicts best who survived on the Titanic.
- We don't have the time to do an in-depth exploration into the math behind each of the models I present.
    - I will provide a link for each model which gives you an explanation of the model at hand. 

### Logistic regression (statsmodel)
- Yesterday we used the `statsmodel` package to to OLS, however, because we are looking at a binary variable (1== survive; 0==died) we will look at a logistic regression.
- We will look at another package, but the statsmodel has nice summary tables. 

In [19]:
import statsmodels as sm

lm = sm.discrete.discrete_model.Logit(endog = Y_train, exog = X_train, data = data).fit()
print(lm.summary())


Optimization terminated successfully.
         Current function value: 0.508727
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:               survived   No. Observations:                  439
Model:                          Logit   Df Residuals:                      436
Method:                           MLE   Df Model:                            2
Date:                Sat, 13 Jun 2020   Pseudo R-squ.:                  0.2078
Time:                        15:05:21   Log-Likelihood:                -223.33
converged:                       True   LL-Null:                       -281.90
Covariance Type:            nonrobust   LLR p-value:                 3.654e-26
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
pclass        -0.3704      0.093     -3.975      0.000      -0.553      -0.188
age            0.0364      0.

### `statsmodel`
- `statsmodel` is not specifically for making predictions so there is no inherent way to see how well it performed, however, it's quite easy to calculate.
- Even though we ran a Logit model the `logit` attribute of `statsmodel` still only returns a probablity of survival, so we round this value. 

192     True
265     True
101     True
625     True
523     True
       ...  
304     True
488     True
203     True
196     True
352    False
Length: 217, dtype: bool

In [20]:
predictions = lm.predict(X_test)
print('The prediction rate of success is:',sum(round(predictions) == Y_test['survived'])/len(predictions))

The prediction rate of success is: 0.7788018433179723


### What if we ran ols?

In [None]:
lm2 = sm.api.formula.ols("survived ~ C(pclass) + age + sex", data = data).fit()
print(lm2.summary())
predictions2 = lm2.predict(X_test)
results = []
results = ['OLS Prediction rate of success: {}%'.format(sum(round(predictions2) == Y_test['survived'])/len(predictions2) * 100)]
results

### Logistic regression (sklearn)
- `sklearn` *is* a package created for prediction.
- The tradeoff here is there is no clean way to look at a summary of our regression.
    - We can look at coefficients and R-squared values separately.

In [24]:
## Different coefficients?

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

logreg = LogisticRegression()
logreg.fit(X_train, Y_train.values.ravel())
logreg.coef_

array([[-1.1164483 , -0.03100704, -2.19867651]])

In [25]:
# Turn off regularization and do not fit the intercepts!
logreg2 = sk.linear_model.LogisticRegression(C=1e9,fit_intercept = False)
logreg2.fit(X_train, Y_train.values.ravel())
logreg2.coef_

array([[-0.37044427,  0.03639955, -1.90261196]])

In [26]:
#manual
y_pred = logreg.predict(X_test)
sum(y_pred == Y_test['survived'])/len(y_pred)

0.8064516129032258

In [27]:
#built-in attributes
results = []

logreg.score(X_test,Y_test)
results.append('LR Prediction rate of success: {}%'.format(logreg.score(X_test,Y_test)*100))
results

['LR Prediction rate of success: 80.64516129032258%']

### Random Forest
- Random forests are a type of [decision tree](https://www.youtube.com/watch?v=DCZ3tsQIoGU&feature=youtu.be&t=389) which is a method of classifying data.
- [Decision trees](https://www.analyticsvidhya.com/blog/2020/05/decision-tree-vs-random-forest-algorithm/) take all the data (all observations and variables/features).

<img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2020/05/rfc_vs_dt11.png" width="400">



### Random Forest

- [RF algorithms](https://www.youtube.com/watch?v=D_2LkhMJcfY) is one of the most popular types of decision trees.
- The difference between random forest and your general decision tree is that random forest randomly selects a subset of observations and varaibles and build multiple decision trees.
    - Trees are built independently.
    - From these multiple analyses, the outcomes are averaged.

In [28]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, Y_train.values.ravel())
results.append(('RF Prediction rate of success: {}%'.format(rf.score(X_test, Y_test) * 100)))
results

['LR Prediction rate of success: 80.64516129032258%',
 'RF Prediction rate of success: 77.41935483870968%']

## Gradient boosting
- Gradient boosting is another type of decision tree.
- Rather than averaging out the results at the end, gradient boosting is iterative.
    - Each tree learns from the last and tries to make corrections.

In [29]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
GBM = GradientBoostingClassifier()
GBM.fit(X_train, Y_train.values.ravel())
results.append('GBM Prediction rate of success: {}%'.format(GBM.score(X_test,Y_test)*100))
results

['LR Prediction rate of success: 80.64516129032258%',
 'RF Prediction rate of success: 77.41935483870968%',
 'GBM Prediction rate of success: 78.80184331797236%']

### SVM
- [Support vector machines](https://www.youtube.com/watch?time_continue=2&v=Y6RRHw9uN9o) work by drawing a boundary line (hyper plane) between our training data.



In [30]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

svm =SVC()
svm.fit(X_train,Y_train.values.ravel())
results.append(('SVM Prediction rate of success: {}%'.format(svm.score(X_test, Y_test) * 100)))
results

['LR Prediction rate of success: 80.64516129032258%',
 'RF Prediction rate of success: 77.41935483870968%',
 'GBM Prediction rate of success: 78.80184331797236%',
 'SVM Prediction rate of success: 68.20276497695853%']

## Naive Bayes

- The [Naive Bayes Classifier](https://www.youtube.com/watch?v=CPqOCI0ahss) relies on Bayes Theorem.
    - $ P(A \mid B) = \frac{P(B \mid A) \, P(A)}{P(B)} $


In [31]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

nb= GaussianNB()
nb.fit(X_train, Y_train.values.ravel())
results.append('NB Prediction rate of success: {}%'.format(nb.score(X_test,Y_test)*100))
results

['LR Prediction rate of success: 80.64516129032258%',
 'RF Prediction rate of success: 77.41935483870968%',
 'GBM Prediction rate of success: 78.80184331797236%',
 'SVM Prediction rate of success: 68.20276497695853%',
 'NB Prediction rate of success: 76.036866359447%']

## K-nearest neighbors

- [KNN](https://www.youtube.com/watch?v=MDniRwXizWo) is a non-parametric method which looks at the 'k' number of closest examples so those who are closest together are similarly classified.

In [32]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

knear =  KNeighborsClassifier()
knear.fit(X_train, Y_train.values.ravel())
results.append('KNN Prediction rate of success: {}%'.format(knear.score(X_test,Y_test)*100))
results

['LR Prediction rate of success: 80.64516129032258%',
 'RF Prediction rate of success: 77.41935483870968%',
 'GBM Prediction rate of success: 78.80184331797236%',
 'SVM Prediction rate of success: 68.20276497695853%',
 'NB Prediction rate of success: 76.036866359447%',
 'KNN Prediction rate of success: 70.96774193548387%']