<a href="https://colab.research.google.com/github/beccac87/beccac87.github.io/blob/master/Rstudio_Workbook11.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Import packages needed for the exercise

In [None]:
import pandas as pd  #package for managing data
import numpy as np   #package for array-processing
from sklearn import tree, preprocessing  #package for machine learning

Read in two datasets from Kaggle: one for testing data, one for training

In [None]:
train_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
train = pd.read_csv(train_url)

test_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
test = pd.read_csv(test_url)

# pd.read_csv() means “please invoke the read_csv() function, which lives in
# the pd (pandas) library.” Technically, we created a DataFrame object and
# called one of its built-in methods.

Perform two functions on the train dataset: print() and head()

In [None]:
print(train.head())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


Do the same for the test dataset

In [None]:
print(test.head())

   PassengerId  Pclass                                          Name     Sex  \
0          892       3                              Kelly, Mr. James    male   
1          893       3              Wilkes, Mrs. James (Ellen Needs)  female   
2          894       2                     Myles, Mr. Thomas Francis    male   
3          895       3                              Wirz, Mr. Albert    male   
4          896       3  Hirvonen, Mrs. Alexander (Helga E Lindqvist)  female   

    Age  SibSp  Parch   Ticket     Fare Cabin Embarked  
0  34.5      0      0   330911   7.8292   NaN        Q  
1  47.0      1      0   363272   7.0000   NaN        S  
2  62.0      0      0   240276   9.6875   NaN        Q  
3  27.0      0      0   315154   8.6625   NaN        S  
4  22.0      1      1  3101298  12.2875   NaN        S  


In [None]:
# Question 1: Which variable is missing from the test set?

# Answer: Survived

describe() gives statistical information about each numerical variable

In [None]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [None]:
# Question 2: Based on the table above, what is the average age of Titanic passengers?

# Answer: 29

Focusing on the Pclass variable --> count how many unique values are in each Pclass column

In [None]:
train["Pclass"].value_counts()
#How many passenger classes are there?
#Which class has the most passengers?

3    491
1    216
2    184
Name: Pclass, dtype: int64

Does the same as previous, but counts how many survived vs. how many didn't

In [None]:
train["Survived"].value_counts()

0    549
1    342
Name: Survived, dtype: int64

In [None]:
# Question 3: What does a value of 0 mean in the example above?

# Answer: In this example the 0 means 549 people

Gives proportion of passengers who died and survived as percentage

In [None]:
print(train["Survived"].value_counts(normalize = True))  #normalize means values will add up to 1

0    0.616162
1    0.383838
Name: Survived, dtype: float64


In [None]:
# Question 4: What % of passengers survived?

# Answer: 38% percent of passengers survived

Group 'survived' column with 'sex' column to determine how many men survived/died

In [None]:
print(train["Survived"][train["Sex"] == 'male'].value_counts())

0    468
1    109
Name: Survived, dtype: int64


Same for women

In [None]:
print(train["Survived"][train["Sex"] == 'female'].value_counts())

1    233
0     81
Name: Survived, dtype: int64


In [None]:
# Question 5: Based on the stats above, would you rather be a man or a woman on the Titanic? Why?
# What would the authors of Data Feminism say about the way of storing data in the "Sex" column in this dataset?

#Answer: I would rather be a woman on the titanic because they had a higher survival rate.

**Where Age column is missing a value, fill it with median value of "Age" variable**
(ML doesn't work if there is missing data)

In [None]:
train["Age"] = train["Age"].fillna(train["Age"].median())
#Does this operation mean we know the exact age of the passengers whose age is missing? Or are we simply filling in a resonable value to replace the missing ones?
#This happens a lot when we work with algorithms - we approximate many values instead of knowing exactly what they are.

Assign predefined class "target" using Survived from training data (want target to be prediction of survival)

In [None]:
# Create the target and features numpy arrays: target, features_one
target = train["Survived"].values #This command tells Python the variable we are trying to predict in our dataset is "Survived"

All predictors need to be defined in terms of numbers --> need to transform "Sex" column into a numeric variable (0 and 1) to use it in the model

In [None]:
train["Sex"].value_counts() #We can see "Sex" is coded as "female" and "male" (categorical variable) - need to assign numbers to these two categories

male      577
female    314
Name: Sex, dtype: int64

In [None]:
# Preprocess
encoded_sex = preprocessing.LabelEncoder()

# Convert into numbers
train.Sex = encoded_sex.fit_transform(train.Sex)

Let's check if "Sex" was successfully transformed:

In [None]:
train["Sex"].value_counts()

1    577
0    314
Name: Sex, dtype: int64

In [None]:
# Question 6: What does a value of 1 correspond to - "male" or "female"?

# Answer: The value 1 corresponds to male

The next step is to pick the predictors, aka "features," that will help us predict the target variable "Survived":

In [None]:
features_one = train[["Pclass", "Sex", "Age", "Fare"]].values

In [None]:
# Question 7: How many predictors will the ML model have?

# Answer: 4 predictors

Above uses **four predictors** to determine whether someone survived or not (AND uses "DecisionTreeClassifier" ML algorithm)


We are now ready to "train" and "fit" the model on the train set:

In [None]:
# Fit the first decision tree: my_tree_one
my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(features_one, target)

In [None]:
#This command shows the accuracy of the model in the training phase, i.e. what proportion of the passengers is classified correctly.
#In this case, the accuracy is pretty high - the error rate is only about 2%, pretty impressive.
print(my_tree_one.score(features_one,target))

0.9775533108866442


Let's see which variables are most important in predicting whether a passenger survived (they are printed in the same order they were declared in the features_one vector: "Pclass", "Sex", "Age", "Fare"):

In [None]:
# Look at the importance and score of the included features
print(my_tree_one.feature_importances_)

[0.12797088 0.31274009 0.23404358 0.32524545]


In [None]:
# Question 8: Which are the top 2 most important variables in predicting whether a passenger survived?

# Answer: sex and fare

At this point, we have a fully trained supervised ML model which we can now use on our test data to make predictions about who will survive the disaster.

In [None]:
# Just as with the train set, we need to "clean" the test data by encoding "Sex" as a binary and filling the missing values in "Fare" and "Age"
test.Sex = encoded_sex.fit_transform(test.Sex)

test["Age"] = test["Age"].fillna(test["Age"].median())
test["Fare"] = test["Fare"].fillna(test["Fare"].median())

In [None]:
# Next, we create the list of features our prediction model will use - the same ones we used in training
test_features = test[["Pclass", "Sex", "Age", "Fare"]].values

In [None]:
# We are now ready to predict the fate of each passenger in the test set.
# The command below displays our prediction - a column of 1's and 0's where 1 means that particular passenger survived;

my_prediction = my_tree_one.predict(test_features)
print(my_prediction)

[0 0 1 1 1 0 0 0 1 0 0 0 1 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 0 0
 0 0 1 0 1 0 1 1 0 0 0 1 1 1 0 1 1 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 1 1 0 0 0
 1 0 0 1 0 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 0 1 0 0 0 1 0 0 0 1 0 0
 0 1 1 1 0 1 1 0 1 1 0 1 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0
 1 0 1 0 0 1 0 0 1 1 0 1 1 1 1 1 0 1 1 0 0 0 0 1 0 1 0 1 1 0 1 1 0 0 1 0 1
 0 1 0 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 1 1 0 0 0 0 1 0 1 0 1 0
 1 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 1 1 1 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1
 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 0 0 0
 1 0 0 0 0 0 0 0 1 1 0 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 1
 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 1 1 0 0 0 1 0
 0 1 0 0 1 1 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0
 0 1 1 1 1 0 0 1 0 0 0]


In [None]:
# Question 9: Based on the prediction column, what does the model predict will happen to the 3rd passenger in the test set?

# Answer: They would survive

How can we improve the model? Can we include more of the variables as predctors? Here is a full list of the variables in the dataset:

In [None]:
train.columns
# Feel free to repeat the training steps above with an expanded list of features. This is not required for this class, though.

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [None]:
# Quesion 10: We could create a better model by including more of the variables in the dataset, and yet, even with all of them in the model,
# can you think of passenger characteristics that might be important for survival but are not part of the data?

# Answer: A passenegr characertistic that might be important for survival could be their location at the time of the collusion. For example if someone was near the safety boats they would be more likely to survive.