# Goal

It is your job to predict if a passenger survived the sinking of the Titanic or not.
For each PassengerId in the test set, you must predict a 0 or 1 value for the Survived variable.
## Metric

Your score is the percentage of passengers you correctly predict. This is known simply as "accuracy”.
## Submission File Format

You should submit a csv file with exactly 418 entries plus a header row. Your submission will show an error if you have extra columns (beyond PassengerId and Survived) or rows.

The file should have exactly 2 columns:

- PassengerId (sorted in any order)
- Survived (contains your binary predictions: 1 for survived, 0 for deceased)


In [1]:
# Import the Pandas library
import pandas as pd

# Load the train and test datasets to create two DataFrames
train_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
train = pd.read_csv(train_url)

test_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
test = pd.read_csv(test_url)

In [2]:
#Print the `head` of the train and test dataframes
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [4]:
test.describe()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
count,418.0,418.0,332.0,418.0,418.0,417.0
mean,1100.5,2.26555,30.27259,0.447368,0.392344,35.627188
std,120.810458,0.841838,14.181209,0.89676,0.981429,55.907576
min,892.0,1.0,0.17,0.0,0.0,0.0
25%,996.25,1.0,21.0,0.0,0.0,7.8958
50%,1100.5,3.0,27.0,0.0,0.0,14.4542
75%,1204.75,3.0,39.0,1.0,0.0,31.5
max,1309.0,3.0,76.0,8.0,9.0,512.3292


In [5]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [6]:
test.shape

(418, 11)

In [7]:
train.shape

(891, 12)

In [8]:

# Passengers that survived vs passengers that passed away
print(train["Survived"].value_counts())

# As proportions
print(train["Survived"].value_counts(normalize = True))

# Males that survived vs males that passed away
print(train["Survived"][train['Sex'] == 'male'].value_counts())

# Females that survived vs Females that passed away
print(train["Survived"][train['Sex'] == 'female'].value_counts())

# Normalized male survival
print(train["Survived"][train['Sex'] == 'male'].value_counts(normalize = True))

# Normalized female survival
print(train["Survived"][train['Sex'] == 'female'].value_counts(normalize = True))


0    549
1    342
Name: Survived, dtype: int64
0    0.616162
1    0.383838
Name: Survived, dtype: float64
0    468
1    109
Name: Survived, dtype: int64
1    233
0     81
Name: Survived, dtype: int64
0    0.811092
1    0.188908
Name: Survived, dtype: float64
1    0.742038
0    0.257962
Name: Survived, dtype: float64


In [9]:
# Passengers that survived vs passengers that passed away
print("Passengers that survived {}".format(train["Survived"].value_counts()[1]))

# As proportions
print("Passengers that survived {}%".format(round(train["Survived"].value_counts(normalize = True)[1]*100, 4)))

# Males that survived vs males that passed away
print("Males that survived {}".format(train["Survived"][train['Sex'] == 'male'].value_counts()[1]))

# Females that survived vs Females that passed away
print("Females that survived {}".format(train["Survived"][train['Sex'] == 'female'].value_counts()[1]))

# Normalized male survival
print("Normalized male survival {}%".format(round(train["Survived"][train['Sex'] == 'male'].value_counts(normalize = True)[1]*100,4)))

# Normalized female survival
print("Normalized male survival {}%".format(round(train["Survived"][train['Sex'] == 'female'].value_counts(normalize = True)[1]*100,4)))

Passengers that survived 342
Passengers that survived 38.3838%
Males that survived 109
Females that survived 233
Normalized male survival 18.8908%
Normalized male survival 74.2038%


In [10]:
print("Passengers under 18 are {}".format(train['Age'][train['Age'] < 18].shape[0]))

Passengers under 18 are 113


In [11]:
print("Passengers higher 18 are {}".format(train['Age'][train['Age'] >= 18].shape[0]))

Passengers higher 18 are 601


In [12]:
train["Child"] = float('NaN')

In [13]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Child
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,


In [14]:
train["Child"][train["Age"] >= 18] = 0

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [15]:
train["Child"][train["Age"] < 18] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [16]:
train.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Child
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,0.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0.0
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,0.0
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,1.0
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,0.0
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,1.0


In [17]:
train["Age"].isnull().value_counts()

False    714
True     177
Name: Age, dtype: int64

In [18]:
# Create the column Child and assign to 'NaN'
train["Child"] = float('NaN')

# Assign 1 to passengers under 18, 0 to those 18 or older. Print the new column.
train["Child"][train["Age"] < 18 ] = 1
train["Child"][train["Age"] >= 18 ] = 0
print(train["Child"].value_counts())

print("Survived people: \n{}".format(train["Survived"].value_counts()))

# Print normalized Survival Rates for passengers under 18
print("Survived child \n{}".format(train["Survived"][train["Child"] == 1].value_counts()))
print("Survived child porcent {}".format(round(train["Survived"][train["Child"] == 1].value_counts(normalize = True)[1]*100,4)))

# Print normalized Survival Rates for passengers 18 or older
print("Survived older \n{}".format(train["Survived"][train["Child"] == 0].value_counts()))
print("Survived older porcent {}".format(round(train["Survived"][train["Child"] == 0].value_counts(normalize = True)[1]*100,4)))


0.0    601
1.0    113
Name: Child, dtype: int64
Survived people: 
0    549
1    342
Name: Survived, dtype: int64
Survived child 
1    61
0    52
Name: Survived, dtype: int64
Survived child porcent 53.9823
Survived older 
0    372
1    229
Name: Survived, dtype: int64
Survived older porcent 38.1032


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [19]:
print("People survived with NaN age")
train["Survived"][train["Child"].isnull()].value_counts()

People survived with NaN age


0    125
1     52
Name: Survived, dtype: int64

In [20]:
test_one = test

In [21]:
test_one["Survived"] = 0

In [22]:
test_one.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,0
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,0
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,0
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,0
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,0


_Con pura observacion podemos deducir que los hombres estan por debajo del 50% de sobrevivir, y las mujeres por encima del 50% de que se salven._

__Para la primera prediccion se puede decir que en el set de datos de test alrededor del 100% de las mujeres sobreviviran, y el 0% de los hombres sobreviven.__

In [23]:
# Create a copy of test: test_one
test_one = test

# Initialize a Survived column to 0
test_one["Survived"] = 0

# Set Survived to 1 if Sex equals "female" and print the `Survived` column from `test_one`
test_one["Survived"][test_one["Sex"] == "female"] = 1
test_one["Survived"]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


0      0
1      1
2      0
3      0
4      1
5      0
6      1
7      0
8      1
9      0
10     0
11     0
12     1
13     0
14     1
15     1
16     0
17     0
18     1
19     1
20     0
21     0
22     1
23     0
24     1
25     0
26     1
27     0
28     0
29     0
      ..
388    0
389    0
390    0
391    1
392    0
393    0
394    0
395    1
396    0
397    1
398    0
399    0
400    1
401    0
402    1
403    0
404    0
405    0
406    0
407    0
408    1
409    1
410    1
411    1
412    1
413    0
414    1
415    0
416    0
417    0
Name: Survived, Length: 418, dtype: int64

## Decision tree

In [24]:
# Import the Numpy library
import numpy as np

# Import 'tree' from scikit-learn library
from sklearn import tree

### Cleaning and Formatting your Data

Because before see missing date in the feature "Age" subtitute this null for median

In [25]:
train["Age"].isnull().value_counts()

False    714
True     177
Name: Age, dtype: int64

In [26]:
train["Age"] = train["Age"].fillna(train["Age"].median())

In [27]:
train["Age"].isnull().value_counts()

False    891
Name: Age, dtype: int64

## :)

Convert variable qualitative to quantitative

In [28]:
train["Sex"].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [29]:
# Convert the male and female groups to integer form
train["Sex"][train["Sex"] == "male"] = 0
train["Sex"][train["Sex"] == "female"] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [30]:
train["Sex"].value_counts()

0    577
1    314
Name: Sex, dtype: int64

In [31]:
train["Embarked"].isnull().value_counts()

False    889
True       2
Name: Embarked, dtype: int64

In [32]:
train["Embarked"].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [33]:
# Impute the Embarked variable
train["Embarked"] = train["Embarked"].fillna("S")
train["Embarked"].value_counts()

S    646
C    168
Q     77
Name: Embarked, dtype: int64

Replace each class of Embarked with a uniques integer. 0 for S, 1 for C, and 2 for Q

In [34]:
# Convert the Embarked classes to integer form
train["Embarked"][train["Embarked"] == "S"] = 0
train["Embarked"][train["Embarked"] == "C"] = 1
train["Embarked"][train["Embarked"] == "Q"] = 2
train["Embarked"].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


0    646
1    168
2     77
Name: Embarked, dtype: int64

In [35]:
train[["Sex","Embarked"]].head()

Unnamed: 0,Sex,Embarked
0,0,0
1,1,1
2,1,0
3,1,0
4,0,0


# Fit my model

In [36]:
# Print the train data to see the available features
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Child
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,0,0.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,1,0.0
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,0,0.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,0,0.0
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,0,0.0


In [37]:
print("""
Pclass null:
{}
Sex null:
{}
Age null:
{}
Fare null:
{}""".format(train["Pclass"].isnull().value_counts(),
             train["Sex"].isnull().value_counts(),
             train["Fare"].isnull().value_counts(),
             train["Fare"].isnull().value_counts()))


Pclass null:
False    891
Name: Pclass, dtype: int64
Sex null:
False    891
Name: Sex, dtype: int64
Age null:
False    891
Name: Fare, dtype: int64
Fare null:
False    891
Name: Fare, dtype: int64


In [38]:
# Create the target and features numpy arrays: target, features_one
target = train["Survived"].values
features_one = train[["Pclass", "Sex", "Age", "Fare"]].values

In [39]:
# Fit my first decision tree: my_tree_one
my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(features_one, target)

In [40]:
# Look at the importance and score of the included features
print(my_tree_one.feature_importances_)
print(my_tree_one.score(features_one, target))

[0.12566688 0.31274009 0.23204479 0.32954824]
0.9775533108866442


In [41]:
# Importance for each feature
print(" Pclass\t\tSex\tAge\t  Fare")
print(my_tree_one.feature_importances_)

 Pclass		Sex	Age	  Fare
[0.12566688 0.31274009 0.23204479 0.32954824]


In [42]:
# Returns the mean accuracy on the given test data and labels.
print(my_tree_one.score(features_one, target))

0.9775533108866442


In [43]:
test["Fare"].notnull().value_counts()

True     417
False      1
Name: Fare, dtype: int64

In [44]:
test["Pclass"].notnull().value_counts()

True    418
Name: Pclass, dtype: int64

In [45]:
test["Sex"].notnull().value_counts()

True    418
Name: Sex, dtype: int64

In [46]:
print("?")
test["Age"].notnull().value_counts()

?


True     332
False     86
Name: Age, dtype: int64

In [47]:
test[test["Fare"].isnull() == True]

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
152,1044,3,"Storey, Mr. Thomas",male,60.5,0,0,3701,,,S,0


In [48]:
# Impute the missing value with the median
test.Fare[152] = test["Fare"].median()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [49]:
test["Fare"].isnull().value_counts()

False    418
Name: Fare, dtype: int64

In [50]:
test["Age"] = test["Age"].fillna(test["Age"].median())
test["Age"].notnull().value_counts()

True    418
Name: Age, dtype: int64

### Change my labels for quantitive labels

In [51]:
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,0
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,1
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,0
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,0
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,1


In [52]:
# Convert the male and female groups to integer form
test["Sex"][test["Sex"] == "male"] = 0
test["Sex"][test["Sex"] == "female"] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [53]:
# Extract the features from the test set: Pclass, Sex, Age, and Fare.
test_features = test[["Pclass", "Sex", "Age", "Fare"]].values

In [54]:
# Make my prediction using the test set
my_prediction = my_tree_one.predict(test_features)

In [55]:
# Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions
PassengerId =np.array(test["PassengerId"]).astype(int)
my_solution = pd.DataFrame(my_prediction, PassengerId, columns = ["Survived"])
my_solution

Unnamed: 0,Survived
892,0
893,0
894,1
895,1
896,1
897,0
898,0
899,0
900,1
901,0


In [56]:
# Check my data frame has 418 entries
my_solution.shape

(418, 1)

In [57]:
# Write my solution to a csv file with the name my_solution.csv
my_solution.to_csv("my_solution_one.csv", index_label = ["PassengerId"])

## Overfitting and how to control it  - Second Tree

In [58]:
# Create a new array with the added features: features_two
print(train.head())
features_two = train[["Pclass","Age","Sex","Fare", "SibSp", "Parch", "Embarked"]].values

#Control overfitting by setting "max_depth" to 10 and "min_samples_split" to 5 : my_tree_two
#Maybe we can improve the overfit model by making a less complex model? In DecisionTreeRegressor, the depth of our model is defined by two parameters:

#    the max_depth parameter determines when the splitting up of the decision tree stops.
#    the min_samples_split parameter monitors the amount of observations in a bucket. If a certain threshold is not reached (e.g minimum 10 passengers) no further splitting can be done.

#By limiting the complexity of your decision tree you will increase its generality and thus its usefulness for prediction!
max_depth = 10
min_samples_split = 5
my_tree_two = tree.DecisionTreeClassifier(max_depth = max_depth, min_samples_split = min_samples_split, random_state = 1)
my_tree_two = my_tree_two.fit(features_two, target)

#Print the score of the new decison tree
#This is relation feature with importance
print(my_tree_two.feature_importances_)
print(my_tree_two.score(features_two, target))

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name Sex   Age  SibSp  Parch  \
0                            Braund, Mr. Owen Harris   0  22.0      1      0   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...   1  38.0      1      0   
2                             Heikkinen, Miss. Laina   1  26.0      0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)   1  35.0      1      0   
4                           Allen, Mr. William Henry   0  35.0      0      0   

             Ticket     Fare Cabin Embarked  Child  
0         A/5 21171   7.2500   NaN        0    0.0  
1          PC 17599  71.2833   C85        1    0.0  
2  STON/O2. 3101282   7.9250   NaN        0    0.0  
3            113803  53.1000  C123        0    0.0  
4            373450   8.0500   NaN   

# Feature enginnering  - Third Tree

In [59]:
# Create train_two with the newly defined feature
train_two = train.copy()
train_two["family_size"] = train["SibSp"].values + train["Parch"].values + 1
print(train_two.head(10))
# Create a new feature set and add the new feature
features_three = train_two[["Pclass", "Sex", "Age", "Fare", "SibSp", "Parch", "family_size"]].values

# Define the tree classifier, then fit the model
my_tree_three = tree.DecisionTreeClassifier()
my_tree_three = my_tree_three.fit(features_three, target)

# Print the score of this decision tree
print(my_tree_three.score(features_three, target))

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   
5            6         0       3   
6            7         0       1   
7            8         0       3   
8            9         1       3   
9           10         1       2   

                                                Name Sex   Age  SibSp  Parch  \
0                            Braund, Mr. Owen Harris   0  22.0      1      0   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...   1  38.0      1      0   
2                             Heikkinen, Miss. Laina   1  26.0      0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)   1  35.0      1      0   
4                           Allen, Mr. William Henry   0  35.0      0      0   
5                                   Moran, Mr. James   0  28.0      0      0   
6                            McCarthy, Mr. 

# With Random Forest

In [61]:
# Import the `RandomForestClassifier`
from sklearn.ensemble import RandomForestClassifier
# Convert the Embarked classes to integer form on test set
test["Embarked"][test["Embarked"] == "S"] = 0
test["Embarked"][test["Embarked"] == "C"] = 1
test["Embarked"][test["Embarked"] == "Q"] = 2

# We want the Pclass, Age, Sex, Fare,SibSp, Parch, and Embarked variables
features_forest = train[["Pclass", "Age", "Sex", "Fare", "SibSp", "Parch", "Embarked"]].values

# Building and fitting my_forest
forest = RandomForestClassifier(max_depth = 10, min_samples_split=2, n_estimators = 100, random_state = 1)
my_forest = forest.fit(features_forest, target)

# Print the score of the fitted random forest
print(my_forest.score(features_forest, target))

# Compute predictions on our test set features then print the length of the prediction vector
test_features = test[["Pclass", "Age", "Sex", "Fare", "SibSp", "Parch", "Embarked"]].values
pred_forest = my_forest.predict(test_features)
print(len(pred_forest))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


0.9393939393939394
418


# Interpreting and Comparing

In [62]:
#Request and print the `.feature_importances_` attribute
print(my_tree_two.feature_importances_)
print(my_forest.feature_importances_)

#Compute and print the mean accuracy score for both models
print(my_tree_two.score(features_two, target))
print(my_forest.score(features_forest, target))

[0.14130255 0.17906027 0.41616727 0.17938711 0.05039699 0.01923751
 0.0144483 ]
[0.10384741 0.20139027 0.31989322 0.24602858 0.05272693 0.04159232
 0.03452128]
0.9057239057239057
0.9393939393939394
