<h1>Chapter 1. First Python code.</h1>
Let us build a simple model. Our goal (that is, the first phase of CRISP-DM process) is to understand the factors affecting chances for survival for passengers onboard Titanic. The Excel file Titanic.xlsx contains data on all passengers as well as the records on their survival. 
<Br>We will employ a few Python libraries frequently used in machine learning:
    <ul><li>pandas: data analysis and manipulation library;
    <li>sklearn: machine learning library;
    <li>matplotlib: data visualization library.</ul>
<br> Notice a list of files on the left. 
    
<br>Anaconda notebooks are organized in "cells". This text is in a "Markdown cell". <b>The next cell is a "code cell". Click it to select. </b> 
<br>On the top of this window, notice a list of icons. <b>Find the "Run" icon ►. Click this button to run the code in the selected code cell. </b>
<br>Alternatively, you can press <Shift - Enter> 

In [None]:
# When a row starts with the "#" symbol, it contains a comment
# Use comments to explain what your code is doing. 

# The first command loads pandas library. Now you can use this library in your program. 
# The "as pd" part of the command means that you can refer to this library as "pd", not "pandas". 
# This is a frequent shortcut in Python for frequently used libraries. 
# We also install a library xlrd used to read Excel files. 

import pandas as pd
!pip install xlrd

# Read the Titanic data from the file TitanicCleaned. This file is stored on the course' GitHub site.
# Notice that the file is in Excel format. To read the data we are using pandas function "read_csv". 
# We can do that because we have already loaded pandas library in the previous command.

titanic_data = pd.read_excel('https://raw.githubusercontent.com/andykir/DataMiningLectures/main/TitanicCleaned.xlsx')

# Pandas loaded the data into a variable named "titanic_data". 
#Variables in Python have types - for example, "int" (integer), "str" (text), etc. 
#There are more complex types as well. The type of "titanic_data" is "DataFrame". 
#It combines the data (usually, a table) and operations (functions) that can be performed on this data. 
#For example, a function "head()" is used to display a few top rows of the DataFrame.  
 
titanic_data.head()


<h2>Data Understanding </h2>
Explore this data. Notice the names of the columns and data. The "Lifeboat" column contains the lifeboat numbers. 
Notice that for some passengers the lifeboat numbers are missing ("NaN"). The reason is, these passengers did not survive. 
<br>Our goal is to predict the value of "Survived" column (dependent variable) from relevant characteristics of passengers (independent variables). In machine learning, the dependent variable is called "target" and independent variables are called "features", "predictors", or "attributes". 
<br> Let us use the following features: 'Sex_Female0Male1', 'Passenger Class', 'Passenger Fare', and 'No of Siblings or Spouses on Board'
<br> Notice how to use function "print".

In [None]:
# Selecting the feature variables (X)
X = titanic_data[['Sex_Female0Male1', 'Passenger Class', 'Passenger Fare', 'No of Siblings or Spouses on Board']]
# Selecting the target variable (y)
y = titanic_data['Survived']

# see the variables used for modelings
print("Features")
print(X.head())

print("Target")
print(y.head())



<h2>Modeling</h2>
We will be using DecisionTreeClassifier model from machine learning library sklearn.
It may take a minute or two to train the model. 

In [None]:
# Import the library
import sklearn.tree

# Creating and fitting the decision tree model
model = sklearn.tree.DecisionTreeClassifier()
model.fit(X, y)


<h2>Visualization of results</h2>
For visualization, we will be using matplotlib library. Each branch of a decision tree model contains a condition on one independent variable. Again, it will take a bit of time. 

In [None]:
# Visualizing the decision tree
import matplotlib.pyplot as plt
plt.figure(figsize=(15, 10))
sklearn.tree.plot_tree(model, feature_names=list(X.columns), class_names=['Not Survived', 'Survived'], filled=True)
plt.show()

Unfortunately, the model is way too complex and hence not very useful. Let us simplify the model, limiting the number of brunches to 3. I needed to modify one line in the code (do you see, which one?). Notice that, I don't need to import the libraries anymore: they have already been loaded.

In [None]:
# Creating and fitting the decision tree model
model = sklearn.tree.DecisionTreeClassifier(max_depth=3)
model.fit(X, y)

# Visualizing the decision tree
plt.figure(figsize=(15, 10))
sklearn.tree.plot_tree(model, feature_names=list(X.columns), class_names=['Not Survived', 'Survived'], filled=True)
plt.show()

Read the explanation of the results in the lab assignment.