## Jupyter Notebook

First things first, let's get some terminology straight.
- The *language* we're working in – Python 3.7 
- The *editor* we're using is Google Colab – The code runs on Google's servers, and shows the results on our browser
- This file is an interactive Python notebook, a `.ipynb` file. These are pretty special, also known as **Jupyter notebooks**. 

Jupyter notebooks have a few special properties that make it ideal for work with data:
 - Code is organized into cells, which can be **code** or **markdown** 
 - We can run the cells in **any order**, try it out!
 - The last item returned in a cell will print automatically, no need to wrap it with `print()`

In [None]:
#Set a variable

In [None]:
#Return it with print

Anything you can do in Python, you can do here! 

1. Write a function that takes a string as input, and does something to it 
2. In a new cell, call the function and test it out

In [None]:
#Write a function

In [None]:
#Call it

## Importing packages

We use the `pandas` package to easily work with data as tables.
<br>The `numpy` package allows us to work with some other special data types, like missing values
<br><br>We'll rename these as `pd` and `np`, just so its easier to refer to later on

In [None]:
#import pandas and numpy

## Importing data (Titanic Dataset)

We'll be looking at a dataset of all the passengers of the titanic and their information along with whether they survived.

For this semester, we'll typically work with data in *tabular* format, the type you'd be used to in an excel spreadsheet. Data files saved in this format will usually have a `.csv` file ending, short for comma seperated values.

For example, a CSV file could look something like...

```
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7,,S
```

To import this, let's use the `pd.read_csv()` function:

In [None]:
#Read in the dataframe
url = 'https://raw.githubusercontent.com/dt3zjy/node/master/week-1/workshop/titanic.csv'

Here, we've saved the data to a `dataframe` object named `crimes`

In [None]:
#Check out the type

DataFrames contain our data in little "spreadsheet"-like structures. Whatever manipulations you can think of doing to the data, you can likely search how to do 

## Exploring dataframes

Let's take a look at the data. We'll use the function `.head()` to read in the first 5 rows

In [None]:
#Take a peek at the data

- Pclass: Different classes within the ship (first class, etc.)
- SibSp: Number of siblings that passenger had
- Parch: Number of parents or children that passenger had
- Embarked: Where the passenger boarded from --> Q is Queenstown, S is Southampton, C is Cherbourg

How big is the dataset? `.shape` returns a tuple with the dimensions as (rows, columns)

In [None]:
#Show shape

Let's try to understand our data a bit better. 
- How many different classes are in the dataset? 

In [None]:
#Number of unique

- Where did the passengers come from? How many?

In [None]:
#Value counts

Show the oldest passenger by sorting the dataframe:

In [None]:
#Sort values

### Subsetting

Subsetting is a super helpful tool. We'll take a look at this more depth in next week, but for now, here are the basics:

We can filter rows from a dataframe based on some condition

- Show passengers that boarded from `Q` (Queenstown)

In [None]:
#Subset by Q

How would you show all passengers that were 21+ years old?

Hint: Same way as matching if statements in python, mirroring the syntax above

In [None]:
#Your Turn!

## Data Manipulation

What is the percentage of passengers with a sibling?

In [None]:
#Find percentage

## Visualization

First things first, let's import the package to help us visualize the data, `plotly`.

If this package isn't yet included, we can install it using `!pip install plotly`. More on this week 5. 

In [None]:
#Import px

Note that we're using the sub package of the broader package, called `plotly express`. This simplifies a lot of the more difficult steps

Plotly express has a broad range of options to play with, let's take a look at the documentation. 
<br>Do a quick google search to pull up documentation for `px.scatter` OR run `px.scatter?` in a Jupyter cell

In [None]:
#Find more info on px.histogram

Let's look at the age distribution

In [None]:
#Histogram

Look at the stacked histogram. What does the data tell you?

### Machine Learning Showcase

Let's take this data and apply machine learning on it. Are you able to find out if someone survived based on their information?

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

titanic = titanic.drop(columns=['Name', 'Ticket','Fare','Cabin','Embarked'])

titanic = titanic.dropna()
titanic = pd.get_dummies(titanic)

X = titanic.drop('Survived', axis=1)
y = titanic['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf = DecisionTreeClassifier(max_depth=3)
clf.fit(X_train, y_train)

from IPython.display import display, HTML
from sklearn.tree import export_graphviz
from graphviz import Source

graph = Source(export_graphviz(clf,
                        out_file=None,
                        feature_names=X_train.columns,
                        class_names=['Not Survived', 'Survived'],
                        filled=True,
                        rounded=True))
graph