<img style="float: left;" src="https://upload.wikimedia.org/wikipedia/commons/2/24/IBM_Cloud_logo.png">

# Introduction to Python for Data Analytics—Titanic Mortality Dataset Challenge

We hope you enjoyed our introductory session on Python for Data Analytics. As a recap, here are some of the things we covered:

* Python as a programming language, and what makes it applicable to data science and analytics
* Jupyter notebooks as "document editors" for writing and running code, just like this one!
* Essential Python libraries used by data practicioners, including pandas and matplotlib
* Key functions within those libraries, such as methods for filtering and grouping dataframes in pandas, and making simple plots with matplotlib
* The wide array of applications on IBM Cloud Pak for Data, most notably Watson Natural Language Understanding

Having been introduced to these topics, you're now ready to try your hand at your own short data analytics project in Python. This informal competition is designed to provide that opportunity!

### The Dataset

For this exercise you'll be using a famous Titanic - Machine Learning from Disaster dataset hosted on [Kaggle](https://www.kaggle.com/c/titanic/data). The dataset provides information on passengers on the cruise liner Titanic, and most notably which passengers died in the tragic sinking of the ship in 1912. 

When approaching any new dataset, it's always a good idea to gather as much information as possible about the nature of the data, its source, and its contents. The full documentation for this dataset can be found [here](https://www.kaggle.com/c/titanic/data), but here are a few highlights:

#### Variables

* survival = Survival	(0 = No, 1 = Yes)
* pclass = Ticket class	(1 = 1st, 2 = 2nd, 3 = 3rd)
* sex = Sex	
* Age = Age in years	
* sibsp = # of siblings / spouses aboard the Titanic	
* parch = # of parents / children aboard the Titanic	
* ticket = Ticket number	
* fare = Passenger fare	
* cabin = Cabin number	
* embarked= Port of Embarkation	(C = Cherbourg, Q = Queenstown, S = Southampton)

#### Variable Notes
pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.



### The Challenge

With these data, your challenge is simple: **Conduct a short exploratory data analysis (EDA) using Python to help explain who lived and who died on the Titanic.**

Specifically, we'd like for you to use this Python notebook to conduct your EDA, writing code to explore, manipulate, and visualize the data as you see fit. **Format and present your findings in this same notebook with code, comments, and at least 1-2 visualizations explaining your findings.**

You can approach this challenge in a number of ways: There's no single right way to go about it. You may choose to stick to descriptive statistics and basic plots to comprise your report, or you may try to use machine learning to try to classify passengers on whether or not they survived. Just note that **we'll be judging submissions mainly on how well they explain/show the data analyst's findings, as opposed to the complexity of the code and approaches that the analyst chose.**

### Submission

Once you're ready to submit, please download your notebook as a .ipynb file (the default format) using the download button on the top right of Watson Studio. 

***Then send the .ipynb file as an attachment in an email to by 5:00 PM Pacific Time on Wednesday June 23 to alexamari@ibm.com with the subject line "Titanic Data Competition Submission" or something similar.***

Next, be sure to join us for our review of the competition results on Thursday June 24 at 11:00 AM PST.

### Tips for Getting Started

##### #1
Remember that you can refer back to the Python notebooks we saw during the introductory session on basic Python objects, operations, functions and methods, as well as the capabilities of building, manipulating, and visualizing data in dataframes with pandas and matplotlib. 

* [Intro to base Python](https://github.com/IBM/python-and-analytics/blob/master/notebooks/learning-python-3.ipynb)
* [Intro to pandas and matplotlib in Python](https://github.com/IBM/python-and-analytics/blob/master/notebooks/UK-workshop-pandas.ipynb) 

Feel free to copy the approaches and code in these notebooks and apply them to the Titanic dataset here. 

##### #2 
Attend our office hour on Tuesday June 22 at 11:00 AM to discuss with Alex and Ray your ideas and approaches for the challenge, as well as to review any concepts from the introductory session that you'd like to see again or learn about further. 

##### #3
Use the Python notebook to its full potential! In addition to making code cells, consider adding text to your report with Markdown cells. Markdown cells are simply cells that contain text and images to help guide a notebook user, as opposed to code. For example, this text that you're reading right now is contained in a markdown cell! To create a markdown cell, create a new cell with the circled plus sign, then select "Markdown" in the Format dropdown menu above. For tips on how to format your markdown cells.

* [Quick guide to Markdown formatting](https://www.markdownguide.org/basic-syntax/)

##### #4 
In addition to the documentation for base Python, be sure to use the documentation for libraries like pandas and matplotlib to see what they are capable of, as well as how to implement their included functions/methods. Also, don't forget to use Stack Overflow to ask view questions, answers, and code patterns related to these libraries and Python in general!

* [Python 3.9.5 documentation](https://docs.python.org/3/)
* [pandas documentation](https://pandas.pydata.org/docs/)
* [matplotlib documentation](https://matplotlib.org/stable/contents.html)
* [seaborn documentation](https://seaborn.pydata.org/)
* [Stack Overflow](https://stackoverflow.com/)

### Code to Get Started

Run the following cell to import some relevant libraries, as well as load the Titanic dataset into a dataframe. Of course, feel free to import more libraries as you see fit.

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## Assigning the Titanic dataset from Kaggle to a pandas DataFrame, df

df = pd.read_csv('https://gist.githubusercontent.com/michhar/2dfd2de0d4f8727f873422c5d959fff5/raw/fa71405126017e6a37bea592440b4bee94bf7b9e/titanic.csv')

## Showing the top 5 rows of the DataFrame using the head method

df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [11]:
## Start your code here and add as many code/markdown cells as you like! 
## Hint: may I suggest beginning with pandas df.shape() and df.describe() methods as some first steps in exploring these data :) ? 