# Five Minute Python Refresher

Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. Its high-level built in data structures, combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application Development, as well as for use as a scripting or glue language to connect existing components together. 

# Data Structures

### data structure
An organization of data for the purpose of making it easier to use.
### immutable data value
A data value which cannot be modified. Assignments to elements or slices (sub-parts) of immutable values cause a runtime error.
### mutable data value
A data value which can be modified. The types of all mutable values are compound types. Lists and dictionaries are mutable; strings and tuples are not.


## Following are the primary Data Structures that we will study:
**1. Strings**
Collection of unicode characters. It is indexed and immutable, "hello world!"

**2. Lists**
Collection of elements. It is indexed and mutable. Allows duplicates, \[10,20,30\]

**3. Tuple**
Collection which can be indexed but immutable, (apple, 250)

**4. Set**
Collection of unordered elements that doesn't allow repetitions, {apple, orange}

**5. Dictionary**
Collection of Key-Value pairs, {key:value}

# Control Flow

A program’s control flow is the order in which the program’s code executes. The control flow of a Python program is regulated by conditional statements, loops, and function calls. This section covers the if statement and for and while loops; functions are covered in the next class.

![](https://www.researchgate.net/profile/Kay_Smarsly/publication/322509045/figure/fig1/AS:583153716215809@1516046088625/Control-flow-of-elementary-control-structures.png)

# The if Statement

Often, you need to execute some statements only if some condition holds, or choose statements to execute depending on several mutually exclusive conditions. The Python compound statement if, which uses if, elif, and else clauses, lets you conditionally execute blocks of statements. 

## Comparision Operators
- x == y
- x!= y
- x < y
- x <= y
- x > y
- x >= y

## Python Indentation
In Python, the code blocks are defined by a set of common or consistent number of spaces. This is called Python Indentation.

The block scope will end at the first un-indented line.

The best practice is to use on Tab space.

##  Quick Task 1.1: Fizz Buzz
Write code to print all the numbers upto a given number "N", replacing every multiple of 3 by the word "fizz" and multiple of 5 by the word "buzz", and multiples of both 3 and 5 by "fizzbuzz"

For N=7, print the output:\
1 2 fizz 4 buzz fizz 7

All the items are printed in the same line

In [12]:
max_num = 16

for fizzbuzz in range(1, max_num+1):
    # do something here
    if fizzbuzz % 3 == 0 and fizzbuzz % 5 ==0:
        print ("fizzbuzz")
    elif fizzbuzz % 5 == 0:
        print("buzz")
    elif fizzbuzz % 3 == 0:
        print("fizz")
    else:
        print(fizzbuzz)

1
2
fizz
4
buzz
fizz
7
8
fizz
buzz
11
fizz
13
14
fizzbuzz
16


In [4]:
7 % 

1

## Quick Task 1.2: Wordcount

Given a sentence, find how many times each word occured

sentence = "hello this is a nice morning we have this class very morning"\
this: 2\
nice: 1\
morning: 2\
... and so on

In [1]:
sentence = "hello this is a nice morning we have this class very morning"

worddict = dict()

words = sentence.split(' ')

for word in words:
    #print('reading ' + word)
    if word in worddict:
        worddict[word] = worddict[word] + 1
    else:
        worddict[word] = 1
    #input()

    
worddict

{'a': 1,
 'class': 1,
 'have': 1,
 'hello': 1,
 'is': 1,
 'morning': 2,
 'nice': 1,
 'this': 2,
 'very': 1,
 'we': 1}

# Functions in Python

Function is a group of statements that performs a specific task.

## Advantages:

- Makes your code more organized and manageable.
- Brings code reusability.

## Function Syantax:
```
def <function name>([parameters]):
    '''Doc string'''
    Logic/statements
    ...
    ...
    return value/print(value)
```
**def** - marks the start of the function header.\
**function name** - a unique name to idenfity the function, this naming follows the same checklist we learnt in variable naming.\
**parameters/arguments** - used to pass a value to the function while calling. These are optional.\
**Doc String** - a short description about the function. This is optional.\
**Logic/statements** - one or more valid python statements to perform the required task.\
**return** - this will return a value from the funxtion. Optional.\
**print** - to display a value from the fucntion. Optional.


## Quick Task 1.3: Create a Loan Interest Calculator Using Simple Interest

Create a function that takes price, downpayment, rate of interest and time duration (computed yearly). Return the total amount user has to pay due to the interest excluding downpayment.

In [2]:
def get_amount(price, downpayment, rate, time):
    principle = price - downpayment
    interest = principle * rate * time/100
    amount = principle + interest
    
    return amount

get_amount(1000, 200, 4, 10)

1120.0

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

## Why do we do EDA?
 - to understand nature and volume of data
 - critical analysis
 - find insights and patterns
 - find missing data

![Data Science Life Cycle](http://www.cortell.co.za/wp-content/uploads/2018/06/chart.png)

# Univariate Analysis

Univariate analysis is the simplest form of data analysis where the data being analyzed contains only one variable. Since it's a single variable it doesn’t deal with causes or relationships.  The main purpose of univariate analysis is to describe the data and find patterns that exist within it.

You can think of the variable as a category that your data falls into. One example of a variable in univariate analysis might be "age". Another might be "height". Univariate analysis would not look at these two variables at the same time, nor would it look at the relationship between them.

Some ways you can describe patterns found in univariate data include looking at mean, mode, median, range, variance, maximum, minimum, quartiles, and standard deviation. Additionally, some ways you may display univariate data include frequency distribution tables, bar charts, histograms, frequency polygons, and pie charts.

# Bivariate Analysis
Bivariate analysis is used to find out if there is a relationship between two different variables. Something as simple as creating a scatterplot by plotting one variable against another on a Cartesian plane (think X and Y axis) can sometimes give you a picture of what the data is trying to tell you. If the data seems to fit a line or curve then there is a relationship or correlation between the two variables.  For example, one might choose to plot caloric intake versus weight.

# Multivariate Analysis
Multivariate analysis is the analysis of three or more variables.  There are many ways to perform multivariate analysis depending on your goals.

This is relatively more complex and requires advanced understanding of statistics and data analysis methods.

# Univariate Analysis vs Bivariate Analysis
| **Univariate Analysis**                           | **Bivariate Analysis**                                           |
|-----------------------------------------------|--------------------------------------------------------------|
| Involves a single variable                    | Involves two variables                                       |
| Deals with intrinsic property of the data     | Deals with cause and relationships between the two variables |
| Major purpose is to describe                  | Major purpose is to explain                                  |
| Mean, Median, Mode, Range, Standard Deviation | Correlation, Relationships, Causal Explanations              |


Classical statistics focused almost exclusively on inference, a sometimes complex set of procedures for drawing conclusions about large populations based on small samples. In 1962, John W. Tukey called for a reformation of statistics in his seminal paper “The Future of Data Analysis”. He proposed a new scientific discipline called data analysis that included statistical inference as just one component. Tukey forged links to the engineering and computer science communities (he coined the terms bit, short for binary digit, and software), and his original tenets are suprisingly durable and form part of the foundation for data science. The field of exploratory data analysis was established with Tukey’s 1977 now-classic book Exploratory Data Analysis

People are not very good at looking at a column of numbers or a whole spreadsheet and then determining important characteristics of the data. They find looking at numbers to be tedious, boring, and/or overwhelming. Exploratory data analysis is generally cross-classified in two ways. First, each method is either non-graphical or graphical. And second, each method is either
univariate or multivariate (usually just bivariate). Each of the categories of EDA have further divisions based on the role (outcome or explanatory) and type (categorical or quantitative) of the variable(s) being examined.

# The Dataset

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others. In this experiment, we will analyze and find patterns that indicate who might have a better chance at survival.

The dataset shared here is public information, though these files are sourced from Kaggle. After this 2-Day course, you are encouraged to attempt the open challenge at https://www.kaggle.com/c/titanic


# Reading a File

We have shared a file in CSV format. How do we read it in python?

## Quick Task 1.4: Using the open() method

In [4]:
f = open("C:/Users/elamp/Documents/RIT/data opleidingsdagen/Python_Leergang/datascience-crashcourse/data/train.csv") 
#f = open("data/train.csv", 'r')

If there was an error, either the file isn't at the right place or you need to write the path properly.

In [5]:
f

<_io.TextIOWrapper name='C:/Users/elamp/Documents/RIT/data opleidingsdagen/Python_Leergang/datascience-crashcourse/data/train.csv' mode='r' encoding='cp1252'>

In [6]:
f.read()

'PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked\n1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S\n2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C\n3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S\n4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S\n5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S\n6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q\n7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,51.8625,E46,S\n8,0,3,"Palsson, Master. Gosta Leonard",male,2,3,1,349909,21.075,,S\n9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27,0,2,347742,11.1333,,S\n10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14,1,0,237736,30.0708,,C\n11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4,1,1,PP 9549,16.7,G6,S\n12,1,1,"Bonnell, Miss. Elizabeth",female,58,0,0,113783,26.55,C103,S\n13,0,3,"Saundercock, Mr. William Henry",

### Discussion: 
What are the issues if we continue using this method?

## Easier Option
Don't re-invent the wheel

In [16]:
import csv

titanic = {}

with open('C:/Users/elamp/Documents/RIT/data opleidingsdagen/Python_Leergang/datascience-crashcourse/data/train.csv', mode='r') as data:
    titanic = csv.DictReader(data)
    for line in titanic:
        #print(line)
        print(line['Survived'])

0
1
1
1
0
0
0
0
1
1
1
1
0
0
0
1
0
1
0
1
0
1
1
1
0
1
0
0
1
0
0
1
1
0
0
0
1
0
0
1
0
0
0
1
1
0
0
1
0
0
0
0
1
1
0
1
1
0
1
0
0
1
0
0
0
1
1
0
1
0
0
0
0
0
1
0
0
0
1
1
0
1
1
0
1
1
0
0
1
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
1
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
0
1
1
0
0
0
0
1
0
0
1
0
0
0
0
1
1
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
1
1
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
1
1
0
1
1
0
0
1
0
1
1
1
1
0
0
1
0
0
0
0
0
1
0
0
1
1
1
0
1
0
0
0
1
1
0
1
0
1
0
0
0
1
0
1
0
0
0
1
0
0
1
0
0
0
1
0
0
0
1
0
0
0
0
0
1
1
0
0
0
0
0
0
1
1
1
1
1
0
1
0
0
0
0
0
1
1
1
0
1
1
0
1
1
0
0
0
1
0
0
0
1
0
0
1
0
1
1
1
1
0
0
0
0
0
0
1
1
1
1
0
1
0
1
1
1
0
1
1
1
0
0
0
1
1
0
1
1
0
0
1
1
0
1
0
1
1
1
1
0
0
0
1
0
0
1
1
0
1
1
0
0
0
1
1
1
1
0
0
0
0
0
0
0
1
0
1
1
0
0
0
0
0
0
1
1
1
1
1
0
0
0
0
1
1
0
0
0
1
1
0
1
0
0
0
1
0
1
1
1
0
1
1
0
0
0
0
1
1
0
0
0
0
0
0
1
0
0
0
0
1
0
1
0
1
1
0
0
0
0
0
0
0
0
1
1
0
1
1
1
1
0
0
1
0
1
0
0
1
0
0
1
1
1
1
1
1
1
0
0
0
1
0
1
0
1
1
0
1
0
0
0
0
0
0
0
0
1
0
0
1
1
0
0
0
0
0
1
0
0
0
1
1
0
1
0
0
1
0
0
0
0
0
0
1
0
0
0


### Discussion: 
Take 2 minutes and explore these with the person(s) sitting next to you. Find the answers to the questions below and the code blocks that follow.

- How many rows are there in this part of the dataset?
- How many columns are there?
- What are the columns called?
- What does each column stand for? Feel free to google to find the answer to this.

## Task 2.1: Find what percentage people survived?

In [19]:
survived=0
total = 0

with open('C:/Users/elamp/Documents/RIT/data opleidingsdagen/Python_Leergang/datascience-crashcourse/data/train.csv', mode='r') as data:
    titanic = csv.DictReader(data)
    for line in titanic:
        if line['Survived'] == '1':
            survived = survived +1
            total = total+1
        elif line['Survived'] == '0':
            total = total+1
            
        
        # write your code here
        pass
        
        
print (100*survived/total)

38.38383838383838


In [14]:
line['Survived']

'0'

## Task 2.2: Find the oldest person

In [None]:
max_age = 0
oldestperson = None

with open('C:/Users/elamp/Documents/RIT/data opleidingsdagen/Python_Leergang/datascience-crashcourse/data/train.csv', mode='r') as data:
    titanic = csv.DictReader(data)
    for line in titanic:
        # write your code here
        pass
    
            
print (oldestperson)

## Task 2.3: Find what percent of people travelling in the First Class survived?

In [None]:
survived=0
total = 0

with open('C:/Users/elamp/Documents/RIT/data opleidingsdagen/Python_Leergang/datascience-crashcourse/data/train.csv', mode='r') as data:
    titanic = csv.DictReader(data)
    for line in titanic:
        # write your code here
        pass

print (100*survived/total)

# 3. Data as a Table

**Pandas** is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. The name is crunched form of **pan**el **da**ta.

Not only is the pandas library a central component of the data science toolkit but it is used in conjunction with other libraries in that collection. Pandas is built on top of the NumPy package, meaning a lot of the structure of NumPy is used or replicated in Pandas. Data in pandas is often used to feed statistical analysis in SciPy, plotting functions from Matplotlib, and machine learning algorithms in Scikit-learn.


### Spreadsheets to DataFrame
When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you. pandas will help you to explore, clean and process your data. In pandas, a data table is called a DataFrame.

![](https://pandas.pydata.org/docs/_images/01_table_dataframe.svg)



In [1]:
import pandas as pd
print (pd.__version__)

0.25.3



The following command must be run outside of the IPython shell:

    $ pip install --upgrade pandas

The Python package manager (pip) can only be used from outside of IPython.
Please reissue the `pip` command in a separate terminal or command prompt.

See the Python documentation for more informations on how to install packages:

    https://docs.python.org/3/installing/


If you see any error while running the previous line, there was possibly some issue with pandas installation.

## Task 3.1 Read the dataset using Pandas

In [4]:
titanic = None
filepath = "C:/Users/elamp/Documents/RIT/data opleidingsdagen/Python_Leergang/datascience-crashcourse/data/train.csv"
titanic  = pd.read_csv(filepath)

In [23]:
titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


### 3.2 What is the datatype of this object?

In [None]:
type(titanic)

### 3.3 Read first 3 rows of the dataset

In [None]:
titanic.head()

### 3.4 Read last 3 rows of the dataset

In [None]:
titanic.tail()

### 3.5 What are all the columns?

In [25]:
titanic.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

### 3.6 Short Side-Quest

Convert dataset into a two-dimensional numpy array using `df.values` and find the number of people who survived.

In [29]:
#titanic.loc[titanic['Survived']==1]


array([[1, 0, 3, ..., 7.25, nan, 'S'],
       [2, 1, 1, ..., 71.2833, 'C85', 'C'],
       [3, 1, 3, ..., 7.925, nan, 'S'],
       ..., 
       [889, 0, 3, ..., 23.45, nan, 'S'],
       [890, 1, 1, ..., 30.0, 'C148', 'C'],
       [891, 0, 3, ..., 7.75, nan, 'Q']], dtype=object)

# 4. Pandas for Data Analysis in Python

### Why Pandas?
- Fast and efficient DataFrame object with default and customized indexing.
- Tools for loading data into in-memory data objects from different file formats.
- Data alignment and integrated handling of missing data.
- Reshaping and pivoting of date sets.
- Label-based slicing, indexing and subsetting of large data sets.
- Columns from a data structure can be deleted or inserted.
- Group by data for aggregation and transformations.
- High performance merging and joining of data.


### Multi-Format Support
Pandas supports the integration with many file formats or data sources out of the box (csv, excel, sql, json, parquet,…). Importing data from each of these data sources is provided by function with the prefix `read_*`. Similarly, the `to_*` methods are used to store data.

![https://pandas.pydata.org/docs/_images/02_io_readwrite.svg](https://pandas.pydata.org/docs/_images/02_io_readwrite.svg)



### Analysis and Visualization
Basic statistics (mean, median, min, max, counts…) are easily calculable. These or custom aggregations can be applied on the entire data set, a sliding window of the data or grouped by categories. The latter is also known as the split-apply-combine approach.

Pandas provides plotting your data out of the box, using the power of Matplotlib. You can pick the plot type (scatter, bar, boxplot,…) corresponding to your data.

![](https://pandas.pydata.org/docs/_images/04_plot_overview.svg)


# Pandas Data Structures

The core components of Pandas are **DataFrame** and **Series**. 

![](https://www.kdnuggets.com/wp-content/uploads/pandas-02.png)



In [2]:
data = {"Country":["The Netherlands", "Belgium","India","Brazil"],
       "Capital":["Amsterdam", "Brussels","New Delhi","Brasilia"],
       "Population":[17000000,  11122233, 1300000000, 207855000]}
df = pd.DataFrame(data, columns=['Country','Capital','Population'])
df

Unnamed: 0,Country,Capital,Population
0,The Netherlands,Amsterdam,17000000
1,Belgium,Brussels,11122233
2,India,New Delhi,1300000000
3,Brazil,Brasilia,207855000


### 4.1 Print summary statistics of the dataset

In [5]:
titanic.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


# Data Exploration and Summarization

In [7]:
titanic['Survived'].value_counts()

0    549
1    342
Name: Survived, dtype: int64

In [None]:
titanic['Survived']==1

In [None]:
titanic[titanic['Survived']==1]

In [None]:
titanic[titanic['Survived']==1][['Name','Age','Sex']]

### 4.2 Find average age of people who survived

In [8]:
titanic.loc[titanic['Survived']==1]['Age'].mean()

28.343689655172415

In [9]:
titanic.loc[titanic['Survived']==0]['Age'].mean()

30.62617924528302

### 4.3 Find how many people whose name contains 'Mark' survived

In [None]:
titanic[titanic['Name'].str.contains("Mark")]

### 4.4 Find survival rate of all the people who are either female or child (age<12)

In [None]:
titanic['Sex']=='female'

In [None]:
(titanic['Sex']=='female') | (titanic['Age']<12)

### 4.5 Find the opposite

## Modifying the Dataframe

### 4.6 Add new column based on other columns

Create a new column so that we have information about whether a person is a child, or an older male or a female.

In [None]:
# This block shows a function that returns a value, but it doesn't work with a dataframe

def male_female_child(age, sex):
    if age<12:
        return 'child'
    else:
        return sex

In [None]:
male_female_child(15, 'male')

In [None]:
male_female_child(11, 'male')

In [None]:
male_female_child(16, 'female')

In [None]:
def male_female_child(passenger):
    # Change the function so that it creates the new column
    return None


# The function should be called like the following. You can use the same line to run the function.

titanic[['Age','Sex']].apply(male_female_child, axis=1) 

### 4.7 Find the survival rate of men above the age of 40 (excluding 40)

Work with your neighbours to find solution of this. The final answer should be a percentage. 

This can also be rephrased as "If you are a man above the age of 40, what are the %age chances that you would have survived titanic disaster?" 
or
"What percentage of men older than 40 aboard the titanic survived?"

In [None]:
# Show the survival data

###  4.8 Find the survival rate of people in the first class

Can you conclude anything based on this?

In [None]:
# Show the survival data

# 5. Visualization

Visualizations help us get a more detailed yet crisp picture of the dataset. 

### Plot a histogram of Ages

In [None]:
titanic['Age'].plot(kind='hist')

### Plot a histogram of Fares

# Introducing Seaborn

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

![](https://livecodestream.dev/post/how-to-build-beautiful-plots-with-python-and-seaborn/featured.jpg)


### Import (if required, install) Seaborn

In [None]:
# %pip install seaborn

In [None]:
import seaborn as sns

### Show the count of men, women and children across each passenger class

In [None]:
titanic['person'] = titanic[['Age','Sex']].apply(male_female_child, axis=1) 

sns.catplot(x='Pclass', data=titanic, kind='count', hue='person')

#### Alternate way of looking at the same data

Notice how crosstabulation function helps us get a crisp perspective about the data which is in categorical form

In [None]:
pd.crosstab(index = titanic['Pclass'], columns = titanic['person'])

In [None]:
pd.crosstab(index = titanic['Pclass'], columns = titanic['person']).plot(kind='bar', stacked=True)

### Show a swarmplot showing ages of people who survived

In [None]:
import  matplotlib.pyplot as plt
plt.figure(figsize=(12,8))
# Put your code here
plt.show()

### Show bar-charts indicating number of people across each class, further divided across survival status

In [None]:
pd.crosstab( titanic.Survived ,titanic.Pclass).plot(kind="bar")

### Find if how rich you were determined if you would have survived

Box plot is another way to understand the distribution. It removes the outliers and show us where the median (exact middle of the data) and the 1st quartile and 3rd quartile exist. It is a crisp yet detailed way to understand the data. 

In [None]:
plt.figure(figsize=(12,10))
# Create a boxplot with x=survived, y=fare
plt.xticks(rotation=45)
plt.show()

### Back to the original question

Use countplot to show the how many men, women and child survived or didn't survive

In [None]:
plt.figure(figsize=(12,10))
sns.countplot('person', hue ='Survived', data=titanic)
plt.show()

### Point Plot

A point plot represents an estimate of central tendency for a numeric variable by the position of scatter plot points and provides some indication of the uncertainty around that estimate using error bars.

Point plots can be more useful than bar plots for focusing comparisons between different levels of one or more categorical variables. They are particularly adept at showing interactions: how the relationship between levels of one categorical variable changes across levels of a second categorical variable. The lines that join each point from the same hue level allow interactions to be judged by differences in slope, which is easier for the eyes than comparing the heights of several groups of points or bars.

### Create a point plot between Passenger class and Survived

### Create a point plot between Passenger class and Survived grouped by Sex

### Create a linear regression model based plot between Age and Survival

Note: Linear regression is not best suited for these problems and we will look at alternatives later, however this indicates the general trend in the data that we wish to find. 

In [None]:
sns.lmplot('Age','Survived', data=titanic, hue='Sex')

## Task: Is it better to travel with family?

Create a new column called **Partners**. If the person is a parent travelling with a child, or a child with a parent, or someone travelling with a sibling, that indicates they are travelling with family, otherwise they are travelling alone. The column called partners should contain the value "with family" if someone is travelling with parents, children, siblings, or spouse; "alone" if travelling alone.

In [None]:
# Modify this line. It might require more than one line of code

titanic['Alone'] = None

In [None]:
sns.lmplot('Age','Survived', data=titanic, hue='Alone')

### Does the city where you embarked have any impact?

In [None]:
plt.figure(figsize=(12,10))
# Complete the following line
sns.catplot()
plt.show()

# Conclusion

Classroom discussion on what we can conclude from this analysis of the dataset.

- 