# Pandas for Exploratory Data Analysis 
Author: Danny Malter

Lesson plan can be found on my GitHub profile - @danmalter <br>
https://github.com/danmalter/Malter-Analytics

<br>

Pandas is a popular Python library for exploratory data analysis (EDA).  The functions within the pandas library are important for understanding, formatting and preparing data.

This notebook will cover:
 - How to read in a dataset with Pandas

## About the Dataset: Titanic

Let's take a closer look at the Titanic table [data dictionary](https://www.kaggle.com/c/titanic), which is a description of the fields (columns) in the table (the .csv file we will import below):

**Variable**
- **survival** - Survival;	0 = No, 1 = Yes
- **pclass** - Ticket; class	1 = 1st, 2 = 2nd, 3 = 3rd
- **sex** - Sex	
- **Age** - Age in years	
- **sibsp** - # of siblings / spouses aboard the Titanic	
- **parch** - # of parents / children aboard the Titanic	
- **ticket** - Ticket number	
- **fare** - Passenger fare	
- **cabin** - Cabin number	
- **embarked** - Port of Embarkation;	C = Cherbourg, Q = Queenstown, S = Southampton

## Importing Pandas

To [import a library](https://docs.python.org/3/reference/import.html), we write `import` and the library name. For Pandas, is it common to name the library `pd` so that when we reference a function from the Pandas library, we only write `pd` to reference the aliased [namespace](https://docs.python.org/3/tutorial/classes.html#python-scopes-and-namespaces) -- not `pandas`.

In [4]:
import pandas as pd

## Reading in Data

Pandas dramatically simplifies the process of reading in data. When we say "reading in data," we mean loading a file into our machine's memory.

When you have a CSV, for example, and then you double-click to open it in Microsoft Excel, the open file is "read into memory." You can now manipulate the CSV.

When we read data into memory in Python, we are creating an object. We will soon explore this object.

Typically if we are working with a CSV, we would use the [read CSV](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) method.  In this example, we'll read the Titanic data direclty from the scikit-learn package.<br>

A [delimiter](https://en.wikipedia.org/wiki/Delimiter-separated_values) is a character that separates fields (columns) in the imported file. Just because a file says `.csv` does not necessarily mean that a comma is used as the delimiter. In this case, we have a tab character as the delimiter for our columns, so we will be using `sep='\t'` to tell pandas to 'cut' the columns every time it sees a [tab character in the file](http://vim.wikia.com/wiki/Showing_the_ASCII_value_of_the_current_character).

In [13]:
df = pd.read_csv('titanic.csv', sep=',')

*Documentation Pause*

How did we know how to use `pd.read_csv`? Let's take a look at the [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html). Note the first argument required (`filepath`).
> Take a moment to dissect other arguments and options when reading in data.

We have just created a data structure called a `DataFrame`. See?

In [14]:
type(df)

pandas.core.frame.DataFrame

## Inspecting our DataFrame: The basics

We'll now perform basic operations on the DataFrame, denoted with comments.

In [15]:
# print the first and last 5 rows
df.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Notice that `.head()` is a method (denoted by parantheses), so it takes arguments.

## Writing the Dataset to a Folder

To write a file back out, we can use the [to CSV](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html) method.

In [9]:
df.to_csv('../read-data-with-pandas/titanic_new.csv', header=True)