# Workshop PR01: Setting up, an intro to pandas, & data basics

## Learning Objectives
* Set up a working Anaconda3 environment
* Sign up to Kaggle
* Understand the terminology we use to talk about data sets
* Cursory understanding of coding using the `pandas` module

## Setting up your Anaconda3
1. **Install Anaconda3**. Download Anaconda (Python 3.7) from https://www.anaconda.com/distribution/ for whichever operating system you're using.
2. **Install `pandas`**. Open Anaconda3 navigator, and open the console (Environments --> root). Type `conda install pandas` into the console, and wait for it to install.
3. **Install `sklearn`**. The same as in 2, but typing `conda install sklearn`.

## Sign up to Kaggle
We use a lot of data sets provided by Kaggle (https://www.kaggle.com/), including the one we'll be using in the next session. Sign up to Kaggle so you've got access to these.

## Intro to pandas
In this section, we'll take a look at using `pandas` to read in and mess around with some data. There are two things we're trying to do here: (1) get familiar look with some must-know `pandas` syntax, and (2) pick up some of the terminology that gets used to talk about data.

We'll be using the data from the Kaggle competition, "Titanic: Machine Learning from Disaster" (https://www.kaggle.com/c/titanic). **Download 'train.csv'; put it in the same directory as this jupyter notebook.**

### The pandas dataframe

If we want to use Python to take a look at our Titanic data, then we need to get the data into some kind of in-memory Python object. We're going to use the `Dataframe` provided by the `pandas` module.

Think of a `Dataframe` as pretty much the same thing as a table (like in Excel). Columns pertain to features (e.g. age, weight) and rows to data objects (e.g. people).

Let's start out by **reading the data into a pandas dataframe**.

In [7]:
import pandas as pd

In [8]:
# read the data from a .csv into an in-memory dataframe
path_to_data = "../../kg-data/titanic/train.csv"
df = pd.read_csv(path_to_data)

In [9]:
# take a look at the first few rows
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Aside: Data set terminology
* The rows, each of which contains data pertaining to a single real-world object of interest (i.e. a passenger) are referred to as **instances**.
* The columns, containing data that describe each of the instances (e.g. "PClass", "Sex", "Age"), are referred to as **features**.
* In the context of using data to build predictive models or classifiers, the property (column) that we're trying to predict/classify is called the **label**. In this competition, the label we're interested in predicting is "Survived".

### Stuff you can do with a `pandas` dataframe

There are tonnes of things you can do with the `DataFrame`. We've listed a few of the most common things here, but check out https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html if you want to know more.

#### Summarize the data
```Python
df.describe() # generates summary statistics over the data
df.info() # prints dataframe metadata (e.g. data types of each column)
df["Survived"].unique() # lists all the unique values in a column
df['Survived'].value_counts() # counts up the number of times each unique value occurs
```

#### Checking for NaNs
```Python
df["Cabin"].isnull().sum()
```

#### Selecting data
```Python
survived = df["Survived"] # to select a single column (returns Series, not DataFrame)
age_sex = df[["Age", "Sex"]] # to select multiple columns (returns DataFrame)
```

#### Creating new columns
```Python
df["num_family_members"] = df["SibSp"] + df["Parch"] # create new column by summing two existing ones
df["sex_as_number"] = df["Sex"].apply(lambda row: 0 if row == "male" else 1) # create new col by applying the lambda function to "Sex"
```

#### Dropping existing columns
```Python
df = df.drop(["Name"], axis=1) # drop the "Name" column from the data set
```