# Pandas & Dataframe Basics

- API Reference: https://pandas.pydata.org/docs/reference/index.html

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np

## 1. Dataframes

### 1a. Download and import dataset with pandas
- `pd.read_csv(file directory)`
- Download data from : https://www.kaggle.com/c/titanic
    - `test.csv`
    - `train.csv`
    - `gender_submission.csv`

In [None]:
# Read csv file


### 1b. Show Data
- `pd.head()`: read first _ rows of dataframe
- `pd.tail()`: read last _ rows of dataframe
- by default, return 5 rows

In [None]:
# pd.head()


In [None]:
# pd.tail()


### 1c. Count Values
- `pd.value_counts()`
- For counting certain values in a **Series**
- Need to convert dataframe into series to use this method

In [None]:
# what is Series?
# A dataframe is composed of multiple Series


In [None]:
# number of occurences


In [None]:
"""
Try this:
Count the number of occurences of each values in 'Embarked'
"""


### 1d. Creating a Dataframe with lists
- `pd.Dataframe(list or array, columns)`

In [None]:
# list shape


In [None]:
# creating a dataframe with list


### 1e. Creating a Dataframe with dictionaries
- `pd.DataFrame(dictionary)`
- Since dictionaries are mapped with key-values, no list input is needed
- key : column name
- value: data

## 2. Manipulating Dataframes

### 2a. Adding Rows to an existing Dataframe

In [None]:
# Fill new column with 0


In [None]:
# Fill new column with an empty string


In [None]:
# Creating new columns from an existing column


In [None]:
"""
Try This:
Create a new column 'Fare_DC' which has 10 % discounted values from the 'Fare' column
"""



### 2b. Updating rows in an existing Dataframe
- original column = original column + something

### 2c. Deleting rows or columns in an existing Dataframe
- If deleting a column, axis = 1
- `pd.drop(column name, axis = 1)`

In [None]:
# deleting column 'Age_new'


- If deleting a row, axis = 0 (default)
- `pd.drop(index_num, axis = 0)`

In [None]:
# deleting the first row


In [None]:
"""
Try This:
Delete the column 'Fare_DC'
"""


## 3. Indexing in Dataframe
- Index : Unique numbers associated with each rows and columns of a dataframe


In [None]:
data = pd.read_csv('titanic/train.csv')
data.head(5)

In [None]:
# get info of data


In [None]:
# get index


In [None]:
# get index at certain position of a list
lst = ['apple', 'banana', 'melon']


In [None]:
# get certain value from a certain row by indexing


- `pd.reset_index()`
- creating a new index column to an existing dataframe

In [None]:
data.head()

In [None]:
data.reset_index()

## 4. Data Selection / Filtering

In [None]:
"""
Try importing the titanic train.csv file with pandas
"""


- Data Selection with "[]" operator
- `dataframe[column_name]`

In [None]:
# Get certain column from a dataframe
# Returns a series type


In [None]:
# Get multiple columns from a dataframe
# Returns a dataframe type


- Cannot call a column with column index number

In [None]:
# If called with a number, returns a KeyError
data[0]

- However, slicing is possible
- `dataframe[0:3]`
    - Getting the first three rows of a dataframe

In [None]:
# Get the first three rows of columns 'Survived', 'Pclass'


- Boolean indexing is also possible
    - Meaning that a statement returns either True or False value

In [None]:
# Get rows with Pclass having value of 3


In [None]:
"""
Try this:
Get First 10 rows where Pclass equals 2
"""


**Indexing functions**
- `loc` : Label based indexing (명칭 기반 인덱싱)
    - Referencing to column name when indexing
- `iloc`: Position based indexing (위치 기반 인덱싱) 
    - Referencing to the position of rows and columms of a dataframe
    - `dataframe.iloc[row, column]`

In [None]:
# iloc only allows integers as inputs
data.iloc[0, 'Name']

In [None]:
# loc allows both integers and strings as inputs
# however, the integer does not represent a value in a row, but a unique index of a row



In [None]:
test = {'Color': ['Yellow', 'Red', 'Green'],
        'Name': ['one', 'two', 'three'],
        'Year': [1999, 1909, 2011]
       }



**Boolean indexing**

In [None]:
# Get passengers over age 60



**Operands**
- <, >
    - comparing the size
- <=, >=
    - greater/less or equal to

In [None]:
myage = 10
yourage = 20
hisage = 20



- `&`, `and`
    - and operator 
- `|`, `or`
    - or operator

In [None]:
"""
Try this:
Get passengers of 'Age' order than 50 and 'Pclass' equal to 1
"""

In [None]:
data[(data['Age'] > 50) & (data['Pclass'] == 1)]