# Pandas - Data Analysis Library

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.

## Importing Pandas

In [None]:
import pandas as pd
import numpy as np

In [None]:
pd.__version__

## Series

In [None]:
pd.Series(data = [7,2,3,4])

In [None]:
pd.Series(np.random.randn(1500))

## DataFrame

In [None]:
pd.DataFrame(data = {"Nama" : ["Selly", "Emir"], "Umur": [12, 13]}) # masukkan data pakai dictionary

Unnamed: 0,Nama,Umur
0,Selly,12
1,Emir,13


![](series-and-dataframe.width-1200.png)

In [None]:
array = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],
                     index= ['First', 'Second', 'Third'],
                     columns = ['Team A', 'Team B', 'Team C'])
array

## Creating DataFrame from dictionary

In [None]:
ta = {'Tarantula': ['Alam', 'Ansyar', 'Rasfi'], 'Anyelir': ['Ipe', 'Intan', 'Chaca']}
ta

In [None]:
ta1 = pd.DataFrame(ta)
ta1

In [None]:
ta2 = pd.DataFrame.from_dict(ta1, orient='columns')
ta2

Specify orient='index' to create the DataFrame using dictionary keys as rows:

In [None]:
ta3 = pd.DataFrame.from_dict(ta, orient='index') # kunci dictionary sebagai baris.
ta3

When using the ‘index’ orientation, the column names can be specified manually:

In [None]:
ta4 = pd.DataFrame.from_dict(ta, orient='index',
                       columns=['Ketua', 'Wakil', 'Sekretaris'])
ta4

We can change the columns' name

In [None]:
ta4.columns

In [None]:
ta4.columns.values[0]

In [None]:
ta4.columns = ["Sekretaris", "Ketua", 'Wakil']
ta4

In [None]:
ta4.columns.values[0]

In [None]:
ta4.columns.values[0] = "President"

In [None]:
ta4

### Exercise 1

1. Create the following dataframe
![](ex00.png)

In [None]:
a = pd.DataFrame(data = {"Age": [24, 13, 53], 
                         "Location": ["New York", "Paris", "Berlin"], 
                         "Name": ["George", "Anna", "Peter"]})
a

2. Change "Location" into "City"

In [None]:
a.columns.values[1] = "City"
a

3. Change all columns into No. Participant, Delegates From, Participant. 

In [None]:
a.columns

In [None]:
a.columns = ['No. Participant', 'Delegates From', 'Participant']
a

# Open CSV file

We will be using data of Uber drive in 2016. The data can be obtained from Kaggle (https://www.kaggle.com/zusmani/uberdrives)

In [None]:
data = pd.read_csv("My Uber Drives - 2016.csv")
data

### Basic Operation

In [None]:
data.head() # show top 5

In [None]:
data.tail() # memanggil 5 terbawah

In [None]:
data.shape

In [None]:
data.dtypes

### Convert data type

It can be seen that the START_DATE* and END_DATE* is object type data. While in fact, it is a date

In [None]:
data1 = pd.DataFrame({"Cost":["5","5","7"],
                      "Amount":[11,12,13],
                      "Date": ["11-10-2020","12-10-2020","13-10-2020"]})
data1

In [None]:
data1.dtypes

In [None]:
data1["Cost"] = pd.to_numeric(data1["Cost"])

In [None]:
data1["Date"] = pd.to_datetime(data1["Date"])

In [None]:
data1

In [None]:
data1.dtypes

In [None]:
data1["Cost"] = data1["Cost"].map(str)

In [None]:
data1.dtypes

In [None]:
data1["Amount"] = data1["Amount"].astype(str)

In [None]:
data1.dtypes

In [None]:
data1['Amount'] = data1['Amount'].astype(int) # back into object

In [None]:
data1.dtypes

#### Apply to our dataframe

In [None]:
# convert data to datetime format
pd.to_datetime(data["START_DATE*"], format='%m/%d/%Y %H:%M')

In [None]:
data.tail()

In [None]:
pd.to_datetime(data["START_DATE*"],
               format='%M/%D/%Y %H%min', # month, day, Year, Hours, Minutes
               errors = 'coerce') # mengabaikan eror kata 'Totals' di kolom waktunya. Diganti dengan NaT
# ini tidak tersave di DataFrame data

In [None]:
data.dtypes

In [None]:
data.tail()

Why the `START_DATA*` is still object? because it is not changed in the data frame

In [None]:
data["START_DATE*"] = pd.to_datetime(data["START_DATE*"],format='%m/%D/%Y %H:%min', errors = 'coerce')

In [None]:
data.dtypes

In [None]:
data["END_DATE*"] = pd.to_datetime(data["END_DATE*"],format='%m/%d/%Y %H:%M', errors = 'coerce')

In [None]:
data.dtypes

In [None]:
data.tail()

### Dataset summarization

In [None]:
data.describe()  # generate descriptive statistics

In [None]:
data.describe(include='all')

In [None]:
data.info()

In [None]:
# count of unique start locations
data["START*"].value_counts()

### > Exercise 2

1. Create the following dataframe with “Umur” is object type and convert it into integer
![](ex1.png)

In [None]:
tim_a = pd.DataFrame({'Nama':['Ahmad', 'Joko', 'Adi'],
                     'Umum':['12', '13', '15'],
                     'Kelas':['6', '7', '8']})
tim_a

In [None]:
tim_a.dtypes

In [None]:
# cara 1
tim_a['Umum'] = pd.to_numeric(tim_a['Umum'])
tim_a.dtypes

In [None]:
# cara 2
tim_a['Kelas'] = tim_a['Kelas'].astype(int)
tim_a.dtypes

In [None]:
tim_a

2. Go to Kaggle, download the Titanic data and do the data basic exploration.\
head, tail, describe, info, size, shape

## Data Manipulation Tasks

There are five common data manipulations tasks:
1. Selecting/Indexing
2. Filtering
3. Sorting
4. Mutating/conditionally adding columns
5. Groupby/summarize

## 1. Selecting/Indexing

### `loc` and `iloc`

![](loc.png)

In [None]:
data.head()

### Positional indexing

Indeksing just the rows. Cuma 1 axis, 1 sumbu yaitu baris

In [None]:
# mengindeks n  baris with scalar integer dalam bentuk Series dengan. n = 4
print(data.iloc[4])
type(data.iloc[4])

In [None]:
# mengindeks n baris dengan list of integer dalam bentuk DataFrame. n = 3
print(type(data.iloc[[3]]))
(data.iloc[[3]])

In [None]:
data.iloc[[0, 3, 99]]

In [None]:
data.iloc[:4]

Indexing both axis.

In [None]:
data.head()

In [None]:
# [row, columns]
data.iloc[1:4, 2:6]

In [None]:
# with slice object
data.iloc[15:21, 2:7]

### Label indexing

In [None]:
data.columns

In [None]:
x = data.loc[:, 'STOP*']
x

In [None]:
data1 = data.loc[1:3, "START*"]
data1

In [None]:
type(data1)

In [None]:
b = data.loc[:, "START_DATE*":'START*']
b

In [None]:
type(b)

In [None]:
a = data.loc[:25, ["START_DATE*", "MILES*"]]
a

In [None]:
type(a)

In [None]:
e = data.loc[:10, ['START_DATE*', 'START*', 'MILES*', 'STOP*', 'END_DATE*']]
e

In [None]:
c = data.loc[:, ["START*"]].head()
c

In [None]:
type(c)

##### All function work in df, not in series

### > Exercise 3

1. Select columns: `START_DATE*, START*, STOP*`

In [None]:
data.columns

In [1]:
index1 = data.loc[:, ['START_DATE*', 'START*', 'STOP*']]
index1

NameError: ignored

2. Extract the first & last 10 rows of the previous columns

In [None]:
index2 = index1.loc[:10, ['START_DATE*':'STOP*']
index2

In [None]:
data.loc[data["START*"] == "New York"]

In [None]:
st = data[data["START*"].isin(["Cary"])]
st

In [None]:
data.loc[(data["MILES*"] > 10) & (data["START*"].isin(["New York", "Morris"]))]

In [None]:
data["DISTANCE"] = np.where(data["MILES*"] > 5, "Long trip", "Short trip")
data.head()

In [None]:
![](Screenshot(9).png)

Find all trips that is greater than 10 miles and originated from New York and Morris
Hint: use and


In [None]:
data[""]