# Week 1: Day 4 AM // Pandas Basic

# Pandas Introduction

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.

Pandas is a Python library that provides extensive means for data analysis. Data scientists often work with data stored in table formats like .csv, .tsv, or .xlsx. Pandas makes it very convenient to load, process, and analyze such tabular data using SQL-like queries. In conjunction with Matplotlib and Seaborn, Pandas provides a wide range of opportunities for visual analysis of tabular data.

The main data structures in Pandas are implemented with Series and DataFrame classes. The former is a one-dimensional indexed array of some fixed data type. The latter is a two-dimensional data structure - a table - where each column contains data of the same type. You can see it as a dictionary of Seriesinstances. DataFrames are great for representing real data: rows correspond to instances (objects, observations, etc.), and columns correspond to features for each of the instances.

> Pandas is like Excel in Python: it uses tables (namely DataFrame) and operates transformations on the data. But it can do a lot more.

**Installing Pandas**

`python -m pip install pandas`

or

`conda install pandas`

## Using the Pandas Python Library


In [1]:
import numpy as np
import pandas as pd

## Getting to Know Pandas’ Data Structures

### Understanding Series Objects

Python’s most basic data structure is the list, which is also a good starting point for getting to know pandas.Series objects. Create a new Series object based on a list:

In [2]:
revenues = pd.Series([5555, 7000, 1980])

In [4]:
revenues

0    5555
1    7000
2    1980
dtype: int64

In [5]:
# # simple array
data = np.array([5555, 7000, 1980])
revenues = pd.Series(data)
print(revenues)

0    5555
1    7000
2    1980
dtype: int32


You’ve used the list `[5555, 7000, 1980]` to create a Series object called revenues. A Series object wraps two components:

- A sequence of values
- A sequence of identifiers, which is the index

You can access these components with `.values` and `.index`, respectively:

In [None]:
revenues

0    5555
1    7000
2    1980
dtype: int64

`revenues.values` returns the values in the Series, whereas `revenues.index` returns the positional index.

In [6]:
revenues.values

array([5555, 7000, 1980])

In [None]:
revenues.index

RangeIndex(start=0, stop=3, step=1)

While Pandas builds on NumPy, a significant difference is in their indexing. Just like a NumPy array, a Pandas Series also has an integer index that’s implicitly defined. This implicit index indicates the element’s position in the Series.

However, a Series can also have an arbitrary type of index. You can think of this explicit index as labels for a specific row:

In [7]:
city_revenues = pd.Series(
    [4200, 8000, 6500],
    index=["Amsterdam", "Toronto", "Tokyo"]
)
city_revenues

Amsterdam    4200
Toronto      8000
Tokyo        6500
dtype: int64

Here, the index is a list of city names represented by strings. You may have noticed that Python dictionaries use string indices as well, and this is a handy analogy to keep in mind! You can use the code blocks above to distinguish between two types of Series:

- revenues: This Series behaves like a Python list because it only has a positional index.
- city_revenues: This Series acts like a Python dictionary because it features both a positional and a label index.

Here’s how to construct a Series with a label index from a Python dictionary:

In [8]:
city_employee_count = pd.Series({"Amsterdam": 5, "Tokyo": 8})
city_employee_count

Amsterdam    5
Tokyo        8
dtype: int64

The dictionary keys become the index, and the dictionary values are the Series values.

Just like dictionaries, Series also support `.keys()` and the in keyword:

In [9]:
city_employee_count.keys()

Index(['Amsterdam', 'Tokyo'], dtype='object')

In [10]:
#pengecekan index

"Tokyo" in city_employee_count

True

In [None]:
"New York" in city_employee_count

False

You can use these methods to answer questions about your dataset quickly.

### Understanding DataFrame Objects

While a Series is a pretty powerful data structure, it has its limitations. For example, you can only store one attribute per key. 

If you’ve followed along with the Series examples, then you should already have two Series objects with cities as keys:

- city_revenues
- city_employee_count

You can combine these objects into a DataFrame by providing a dictionary in the constructor. The dictionary keys will become the column names, and the values should contain the Series objects:

In [12]:
pd.DataFrame({'kolom 1': ['a', 'b' , 'c', 'd'],
            'kolom 2': ['a', 'b' , 'c', 'd']})

Unnamed: 0,kolom 1,kolom 2
0,a,a
1,b,b
2,c,c
3,d,d


In [13]:
city_data = pd.DataFrame({
    "revenue": city_revenues,
    "employee_count": city_employee_count
})

In [14]:
city_data

Unnamed: 0,revenue,employee_count
Amsterdam,4200,5.0
Tokyo,6500,8.0
Toronto,8000,


Note how Pandas replaced the missing employee_count value for Toronto with NaN.

The new DataFrame index is the union of the two Series indices:

In [15]:
city_data.index

Index(['Amsterdam', 'Tokyo', 'Toronto'], dtype='object')

Just like a Series, a DataFrame also stores its values in a NumPy array:

In [16]:
city_data.values

array([[4.2e+03, 5.0e+00],
       [6.5e+03, 8.0e+00],
       [8.0e+03,     nan]])

You can also refer to the 2 dimensions of a DataFrame as axes:

In [None]:
city_data.axes

[Index(['Amsterdam', 'Tokyo', 'Toronto'], dtype='object'),
 Index(['revenue', 'employee_count'], dtype='object')]

In [None]:
city_data.axes[0]

Index(['Amsterdam', 'Tokyo', 'Toronto'], dtype='object')

In [None]:
city_data.axes[1]

Index(['revenue', 'employee_count'], dtype='object')

The axis marked with 0 is the row index, and the axis marked with 1 is the column index. This terminology is important to know because you’ll encounter several DataFrame methods that accept an axis parameter.

A DataFrame is also a dictionary-like data structure, so it also supports .keys() and the in keyword. However, for a DataFrame these don’t relate to the index, but to the columns:

In [17]:
city_data

Unnamed: 0,revenue,employee_count
Amsterdam,4200,5.0
Tokyo,6500,8.0
Toronto,8000,


In [18]:
city_data.keys()

Index(['revenue', 'employee_count'], dtype='object')

In [19]:
"Amsterdam" in city_data #ini karena amsterdam merupakan value, bukan sebuah index

False

In [20]:
"revenue" in city_data

True

### Accessing Series Elements

In the section above, you’ve created a Pandas Series based on a Python list and compared the two data structures. You’ve seen how a Series object is similar to lists and dictionaries in several ways. A further similarity is that you can use the indexing operator `([])` for Series as well.

You’ll also learn how to use two Pandas-specific access methods:

- `.loc`
- `.iloc`

You’ll see that these data access methods can be much more readable than the indexing operator.

**Using the Indexing Operator**

Recall that a Series has two indices:

- A positional or implicit index, which is always a RangeIndex
- A label or explicit index, which can contain any hashable objects

Next, revisit the `city_revenues` object:

In [21]:
city_revenues

Amsterdam    4200
Toronto      8000
Tokyo        6500
dtype: int64

You can conveniently access the values in a Series with both the label and positional indices:

In [22]:
city_revenues["Toronto"]

8000

In [25]:
city_revenues[1]

8000

You can also use negative indices and slices, just like you would for a list:

In [26]:
city_revenues[-1]

6500

In [27]:
city_revenues[1:]

Toronto    8000
Tokyo      6500
dtype: int64

In [28]:
city_revenues["Toronto":]

Toronto    8000
Tokyo      6500
dtype: int64

**Using .loc and .iloc**

The indexing operator `([])` is convenient, but there’s a caveat. What if the labels are also numbers? Say you have to work with a Series object like this:

In [29]:
colors = pd.Series(
    ["red", "purple", "blue", "green", "yellow"],
    index=[1, 2, 3, 5, 8]
)

In [30]:
colors

1       red
2    purple
3      blue
5     green
8    yellow
dtype: object

In [31]:
colors[1]

'red'

What will `colors[1]` return? For a positional index, `colors[1]` is "purple". However, if you go by the label index, then `colors[1]` is referring to "red".

The good news is, you don’t have to figure it out! Instead, to avoid confusion, the Pandas Python library provides two data access methods:

- .loc refers to the label index.
- .iloc refers to the positional index.

These data access methods are much more readable:

In [None]:
colors.loc[1]

'red'

In [None]:
colors.iloc[1]

'purple'

In [32]:
colors.iloc[1:3]

2    purple
3      blue
dtype: object

`colors.loc[1]` returned "red", the element with the label 1. `colors.iloc[1]` returned "purple", the element with the index 1.

It’s easier to keep in mind the distinction between .loc and .iloc than it is to figure out what the indexing operator will return. Even if you’re familiar with all the quirks of the indexing operator, it can be dangerous to assume that everybody who reads your code has internalized those rules as well!

`.loc` and `.iloc` also support the features you would expect from indexing operators, like slicing. However, these data access methods have an important difference. While `.iloc` excludes the closing element, `.loc` includes it. Take a look at this code block:

In [33]:
# Return the elements with the implicit index: 1, 2

colors.iloc[1:3]

2    purple
3      blue
dtype: object

If you compare this code with the image above, then you can see that `colors.iloc[1:3]` returns the elements with the positional indices of 1 and 2. The closing item "green" with a positional index of 3 is excluded.

On the other hand, `.loc` includes the closing element:

In [34]:
# Return the elements with the explicit index between 3 and 8

colors.loc[3:8]

3      blue
5     green
8    yellow
dtype: object

This code block says to return all elements with a label index between 3 and 8. Here, the closing item "yellow" has a label index of 8 and is included in the output.

You can also pass a negative positional index to `.iloc`:

In [35]:
colors.iloc[-2]

'green'

You start from the end of the Series and return the second element.

You can use the code blocks above to distinguish between two Series behaviors:

- You can use `.iloc` on a Series similar to using `[]` on a list.
- You can use `.loc` on a Series similar to using `[]` on a dictionary.

### Accessing DataFrame Elements

Since a DataFrame consists of Series objects, you can use the very same tools to access its elements. The crucial difference is the additional dimension of the DataFrame. You’ll use the indexing operator for the columns and the access methods `.loc` and `.iloc` on the rows.

**Using the Indexing Operator**

If you think of a DataFrame as a dictionary whose values are Series, then it makes sense that you can access its columns with the indexing operator:

In [36]:
city_data["revenue"]

Amsterdam    4200
Tokyo        6500
Toronto      8000
Name: revenue, dtype: int64

Here, you use the indexing operator to select the column labeled "revenue".

If the column name is a string, then you can use attribute-style accessing with dot notation as well:

In [37]:
city_data.revenue

Amsterdam    4200
Tokyo        6500
Toronto      8000
Name: revenue, dtype: int64

`city_data["revenue"]` and `city_data.revenue` return the same output.

There’s one situation where accessing DataFrame elements with dot notation may not work or may lead to surprises. This is when a column name coincides with a DataFrame attribute or method name:

In [38]:
toys = pd.DataFrame([
    {"name": "ball", "shape": "sphere"},
    {"name": "Rubik's cube", "shape": "cube"}
])

In [None]:
toys["shape"]

0    sphere
1      cube
Name: shape, dtype: object

In [39]:
toys.shape #2,2 maksudnya ada 2 baris dan 2 kolom

(2, 2)

The indexing operation `toys["shape"]` returns the correct data, but the attribute-style operation `toys.shape `still returns the shape of the DataFrame. You should only use attribute-style accessing in interactive sessions or for read operations. You shouldn’t use it for production code or for manipulating data (such as defining new columns).

**Using .loc and .iloc**

Similar to Series, a DataFrame also provides `.loc` and `.iloc` data access methods. Remember, `.loc` uses the label and `.iloc` the positional index:

In [43]:
city_data

Unnamed: 0,revenue,employee_count
Amsterdam,4200,5.0
Tokyo,6500,8.0
Toronto,8000,


In [40]:
city_data.loc["Amsterdam"]

revenue           4200.0
employee_count       5.0
Name: Amsterdam, dtype: float64

In [44]:
city_data.loc["Amsterdam",:] # titik dua adalah defaultnya

revenue           4200.0
employee_count       5.0
Name: Amsterdam, dtype: float64

In [45]:
city_data.loc["Amsterdam":"Toronto"]

Unnamed: 0,revenue,employee_count
Amsterdam,4200,5.0
Tokyo,6500,8.0
Toronto,8000,


In [41]:
city_data.loc["Tokyo": "Toronto"]

Unnamed: 0,revenue,employee_count
Tokyo,6500,8.0
Toronto,8000,


In [42]:
city_data.iloc[1]

revenue           6500.0
employee_count       8.0
Name: Tokyo, dtype: float64

In [46]:
city_data.iloc[2]

revenue           8000.0
employee_count       NaN
Name: Toronto, dtype: float64

Each line of code selects a different row from city_data:

- `city_data.loc["Amsterdam"]` selects the row with the label index "Amsterdam".
- `city_data.loc["Tokyo": "Toronto"]` selects the rows with label indices from "Tokyo" to "Toronto". Remember, `.loc` is inclusive.
- `city_data.iloc[1]` selects the row with the positional index 1, which is "Tokyo".



For a DataFrame, the data access methods .loc and .iloc also accept a second parameter. While the first parameter selects rows based on the indices, the second parameter selects the columns. You can use these parameters together to select a subset of rows and columns from your DataFrame:

In [47]:
city_data.loc["Amsterdam": "Tokyo", "revenue"]

Amsterdam    4200
Tokyo        6500
Name: revenue, dtype: int64

Note that you separate the parameters with a comma `(,)`. The first parameter, `"Amsterdam" : "Tokyo,"` says to select all rows between those two labels. The second parameter comes after the comma and says to select the `"revenue"` column.

In [None]:
city_revenues.sum()

18700

In [None]:
city_revenues.max()

8000

The first method returns the total of city_revenues, while the second returns the max value. There are other methods you can use, like `.min()` and `.mean()`.

## Filter Pandas DataFrame with numpy

In [58]:
dataFrame = pd.DataFrame({"Product": ["SmartTV", "ChromeCast", "Speaker", "Earphone"],
                          "Opening_Stock": [300, 700, 1200, 1500],
                          "Closing_Stock": [200, 500, 1000, 900]})

print("DataFrame...\n",dataFrame)

# using numpy where() to filter DataFrame with 2 Conditions
resValues1 = np.where((dataFrame['Opening_Stock']>=700) & (dataFrame['Closing_Stock']< 1000))

print("\nFiltered DataFrame Value = \n",dataFrame.loc[resValues1])

# using numpy where() to filter DataFrame with 3 conditions
resValues2 = np.where((dataFrame['Opening_Stock']>=500) & (dataFrame['Closing_Stock']< 1000) & (dataFrame['Product'].str.startswith('C')))

print("\nFiltered DataFrame Value = \n",dataFrame.loc[resValues2])

DataFrame...
       Product  Opening_Stock  Closing_Stock
0     SmartTV            300            200
1  ChromeCast            700            500
2     Speaker           1200           1000
3    Earphone           1500            900

Filtered DataFrame Value = 
       Product  Opening_Stock  Closing_Stock
1  ChromeCast            700            500
3    Earphone           1500            900

Filtered DataFrame Value = 
       Product  Opening_Stock  Closing_Stock
1  ChromeCast            700            500


In [50]:
dataFrame

Unnamed: 0,Product,Opening_Stock,Closing_Stock
0,SmartTV,300,200
1,ChromeCast,700,500
2,Speaker,1200,1000
3,Earphone,1500,900


In [55]:
dataFrame ["Opening_Stock"]>1000

0    False
1    False
2     True
3     True
Name: Opening_Stock, dtype: bool

In [56]:
dataFrame[dataFrame ["Opening_Stock"]>1000]

Unnamed: 0,Product,Opening_Stock,Closing_Stock
2,Speaker,1200,1000
3,Earphone,1500,900


In [61]:
dataFrame[dataFrame ["Opening_Stock"]>1000][['Product', 'Opening_Stock']]

Unnamed: 0,Product,Opening_Stock
2,Speaker,1200
3,Earphone,1500


In [63]:
df = dataFrame.set_index('Product')

df

Unnamed: 0_level_0,Opening_Stock,Closing_Stock
Product,Unnamed: 1_level_1,Unnamed: 2_level_1
SmartTV,300,200
ChromeCast,700,500
Speaker,1200,1000
Earphone,1500,900


In [64]:
#.loc + condition
#return data with closing under 600

df.loc[df['Closing_Stock'] < 600, 'Closing_Stock']

Product
SmartTV       200
ChromeCast    500
Name: Closing_Stock, dtype: int64

In [65]:
# .iloc + condition
# return data with closing under 600

df.iloc[1,1]

500

## Combining Multiple Datasets

In the previous section, you’ve learned how to clean a messy dataset. Another aspect of real-world data is that it often comes in multiple pieces. In this section, you’ll learn how to grab those pieces and combine them into one dataset that’s ready for analysis.

Earlier, you combined two Series objects into a DataFrame based on their indices. Now, you’ll take this one step further and use `.concat()` to combine city_data with another DataFrame. Say you’ve managed to gather some data on two more cities:

In [66]:
further_city_data = pd.DataFrame(
    {"revenue": [7000, 3400], "employee_count": [2, 2]},
    index=["New York", "Barcelona"]
)

In [67]:
further_city_data

Unnamed: 0,revenue,employee_count
New York,7000,2
Barcelona,3400,2


In [68]:
city_data

Unnamed: 0,revenue,employee_count
Amsterdam,4200,5.0
Tokyo,6500,8.0
Toronto,8000,


This second DataFrame contains info on the cities "New York" and "Barcelona".

You can add these cities to city_data using `.concat()`:

In [69]:
all_city_data = pd.concat([city_data, further_city_data], sort=False)

In [70]:
all_city_data

Unnamed: 0,revenue,employee_count
Amsterdam,4200,5.0
Tokyo,6500,8.0
Toronto,8000,
New York,7000,2.0
Barcelona,3400,2.0


Now, the new variable `all_city_data` contains the values from both DataFrame objects.

By default, `concat()` combines along axis=0. In other words, it appends rows. You can also use it to append columns by supplying the parameter `axis=1`:

In [71]:
city_countries = pd.DataFrame({
    "country": ["Holland", "Japan", "Holland", "Canada", "Spain"],
    "capital": [1, 1, 0, 0, 0]},
    index=["Amsterdam", "Tokyo", "Rotterdam", "Toronto", "Barcelona"]
)

In [72]:
city_countries

Unnamed: 0,country,capital
Amsterdam,Holland,1
Tokyo,Japan,1
Rotterdam,Holland,0
Toronto,Canada,0
Barcelona,Spain,0


In [73]:
all_city_data

Unnamed: 0,revenue,employee_count
Amsterdam,4200,5.0
Tokyo,6500,8.0
Toronto,8000,
New York,7000,2.0
Barcelona,3400,2.0


In [74]:
cities = pd.concat([all_city_data, city_countries], axis=1, sort=False)

In [75]:
cities

Unnamed: 0,revenue,employee_count,country,capital
Amsterdam,4200.0,5.0,Holland,1.0
Tokyo,6500.0,8.0,Japan,1.0
Toronto,8000.0,,Canada,0.0
New York,7000.0,2.0,,
Barcelona,3400.0,2.0,Spain,0.0
Rotterdam,,,Holland,0.0


Note how Pandas added NaN for the missing values.

If you want to combine only the cities that appear in both DataFrame objects, then you can set the join parameter to inner:

In [76]:
pd.concat([all_city_data, city_countries], axis=1, join="inner")

Unnamed: 0,revenue,employee_count,country,capital
Amsterdam,4200,5.0,Holland,1
Tokyo,6500,8.0,Japan,1
Toronto,8000,,Canada,0
Barcelona,3400,2.0,Spain,0


While it’s most straightforward to combine data based on the index, it’s not the only possibility. You can use `.merge()` to implement a join operation similar to the one from SQL:

In [77]:
countries = pd.DataFrame({
    "population_millions": [17, 127, 37],
    "continent": ["Europe", "Asia", "North America"]
}, index=["Holland", "Japan", "Canada"])

Here, you pass the parameter `left_on="country"` to `.merge()` to indicate what column you want to join on. The result is a bigger DataFrame that contains not only city data, but also the population and continent of the respective countries:

In [78]:
countries

Unnamed: 0,population_millions,continent
Holland,17,Europe
Japan,127,Asia
Canada,37,North America


In [79]:
cities

Unnamed: 0,revenue,employee_count,country,capital
Amsterdam,4200.0,5.0,Holland,1.0
Tokyo,6500.0,8.0,Japan,1.0
Toronto,8000.0,,Canada,0.0
New York,7000.0,2.0,,
Barcelona,3400.0,2.0,Spain,0.0
Rotterdam,,,Holland,0.0


In [82]:
pd.merge(cities, countries, left_on="country", right_index=True) #posisi berpengaruh sebelah kiri mengacu ke kolom country, sebelah kanan mengacu ke kolom index

Unnamed: 0,revenue,employee_count,country,capital,population_millions,continent
Amsterdam,4200.0,5.0,Holland,1.0,17,Europe
Rotterdam,,,Holland,0.0,17,Europe
Tokyo,6500.0,8.0,Japan,1.0,127,Asia
Toronto,8000.0,,Canada,0.0,37,North America


In [85]:
pd.merge(cities, countries, left_on="country", right_index=True, how='left')

Unnamed: 0,revenue,employee_count,country,capital,population_millions,continent
Amsterdam,4200.0,5.0,Holland,1.0,17.0,Europe
Tokyo,6500.0,8.0,Japan,1.0,127.0,Asia
Toronto,8000.0,,Canada,0.0,37.0,North America
New York,7000.0,2.0,,,,
Barcelona,3400.0,2.0,Spain,0.0,,
Rotterdam,,,Holland,0.0,17.0,Europe


Note that the result contains only the cities where the country is known and appears in the joined DataFrame.

`.merge()` performs an inner join by default. If you want to include all cities in the result, then you need to provide the how parameter:

In [None]:
pd.merge(
    cities,
    countries,
    left_on="country",
    right_index=True,
    how="left"
)

Unnamed: 0,revenue,employee_count,country,capital,population_millions,continent
Amsterdam,4200.0,5.0,Holland,1.0,17.0,Europe
Tokyo,6500.0,8.0,Japan,1.0,127.0,Asia
Toronto,8000.0,,Canada,0.0,37.0,North America
New York,7000.0,2.0,,,,
Barcelona,3400.0,2.0,Spain,0.0,,
Rotterdam,,,Holland,0.0,17.0,Europe


With this left join, you’ll see all the cities, including those without country data.