# Data Manipulation using Pandas

Author: Andreas Chandra \
[Email](mailto:andreas@jakartaresearch.com) [Github](https://github.com/andreaschandra) [Blog](https://datafolksid.xyz/andreas)

## Contents

Day 1
- A Brief Overview of Pandas
- Read/Write Pandas
- Creating DataFrame from Dict/List
- Basic Functionalities and Attributes (Head, Tail, Dtype, Shape, Describe, Missing Values)
- Type Casting
- Renaming Column
- Slicing and Dicing DataFrame (Filtering)

Day 2
- Reindexing
- Dropping and Poping
- Duplicate data
- Numeric Calculation
- String Operation

Day 3
- Sorting
- Grouping
- Pandas Apply and Map Function
- Appending, Joining, Merging, Concatenating 2 or more DataFrame
- Pivot and Stack

Day 4
- Brief of Timeseries
- Window Function
- Basic Plotting

## Day 1

### Overview of Pandas

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.

Installation \
`pip install pandas`

Repo: https://github.com/pandas-dev/pandas

In [None]:
# Import the library
import pandas as pd

### Read/Write Functions

https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

Common read functions \
`read_csv()` `read_excel()` `read_table()` `read_json()`

In [None]:
d_data = pd.read_csv("telcom_user_extended.csv")

Common write functions \
`to_csv` `to_json` `to_excel`

In [None]:
d_data.to_csv("telecom_users_2.csv", index=False)

### Creating DataFrame from List/Dictionary

From list

In [None]:
user_profile = [
    {"id": 101, "gender": "L", "age": 20, "last education": "high school", "is_married": True},
    {"id": 102, "gender": "P", "age": 18, "last education": "middle school", "is_married": False},
    {"id": 103, "gender": "L", "age": 19, "last education": "high school", "is_married": True},
    {"id": 104, "gender": "P", "age": 28, "last education": "master's degree", "is_married": False},
    {"id": 105, "gender": None, "age": 21, "last education": "bachelor's degree", "is_married": True}
]

In [None]:
pd.DataFrame(user_profile)

From list

In [None]:
number_list_only = [
    [101,"L",20,'high school', True], 
    [102,'P',18,'middle school', False],
    [103,'L',19,'high school', True],
    [104,'P',28,"master's degree", False],
    [105,None,21,"bachelor's degree", True],
]

In [None]:
pd.DataFrame(data=number_list_only, columns=["id", "gender", "age", "last education", 'is_married'])

From dictionary

In [None]:
user_profile_dict = {
    'id': [101,102,103,104,105],
    'gender': ["L", "P", "L", "P", None],
    'last education': ["high school", "middle school", "high school", "master's degree", "bachelor's degree"],
    'is_married': [True, False, True, False, True]
}

In [None]:
pd.DataFrame(user_profile_dict)

### Basic Functionalities

In [None]:
d_data.head()

In [None]:
d_data.tail()

In [None]:
d_data.shape

In [None]:
d_data.dtypes

Statistical descriptive numeric columns

In [None]:
d_data.describe()

In [None]:
d_data.info()

Counting missing values

In [None]:
d_data.isna().sum()

### Fill Missing Values

by `Series.fillna(value)` \
by `DataFrame.fillna(value)`

In [None]:
d_data.Partner.fillna('No', inplace=True)

In [None]:
d_data.Partner.isna().sum()

### Type Casting

https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#basics-dtypes

using `DataFrame.astype({'col': int, 'col2': str})` \
using `Series.astype(int|str|float)`

In [None]:
d_data.TotalCharges = d_data.TotalCharges.replace(' ', None)

In [None]:
d_data.TotalCharges = d_data.TotalCharges.astype(float)

In [None]:
d_data.dtypes

### Renaming Columns

In [None]:
d_data.head()

In [None]:
d_data.rename(columns={'customerID':'customer_id',
                       'SeniorCitizen': 'senior_citizen', 
                       'PhoneService': 'phone_service'}, inplace=True)

In [None]:
d_data.head()

### Duplicate Data

find duplicate entries using `DataFrame.duplicated()`

In [None]:
d_data[d_data.duplicated(subset='customer_id')]

In [None]:
d_data.drop_duplicates(subset='customer_id', inplace=True)

In [None]:
d_data.shape

### Slicing

slicing and dicing in Pandas can be done using `.loc` `.iloc` `.at` `.iat` or just bracket

In [None]:
d_data.loc[:5, ['gender', 'senior_citizen', 'Partner']]

In [None]:
d_data.gender.unique()

In [None]:
d_data[d_data.gender == 'Female']

### Assignin new columns and replace

In [None]:
d_data['is_married'] = 'No'

In [None]:
d_data.head()

Replace values
- Replace values `No` to `0` in SeniorCitizen

In [None]:
d_data.senior_citizen.unique()

In [None]:
d_data.loc[d_data.senior_citizen=='No', 'senior_citizen'] = 0

- Replace Values Internet Connection `No` to `Wireless`

In [None]:
d_data.InternetService.unique()

In [None]:
d_data.loc[d_data.InternetService=='No', 'InternetService'] = 'Wireless'

In [None]:
d_data.head()

### Save latest data to csv for the next day

In [None]:
d_data.to_csv("telcom_user_extended_day2.csv", index=False)