# Data Manipulation using Pandas

Author: Andreas Chandra \
[Email](mailto:andreas@jakartaresearch.com) [Github](https://github.com/andreaschandra) [Blog](https://datafolksid.xyz/andreas) \
If you want to talk with me, proposed schedule [here](https://calendly.com/andreaschandra/)

## Contents

Day 4
- Brief of Timeseries
- Window Function
- Basic Plotting

## Day 4

In [None]:
import pandas as pd

In [None]:
pd.read_csv("telcom_user_extended_day4.csv")

### Datetime

1. String to datetime format

In [None]:
d_data['RecordedDate'].head()

In [None]:
## YYYY-MM-DD
d_data['RecordedDate_updated'] = pd.to_datetime(d_data['RecordedDate'])
d_data['RecordedDate_updated'].head()

2. Datetime to string format

In [None]:
## MM-DD-YYYY
d_data['RecordedDate_updated_2'] = d_data['RecordedDate_updated'].dt.strftime('%m-%d-%Y')
d_data['RecordedDate_updated_2'].head()

In [None]:
## MM/DD/YYYY
d_data['RecordedDate_updated_3'] = d_data['RecordedDate_updated'].dt.strftime('%m/%d/%Y')
d_data['RecordedDate_updated_3'].head()

3. Timedelta

The date units are years (‘Y’), months (‘M’), weeks (‘W’), and days (‘D’), while the time units are hours (‘h’), minutes (‘m’), seconds (‘s’), milliseconds (‘ms’)

Source: https://numpy.org/doc/stable/reference/arrays.datetime.html

In [None]:
sample_time = d_data['RecordedDate_updated'][:2].values
sample_time

In [None]:
# Time delta is in nano seconds (10^9 seconds), but it seems too high, let us change the units...

timedelta = sample_time[1] - sample_time[0]
timedelta

In [None]:
# Time delta in days

timedelta_days = timedelta.astype('timedelta64[D]')
timedelta_days

In [None]:
# Time delta in weeks

timedelta_weeks = timedelta.astype('timedelta64[W]')
timedelta_weeks

4. Timezone

In [None]:
import pytz
import datetime

In [None]:
pytz.all_timezones

In [None]:
# You can now see the +07 timezone set to Jakarta time
d_data['RecordedDate_updated'].dt.tz_localize('Asia/Jakarta')

In [None]:
# Lets change to Kuala Lumpur +08 time
d_data['RecordedDate_updated'].dt.tz_localize('Asia/Jakarta').dt.tz_convert('Asia/Kuala_Lumpur')

### Brief Timeseries

Resampling

In [None]:
# Using the datetime as index first...
dt_data = d_data.set_index('RecordedDate_updated')
dt_data.head()

In [None]:
# Sum monthly charges 
dt_data['MonthlyCharges'].resample("1M").sum()

In [None]:
# Mean monthly Netflix usage (in MB) 
dt_data['netflix_usage_megabytes'].resample("1M").mean()

In [None]:
# Min monthly internet speed  
dt_data['average_internet_speed_in_megabytes'].resample("1M").min()

In [None]:
# Min daily internet speed  
dt_data['average_internet_speed_in_megabytes'].resample("1D").min()

In [None]:
# Min daily internet speed  
dt_data['average_internet_speed_in_megabytes'].resample("1D").min().dropna()

Aggregates

In [None]:
# Comparing internet speed with monthly charges
dt_data[['average_internet_speed_in_megabytes', 'MonthlyCharges']].resample("1M").min().head()

## Seems like when internet speed increases, the users pay more for their monthly charges

### Window Functions

Basic Rolling window

Source: https://pandas.pydata.org/pandas-docs/stable/reference/window.html

In [None]:
# Rolling Sum of monthly Netflix usage (in MB) - 2 months
monthly_netflix = dt_data['netflix_usage_megabytes'].resample("1M").mean()

monthly_netflix.rolling(2).sum().head()

In [None]:
# Rolling mean of monthly Netflix usage (in MB) - 2 months
monthly_netflix.rolling(2).mean().head()

### Basic Plotting

- Line
- Title
- XY axis label
- Styling (color, shape)

In [None]:
# Basic plot
monthly_netflix.plot()

In [None]:
# Lets give title

monthly_netflix.plot(title="Monthly netflix usage in MB", y="Hi")

In [None]:
# Lets give the x-axis and y-axis label

ax = monthly_netflix.plot(title="Monthly netflix usage in MB")
ax.set_xlabel("Months")
ax.set_ylabel("Netflix Usage in MB")

In [None]:
ax = monthly_netflix.plot(title="Monthly netflix usage in MB")
ax.set_xlabel("Months")
ax.set_ylabel("Netflix Usage in MB")

# Styling
ax.get_lines()[0].set_color("red")

### Exercise

1. Change RecordedDate to this format: YYYY-MM-DD in 'RecordedDate_updated_exercise' column
ANS: d_data['RecordedDate_updated_exercise'] = d_data['RecordedDate_updated'].dt.strftime('%Y-%m-%d')
2. Set timedelta to 2 Days format
ANS: timedelta_2days = timedelta.astype('timedelta64[2D]')
3. Set the rolling mean of netflix from 2 months to 3 months
ANS: monthly_netflix.rolling(3).mean()
4. Change the line color to green
ANS: 
'''
ax = monthly_netflix.plot(title="Monthly netflix usage in MB")
ax.set_xlabel("Months")
ax.set_ylabel("Netflix Usage in MB")
ax.get_lines()[0].set_color("green")
'''