# TELE 6500 LAB 1: Introduction and Exploratory Data Analysis

Welcome to the first lab of TELE 6500.  You will practice the following topics.

1. How is time handled in Python?
2. EDA of pollution dataset: importing the data and preliminaries
3. EDA: Visualization
4. EDA: ACF/PACF
5. EDA: Handling Missing Values
6. ACF/PACF revisited

## 1. How is time handled in Python?

In this section, we review the fundamentals of managing time in Python.  Let's begin by importing the time package.

In [None]:
# Python’s standard library includes a module called time
# that can print the number of seconds since the Unix epoch:
import time


Now, we can print the number of seconds elapsed since the Unix epoch started on January 1, 1970 midnight UTC.

In [None]:
time.time()

As we can see, this package is mainly used to create timestamps and is a good fit for measuring the running time of code.

Now, let us import datetime, which offers date, time and datetime classes

In [None]:
from datetime import date, time, datetime

We can create a datetime object from the timestamp obtained above.

In [None]:
datetime.fromtimestamp(1726669469.8785067)

Next, we can create date and time objects using the respective constructors.

In [None]:
date(year=2020, month=1, day=31)

In [None]:
time(hour=13, minute=14, second=31)

As we see, just creating date and time objects are not particularly useful.  Let's create two datetime objects.

In [None]:
t1 = datetime(year=2020, month=2, day=29, hour=13, minute=14, second=31)
t2 = datetime(year=2020, month=2, day=29, hour=14, minute=14, second=31)

Let's take the difference of the two datetime objects. 

In [None]:
t2 - t1

The difference of two datetime objects is always a timedelta object.

Now, let's work with timezones.

In [None]:
from dateutil import tz, parser
from datetime import timedelta

We can create UTC time in two ways.  Using utcnow() and .now, with manual timezone assignment

In [None]:
now = datetime.utcnow()

In [None]:
print(now)

In [None]:
datetime.now(tz=tz.UTC)

Next, let's create a datetime object using a string and then assign it a timezone.

In [None]:
MY_DATE = parser.parse('May 15, 2024 8:00 AM')

print(f'Before assignment, timezone = {MY_DATE.tzname()}')

MY_DATE = MY_DATE.replace(tzinfo=tz.gettz('America/New_York'))

print(f'After assignment, timezone={MY_DATE.tzname()}')

In [None]:
MY_DATE

Next, let's do some timezone arithemetic

In [None]:
now = datetime.now(tz=tz.tzlocal())

In [None]:
print(now + timedelta(days=1))

In [None]:
print(now + timedelta(months=10))

In [None]:
print(now + timedelta(days=200))

As we studied in class, timedelta works only with days.  For a more granular control, we need relativedelta from dateutil.

In [None]:
from dateutil.relativedelta import relativedelta

In [None]:
now = datetime.now()

In [None]:
delta = relativedelta(years=+5, months=+1, days=+3, hours=-4, minutes=-30)

In [None]:
now + delta

In [None]:
now+delta - now

Next, we move on to the Beijing pollution dataset and explore the various patterns.

## 2. EDA of pollution dataset

Let's begin by importing the data

In [None]:
import seaborn as sns

import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import os
from datetime import datetime

import pandas as pd

In [None]:
df = pd.read_csv('PRSA_Data_Dingling_20130301-20170228.csv')

In [None]:
df.head()

As we see, each column is imported separately.  We need to ensure that the date-related columns are imported as one column.

In [None]:
df = pd.read_csv('PRSA_Data_Dingling_20130301-20170228.csv',
                 parse_dates = [['year', 'month', 'day', 'hour']],
                 date_parser = lambda x: datetime.strptime(x, '%Y %m %d %H'),
                keep_date_col=True)


Let's inspect the first few rows of the imported dataframe.

In [None]:
df.head(10)

This looks much better!  Let's inspect the data types for the columns and the number of entries they have.

In [None]:
df.info()

Hmm... most of the data types look okay.  Note the number of non-null rows are not the same.  This means several of the columns have missing data.  

We can also check for the number of unique entries in each column.

In [None]:
df.nunique()

### Different types of visualization of given data (no resampling)

Next, we create different types of visualizations

In [None]:
import plotly.express as px
from plotly.subplots import make_subplots

In [None]:
fig = px.line(df,
             x = 'year_month_day_hour',
             y='PM2.5',
             title='PM2.5 with Slider')


fig.update_xaxes(
                 rangeslider_visible=True,
                 rangeselector=dict(
                             buttons = list(
                             [dict(count=1, label='1y', step='year', stepmode='backward'),
                              dict(count=2, label='2y', step='year', stepmode='backward'),
                              dict(count=3, label='3y', step='year', stepmode='backward'),
                              dict(step='all')
                             ])))
fig.show()

fig.write_html("PM25_slider.html")

In [None]:
df = df.set_index('year_month_day_hour')

In [None]:
df.head()

In [None]:
df['PM2.5'].plot(grid=True)

In [None]:
df.loc['2015', 'PM2.5'].plot(grid=True, figsize=(12, 8))

In [None]:
df['2014':'2016'][['month', 'O3']].boxplot(by='month',
                                           showfliers=False,
                                           showmeans=True,
                                           positions=[1, 10, 11, 12, 2, 3,4, 5, 6, 7,8, 9],
                                           figsize=(12, 8))

In [None]:
df.loc['2015'][['PM2.5', 'TEMP']].plot(subplots=True, figsize=(16, 9))

In [None]:
multi_data = df[['TEMP', 'PRES', 'DEWP', 'RAIN', 'PM2.5']]
multi_data.plot(subplots=True, figsize=(16, 9))

In [None]:
g = sns.pairplot(df[['SO2', 'NO2', 'O3', 'CO', 'PM2.5', 'year']], hue='year')

In [None]:
pd.plotting.lag_plot(df['TEMP'],lag=100)

### Visualization at different time scales (resampling)

In [None]:
df

In [None]:
data_columns = ['PM2.5', 'PM10', 'SO2', 'NO2', 'CO', 'O3', 'TEMP']
df_daily_mean = df[data_columns].resample('D').mean()

In [None]:
# Start and end of the date range to extract
start, end = '2015-01', '2015-03'

# Plot daily and weekly resampled time series together
fig, ax = plt.subplots(figsize=(16, 9))
ax.plot(df.loc[start:end, 'PM2.5'], marker='.', linestyle='-', linewidth=0.5, label='Daily')
ax.plot(df_daily_mean.loc[start:end, 'PM2.5'], marker='o', markersize=8, linestyle='-', label='Weekly Mean Resample')
ax.set_ylabel('PM2.5', fontsize=12)
ax.legend();

Get more plots by resampling at different frequencies and using different methods.

## ACF and PACF Plotsfrom statsmodels.graphics.tsaplots import plot_acf


Let's try to get the ACF and PACF plot.  

In [None]:
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
plot_acf(df['TEMP'], lags=90)

We do not get any result because of the missing data.  We cannot get ACF and PACF plots if there is missing data.

We will learn how to fill the missing data next.

## Handling Missing Data

In [None]:
df_temp_sample = df['2015-02-21 10:00:00':'2015-02-21 23:00:00']['TEMP']

In [None]:
df_temp_sample = pd.DataFrame(df_temp_sample)

In [None]:
df_temp_sample['TEMP_FFILL'] = df_temp_sample.fillna(method='ffill')


In [None]:
df_temp_sample

In [None]:
df_temp_sample['TEMP_BFILL'] = df_temp_sample['TEMP'].fillna(method='bfill')


In [None]:
df_temp_sample['TEMP_ROLLING'] = df_temp_sample['TEMP'].rolling(window=2, min_periods=1).mean()

In [None]:
#df_imp['’] = 
df_temp_sample = df_temp_sample.reset_index()

df_temp_sample['TEMP_PREV_YEAR'] = df_temp_sample.apply(lambda x: df.loc[x['year_month_day_hour'] - pd.offsets.DateOffset(years=-1)]['TEMP'] if pd.isna(x['TEMP']) else x['TEMP'], axis=1)


In [None]:
df_temp_sample

In [None]:
df_temp_sample['TEMP_LINEAR'] = df_temp_sample['TEMP'].interpolate(method='linear')

In [None]:
df_temp_sample

In [None]:
df['TEMP']

Let's get back to pliotting the ACF and PACF plots.

In [None]:
df['TEMP_FILLED'] = df['TEMP'].interpolate(method='linear')

In [None]:
plot_acf(df['TEMP_FILLED']['2015-02-01 00:00:00':'2015-02-07 23:59:00'], lags=40)

In [None]:
plot_pacf(df['TEMP_FILLED']['2015-02-01 00:00:00':'2015-02-07 23:59:00'], lags=40)

We were able to create the ACF and PACF plots after filling the missing data.