# Wk18 Lecture02 CodeAlong: UFOs

## Learning Objectives

- By the end of this CodeAlong, students will be able to:
   - Calculate time series statistics (rolling mean/std/diff/pct_change
   - Perform feature engineering for time series EDA 
   - Aggregate time series using date parts to answer stakeholder questions.

    

# 🕹️Part 1) Preparing Irregular-Interval Time Series

### Overview from Last Lecture

- 1) [ ] Convert the dates & times to a single column (if needed).
- 2) [ ] Convert the datetime column  (most likely a string) to a datetime data type.
- 3) [ ] Set the datetime column as the Series/DataFrame index
- 4) [ ] Resample the time series to the desired/correct frequency using the desired/correct aggregation method.
- 5) [ ] Impute null values (if required)


### UFO Sightings

- UFO Sightings: https://www.kaggle.com/datasets/NUFORC/ufo-sightings 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.ticker as mticks
import seaborn as sns


import missingno as miss
import datetime as dt
import statsmodels.tsa.api as tsa

plt.rcParams['figure.figsize'] = [10,5]

In [None]:
ufo  = pd.read_csv("Data/ufos-kaggle/scrubbed.csv", low_memory=False)
ufo

## Is this regular or irregular (events)?

In [None]:
ufo.info()

## 1) Convert the dates & times to a single column (if needed).

Datetime is already one column.  Nothing to do here.

## 2) Converting Date Cols to Datetime

In [None]:
## Investigate the date format
ufo.loc[0,'datetime']

In [None]:
## Set the date format
fmt = '%m/%d/%Y %H:%M'

In [None]:
## convert datetime to datetime
ufo['datetime'] = pd.to_datetime(ufo['datetime'], format=fmt)

> Panda's is confused by 24:00. It doesn't know if we mean 0:00 of the NEXT day or if we mean the 11:59 pm (23:59) the same day

#### Handling Errors with pd.to_datetime

- Can use the `errors` argument for pd.to_datetime:
    - "raise" (default): raise an exception when errors happen
    - 'ignore': ignores the errors and returns the original value for that row. 
        - NOT RECOMMENDED: the entire column will not be datetime.
    - 'coerce': convert any bad datetime values to null values (NaT - NotATime)

>- **Branch point: we have a choice on how we deal with the bad timestamps.**
    -  Do we coerce them, make then null values, and drop them? Potentially losing a lot of data.
    - Or do we investigate a bit more to see if we can fix the problem without losing data.
    
    
- Let's see how much data we would lose if we chose to coerce the bad values:

In [None]:
## Check missing data before coerce
ufo['datetime'].isna().sum()

In [None]:
## Check missing data after coerce
coerced_dt = pd.to_datetime(ufo["datetime"], format=fmt, errors='coerce')
coerced_dt.isna().sum() / len(ufo)

Should we drop the rows, or try to fix the times?

In [None]:
## Drop the rows


In [None]:
## Fix the errors: 


## 3) Setting datetime index

In [None]:
## Create ufo_ts by setting the datetime index
ufo_ts = ufo.set_index('datetime')
ufo_ts

In [None]:
## Check index and frequency
ufo.index

## 4) Resample Data to Desired Frequency

What frequency should we resample our data to?  This requires some thinking

### Let's visualize Our Data

In [None]:
## Plot the full dataset
ufo_ts.plot()

> Hmmmm.... what are we *trying* to visualize?



**What do we really want to know about UFO's?**
- Duration of sighting?
- Location of sighting?
- Number of sightings?

### Converting to Daily Frequency

**We want to quantify the number of events that occurred within each interval.**

>- Q: How could we do this?

Resample by day and aggregate by the number of entries for each day


### Make `ts` from ufo_ts

In [None]:
# Resample by day and number of sightings
ts = 

In [None]:
# plot the ts
ts.plot()

In [None]:
## Change figsize to 10,5
plt.rcParams['figure.figsize'] = [10,5]

In [None]:
## Plot again
ts.plot()

Is all of this data relevant and interesting?  When did sightings really start becoming significant?


In [None]:
## keep only recent data


In [None]:
## Plot again
ts.plot()

# Part 2) Aggregating Full Dataset Using Date Parts

## 📝 **Stakeholder Questions to Answer**

**ANSWER TOGETHER:**
- 1) What Month and Year had the most sightings? (and how many sightings were there?)

- 2) Which month of the year has had the highest total number of reported sightings??
- 3) Is there a seasonal pattern to UFO sightings? If so, how long is the season?

- 4) Which US holiday has the largest number of sightings?
___
**ANSWER SELECTED Q's IN BREAKOUT ROOMS**

- 5) Which year had the highest % increase in sightings compared to previous years? (since 1950)

- 6) What day of the week has the highest reported sightings?

- 7) At what time of day (hour) do most sightings occur?

- 8) Which US state has the most sightings?

- 9) Which country had the largest proportion of sightings for the year 2000?

- 10) Have the types/shapes of UFO's witness changed over time?
    - Tip: use only the 4 most common shapes

### Making `eda_df` for answering questions

In order to access the datetime library of methods in Pandas, we will need to convert `datetime` back into a column.  A datetime index does not have the methods we need.

In [None]:
## making eda_df with date as a column instead of index
eda_df = ufo_ts.copy()
eda_df

## Feature Engineering: Date Parts

- Datetime objects have:
    - year
    - month
    - month_name()
    - day
    - day_name()
    - hour
    - seconds
    
- Pandas has a `.dt.` accessor to use datetime methods on an entire column at once.

In [None]:
## feature engineering for dates
eda_df['year'] = eda_df['datetime'] ##
eda_df['month'] = eda_df['datetime'] ##
eda_df['day of month'] = eda_df['datetime'] ##
eda_df['day of week'] = eda_df['datetime'] ##
eda_df['hour'] = eda_df['datetime'] ##
eda_df.head()

> Let's add a "weekend" feature that will be True if the day was a Saturday or Sunday.

In [None]:
## let's add a weekend feature
eda_df['weekend'] = ##
eda_df.head()

#### Let's add a column for the decade

In [None]:
## Calculate decade by subtracting the data modulo 10
eda_df['decade'] = ##
eda_df.head()

## 🕹️ Answering Stakeholder Questions (Together)

### Making `eda_ts` & `ts`

### 1) What Month/Year had the most sightings? (and how many sightings were there?)


In [None]:
## make a ts that is resampled to monthly
eda_ts = 
eda_ts.head()

In [None]:
## Resample to correct frequency


In [None]:
## get the date of the max sightings


In [None]:
# how many sightings?


In [None]:
## Plot the ts and add vertical line at month with most sightings
ax = m_ts.plot();
fmt = "%m/%Y"
ax.axvline(date_most_ufos, ls='--',color='k', 
           label=f"{date_most_ufos.strftime(fmt)} had {ts.loc[date_most_ufos]}")
ax.legend()

### 2) Which month of the year has had the highest total number of reported sightings?

In [None]:
## Check value counts of months in eda_df


### 3) Is there a seasonal pattern to recent UFO sightings? If so, how long is the season?

### Seasonality

In [None]:
import statsmodels.tsa.api as tsa

In [None]:
## Isolate just years since 2000 to capture recent trends


In [None]:
## plot the sliced ts


In [None]:
## Isolate trend and seasonal components with seasonal_decompose()


In [None]:
## Plot the decomposition
seasonal = decomp.seasonal
ax = seasonal.plot(figsize=(12,3))
ax.set(ylabel='Change in # of Sightings',
      title='Seasonal Component of Sightings');

In [None]:
## separate seasonal component and plot


### 4) Which US holiday has the largest number of sightings?

#### Feature Engineering: Holidays

In [None]:
# !pip install holidays
import holidays
import datetime as dt
from holidays import country_holidays

In [None]:
## Create an instance of the US country holidays.


In [None]:
## create a test holiday 
test = "01/01/2015"
test

In [None]:
## use .get() to test the api 


In [None]:
## Map the api's .get method onto the df to get all holidays

## Check the unique holidays


Apparently Juneteenth has made it on there yet.

#### Answer to which holiday has most sightings:

In [None]:
## Plot count of holiday sightings using sns.countplot()


#### Wait...when did **that** movie come out?

In [None]:
release_date= '1997-07-03'

In [None]:
## Plot the # of sightings over time and annotate the release date


### 7-day Rolling Mean of Daily Sightings
This is so noisy, let's try plotting a rolling mean.

### 5) Which year had the highest % increase in sightings compared to previous years? (since 1950)

In [None]:
## Resample monthly ts as yearly

## Calculate percent change

In [None]:
## Find the year with the biggest percent change (absolute value)


In [None]:
## Plot the percent changes and add a line at the biggest


## 🏓**Breakout Rooms: Answering Stakeholder Questions**

**Choose 1-2 of the remaining questions and work in breakout rooms to answer them:**
- 5) Which year had the highest % increase in sightings compared to previous years?
- 6) What day of the week has the highest reported sightings?
- 7) At what time of day (hour) do most sightings occur?
- 8) Which US state has the most sightings?
- 9) Which country had the largest proportion of sightings for the year 2000?
- 10) Have the types/shapes of UFO's witness changed over time?
    - Tip: use only the 4 most common shapes



### 5) Which year had the highest % increase in sightings compared to previous years? (since 1950)

### 6) What day of the week has the highest reported sightings?

### 6) Which country had the largest proportion of sightings for the year 2000?

### 7) Have the types/shapes of UFO's witness changed over time?

___
# Bonus: Plotly Express

In [None]:
import plotly.express as px
import plotly.io as pio

### Map Over Time

In [None]:
eda_df = eda_df.sort_values('decade')
eda_df.columns = eda_df.columns.str.strip()
eda_df['latitude'] = pd.to_numeric(eda_df['latitude'], errors='coerce')
eda_df.head()

In [None]:
px.scatter_geo(data_frame=eda_df, lat='latitude',lon='longitude', animation_frame="decade",
              template='ggplot2')