# Event Prediction

Observations come from 2 data streams (people flow in and out of the building), over 15 weeks, 48 time slices per day (half hour count aggregates). The purpose is to predict the presence of an event such as a conference in the building that is reflected by unusually high people counts for that day/time period.


**Source**: https://archive.ics.uci.edu/ml/datasets/CalIt2+Building+People+Counts

<img src="https://novotel.accor.com/imagerie/business-meeting-hotel/seminars-picture.jpg">

## Goals:

### Understand the dataset
- How the features are related to each other?
- Is there redundant features?
- Is there outliers?
- Is there missing data?
- Are data types adequate for analysis?

### Understand the problem
- What features are correlated to the target feature?
- Is it possible to create new features that are correlated to the target feature?
- Answer [questions](https://en.wikipedia.org/wiki/Data_analysis#Analytical_activities_of_data_users) using data 

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

%matplotlib inline

In [2]:
pd.set_option("display.max_columns", None)

In [3]:
dodgers_events_cols = ["date", "begin_event_time", "end_event_time", "game_attendance", "away team", "win_lose_score"]
dodgers_counts_cols = ["datetime", "count"]

dodgers_counts = pd.read_csv("data/Dodgers.data",   header=None, names=dodgers_counts_cols)
dodgers_events = pd.read_csv("data/Dodgers.events", header=None, names=dodgers_events_cols)

In [4]:
calit2_events_cols  = ["date", "begin_event_time", "end_event_time", "event_name"]
calit2_counts_cols  = ["date", "time", "count"]

calit2_counts  = pd.read_csv("data/CalIt2.data",    header=None, names=calit2_counts_cols)
calit2_events  = pd.read_csv("data/CalIt2.events",  header=None, names=calit2_events_cols)