# NFL Dataset Wrangling

## A. Pendahuluan
### A.1. **Deskripsi dataset**
Lorem ipsum

### A.2. **Alasan dataset menarik**
Lorem ipsum

### A.3. **Pertanyaan/tujuan analisis**
Lorem ipsum

## B. Setup & Package
### B.1. **Semua import di bagian atas notebook**
Proses impor *library* kami lakukan di sel berikut:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_columns', None)
sns.set_style('whitegrid')

### B.2. **Path file disimpan dalam variabel**
Proses penyimpanan path file kami lakukan di sel berikut:

In [21]:
raw_path = {
    'attendance': '../data/raw/attendance.csv',
    'games': '../data/raw/games.csv',
    'standings': '../data/raw/standings.csv'
}

clean_path = '../data/processed'


### B.3. **Penjelasan singkat fungsi paket**
Lorem ipsum

## C. Data Preparation
### C.1. **Import dataset**
Berikut adalah proses impor dataset:

In [22]:
attendance = pd.read_csv(raw_path['attendance'])
games = pd.read_csv(raw_path['games'])
standings = pd.read_csv(raw_path['standings'])

### C.2. **Langkah cleaning dan Data Wrangling**

1. Function untuk mengecek bentuk, tipe data, *missing values*, dan *duplicates* pada dataset.

In [23]:
def cek_data(df, nama):
    print(f"--- Tahap Pengecekan 1: {nama} ---")
    print(f"Shape: {df.shape}")
    print("\nTipe Data:\n", df.dtypes)
    print("\nMissing Values:\n", df.isnull().sum())
    print("\nDuplicates:", df.duplicated().sum())
    print('-'*15)

cek_data(games, 'Games')
cek_data(attendance, 'Attendance')
cek_data(standings, 'Standings')

--- Tahap Pengecekan 1: Games ---
Shape: (5324, 19)

Tipe Data:
 year               int64
week              object
home_team         object
away_team         object
winner            object
tie               object
day               object
date              object
time              object
pts_win            int64
pts_loss           int64
yds_win            int64
turnovers_win      int64
yds_loss           int64
turnovers_loss     int64
home_team_name    object
home_team_city    object
away_team_name    object
away_team_city    object
dtype: object

Missing Values:
 year                 0
week                 0
home_team            0
away_team            0
winner               0
tie               5314
day                  0
date                 0
time                 0
pts_win              0
pts_loss             0
yds_win              0
turnovers_win        0
yds_loss             0
turnovers_loss       0
home_team_name       0
home_team_city       0
away_team_name       0
away_team_city

2. Identifikasi pada langkah **C1**:

Data `games`:
- Pada kolom `date`, tipe datanya masih berupa `object`, harus diganti menjadi `date`.
- Pada kolom `week`, tipe `object` karena di baris bawah ada week yang berupa istilah, akan kami tambahkan kolom baru `week_num`.
- Pada kolom `tie`, banyak missing value, harus diganti dengan 0 untuk menandakan tidak adanya kejadian seri/*tie*.
- Pada kolom `time`, tipe data `object`, akan kami biarkan karena `time` sendiri ambigu karena ada banyak `date`, kami akan buat kolom baru bernama `kickoff_time`.

In [30]:
# Penanganan Data games
games = pd.read_csv(raw_path['games']) # load dataset lagi untuk mencegah error

# Buat Kolom dummy untuk ubah tipe data kolom date
games['full_date_str'] = games['date'].astype(str) + ', ' + games['year'].astype(str)

# Ubah tipe data ke datetime
games['date'] = pd.to_datetime(games['full_date_str'])

# Membuat kolom kickoff_time
games['kickoff_time'] = pd.to_datetime(
    games['date'].astype(str) + ' ' + games['time'],
    errors = 'coerce'
)

# Pemetaan kolom week object ke integer
week_map = {
    'WildCard':18,
    'Division':19,
    'ConfChamp':20,
    'SuperBowl':21
}

# Membuat kolom week_num
games['week_num'] = games['week'].replace(week_map).astype(int)

# Mengisi kolom tie yang kosong dengan 0
games['tie'] = games['tie'].fillna(0)

cek_data(games, 'Games')
games

--- Tahap Pengecekan 1: Games ---
Shape: (5324, 22)

Tipe Data:
 year                       int64
week                      object
home_team                 object
away_team                 object
winner                    object
tie                       object
day                       object
date              datetime64[ns]
time                      object
pts_win                    int64
pts_loss                   int64
yds_win                    int64
turnovers_win              int64
yds_loss                   int64
turnovers_loss             int64
home_team_name            object
home_team_city            object
away_team_name            object
away_team_city            object
full_date_str             object
kickoff_time      datetime64[ns]
week_num                   int64
dtype: object

Missing Values:
 year              0
week              0
home_team         0
away_team         0
winner            0
tie               0
day               0
date              0
time             

  games['kickoff_time'] = pd.to_datetime(


Unnamed: 0,year,week,home_team,away_team,winner,tie,day,date,time,pts_win,pts_loss,yds_win,turnovers_win,yds_loss,turnovers_loss,home_team_name,home_team_city,away_team_name,away_team_city,full_date_str,kickoff_time,week_num
0,2000,1,Minnesota Vikings,Chicago Bears,Minnesota Vikings,0,Sun,2000-09-03,1:00PM,30,27,374,1,425,1,Vikings,Minnesota,Bears,Chicago,"September 3, 2000",2000-09-03 13:00:00,1
1,2000,1,Kansas City Chiefs,Indianapolis Colts,Indianapolis Colts,0,Sun,2000-09-03,1:00PM,27,14,386,2,280,1,Chiefs,Kansas City,Colts,Indianapolis,"September 3, 2000",2000-09-03 13:00:00,1
2,2000,1,Washington Redskins,Carolina Panthers,Washington Redskins,0,Sun,2000-09-03,1:01PM,20,17,396,0,236,1,Redskins,Washington,Panthers,Carolina,"September 3, 2000",2000-09-03 13:01:00,1
3,2000,1,Atlanta Falcons,San Francisco 49ers,Atlanta Falcons,0,Sun,2000-09-03,1:02PM,36,28,359,1,339,1,Falcons,Atlanta,49ers,San Francisco,"September 3, 2000",2000-09-03 13:02:00,1
4,2000,1,Pittsburgh Steelers,Baltimore Ravens,Baltimore Ravens,0,Sun,2000-09-03,1:02PM,16,0,336,0,223,1,Steelers,Pittsburgh,Ravens,Baltimore,"September 3, 2000",2000-09-03 13:02:00,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5319,2019,Division,Kansas City Chiefs,Houston Texans,Kansas City Chiefs,0,Sun,2019-01-12,3:05PM,51,31,434,1,442,1,Chiefs,Kansas City,Texans,Houston,"January 12, 2019",2019-01-12 15:05:00,19
5320,2019,Division,Green Bay Packers,Seattle Seahawks,Green Bay Packers,0,Sun,2019-01-12,6:40PM,28,23,344,0,375,0,Packers,Green Bay,Seahawks,Seattle,"January 12, 2019",2019-01-12 18:40:00,19
5321,2019,ConfChamp,Kansas City Chiefs,Tennessee Titans,Kansas City Chiefs,0,Sun,2019-01-19,3:05PM,35,24,404,0,295,0,Chiefs,Kansas City,Titans,Tennessee,"January 19, 2019",2019-01-19 15:05:00,20
5322,2019,ConfChamp,San Francisco 49ers,Green Bay Packers,San Francisco 49ers,0,Sun,2019-01-19,6:40PM,37,20,354,0,358,3,49ers,San Francisco,Packers,Green Bay,"January 19, 2019",2019-01-19 18:40:00,20


Data `Attendance`:
- Pada kolom `weekly_attendance` ada 638 missing values, harus di*drop* karena tidak ada permainan pada waktu itu.
- Kolom yang sama, data tipe masih `float64`, data hilang harus didrop dan konversi ke `int64`

Data `Standings`:
- Tidak ada kejanggalan.