# Introduction
In this notebook, we'll focus on data wrangling. Let's outline the steps and the goals of each step.

|Step               |Goal                                                                   |
|-------------------|-----------------------------------------------------------------------|
|Data Collection    |Gather and join the data to streamline the next steps of the capstone  |
|Data Organization: |Establish the file structure and version control                       |
|Data Definition:   |Understand, annotate and clean the data in preparation for future work |
|Data Cleaning:     |Check for missing data or wrong data, and handle them appropriately    |

# Data Collection

## Data Loading

We'll import the data from Kaggle. 

Kaggle requires users to sign in and generate an API Key. Make sure your API key is at the correct location before running the next cell. If necessary, also make sure that Kaggle has been installed before continuing.

In [8]:
# Download the zipped data using the Kaggle API 
!kaggle competitions download -c riiid-test-answer-prediction -p "..\data\raw"

Downloading riiid-test-answer-prediction.zip to ..\data\raw




  0%|          | 0.00/1.29G [00:00<?, ?B/s]
  0%|          | 1.00M/1.29G [00:00<09:49, 2.35MB/s]
  0%|          | 3.00M/1.29G [00:00<03:26, 6.69MB/s]
  1%|          | 7.00M/1.29G [00:00<01:29, 15.5MB/s]
  1%|          | 11.0M/1.29G [00:00<01:03, 21.6MB/s]
  1%|1         | 17.0M/1.29G [00:00<00:44, 30.9MB/s]
  2%|1         | 21.0M/1.29G [00:01<01:03, 21.4MB/s]
  2%|1         | 26.0M/1.29G [00:01<00:53, 25.5MB/s]
  2%|2         | 30.0M/1.29G [00:01<00:48, 27.8MB/s]
  3%|2         | 38.0M/1.29G [00:01<00:34, 38.8MB/s]
  3%|3         | 43.0M/1.29G [00:01<00:42, 31.3MB/s]
  4%|3         | 49.0M/1.29G [00:02<00:49, 26.9MB/s]
  4%|4         | 55.0M/1.29G [00:02<00:54, 24.4MB/s]
  5%|4         | 66.0M/1.29G [00:02<00:40, 32.6MB/s]
  6%|5         | 76.0M/1.29G [00:02<00:29, 43.6MB/s]
  6%|6         | 82.0M/1.29G [00:03<00:38, 33.9MB/s]
  7%|6         | 89.0M/1.29G [00:03<00:34, 37.4MB/s]
  7%|7         | 95.0M/1.29G [00:03<00:39, 32.6MB/s]
  7%|7         | 99.0M/1.29G [00:03<00:38, 33.3MB/s]
 

In [39]:
#Unzip the downloaded file
from zipfile import ZipFile
from pathlib import Path

zipped_data_path = Path('../data/raw/riiid-test-answer-prediction.zip')
unzip_destination_folder_path = Path('../data/interim')

with ZipFile(zipped_data_path, 'r') as zf:
    # Save list of file names in zip file to a list
    zf_names = zf.namelist()
    # Extract all files
    zf.extractall(unzip_destination_folder_path)

In [50]:
#Check the names of the unzipped files for the file names containing our data.
zf_names

['example_sample_submission.csv',
 'example_test.csv',
 'lectures.csv',
 'questions.csv',
 'riiideducation/__init__.py',
 'riiideducation/competition.cpython-37m-x86_64-linux-gnu.so',
 'train.csv']

The data_csv_files are  `lectures.csv`,  `questions.csv`, `train.csv` .

In [64]:
import pandas as pd

#Define data paths
lectures_csv_path = Path('../data/interim/lectures.csv')
questions_csv_path = Path('../data/interim/questions.csv')
train_csv_path = Path('../data/interim/train.csv')

# For these small csv files, import them directly with pandas.read_csv()
df_lectures = pd.read_csv(lectures_csv_path)
df_questions = pd.read_csv(questions_csv_path)

`train.csv` is a large csv file, over 7 GB and 100M rows, so we need to load it in chunks.

In [None]:
# Make DataFrame generator from CSV in chunks
df_generator = pd.read_csv(train_csv_path, chunksize=10000000)

#Initialize an empty DataFrame: df_train
df_train = pd.DataFrame()

# Iterate over each DataFrame chunk
for df_chunk in df_generator:
    df_train = df.append(df_chunk)


After importing as a dataframe, save the dataframe as a binary file, so that we can quickly reload the dataframe and resume.


In [None]:
# Define paths
lectures_pkl_path = Path('../data/interim/lectures.pkl.gzip')
questions_pkl_path = Path('../data/interim/questions.pkl.gzip')
train_pkl_path = Path('../data/interim/train.pkl.gzip')

# Save DataFrames to as pkl
df_lectures.to_pickle(lectures_pkl_path)
df_questions.to_pickle(questions_pkl_path)
df_train.to_pickle(train_pkl_path)

## Resuming data analysis without reimporting data
After the binary files have been saved, we can quickly resume by loading the binary files rather tha downloading, unzipping, and reading the csv files in chunks, again.

In [None]:
import pickle
import pandas as pd

# Define paths
lectures_pkl_path = Path('../data/interim/lectures.pkl.gzip')
questions_pkl_path = Path('../data/interim/questions.pkl.gzip')
train_pkl_path = Path('../data/interim/train.pkl.gzip')

with open(lectures_pkl_path, 'rb') as f:
    df_lectures = pickle.load(f)
    
with open(questions_pkl_path, 'rb') as f:
    df_questions = pickle.load(f)
    
with open(train_pkl_path, 'rb') as f:
    df_train = pickle.load(f)

## Data Joining

The data is available as one large file. So their is no need to join the data other than joining the chunks of the large file while importing it.

## Data Subsetting with the larger dataframes
As the data is quite large, it might be useful to subset the data during exploratory data analysis to speed up the process.

In [None]:
#Define the row skip logic

#Skip rows from based on condition like skip every 10th line
def skip_all_but_nth_rows(n, idx):
  return (idx % n != 0)
  
#Skip random lines  
import random
def rand_1_in_n(n, idx):
  return True if random.randrange(1,n)==1 else False


#Create the subsets 

#Define a Dataframe with 1/10 of the data
df_train_1_10 = df_train[df_train.index % 10 == 0]

#Define a DataFrame with 1/100 of the data
df_train_1_100 = df_train[df_train.index % 100 == 0]

#Define a DataFrame with 1/1000 of the data
df_train_1_1000 = df_train[df_train.index % 1000 == 0]

#Define a Dataframe with 1/10000 of the data
df_train_1_10000 = df_train[df_train.index % 10000 == 0]

#Define a Dataframe with 1/100000 of the data
df_train_1_100000 = df_train[df_train.index % 100000 == 0]

#Define a Dataframe with 1/1000000 of the data
df_train_1_1000000 = df_train[df_train.index % 1000000 == 0]

#Define a Dataframe with 1/10000000 of the data
df_train_1_10000000 = df_train[df_train.index % 10000000 == 0]


#Define subset paths
train_pkl_path_1_10 = Path('../data/interim/train_1_10.pkl.gzip')
train_pkl_path_1_100 = Path('../data/interim/train_1_100.pkl.gzip')
train_pkl_path_1_1000 = Path('../data/interim/train_1_1000.pkl.gzip')
train_pkl_path_1_10000 = Path('../data/interim/train_1_10000.pkl.gzip')
train_pkl_path_1_100000 = Path('../data/interim/train_1_100000.pkl.gzip')
train_pkl_path_1_1000000 = Path('../data/interim/train_1_1000000.pkl.gzip')
train_pkl_path_1_10000000 = Path('../data/interim/train_1_10000000.pkl.gzip')

#Save subset dataframes to pkl
df_train_1_1.to_pickle(train_pkl_path_1_1)
df_train_1_10.to_pickle(train_pkl_path_1_10)
df_train_1_100.to_pickle(train_pkl_path_1_100)
df_train_1_1000.to_pickle(train_pkl_path_1_1000)
df_train_1_10000.to_pickle(train_pkl_path_1_10000)
df_train_1_100000.to_pickle(train_pkl_path_1_100000)
df_train_1_1000000.to_pickle(train_pkl_path_1_1000000)
df_train_1_10000000.to_pickle(train_pkl_path_1_10000000)

# Data Organization

## File Structure

We'll use the default [file structure template for data science from cookiecutter data science](https://medium.com/@rrfd/cookiecutter-data-science-organize-your-projects-atom-and-jupyter-2be7862f487e).

The data files have already been imported according to this template.




|             |                  |
|-------------|------------------|
|├── README.md|          <- Front page of the project. Let everyone |
|│|                         know the major points.|
|│|
|├── models|             <- Trained and serialized models, model|
|│|                         predictions, or model summaries.|
|│|
|├── notebooks|          <- Jupyter notebooks. Use set naming|
|│|                         E.g. `1.2-rd-data-exploration`.|
|│|
|├── reports|            <- HTML, PDF, and LaTeX.|
|│   └── figures|        <- Generated figures.|
|│|
|├── requirements.txt|   <- File for reproducing the environment|
|│|                         `$ pip freeze > requirements.txt`|
|├── data|
|│   ├── external|       <- Third party sources.|
|│   ├── interim|        <- In-progress intermediate data.|
|│   ├── processed|      <- The final data sets for modelling.|
|│   └── raw|            <- The original, immutable data.|
|│|
|└── src |               <- Source code for use in this project.|
|    ├── __init__.py|    <- Makes src a Python module. |
|    │|
|    ├── custom_func.py| <- Various custom functions to import.|
|    │|
|    ├── data          | <- Scripts to download or generate data.|
|    │   └── make_dataset.py|
|    │|
|    ├── features|       <- Scripts raw data into features for|
|    │   │        |         modeling.|
|    │   └── build_features.py|
|    │|
|    ├── models|         <- Scripts to train models and then use|
|    │   │     |            trained models to make predictions.|
|    │   │     |            
|    │   ├── predict_model.py|
|    │   └── train_model.py|
|    │|
|    └── viz|            <- Scripts to create visualizations.|            
|        └── viz.py|

## Version Control

This notebook and it's related files will be stored in a local repository and on Github at:
https://github.com/allen44/capstone-2

## Environmental variables
Following the best practices outlined in the [Twelve Factor App](https://12factor.net/), environmental variables will be excluded from version control.

FOr this notebook, that means that any user wishing to reproduce the data loading steps will need their own Kaggle API key.

# Data Definition

Kaggle lists the definitions of the data on the [competition webpage](https://www.kaggle.com/c/riiid-test-answer-prediction/data).

Here's a excerpt of the relevant section:


### lectures.csv: metadata for the lectures watched by users as they progress in their education.
>`lecture_id`: foreign key for the train/test content_id column, when the content type is lecture (1).

>`part`: top level category code for the lecture.

>`tag`: one tag codes for the lecture. The meaning of the tags will not be provided, but these codes are sufficient for clustering the lectures together.

>`type_of`: brief description of the core purpose of the lecture

### questions.csv: metadata for the questions posed to users.
>`question_id`: foreign key for the train/test content_id column, when the content type is question (0).

>`bundle_id`: code for which questions are served together.

>`correct_answer`: the answer to the question. Can be compared with the train user_answer column to check if the user was right.

>`part`: the relevant section of the TOEIC test.

>`tags`: one or more detailed tag codes for the question. The meaning of the tags will not be provided, but these codes are sufficient for clustering the questions together.

### train.csv 
>`content_id`: (int16) ID code for the user interaction

>`content_type_id`: (int8) 0 if the event was a question being posed to the user, 1 if the event was the user watching a lecture.

>`task_container_id`: (int16) Id code for the batch of questions or lectures. For example, a user might see three questions in a row before seeing the explanations for any of them. Those three would all share a task_container_id.

>`user_answer`: (int8) the user's answer to the question, if any. Read -1 as null, for lectures.

>`answered_correctly`: (int8) if the user responded correctly. Read -1 as null, for lectures.

>`prior_question_elapsed_time`: (float32) The average time in milliseconds it took a user to answer each question in the previous question bundle, ignoring any lectures in between. Is null for a user's first question bundle or lecture. Note that the time is the average time a user took to solve each question in the previous bundle.

>`prior_question_had_explanation`: (bool) Whether or not the user saw an explanation and the correct response(s) after answering the previous question bundle, ignoring any lectures in between. The value is shared across a single question bundle, and is null for a user's first question bundle or lecture. Typically the first several questions a user sees were part of an onboarding diagnostic test where they did not get any feedback.


# Data Cleaning


### Pandas Profiling
The Pandas Profiling module is a quick way to get an overview of the data sets.

In [None]:
# Install updated pandas profiling

%load_ext autoreload
%autoreload 2

import sys
!{sys.executable} -m pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip
!jupyter nbextension enable --py widgetsnbextension

%autoreload 0

## Check for missing data

Based on the excerpt above, we have enough info to set the dtypes on the DataFrames, and label missing or null data.

## Set the dtypes

In [60]:
lectures_dtypes = {'lecture_id': 'category',
                    'part': 'category',
                    'tag': 'category',
                    'type_of': 'string'}
            
questions_dtypes = {'question_id': 'category', 
                   'bundle_id': 'category',
                   'correct_answer': 'category', 
                   'part': 'category',
                   'tags': 'string'}
            
train_dtypes = {'content_id': 'category', 
                'content_type_id': 'category',
                'task_container_id': 'category', 
                'user_answer': 'category',
                'answered_correctly': 'boolean',
                'prior_question_elapsed_time': 'float',
                'prior_question_had_explanation': 'boolean'}
