# Processing data for linear regression

This notebook will transform data from a JSON file format submitted in the Kaggle competition Random Acts of Pizza and transform it into a dataframe that will include all features of itnerest to perform a linear regression and predict who will get a pizza or not.

## Steps to cover

1. Data exploration
2. Feature selection
3. Data Wrangling
4. Output file

# Data exploration

Data is presented in two JSON file sets: train and test. The test data includes the outcome of interest "requester received pizza" plus a ton of other interesting information that we can use to predict. Let's look at the data.

In [4]:
# First let's import libraries of interest
import json
import pandas as pd

In [16]:
with open("../data/external/train.json") as f:
    data = json.load(f)
print(json.dumps(data[1], indent=4, sort_keys=True))

{
    "giver_username_if_known": "N/A",
    "number_of_downvotes_of_request_at_retrieval": 2,
    "number_of_upvotes_of_request_at_retrieval": 5,
    "post_was_edited": false,
    "request_id": "t3_rcb83",
    "request_number_of_comments_at_retrieval": 0,
    "request_text": "I spent the last money I had on gas today. Im broke until next Thursday :(",
    "request_text_edit_aware": "I spent the last money I had on gas today. Im broke until next Thursday :(",
    "request_title": "[Request] California, No cash and I could use some dinner",
    "requester_account_age_in_days_at_request": 501.11109953703703,
    "requester_account_age_in_days_at_retrieval": 1122.279837962963,
    "requester_days_since_first_post_on_raop_at_request": 0.0,
    "requester_days_since_first_post_on_raop_at_retrieval": 621.1270717592593,
    "requester_number_of_comments_at_request": 0,
    "requester_number_of_comments_at_retrieval": 1000,
    "requester_number_of_comments_in_raop_at_request": 0,
    "requeste

## What does it mean?

For those who are not familiar with Reddit and Random Acts of Pizza (I wasn't familiar either when I started the project), I present to you how this data looks like in their site. Not all the information appearing in here matches the information from the data, the reason is because there's more data available linked to the requester aside from the post itself. See figure 1 for reference.

![raop-example](./figures/01-raop-1.png)

Looking at the data, what stands out to me the most is:
- Upvotes/Downvotes
- Number of comments
- Time since publication
- Some say thanks, but are not tagged as "fulfilled"

Now let's look just to the single post to get the reference clearer.

![raop-example](./figures/01-raop-2.png)

Now inside the post we get to see some extra information, like the full title of the requester, the status (tagged as fulfilled) and the full post text. Additionally, the number of comments which can be read below the post that include the request status: who is willing to fulfill the request, how much money they are sending vs how much was actually sent, what pizza place, etc.

What stands out most:
- Title
- Text
- Upvotes
- Percentage of upvoted
- Number of comments
- Days since posted
- Tags (fulfilled, NSFW)

Now, we have access to the requester information:
- User subreddits
- User edited the post
- User flair

## Feature selection

In order to define the feature selection for our linear model we must choose features that are independent from each other and that we believe will be good predictors.

These are the features of interest after exploring the information available and looking the the actual post examples:

- Post upvotes
- Text length (word count)
- Text compound sentiment
- Title lenth (word count)
- Title sentiment
- Days since request
- User activity on redit (count of subreddits)
- Comment count after average request time
- User post's on redit at request (count)
- User posts's on raop (count)
- Account age (days)

The response variable will be true or false.