# Processing data for logistic regression

This notebook will transform data from a JSON file format submitted in the Kaggle competition Random Acts of Pizza and transform it into a dataframe that will include all features of itnerest to perform a logistic regression and predict who will get a pizza or not.

## Steps to cover

1. Data exploration
2. Feature selection
3. Data Wrangling
4. Output file

# Data exploration

Data is presented in two JSON file sets: train and test. The test data includes the outcome of interest "requester received pizza" plus a ton of other interesting information that we can use to predict. Let's look at the data.

In [132]:
# First let's import libraries of interest
import json
import pandas as pd

# Install Libraries
!pip install textblob
!pip install langdetect

# Import libraries - Sentiment
from textblob import TextBlob
import sys
import matplotlib.pyplot as plt
import numpy as np
import os
import nltk
nltk.download('vader_lexicon')
import re
import string
# from wordcloud import WordCloud, STOPWORDS
from PIL import Image
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from langdetect import detect
from nltk.stem import SnowballStemmer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import CountVectorizer




[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\Zaptetra\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [133]:
with open("../data/external/train.json") as f:
    data = json.load(f)
print(json.dumps(data[1], indent=4, sort_keys=True))

{
    "giver_username_if_known": "N/A",
    "number_of_downvotes_of_request_at_retrieval": 2,
    "number_of_upvotes_of_request_at_retrieval": 5,
    "post_was_edited": false,
    "request_id": "t3_rcb83",
    "request_number_of_comments_at_retrieval": 0,
    "request_text": "I spent the last money I had on gas today. Im broke until next Thursday :(",
    "request_text_edit_aware": "I spent the last money I had on gas today. Im broke until next Thursday :(",
    "request_title": "[Request] California, No cash and I could use some dinner",
    "requester_account_age_in_days_at_request": 501.11109953703703,
    "requester_account_age_in_days_at_retrieval": 1122.279837962963,
    "requester_days_since_first_post_on_raop_at_request": 0.0,
    "requester_days_since_first_post_on_raop_at_retrieval": 621.1270717592593,
    "requester_number_of_comments_at_request": 0,
    "requester_number_of_comments_at_retrieval": 1000,
    "requester_number_of_comments_in_raop_at_request": 0,
    "requeste

## What does it mean?

For those who are not familiar with Reddit and Random Acts of Pizza (I wasn't familiar either when I started the project), I present to you how this data looks like in their site. Not all the information appearing in here matches the information from the data, the reason is because there's more data available linked to the requester aside from the post itself. See figure 1 for reference.

![raop-example](./figures/01-raop-1.png)

Looking at the data, what stands out to me the most is:
- Upvotes/Downvotes
- Number of comments
- Time since publication
- Some say thanks, but are not tagged as "fulfilled"

Now let's look just to the single post to get the reference clearer.

![raop-example](./figures/01-raop-2.png)

Now inside the post we get to see some extra information, like the full title of the requester, the status (tagged as fulfilled) and the full post text. Additionally, the number of comments which can be read below the post that include the request status: who is willing to fulfill the request, how much money they are sending vs how much was actually sent, what pizza place, etc.

What stands out most:
- Title
- Text
- Upvotes
- Percentage of upvoted
- Number of comments
- Days since posted
- Tags (fulfilled, NSFW)

Now, we have access to the requester information:
- User subreddits
- User edited the post
- User flair

## Feature selection - Baseline model

In order to define the feature selection for our linear model we must choose features that are independent from each other and that we believe will be good predictors.

These are the features of interest after exploring the information available and looking the the actual post examples:

- Post upvotes
- Text length (word count)
- Text compound sentiment
- Title lenth (word count)
- Title sentiment
- Days since request
- User activity on redit (count of subreddits)
- Comment count after average request time
- User post's on redit at request (count)
- User posts's on raop (count)
- Account age (days)

The response variable will be true or false.

# Data Wrangling

The next step is process the input date to create a dataframe with the features of interest for the baseline model.

In [134]:
# Create a df from json

df = pd.json_normalize(data)
df.head()

Unnamed: 0,giver_username_if_known,number_of_downvotes_of_request_at_retrieval,number_of_upvotes_of_request_at_retrieval,post_was_edited,request_id,request_number_of_comments_at_retrieval,request_text,request_text_edit_aware,request_title,requester_account_age_in_days_at_request,...,requester_received_pizza,requester_subreddits_at_request,requester_upvotes_minus_downvotes_at_request,requester_upvotes_minus_downvotes_at_retrieval,requester_upvotes_plus_downvotes_at_request,requester_upvotes_plus_downvotes_at_retrieval,requester_user_flair,requester_username,unix_timestamp_of_request,unix_timestamp_of_request_utc
0,,0,1,False,t3_l25d7,0,Hi I am in need of food for my 4 children we a...,Hi I am in need of food for my 4 children we a...,Request Colorado Springs Help Us Please,0.0,...,False,[],0,1,0,1,,nickylvst,1317853000.0,1317849000.0
1,,2,5,False,t3_rcb83,0,I spent the last money I had on gas today. Im ...,I spent the last money I had on gas today. Im ...,"[Request] California, No cash and I could use ...",501.1111,...,False,"[AskReddit, Eve, IAmA, MontereyBay, RandomKind...",34,4258,116,11168,,fohacidal,1332652000.0,1332649000.0
2,,0,3,False,t3_lpu5j,0,My girlfriend decided it would be a good idea ...,My girlfriend decided it would be a good idea ...,"[Request] Hungry couple in Dundee, Scotland wo...",0.0,...,False,[],0,3,0,3,,jacquibatman7,1319650000.0,1319646000.0
3,,0,1,True,t3_mxvj3,4,"It's cold, I'n hungry, and to be completely ho...","It's cold, I'n hungry, and to be completely ho...","[Request] In Canada (Ontario), just got home f...",6.518438,...,False,"[AskReddit, DJs, IAmA, Random_Acts_Of_Pizza]",54,59,76,81,,4on_the_floor,1322855000.0,1322855000.0
4,,6,6,False,t3_1i6486,5,hey guys:\n I love this sub. I think it's grea...,hey guys:\n I love this sub. I think it's grea...,[Request] Old friend coming to visit. Would LO...,162.063252,...,False,"[GayBrosWeightLoss, RandomActsOfCookies, Rando...",1121,1225,1733,1887,,Futuredogwalker,1373658000.0,1373654000.0


In [135]:
# let's check all the variables
df.columns

Index(['giver_username_if_known',
       'number_of_downvotes_of_request_at_retrieval',
       'number_of_upvotes_of_request_at_retrieval', 'post_was_edited',
       'request_id', 'request_number_of_comments_at_retrieval', 'request_text',
       'request_text_edit_aware', 'request_title',
       'requester_account_age_in_days_at_request',
       'requester_account_age_in_days_at_retrieval',
       'requester_days_since_first_post_on_raop_at_request',
       'requester_days_since_first_post_on_raop_at_retrieval',
       'requester_number_of_comments_at_request',
       'requester_number_of_comments_at_retrieval',
       'requester_number_of_comments_in_raop_at_request',
       'requester_number_of_comments_in_raop_at_retrieval',
       'requester_number_of_posts_at_request',
       'requester_number_of_posts_at_retrieval',
       'requester_number_of_posts_on_raop_at_request',
       'requester_number_of_posts_on_raop_at_retrieval',
       'requester_number_of_subreddits_at_request', 'r

## Data Transformation

Let's review the approapiate transformation for the variables of internest:

### Outcome variable - True False
**y** = requester_received_pizza


### Independent variables
**Post upvotes** = number_of_upvotes_of_request_at_retrieval - number_of_downvotes_of_request_at_retrieval

**Text length (word count)** = count(request_text)

**Text compound sentiment** = sentiment function (request_text)

**Title length (word count)** = count (request_title)

**Title sentiment** = sentiment function (request_title)

**Days since request** = requester_days_since_first_post_on_raop_at_retrieval - requester_days_since_first_post_on_raop_at_request

**User subs on (count of subreddits)** = requester_number_of_subreddits_at_request

**User activity comments** = requester_number_of_comments_at_retrieval - requester_number_of_comments_at_request

**User activity comments raop** = requester_number_of_comments_in_raop_at_retrieval - requester_number_of_comments_in_raop_at_request

**User posts reddit** = requester_number_of_posts_at_retrieval - requester_number_of_posts_at_request

**User posts raop** = requester_number_of_posts_on_raop_at_retrieval - requester_number_of_posts_on_raop_at_request

**Comment count** = request_number_of_comments_at_retrieval 

**User rate start** = requester_upvotes_plus_downvotes_at_request / requester_upvotes_plus_downvotes_at_request

**User rate end** = requester_upvotes_minus_downvotes_at_retrieval / requester_upvotes_plus_downvotes_at_retrieval

**Account age (days)** = requester_account_age_in_days_at_retrieval

Unused variables:
- post_was_edited
- request_id
- request_text_edit_aware
- unix_timestamp_of_request
- unix_timestamp_of_request_utc

Requester info unused vairables:
- requester_account_age_in_days_at_request
- requester_subreddits_at_request
- requester_upvotes_minus_downvotes_at_request
- requester_user_flair
- requester_username


In [136]:
# functions required

# Count words in a text
# Input: string
# Output: count of words
def count_words(string):
    word_list = string.split()
    num_words = len(word_list)
    return num_words

# test function
# count_words('I love chocolate and beef, please!')

# Provides compound sentiment of a text based on VADER Sentiment analysis
# Input: string
# Output: compound sentiment between -1 and 1 (Negative/Positive)
def sentiment_scores(sentence):
    
    # Create a sentiment analysis object
    sent = SentimentIntensityAnalyzer()
    sent_dict = sent.polarity_scores(sentence)
    
#     print("Pos: {}, Neu: {}, Neg: {}, Score: {}".format(sent_dict['pos'],sent_dict['neu'],
#                                                        sent_dict['neg'],sent_dict['compound']))
    
    if sent_dict['compound']>=0.2:
        sent_dict['class'] = 'Positive'
    elif sent_dict['compound']<= -0.2:
        sent_dict['class'] = 'Negative'
    else:
        sent_dict['class'] = 'Neutral'
    
#     return [sentence, sent_dict['pos'],
#                 sent_dict['neu'], sent_dict['neg'],
#                 sent_dict['compound'],sent_dict['class']]
    return sent_dict['compound']

# Check the function with an example
# sentiment_scores("70% of proud Georgians said they would ecstatically vote\
#                  and give their life for the great leader Trump in 2024.")

In [137]:
## Update dataframe with constructed variables

df['post_upvotes'] = df['number_of_upvotes_of_request_at_retrieval'] \
                     - df['number_of_downvotes_of_request_at_retrieval']
    
df['text_word_count'] = df['request_text'].apply(lambda x: count_words(x))
df['text_sentiment'] = df['request_text'].apply(lambda x: sentiment_scores(x))
df['title_word_count'] = df['request_title'].apply(lambda x: count_words(x))
df['title_sentiment'] = df['request_title'].apply(lambda x: sentiment_scores(x))
df['days_since_request'] = df['requester_days_since_first_post_on_raop_at_retrieval'] \
                           - df['requester_days_since_first_post_on_raop_at_request']
    
df['user_activity_comments'] = df['requester_number_of_comments_at_retrieval'] \
                               - df['requester_number_of_comments_at_request']
df['user_activity_comments_raop'] = df['requester_number_of_comments_in_raop_at_retrieval']\
                                    - df['requester_number_of_comments_in_raop_at_request']
    
df['user_posts_reddit'] = df['requester_number_of_posts_at_retrieval'] \
                          - df['requester_number_of_posts_at_request']

df['user_posts_raop'] = df['requester_number_of_posts_on_raop_at_retrieval'] \
                    - df['requester_number_of_posts_on_raop_at_request']

df['user_rate_start'] = df['requester_upvotes_plus_downvotes_at_request'] \
                        / df['requester_upvotes_plus_downvotes_at_request']

df['user_rate_end'] = df['requester_upvotes_minus_downvotes_at_retrieval'] \
                      / df['requester_upvotes_plus_downvotes_at_retrieval']

df.rename(columns={'requester_number_of_subreddits_at_request':'num_subreddits',
                  'request_number_of_comments_at_retrieval':'request_comment_counts',
                  'requester_account_age_in_days_at_retrieval':'account_age'})

df.drop(columns=['post_was_edited', 
                 'request_id',
                 'request_text_edit_aware',
                 'unix_timestamp_of_request',
                 'unix_timestamp_of_request_utc',
                 'requester_account_age_in_days_at_request',
                 'requester_subreddits_at_request',
                 'requester_upvotes_minus_downvotes_at_request',
                 'requester_user_flair',
                 'requester_username',
                 'giver_username_if_known',                 
                ], axis=1)
df.head()

Unnamed: 0,giver_username_if_known,number_of_downvotes_of_request_at_retrieval,number_of_upvotes_of_request_at_retrieval,post_was_edited,request_id,request_number_of_comments_at_retrieval,request_text,request_text_edit_aware,request_title,requester_account_age_in_days_at_request,...,text_sentiment,title_word_count,title_sentiment,days_since_request,user_activity_comments,user_activity_comments_raop,user_posts_reddit,user_posts_raop,user_rate_start,user_rate_end
0,,0,1,False,t3_l25d7,0,Hi I am in need of food for my 4 children we a...,Hi I am in need of food for my 4 children we a...,Request Colorado Springs Help Us Please,0.0,...,0.8323,6,0.6124,792.420405,0,0,1,1,,1.0
1,,2,5,False,t3_rcb83,0,I spent the last money I had on gas today. Im ...,I spent the last money I had on gas today. Im ...,"[Request] California, No cash and I could use ...",501.1111,...,-0.6908,10,-0.296,621.127072,1000,0,11,2,1.0,0.381268
2,,0,3,False,t3_lpu5j,0,My girlfriend decided it would be a good idea ...,My girlfriend decided it would be a good idea ...,"[Request] Hungry couple in Dundee, Scotland wo...",0.0,...,0.8074,10,0.6696,771.616181,0,0,1,1,,1.0
3,,0,1,True,t3_mxvj3,4,"It's cold, I'n hungry, and to be completely ho...","It's cold, I'n hungry, and to be completely ho...","[Request] In Canada (Ontario), just got home f...",6.518438,...,0.5154,11,0.0,734.517164,5,2,1,1,1.0,0.728395
4,,6,6,False,t3_1i6486,5,hey guys:\n I love this sub. I think it's grea...,hey guys:\n I love this sub. I think it's grea...,[Request] Old friend coming to visit. Would LO...,162.063252,...,0.9778,14,0.8455,146.570567,38,2,2,1,1.0,0.649179


In [196]:
## Reduce the dataset to the variabels of interest

df_2 = df.copy()

# Remove columns not used
df_2 = df_2.drop(columns=['giver_username_if_known',
       'number_of_downvotes_of_request_at_retrieval',
       'number_of_upvotes_of_request_at_retrieval', 
       'request_number_of_comments_at_retrieval', 
       'request_text_edit_aware', 
       'requester_account_age_in_days_at_request',
       'requester_account_age_in_days_at_retrieval',
       'requester_days_since_first_post_on_raop_at_request',
       'requester_days_since_first_post_on_raop_at_retrieval',
       'requester_number_of_comments_at_request',
       'requester_number_of_comments_at_retrieval',
       'requester_number_of_comments_in_raop_at_request',
       'requester_number_of_comments_in_raop_at_retrieval',
       'requester_number_of_posts_at_request',
       'requester_number_of_posts_at_retrieval',
       'requester_number_of_posts_on_raop_at_request',
       'requester_number_of_posts_on_raop_at_retrieval',
       'requester_number_of_subreddits_at_request',
       'requester_subreddits_at_request',
       'requester_upvotes_minus_downvotes_at_request',
       'requester_upvotes_minus_downvotes_at_retrieval',
       'requester_upvotes_plus_downvotes_at_request',
       'requester_upvotes_plus_downvotes_at_retrieval', 'requester_user_flair',
       'unix_timestamp_of_request','unix_timestamp_of_request_utc',
       'request_text','request_title','requester_username','request_id' ], axis=1)

# Clean values
# df_2['requester_received_pizza'] = df_2['requester_received_pizza'].apply(lambda x: 1 if (x==True) else 0 )
df_2['post_was_edited'] = df_2['post_was_edited'].apply(lambda x: 0 if (x==False) else 1)
df_2['user_rate_start'] = df_2['user_rate_start'].fillna(df_2['user_rate_start'].median())
df_2['user_rate_end'] = df_2['user_rate_end'].fillna(df_2['user_rate_end'].median())
df_2.insert(len(df_2.columns)-1, 'requester_received_pizza', df_2.pop('requester_received_pizza'))

# round data
df_2 = df_2.round(decimals=2)

df_2.head()

Unnamed: 0,post_was_edited,post_upvotes,text_word_count,text_sentiment,title_word_count,title_sentiment,days_since_request,user_activity_comments,user_activity_comments_raop,user_posts_reddit,user_posts_raop,user_rate_start,user_rate_end,requester_received_pizza
0,0,1,67,0.83,6,0.61,792.42,0,0,1,1,1.0,1.0,False
1,0,3,16,-0.69,10,-0.3,621.13,1000,0,11,2,1.0,0.38,False
2,0,3,59,0.81,10,0.67,771.62,0,0,1,1,1.0,1.0,False
3,1,1,30,0.52,11,0.0,734.52,5,2,1,1,1.0,0.73,False
4,0,0,103,0.98,14,0.85,146.57,38,2,2,1,1.0,0.65,False


In [197]:
df_2.describe()

Unnamed: 0,post_was_edited,post_upvotes,text_word_count,text_sentiment,title_word_count,title_sentiment,days_since_request,user_activity_comments,user_activity_comments_raop,user_posts_reddit,user_posts_raop,user_rate_start,user_rate_end
count,4040.0,4040.0,4040.0,4040.0,4040.0,4040.0,4040.0,4040.0,4040.0,4040.0,4040.0,4040.0,4040.0
mean,0.159653,3.755941,77.529208,0.512564,12.422525,0.053941,502.576198,174.327475,1.981683,19.550248,1.175495,1.0,0.492458
std,0.36633,9.048429,71.281563,0.519856,6.965016,0.412741,270.740701,293.440622,4.883963,47.282088,0.481542,0.0,0.219908
min,0.0,-7.0,0.0,-0.99,1.0,-0.93,0.0,0.0,0.0,0.0,0.0,1.0,-0.56
25%,0.0,1.0,35.0,0.25,8.0,-0.25,250.4725,2.0,0.0,1.0,1.0,1.0,0.36
50%,0.0,2.0,59.0,0.74,11.0,0.0,506.305,28.0,1.0,4.0,1.0,1.0,0.53
75%,0.0,4.0,97.0,0.92,16.0,0.4,761.38,185.0,2.0,18.0,1.0,1.0,0.63
max,1.0,314.0,854.0,1.0,52.0,0.97,1025.41,1000.0,135.0,998.0,9.0,1.0,1.0


In [198]:
# Now save for future use
# Will save as csv in data/interim

df_2.to_csv('../data/interim/logit_sentiments.csv', index=False)