### Data Cleaning

In [73]:
import pandas as pd
import numpy as np

In [74]:
# read in scifi data

scifi = pd.read_csv('./data/scifi_new-cleaned.csv')

In [75]:
# read in askscience data

science = pd.read_csv('./data/askscience-cleaned.csv')

In [76]:
# inspect scifi data - first and last five rows

scifi.head()
scifi.tail()

Unnamed: 0,created_utc,id,permalink,selftext,subreddit,subreddit_id,title
96630,1202244450,67rqj,/r/scifi/comments/67rqj/whos_the_worst_bad_guy...,[deleted],scifi,t5_2qh2z,Who's the worst bad guy of all time? You gues...
96631,1201617121,66zem,/r/scifi/comments/66zem/understanding_manga_an...,,scifi,t5_2qh2z,Understanding Manga: An Interview with Robin B...
96632,1201592426,66yco,/r/scifi/comments/66yco/escape_pod_weekly_sf_s...,,scifi,t5_2qh2z,Escape Pod - weekly SF short story podcast
96633,1201503994,66tw5,/r/scifi/comments/66tw5/io9_gawkers_science_fi...,,scifi,t5_2qh2z,io9 - Gawker's science fiction site
96634,1201450476,66rql,/r/scifi/comments/66rql/clive_thompson_on_why_...,,scifi,t5_2qh2z,Clive Thompson on Why Sci-Fi Is the Last Basti...


In [77]:
# inspect science data - first and last five rows

science.head()
science.tail()

Unnamed: 0,created_utc,id,permalink,selftext,subreddit,subreddit_id,title
805808,1413243425,2j5yi3,/r/askscience/comments/2j5yi3/any_safety_conce...,,askscience,t5_2qm4e,Any safety concerns for infant in vibrating ro...
805809,1413243260,2j5y7b,/r/askscience/comments/2j5y7b/what_happens_whe...,,askscience,t5_2qm4e,What happens when a heterochronic parabiont is...
805810,1413242837,2j5xhr,/r/askscience/comments/2j5xhr/how_come_all_obj...,,askscience,t5_2qm4e,How come all objects in the solar system orbit...
805811,1413242735,2j5xbw,/r/askscience/comments/2j5xbw/falling/,,askscience,t5_2qm4e,Falling
805812,1413242651,2j5x6o,/r/askscience/comments/2j5x6o/how_come_when_i_...,,askscience,t5_2qm4e,"How come when I go to bed hungry, I'm not hung..."


**Time Stamps**

In [78]:
# maximum and minimum timestamps of scifi

print('scifi')
print('Min', scifi['created_utc'].min())
print('Max', scifi['created_utc'].max())
print('')

# maximum and minimum timestamps of science

print('science')
print('Min', science['created_utc'].min())
print('Max', science['created_utc'].max())

scifi
Min 1201450476
Max 1587328409

science
Min 1413242651
Max 1587224485


#### Drop Duplicate Rows

**Science dataframe**

In [79]:
# drop duplicates from dataset based on created_utc
# keep first instances 

science.drop_duplicates('created_utc', inplace=True)

**Scifi dataframe**

In [80]:
# drop duplicates from scifi based on 'created_utc'

scifi.drop_duplicates('created_utc', inplace=True)

# source: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html

#### Inspect the Dataframes

In [81]:
# create function - to inspect data frame

def inspect(dataframe):
    print('Rows, columns:', dataframe.shape)
    print('')
    print(dataframe.info())      

**Scifi dataframe**

In [82]:
# print info about scifi dataframe

inspect(scifi)

Rows, columns: (96113, 7)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 96113 entries, 0 to 96634
Data columns (total 7 columns):
created_utc     96113 non-null int64
id              96113 non-null object
permalink       96113 non-null object
selftext        32266 non-null object
subreddit       96113 non-null object
subreddit_id    96113 non-null object
title           96113 non-null object
dtypes: int64(1), object(6)
memory usage: 5.9+ MB
None


**Science dataframe**

In [83]:
# print info about science dataframe

inspect(science)

Rows, columns: (805813, 7)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 805813 entries, 0 to 805812
Data columns (total 7 columns):
created_utc     805813 non-null int64
id              805813 non-null object
permalink       805813 non-null object
selftext        663535 non-null object
subreddit       805810 non-null object
subreddit_id    805810 non-null object
title           805810 non-null object
dtypes: int64(1), object(6)
memory usage: 49.2+ MB
None


#### Missing Data

In [84]:
# create function to count missing values for columns 

def missings(column):
#     print(f'{str(column)} dataframe')

    # count of missing data in column
    nans = column.isna().sum()
    print(f'Number of NaNs: {nans}')

    # create boolean array of column
    # True = text was '[removed]'
    removed = column == '[removed]'

    # print number of missings because [removed]
    print(f'Number of text [removed]: {removed.sum()}')
    
    # total number of rows where text is missing in a column
    total = nans + removed.sum()
    # print the number of rows where text is missing
    print(f'Total missings: {total}')
    
    # print rate of missings in a column
    print(f'Rate of missings: {round((total/column.shape[0])*100, 2)}%')

    print('')

In [85]:
# call missing_values function on scifi['selftext']
print("Scifi['selftext']")
missings(scifi['selftext'])

# call missing values function on scifi['title']
print("Scifi['title']")
missings(scifi['title'])

# call missing values function on science['selftext']
print("Science['selftext']")
missings(science['selftext'])

# call missing values functin on science['title']
print("Science['title']")
missings(science['title'])

Scifi['selftext']
Number of NaNs: 63847
Number of text [removed]: 1450
Total missings: 65297
Rate of missings: 67.94%

Scifi['title']
Number of NaNs: 0
Number of text [removed]: 0
Total missings: 0
Rate of missings: 0.0%

Science['selftext']
Number of NaNs: 142278
Number of text [removed]: 604801
Total missings: 747079
Rate of missings: 92.71%

Science['title']
Number of NaNs: 3
Number of text [removed]: 1
Total missings: 4
Rate of missings: 0.0%



Selftext (the text of the subreddit submission) is more prone to having missing or removed values than the title of the submission. Since, however, it is also richer in context-invoking words than the title, for the purposes of model building I use the selftext. 

#### Deleting Rows with Missing Data

**Scifi dataframe - selftext:**

In [86]:
# scifi - selftext
# remove rows where selftext is np.nan
# remove rows where selftext is [removed]
# remove rows where selftext is [deleted]
# remove rows where selftext contains removed in itallics ("\[removed\]")

scifi = scifi.loc[scifi['selftext'].notna() & (scifi['selftext'] != '[removed]') & (scifi['selftext'] != '[deleted]')\
                  & (scifi['selftext'] != "\[removed\]")]
scifi.shape

(24171, 7)

In [87]:
# define function to
# turn values into np.nan where dataframe[column] contains [removed]

def turn_nan(text):
    if '[removed]' in text:
        return np.nan
    else:
        return text

# code taken from Greg Dye

In [88]:
# run turn_nan function on scifi['selftext'] column

scifi['selftext'] = scifi['selftext'].apply(turn_nan)

In [89]:
# delete rows where scifi['selftext'] is np.nan

scifi = scifi.loc[scifi['selftext'].notna()]
scifi.shape

(24167, 7)

**Science dataframe - selftext:**

In [90]:
# science
# remove rows where selftext is np.nan
# remove rows where selftext is [removed]
# remove rows where selftext is [deleted]

science = science.loc[science['selftext'].notna() & (science['selftext'] != '[removed]') & (science['selftext'] != '[deleted]')\
                  & (science['selftext'] != "\[removed\]")]
science.shape

(56073, 7)

In [91]:
# run turn_nan function on science['selftext'] column

science['selftext'] = science['selftext'].apply(turn_nan)

# check if there are rows where science['selftext'] has missing values

science['selftext'].isna().sum()

0

**Scifi dataframe - title:**

In [92]:
# call missing values function on scifi['title']
print("Scifi['title']")
missings(scifi['title'])

Scifi['title']
Number of NaNs: 0
Number of text [removed]: 0
Total missings: 0
Rate of missings: 0.0%



In [93]:
# check if there are rows where [removed] is embedded in the text
# if number of missings > 0 ==> delete rows where scifi['title'] is missing

scifi['title'].apply(turn_nan)
scifi['title'].isna().sum()

0

There are no missings in the scifi dataframe's title column.

**Science dataframe - title:**

In [94]:
# call missing values functin on science['title']
print("Science['title']")
missings(science['title'])

Science['title']
Number of NaNs: 3
Number of text [removed]: 0
Total missings: 3
Rate of missings: 0.01%



In [95]:
# remove the 3 rows with np.nan

science = science.loc[science['title'].notna()]
science.shape

(56070, 7)

In [96]:
# check if science['title'] has rows that contain '[removed]' embedded in the text
# if number of missings > 0 ==> delete rows where science['title'] is missing

science['title'].apply(turn_nan)
science['title'].isna().sum()

0

#### Create Combined Dataframe

In [97]:
# dataframe size

print(f'scifi: {scifi.shape[0]}')
print(f'science: {science.shape[0]}')

scifi: 24167
science: 56070


**Data**<br>
**r/science:** df has 56_070 rows between time stamps: 1413242651 - 1587224485

**r/scifi:** df has 24_167 rows between time stamps: 1201450476 - 1587328409

Let's create a combined dataset by selecting a random sample of 24_167 rows from the r/science dataframe and add that to the r/scifi dataframe. This way, exactly half of the combined dataset will be from the scifi subreddit and half from the askscience subreddit.

In [98]:
# select a random sample of 24_167 rows from science dataframe
# set random_state at 1 for reproducibility

science_small = science.sample(n=24_167, random_state=1)
science_small.shape

(24167, 7)

In [99]:
# create combined dataframe
# use continuous indexing, drop original indexing

data = scifi.append(science_small, ignore_index=True)

In [101]:
# export data to csv file

data.to_csv('./data/reddit_working.csv', index=False)