## Data Cleaning

With both the Space X and NASA subreddit data now saved in both json and csv formats I will ensure that the data is cleaned and prepared for Exploratory Data Analysis.

In [15]:
import numpy as np
import pandas as pd

### Step # 1: Load the Data

Importing the saved csv information with the subreddit data from both Space X and NASA

In [16]:
spacex_df = pd.read_csv('../data/spacex_df.csv')
nasa_df   = pd.read_csv('../data/nasa_df.csv')

### Step # 2: Clean the data

#### Create one dataframe

Knowing that we used the same API to scrape the subreddit data for both Space X and NASA I want to put the data from each of these subreddits into one single dataframe. This will allow me to analyze the data together to identify unneeded columns, null cells and identify which column(s) will be the most optimal for applying NLP.

In [17]:
df = pd.concat([spacex_df, nasa_df], sort=True)

#### Examine the columns

Reviewing the columns to see what features will be the most optimal to apply to NLP.

In [18]:
df.columns

Index(['Unnamed: 0', 'approved_at_utc', 'approved_by', 'archived', 'author',
       'author_cakeday', 'author_flair_background_color',
       'author_flair_css_class', 'author_flair_richtext',
       'author_flair_template_id', 'author_flair_text',
       'author_flair_text_color', 'author_flair_type', 'author_fullname',
       'banned_at_utc', 'banned_by', 'can_gild', 'can_mod_post', 'category',
       'clicked', 'content_categories', 'contest_mode', 'created',
       'created_utc', 'crosspost_parent', 'crosspost_parent_list',
       'distinguished', 'domain', 'downs', 'edited', 'gilded', 'hidden',
       'hide_score', 'id', 'is_crosspostable', 'is_meta',
       'is_original_content', 'is_reddit_media_domain', 'is_self', 'is_video',
       'likes', 'link_flair_background_color', 'link_flair_css_class',
       'link_flair_richtext', 'link_flair_template_id', 'link_flair_text',
       'link_flair_text_color', 'link_flair_type', 'locked', 'media',
       'media_embed', 'media_metadata', 

#### Remove unneeded columns and remove duplicate posts

Performing some cleaning to ensure data doesn't have any unnecessary columns and dropping duplicate posts to ensure model doesn't receive the same information twice.

In [20]:
df.shape

(2474, 99)

In [21]:
df.drop('Unnamed: 0', axis=1, inplace=True)

In [22]:
df.drop_duplicates(inplace=True)

In [23]:
df.shape

(2448, 98)

In [24]:
df.head()

Unnamed: 0,approved_at_utc,approved_by,archived,author,author_cakeday,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,...,thumbnail_height,thumbnail_width,title,ups,url,user_reports,view_count,visited,whitelist_status,wls
0,,,False,ElongatedMuskrat,,,contributor,[],,r/SpaceX Bot,...,,,Telstar 18V / APStar 5C Launch Campaign Thread,246,https://www.reddit.com/r/spacex/comments/95cte...,[],,False,all_ads,6
1,,,False,ElongatedMuskrat,,,contributor,[],,r/SpaceX Bot,...,,,"r/SpaceX Discusses [September 2018, #48]",170,https://www.reddit.com/r/spacex/comments/9ckoe...,[],,False,all_ads,6
2,,,False,soldato_fantasma,,,contributor,[],,Host of BulgariaSat-1,...,84.0,140.0,SpaceX granted patents for custom-built Starli...,916,https://www.teslarati.com/spacex-custom-built-...,[],,False,all_ads,6
3,,,False,jclishman,,,contributor,[],,Host of Inmarsat-5 Flight 4,...,140.0,140.0,"Jeff Foust on Twitter - ""Shotwell: think we’ll...",210,https://twitter.com/jeff_foust/status/10378148...,[],,False,all_ads,6
4,,,False,Bossdude234,,,,[],,,...,140.0,140.0,"SpaceX on Twitter - ""Now targeting September 9...",79,https://twitter.com/SpaceX/status/103784391187...,[],,False,all_ads,6


#### Features for NLP

From my review of the data there are two specific columns that I can apply NLP:
- Selftext
    - The body of the subreddit post
- Title
    - The subreddit titles with intital details about the topic of the subreddit post
    
    
#### Analysis of the columns

After running a method on both the Title and Selftext columns to identify the number of cells that are empty:
- Selftext has 2,061 empty cells that are null which is about 84% of all the data
- Title has 0 empty cells

Knowing that the majority of the cells within the Selftext column are empty I will only be using the Title column from the Space X and NASA subreddits to apply NLP.

In [25]:
print(df.selftext.isnull().sum())
print(df.selftext.isnull().sum() / len(df))

2061
0.8419117647058824


In [26]:
df.title.isnull().sum()

0

In [27]:
df.shape

(2448, 98)

In [28]:
df.subreddit.value_counts()

nasa      1227
spacex    1221
Name: subreddit, dtype: int64

### Step # 3: Save the clean dataframe

In [12]:
df.to_csv('../data/df_final.csv', index=False)