## Add the Target Variable
Now that the data from r/Democrats and r/Conservative have been acquired using the API.  The target variable must be added.  The study is not trying to predict a particular group so the assignment of the values to the target variable is arbitrary.  In this case r/Democrat will be assigned 0 and the r/Conservative will be assigned 1. 

In [132]:
import pandas as pd

In [133]:
dems = pd.read_csv('./data/democratic.csv')
reps = pd.read_csv('./data/republican.csv')


In [134]:
dems.head(1)

Unnamed: 0,approved_at_utc,approved_by,archived,author,author_cakeday,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,...,thumbnail_height,thumbnail_width,title,ups,url,user_reports,view_count,visited,whitelist_status,wls
0,,,False,backpackwayne,,,,[],,,...,,,THE TIME FOR UNITY IS NOW - A Progressive and ...,80,https://www.reddit.com/r/democrats/comments/95...,[],,False,all_ads,6


In [135]:
dems.shape


(876, 97)

In [136]:
len(dems.title)

876

In [137]:
len(set(dems.title))  #seem to be 3 duplicate titles

873

In [138]:
reps.head(3)

Unnamed: 0,approved_at_utc,approved_by,archived,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,...,thumbnail_height,thumbnail_width,title,ups,url,user_reports,view_count,visited,whitelist_status,wls
0,,,False,Yosoff,,Libertarian,"[{'e': 'text', 't': 'First Principles'}]",771ee808-8b38-11e1-8706-12313d096aae,First Principles,dark,...,,,U.S. Constitution Discussion - Week 10 of 52 (...,31,https://www.reddit.com/r/Conservative/comments...,[],,False,all_ads,6
1,,,False,thatrightwinger,,Far-Right,"[{'e': 'text', 't': ""Don't Tread on Me""}]",9b86186a-8b38-11e1-8f58-12313d2c1af1,Don't Tread on Me,dark,...,81.0,140.0,Burt Reynolds: Leading Man &amp; Distillation ...,18,https://www.nationalreview.com/2018/03/burt-re...,[],,False,all_ads,6
2,,,False,chabanais,,Bold,"[{'e': 'text', 't': 'Stronger than derp'}]",,Stronger than derp,dark,...,105.0,140.0,Nike's best trick...,1997,https://i.redd.it/94w2w7ioumk11.jpg,[],,False,all_ads,6


In [139]:
reps.shape

(799, 96)

In [140]:
len(set(reps.title))

788

Both data sets have 97 variables.  r/democrat has 876 entries and the r/Conservative has 861 entries.  

### Adding target value

In [141]:
dems['target'] = 0
dems['target'].head()

0    0
1    0
2    0
3    0
4    0
Name: target, dtype: int64

In [142]:
reps['target'] = 1
reps['target'].head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

### Merging the dataframes

Each data from should now have 98 features and they can be merged together for analysis

In [143]:
frames = [dems, reps]
comb = pd.concat(frames, axis = 0)

In [144]:
comb.shape

(1675, 98)

It looks as if the data frames combined correctly.  Still have 94 features and have 1737 cases which is 861 + 876

Next the excess variables will be striped and the DataFrame will be save to a csv to be read into an analysis notebook.  The vectorization will be done with the analysis, since the parameters of the vectorization will be altered to maximize both the prediction and the explanatory power of the model.

In [145]:
comb.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1675 entries, 0 to 798
Data columns (total 98 columns):
approved_at_utc                  0 non-null float64
approved_by                      0 non-null float64
archived                         1675 non-null bool
author                           1675 non-null object
author_cakeday                   1 non-null object
author_flair_background_color    0 non-null float64
author_flair_css_class           246 non-null object
author_flair_richtext            1672 non-null object
author_flair_template_id         157 non-null object
author_flair_text                508 non-null object
author_flair_text_color          511 non-null object
author_flair_type                1672 non-null object
author_fullname                  1672 non-null object
banned_at_utc                    0 non-null float64
banned_by                        0 non-null float64
can_gild                         1675 non-null bool
can_mod_post                     1675 non-null bool

In [146]:
#drop columns where all of the data is missing will eliminate 18 columns from the dataframe
comb = comb.dropna(axis = 1, how = "all")


In [147]:
comb.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1675 entries, 0 to 798
Data columns (total 81 columns):
archived                    1675 non-null bool
author                      1675 non-null object
author_cakeday              1 non-null object
author_flair_css_class      246 non-null object
author_flair_richtext       1672 non-null object
author_flair_template_id    157 non-null object
author_flair_text           508 non-null object
author_flair_text_color     511 non-null object
author_flair_type           1672 non-null object
author_fullname             1672 non-null object
can_gild                    1675 non-null bool
can_mod_post                1675 non-null bool
clicked                     1675 non-null bool
contest_mode                1675 non-null bool
created                     1675 non-null float64
created_utc                 1675 non-null float64
crosspost_parent            104 non-null object
crosspost_parent_list       104 non-null object
distinguished               4

In [148]:
comb.shape


(1675, 81)

In [149]:
comb.info()



<class 'pandas.core.frame.DataFrame'>
Int64Index: 1675 entries, 0 to 798
Data columns (total 81 columns):
archived                    1675 non-null bool
author                      1675 non-null object
author_cakeday              1 non-null object
author_flair_css_class      246 non-null object
author_flair_richtext       1672 non-null object
author_flair_template_id    157 non-null object
author_flair_text           508 non-null object
author_flair_text_color     511 non-null object
author_flair_type           1672 non-null object
author_fullname             1672 non-null object
can_gild                    1675 non-null bool
can_mod_post                1675 non-null bool
clicked                     1675 non-null bool
contest_mode                1675 non-null bool
created                     1675 non-null float64
created_utc                 1675 non-null float64
crosspost_parent            104 non-null object
crosspost_parent_list       104 non-null object
distinguished               4

### Search for Features to be included in analaysis

In [150]:
comb.title.head(4)

0    THE TIME FOR UNITY IS NOW - A Progressive and ...
1    Nation Stunned That There Is Someone in White ...
2             Alex Jones Permanently Banned By Twitter
3    It's About Time Senate Democrats Showed Some D...
Name: title, dtype: object

The title will contain information that will be helpful in sorting out issues and words that identify Democratic posts v. Conservative posts

In [151]:
features = ['title', 'target']
test_data = comb[features]

In [152]:
test_data.to_csv('./data/combined_data.csv', index = False)